Information on spiders (crawler, parser) in PHP?

T

toporov2011-03-01 12:47:32

PHP

toporov, 2011-03-01 12:47:32

Hello.
The task arose before me to write a content parser for third-party sites in php. The ideology of this module is as follows:
- the administrator sets the rules for parsing a particular site (page), assigning weights to certain selectors (tags);
- the model parses the site (page);
- we analyze the result obtained after parsing, applying the rules entered by the administrator to it. At the output, we should get the page context of the form array('word1'=>int(...), 'word2'=>int(...)...). Here word1 is the word extracted from the page by the spider, and int(...) is the weight of the content obtained after applying the admin rules to the parsed result. That. we can get an approximate page contest, i.e.
Content scraping is not a problem. You can use the native DomDocument - XPath (fast in speed, but costly to create and maintain), Zend_Dom_Query or phpQuery or Nokogiri ( theme ) of the w999d habrauser - slower in speed, but easier to write, good opportunities for parsing. (If someone knows good parsing libraries not listed by me, please tell me.)
So, the question itself is how to organize the analysis and parsing of the content on the page in order to get some kind of squeeze, content context at the output (sites for parsing will be diverse in structure and content). Are there any open-source crawlers that would efficiently parse a page? Can you please provide information on how to build a search index?
I apologize for the somewhat vague presentation of the question, thanks for your attention!

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

O

Ogra, 2011-03-01
@Ogra

Yahoo Pipes?

A

andrew_tch, 2011-03-01
@andrew_tch

1) xpath (memory)
2) learn statistics (and read books on data analysis
)

E

Evgeny Elizarov, 2011-03-01
@KorP

# "PHPQuery"
# "Simple HTML DOM"
habrahabr.ru/blogs/php/114323/

E

egorist, 2017-05-07
@egorist

https://github.com/wasinger/htmlpagedom