Answer the question
In order to leave comments, you need to log in
What is the best and fastest way to write a parser (PHP)?
So, the contestants:
1. phpQuery
+ supports a bunch of selectors
- low speed
2. Simple HTML Dom
+ good documentation
+ easy to learn
- talking about memory leaks, i.e. large files cannot be parsed
- problems with parsing speed
3. Nokogiri
+ high speed
- terrible documentation
Listed everything that came to mind, maybe I missed something. So what is the best choice?
Answer the question
In order to leave comments, you need to log in
DiDom: https://github.com/Imangazaliev/DiDOM
+ high speed ( comparison with other parsers )
+ good dock
+ large number of supported selectors
+ most importantly - tests
A simple example:
$document = new Document('http://www.example.com/', true);
echo $document->first('title::text');
$links = $document->find('a[href]::attr(href)');
var_dump($links);
$links = $document->find('a[href]:has(img)::attr(href)');
var_dump($links);
PHP: multi-curl+regex+DOMXPath
Example ( https://www.ibm.com/developerworks/en/library/x-xp... ) :
$doc = new DOMDocument;
$doc->load('products.xml');
$xpath = new DOMXPath($doc);
$products = $xpath->query("/PRODUCTS/PRODUCT[SKU='soft5678']/NAME");
foreach ($products as $product)
print($product->nodeValue);
Over the years of parsing data, I came up with a simple set:
Curl + tidy + DOMXpath
First question, parser of what?
And if I correctly understood the task of this parser, then why are you inventing bicycles?
*
cURL for getting content
- php.net/manual/en/book.curl.php SimpleXML for document parsing.
Both components come out of the box with PHP. Unified, documented and well-known interfaces.
Or here's another, Symfony2 component symfony.com/doc/current/components/dom_crawler.html
I settled on Nokogiri
Really high speed (I chose from the same as you, only 1.5-2 years ago) and eats less memory than the rest. I don’t remember about the docks, but it was not difficult to figure it out
Most reliable on PhantomJs, as it is a full-fledged browser. Faster with
phpQuery About Simple HTML Dom they write that it works with invalid html. Does not work. That was my reason for switching from it to phpQuery
I'll add my bike to the piggy bank =)
https://bitbucket.org/ramzeska/html-dom-parser/wik...
if you just need to get the text, DOMDocument has the highest speed phpQuery is about 4 times slower (according to my personal tests), but it has a bunch of selectors and in the end I chose it for myself
Simple HTML Dom is very slow
Nokogiri is a pure parser, there is no replacement and in fact the same DOMDocument that has been overgrown with hacks, so there is no point in it when it comes to speed
Which is faster
I'll just leave a link here https://github.com/Imangazaliev/DiDOM/wiki/%D0%A1%... Draw your own
conclusions
https://github.com/FriendsOfPHP/Goutte
An excellent parser with the ability to fall through links. Simple and user friendly interface. His composer has symfony/dom-crawler as a dependency , so if you believe the tests of a friend in the comment above, then the performance is average in comparison. But the entry threshold is fast (sampling by css selectors, getting data and attributes through jQuery analogues of the .text(), .attr() methods, as well as iterating through .each())
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question