What is the best and fastest way to write a parser (PHP)?

php man2016-08-08 20:12:52

PHP

php man, 2016-08-08 20:12:52

So, the contestants:
1. phpQuery
+ supports a bunch of selectors
- low speed
2. Simple HTML Dom
+ good documentation
+ easy to learn
- talking about memory leaks, i.e. large files cannot be parsed
- problems with parsing speed
3. Nokogiri
+ high speed
- terrible documentation
Listed everything that came to mind, maybe I missed something. So what is the best choice?

Answer the question

In order to leave comments, you need to log in

14 answer(s)

Muhammad, 2016-08-08
@php man

DiDom: https://github.com/Imangazaliev/DiDOM
+ high speed ( comparison with other parsers )
+ good dock
+ large number of supported selectors
+ most importantly - tests
A simple example:

$document = new Document('http://www.example.com/', true);

echo $document->first('title::text');

A little more complicated - we parse all links:

$links = $document->find('a[href]::attr(href)');

var_dump($links);

Even more difficult is to get the addresses of all image links:

$links = $document->find('a[href]:has(img)::attr(href)');

var_dump($links);

Other options:
- Symfony DomCrawler
- Zend Dom Query

xmoonlight, 2016-08-09
@xmoonlight

PHP: multi-curl+regex+DOMXPath
Example ( https://www.ibm.com/developerworks/en/library/x-xp... ) :

$doc = new DOMDocument;
$doc->load('products.xml');
$xpath = new DOMXPath($doc);
$products = $xpath->query("/PRODUCTS/PRODUCT[SKU='soft5678']/NAME");
foreach ($products as $product)
   print($product->nodeValue);

Ilya, 2016-08-18
@glebovgin

Over the years of parsing data, I came up with a simple set:
Curl + tidy + DOMXpath

oe24y, 2016-08-09
@oe24y

Here is another jQuery-like
PHP Simple HTML DOM Parser

Sergii Buinytskyi, 2016-08-18
@boonya

First question, parser of what?
And if I correctly understood the task of this parser, then why are you inventing bicycles? *
cURL for getting content - php.net/manual/en/book.curl.php SimpleXML for document parsing. Both components come out of the box with PHP. Unified, documented and well-known interfaces. Or here's another, Symfony2 component symfony.com/doc/current/components/dom_crawler.html

Alexander, 2016-08-09
@OneFive

Very cool thing https://github.com/sleeping-owl/apist

PooH63, 2016-08-12
@PooH63

I settled on Nokogiri
Really high speed (I chose from the same as you, only 1.5-2 years ago) and eats less memory than the rest. I don’t remember about the docks, but it was not difficult to figure it out

bartmanskyi, 2016-08-18
@bartmanskyi

Most reliable on PhantomJs, as it is a full-fledged browser. Faster with
phpQuery About Simple HTML Dom they write that it works with invalid html. Does not work. That was my reason for switching from it to phpQuery

Ramzeska, 2016-08-14
@Ramzeska

I'll add my bike to the piggy bank =)
https://bitbucket.org/ramzeska/html-dom-parser/wik...

morsvox, 2016-08-18
@morsvox

if you just need to get the text, DOMDocument has the highest speed phpQuery is about 4 times slower (according to my personal tests), but it has a bunch of selectors and in the end I chose it for myself
Simple HTML Dom is very slow
Nokogiri is a pure parser, there is no replacement and in fact the same DOMDocument that has been overgrown with hacks, so there is no point in it when it comes to speed

LightAir, 2016-08-18
@LightAir

Which is faster

The answer is simple, faster on a familiar and/or well-documented tool.
Better native. More precisely suggested , dear xmoonlight

Andrey Sanych, 2016-08-18
@mountpoint

I'll just leave a link here https://github.com/Imangazaliev/DiDOM/wiki/%D0%A1%... Draw your own
conclusions

webstorm, 2017-02-07
@webstorm

https://github.com/FriendsOfPHP/Goutte
An excellent parser with the ability to fall through links. Simple and user friendly interface. His composer has symfony/dom-crawler as a dependency , so if you believe the tests of a friend in the comment above, then the performance is average in comparison. But the entry threshold is fast (sampling by css selectors, getting data and attributes through jQuery analogues of the .text(), .attr() methods, as well as iterating through .each())

Ruslan, 2017-03-04
@mitrm

GuzzleHttp + phpQuery