PHP, parser speed comparison?

E

Evgeny Orlov2018-10-03 06:36:40

PHP

Evgeny Orlov, 2018-10-03 06:36:40

In general, I asked myself the question of switching from the old php parser simple_html_dom to a more nimble one.
* only interested in php parsers.
Google, a toaster and other resources suggested that for performance it is better to use phpQuery or DiDOM
.
Made for all 3 parsers, the same script.
And .. simple, which everyone scolds for speed, performs it faster.
In general, tell me if I'm wrong.
Or advise a really smart php parser.
For the test parse toaster.
1) get a list of questions on the main
page 2) for each question, open the page with the question itself (for load and speed test)

Simple Html Dom

<meta http-equiv=Content-Type content="text/html;charset=UTF-8">

<?
set_time_limit(0);
$start = microtime(true);

# cURL для парсера
function dlPage($href)
{
  $curl = curl_init();
  curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
  curl_setopt($curl, CURLOPT_HEADER, false);
  curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
  curl_setopt($curl, CURLOPT_URL, $href);
  curl_setopt($curl, CURLOPT_REFERER, $href);
  curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
  curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
  $str = curl_exec($curl);
  curl_close($curl);

  $dom = new simple_html_dom();
  $dom->load($str);
  return $dom;
}


include_once('simple_html_dom/simple_html_dom.php');
$html=dlPage("https://toster.ru/questions");

foreach($html->find('a[class="question__title-link"]') as $div)
{
  $link=$div->href;
  $name=$div->innertext;
  echo $name." = ".$link."<br>";
  
  $html2=dlPage($link);
}

echo "<hr>".round(microtime(true) - $start, 4);
?>

phpQuery

<meta http-equiv=Content-Type content="text/html;charset=UTF-8">

<?
set_time_limit(0);
$start = microtime(true);

$fake_user_agent = "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201";
ini_set('user_agent', $fake_user_agent);

require('phpQuery/phpQuery-onefile.php');


$html=file_get_contents('https://toster.ru/questions');
$document=phpQuery::newDocument($html);

$hentry=$document->find('a.question__title-link');
foreach ($hentry as $el)
{
  $pq = pq($el);
  
  $name=$pq->text();
  $href=$pq->attr('href');
  echo $name." = $href<br>";
  
  $html2=file_get_contents($href);
  $document2=phpQuery::newDocument($html2);
}

echo "<hr>".round(microtime(true) - $start, 4);
?>

Didom

<meta http-equiv=Content-Type content="text/html;charset=UTF-8">

<?
set_time_limit(0);
$start = microtime(true);

# эмуляция того что мы не бот
#$fake_user_agent = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.47 Safari/536.11";
$fake_user_agent = "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201";
ini_set('user_agent', $fake_user_agent);

# подключаем парсер
require_once('DiDom/ClassAttribute.php');
require_once('DiDom/Document.php');
require_once('DiDom/Element.php');
require_once('DiDom/Encoder.php');
require_once('DiDom/Errors.php');
require_once('DiDom/Query.php');
require_once('DiDom/StyleAttribute.php');
require_once('DiDom/Exceptions/InvalidSelectorException.php');
use DiDom\ClassAttribute;
use DiDom\Document;
use DiDom\Element;
use DiDom\Encoder;
use DiDom\Errors;
use DiDom\Query;
use DiDom\StyleAttribute;
use DiDom\Exceptions\InvalidSelectorException;
#########################

$document = new Document('https://toster.ru/questions', true);

$posts = $document->find('.question__title-link');
foreach($posts as $post)
{
  echo $post->text(), " = ".$post->href."<br>";
  $document2=new Document($post->href, true);
}

echo "<hr>".round(microtime(true) - $start, 4);
?>

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

D

DevMan, 2018-10-03
@Miracl

the lion's share of the time is spent on loading pages, so it is not surprising that there is no noticeable difference.
who is so mean? especially when speed is needed.
first, the necessary urls are prepared / collected.
then they are pumped out into many streams (for example, by a multicurl) and added to the database or to the disk or somewhere else.
documents are already quietly parsed in the background locally.

I

index0h, 2018-10-03
@index0h

In general, tell me if I'm wrong.

* The fact that the tests need to be run many times, at least thousands.
* The fact that in the measurement of parsing time you take into account the page load time.
* In that you take into account the withdrawal time.
* In that you take into account the time of preparation of the environment (includes / requests).
What you have now is just garbage data, literally meaningless.