Answer the question
In order to leave comments, you need to log in
PHP, parser speed comparison?
In general, I asked myself the question of switching from the old php parser simple_html_dom to a more nimble one.
* only interested in php parsers.
Google, a toaster and other resources suggested that for performance it is better to use phpQuery or DiDOM
.
Made for all 3 parsers, the same script.
And .. simple, which everyone scolds for speed, performs it faster.
In general, tell me if I'm wrong.
Or advise a really smart php parser.
For the test parse toaster.
1) get a list of questions on the main
page 2) for each question, open the page with the question itself (for load and speed test)
<meta http-equiv=Content-Type content="text/html;charset=UTF-8">
<?
set_time_limit(0);
$start = microtime(true);
# cURL для парсера
function dlPage($href)
{
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$str = curl_exec($curl);
curl_close($curl);
$dom = new simple_html_dom();
$dom->load($str);
return $dom;
}
include_once('simple_html_dom/simple_html_dom.php');
$html=dlPage("https://toster.ru/questions");
foreach($html->find('a[class="question__title-link"]') as $div)
{
$link=$div->href;
$name=$div->innertext;
echo $name." = ".$link."<br>";
$html2=dlPage($link);
}
echo "<hr>".round(microtime(true) - $start, 4);
?>
<meta http-equiv=Content-Type content="text/html;charset=UTF-8">
<?
set_time_limit(0);
$start = microtime(true);
$fake_user_agent = "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201";
ini_set('user_agent', $fake_user_agent);
require('phpQuery/phpQuery-onefile.php');
$html=file_get_contents('https://toster.ru/questions');
$document=phpQuery::newDocument($html);
$hentry=$document->find('a.question__title-link');
foreach ($hentry as $el)
{
$pq = pq($el);
$name=$pq->text();
$href=$pq->attr('href');
echo $name." = $href<br>";
$html2=file_get_contents($href);
$document2=phpQuery::newDocument($html2);
}
echo "<hr>".round(microtime(true) - $start, 4);
?>
<meta http-equiv=Content-Type content="text/html;charset=UTF-8">
<?
set_time_limit(0);
$start = microtime(true);
# эмуляция того что мы не бот
#$fake_user_agent = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.47 Safari/536.11";
$fake_user_agent = "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201";
ini_set('user_agent', $fake_user_agent);
# подключаем парсер
require_once('DiDom/ClassAttribute.php');
require_once('DiDom/Document.php');
require_once('DiDom/Element.php');
require_once('DiDom/Encoder.php');
require_once('DiDom/Errors.php');
require_once('DiDom/Query.php');
require_once('DiDom/StyleAttribute.php');
require_once('DiDom/Exceptions/InvalidSelectorException.php');
use DiDom\ClassAttribute;
use DiDom\Document;
use DiDom\Element;
use DiDom\Encoder;
use DiDom\Errors;
use DiDom\Query;
use DiDom\StyleAttribute;
use DiDom\Exceptions\InvalidSelectorException;
#########################
$document = new Document('https://toster.ru/questions', true);
$posts = $document->find('.question__title-link');
foreach($posts as $post)
{
echo $post->text(), " = ".$post->href."<br>";
$document2=new Document($post->href, true);
}
echo "<hr>".round(microtime(true) - $start, 4);
?>
Answer the question
In order to leave comments, you need to log in
the lion's share of the time is spent on loading pages, so it is not surprising that there is no noticeable difference.
who is so mean? especially when speed is needed.
first, the necessary urls are prepared / collected.
then they are pumped out into many streams (for example, by a multicurl) and added to the database or to the disk or somewhere else.
documents are already quietly parsed in the background locally.
In general, tell me if I'm wrong.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question