@
@
@Twitt2017-06-13 16:06:35
PHP
@Twitt, 2017-06-13 16:06:35

How to parse many pages quickly in PHP?

There is one site on which there are many pages from which I need to take only 1 block with information. There are about 10,000 such pages. They go in an orderly manner, not scattered.
My algorithm is as follows: we access each page in a cycle through CURL, then through the phpQuery library we look for the class in which the information we need is located, and display all this on the page. Here is the code as it looks now:


<?php
require_once 'phpquery/phpQuery/phpQuery.php';
ini_set('max_execution_time', 0);
for($i = 1; $i <= 600; $i++) { // process 600 profiles
$url = ' https://site_address/anketa/ ' . $i; // it turns out that each profile will be processed starting from 1 to 600
$val = curlIt($url);
$html = phpQuery::newDocument($val);
$pq = pq($html);
$elem = $pq->find('.distance'); // get content inside distance class
if(!$elem) continue; // if there is no such block, skip
echo $elem; // Display that content of the .distance class
}
function curlIt($url) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$res = curl_exec($ch);
curl_close($ch);
return $res;
}

It all works, but when we display 50 results, and still have to wait several minutes. I tried to display 500 results, after 10 minutes I did not wait for the result.
How to do it right, and so that it comes out faster?

Answer the question

In order to leave comments, you need to log in

6 answer(s)
A
Alexander Aksentiev, 2017-06-13
_

Well, firstly, multi curl to Google.
Secondly, if composer is not a terrible word, then here .
Thirdly, most likely, most of the time it is not the request itself that consumes, but DOM parsing through phpquery. It is better to use something more modern and probably faster.

M
Maxim Timofeev, 2017-06-13
@webinar

Why speed? It takes patience. Start the process and let yourself parse. For such purposes, I use the contentDownloader software; it's faster to set up and run everything. Start up in 10 streams and go to drink tea. 50,000 pages is not very much, 1-2 hours and everything will be ready. Here was an experience with 5,000,000 pages and a complex data selection. 3 days. But if it parses itself, then it's still not a problem.
Well, or pull the script with a cron once a minute and write the results to the database. You can pull several at a time. And after n time, where n is clearly less than 10 hours, everything will be ready.
Optimization will take longer. If we are talking about daily parsing of 1M pages, then we have to think. Watch where there is more time loss, when loading a page or when a script is running, select libraries. And for a one-time task of 50k pages, it's a lot of honor to optimize something.

S
Sergey Pugovkin, 2017-06-13
@Driver86

Here the speed is limited by the site channel and protection against frequent access, if any.

N
nirvimel, 2017-06-13
@nirvimel

Caching of intermediate results (after phpQuery) in the database will help here. If you need to always return up-to-date data, while the original data is constantly changing, then each entry in the cache should have a timestamp of the entry, and if more than N seconds / days have passed since then, then a second request is made to the original page.
Well, fetching 500 pages synchronously in one thread... if I saw such code in the project, I would think that it was done on purpose to offload the server for other tasks or give the user time to go smoke.

L
lxfr, 2017-06-13
@lxfr

You, in addition to speed, should also worry about the question "how soon will I be banned by IP". Now half of the sites can do this automatically when there are a lot of strange requests.

E
entermix, 2017-06-13
@entermix

Use multi curl, or gearman for example, to run multiple operations in parallel.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question