Answer the question
In order to leave comments, you need to log in
How to scrape a site using PHP with more than 30 thousand links?
There is a website with about 30,000 items.
There is also a PHP script that parses one link and outputs the result.
How to make the script parse about 30,000 links. Should they all be included in an array? But then it will turn out to be a huge file that will be executed slowly.
// Loading page
$max_timout = 10;
$proxy = false;
$product_url = "https://www.ikea.com/ru/ru/catalog/products/303012";
$data = request($product_url, $max_timout, $proxy);
// Start parsing
$pq = phpQuery::newDocument($data['data']);
// Product title
$result['title'] = trim($pq->find('div.range-revamp-header-section__title--big')->html());
function request($url, $timeout = 10, $proxy = false)
{
$headers[] = "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:61.0) Gecko/20100101 Firefox/61.0";
$headers[] = "Accept: */*";
$headers[] = "Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,true);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
$data = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
$result['httpcode'] = $httpcode;
$result['data'] = $data;
return $result;
}
Answer the question
In order to leave comments, you need to log in
before parsing a site, first make sure it has API
https://developer.inter.ikea.com/
If you want to multi-thread execution in parts, first rewrite your script using a proxy. Because without a proxy, you are likely to get banned very quickly.
The options
are to run 10-100 of your parsers in parallel and tweak the code so that they take the next link from some database that takes into account multi-user access or blocking transactions.
- remake the parser that also works in one thread, but uses, for example , curl_multi when requests to the site go asynchronously
And remember, the site admin may not like 100500 requests to their server, as it looks like ddos.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question