How to scrape a site using PHP with more than 30 thousand links?

D

damarkuzz2021-04-25 19:26:32

PHP

damarkuzz, 2021-04-25 19:26:32

There is a website with about 30,000 items.
There is also a PHP script that parses one link and outputs the result.
How to make the script parse about 30,000 links. Should they all be included in an array? But then it will turn out to be a huge file that will be executed slowly.

// Loading page
$max_timout = 10;
$proxy = false;
$product_url = "https://www.ikea.com/ru/ru/catalog/products/303012";
$data = request($product_url, $max_timout, $proxy);

// Start parsing
$pq = phpQuery::newDocument($data['data']);

// Product title
$result['title'] = trim($pq->find('div.range-revamp-header-section__title--big')->html());

function request($url, $timeout = 10, $proxy = false)
{
  $headers[] = "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:61.0) Gecko/20100101 Firefox/61.0";
    $headers[] = "Accept: */*";
    $headers[] = "Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3";

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION,true);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($ch, CURLOPT_TIMEOUT, $timeout);
    curl_setopt($ch, CURLOPT_PROXY, $proxy);

    $data = curl_exec($ch);
    $httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    $result['httpcode'] = $httpcode;
    $result['data'] = $data;
    return $result;	
}

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

D

Denis Yuriev, 2021-04-25
@dyuriev

before parsing a site, first make sure it has API
https://developer.inter.ikea.com/

N

Nadim Zakirov, 2021-04-25
@zkrvndm

If you want to multi-thread execution in parts, first rewrite your script using a proxy. Because without a proxy, you are likely to get banned very quickly.

R

rPman, 2021-04-25
@rPman

The options
are to run 10-100 of your parsers in parallel and tweak the code so that they take the next link from some database that takes into account multi-user access or blocking transactions.
- remake the parser that also works in one thread, but uses, for example , curl_multi when requests to the site go asynchronously
And remember, the site admin may not like 100500 requests to their server, as it looks like ddos.