L
L
lesh_a2019-12-17 21:47:49
PHP
lesh_a, 2019-12-17 21:47:49

How to scrape a large number of sites?

You need to parse a large number of sites (about 30-50k). At the beginning, I chose a multicurl for these purposes, wrote a couple of classes myself, checked it on a small number of files, everything was ok, but as soon as I tried to run it on more sites (about 300-1000), some problems began. The first sites were processed, and the rest simply did not load. Moreover, if you take those sites that did not load and run separately, then they normally worked out. There was no system for displaying errors in self-written classes, so I didn’t really know what was the matter.
Then I rewrote everything using this library, which uses ReactPHP and DOM parsing. It works in the same way, but again the same problem, after the n-th number of sites it stops working. There is an error system here, this is what it says: Connection to XXX:80 failed during DNS lookup: DNS query for XXX failed: too many retries .
I launch sites with this error separately - everything works out.
Please tell me what is wrong and how to get out of this situation?
The code is something like this:

$client = new Browser($loop, $connector);
$parser = new \app\VersionParser($client, $loop);
$parser->parse($urls);

$loop->run();

Answer the question

In order to leave comments, you need to log in

2 answer(s)
D
Dimonchik, 2019-12-17
@dimonchik2013

Scrapy
for puff met a crawler about 5 years ago, probably, and now there is
skip and go to the next one, you can put a caching server with 8 well-known powerful ones at the top (google, CF, yandex, hurrican)

Y
Yuri Paimurzin, 2020-01-16
@rusellsystems

A few years ago, I did parsing sites with JavaScript through simulation, as well as with queues and from the database in Postgres, I tested all this on Linux servers with rabbitmq, the network worked for half a year until I got tired of Chromium and Lazarus-IDE on the server side, with .. .

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question