How not to get blacklisted in multi-threaded parsing?

D

Denis99992018-04-02 19:49:11

PHP

Denis9999, 2018-04-02 19:49:11

Wrote a multi-threaded php parser using the guzzle library. It goes through a large number of sites and takes the content of one page. Over time, instead of content, a 302 server response code is returned. Not even with time, I conducted a couple of test runs:
1. A total of 300 sites are parsed into 100 threads.
2. 300 sites are parsed into 300 threads. What are some ways to minimize this problem, or at least reduce the number of sites returning 302 response code? Is it possible to do without changing ip addresses?
In the second option 302, the code began to return immediately with a message that my ip was in the black list, and in the first one everything was fine. At the same time, I launched the first option after the second, that is, if my ip is in the blacklist, then why did parsing 100 sites after parsing 300 work fine?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

B

Boris Syomov, 2018-04-02
@kotomyava

Use a large list of proxies, and make requests through them.

X

xmoonlight, 2018-04-03
@xmoonlight

As long as your parser does not differ in behavior from ordinary site users, it is invisible.
It is enough to correctly emulate a virtual user (browser, JS events, frequency of requests and correct navigation) so as not to be banned.
In short: one page request on one domain - no more than once every 15 seconds.

L

latteo, 2018-04-03
@latteo

One of the options for a ban is obtained when many sites hang on the same server.
The easiest way to get around is to determine the ip sites and pause when requesting the same ip.
In theory, the mechanism for such pauses in curl should be built in and it is likely that this can be controlled from php. For details, read the mana ;)
A more complex and rarer option is to have several ips on one server or several servers behind one firewall, which perceives multiple requests as DDOS. Here it is already necessary to calculate the subnet or even all subnets of DCs to set pauses.