What should I do to prevent blocking on a third-party server if I steal data from there?

D

DemonIa2018-09-19 20:57:47

Parsing

DemonIa, 2018-09-19 20:57:47

There is a certain well-known site that does not provide a public API, but its front makes ajax requests to the server every few seconds, thus maintaining the relevance of the data.
There is nothing related to CORS in the headers, so you can simply make requests to this server with cURL and get the necessary data.
But it seems to me very strongly that if I run an application on my server that will make a certain number of requests around the clock, then the IP of my server will be banned, and I will be left with nothing.
Q: What can be done to avoid blocking in the future?
My thoughts on this:
1. Collect an array with 50 (for example) different User-Agent's and substitute some random element from the array in the header for each request. This way it will be easier to "get lost in the crowd".
2. Buy a pool of IP addresses, and randomly make requests through them. As far as I understand, proxy servers were invented to solve such problems.
If I buy a package of IPs, for example here (proxywhite.com) what should I do with them? Interested in the technical side of the issue?
Are there any out-of-the-box solutions for binding an IP pool to a web server running on NodeJS?
Thank you!

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

V

Vladimir Letyagin, 2018-09-19
@DemonIa

First of all, I would test on another server, and whether ip will really be banned with a large number of requests. This doesn't happen very often.
Changing the user-agent is unlikely to help.
There are two options left - either limit the frequency of requests (For example, 1 per second) or through a proxy.
As an option to do this - we have an array with proxies and either get information through them by connecting to them in turn, or get information asynchronously, where each request on a separate proxy has a pause of a couple of seconds.

D

Denis, 2018-09-25
@Dennes

1) in addition to the user-agent header, your "target" can also identify bots by other parameters (enabled flash, js, web-rtc, etc.). Use a sniffer to find out what data the site "pulls" from the client
2) You are thinking in the right direction. I do not think that in your case, it is necessary to use paid proxies - publicly available ones will be enough

D

DoctorGata, 2019-02-28
@DoctorGata

There is a residential proxy provider . IP pool - 10 million, 190+ countries.