How to unobtrusively and effectively scrape websites?

S

sandrain2013-11-22 18:51:24

Parsing

sandrain, 2013-11-22 18:51:24

Many sites are banned for a large number of hits.
- Are there any statistics or generally accepted norms for the number of calls in a certain period of time?
- What additional information to collect from sites in order to quickly understand why the data was not available for collection at that moment?
- What additional collect information in order to reduce the risk of a ban from the site in the future?
Thank you.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

V

Vlad Zhivotnev, 2013-11-22
@sandrain

There is the concept of throttling. It is quite applicable to your case =)
Started the project to respond more slowly - reduced the load, began to respond normally - increase the load in small steps. We started five hundred - reduced the load several times at once.
But @L3n1n is right - my hompaga will last 10kl and won't itch, and the blozhek will be bent at 300 rps. So the specific numbers for all sites are different.

S

Stepan, 2013-11-22
@L3n1n

- Each project has its own restrictions on the number of requests.
- A strange question, describe in more detail.
- What data you collect in my opinion does not play any role in parsing.

M

Masterme, 2013-11-22
@Masterme

use a pool of anonymous proxies, do not try to download the entire site in a swoop, disguise yourself as search engines, set the correct referrer, give away cookies received from the site