What is the best way to parse a large volume?

A

alst1612016-12-23 01:05:20

go

alst161, 2016-12-23 01:05:20

There are 20 sites. They need to be monitored for new information. At the moment they are checked every minute. They work through free (or cheap), but anonymous proxies. But after some time came to the conclusion that it is not stable. Directly from the server is impossible, because calculate ip and block access. I would like to get advice on what is better to use in this situation: vpn or just high-quality proxies. Preferably with an indication of a quality resource.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

X

xmoonlight, 2016-12-23
@xmoonlight

1. Why are you tormenting the donor resource every minute?!!!!
If you monitor some posts there, then it is enough to monitor the RSS / Atom feed every 3-5 minutes, and when a change appears, upload new content using the link from the news for further parsing to yourself.
2. Crawler should parse with Round-Robin ("carousel"): resource link 1 -> resource link 2 -> ... -> resource link N->LOOP (start over). And do not go through all the links of ONE resource, clogging his channel!
3. A high-quality resource - the crawler needs to pretend to be a regular user: do not request often, look at 5-6 linked pages under one session.
Then 2-3 proxy addresses will be enough for you for a long time.

D

DannyFork, 2016-12-23
@DannyFork

Crawlera has an automatic rotator for thousands of proxies https://scrapinghub.com/crawlera/

M

Mouvdy, 2016-12-23
@Mouvdy

We use proxies from actproxy.com in our work, they work stably and flawlessly, try them, the proxies are stable and cost relatively little