Answer the question
In order to leave comments, you need to log in
What is the best way to parse a large volume?
There are 20 sites. They need to be monitored for new information. At the moment they are checked every minute. They work through free (or cheap), but anonymous proxies. But after some time came to the conclusion that it is not stable. Directly from the server is impossible, because calculate ip and block access. I would like to get advice on what is better to use in this situation: vpn or just high-quality proxies. Preferably with an indication of a quality resource.
Answer the question
In order to leave comments, you need to log in
1. Why are you tormenting the donor resource every minute?!!!!
If you monitor some posts there, then it is enough to monitor the RSS / Atom feed every 3-5 minutes, and when a change appears, upload new content using the link from the news for further parsing to yourself.
2. Crawler should parse with Round-Robin ("carousel"): resource link 1 -> resource link 2 -> ... -> resource link N->LOOP (start over). And do not go through all the links of ONE resource, clogging his channel!
3. A high-quality resource - the crawler needs to pretend to be a regular user: do not request often, look at 5-6 linked pages under one session.
Then 2-3 proxy addresses will be enough for you for a long time.
Crawlera has an automatic rotator for thousands of proxies https://scrapinghub.com/crawlera/
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question