Web scraping of blocked sites: what do you recommend?

S

Sirion2016-12-27 10:11:11

Computer networks

Sirion, 2016-12-27 10:11:11

The bottom line is this: I need to collect data from some sites that are blocked by Roskompozor. If I did it by hand, I would use Tor or friGate, but of course I don't want to do it by hand. Accordingly, I see two approaches. I'm asking for advice on which one to choose, and how best to implement it.
1. Take lists of free proxies somewhere on the Internet and methodically walk through them. Where would you recommend getting them?
2. Don't be a cheapskate and rent your own proxy server. Again, where/how is the best way to do this? I have never done this, and something tells me that the first lines of Google issuance will lead me to a suboptimal place.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

J

jacob1237, 2016-12-27
@jacob1237

If I did it by hand, I would use Tor or friGate, but of course I don't want to do it by hand

Tor works great as a proxy for bots/crawlers. The whole question is only in the stability of the connection and getting into IP blacklists. Because some services are able to determine whether an IP belongs to the Tor network.
If you need to do it professionally, look at services like Crawlera . This is a very convenient proxy autorotator.
But if you are not collecting data on a gigantic scale and not at the speed of light (actually flooding websites), then the easiest option is to buy an account of any foreign VPN service and drive your bots from your home PC through a VPN tunnel, i.e. from one IP.

�

Максим Тимофеев, 2016-12-27
@webinar Куратор тега Веб-разработка

1. вариант: Берем европейский или украинский хостинг, пишем на нем парсер - все работает мимо роскомнадзора.
2. вариант: Брем ПО типа ContentDownloader загружаем в него список не РФ прокси - парсим без проблем.