Answer the question
In order to leave comments, you need to log in
Yelp.com: python parsing through a proxy - how to bypass the ban?
I collect data from yelp.com, first parsing through the main address - after 200-150 pages I get a ban. I start parsing through a proxy - the proxy bans very quickly.
In the code I am changing UA.
proxy = 'socks4://42.105.99.197:35519'
ua = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US;q=0.8;en;q=0.3',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'www.yelp.com',
'Pragma': 'no-cache',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
r = s.get(url, headers=ua, proxies=dict({'https':proxy, 'http':proxy}))
Answer the question
In order to leave comments, you need to log in
Yes, this is not a tough defense and is decided once or twice. Just buy access to paid proxy lists and use them through the API. Even the user-agent can not be touched. They took 50 records - a new proxy, they took 50 records - again a new proxy, and so on ad infinitum. Everything is done automatically, there is no need to collect some shitty proxies somewhere or something like that.
If they ban after 150 pages - why not change the proxy after 100? :)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question