M
M
Maxim Artemiev2018-11-22 16:47:18
Python
Maxim Artemiev, 2018-11-22 16:47:18

Yelp.com: python parsing through a proxy - how to bypass the ban?

I collect data from yelp.com, first parsing through the main address - after 200-150 pages I get a ban. I start parsing through a proxy - the proxy bans very quickly.
In the code I am changing UA.

proxy = 'socks4://42.105.99.197:35519'
    ua = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US;q=0.8;en;q=0.3',
            'Cache-Control': 'max-age=0',
            'Connection': 'keep-alive',
            'Host': 'www.yelp.com',
            'Pragma': 'no-cache',
            'Upgrade-Insecure-Requests': '1',
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
        }
    r = s.get(url, headers=ua, proxies=dict({'https':proxy, 'http':proxy}))

It doesn't work for any proxy to last more than 100-150 pages - a hard ban (for example, a week ago I got a ban on the main address and still haven't been unbanned). On the Internet, a lot of users complain that their addresses (NAT) have been banned.
I tried using Selenium Webdriver - parsing is much slower, but after about 100 pages it is also banned. I collected about 20k proxies (workers) from the Internet, for 1000 collected pages there are more than 4000 banned proxies ...
What are the options for bypassing such tough protection?

Answer the question

In order to leave comments, you need to log in

3 answer(s)
S
Semyon Semyonov, 2018-11-22
@man_without_face

Yes, this is not a tough defense and is decided once or twice. Just buy access to paid proxy lists and use them through the API. Even the user-agent can not be touched. They took 50 records - a new proxy, they took 50 records - again a new proxy, and so on ad infinitum. Everything is done automatically, there is no need to collect some shitty proxies somewhere or something like that.

C
CityCat4, 2018-11-22
@CityCat4

If they ban after 150 pages - why not change the proxy after 100? :)

Y
Yuryy, 2018-11-23
@Yuraz

You can also use this: https://github.com/aivarsk/scrapy-proxies
Random proxy middleware for Scrapy

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question