Why does Yandex ban the python parser almost immediately?

S

Semyon2020-11-15 16:40:53

Scrapy

Semyon, 2020-11-15 16:40:53

There is the following Yandex search parser:
PS The code is not optimized at all.

The code

import requests 
from bs4 import BeautifulSoup
import time

PAGES=5 # Сколько страниц парсим?

def get_search(search_str):
    headers_Get = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/536 (KHTML, like Gecko) Chrome/86.0 Safari/536',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
    } # "Заголовки"
    blok_list = search_str.split()
    url_query = '%20'.join(blok_list) # Заменяем пробелы спец. символом
    output = []
    for page in range(PAGES):
        url = 'https://yandex.ru/search/?text=' + url_query + '&p='+str(page)+'&lr=213' # Ссылка для парса
        time_start=time.time() # Время в начале
        r = requests.get(url, headers=headers_Get) # Парсим
        soup = BeautifulSoup(r.text, "html.parser") # Отправляем html в бьютифулсуп
        for searchWrapper in soup.find_all('li', {'class':'serp-item'}): # Ищем все результаты поиска
            url = searchWrapper.find('a', {'class':'i-bem'})["href"] # Берём ссылку из результата
            if url[0]=="h": # Нормальная ли ссылка (http?)
                output.append(url) # Ссылка найдена, беру!
        a=time.time()-time_start # Ожидание, чтобы проходило 3 сек. между запросами
        if 0<a<3 and (page+1)!=PAGES: # Чтобы лишнего не ждать
            time.sleep(3-a)
    return output

print(len(get_search("ух, негодяи! Зачем банить так сразу!?")))

Launched from a home computer in Russia, the delay between requests is as much as 3 seconds ... Banned after 10 requests. I ask those who know to look - the problem is in the code or is Yandex so tricky. In the first case, if the proxies cost approx. 15 rubles, then it turns out 1.5 rubles per request ?! Looks strange.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

S

Sergey Gornostaev, 2020-11-15
@Hitreno

How to parse without a ban?