S
S
Semyon2020-11-15 16:40:53
Scrapy
Semyon, 2020-11-15 16:40:53

Why does Yandex ban the python parser almost immediately?

There is the following Yandex search parser:
PS The code is not optimized at all.

The code
import requests 
from bs4 import BeautifulSoup
import time

PAGES=5 # Сколько страниц парсим?

def get_search(search_str):
    headers_Get = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/536 (KHTML, like Gecko) Chrome/86.0 Safari/536',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
    } # "Заголовки"
    blok_list = search_str.split()
    url_query = '%20'.join(blok_list) # Заменяем пробелы спец. символом
    output = []
    for page in range(PAGES):
        url = 'https://yandex.ru/search/?text=' + url_query + '&p='+str(page)+'&lr=213' # Ссылка для парса
        time_start=time.time() # Время в начале
        r = requests.get(url, headers=headers_Get) # Парсим
        soup = BeautifulSoup(r.text, "html.parser") # Отправляем html в бьютифулсуп
        for searchWrapper in soup.find_all('li', {'class':'serp-item'}): # Ищем все результаты поиска
            url = searchWrapper.find('a', {'class':'i-bem'})["href"] # Берём ссылку из результата
            if url[0]=="h": # Нормальная ли ссылка (http?)
                output.append(url) # Ссылка найдена, беру!
        a=time.time()-time_start # Ожидание, чтобы проходило 3 сек. между запросами
        if 0<a<3 and (page+1)!=PAGES: # Чтобы лишнего не ждать
            time.sleep(3-a)
    return output

print(len(get_search("ух, негодяи! Зачем банить так сразу!?")))
Launched from a home computer in Russia, the delay between requests is as much as 3 seconds ... Banned after 10 requests. I ask those who know to look - the problem is in the code or is Yandex so tricky. In the first case, if the proxies cost approx. 15 rubles, then it turns out 1.5 rubles per request ?! Looks strange.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
S
Sergey Gornostaev, 2020-11-15
@Hitreno

How to parse without a ban?

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question