M
M
Mr_Ares2021-11-20 01:09:29
Python
Mr_Ares, 2021-11-20 01:09:29

Why do aiohttp.get requests from one application return a 429 error, but not from multiple applications?

The task of the program is to get all links to files from the site with articles, but the server has a limit on the number of requests, after a certain number of requests, the response is 429 (too many requests).
When I wrote the code, there was 1 session at the beginning and I noticed that if I run this program several times, that is, create several sessions, each of which will be launched in a separate program, then errors do not come (I launched 5 programs, therefore 5 sessions). Rewrote the program to create multiple sessions in one program. I left the delays for the number of requests per session the same, I make 5 requests and wait 3 seconds. And here I met with a problem!

Everything works fine with 2 sessions, but if you create more than 2 sessions, then 429 responses start coming. Moreover, if this program with 2 sessions is run several times, then everything works just as well. That is, if I make requests from one running code with 3 sessions or more, errors begin to come, and if I run this code with 2 sessions several times, let's say 3, then no errors come, although 6 sessions are running (3 programs of 2 sessions).

I want to know why it works this way and how to fix it so that I can run all sessions from one application.

I looked for a problem in creating aiohttp.ClientSession() and tcpconnector sessions, but I did not find anything in the documentation and the Internet.

Below is my code.

import asyncio
import json
import aiohttp
from bs4 import BeautifulSoup
from lxml import etree


MAIN_URLS = [
    'https://www.mdpi.com/search?sort=pubdate&page_no={0}&page_count=15&year_from=2015&year_to=2021&subjects=chem-materials&view=compact',
    'https://www.mdpi.com/search?sort=pubdate&page_no={0}&page_count=15&year_from=2015&year_to=2021&subjects=engineering&view=compact',
    'https://www.mdpi.com/search?sort=pubdate&page_no={0}&page_count=15&year_from=2015&year_to=2021&subjects=med-pharma&view=compact',
    'https://www.mdpi.com/search?sort=pubdate&page_no={0}&page_count=15&year_from=2015&year_to=2021&subjects=physics-astronomy&view=compact',
    'https://www.mdpi.com/search?sort=pubdate&page_no={0}&page_count=15&year_from=2015&year_to=2021&subjects=arts-humanity&view=compact',
    'https://www.mdpi.com/search?sort=pubdate&page_no={0}&page_count=15&year_from=2015&year_to=2021&subjects=environment&view=compact',
    'https://www.mdpi.com/search?sort=pubdate&page_no={0}&page_count=15&year_from=2015&year_to=2021&subjects=bio-life&view=compact',
    'https://www.mdpi.com/search?sort=pubdate&page_no={0}&page_count=15&year_from=2015&year_to=2021&subjects=health&view=compact',
    'https://www.mdpi.com/search?sort=pubdate&page_no={0}&page_count=15&year_from=2015&year_to=2021&subjects=computer-math&view=compact',
    'https://www.mdpi.com/search?sort=pubdate&page_no={0}&page_count=15&year_from=2015&year_to=2021&subjects=business-econ&view=compact',
    'https://www.mdpi.com/search?sort=pubdate&page_no={0}&page_count=15&year_from=2015&year_to=2021&q=Agriculture&view=compact',
    'https://www.mdpi.com/search?sort=pubdate&page_no={0}&page_count=15&year_from=2015&year_to=2021&q=Animals&view=compact',
    'https://www.mdpi.com/search?sort=pubdate&page_no={0}&page_count=15&year_from=2015&year_to=2021&q=plants&view=compact',
    'https://www.mdpi.com/search?sort=pubdate&page_no={0}&page_count=15&year_from=1996&year_to=2021&q=crop&view=compact']
SESSIONS_COUNT = 2
start_url, stop_url = 0, 1

def get_urls_from_page(html_text: str) -> list:
    """Получает html возвращает список ссылок с этой стр"""
    bs = BeautifulSoup(html_text)
    return [tag.get('href') for tag in bs.find_all('a', {'class': 'UD_Listings_ArticlePDF'})]


async def get_page(url, session):
    """Получет ссылку на стр и сессию, возвращает html"""
    status = 1
    while status != 200:
        if status != 1:
            await asyncio.sleep(70)
        async with session.get(url) as response:
            html_text = await response.text()
            status = response.status
            print(str(status) + ' ', end='')
    return html_text


def save(main_url_num: int, new_info: dict):
    """Получает номер главной ссылки и новую информацию и сохраняет"""
    save_dir = f'saves/main_url_{main_url_num + 1}.json'
    with open(save_dir, 'r', encoding='UTF-8') as file:
        data = json.load(file)
    data['pages'] += new_info['pages']
    data['urls'] += new_info['urls']
    with open(save_dir, 'w', encoding='UTF-8') as file:
        json.dump(data, file)
    print(f'Saved {main_url_num}')


async def get_urls(session: aiohttp.ClientSession, main_url_num, page_num):
    url = MAIN_URLS[main_url_num].format(page_num)
    html_text = await get_page(url, session)
    urls = get_urls_from_page(html_text)
    return urls


async def create_session():
    return aiohttp.ClientSession()


def get_saved_pages(main_url_num: int) -> list:
    """Получает номер главной ссылки возвращает список сохранённых стр"""
    save_dir = f'saves/main_url_{main_url_num + 1}.json'
    with open(save_dir, 'r', encoding='UTF-8') as file:
        data = json.load(file)
    return data['pages']


def get_pages_count_from_page(html_text):
    bs = BeautifulSoup(html_text)
    dom = etree.HTML(str(bs))
    return int(dom.xpath('//*[@id="exportArticles"]/div/div[3]/div/div[2]/div[1]')[0].text.split()[-1][:-1])


async def get_work(session):
    work = []
    for i in range(start_url, stop_url):
        saved_pages = get_saved_pages(i)
        html_text = await get_page(MAIN_URLS[i].format(1), session)
        pages_count = get_pages_count_from_page(html_text)
        work_pages = [[i, page] for page in range(1, pages_count + 1) if page not in saved_pages]
        work += work_pages
        print(f'Работа с {i + 1}, получена!')
    return work


class Session:
    def __init__(self, session, ioloop: asyncio.AbstractEventLoop, session_num):
        self.session_num = session_num
        self.counter = 0
        self.ioloop = ioloop
        self.tasks = []
        self.session = session
        self.complete_tasks = []
        print(session)

    async def start(self):
        for task in self.tasks:
            print(f'Session - {self.session_num}, url - {task[0]}, page - {task[1]}')
            result = await self.ioloop.create_task(get_urls(self.session, *task))
            self.complete_tasks.append([task, result])
            self.counter += 1
            if self.counter >= 5:
                self.counter = 0
                await asyncio.sleep(3)
            if len(self.complete_tasks) >= 20:
                self.save()
        self.save()

    def save(self):
        saves = {}
        for info, urls in self.complete_tasks:
            main_url_num, page = info
            if main_url_num in saves.keys():
                saves[main_url_num]['pages'].append(page)
                saves[main_url_num]['urls'] += urls
            else:
                saves[main_url_num] = {'pages': [], 'urls': urls}
        for key in saves.keys():
            save(key, saves[key])
        self.complete_tasks.clear()

    def add_tasks(self, work):
        self.tasks.append(work)


def main():
    ioloop = asyncio.get_event_loop()
    sessions = [ioloop.run_until_complete(create_session()) for _ in range(SESSIONS_COUNT)]
    Sessions = [Session(sessions[i], ioloop, i) for i in range(SESSIONS_COUNT)]
    works = ioloop.run_until_complete(get_work(sessions[0]))
    work_to_one_session_count = len(works) // SESSIONS_COUNT
    for Sess in Sessions[:-1]:
        for i in range(work_to_one_session_count):
            Sess.add_tasks(works.pop(i))
    for work in works:
        Sessions[-1].add_tasks(work)
    tasks = [ioloop.create_task(Sess.start()) for Sess in Sessions]
    ioloop.run_until_complete(asyncio.wait(tasks))

    print('Complete!')

    for session in sessions:
        ioloop.run_until_complete(session.close())


if __name__ == '__main__':
    main()

Answer the question

In order to leave comments, you need to log in

[[+comments_count]] answer(s)
V
Vladimir Korotenko, 2021-11-20
@firedragon

429 Too Many Requests - HTTP - MDN Web
Docs
well, or change
SESSIONS_COUNT = 2

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question