B
B
bfesiuk2020-06-14 02:06:05
Parsing
bfesiuk, 2020-06-14 02:06:05

AJAX parsing in BeautifulSoup Python?

I decided to try to implement a job parser. The site gives out only 20 links, then the "More" button.
Through the "Network" tab, I looked at what sends the request. Crutch pulled out CSRF_TOKEN (pulls out every other time) and made a request, I get status code 403.

Website: https://jobs.dou.ua/vacancies/?category=Ruby

Code:

import requests
from bs4 import BeautifulSoup

HEADERS = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}

URL = "https://jobs.dou.ua/vacancies/?category=Ruby"

session = requests.Session()


def get_html(url):
    r = session.get(url, headers=HEADERS)
    return r


def get_links(response):
    if response.status_code == 200:
        html = BeautifulSoup(response.text, "html.parser")
        lis = html.find_all('li', class_="l-vacancy")

        # Количество вакансий до нажатия
        print(len(lis))

        # Костыльно достаю csrf
        script = str(html.select('script')[5])
        csrf = str(script[32:32+64])
        print(script)
        print(csrf)

        load_data = {
            'csrfmiddlewaretoken': csrf,
            'count': 20}
        response = session.post('https://jobs.dou.ua/vacancies/xhr-load/?category=Ruby', data=load_data)
        print(response.status_code)

        html = BeautifulSoup(response.text, "html.parser")
        lis = html.find_all('li', class_="l-vacancy")

        # Количество вакансий после нажатия
        print(len(lis))
    else:
        return 'Connection error!'


get_links(get_html(URL))

Answer the question

In order to leave comments, you need to log in

1 answer(s)
S
soremix, 2020-06-14
@bfesiuk

The site is blocked in the Russian Federation, but I will write general recommendations, it should work

  1. Interesting, you certainly get a token. It's in JSON format, I understand? Include the json library and do json.loads(script). From there, already get the token, as from a regular dictionary. I also don’t believe that the script is there without attributes, it’s better to get it by class / id / etc.
  2. The XHR request doesn't look complete. Are there any other settings by any chance?
  3. Add headers to XHR
  4. One user-agent may not be enough, try adding others, see what they have, maybe some unusual ones are present. You can try adding Accept/Referrer and others.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question