Why can't I parse page elements with obscure classes?

F

Fallerwood2021-03-30 15:19:09

Python

Fallerwood, 2021-03-30 15:19:09

I want to learn parsing. Already wrote something and even it turned out. After a while, I decided to return to it. I started writing a new parser and learning everything almost from scratch. For example, I wanted to parse information about matches on the betting site parimatch. But when trying to take elements with information, it cannot find them or returns an empty object. Why?
PS I read a lot of things on forms, tried to use selenium, the same story.

import requests
from bs4 import BeautifulSoup


URL = 'https://www.parimatch.ru/'
HEADERS = {
    'accept': 'image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63'
}


# Получение html
def get_html(url, params=''):
    html = requests.get(url).text
    return html  # Возврат полученной страницы


# Поиск нужного контента
def get_content(html):
    soup = BeautifulSoup(html, "html.parser")
    items = soup.find_all('div', {'class': 'QHMOkrbtqvSkGzF6oZD2a'})
    print(items)


if __name__ == '__main__':
    html = get_html(URL)
    get_content(html)

Conclusion:

[]
Process finished with exit code 0

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

M

MinTnt, 2021-03-30
@Fallerwood

In general, parsing is not always as simple as it seems. Sites often try to protect themselves from simple parsers in various ways, even minimal ones, while there are also various types of protection.
In general, if you want to see what we get from the post request, just write to a file, so it will be easier to understand where the error is, and so on. For example

import requests

getpost = requests.get('https://www.parimatch.ru/')
with open('log.html', 'w', encoding='utf-8') as f:
  f.write(getpost.text)

Next, we can open this page and see what has loaded us.
Basically, as we can see, there is only an empty page with a splash screen. So the info is loaded by the script.
I repeat once again that not everything is so simple, but for that it is fun. :D
So far I've looked, matches are first made friends with a get request at https://www.parimatch.ru/api/top-matches in the format:

spoiler

"abTestLabel":null,"topEvents":[{"id":"F","eventList":["1|6154167","1|6154172","1|6154171","1|6154169","1|6153670","1|6154164","1|6154166","1|6154165","1|6154170","1|6154168"]},{"id":"CS","eventList":["1|6193860","1|6173617","1|6193859","1|6161642","2|6192368","1|6193855","1|6193858","2|6191488","2|6192369","2|6191486"]},{"id":"H","eventList":["1|6185855","1|6185856","1|6174639","1|6174637","1|6174635","1|6174636","1|6190210","1|6174680","1|6174948","1|6179742"]},{"id":"B","eventList":["1|6173785","1|6173786","1|6173784","1|6173976","1|6174103","1|6173789","1|6173929","1|6166406","1|6166663","2|6188578"]},{"id":"T","eventList":["1|6189125","1|6192182","1|6191996","1|6190277","1|6190232","1|6189853","1|6192338","1|6192328","2|6186610","1|6191995"]},{"id":"TT","eventList":["2|6193585","2|6192227","2|6193586","2|6193234","2|6193462","2|6193912","2|6192904","1|6193575","1|6193624","1|6193623"]},{"id":"VB","eventList":["2|6187527","1|6187528","1|6191657","1|6187530","1|6186281","1|6177549","1|6186390","1|6186283","1|6186388","1|6186284"]}],"source":"TopMatch"}

From which, according to these data, a request is sent to the database to receive data at
https://www.parimatch.ru/content/strapi/system/graphql ?
query: "query getData($id: [String]) { events(where: {id: $id}) { slug, id, sportCode, categoryId, tournamentId }}" In which match IDs are passed in the date parameters that are received from the first query "variables":{"id":["6173617","6154171","6154169".
Hope it helped. :g

S

s7500, 2021-03-30
@s7500

It looks like a generated class
. Try to take a parent with a normal name and use it to refer to the element's child class