L
L
Li Uzumaki2020-08-30 17:11:46
Python
Li Uzumaki, 2020-08-30 17:11:46

What is wrong with my python parser?

Tried to parse the game details (timeline, ratings) that appear on the site, but it doesn't show the items I was trying to get.
What did I do wrong and what is my mistake?

Here is the code:

#parse
import requests
from bs4 import BeautifulSoup

url = 'https://osu.ppy.sh/users/16873295'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36 OPR/68.0.3618.206', 'accept': '*/*'}

def get_html(url, params=None):
    r = requests.get(url, headers=headers, params=params)
    return r
    

def get_content(html):
    soup = BeautifulSoup(html, 'html.parser')
    items = soup.find_all('div', class_='play-detail')
    
    print(items)


def parse():
    html = get_html(url)
    if html.status_code == 200:
        get_content(html.text)
    else:
        print('Error')


parse()


#python 3.7.0

Answer the question

In order to leave comments, you need to log in

1 answer(s)
S
Sergey Karbivnichy, 2020-08-30
@Termot

I have repeatedly advised here, take it as a rule, before any parsing, load the page using the script to your disk. Next, open the page in a text editor, and look for the right element with the right class (or id) in the html. If there is, then you can work with requests. Otherwise - Selenium (there is also XHR...).
Here is the code itself:

import requests

headers = {'user-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0',
      'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
}

url = 'ссылка'
filename = 'index.html'

response = requests.get(url,headers=headers)
if response.status_code == 200:
    with open(filename,'w') as file:
        file.write(response.text)
else:
  print(response)

Enter the link and run the script. If everything is OK, the index.html file will appear on the disk (you can further practice parsing on this file). Otherwise, an HTTP error code will appear in the console. If it's an error, substitute headers, cookies... and try again.
Specifically, in this case, there is no div element with the play-detail class in html. It will appear after js scripts are processed by the js engine. But there is a way out. All data is there. But they are in json format in the script tag with id (if I'm not mistaken) - json-extras.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question