Why does bs4 parse the page incorrectly?

A

Alena_Y2020-07-01 14:10:08

Python

Alena_Y, 2020-07-01 14:10:08

Good day, I'm trying to parse a VKontakte avatar, I took Pavel Durov as an example, part of the code is as follows:

import bs4
import requests

def getting_avatar(id):
request = requests.get(" https://vk.com/id " + id)
b = bs4.BeautifulSoup(request.text, "html.parser")
print(b)

getting_avatar(1)

The problem is that the page at Pavel Durov | VKontakte contains about 2500 lines, among which is just the right tag with the required id = profile_photo_link, and the result ...

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

Alena_Y, 2020-07-02
@Alena_Y

The question is solved, you can do it like this:
import urllib
from selenium import webdriver
import random
import urllib.request
url = input()
driver = webdriver.Chrome()
driver.get(url)
with open('filename.png', 'wb' ) as file:
file.write(driver.find_element_by_xpath('//*[@id="profile_photo_link"]/img').screenshot_as_png)
driver.close()

S

Sergey Karbivnichy, 2020-07-01
@hottabxp

import requests
from bs4 import BeautifulSoup
import json

response = requests.get('https://vk.com/id1')

soup = BeautifulSoup(response.text,'html.parser')
avatar = soup.find('div',id='page_avatar').a.get('onclick')
json_raw = avatar[avatar.find('{'):avatar.rfind('}')+1] #Здесь вытаскивает json
json_data = json.loads(json_raw)
print(json_data['temp']['x']) # Получаем из json url аватарки

requests doesn't know javascript.

G

Guerro69, 2020-07-01
@Guerro69

Try to parse the image using the VK method:
https://api.vk.com/method/users.get?user_ids=1&fie...