B
B
black_dis2021-10-22 13:53:37
Python
black_dis, 2021-10-22 13:53:37

How to parse a dynamic site?

https://forum.malinovka.org/topic/13323-list-action...
From this site you need to parse leaders and information on them.
With a normal req request, I get "Please turn JavaScript on and reload the page." and I can't get the information I need.

The code will not be used by me.

import requests
from bs4 import BeautifulSoup

headers = {"user-agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"}
res = requests.get("https://forum.malinovka.org/topic/13323-список-действующих-лидеров/", headers = headers)
soup = BeautifulSoup(res.content, "html.parser")

all_liders = soup.findall("div", class_ = "ipsType_normal ipsType_richText ipsContained")

Answer the question

In order to leave comments, you need to log in

2 answer(s)
V
Vladimir Kuts, 2021-10-22
@fox_12

Take Selenium - and go...

J
jerwright, 2021-10-22
@jerwright

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from selenium import webdriver
import time

URL = 'https://forum.malinovka.org/topic/13323-список-действующих-лидеров/'

options = webdriver.ChromeOptions()
driver = webdriver.Chrome(executable_path="chromedriver.exe", options=options)
driver.get(url=URL)
time.sleep(2)
useragent = UserAgent()
needed_html_code = driver.page_source
driver.close()
driver.quit()

soup = BeautifulSoup(needed_html_code, "html.parser")

content_div = soup.find('div', class_='cPost_contentWrap ipsPad')
for p in content_div.find_all('p')[1:]:
  for item in p.contents:
    print(str(item.string).replace('None', ''), end='\n')
  print("-"*15)

You need to install webdriver (in my case chrome) for the code to work. If you put the code on heroku, for example, then you can additionally install it there.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question