S
S
Stepan Sidorov2020-04-24 21:00:20
Python
Stepan Sidorov, 2020-04-24 21:00:20

How to bypass site blocking from parsing?

This site needs to be parsed: https://runcsgo.org.
This site is secure and I use fake-useragent to bypass the block.
I kind of go through it, but I get something completely different from what is on the site when I log in through the browser.
Here is my code:

import requests as req
from bs4 import BeautifulSoup as BS
from fake_useragent import UserAgent
UserAgent().chrome
html = req.get("http://csgorun.org",headers={'User-Agent': UserAgent().chrome})
soup = BS(html.text, features="html.parser")
print(html)

I know about selenium, but it opens the browser, so it will interfere with the program.
Most likely I'm somehow not getting the page correctly, but I'm not exactly sure, maybe something is wrong with the protection.
In general, if anyone knows how to solve, write, it will help a lot.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
S
Sergey Karbivnichy, 2020-04-24
@Stepan47

Who told you that there is a blockage?
1) Some data is loaded by xhr .
2) Also, the data on the site is updated via websocket .
websockets.readthedocs.io
PyPI websockets 8.1

S
Stepan Sidorov, 2020-04-25
@Stepan47

Here's my answer too.
The site does not block me if I use UserAgent. But alas, I did not manage to get the whole page using BS4, so I used the chrome driver but in the background.
Here's the resulting code:

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import driver
from time import sleep
from bs4 import BeautifulSoup as BS

ua = dict(DesiredCapabilities.CHROME)
options = webdriver.ChromeOptions()
options.add_argument('headless')
browser = webdriver.Chrome(chrome_options=options)
browser.get('https://csgorun.org/')
soup = BS(browser.page_source,"html.parser")

Thanks to everyone who helped.

G
ghazar7an, 2020-04-25
@ghazar7an

Here is a simple page parser code for you.

from bs4 import BeautifulSoup
import requests
url = 'http://csgorun.org'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
print(soup)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question