How to get page code?

R

rsefsE2020-07-29 16:26:32

Python

rsefsE, 2020-07-29 16:26:32

I'm doing facebook parsing and I've run into a problem. If I look at the page's code, I see a beautiful html tree broken into blocks, etc. But, when I get soup, I get, as it seems to me, obfuscated page code. If you have come across this, what have you done, or perhaps you have some good sources where you can read about obfuscation in a more understandable way for a beginner. I will be glad to everything. Sample code where I get soup.

if not self.browser.is_free():
    self.browser.driver.get(url)
#   js_code = "document.getElementsByTagName('html')[0].outerHTML"
#   your_elements = self.browser.driver.execute_script(js_code)
    html = self.browser.driver.page_source

    soup = BeautifulSoup(html, 'html.parser')

    return soup

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

S

Stalker_RED, 2020-07-29
@rsefsE

When you look in devtools, you see the DOM built by the browser. The browser parsed as best it could, corrected the errors that it could, brought it to a beautiful view. When you look at the source code (ctrl+U in the browser), you see what actually came from the server.
HTML entities are not hard to decode

import html
x = html.unescape('&#x42d;&#x445;&#x43e; &#x41c;&#x43e;&#x441;&#x43a;&#x432;&#x44b;')
print(x) # -> эхо москвы

https://ideone.com/vtqrhO

S

soremix, 2020-07-29
@SoreMix

So I don't understand what the problem is.
In F12-> Elements you see the page code rendered by a JS script
In Ctrl+U you can see the source code, without JS processing
In the source code, the tree is not built because Facebook decided so, the code is not intended for human reading, the computer understands it and in a minimized form .
Or you type encoding

&#x42d;&#x445;&#x43e; &#x41c;&#x43e;&#x441;&#x43a;&#x432;&#x44b;

frightened?