Selenium returns a truncated response. How to get the whole page code?

D

Dao1312022-02-28 11:40:51

Python

Dao131, 2022-02-28 11:40:51

I'm trying to parse the site https://www.houzz.ru/professionals/remont-i-otdelk...

Below is the code I'm using.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import time


chromedriver = r'E:/ProgrammFiles/chromdriver/chromedriver.exe'
opts = Options()
opts.add_argument("user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36")
opts.add_argument('-headless')
browser = webdriver.Chrome(service=Service(r'E:/ProgrammFiles/chromdriver/chromedriver.exe'), options=opts)
browser.get('https://www.houzz.ru/professionals/remont-i-otdelka-kvartir-i-domov')
content = ''
while True:
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    if content != browser.page_source:
        content = browser.page_source
        continue
    else:
        break
time.sleep(10)
requiredHtml = browser.execute_script("return document.body.innerHTML;")
print(requiredHtml)

This is the response I am getting.

https://pastebin.com/txVmHsNH

The first line is json, but it is somehow cut off from the very beginning (I also shortened it to fit the pastebin response code) and it stores information that is not quite expected.

I need to get either json stored in

<script id="hz-ctx" type="application/json">...</script>

, or directly html, which can later be parsed using BeautifulSoup.

Please tell me how to do it on this site.