D
D
Dao1312022-02-28 11:40:51
Python
Dao131, 2022-02-28 11:40:51

Selenium returns a truncated response. How to get the whole page code?

I'm trying to parse the site https://www.houzz.ru/professionals/remont-i-otdelk...

Below is the code I'm using.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import time


chromedriver = r'E:/ProgrammFiles/chromdriver/chromedriver.exe'
opts = Options()
opts.add_argument("user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36")
opts.add_argument('-headless')
browser = webdriver.Chrome(service=Service(r'E:/ProgrammFiles/chromdriver/chromedriver.exe'), options=opts)
browser.get('https://www.houzz.ru/professionals/remont-i-otdelka-kvartir-i-domov')
content = ''
while True:
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    if content != browser.page_source:
        content = browser.page_source
        continue
    else:
        break
time.sleep(10)
requiredHtml = browser.execute_script("return document.body.innerHTML;")
print(requiredHtml)


This is the response I am getting.

https://pastebin.com/txVmHsNH

The first line is json, but it is somehow cut off from the very beginning (I also shortened it to fit the pastebin response code) and it stores information that is not quite expected.

I need to get either json stored in
<script id="hz-ctx" type="application/json">...</script>
, or directly html, which can later be parsed using BeautifulSoup.

Please tell me how to do it on this site.

Answer the question

In order to leave comments, you need to log in

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question