How to get rendered HTML page using Selenium or PhantomJS?

P

pirate_prentice2019-10-23 12:23:53

PHP

pirate_prentice, 2019-10-23 12:23:53

Hello! Help, please, to solve a problem.
I am writing a parser in PHP that receives data from a page that is rendered by JS code on the client side. For this, it was decided to use Selenium (using facebook/php-webdriver) or PhantomJS (using the jonnnnyw/php-phantomjs package).
But both methods have so far only allowed us to get the original, non-rendered HTML code. In the case of Selenium, you can see that in the browser the page is rendered completely, however, $driver->getPageSource() returns the same raw HTML and JS scripts. Using timeouts didn't help.
How can this issue be resolved?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

S

Sergey c0re, 2019-10-23
@pirate_prentice

c Selemium did not work, but IMHO , you don’t need to take getPageSource, but after loading the entire page, find an element, for example body, and get its innerHTML if you parse with regexps. Or work with the DOM of the page, which is probably more convenient.
something like this (maybe I'm wrong with the syntax):

$element = $driver->findElement(WebDriverBy::cssSelector('body'));

$src = $element->getAttribute('innerHTML');

# или так

$src = $driver->executeScript("return document.body.innerHTML");

K

kvaks, 2019-10-23
@kvaks

in selenium add page load wait

G

grinat, 2019-10-23
@grinat

See which block appears when the page is rendered and do a waitForSelector and there is a path to the block, since it has appeared, everything, react / vue or whatever is used for rendering, worked successfully. Yes, and I don’t advise using selenium because it’s shit, phantomjs hasn’t been maintained for a long time, take pupeter, this is the official lib from Google for faceless chrome.