M
M
mrantony2019-01-15 18:03:17
PHP
mrantony, 2019-01-15 18:03:17

How to bypass JS script that interferes with page parsing?

Good evening everyone!
When I try to parse the page, I get the following result:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
   <script type="text/javascript">
     (function() {
       var xhr = window.XMLHttpRequest ? new XMLHttpRequest() : new ActiveXObject('Microsoft.XMLHTTP');

       xhr.onreadystatechange = function() {
         if (xhr.readyState == 4 && xhr.responseText == 1) {
           var date = new Date();
           date.setTime(date.getTime() + 60000);
           document.cookie = 'referrer=' + encodeURIComponent(document.referrer);
           window.location = window.location.href;
           if(window.location.hash.length) location.reload();
         }
       };

       var url = location.protocol + '//' + location.hostname + '/check.page';
       var data = 'ua=' + encodeURIComponent(navigator.userAgent) + '&sec=' + encodeURIComponent('superkey14') + '&rnd=' + Math.random() + '&loc=' + encodeURIComponent(window.location.href);

       xhr.open('POST', url, true);
       xhr.setRequestHeader('Content-Type', 'application/x-www-form-urlencoded');
       xhr.send(data);
     })();
   </script>
 </head>
 <body></body>
</html>

The same result if you open the code view in Chrome.
Tried PHP PHANTOMJS.
use JonnyW\PhantomJs\Client;
    $client = Client::getInstance();
    $request = $client->getMessageFactory()->createRequest('url', 'GET');
    $response = $client->getMessageFactory()->createResponse();
    $client->send($request, $response);
    echo $response->getContent();

The result is the same.
How to get the page that's in the browser?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
R
rPman, 2019-01-15
@mrantony

Humble yourself, half of the Internet is already javascript based.
Load pages with headless browsers, browser components are available for all known programming languages ​​and platforms based either on webkit (for example, java webengine) or based on firefox (mono webbrowser) or iexplore (.net - webbrowser).
ps php - https://github.com/chrome-php/headless-chromium-php
Get the page either directly from the programming language by requesting webengine.document.innerHTML or make a screenshot of the image, or inject javascript into the page and work with it as you like please, incl. emulate button presses and the user in general.

G
grinat, 2019-01-15
@grinat

There are services based on puppeteer, they immediately give html - https://github.com/GoogleChrome/rendertron demo https://render-tron.appspot.com/

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question