Why, when parsing the site, I can not get some text data?

V

Vladimir2019-01-22 09:27:00

Parsing

Vladimir, 2019-01-22 09:27:00

Good day to all professional in my field, I really need your help, I'm trying to parse the page " https://razmerkoles.ru/size/peugeot/308/2013/ "
the problem is to get this data

<?php
require_once 'pr-24.12.2018/simple_html_dom.php';
ini_set('memory_limit', '500M');

function getCurlResult ($url) {
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
  curl_setopt($ch, CURLPROTO_HTTPS,1);
  $htmltext = curl_exec($ch);
  curl_close($ch);
  //$htmltext = iconv("CP1251", "UTF-8", $htmltext);
  return $htmltext;
}
$urlYear = 'https://razmerkoles.ru/size/peugeot/308/2013/';
      // print(getCurlResult($urlYear));
      // $file = fopen("5.txt", "w");
      // fwrite($file, getCurlResult($urlYear));
      // fclose($file);
      $html3 = str_get_html(getCurlResult($urlYear));			
      $htmlArr3 = $html3->find('#vehicle-market-data .vehicle-market'); // массив внутренних рынков
      if(count($htmlArr3) != 1) {// проверка есть ли разделения на внутренние рынки
        foreach ($htmlArr3 as $key => $div) {
          // echo $div->outertext . "<br>";
          $market = $html3->find('#vehicle-market-data .vehicle-market h4', $key)->plaintext; // выводит название внутреннего рынка
          echo $market . "<br>";
          foreach ($div->find('.modification-item') as $modifications) { // проходит по модификациям модели на одном рынке						// echo "<pre>";
            // var_dump($modifications);
            // echo "</pre>";						
            foreach ($modifications->find('tbody tr') as $k => $size) {
              // print($size->outertext . "<br>");
              $b = $size->find('td[class=data-rim aux-table-cell]')->plaintext;
              var_dump($b);
              // $file = fopen($k . ".txt", "w");
              // fwrite($file, $size->outertext);
              // fclose($file);

              // $zz++;
              // if($zz > 3) {
              // exit;
              // }
            }
          }					
        }				
      } else { //если нет разделения на внутренние рынки
      }
      $html3->clear(); // подчищаем за собой
      unset($html3);

If the Curl page is displayed on the screen, then all the information is contained, if written to
// $file = fopen("5.txt", "w");
// fwrite($file, getCurlResult($urlYear));
// fclose($file);
Then it will no longer be in a text file

. Your help is very needed!

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

X

xoo, 2019-01-22
L

An interesting task
Look,
you are interested in the Information block is already here, but first encoded in base64 , and then masked .

A

Alexey Ukolov, 2019-01-22
@alexey-m-ukolov

What exactly "all information" do you need? Most likely, it is generated using javascript, so when you "output it to the screen" (to the browser, as I suspect), you see it - js is executed there, but not in curl.
There is protection against parsing, it td[class=data-rim aux-table-cell]is filled through javascript and only with the user agent of a real browser.

<script>
!function () {
    (function () {
        for (var t = [/PhantomJS/.test(window.navigator.userAgent), /HeadlessChrome/.test(window.navigator.userAgent), navigator.webdriver, window.callPhantom || window._phantom], e = 0; e < t.length; e++) if (t[e]) return !0;
        return !1
    })() || (function () {
        for (var t, e = document.querySelectorAll("span[data-rim]"), r = 0; r < e.length; ++r) {
            var n = e[r], a = n.getAttribute("data-rim");
            n.innerHTML = (t = a) ? atob(function (t) {
                return t.split("").map(function (t) {
                    return t === t.toUpperCase() ? t.toLowerCase() : t.toUpperCase()
                }).join("")
            }(t.replace(/-/g, "="))) : t, n.parentNode.classList.add("aux-table-cell")
        }
    }(), function () {
        for (var t = document.querySelectorAll("tbody[data-vehicle]"), e = function (t) {
            return String.fromCharCode(t)
        }, r = 0; r < t.length; ++r) {
            var n = t[r], a = n.getAttribute("data-vehicle");
            a = a.match(/\d{3}/g).map(e).join("");
            for (var o = n.querySelectorAll("tr>td.data-bolt-pattern"), i = 0; i < o.length; ++i) o[i].innerHTML = a
        }
    }())
}();
</script>

Accordingly, you need to use, for example, PhantomJS or Chrome in headless mode , but pass the user agent of the desktop browser.