PHP Simple HTML DOM Parser. Why can't I get the item?

I

Igor2014-01-25 14:00:34

PHP

Igor, 2014-01-25 14:00:34

Good day!
First time writing a parser.
You need to parse such a Page of oils, or rather the information that is in the table.
I googled it and decided to use the PHP Simple HTML DOM Parser.
Partially succeeded. I can not understand, only how can I get the elements that are shown in the screenshot:

My code:

<?php
include 'simple_html_dom.php';

$link = 'http://lubematch.shell.com/ru/ru/equipment/100_2_8i_avant_001755';

   $data = file_get_html($link);

   $result = array();

        foreach($data->find('td.application') as $a){

          $result['application'][] =  $a->plaintext;

        }

        foreach($data->find('td.recommendation') as $a){

            $result['recommendation'][] =  $a->plaintext;
        }

        foreach($data->find('td.capacity') as $a){

            $result['capacity'][] =  $a->plaintext;
        }

    

   echo "<pre>";
    print_r($result);
  echo "</pre>";
?>

I get the answer:

I will be grateful in advance for your help.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Alexey Sundukov, 2014-01-25
@GansikUA

Use XPath , Luke.

<?php

// [1- Скачиваем файл
// Создаем поток
$opts = array(
  'http' => array(
    'method'  => 'GET',
    'timeout' => 10,
  ),
);

$context = stream_context_create($opts);

// Открываем файл с помощью установленных выше HTTP-заголовков
$page_content = file_get_contents('http://lubematch.shell.com/ru/ru/equipment/100_2_8i_avant_001755', false, $context);
// -1]

// [2- Парсим данные
// [3- Строим DOM
// по сути - отключаем вывод ошибок валидации
libxml_use_internal_errors(true);
$page_dom = new \DOMDocument();

$page_dom->strictErrorChecking = false;
$page_dom->preserveWhiteSpace  = false;
$page_dom->validateOnParse     = true;

$page_dom = new \DOMDocument();

// [4- loadHTML не дает использовать utf-8, делаем хаком http://php.net/manual/en/domdocument.loadhtml.php#95251
$page_dom->loadHTML('<?xml encoding="UTF-8">' . $page_content);

foreach ($page_dom->childNodes as $node) {
  if ($node->nodeType == XML_PI_NODE) {
    $page_dom->removeChild($node);
  }
}
$page_dom->encoding = 'UTF-8';
// -4]

$page_xpath = new \DOMXPath($page_dom);
// -3]

// Вытаскиваем Standard
$param_1 = $page_xpath->query('//table[@id="recommendation"]//tr[2]/th')->item(0)->nodeValue;
// Вытаскиваем Spirax S4 ATF HDX
$param_2 = $page_xpath->query('//table[@id="recommendation"]//tr[5]/td[1]')->item(0)->nodeValue;
// -2]

var_dump($param_1, $param_2);

I

Igor Deyashkin, 2014-01-25
@Lobotomist

If you look at the source code of the page, it will become clear why the marked text does not fall into the selection.
For example, you are looking for a td with the class recommendation , but not all tds in the third column have this class. For example <td>Spirax S4 ATF HDX</td>, this class does not exist here. Also, you don’t take data at all from the column in which the headers lie <th class="tiername tiername">Standard</th>, where do you get them from? =)
If I were you, I would parse using some other principle. What structure do you want to end up with?