How to write XPath correctly?

D

Dmitry Baibukhtin2014-06-09 19:39:35

PHP

Dmitry Baibukhtin, 2014-06-09 19:39:35

Hello. Answer please. There is a table (table). I need to select all rows from it.
From each row, select a div with a name class, a div with a description class, and a div with a rating class.
How to do it? I think we need to do some xpath queries. One for getting all rows, and the rest for getting individual elements (in this case, not everything is so simple, there is a whole HTML mess).
I wanted to do this, but DomXPath requires a DomDocument object and nothing else:

$pageDom = new DOMDocument();
        @$pageDom->loadHTML($pageHtml);
        $pageXPath = new DomXPath($pageDom);
        $elementsDom = $pageXPath->query('table/tr');
        // Process all elements
        foreach ($elementsDom as $elementDom) {
        // Здесь ошибка
        $elementXPath = new DomXPath($elementDom);

        $element = array();
        $element['name'] = $elementXPath->query('div[class="name"]')->item(0)->nodeValue;
        $element['description'] = $elementXPath->query('div[class="description"]')->item(0)->nodeValue;
        $element['rating'] = $elementXPath->query('div[class="rating"]')->item(0)->nodeValue;

        $elements[] = $element;
        }

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

N

nowm, 2014-06-09
@PiloTeZ

From the very beginning, you should understand that Query can only be called on a DomXPath that is only initialized with a DomDocument. Everything. It cannot be palmed off with DomNodeList or DomNode. DomDocument only. Because of this, you need to take a different approach to getting the data.
You think that you can find a table with a query, then another query to find a DIV in it, then another query to find some SPAN in this DIV, and then another query to find A in it. You can’t work with DomXPath like that. If you want to find an element, look for it right away - from the DOM root.
Right now I can write something inaccurately in the XPath queries themselves, but you need to do something like this:

$pageDom = new DOMDocument();
@$pageDom->loadHTML($pageHtml);
$pageXPath = new DomXPath($pageDom);

$elementsName = $pageXPath->query('.//table/.//div[class="name"]');
$elementsDescription = $pageXPath->query('.//table/.//div[class="description"]');
$elementsRating = $pageXPath->query('.//table/.//div[class="rating"]');

$elements = array();

for ($i = 0; $i < $elementsName->length; $i++) {
    $elements[] = array(
        'name' => $elementsName->item($i)->nodeValue,
        'description' => $elementsDescription->item($i)->nodeValue,
        'rating' => $elementsRating->item($i)->nodeValue,
    );
}

//Profit

It is still possible to anchor to previous search results. The DomXPath::query function has an optional parameter of type DOMNode. Such implicit sub-requests are obtained.

$pageDom = new DOMDocument();
@$pageDom->loadHTML($pageHtml);
$pageXPath = new DomXPath($pageDom);

$elementsDom = $pageXPath->query('.//table/tr');

$elements = array();

foreach ($elementsDom as $elementDom) {
    $elements[] = array(
        'name' => $pageXPath->query('.//div[class="name"]', $elementDom)->item(0)->nodeValue,
        'description' => $pageXPath->query('.//div[class="description"]', $elementDom)->item(0)->nodeValue,
        'rating' => $pageXPath->query('.//div[class="rating"]', $elementDom)->item(0)->nodeValue,
    );
}

The peculiarity is that the same $pageXPath is used, and there is no attempt to create a separate DOMXPath from DOMNode. And then the search takes place in the context of the previous query results - by adding an additional parameter to the DomXPath::query function that specifies the context in which the search takes place - DomXPath::query(query_string, search_context). So in such a situation, ".//div[class="name"]" will not be searched in the entire document, but only in the current TR line.