M
M
Mykola2014-06-11 18:31:40
PHP
Mykola, 2014-06-11 18:31:40

Why doesn't xpath work?

I want to parse a table, code

$fileByUrl = 'http://w1.c1.rada.gov.ua/pls/z7503/a002';
$referer = 'http://rada.gov.ua/';

  $ch=curl_init();
  curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_REFERER, $referer);
    curl_setopt($ch, CURLOPT_USERAGENT, "Opera/9.80 (Windows NT 5.1; U; ru) Presto/2.9.168 Version/11.51");
  curl_setopt($ch, CURLOPT_URL, $fileByUrl);
  curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
  curl_setopt($ch, CURLOPT_COOKIEFILE,  'cookie.txt');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_ENCODING,'gzip');
  $str = curl_exec($ch);
  $info = curl_getinfo($ch);
  curl_close($ch);

$code = $info['http_code'];
  if($code == 200){
    $doc = new DOMDocument;
    $doc->load($str);
    
    $xpath = new DomXPath($doc);
    $res = $xpath->query('//*[@id="content-all"]/div[2]/div/table/tbody/tr[3]');
    foreach($res as $obj) {
      echo $obj->nodeValue;
        }

echo doesn't output anything.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
N
nowm, 2014-06-11
@iSensetivity

First of all - because of this:
"load" is for loading files and as a parameter it needs to be given the path to the file. If you want to load a string, you need to use the "loadHTML" function.
Then you will get a bunch of warnings. If a message appears saying that there are misunderstandings with the encoding, you can get rid of it by correcting the line with loadHTML:
In addition to the line about the encoding, there will be a bunch of warnings, like:

Warning: DOMDocument::loadHTML(): Opening and ending tag mismatch: li and div in Entity
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity
Warning: DOMDocument::loadHTML(): Opening and ending tag mismatch: td and b in Entity

To prevent these notifications from polluting the air, you can add the “@” symbol when calling “loadHTML”:
Next, to make sure that the nodes you are trying to look for still exist, you can list all the nodes in general, like this:
$res = $xpath->query('.//*');
foreach($res as $obj) {
  echo $obj->getNodePath() . "\n\r";
}

It will be seen from the listing that the mention of the “table/tbody/tr” link is incorrect. "TBODY" is not there. This XPath query will work fine in FirePath from Firefox, for example. And it works because Firefox independently builds the DOM of the document to an ideal state in its opinion - for example, it adds “TBODY” where it does not exist, closes unclosed tags, and so on.
In the situation with DomDocument and DomXPath, it is better to look at the pure source code of the page and build queries based on the source code, and not on the DOM generated by the browser.
In your situation, you just need to remove "tbody" from the request. The following query will turn out:
As I see, the solution has already appeared, but, in general, such an approach that I described will help to find errors in such situations.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question