Strange behavior of NodeJS+XPath

C

cat_crash2012-10-27 13:20:07

Node.js

cat_crash, 2012-10-27 13:20:07

Good afternoon

I'm trying to learn the now fashionable NodeJS and for this I decided to make myself a service for "gutting" sites using Xpath queries.
Why NodeJS - because I want to learn it
Why Xpath - because I decided not to reinvent the "bicycle"
Why not JQuery - because using JQ you can not get the value of an XML attribute of a node through a request (analogue in Xpath) Actually the NodeJS code itself//a/href

var xpath = require('xpath') 
var dom = require('xmldom').DOMParser
var request = require('request')
var tidy = require('tidy2')
var util = require('util')


            var url=['http://www.rul.by/companies/sect.1.html',
              'http://www.rul.by/companies/sect.1.page.2.html',
              ]
          

                
                
          url.forEach(function(item){
            request(item, function (error, response, body) {
            console.log('Fetch url:'+item);
              
              if (!error && response.statusCode == 200) {
                  
                  var array=new Array();
                  
                  var html=tidy.tidyString(body)
                  var doc = new dom().parseFromString(html)
                  //
                  var fields=["//li[@class='nm']/a/text()",
                    "//li[@class='adr']/text()",
                    "//li[@class='txt']/text()",
                    "//td[@class='cont']/ul/li/text()",
                    "//td[@class='cont']/ul/li/a/text()"]
                    
                  
                  fields.forEach(function(field){
                  
                    var tmp=xpath.select(field, doc)
                    console.log(util.inspect(tmp));
                    
                  });

              }
            
            })
          }, this)

Everything works fine (the request is sent, the content is accepted, the broken HTML content is brought back to normal using Tidy) Then the incomprehensible begins. XPath works, but returns inadequate content containing binary data, pieces of some kind of NodeJS libraries. Here is an example of what www.dropbox.com/s/17hny8fso80d2hi/out.txt produces (download and open in an editor that can display NOT only ACSII characters. For example, Notepad ++)
Although on simpler type queries //titleit returns what you need.

The question is, has anyone encountered this kind of problem? How to trace the cause of the error? Who can report a bug? (I assume the creator of the XPath library)

In general, any comments that help to understand the reason are welcome.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

M

Malyw, 2012-10-27
@cat_crash

If you want to get just content at the end, use:
console.log(tmp.toString());
I checked your example with such a replacement - everything works.