K
K
kto_to2015-03-24 10:26:38
Node.js
kto_to, 2015-03-24 10:26:38

Node.js: stream (sax.js) - data loss in file streaming?

Greetings.
Worked with xml files using xml2js until I had to work with large files.
Having looked through the Internet, I decided to use sax and saxpath for these purposes .
But here, when processing some files, a new problem arose: "Unexpected close tag"

events.js:72 thrower
; // Unhandled 'error' event
^
Error: Unexpected close tag
Line: 819060
Column: 36
Char: >
at error (d:\js\node_prog\xml_parse\node_modules\sax\lib\sax.js:652:8)
at strictFail (d:\js\node_prog\xml_parse\node_modules\sax\lib\sax.js:672:22)
at closeTag (d:\js\node_prog\xml_parse\node_modules\sax\lib\sax.js:867:7)
at Object.write (d:\js\node_prog\xml_parse\node_modules\sax\lib\sax.js:1341:29)
at SAXStream.write (d:\js\node_prog\xml_parse\node_modules\sax\lib\sax. js:227:16)
at ReadStream. (d:\js\node_prog\xml_parse\pdm_xml_2.js:36:15)
at ReadStream.EventEmitter.emit (events.js:95:17)
at ReadStream. (_stream_readable.js:745:14)
at ReadStream.EventEmitter.emit (events.js:92:17)
at emitReadable_ (_stream_readable.js:407:10)

Trying to understand this problem, I found out that the error occurs at the moment when the data with the loss of two characters arrives in the write() function of the sax module in the chunk . Those. the boundary of two pieces of data is on the tag name, and since 2 characters are lost on the boundary (either at the end of the first piece of data or at the beginning of the next one), the parser receives a non-existent tag name and gives an error "Unexpected close tag"
Example of my code:
'use strict';
var saxpath = require('saxpath');
var fs = require('fs');
var sax = require('sax');

var saxParser = sax.createStream(true);
var streamer = new saxpath.SaXPath(saxParser, '//items/items');

streamer.on('match', function (xml) {
    // do somthing with xml
    // console.log(xml);
});

var xmlFileName = 'xml_files/MyFileName.xml';

function createClearString(origString) {
    return origString.replace("\ufeff", "");
}

var readable = fs.createReadStream(xmlFileName);
readable.on('data', function (buf) {
    buf.write(createClearString(buf.toString('utf8')));
    saxParser.write(buf);
});

The createClearString() function has been added to replace non-printable characters at the beginning of a file that caused an error.
For those who find the time, one of the files on which the described error occurs: GoogleDrive , Mega
The above error log was received when processing the specified file. But
Line: 819060
Column: 36
does not correspond to reality. In this file, sc symbols disappeared in the description tag (and this is 819061 lines, according to the sublime). One chunk ended with de , and the second one started with ription . I would be grateful if someone can help to understand this problem, or suggest alternatives.

Answer the question

In order to leave comments, you need to log in

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question