How to correctly count a file in UTF8 from under node.js?

I

Ivan_korolev2015-06-28 09:31:03

Node.js

Ivan_korolev, 2015-06-28 09:31:03

I read the file line by line and load the lines into the list array.
The file itself can grow somewhere up to 2GB, or even more, I would not want to read it entirely into memory.

function rf2m(path){
  var handle=fs.openSync(path, 'r');
  var list=[], n=[], sdata='';
  do{	
    n=fs.readSync(handle, 10, null, 'utf8'); 
    sdata+=n[0]; //Дописываем к данным то, что получили после последнего '\n'
    var x=sdata.split("\n"); //Разбиваем данные на строки
    sdata=x[x.length-1]; //Пишем в переменную то, что получили после последнего '\n'
    for(var i=0; i<x.length-1;i++){
      list.push(x[i]);
      //fs.writeFileSync('log.txt', x[i]+"\n", {flag:'a'},'binary');
    }	 
    if(n[1]==0){ //Если длинна порции равна нулю
      if(x[x.length-1]!='') list.push(x[x.length-1]);
      break;
    }	
  }while(true)
  fs.closeSync(handle);
  return list;
}

When there are Russian characters, I get something like this at the output:
[OP]Sergey
Info
Oy
[OP]Sergey
Info
In UTF8, Russian characters take 2 bytes, and Latin - 1. How to solve this problem?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

M

Mark, 2015-06-28
@Ivan_korolev

Does this https://github.com/jahewson/node-byline thing work correctly?
In a good way, you need to read bytes (Buffer), and not lines, then nothing will break. And to turn into a line just before use.

S

scapp, 2015-06-28
@scapp

"I get something at the output" on which output is more
detailed n=fs.readSync(handle, 10, null, 'utf8');
if do console.error(n) already have encoding problem or not