NodeJS Parsing large text files in utf-8

K

kocherman2012-01-30 00:31:35

JavaScript

kocherman, 2012-01-30 00:31:35

I apologize in advance for the noob question.
I ran into a new problem for me programming under NodeJS (does the platform matter?). Suppose there is a large file that contains a list of lines that need to be read and each parsed and saved. The file is quite large, contains Cyrillic and is stored in UTF-8 encoding.
To read such a file, of course, you should use a binary-safe method and analyze the information by downloading it in parts.
In nodejs, I create a 32kb buffer, read it, break it into lines, parse it, save it. Everything seemed to work fine, but ... crept up unnoticed. After some time, I noticed artifacts in the saved work results.
I immediately understood what was the matter, but I do not know how to solve such a problem elegantly. The fact is that characters in utf-8 encoding have a different size in bytes. And reading in blocks with a given length can “break” a character at the junction of blocks. Of course, in such a situation, you need to discard the last bytes of the information received, and when reading the next block, read them again.

The question is how to calculate the number of bytes that need to be discarded? There would be at least si, php or java, where you can turn to iconv for help. In NodeJS, how to be ... From the arsenal, only the function of converting a buffer into a string

buffer.toString('utf8', 0, bytesRead);

Maybe someone faced such a problem?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

P

PomanoB, 2012-01-30
@PomanoB

en.wikipedia.org/wiki/UTF-8 - there is a sign there, by the first byte of a character you can understand how many bytes it contains.
And everything is easier if there is only Latin and Cyrillic