Answer the question
In order to leave comments, you need to log in
NodeJS Parsing large text files in utf-8
I apologize in advance for the noob question.
I ran into a new problem for me programming under NodeJS (does the platform matter?). Suppose there is a large file that contains a list of lines that need to be read and each parsed and saved. The file is quite large, contains Cyrillic and is stored in UTF-8 encoding.
To read such a file, of course, you should use a binary-safe method and analyze the information by downloading it in parts.
In nodejs, I create a 32kb buffer, read it, break it into lines, parse it, save it. Everything seemed to work fine, but ... crept up unnoticed. After some time, I noticed artifacts in the saved work results.
I immediately understood what was the matter, but I do not know how to solve such a problem elegantly. The fact is that characters in utf-8 encoding have a different size in bytes. And reading in blocks with a given length can “break” a character at the junction of blocks. Of course, in such a situation, you need to discard the last bytes of the information received, and when reading the next block, read them again.
The question is how to calculate the number of bytes that need to be discarded? There would be at least si, php or java, where you can turn to iconv for help. In NodeJS, how to be ... From the arsenal, only the function of converting a buffer into a string
buffer.toString('utf8', 0, bytesRead);
Maybe someone faced such a problem?
Answer the question
In order to leave comments, you need to log in
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question