Answer the question
In order to leave comments, you need to log in
How to correctly read a file in UTF-8 encoding?
There is a huge UTF-8 encoded file.
I want to do 2 things:
1) Read up to certain characters (let's say "a" and "b" for example)
2) know which character I stopped at ("a" or "b")
Trying to poke around with BufferedReader, but nothing good comes out.
The essence of what I do: read into the buffer, decode, then I want to work. Everything breaks into "decode", because UTF-8 can have characters of different lengths. And I get into a situation where only part of the character was considered in the buffer (for example, 1 byte out of 3).
How to do? Mb the approach is wrong? I hope for a standard solution, which for some reason I do not see.
Answer the question
In order to leave comments, you need to log in
As a result, I learned about CodePoint and got this:
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class Main {
public static void main(String[] args) {
Path path = Paths.get("src/test");
assert Files.exists(path) : "Файл не найден";
try (BufferedReader bufferedReader = Files.newBufferedReader(path)) {
int ch;
char surrogate = 0;
while ((ch = bufferedReader.read()) != -1) {
if (surrogate != 0) {
ch = Character.toCodePoint(surrogate, (char) ch);
surrogate = 0;
} else if (Character.isHighSurrogate((char) ch)) {
surrogate = (char) ch;
continue;
}
// в результате в ch имеем CodePoint
// можем работать как с обычным символом
System.out.println(Character.toChars(ch));
}
} catch (IOException e) {
e.printStackTrace();
System.exit(-1);
}
}
}
No need to decode anything, just specify the file encoding when opening
try (BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(file), "UTF8"))) {
// Делайте с входным потоком всё, что вам нужно
}
For example, to perceive the buffer as a stream of bytes, as soon as the reading reaches the end, then load the next portion and reset the pointer to the beginning. When decoding, don't rely on hard-coded offsets like a[i+1], a[i+2], a[i+3], etc., but get the next byte something like this: mybuffer.getNextByte() (in turn, the getNextByte method must handle the situation when the buffer is exhausted, and in this case load the next portion of bytes)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question