A
A
aarifkhamdi2018-08-06 18:47:15
Java
aarifkhamdi, 2018-08-06 18:47:15

How to correctly read a file in UTF-8 encoding?

There is a huge UTF-8 encoded file.
I want to do 2 things:
1) Read up to certain characters (let's say "a" and "b" for example)
2) know which character I stopped at ("a" or "b")
Trying to poke around with BufferedReader, but nothing good comes out.
The essence of what I do: read into the buffer, decode, then I want to work. Everything breaks into "decode", because UTF-8 can have characters of different lengths. And I get into a situation where only part of the character was considered in the buffer (for example, 1 byte out of 3).
How to do? Mb the approach is wrong? I hope for a standard solution, which for some reason I do not see.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
A
aarifkhamdi, 2018-08-07
@aarifkhamdi

As a result, I learned about CodePoint and got this:

import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;


public class Main {
    public static void main(String[] args) {
        Path path = Paths.get("src/test");
        assert Files.exists(path) : "Файл не найден";
        try (BufferedReader bufferedReader = Files.newBufferedReader(path)) {
            int ch;
            char surrogate = 0;
            while ((ch = bufferedReader.read()) != -1) {
                if (surrogate != 0) {
                    ch = Character.toCodePoint(surrogate, (char) ch);
                    surrogate = 0;
                } else if (Character.isHighSurrogate((char) ch)) {
                    surrogate = (char) ch;
                    continue;
                }
//                в результате в ch имеем CodePoint
//                можем работать как с обычным символом
                System.out.println(Character.toChars(ch));
            }
        } catch (IOException e) {
            e.printStackTrace();
            System.exit(-1);
        }
    }
}

Suddenly someone will come in handy.
ps:
in src/test is
"幸福幸福幸福幸福一个梦想一个梦想Ðtestтест123456"

S
Sergey Gornostaev, 2018-08-06
@sergey-gornostaev

No need to decode anything, just specify the file encoding when opening

try (BufferedReader in = new BufferedReader(
       new InputStreamReader(
         new FileInputStream(file), "UTF8"))) {
    // Делайте с входным потоком всё, что вам нужно
}

S
SagePtr, 2018-08-06
@SagePtr

For example, to perceive the buffer as a stream of bytes, as soon as the reading reaches the end, then load the next portion and reset the pointer to the beginning. When decoding, don't rely on hard-coded offsets like a[i+1], a[i+2], a[i+3], etc., but get the next byte something like this: mybuffer.getNextByte() (in turn, the getNextByte method must handle the situation when the buffer is exhausted, and in this case load the next portion of bytes)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question