How to correctly read a file in UTF-8 encoding?

A

aarifkhamdi2018-08-06 18:47:15

Java

aarifkhamdi, 2018-08-06 18:47:15

There is a huge UTF-8 encoded file.
I want to do 2 things:
1) Read up to certain characters (let's say "a" and "b" for example)
2) know which character I stopped at ("a" or "b")
Trying to poke around with BufferedReader, but nothing good comes out.
The essence of what I do: read into the buffer, decode, then I want to work. Everything breaks into "decode", because UTF-8 can have characters of different lengths. And I get into a situation where only part of the character was considered in the buffer (for example, 1 byte out of 3).
How to do? Mb the approach is wrong? I hope for a standard solution, which for some reason I do not see.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

aarifkhamdi, 2018-08-07
@aarifkhamdi

As a result, I learned about CodePoint and got this:

import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;


public class Main {
    public static void main(String[] args) {
        Path path = Paths.get("src/test");
        assert Files.exists(path) : "Файл не найден";
        try (BufferedReader bufferedReader = Files.newBufferedReader(path)) {
            int ch;
            char surrogate = 0;
            while ((ch = bufferedReader.read()) != -1) {
                if (surrogate != 0) {
                    ch = Character.toCodePoint(surrogate, (char) ch);
                    surrogate = 0;
                } else if (Character.isHighSurrogate((char) ch)) {
                    surrogate = (char) ch;
                    continue;
                }
//                в результате в ch имеем CodePoint
//                можем работать как с обычным символом
                System.out.println(Character.toChars(ch));
            }
        } catch (IOException e) {
            e.printStackTrace();
            System.exit(-1);
        }
    }
}

Suddenly someone will come in handy.
ps:

in src/test is

"幸福幸福幸福幸福一个梦想一个梦想Ðtestтест123456"

S

Sergey Gornostaev, 2018-08-06
@sergey-gornostaev

No need to decode anything, just specify the file encoding when opening

try (BufferedReader in = new BufferedReader(
       new InputStreamReader(
         new FileInputStream(file), "UTF8"))) {
    // Делайте с входным потоком всё, что вам нужно
}

S

SagePtr, 2018-08-06
@SagePtr

For example, to perceive the buffer as a stream of bytes, as soon as the reading reaches the end, then load the next portion and reset the pointer to the beginning. When decoding, don't rely on hard-coded offsets like a[i+1], a[i+2], a[i+3], etc., but get the next byte something like this: mybuffer.getNextByte() (in turn, the getNextByte method must handle the situation when the buffer is exhausted, and in this case load the next portion of bytes)