How to defeat UTF-8 encoding in Java?

A

Artemy Merkulov2020-07-27 12:10:21

Java

Artemy Merkulov, 2020-07-27 12:10:21

There is a task:
Given: a file that contains UTF-8 characters. Symbols are arbitrary.
Required: get a string of a given length from a given location in the file and display it in the console.

The problem arose that using the RandomAccessFile class I get a set of bytes, and after converting to a string I get 1 extra character (depending on whether I captured a space or not).

Can you please tell me how to properly decode from an array of bytes into a string in UTF-8?

An example of a line in a file: Thank you for being you

Code:

public class Main {

    public static final int CHARS_PER_PAGE = 19;

    public static void main(String[] args) {
        System.out.println(getPage("test.txt", 0));
    }

    public static String getPage(String filePath, int pageNum) throws IOException {
        int startPos = CHARS_PER_PAGE * pageNum;
        byte[] pageBytes = new byte[CHARS_PER_PAGE];

        RandomAccessFile raf = new RandomAccessFile(filePath, "r");

        raf.seek(startPos);
        raf.read(pageBytes, 0, CHARS_PER_PAGE);

        System.out.println("Bytes Array: " + Arrays.toString(pageBytes));
        System.out.println("Result String: " + new String(pageBytes, StandardCharsets.UTF_8));

        raf.close();

        return new String(pageBytes, StandardCharsets.UTF_8);
    }
}

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

R

Roman Mirilaczvili, 2020-07-27
@2ord

How to read a UTF-8 string from a file in general:
https://dzone.com/articles/read-utf-8-file-java

BufferedReader in = new BufferedReader(new FileReader("file"));
while( (s = in.readLine()) != null) {
  String UTF8Str = new String(s.getBytes(),"UTF-8"));
}

Required: Get a string of given length from a given location in a file

The fact is that when encoding text in UTF-8, each arbitrary character from the Unicode table can be encoded with a previously unknown number of octets. For Cyrillic, there are 2 octets for each character, if I'm not mistaken.