Answer the question
In order to leave comments, you need to log in
How to determine encoding of id3 tag?
There are two mp3 files for the test with tags in Latin1, UTF-8 respectively. I'm trying to count them:
System.out.println(id3v2Tag.getAlbum());
?????????? ????????: ???????? ?? ??? ???????
Эльфийская Рукопись: Сказание На Все Времена
System.out.println(new String(id3v2Tag.getAlbum().getBytes("Latin1")));
Эльфийская Рукопись: Сказание На Все Времена
?????????? ????????: ???????? ?? ??? ???????
System.out.println(new String(id3v2Tag.getAlbum().getBytes("Latin1")));
Answer the question
In order to leave comments, you need to log in
Let me explain why System.out.println(new String(id3v2Tag.getAlbum().getBytes("Latin1"))) works.
As I understand it, you are using some library that can read id3 tags.
This library reads a raw byte array from a file. She needs to convert the bytes to a string, for this she needs to use some kind of encoding. Ideally, this encoding should be set in the library settings. But if the encoding is not set, then, apparently, Latin1 is used.
So, the library converts bytes to a string using Latin1. It happens like this, a certain byte is taken, a certain character is assigned to it, and it is stored in a string. For example, they counted a byte representing the letter "A" in windows-1251 encoding, and in Latin1, some "Õ" will be associated with this byte. If you convert such a string to an array of bytes using UTF-8 and write the bytes to a file, then when you view the file in UTF-8, you will not see Russian letters.
Next, you want to print a string, so you convert it to bytes using Latin1. The character "Õ" is mapped to the byte that in windows-1251 represents the letter "A". Then a string is again created from these bytes, using the system encodingthe default is windows-1251. As a result, the character "A" is obtained from the byte, as it was intended, and this string is correctly displayed on the screen.
How to proceed: in the 1st comment, they gave a link to the library you need. You need to receive the tags as an array of bytes, and convert them to strings using the encoding defined by juniversalchardet. If the library for working with MP3 does not allow you to get tags in the form of byte arrays, then convert the values returned to it into bytes using Latin1, and only then determine the encoding and create strings.
You are confusing something. There are no Cyrillic characters in the latin1 encoding. Most likely there cp1251.
Question marks are displayed in place of non-printing characters, which may have special meaning and disrupt normal terminal operation. It is safer to display question marks, especially since you won’t see much sense in binary cracks.
In the windows explorer and players, both tags are displayed normally. How do they define encoding?
Probably getBytes("latin1") simply does not recode anything, so it is displayed in native windows encoding.
You can probably distinguish ut8 from win1251 without any magic. The Russian letter in 1251 will not be a valid utf8 character. But to distinguish single-byte encodings - you need a frequency analysis.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question