How to determine encoding of id3 tag?

D

deadvip2012-09-15 14:01:24

Programming

deadvip, 2012-09-15 14:01:24

There are two mp3 files for the test with tags in Latin1, UTF-8 respectively. I'm trying to count them:

System.out.println(id3v2Tag.getAlbum());

Conclusion:

?????????? ????????: ???????? ?? ??? ???????
Эльфийская Рукопись: Сказание На Все Времена

I read like this:

System.out.println(new String(id3v2Tag.getAlbum().getBytes("Latin1")));

Conclusion:

Эльфийская Рукопись: Сказание На Все Времена
?????????? ????????: ???????? ?? ??? ???????

In the windows explorer and players, both tags are displayed normally. How do they define encoding?

And another question - why are question marks displayed instead of krakozyab? How does the jvm determine that the output is invalid and should be replaced with question marks? Is it possible to define this in code?

UPD: Indeed, the first file was in cp1251, but then it is not clear why the code:

System.out.println(new String(id3v2Tag.getAlbum().getBytes("Latin1")));

works ok. Probably jambs in the library, which I use to read tags.

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

Y

yupic, 2012-09-15
@deadvip

Let me explain why System.out.println(new String(id3v2Tag.getAlbum().getBytes("Latin1"))) works.
As I understand it, you are using some library that can read id3 tags.
This library reads a raw byte array from a file. She needs to convert the bytes to a string, for this she needs to use some kind of encoding. Ideally, this encoding should be set in the library settings. But if the encoding is not set, then, apparently, Latin1 is used.
So, the library converts bytes to a string using Latin1. It happens like this, a certain byte is taken, a certain character is assigned to it, and it is stored in a string. For example, they counted a byte representing the letter "A" in windows-1251 encoding, and in Latin1, some "Õ" will be associated with this byte. If you convert such a string to an array of bytes using UTF-8 and write the bytes to a file, then when you view the file in UTF-8, you will not see Russian letters.
Next, you want to print a string, so you convert it to bytes using Latin1. The character "Õ" is mapped to the byte that in windows-1251 represents the letter "A". Then a string is again created from these bytes, using the system encodingthe default is windows-1251. As a result, the character "A" is obtained from the byte, as it was intended, and this string is correctly displayed on the screen.
How to proceed: in the 1st comment, they gave a link to the library you need. You need to receive the tags as an array of bytes, and convert them to strings using the encoding defined by juniversalchardet. If the library for working with MP3 does not allow you to get tags in the form of byte arrays, then convert the values returned to it into bytes using Latin1, and only then determine the encoding and create strings.

S

S1ashka, 2012-09-15
@S1ashka

code.google.com/p/juniversalchardet/

A

Alexey Huseynov, 2012-09-15
@kibergus

You are confusing something. There are no Cyrillic characters in the latin1 encoding. Most likely there cp1251.
Question marks are displayed in place of non-printing characters, which may have special meaning and disrupt normal terminal operation. It is safer to display question marks, especially since you won’t see much sense in binary cracks.

In the windows explorer and players, both tags are displayed normally. How do they define encoding?

With the help of magic. The symbols used, the frequencies of their occurrence, stable combinations of symbols are analyzed. The most advanced players simply believe that the tags are in UTF-8, and everyone who stores them in a different encoding is deeply wrong. It is best to use just such players. There will be fewer problems.

V

vsespb, 2012-09-15
@vsespb

Probably getBytes("latin1") simply does not recode anything, so it is displayed in native windows encoding.
You can probably distinguish ut8 from win1251 without any magic. The Russian letter in 1251 will not be a valid utf8 character. But to distinguish single-byte encodings - you need a frequency analysis.