Does the encoding of the data downloaded by the perl script deteriorate when added to the database?

V

Vyacheslav Golovanov2013-10-13 12:39:15

Perl

Vyacheslav Golovanov, 2013-10-13 12:39:15

I switched to a new hosting, the system remained the same, freebsd, and one of the scripts stopped working.

It downloads data from another https site and saves it to the database.
Data in cp1251 encoding, database, tables and mysql connection in the same encoding.

my.cnf:

character-set-server=cp1251
    collation-server=cp1251_general_ci
    init-connect=&quot;SET NAMES cp1251&quot;

When I connect to the database from the script, I execute: The data is pumped out like this:

$dbh->do('SET CHARACTER SET cp1251');<br>

$ua = new LWP::UserAgent;<br>
    ....<br>
    $res = $ua->get(....)<br>
    $s = $res->decoded_content();<br>
<br>

Then the variable $s is parsed and the result is inserted into the database. And the encoding in the database is corrupted:
Г'ГЎГҐГ°ГЎГ ГГЄ ГђГ” (ГЊG'ГЉ), ГЇГ®ГЇГ®Г«ГГҐГГЁГҐ I discovered a

very strange thing while tinkering with the script. If you just save the received data to a text file, then read it from the same file and insert it into the database, the encoding does not deteriorate!

If you look at this text file, you can see that the encoding is correct there, cp1251

What has changed since the previous hosting:

perl: it was 5.10.1, it became 5.14.4
libwww: it was 5.835, it became 6.05
mysql server as it was, and remains 5.1

UPDATE: Just now discovered. If instead of $res->decoded_content() we write $res->content(), then everything works.
Perhaps due to the fact that the downloaded page does not have charset in its headers.
But I still don’t understand what is happening with the string, that if it is inserted into the database, it is in the wrong encoding, and if it is written to a file, then it is in the correct one. Whether the utf-flag any is put? I do not understand :(

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

V

vsespb, 2013-10-13
@SLY_G

> Perhaps due to the fact that the downloaded page does not have charset in the headers.
or vice versa, there is.
> If you just save the received data to a text file, then read them from the same file and insert them into the database, the encoding does not deteriorate!
It would be nice to see the code as read, as written.
> If you look at this text file, you can see that the encoding is correct there, cp1251
does not prove anything yet)
In theory, you need to know how unicode works
perldoc.perl.org/perlunitut.html
perldoc.perl.org/perluniintro.html
perldoc.perl.org /perlunifaq.html
habrahabr.ru/post/190584/
also enable use strict, use warnings.
And also in the right places to Dump data with the Devel:: Peek module, and then it will be possible to understand where the bug is. It would also be nice to see all the used DBD::mysql options.
So far, I'm under the impression that your code doesn't work correctly with perl text strings, instead using the legacy one-byte encoding everywhere. This is also possible, then you need to use content and not decoded_content, it may have worked before, because the old version of LWP did not understand the encoding of this particular page and calling decoded_content was equivalent to content. It is not clear why the data changes after reading and writing from the file. However, this may be affected by the options you use when working with files.

V

vsespb, 2013-10-13
@vsespb

del

K

kirichenko, 2013-10-20
@kirichenko

decoded_content()- firstly, it decompresses gzip / deflate, and secondly, it converts from the real encoding (which I managed to determine) to utf-8 (in the internal representation of the pearl). So that there are no frauds with encoding, you can do this:
decoded_content(charset=>'none')
in general, before asking such questions, it would be nice to read the documentation ...