D
D
Dmitry Shcherbakov2015-12-29 13:22:27
PHP
Dmitry Shcherbakov, 2015-12-29 13:22:27

How to understand the encoding when downloading via wget?

There is such a task: download a file from the site via wget (because it allows you to find out the real name of the file)
What could be easier, but the first difficulty is the address to the file site.ru/export.html?hash=q1w2e3 I can open the real link, if only in PM)
If you open the link in the browser, then an xls file with Russian letters in the name is immediately downloaded, for example: "
free_remains.xls" We set wget on the link, in response we get in the log (from where you can find out the exact name file) = export.html?hash=q1w2e3
Yeah, it’s clear, some key is needed, it turns out there is such a --content-disposition key and the name is passed in this tag in the response headers, cheers the file began to be downloaded and the name seems to be correct BUT, actually went question
Here's what's in the logs: \361\342\356\341\356\344\355\373\345_\356\361\362\340\362\352\350.xls And
here's what's in the folder:
finds out that wget has downloaded everything and rips out the name of the downloaded file \361\342\356\341\356\344\355\373\345_\356\361\362\340\362\352\350.xls from the log, but when accessed by we get a razor for this name.
Actually, the task is how to access the file by a numeric name, and even better if you can make the name in utf-8 (although it doesn’t matter, anyway, the file name will be changed later)
UPDATE
I found that the line that gets into the log this is windows-1251 in octal format

$string = "\361\342\356\341\356\344\355\373\345_\356\361\362\340\362\352\350.xls";

function convertOctalToCharacter($octal) {
    return chr(octdec($octal[1]));
}

echo iconv('windows-1251', 'utf-8', preg_replace_callback('/\\\\([0-7]{1,3})/', 'convertOctalToCharacter', $string));

Hooray, my name is now in utf-8 encoding, well, now of course you can just download this file again but through the curl with your name, but this is a crutch.
Now I'll try to turn to the file after converting to windows-1251, in theory it should work yes ...

Answer the question

In order to leave comments, you need to log in

3 answer(s)
D
Dmitry Shcherbakov, 2015-12-29
@DimNS

Help yourself)) found a solution, suddenly someone will come in handy

// Исходная строка в кодировке windows-1251, но в виде восьмеричной строки
$string = "\361\342\356\341\356\344\355\373\345_\356\361\362\340\362\352\350.xls";

// Функция для преобразования из восьмеричного формата в обычный
function convertOctalToCharacter($octal) {
    return chr(octdec($octal[1]));
}

// Преобразуем из восьмеричного формата в обычный
$filename = preg_replace_callback('/\\\\([0-7]{1,3})/', 'convertOctalToCharacter', $filename);
// Перекодируем в utf-8
$filename_utf8 = iconv('windows-1251', 'utf-8', $filename);
// Переименуем файл в utf-8
rename($filename, $filename_utf8);

M
Melkij, 2015-12-29
@melkij

There is no ephemeral "real" file name in any of these options. The file may not exist at all.
Well, what the hell did this name give you? Save immediately with the name by which you can find the file, the output-document option
Or even immediately curl from php and upload.

R
romy4, 2015-12-29
@romy4

See the headlines

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question