T
T
Tant2012-05-22 22:51:54
Google
Tant, 2012-05-22 22:51:54

Google translate

Let's run this script:

 <?php
// берём чешское слово 'Koláče' (пироги), подготавливаем к передаче через URL 
$text = urlencode('Koláče');
// формируем запрос
$query = "http://translate.google.com/translate_a/t?client=x&text={$text}&sl=cs&tl=en";
// который в итоге выглядит так: 
// http://translate.google.com/translate_a/t?client=x&text=Kol%C3%A1%C4%8De&sl=cs&tl=en


// посылаем для перевода
// этот кусок взял отсюда: 
//http://stackoverflow.com/questions/542046/php-file-get-contentsloc-fails
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_URL, $query);
$response = curl_exec($curl);
curl_close($curl);
// но вместо него можно и проще, результат будет одинаков:
// $response = file_get_contents($query);


echo $response;
?>

in response, we get a JSON object with this garbage:
{"sentences":[{
  "trans":"Kol\u0102\u0104\u00C4 e",
  "orig":"Kol\u0102\u0104\u00C4 e",
  "translit":"",
  "src_translit":""
}],"src":"cs","server_time":2}

Pay attention to the orig field , in some unknown way the string 'Kol%C3%A1%C4%8De' was transformed into 'Kol\u0102\u0104\u00C4 e'.
And if the request translate.google.com/translate_a/t?client=x&text=Kol%C3%A1%C4%8De&sl=cs&tl=en is simply entered into the address bar of the browser, we will get a beautiful correct answer:
{"sentences":[{
  "trans":"Pies",
  "orig":"Koláče",
  "translit":"",
  "src_translit":""
}],"src":"cs","server_time":41}

I would be very grateful if someone could explain why.

Add. information: the original string is in utf-8, and it doesn’t matter, because after urlencode we get only ASCII characters. It translates from English normally, which is logical, because the characters are not additionally encoded.

I suspect that browsers send some “correct” headers, but is this true, and if so, how to find them out, and most importantly, how to send them manually?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
M
mitry, 2012-05-22
@Tant

Looks like it depends on google. User-Agent:
You can check it at web-sniffer.net/ For browsers, the response comes in UTF-8, and for 'Web-sniffer' or empty User-Agent:- \u-encoded

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question