A
A
Alexey Zorin2015-11-01 20:13:30
Google
Alexey Zorin, 2015-11-01 20:13:30

Parse issuance of Google. What else did I miss?

Hello. Please leave the moral side of the question out of the discussion.
Perhaps someone worked on it ...
There is a task - the parse of the issuance of Google.
The code:

/**
   * Получаем html запроса
   * @param string $url адрес запроса
   * @return string  html выдачи
   */
  private function getHtml($url)
  {
    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL,				$url); 
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 	true); 
    curl_setopt($curl, CURLOPT_USERAGENT, 		'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/45.0.2454.101 Chrome/45.0.2454.101 Safari/537.36');
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 	true);
    $response = curl_exec($curl);

    if(curl_getinfo($curl,CURLINFO_HTTP_CODE) !== 200)
    {

      # Получаем картинку и куки
      $imgUrl = phpQuery::newDocument($response)->find("img")->attr("src");
      $curlImage = curl_init();
      curl_setopt($curlImage, CURLOPT_URL, 			"https://www.google.ru".$imgUrl); 
      curl_setopt($curlImage, CURLOPT_USERAGENT, 		'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/45.0.2454.101 Chrome/45.0.2454.101 Safari/537.36');
      curl_setopt($curlImage, CURLOPT_RETURNTRANSFER, true); 
      curl_setopt($curlImage, CURLOPT_COOKIEJAR, 		__DIR__."/../../html/assets/cookies.txt"); 
      file_put_contents("assets/captcha.jpg", curl_exec($curlImage));
      curl_close($curlImage);
      
      # Расшифровываем капчу
      $antiCaptcha = new AntiCaptcha;
      $antiCaptcha->sendCaptcha();
      $captcha = $antiCaptcha->getCaptchaValue();

      # Формируем url запроса
      $url = "https://ipv4.google.com/sorry/CaptchaRedirect?continue=".urlencode(phpQuery::newDocument($response)->find("[name=\"continue\"]")->attr("value"))
          ."&id=".urlencode(phpQuery::newDocument($response)->find("[name=\"id\"]")->attr("value"))
          ."&captcha=".$captcha
          ."&submit="."Submit";
          
      # Переходим по URL со всеми нужными данными
      $curlGoogleAntiCaptcha = curl_init();
      curl_setopt($curlGoogleAntiCaptcha, CURLOPT_URL, 			$url);
      curl_setopt($curlGoogleAntiCaptcha, CURLOPT_USERAGENT, 		'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/45.0.2454.101 Chrome/45.0.2454.101 Safari/537.36');
      curl_setopt($curlGoogleAntiCaptcha, CURLOPT_RETURNTRANSFER, true);
      curl_setopt($curlGoogleAntiCaptcha, CURLOPT_COOKIEFILE, 	__DIR__."/../../html/assets/cookies.txt");
      $result = curl_exec($curlGoogleAntiCaptcha);
// Вот тут почему-то мне опять выдаётся страница с капчей (((		
      return $result;
    }

    curl_close($curl);

    return $response;
  }

Description:
1. If the response code is !== 200, go to the redirect url.
2. We save all the values ​​of the inputs we need from this page
3. We get the captcha image, save the cookies to the file (yes, they are issued only with the image)
4. Decrypt the captcha through a separate class
5. Substitute the cookies, the value of the captcha, go the right way URL
And here again we get a captcha... What am I doing wrong?
PS: The captcha is decrypted correctly. In the 5th step, the URL is exactly correct.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
F
frees2, 2015-11-02
@frees2

And why, when search queries are issued in the rss feed, the old fashioned way
https://news.google.com/news?pz=1&cf=all&ned=ru_ru...

O
Oleg Matrozov, 2015-11-01
@Mear

Good afternoon.
Literally recently I solved a similar problem))) I see the difference right away in that in the last request to Google (captcha confirmation) I send an empty continue parameter. I remember, I also had problems with its passage with your own symptoms.
I can share an abstract "debug" class that I made for tests and debugging: pastebin.com/Eymi1U1K
And yes, all work with cookies is assigned to curl, so don't be surprised that they are not explicitly in the code.

A
Alexey Zorin, 2015-11-01
@newbie67

Thanks everyone, I found my mistake.
joxi.ru/BA00GP6u5GpgAy
That's where the dog is buried... Accordingly, the cookies were saved for *.google.ru, and I sent the captcha to *.google.com
And, Dmitry , this is also possible in PHP.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question