Google issuance pars, how to bypass blocking?

A

Alexey Zorin2015-12-22 17:46:15

Google

Alexey Zorin, 2015-12-22 17:46:15

Please leave the moral side of the issue out of discussion.
Ooh, I just haven't tried it yet. It worked great, and at some point it just stopped.
Let's say we have a request:

https://www.google.ru/search?q=%D0%BF%D1%80%D0%BE%D0%B4%D0%B2%D0%B8%D0%B6%D0%B5%D0%BD%D0%B8%D0%B5+%D1%81%D0%B0%D0%B9%D1%82%D0%BE%D0%B2&num=100

And there is code:

$useragent = $this->getUseragent();
  $curl = curl_init();
  $headers = array();
  $headers[] = "Connection:keep-alive";
  $headers[] = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
  $headers[] = "Connection:keep-alive";
  $headers[] = "Upgrade-Insecure-Requests:1";
  $headers[] = "User-Agent:".$useragent;
  $headers[] = "Accept-Language:ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4";

  curl_setopt($curl, CURLOPT_URL,				$url); 
  curl_setopt($curl, CURLOPT_RETURNTRANSFER, 	true); 
  curl_setopt($curl, CURLOPT_HTTPHEADER, 		$headers);
  curl_setopt($curl, CURLOPT_USERAGENT, 		$useragent);
  curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 30);
  curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 	true);
  curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
  curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
    
  $response = curl_exec($curl);

I'm testing from a local machine. Curl request gets 403 error and banned page.
Immediately I open this request in my browser in incognito mode - and immediately I get the code 200.
I understand correctly that in incognito mode there are absolutely no cookies at the first request, that is, google focuses only on request headers.
What am I doing wrong? I copied all the headers into the $headers array from the browser.
Apparently there is some other parameter that I do not pass.
Any ideas?
UPD: There was such an assumption:
Google bans both users with certain cookies and users without cookies.
If this is the case, then it is most likely possible to bypass the blocking by constantly collecting "working" cookies

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

N

nirvimel, 2015-12-22
@nirvimel

Headers Acceptand Accept-Languagedepend on User-Agent. You change the User-Agent, and Google can fix this discrepancy. Try to start with a constant string instead of getUseragent() . Then, if you need to change the User-Agent so much, then see how Random Agent Spoofer does it, it will match all headers with the fake User-Agent, you may have to go into its source, so it's easier with a constant string.
Try to remove ru-RUfrom Accept-Language. This does not mean that Russian-language results will disappear in the search results or there will be problems with encoding. In general, my entire system and browser are not localized, this does not interfere with Google in Russian.
https://www.google.ru/search?q=q&num=100such requests are sent only by bots. In the browser, when searching from the main page of Google, there is a huge request with a dozen parameters, including some unique hashes. Try to request the main one first, accept and write down all cookies, tear out the url from the search form where the request will go, add it there q=blablaand send a new request with all the cookies. By the way, new cookies come with each request and it would be nice to use them in the next request, as it would happen in a real browser, this will increase the time / number of_requests before the ban.
Don't ask right away num=100, it's easier for you to parse, and it's easier for Google to ban those who parse. Remove this option and download SERP one page at a time. Between requests, pause for a few seconds like a live person would surf. At the same time, it is possible to work in parallel with another request from another session with a different set of cookies and User-Agent, as if several people are sitting from the same IP due to NAT. But, in general, sampling SERP deeper than one or two pages greatly increases the suspicion in your address and brings the captcha closer, try, if possible, to completely refuse to select 100 results in order for the parser to work at least somehow.
Before doing all this, run Wireshark and compare two requests live in it: one from the browser, the other from your script in its current implementation, perhaps some differences will immediately catch your eye.
Even if all conditions are met, a ban is inevitable sooner or later, it depends on the amount of traffic that you create from one IP. Nothing can be done about this. Only a large pool of proxies will save.

E

evnuh, 2015-12-22
@evnuh

Try to copy all the headers from the browser at once in cURL format for your console.

F

frees2, 2015-12-22
@frees2

anonymouse.org/cgi-bin/anon-www_de.cgi/http://news...