Answer the question
In order to leave comments, you need to log in
Google issuance pars, how to bypass blocking?
Please leave the moral side of the issue out of discussion.
Ooh, I just haven't tried it yet. It worked great, and at some point it just stopped.
Let's say we have a request:
https://www.google.ru/search?q=%D0%BF%D1%80%D0%BE%D0%B4%D0%B2%D0%B8%D0%B6%D0%B5%D0%BD%D0%B8%D0%B5+%D1%81%D0%B0%D0%B9%D1%82%D0%BE%D0%B2&num=100
$useragent = $this->getUseragent();
$curl = curl_init();
$headers = array();
$headers[] = "Connection:keep-alive";
$headers[] = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
$headers[] = "Connection:keep-alive";
$headers[] = "Upgrade-Insecure-Requests:1";
$headers[] = "User-Agent:".$useragent;
$headers[] = "Accept-Language:ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4";
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
curl_setopt($curl, CURLOPT_USERAGENT, $useragent);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
$response = curl_exec($curl);
Answer the question
In order to leave comments, you need to log in
Accept
and Accept-Language
depend on User-Agent
. You change the User-Agent, and Google can fix this discrepancy. Try to start with a constant string instead of getUseragent() . Then, if you need to change the User-Agent so much, then see how Random Agent Spoofer does it, it will match all headers with the fake User-Agent, you may have to go into its source, so it's easier with a constant string.ru-RU
from Accept-Language
. This does not mean that Russian-language results will disappear in the search results or there will be problems with encoding. In general, my entire system and browser are not localized, this does not interfere with Google in Russian.https://www.google.ru/search?q=q&num=100
such requests are sent only by bots. In the browser, when searching from the main page of Google, there is a huge request with a dozen parameters, including some unique hashes. Try to request the main one first, accept and write down all cookies, tear out the url from the search form where the request will go, add it there q=blabla
and send a new request with all the cookies. By the way, new cookies come with each request and it would be nice to use them in the next request, as it would happen in a real browser, this will increase the time / number of_requests before the ban.num=100
, it's easier for you to parse, and it's easier for Google to ban those who parse. Remove this option and download SERP one page at a time. Between requests, pause for a few seconds like a live person would surf. At the same time, it is possible to work in parallel with another request from another session with a different set of cookies and User-Agent, as if several people are sitting from the same IP due to NAT. But, in general, sampling SERP deeper than one or two pages greatly increases the suspicion in your address and brings the captcha closer, try, if possible, to completely refuse to select 100 results in order for the parser to work at least somehow.
Try to copy all the headers from the browser at once in cURL format for your console.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question