H
H
hyget2018-03-22 13:06:36
PHP
hyget, 2018-03-22 13:06:36

How to parse https correctly?

Hello, I’m making an https site parser
using guzzle with a proxy,
but still I catch a lock that disappears after 10 minutes. If done without a proxy, there will be such an error in the certificate
5ab37b9f018fd967258530.png

$response = $client->get('https://site/page/'.$i, ['verify' => false,'delay'=> 3000,'proxy'=> ''.$proxy.'']);

tell me in which direction to dig? maybe on the contrary it is necessary to throw up the certificate?
1 proxy is used for 2 pages, and so on in a circle, you can of course use many proxies, but I would like to understand what the problem is)))
When I try to track the blocking:
$response->getStatusCode() // возвращает 200 даже при блокировке.
$response->getReasonPhrase(); //ok, но при блокировке Fatal error: Uncaught GuzzleHttp\Exception\ConnectException

Answer the question

In order to leave comments, you need to log in

1 answer(s)
M
Mikhail Sisin, 2018-03-23
@hyget

First of all, try to disguise yourself as a browser as much as possible (send headers similar to the browser ones with the request. Make sure that the browser User-agent and not something like "php-crawler" leaves you. Clear the jar cookies after fetching each page (very often helps) Make pauses between fetching pages, here you can experiment from a few seconds to minutes, make them random.Regarding the certificate, you can turn off certificate verification:

$this->client = new GuzzleClient(['verify' => false ]);

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question