S
S
Sergey Mironov2020-10-13 17:13:48
Parsing
Sergey Mironov, 2020-10-13 17:13:48

How to scrape secure sites?

I need to parse aliexpress, but verification pops up there after a while.

5f85b56e7d09f907467714.png

I've been fighting for several days, does not give any way to spars.

But if you go directly to the link, then everything is fine, it comes in without any verification.
Here is my code.

$url = "https://aliexpress.ru/af/category/202003449.html?categoryBrowse=y&origin=n&CatId=202003449&catName=sweaters";

$headers = array(
  'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'accept-encoding:  deflate, br',
  'accept-language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7',
  'cache-control: max-age=0',
  'sec-fetch-dest: document',
  'sec-fetch-mode: navigate',
  'sec-fetch-site: none',
  'sec-fetch-user: ?1',
  'upgrade-insecure-requests: 1',
  'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
);

$ch = curl_init($url);

curl_setopt($ch, CURLOPT_COOKIEFILE, __DIR__ . '/cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEJAR, __DIR__ . '/cookie.txt');
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36";
curl_setopt($curl, CURLOPT_REFERER, "https://aliexpress.ru/");
curl_setopt($ch, CURLOPT_HEADER, true);
$code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$html = curl_exec($ch);
curl_close($ch);
print_r($html);


Tell me, in which direction to dig?

1) Is it possible to parse ali using curl alone and bypass protection? Or is everything strongly protected there, what can you do?

2) Maybe you should use curl+selenium? But the technology itself is still incomprehensible to me. Maybe someone knows.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
S
Sergey Karbivnichy, 2020-10-13
@hottabxp

How to scrape secure sites in 2020?
Exactly the same as in 2019.
There is a reliable (more or less) way. Download the extension skeleton for your browser and write a js extension for parsing. I don’t know js, but knowing python, I wrote an extension for parsing numbers with olx in a couple of evenings. It was simple, the numbers were displayed in the developer's console - nevertheless, it worked. Knowing the essence of programming, you can write software that solves your problem in almost any language in a couple of evenings.

M
Maxim, 2020-10-13
@Tomio

Buy a pool of proxy addresses from different countries =) 100-200 proxies will allow you to parse a certain amount of data. You can try to register as a developer
and use their api here

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question