E
E
Egor Davydov2021-03-15 14:32:40
PHP
Egor Davydov, 2021-03-15 14:32:40

How to bypass Cloudflare protection when scraping?

Greetings!

The task ahead is to parse a site that is protected from parsing by means of Cloudflare.
Those. a regular file_get_contents will return an HTTP error 403.

After a little understanding of the principles and device of this protection, I came to the conclusion that Cloudflare proxy requests and, before issuing the HTML code, issues some kind of "invisible" JavaScript captcha. In fact, this captcha is a JS code that executes immediately after loading and allows you to load the rest of the page.
With normal requests, of course, JS does not work for us.

Rummaging around the net, I found several outdated solutions, such as Guzzle Cloudflare bypass, but none of the ready-made solutions could be launched.

If I understand the algorithm correctly, we need to make a request through a browser with JS processing. For this, in theory, a console headless browser should be suitable.

Questions:
1. What is a headless browser? I misunderstood this and have not worked with this before.
2. How can I organize multithreading (Many ... tabbedness?) Through such a headless browser?
3. Will this approach work at all to bypass Cloudflare protection?
4. What are the pitfalls to expect?

PS: I'm currently deploying on the second Docker machine with chromium and will test the theory, but in order not to waste time, I decided to immediately throw the question on Habr. Who can help with any answer or comment - I will be immensely grateful!

Answer the question

In order to leave comments, you need to log in

1 answer(s)
F
FasterTans, 2021-03-15
@FasterTans

1. What is a headless browser? I misunderstood this and had not worked with this before.
Selenium, Puppeteer
2. How can I organize multi-threading (Many...tabbing?) through such a headless browser?
Depending on the chosen browser, type into Google "selenium\puppeteer multithreading"
3. Will this approach work at all to bypass Cloudflare protection?
Yes, google how to bypass protection.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question