How to scrape websites protected by CloudFront?

E

EgorkaOle2020-09-22 00:06:02

Crawling

EgorkaOle, 2020-09-22 00:06:02

There is one popular website protected by CloudFront that occasionally releases news that needs to be scraped as quickly as possible. The cache time is two minutes on each node, so you usually get the cached version. Are there any efficient options other than fetching a page from hundreds of different proxies at the same time hoping to get the latest version somewhere? If not, what is the best proxy?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

A

alpeg, 2020-09-22
@alpeg

Bypassing the cache entirely depends on the site settings, you can’t figure it out without experimenting.
Try iterating over headers, GET parameters, cookies.
I recommend reading the CloudFront documentation itself , especially the sections on Query String Parameters, Cookies, and Request Headers.