Answer the question
In order to leave comments, you need to log in
How to scrape websites protected by CloudFront?
There is one popular website protected by CloudFront that occasionally releases news that needs to be scraped as quickly as possible. The cache time is two minutes on each node, so you usually get the cached version. Are there any efficient options other than fetching a page from hundreds of different proxies at the same time hoping to get the latest version somewhere? If not, what is the best proxy?
Answer the question
In order to leave comments, you need to log in
Bypassing the cache entirely depends on the site settings, you can’t figure it out without experimenting.
Try iterating over headers, GET parameters, cookies.
I recommend reading the CloudFront documentation itself , especially the sections on Query String Parameters, Cookies, and Request Headers.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question