R
R
Rinat Bakiev2017-06-02 08:33:18
Python
Rinat Bakiev, 2017-06-02 08:33:18

How to parse a site that is "forever" unavailable?

Hello!
At one state structure, the site almost 146% of requests give timeout. Even at night (hoped for less load). But sometimes it works well. get response:

Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Content-Length: 34854
Content-Type: text/html; charset=windows-1251
Date: Fri, 02 Jun 2017 04:51:08 GMT
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Server: Microsoft-IIS/6.0
Set-Cookie: phpbb3_kwgpv_u=1 ; expires=Sat, 02-Jun-2018 04:51:08 GMT; path=/; HttpOnly
phpbb3_kwgpv_k=; expires=Sat, 02-Jun-2018 04:51:08 GMT; path=/; HttpOnly
X-Powered-By: PHP/5.2.8

You need to download the html pages and they stuffed some data into .js. Download links have been created. The error is basically:
Content-Length: 866
Content-Type: text/html
Date: Fri, 02 Jun 2017 05:26:42 GMT
Server: Microsoft-IIS/6.0
FastCGI Error
The FastCGI Handler was unable to process the request.
Error Details:
The FastCGI process exceeded configured request timeout
Error Number: 258 (0x80070102).
Error Description: The wait operation timed out.
HTTP Error 500 - Server Error.
Internet Information Services (IIS)
What's the best way to download 50,000 files?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
A
Artem, 2017-06-02
@devspec

1. Parse single-threaded so as not to additionally load the site.
2. Set the timeout higher.
3. If the page is not available, click it again until it downloads.
It will take a long time to parse, but there is no other algorithm.

I
InoMono, 2018-02-06
@InoMono

Search the World Wide Web Archive
https://en.wikipedia.org/wiki/%D0%90%D1%80%D1%85%D...

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question