N
N
NoEscape2014-01-17 01:30:14
C++ / C#
NoEscape, 2014-01-17 01:30:14

What is the most efficient way to download several million html pages without spending forever?

The C# program makes Get requests and downloads the code. I am creating in a loop over a new stream, each stream with an html pump function. After 1400 repetitions of the cycle, the program hangs, visual studio too. Sometimes it hangs and throws out of memory error. It seems to be due to the large number of threads.
Question: how can I most efficiently download several million html pages without spending forever?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
S
Sergey, 2014-01-17
Protko @Fesor

At you that 1400 flows are created? If you have 8-16 gigs of RAM, then of course there will not be enough memory ... Do you at least free up memory?
You need to write a queue manager. Several threads will hang constantly and each will apply for a new task to the manager (you will have to block the rest of the threads so that there is no race for resources). Having received the task, your worker thread will download the data and save the result of the work to the database / file system and will ask for a new task ...

N
nekipelov, 2014-01-17
@nekipelov

I don't program in C#, so I can only tell by the approach to the task. For every page down the line - waste. Moreover, on every corner they talk about how conveniently asynchrony is made in C#. Those. streams are not needed at all, you need to download asynchronously, and the only stream will completely cope with downloading data on a fairly wide channel.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question