Y
Y
yurabrg2015-04-02 16:24:34
C++ / C#
yurabrg, 2015-04-02 16:24:34

Lots of parallel httpwebrequest through different proxies. How to get maximum speed?

I am writing an application for parsing a bookmaker's website. The final goal is to get fresh html of the site as often and quickly as possible (it contains the information I need). Since the bookmaker bans ip if too many requests to the site are made from it, I use a collection of proxies (from 50 to 100 pieces) to distribute requests between.
Program language - C#. I am using asynchronous requests (httpwebrequest) like this:

ServicePointManager.DefaultConnectionLimit = 1000;
        ServicePointManager.Expect100Continue = false;
        ServicePointManager.UseNagleAlgorithm = false;

        for (var i = 0; i < proxyCollection.Count; i++)
        {
            var proxyLocal = proxyCollection[i];
            var iLocal = i;
            Task.Run(async () =>
                {
                    var httpWebRequest = (HttpWebRequest) WebRequest.Create(String.Format("https://bookmaker.com/page{0}", iLocal));
                    httpWebRequest.Proxy = proxyLocal;
                    httpWebRequest.PreAuthenticate = true;
                    httpWebRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;

                    using (var httpWebResponse = (HttpWebResponse) await httpWebRequest.GetResponseAsync())
                    using (var responseStream = httpWebResponse.GetResponseStream())
                    using (var streamReader = new StreamReader(responseStream))
                    {
                        var stringContent = await streamReader.ReadToEndAsync();
                        //Тут я обрабатываю данные, которые получил с сайта. Обработка идет очень быстро - проблема не в ней
                        ProcessStringContent(stringContent);
                    }
                });            
         }

All the settings that I use for the ServicePointManager and for a separate request, I specified in the code.
First, for some reason, all requests do not start at the same time. Even if I look in the task manager and the network load, then I observe two or more spikes in requests, between which there can be up to several seconds of silence. It is obvious that I do not put any delays anywhere and this behavior is strange. This behavior also manifests itself in the fact that I have several very fast proxies, from which the request takes no more than 200ms, but if I launch all requests at the same time, then often, after 5 seconds, requests from fast proxies have not even started yet.
Secondly, the total execution time of all queries is quite large. For sure, this is due to the first problem, but nevertheless, this should also be paid attention to. If I make a request from each proxy separately, synchronously, then I observe a request time of 100-1000ms. From here, I expect that the total time of all requests (100 pieces) sent at the same time should not exceed, say, 2 seconds. After all, they are all processed asynchronously!
Thirdly, and most strangely, the above code causes the interface to lag. I have a WPF application and I'm pretty familiar with the concept of a dispatcher thread and everything related to it. And all the same, it puzzles me why the code called in the new Task and at the same time asynchronous can slow down the UI.
There is only one thought that the synchronous operation WebRequest.Create may take some time (sometimes they write about this on the Internet, they say it takes a lot of time to configure the use of a proxy and to look up dns). And it turns out that all threads are clogged with the creation of requests and this leads to lags. Is it possible to somehow speed up all this (remove the search for proxy settings, speed up the search for dns) and avoid lags?
I've tried different combinations of ServicePointManager settings and haven't found a winner.
I am looking for any help in this matter. Interested in any opinions about how to properly organize the solution of the task I need in .Net? There is critically little information on the Internet, maybe someone will share their experience? After all, there are certainly solutions for frequent and quick collection of information from various supplier sites.
Thanks in advance.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
A
AxisPod, 2015-04-02
@AxisPod

Those. you've boxed yourself into Task.Run and async/await, and then you ask why. Task.Run does not guarantee that everything will run in separate threads, in fact there will be a queue and a ThreadPool. And using await ends up lining everything up into a very small number of threads. Here it is necessary to do it the old fashioned way, with old mechanisms, it will be more effective. BeginGetResponse/BeginGetRequestStream will help you and you won't need many streams.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question