K
K
kotarak2019-02-20 00:03:22
Parsing
kotarak, 2019-02-20 00:03:22

How to quickly parse links on the portal?

You need to find broken links on the portal.
I download the html page, find all the links on it and add them to the download queue. In the process, I highlight broken / incorrect.

HttpWebResponse response;
                StreamReader respStream;
                try
                {
                    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(link.ToString());
                    request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/15.0";
                    request.Accept = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
                    request.AllowAutoRedirect = true;
                    request.Timeout = 10000;
                    response = (HttpWebResponse)request.GetResponse();
                    respStream = new StreamReader(response.GetResponseStream());
                    html = respStream.ReadToEnd();
                    response.Close();
                    respStream.Close();
                }
                catch (Exception ex)
                {
                    System.Console.WriteLine("-------------\n" +
                            "Bad link: " + link + "\n" +
                            "From: " + link.Parent +
                            "\n" + ex.Message);
                    link.ErrorComments = ex.Message;
                    link.Parent.AddSon(link);
                    continue;
                }

Here is the download code.
I parallelized the program, but the speed has not changed globally. From 60 links per minute in a single thread, the speed has increased to 120. Maybe I'm not optimally downloading pages? Internet speed allows you to download 10-20 pages per second. At the same time, the download speed decreases over time. How to speed up the process?

Answer the question

In order to leave comments, you need to log in

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question