A
A
Alexey Smirnov2017-02-18 13:07:00
.NET
Alexey Smirnov, 2017-02-18 13:07:00

How to parse HTML using HttpClient?

Hello.
I need to parse their HTML data. To do this, I use the following program code (C# .NET):

string pathToHtml = "ссылка";
WebClient client = new WebClient();
var data = client.DownloadData(pathToHtml);
var html = Encoding.UTF8.GetString(data);

// Создание экземпляра локальной переменной «doc».
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

// Загрузка HTML кода в локальную переменную «doc».
doc.LoadHtml(html);

var x = doc.DocumentNode.SelectNodes("XPATH выражение").Elements("tr").ToList();

The above code works well on my computer.
I had a need to use this code in .NET Core. But the thing is, it's WebClientnot supported in .NET Core: stackoverflow.com . For .NET Core, you need (maybe not required) to use HttpClient.
I tried to rewrite the code in different ways that I found on the Internet, but the code did not work for me.
I would like to ask you to show me a simple example of working HTML parsing code with HttpClient.
PS HtmlAgilityPack has a version for .NET Core (called HtmlAgilityPack.NetCore), so Agility for .NET Core works well.
PPS I'm inexperienced.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
T
Tom Nolane, 2017-02-18
@ERAFY

There are many ways, but I will suggest using the universal one, even if it is a crutch, but it does not take up much space, additional ones (third-party libraries are not needed ...):

using System;
using System.IO;
using System.Net;
using System.Net.Http;
using System.Text.RegularExpressions;
using System.Threading.Tasks;

namespace ConsoleApplication3
{
    public static class Program
    {
        private static string html = "Ошибка";

        private static void Main()
        {
            ShowTags("https://www.yandex.ru/","a");
            Console.ReadKey();
        }

        private static async void ShowTags(string my_url, string tag = "a") // Тег по умолчанию для поиска, ищем теги <a></a>
        { 
            // Загружем страницу 
           string data = await GetHtmlPageText(my_url);

           if (!data.Contains("Ошибка"))
            { 
                string pattern = string.Format(@"\<{0}.*?\>(?<tegData>.+?)\<\/{0}\>", tag.Trim());
                // \<{0}.*?\> - открывающий тег
                // \<\/{0}\> - закрывающий тег
                // (?<tegData>.+?) - содержимое тега, записываем в группу tegData

                Regex regex = new Regex(pattern, RegexOptions.ExplicitCapture);
                MatchCollection matches = regex.Matches(data);

                foreach (Match matche in matches)
                {
                    Console.WriteLine(matche.Value);
                    Console.WriteLine("Содержание:");
                    Console.WriteLine(matche.Groups["tegData"].Value);
                    Console.WriteLine("---------------------------");
                } 
            }
            else
            {
                Console.WriteLine("Ошибка при загрузке со страницы: " + my_url);
            }
        }

        private static async Task<string> GetHtmlPageText(string url)
        {  
            await Task.Run(async()=>{
               
                // ... используем HttpClient.
                using (HttpClient client = new HttpClient())
                using (HttpResponseMessage response = await client.GetAsync(url))
                using (HttpContent content = response.Content)
                {
                    // ... записать ответ
                    string result = await content.ReadAsStringAsync();
                    if (html != null)
                    {
                        html = result;
                    }
                } 
            });
            return html;
        }
    }
}

result on the example of yandex.:
<a href="http://mail.yandex.ru"onclick="c(this,17,1080)">Войти&nbsp;в&nbsp;почту</a>
Содержание:
Войти&nbsp;в&nbsp;почту

Regex is faster than other parsers

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question