How to parse a page in c#?

C

cats_is_cool2020-10-27 01:39:01

C++ / C#

cats_is_cool, 2020-10-27 01:39:01

The task is to download the page, parse it in such a way that at the output you get all (I emphasize, ALL ) the text that is on the page (the one that the user sees), this is the links and the title of the articles and the content of the articles itself, if the user sees it. Accordingly, as I do, I connected the Englishsharp library so that I could select by tags, I download the page using an http request, using the query selectorall("body").select(x=>x.textcontext) method, I pull out all the text, everything seems to be fine if not for one thing, he hawala javascript code that lies in the badi tag (on those sites where it is). How to avoid it?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

A

Alex, 2020-10-27
@Alex_At_Net

In general, the solution to this problem can only be done using OCR (optical character recognition): render the page and feed it to the OCR engine. At the output, you get a certain percentage (close to 100%) of the recognized text.
All other special cases are simple HTML parsing + exceptions for each specific site or CMS. If a script is being captured, remove the script tags from the document before taking the text content. Etc.