How to search the main content of the site?

J

jallvar2019-03-01 16:02:41

Parsing

jallvar, 2019-03-01 16:02:41

Hello everyone, problem statement.
- Parsing the main content from the web page.
My suggestion:
iterate over all html tags for max content. (error, yes)
Are there any ready-made solutions or ideas on how to do this?
Here, this friend knows how to do it
https://be1.ru/antiplagiat-online/ (not advertising)
Preferably in python, c#
thanks in advance

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Alexander, 2019-03-07
Madzhugin @Suntechnic

body will win.
Take two groups of pages:
Group A - landing page loaded many times.
Group B - pages at the same nesting level. It is better to add each page here 2-3 times.
Now let's take a landing page and remove from it all elements that differ from at least one page from group A. This way we will discard ads, all sorts of news columns that are constantly loading different, etc.
The next step is to remove from the page all the elements that match the elements on any page B.
The rest is the main content, in a general sense. There may still be, say, lists of recommended products for this, or lists of similar news or articles on this topic. Since they will most likely differ from those on the pages of group B, and will not change when reloading the page and, accordingly, will not be eliminated by comparison with A. Well, here you can try to remove regular structures, and if this approach removes a smaller part of the content (this is necessary not to clean p tags from the article for example) to agree to such cleaning. You can also take into account that such regular structures will have many nested tags, unlike regular content structures.
Somehow I would do it.

D

Dimonchik, 2019-04-05
@dimonchik2013

https://www.slideshare.net/PyNSK/python-53858880