How to extract article text from html page?

A

albertalexandrov2018-07-23 14:05:12

Python

albertalexandrov, 2018-07-23 14:05:12

Hey!
There are news sites, blogs, etc., whose pages contain articles. But in addition to the texts of the articles, the pages also contain comments, advertising, navigation, and so on. The task is to extract from the pages of news sites, blogs, and so on. only the texts of the articles . Since the sources are different, the html markup is different for everyone. That is, you need to implement something like a read mode.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

O

Oleg Zakharov, 2018-07-23
@blazenn12

We take bs4 and write a parser for each site

D

Doc44, 2018-07-23
@Doc44

If there is micro-markup on the site, it's easy.
If not - individually for each site.
Or try to guess automatically, but there the quality will be much worse.
Or display a query to the user.
See how this parser works https://evernote.com/intl/en/products/webclipper
Sometimes it guesses, sometimes it misses, but it offers several options. With the possibility of manual correction in some cases.
The developers who made it are pretty well-paid guys.
Evernote is worth a billion dollars and notes are their main service.
So rest assured - the parser is well done.