There is a news site parser. He collects articles, but how to make sure that he does not collect unnecessary paragraphs?

K

Kirill Petrov2018-07-29 12:04:22

Python

Kirill Petrov, 2018-07-29 12:04:22

There is a parser, it goes through the navigation page, collects urls, follows the urls and copies news articles. The whole problem is that the articles have tags that are also parsed, but I don't need them. How can this be bypassed. The unwanted tag has an "insert" class, but the strip() method doesn't help, it removes all the tags.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

M

Mikhail Bobkov, 2018-07-29
@mike_bma

You need either regular expressions or dom crawler. Look for an unnecessary tag and remove it along with the content.