How to parse (clear/parse) HTML after grabbing?

W

weranda2018-12-19 09:52:08

Python

weranda, 2018-12-19 09:52:08

Greetings
I train on cats . With pulling out individual elements from the page, everything is more or less clear, but with the post-processing of the saved HTML code, it is not clear. I still don't understand how to clear it of unnecessary data (links, highlights, other tags). Regular expressions come to mind and theoretically it would be possible to use them, but they are not advised.
What I do:
- Save Scrapy the contents of the .mw-parser-output block
What is not clear:
- how to remove link tags, highlights (bold, italics)
- how to remove the page content block
- how to remove all having classes and identifiers
- how to remove entire blocks (notes , literature, links)
- in general, post-processing of content
Of course, when searching for elements in Scrapy, you could immediately write:

//div[@class='mw-parser-output']/*[not(@class='toc' or @class='reflist not-references')]
исключили блок навигации и литературы

But I feel that there are much better options.
Please share your knowledge.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

V

Vladimir, 2018-12-19
@vintello

no one bothers with subsequent cleaning :)
got html and pull out what you need from there, structure and put it in a database or another place
scrapy is also built, you have a final Items structure that describes the data that should be output. and in the process of parsing html, you simply add the necessary data to these Items. it's all

M

Michael, 2019-01-02
@moonz

The right way is the specification of the search, i.e. sifting out the superfluous at the stage of data collection and not after.