Answer the question
In order to leave comments, you need to log in
How to parse (clear/parse) HTML after grabbing?
Greetings
I train on cats . With pulling out individual elements from the page, everything is more or less clear, but with the post-processing of the saved HTML code, it is not clear. I still don't understand how to clear it of unnecessary data (links, highlights, other tags). Regular expressions come to mind and theoretically it would be possible to use them, but they are not advised.
What I do:
- Save Scrapy the contents of the .mw-parser-output block
What is not clear:
- how to remove link tags, highlights (bold, italics)
- how to remove the page content block
- how to remove all having classes and identifiers
- how to remove entire blocks (notes , literature, links)
- in general, post-processing of content
Of course, when searching for elements in Scrapy, you could immediately write:
//div[@class='mw-parser-output']/*[not(@class='toc' or @class='reflist not-references')]
исключили блок навигации и литературы
Answer the question
In order to leave comments, you need to log in
no one bothers with subsequent cleaning :)
got html and pull out what you need from there, structure and put it in a database or another place
scrapy is also built, you have a final Items structure that describes the data that should be output. and in the process of parsing html, you simply add the necessary data to these Items. it's all
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question