Answer the question
In order to leave comments, you need to log in
We parse HTML, we take out unformatted text from there. Python + LXML
I'm trying to parse a blog that publishes homework for school students. The HTML page contains DIVs that are uniquely defined by CSS and contain text for homework, unfortunately with a design (beauty added).
If we take the text of the element using element.text_content (), we get everything in a row and without markup, that is, the DZ will be in one line, porridge.
If you take it through Xpath, for example, spans = elementlist[0].xpath("*/span//text()")
then each sneeze of design, whether it be <b>, <u>, <p>
etc. - will be a separate element, and displaying the elements line by line, we get an ugly column of values, in which it will be simply unrealistic to guess where the line feed is applied.
Question - how to remove the text, keeping the line breaks, but ignore the design in spans, boldness, italics, etc.?
The original html (example) is available at irina2013-2gymn.blogspot.ru/2013/12/blog-post_4.html
Answer the question
In order to leave comments, you need to log in
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question