C
C
ctocopok2013-12-29 14:17:55
Python
ctocopok, 2013-12-29 14:17:55

We parse HTML, we take out unformatted text from there. Python + LXML

I'm trying to parse a blog that publishes homework for school students. The HTML page contains DIVs that are uniquely defined by CSS and contain text for homework, unfortunately with a design (beauty added).
If we take the text of the element using element.text_content (), we get everything in a row and without markup, that is, the DZ will be in one line, porridge.
If you take it through Xpath, for example, spans = elementlist[0].xpath("*/span//text()")then each sneeze of design, whether it be <b>, <u>, <p>etc. - will be a separate element, and displaying the elements line by line, we get an ugly column of values, in which it will be simply unrealistic to guess where the line feed is applied.
Question - how to remove the text, keeping the line breaks, but ignore the design in spans, boldness, italics, etc.?
The original html (example) is available at irina2013-2gymn.blogspot.ru/2013/12/blog-post_4.html

Answer the question

In order to leave comments, you need to log in

1 answer(s)
M
maxaon, 2013-12-29
@maxaon

Convert the required element back to HTML, replace '<br>', '</p>'with newlines. The rest of the tags can simply be removed with re in non-greedy mode
. PS: Grab is fine for web scraping .

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question