How to parse text in a div, ignoring nested tags, BeautifulSoup?

M

Mist82015-03-24 23:24:43

Python

Mist8, 2015-03-24 23:24:43

How to parse part of a div like this:

<div class="example">
<p>bla-bla-bla</p>
<div>something not important</div>
<strong>SomeText</strong>
<br>
Нужный текст
<span style="color:red">Тоже нужный текст</span>
Нужный текст
</div>

The problem is that the text that needs to be parsed inside the div itself is either without a tag or in the tag, but besides it, this div first has a nested div that does not need to be parsed. How to cut off the extra (extra nested div)?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

M

Mist8, 2015-03-25
@Mist8

One of the options how to remove the excess:

from bs4 import BeautifulSoup
html_doc = """
<div class="example">
<p>bla-bla-bla</p>
<div>something not important</div>
<strong>SomeText</strong>
<br>
Нужный текст
<span style="color:red">Тоже нужный текст</span>
Нужный текст
</div>
"""
soup = BeautifulSoup(html_doc)
tag = soup.find("div", class_="example")

tag.div.decompose() # убираем вложенный div
tag.p.decompose()  # убираем текст в теге <p>
tag.br.decompose() # убираем перенос <br>
print(tag)

I

IRIP, 2018-03-15
@IRIP

How to destroy nested elements before reading?