How do I use BS4 to extract the text inside a tag that is inside the found tag?

S

Safronov_Alexei2020-12-27 20:26:39

Python

Safronov_Alexei, 2020-12-27 20:26:39

Hi Habr!

<LTTextBoxHorizontal y0="677.457" y1="698.418" x0="47.208" x1="86.76" width="39.552" height="20.962" bbox="[47.208, 677.457, 86.76, 698.418]" index="3">
<LTTextLineHorizontal y0="688.668" y1="698.418" x0="53.168" x1="86.76" width="33.592" height="9.75" bbox="[53.168, 688.668, 86.76, 698.418]" word_margin="0.1">Patient </LTTextLineHorizontal>
<LTTextLineHorizontal y0="677.457" y1="687.207" x0="47.208" x1="86.76" width="39.552" height="9.75" bbox="[47.208, 677.457, 86.76, 687.207]" word_margin="0.1">Jonson </LTTextLineHorizontal>
</LTTextBoxHorizontal>

I have something like this XML in which I want to track the LTTextBoxHorizontal tag, in which there will be a Tag with the Patient line, and also display the text of the remaining LTTextLineHorizontal tags inside the LTTextBoxHorizontal tag.

I use bs4, tell me how to do it?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

S

soremix, 2020-12-27
@SoreMix

Just like with regular html, read the documentation

soup = BeautifulSoup(xml, 'xml')
box = soup.find('LTTextBoxHorizontal')
for line in box.find_all('LTTextLineHorizontal'):
  print(line.text)

D

devdb, 2020-12-28
@devdb

html = '''<LTTextBoxHorizontal y0="677.457" y1="698.418" x0="47.208" x1="86.76" width="39.552" height="20.962" bbox="[47.208, 677.457, 86.76, 698.418]" index="3">
<LTTextLineHorizontal y0="688.668" y1="698.418" x0="53.168" x1="86.76" width="33.592" height="9.75" bbox="[53.168, 688.668, 86.76, 698.418]" word_margin="0.1">Patient </LTTextLineHorizontal>
<LTTextLineHorizontal y0="677.457" y1="687.207" x0="47.208" x1="86.76" width="39.552" height="9.75" bbox="[47.208, 677.457, 86.76, 687.207]" word_margin="0.1">Jonson </LTTextLineHorizontal>
</LTTextBoxHorizontal>'''

import bs4
text = bs4.BeautifulSoup( html, 'html.parser' ).get_text()


>> print( text )

Patient
Jonson

>> text
'\nPatient \nJonson \n'

Variant with Regexp, faster and less resource intensive:

import re
text = re.sub( r'<[^>]+>', '', html)