M
M
Mike2018-10-22 03:36:25
Python
Mike, 2018-10-22 03:36:25

How to get text from all 'p' tags?

beautifulsoup. I can't get text from all 'p' tags.

soup = BeautifulSoup('<html><div><p>hello world 1</p></div><div><p>hello world 2</p></div> 
    </html>', features='lxml')
    tags = soup.find('div')
    for x in tags.find_all('p'):
        print(x.get_text())

Returns only the first hello world 1. How to get all hello worlds?
I can of course do that and get all the hello worlds, but that doesn't work for a real script....
for x in soup.find_all('p'): print(x.get_text())

Answer the question

In order to leave comments, you need to log in

2 answer(s)
M
MrDwayne, 2018-10-22
@google_online

Your whole problem is that you want to get the text from all the 'p' tags, but you are only extracting one 'div' tag
If you want to get the text from all the 'p' tags, then you need to read all the tags first 'div'. It is worth replacing with Thanks to this, we get an array in which all our 'div' tags lie. Let's look at our current array, which stores the tags variable:
After that, it should be iterated By looping through the elements of the array with tags 'div', we need to get a new array that will contain elements with tags 'p', which are located in a specific tag 'div'

for x in tags:
    texts = x.find_all('p')

Now we have an array of 'p' tags in a specific 'div' (located in the texts variable), which is generated with different elements for a specific 'div' tag when iterating over the tags array, it remains only to iterate through this array and get the text from each 'p' tag '
for text in texts:
    print(text.get_text())

Full code (renamed variables for better readability)
soup = BeautifulSoup('<html><div><p>hello world 1</p></div><div><p>hello world 2</p></div></html>', features='lxml')
divs = soup.find_all('div')
for div in divs:
    ps = div.find_all('p')
    for p in ps:
        print(p.get_text())

L
lega, 2018-10-22
@lega

>>> html = '<html><div><p>hello world 1</p></div><div><p>hello world 2</p></div> </html>'
>>> re.findall(r'<p>([^<]+)</p>', html)
['hello world 1', 'hello world 2']

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question