How to get text from all 'p' tags?

M

Mike2018-10-22 03:36:25

Python

Mike, 2018-10-22 03:36:25

beautifulsoup. I can't get text from all 'p' tags.

soup = BeautifulSoup('<html><div><p>hello world 1</p></div><div><p>hello world 2</p></div> 
    </html>', features='lxml')
    tags = soup.find('div')
    for x in tags.find_all('p'):
        print(x.get_text())

Returns only the first hello world 1. How to get all hello worlds?
I can of course do that and get all the hello worlds, but that doesn't work for a real script....
for x in soup.find_all('p'): print(x.get_text())

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

M

MrDwayne, 2018-10-22
@google_online

Your whole problem is that you want to get the text from all the 'p' tags, but you are only extracting one 'div' tag
If you want to get the text from all the 'p' tags, then you need to read all the tags first 'div'. It is worth replacing with Thanks to this, we get an array in which all our 'div' tags lie. Let's look at our current array, which stores the tags variable:
After that, it should be iterated By looping through the elements of the array with tags 'div', we need to get a new array that will contain elements with tags 'p', which are located in a specific tag 'div'

for x in tags:
    texts = x.find_all('p')

Now we have an array of 'p' tags in a specific 'div' (located in the texts variable), which is generated with different elements for a specific 'div' tag when iterating over the tags array, it remains only to iterate through this array and get the text from each 'p' tag '

for text in texts:
    print(text.get_text())

Full code (renamed variables for better readability)

soup = BeautifulSoup('<html><div><p>hello world 1</p></div><div><p>hello world 2</p></div></html>', features='lxml')
divs = soup.find_all('div')
for div in divs:
    ps = div.find_all('p')
    for p in ps:
        print(p.get_text())

L

lega, 2018-10-22
@lega

>>> html = '<html><div><p>hello world 1</p></div><div><p>hello world 2</p></div> </html>'
>>> re.findall(r'<p>([^<]+)</p>', html)
['hello world 1', 'hello world 2']