How to get the contents of a tag in XPATH along with internal tags?

D

Daniel Reed2015-04-22 12:45:53

Python

Daniel Reed, 2015-04-22 12:45:53

Good afternoon, there is a task to pull out the full content of the tag from the html page, along with internal tags.
For example:

<html>
 <body>
  <div class="post">
   text <p> text </p> text <a> text </a>
   <span> text </span>
  <div class="post">
   another text <p> text </p>
 </body>
</html>

And get the first one<div class="post">

text <p> text </p> text <a> text </a>
   <span> text </span>

So far, it turns out to get only the text, with this expression (there is also an ignoring of the script tag here):

(//div[@class="post"])[1]/descendant-or-self::*[not(name()="script")]/text()

Result: text text text text text
If you use node(), then each tag is returned as an object and I don’t know how to turn it all into a string in the form of html. It returns something like this ( <Element p at 0xb62f939c>I don’t know how to convert these back):

[<Element div at 0xb648193c>, u'\u0420\u0430\u0431\u043e\u0442\u0430 \u0441 \u0441\u0443\u0431\u0442\u0438\u0442\u0440\u0430\u043c\u0438', <Element p at 0xb62f939c>, ...]

There is an option to use BeautifulSoup, but I'm still hoping for xpath, help.

soup = BeautifulSoup(html)
text = [child.strip() if isinstance(child, str) else str(child) for child in soup.find('div', attrs={'class': 'post'})]
text = ''.join(text)
print text

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

B

belanchuk, 2015-12-21
@belanchuk

Late of course. :)

from lxml.html import fromstring
string = '''<html>
 <body>
  <div class="post">
   text <p> text </p> text <a> text </a>
   <span> text </span>
  <div class="post">
   another text <p> text </p>
 </body>
</html>'''
html = fromstring(string)
post = html.xpath('.//div[@class="post"]')[0].text_content()
print post

I

Igor Lyutoev, 2015-04-22
@loader777

and /html() doesn't work?

S

sim3x, 2015-04-22
@sim3x

from lxml import etree

tree = etree.fromstring('<html><head><title>foo</title></head><body><div class="name"><p>foo</p></div><div class="name"><ul><li>bar</li></ul></div></body></html>')
for elem in tree.xpath("//div[@class='name']"):
     # pretty_print ensures that it is nicely formatted.
     print etree.tostring(elem, pretty_print=True)

from lxml import etree, html

tree = html.parse('http://rutracker.org/forum/index.php')
for elem in tree.xpath("//div[@class='category']"):
     print html.tostring(elem, pretty_print=True)

"Your Galya is spoiled" (c)

from StringIO import StringIO
from lxml import etree, html
import requests

c = requests.get('http://rutracker.org/forum/index.php').content

tree = html.parse(StringIO(s))

for elem in tree.xpath("//div[@class='category']"):
     print html.tostring(elem, pretty_print=True)