Answer the question
In order to leave comments, you need to log in
How to get the contents of a tag in XPATH along with internal tags?
Good afternoon, there is a task to pull out the full content of the tag from the html page, along with internal tags.
For example:
<html>
<body>
<div class="post">
text <p> text </p> text <a> text </a>
<span> text </span>
<div class="post">
another text <p> text </p>
</body>
</html>
<div class="post">
text <p> text </p> text <a> text </a>
<span> text </span>
(//div[@class="post"])[1]/descendant-or-self::*[not(name()="script")]/text()
text text text text text
<Element p at 0xb62f939c>
I don’t know how to convert these back):[<Element div at 0xb648193c>, u'\u0420\u0430\u0431\u043e\u0442\u0430 \u0441 \u0441\u0443\u0431\u0442\u0438\u0442\u0440\u0430\u043c\u0438', <Element p at 0xb62f939c>, ...]
soup = BeautifulSoup(html)
text = [child.strip() if isinstance(child, str) else str(child) for child in soup.find('div', attrs={'class': 'post'})]
text = ''.join(text)
print text
Answer the question
In order to leave comments, you need to log in
Late of course. :)
from lxml.html import fromstring
string = '''<html>
<body>
<div class="post">
text <p> text </p> text <a> text </a>
<span> text </span>
<div class="post">
another text <p> text </p>
</body>
</html>'''
html = fromstring(string)
post = html.xpath('.//div[@class="post"]')[0].text_content()
print post
from lxml import etree
tree = etree.fromstring('<html><head><title>foo</title></head><body><div class="name"><p>foo</p></div><div class="name"><ul><li>bar</li></ul></div></body></html>')
for elem in tree.xpath("//div[@class='name']"):
# pretty_print ensures that it is nicely formatted.
print etree.tostring(elem, pretty_print=True)
from lxml import etree, html
tree = html.parse('http://rutracker.org/forum/index.php')
for elem in tree.xpath("//div[@class='category']"):
print html.tostring(elem, pretty_print=True)
from StringIO import StringIO
from lxml import etree, html
import requests
c = requests.get('http://rutracker.org/forum/index.php').content
tree = html.parse(StringIO(s))
for elem in tree.xpath("//div[@class='category']"):
print html.tostring(elem, pretty_print=True)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question