S
S
studprogrammist2018-03-04 12:18:40
Python
studprogrammist, 2018-03-04 12:18:40

How to extract text between tags?

We have the following code, an XML parser:

import xml.etree.ElementTree as ET

doc = """
<?xml version="1.0" encoding="ANSI" ?>
<data>
     <items>
         <item name="item1">1</item>
         <item name="item2">2</item>
         <item name="item3">3</item>
         <item name="item4">4</item>
     </items>
</data>
.----------------------------------------------------------
"""

tree = ET.fromstring(doc)

print(tree.find('.//item[@name="item1"]').text)
print(tree.find('.//item[@name="item4"]').text)

As you can see, it does not start with the main tag, as it ends in other things.
Hence the error:
xml.etree.ElementTree.ParseError: XML or text declaration not at start of entity
Tell me how (using what) to extract all the text between the and tags?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
J
javedimka, 2018-03-04
@studprogrammist

lxml

>>> import lxml.etree
>>> doc = """
... <?xml version="1.0" encoding="ANSI" ?>
... <data>
...      <items>
...          <item name="item1">1</item>
...          <item name="item2">2</item>
...          <item name="item3">3</item>
...          <item name="item4">4</item>
...      </items>
... </data>
... .----------------------------------------------------------
... """
>>> parser = lxml.etree.XMLParser(recover=True)
>>> tree = lxml.etree.fromstring(doc, parser)
>>> [element.text for element in tree.iter('item')]
['1', '2', '3', '4']

Without lxml you can do this:
>>> import xml.etree.ElementTree as ET
>>> doc = """
... <?xml version="1.0" encoding="ANSI" ?>
... <data>
...      <items>
...          <item name="item1">1</item>
...          <item name="item2">2</item>
...          <item name="item3">3</item>
...          <item name="item4">4</item>
...      </items>
... </data>
... .----------------------------------------------------------
... """
>>> tree = ET.fromstring(doc.strip('\n-.'))
>>> [element.text for element in tree.iter('item')]
['1', '2', '3', '4']

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question