How to parse \ parse XML of this kind?

R

Rennicks2018-07-30 21:57:40

Python

Rennicks, 2018-07-30 21:57:40

Good day!
I'm asking for help in inventing the wheel.
There is an xml file of the format:
Broken with hyphens and trimmed the length of the line to improve readability. The original contains about 80k lines, 20 attributes each.

<xml>
  <Detail_collection>
    <Detail 
      Полное_и_сокращенное_наименование_организации="Общество с ограниченной ответственностью "РогаИКо" Сокращенно: ООО "РогаИКо"" 
      ИНН_организации="0123456789" 
      КПП_организации="123456789" 
      Адрес__место_нахождения___организации="РОССИЯ,0123456,"Кукуево г,,Затерянная ул,15/7,," 
      Адрес_электронной_почты_организации="[email protected]"  />
    <Detail Полное_и_сокращенное_наименование_организации="..".." />
    <Detail Полное_и_сокращенное_наименование_организации="..".." />
</Detail_collection>
</xml>

It is necessary to pull out the attribute values from it with further entry into the database.
Trained on a simplified version:

<?xml version="1.0" encoding="utf-8" ?> 
<xml>
    <Detail_collection>
        <Detail text1="sometext11" text2="sometext21" text3="sometext31" />
        <Detail text1="sometext12" text2="sometext22" text3="sometext32" />
    </Detail_collection>
</xml>

import xml.etree.cElementTree as ET
from SQL_worker import Write_to_SQL

tree = ET.parse("data.xml")
root = tree.getroot()

for data in root.findall(".//Detail"):
    a = (data.attrib["text1"])
    b = (data.attrib["text2"])
    c = (data.attrib["text3"])
    Write_to_SQL(a, b, c)

Such code perfectly fulfills the "simplified example".
But the original document crashes while parsing with an error:
xml.etree.ElementTree.ParseError: not well-formed (invalid token)
Referring to the very first fragment with "extra" quotes.
I did not find methods for normalizing such a file for subsequent parsing.
There is an XML-schema for it, but as far as I understand, it has no other application besides validation.
At the moment, I'm leaning towards parsing with regular expressions, but I want to believe that there is a more elegant solution.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

R

Rsa97, 2018-07-30
@Rsa97

Here, IMHO, to disassemble the regular season. The one who prepared this xml forgot to convert quotes, perhaps other characters too.
If all the attributes are in the same order, then, in principle, it is easy to parse.