How to quickly parse XML and write to the database?

A

Artem [email protected]2018-08-30 10:19:42

Python

Artem [email protected], 2018-08-30 10:19:42

How to quickly perform the parsing itself and then make an insert into the database?
At the moment I use this algorithm, but with large files (and when there are a lot of files), it reads xml and makes the insert take a long time.
Is there another algorithm to parse xml faster and make an insert not for each record, but for a group?
For example, immediately write everything that is in ZAP to client_table , then DATA to usl_table

import xml.etree.cElementTree as ET
tree = ET.parse(filexml)
element_xml_root = tree.getroot()
dbcur = db.cursor()

for elem in element_xml_root.findall('ZAP'):
    idclient = elem.find('ID_CLIENT').text
    fam = elem.find('FAM').text
    im = elem.find('IM') .text
    ot = elem.find('OT') .text
    query_pac = dbcur.prepare('INSERT INTO client_table (idclient, fam, im, ot)'
                          ' VALUES (:idclient, : fam, : im, : ot)')
    dbcur.execute(query_pac, (idpac, fam, im, ot))
    for data in elem.findall('DATA'):
        code_usl = data.find('CODE') .text
        date_usl = data.find('DATE') .text
        price_usl = data.find('PRICE') .text
        query_usl = dbcur.prepare('INSERT INTO usl_table (idpac, code_usl, date_usl, price_usl)'
                          ' VALUES (:idpac, : code_usl, : date_usl, : price_usl)')
        dbcur.execute(query_usl, (idpac, code_usl, date_usl, price_usl))

db.commit()
dbcur.close()

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

S

Sergey Pankov, 2018-08-30
@file2

Prolog your function in detail and you will understand what slows down more.
There can be three main problems:
1) Long XML parsing. Its time depends on the size of the file, and you will start inserting into the database only after parsing is completed. In this case, you will have the entire file in the form of an object tree in memory. which can be very inefficient.
2) Long search for the necessary elements in the tree. It is doubtful that it will significantly slow down here against the background of other processes.
3) Long insertion due to separate transactions for each.
1.2) The first problem can be solved by streaming the XML through a SAX parser. Events are hung on the closing of certain tags, and the parser object accumulates data in its state. This will allow you to receive data as the file is read and parsed, and not after. The second problem, by the way, is solved by the sax parser. There will simply be no additional overhead for traversing the tree built by the parser.
3) You can try cursor.executescript for batch execution of queries. There is a minus here, you need to do the validation and screening of data by hand and correctly. need to be afraid of sql injections.
It's better to use cursor.executemany. Here, for example, about it on stackoverflow: https://stackoverflow.com/questions/18244565/how-c...