How to read data from huge xml?

A

Antoxa Zimm2016-01-05 00:48:01

.NET

Antoxa Zimm, 2016-01-05 00:48:01

You need to read data from a huge xml file, for example, 50Gb in size, the file is a root element with a collection of similar nodes (one name, almost always the same set of attributes and nested nodes), the structure of each node is not known in advance, some fields must be read as they are , some are converted/calculated on the fly or after reading all other fields. Before parsing xml, the structure of the entity in the file is not known, everything goes into the database, the list of tables and columns can change throughout the life of the application, there are rules by which we parse xml, for example:

from the attribute PrimaryKeywe consider the hash value, if any, and put it in the cell KeyHashof each table
for the cell XXXId, we look at the parent and child nodes with the name XXXand take the value from it
it is planned to depend on a file with more complex wheels, so SqlBulkCopyimmediately in the database is not an option
etc.

The most logical solution is to read one node from the file through while(reader.read()){}and store it all in memory in arrays, then dump it into the database when enough data is collected (via sqlbulkcopy(writetoserver(idatareader))), but the problem is, parsing even a couple of gigabytes takes a very long time. There is a description of the parsec on the page with beautiful processing results, but I don’t find anything about this either.Question
: how quickly can you parse xml into parts (pick up the parent element, child, check if there is a node with a name starting with XXX or pick up an attribute by name), standard LINQ to XML is slow, which ones are fast and stable, with good documentation, are there any for .net applications?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

O

one pavel, 2016-01-05
@onepavel

sax