D
D
Dmitry Filippov2018-03-29 15:54:26
C++ / C#
Dmitry Filippov, 2018-03-29 15:54:26

How to split XML file parsing into multiple threads in C#?

Good afternoon.
There is a task to parse a huge XML file (1TB) and enter the data into the database.
In one single thread, this all works very slowly, and the parser will finish its work in about three years :D
In general, you need to somehow correctly split the parsing into several threads. There are options?
PS I have never worked with multithreading in C#.
Here is the code that is currently working for me:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Xml;
using System.Xml.Linq;
using Npgsql;

namespace MapReader
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.Write("> ");

            string path = Console.ReadLine();

            IEnumerable<XElement> root = from el in Root(path) select el;

            Osm2Pg pgosm = new Osm2Pg();

            pgosm.CreateTables();

            foreach (XElement item in root)
            {

                if (item.Name == "way")
                {

                    long wayID = long.Parse(item.Attribute("id").Value);

                    Console.WriteLine("way: " + item.Attribute("id").Value);

                    foreach (XElement nd in item.Elements("nd"))
                    {

                        long nodeReference = long.Parse(nd.Attribute("ref").Value);
                        pgosm.InsertWayNds(wayID, nodeReference);

                        Console.WriteLine("--nd: " + nd.Attribute("ref").Value);
                    }

                    foreach (XElement tag in item.Elements("tag"))
                    {

                        string key = tag.Attribute("k").Value;
                        string value = tag.Attribute("v").Value;
                        pgosm.InsertWayTags(wayID, key, value);

                        Console.WriteLine("--tag: " + tag.Attribute("k").Value);
                    }

                }


                // проходимя по node.
                if (item.Name == "node")
                {
                    // конвертируем координаты из географической системы в декартову.
                    double lon = double.Parse(item.Attribute("lon").Value);
                    double lat = double.Parse(item.Attribute("lat").Value);

                    float x = (float)GeoHelper.lonToX(lon);
                    float z = (float)GeoHelper.latToY(lat);


                    long nodeId = long.Parse(item.Attribute("id").Value);
                    pgosm.InsertNodes(nodeId, x, z);
                    
                    Console.WriteLine("node: " + x + "," + z);

                    if (item.HasAttributes)
                    {
                        foreach (XElement tag in item.Elements("tag"))
                        {

                            string key = tag.Attribute("k").Value;
                            string value = tag.Attribute("v").Value;
                            pgosm.InsertNodeTags(nodeId, key, value);

                            Console.WriteLine("--tag: " + tag.Attribute("k").Value);
                        }
                    }
                }
            }

            Console.WriteLine("End of program...");
            Console.Read();

        }

        // магия б***ь...
        static IEnumerable<XElement> Root(string path)
        {
            using (XmlReader reader = XmlReader.Create(path))
            {
                while (reader.Read())
                {
                    if (reader.Name == "way" || reader.Name == "node")
                    {
                        XElement el = XElement.ReadFrom(reader) as XElement;
                        if (el != null)
                            yield return el;
                    }
                }
            }

        }

    }
}

Answer the question

In order to leave comments, you need to log in

2 answer(s)
R
Roman Mirilaczvili, 2018-03-29
@2ord

In addition to what was said in cicatrix
, replace the cycles foreachwith
Using the Task-based Asynchronous Programming library
This eliminates the need for

Choose the number of threads for a specific hardware, or (if you want to get confused) you can make it customizable - start with 4 threads, calculate the average node processing speed (number of nodes per minute), enter one thread per minute, measuring whether the time has increased or decreased. If the time has increased, we return the previous amount, if it has decreased, we add another stream until we find the best option.

C
cicatrix, 2018-03-29
@cicatrix

I see you have some special XML reader. A lot depends on its implementation, namely its thread safety.
In principle, your main foreach can be parallelized as follows:
We do according to the "producer-consumer" pattern
You will have one producer - your reader, which should "supply" links to individual nodes of your file. The consumer should be threads that grab the first one that comes across (and take this into account right away - they will grab out of order, namely, the first node that comes across) and parse them.
Choose the number of threads for a specific hardware, or (if you want to get confused) you can make it customizable - start with 4 threads, calculate the average node processing speed (number of nodes per minute), enter one thread per minute, measuring whether the time has increased or decreased. If the time has increased, we return the previous amount, if it has decreased, we add another stream until we find the best option.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question