Answer the question
In order to leave comments, you need to log in
Inserting a huge amount of data into the database, what is the best way to do it?
There is a problem with inserting a huge amount of data into the database, namely:
1 data is steamed from xml files of different sizes (from megabytes to 50+ gigabytes) according to certain rules
2 each file is processed in the main thread (if they are small, several are launched in parallel)
3 after the file is parsed, by concatenating the strings we write an SQL script like:
insert into schema.table values... 1000 lines per query for each file, one or more scripts are obtained
4 in database tables there are no indexes or primary keys at all, i.e. . searching for duplicates at the moment of inserting each line takes a huge amount of time (I tried, it hurts to look at it)
5 after all the data has been inserted, the removal of duplicates in two columns begins (one must be unique within the table, the second of the timestamp type is the time the processing of the data packet starts) - I create a temporary table of two columns, write the actual data into it, everything that is not in it - I wipe the database change from database
6, transactions are prohibited. several applications work with the database at the same time
How to implement it quickly and efficiently, there are no forces anymore, there are ideas, but there is no holistic and stable solution, so that the database does not break (a lot of scripts are created, you never know what will bend at some point) and everything worked not days, but as quickly as possible and preferably in minutes?
CHANGING THE ORIGINAL PROBLEM:
5 The uniqueness of the record is determined by 2 columns in the database, one of them is the hash of the value in the source xml, the second is the start time of data processing, there should not be two rows with these identical fields in the table
6 Adding indexes, keys and transactions is still prohibited, but creating temporary tables with any set of indexes and fields are possible (and this is just fine
)
SOLUTION
several applications can work with the database (there may be replicas of the application being developed, or a completely different application can work with the database we are interested in), we still cannot use transactions, add indexes or tricky keys to several columns in the tables available in the database.
You can create temporary tables in the database, pour absolutely all the data from xml into them, then add indexes for temporary tables, erase duplicates, merge data into the main tables, delete temporary tables, repeat for each new data batch
Answer the question
In order to leave comments, you need to log in
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question