How to store and search in 10 billion records?

A

Andrew2012-11-14 22:41:15

C++ / C#

Andrew, 2012-11-14 22:41:15

There are 500 billion entries, each entry is some numbers and some text.
This is all divided into N number of parts of about 10 billion.

At the moment, 10 billion are stored in one file (about 5 TB). There are several indexes for this file, binary files (key -> offset in the data file), sorted by key, so the search will be quite simple.

The main problem is that new data often arrives, N million a day, and when these entries are added to the index file, the entire index file has to be rewritten, which is about 500 GB. And so each index and there are several of them for each part. This takes a long time.

How are these problems usually solved? How to store more indexes? Maybe there is some kind of db capable of accommodating so many with several indexes and sortings.

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

A

amarao, 2012-11-14
@amarao

The smart answer in the style of "leave me alone" - hadoop.
If you think about how to solve it - if there are no performance problems and 5TB is fine with one file, then you just need to use trees to store the index and update the indexes only for a portion of the data that has arrived.
Here is the simplest example of an index: we turn the key into a hash (it doesn’t matter how, either 1-in-1, or md5 from it and the low bits), after that we make directories with the name of the first byte of the hash, there are directories with the second byte, etc. ., until something very compact remains. At the time of adding data during their indexing, a small portion of those pieces of the index that have changed is simply updated.
This solution is “on the knee”, if something is cool, look towards specialized databases.

D

darkdimius, 2012-11-14
@darkdimius

For such tasks, the following idea is sometimes suitable: divide the database into parts (packages), and make queries to them independently, and then combine the results
. For example, separately store data for the last days from Sunday by package for one day, merging the entire database once every 7 days in one package.
If you need a search by key, access the packages in ascending order of the "age" of the database.
If you need sorted data, then after the search, you need to “merge” the data with older records overlapping with new ones.
A smarter strategy is to merge packages using a power law. Those packages are only for 2^i days.

P

pav, 2012-11-14
@pav

Look towards LucidWorks Big Data . I myself haven’t worked with it, but I’m working with LucidWorks Search and so far there are no problems (~ 15GB, 10k documents).

R

relgames, 2012-11-15
@relgames

I think Cassandra is worth a try. She can not only search very quickly by primary key, but also by secondary www.datastax.com/docs/1.0/ddl/indexes