How should the database and search be organized for 1,000,000,000,000 (trillion) records per 100TB?

ruboss2015-09-19 19:21:01

MySQL

ruboss, 2015-09-19 19:21:01

Hello everyone, I'm doing a project related to image recognition, I came up with a very interesting problem, I think not only me - search for huge data
Hashes go to the database, I don't know the exact length yet, I think 32-64 utf-8 characters.
From one image there will be approximately 5000 hashes. Since there will be a lot of images (well, really a lot, as for me) 720,000,000 (720 million), you will have to search through more than 1 trillion records, which in turn will take up about 100TB.
How can a structure be designed to be extensible and even work in such conditions?
In theory, hash search should be O(1), will MySQL pull it?
In which direction to dig? Thank you!

Answer the question

In order to leave comments, you need to log in

12 answer(s)

Sergey, 2015-09-19
@ruboss

cassandra.apache.org

Anton, 2015-09-19
@Largo1

hmm, all this is strange .. usually who creates such a database already knows what to do .. work with Oracle

Max, 2015-09-19
@MaxDukov

so much will not pull also . look at hadup

sim3x, 2015-09-20
@sim3x

FS is also a DB

PC-1 for routing
возвращает адреса машин, на которых лежат хеши и картинки по, 
например, первым 4 байтам хеша

PC-1 for hashes
|-/file_with_hash_of_region: content hash of image
|-....

PC-n for hashes
|-/file_with_hash_of_region: content hash of image
|-....

PC-1 for images
|-/image_file_with_hash_as_name
|-....

PC-n for images
|-/image_file_with_hash_as_name
|-....

Dimchansky, 2015-09-24
@Dimchansky

It is unlikely that something will be faster than an Aerospike cluster with SSD drives

xmoonlight, 2015-09-19
@xmoonlight

Make each hash a primary key and then look here:
https://dev.mysql.com/doc/refman/5.5/en/innodb-ind...
UPD: I would add that for training and benchmarking an image (based on a set similar from the database), you need to remove from the further selection (single pass through the entire database) intermediate "close" "similar" instances, leaving a certain percentage of tolerance on the parameters. Thus, it will not grow from "copies" of similar instances.

beduin01, 2015-09-19
@beduin01

Try the ArangoDB
API, it's very simple and the rate of fire is great. But this is the case if you want to try a solution with NoSQL

Alexander Chernykh, 2015-09-24
@sashkets

here is another fresh news www.nixp.ru/news/13589.html

Yuri Yarosh, 2015-09-24
@voidnugget

I would even rather go in the direction of scylladb - a more intelligent thing than Cassandra / Hbase.

gro, 2015-09-24
@gro

So many answers, moreover, that no one even specified what the author means by hash search.
Just one hash to return the ID of the photo?

Alexey Akulovich, 2015-09-24
@AterCattus

If you need to get the IDs of the pictures whose hashes occur most frequently in the requested selection, then you need to build not just a key-value, but more optimal indexes...

pansa, 2015-09-25
@pansa

Personally, I was also confused by the following moments:
1) what kind of hashes are so strange - in UTF8 characters? You know that _1 character_ in this encoding can take from 1 to 6 bytes, which leads to a huge scatter on such a number of records. If you have a hash from ASCII, then why did you pull UTF8 here?
2) 32-64 characters - so 32 or 64? On your count, this difference is + - 50Tb. These are quite serious volumes.
3) How did you calculate 100TB? Have you taken into account the place under the index?
Ideas on the problem:
1) it’s not worth dragging a relational here, because ...
2) it’s obvious that all this needs to be run on more than one machine, at least 2 by eye, not counting the backup (is it needed?) or replicas => sharding = > kv-storages would be better (if we understood what you want correctly)
3) nothing is said about the number of requests - inserts/reads. But I would think about placing a preliminary check on the Bloom filter in front of this storage, so as not to knock on the storage once again. But it is necessary to know the nature of the data and queries.