Answer the question
In order to leave comments, you need to log in
How should the database and search be organized for 1,000,000,000,000 (trillion) records per 100TB?
Hello everyone, I'm doing a project related to image recognition, I came up with a very interesting problem, I think not only me - search for huge data
Hashes go to the database, I don't know the exact length yet, I think 32-64 utf-8 characters.
From one image there will be approximately 5000 hashes. Since there will be a lot of images (well, really a lot, as for me) 720,000,000 (720 million), you will have to search through more than 1 trillion records, which in turn will take up about 100TB.
How can a structure be designed to be extensible and even work in such conditions?
In theory, hash search should be O(1), will MySQL pull it?
In which direction to dig? Thank you!
Answer the question
In order to leave comments, you need to log in
hmm, all this is strange .. usually who creates such a database already knows what to do .. work with Oracle
FS is also a DB
PC-1 for routing
возвращает адреса машин, на которых лежат хеши и картинки по,
например, первым 4 байтам хеша
PC-1 for hashes
|-/file_with_hash_of_region: content hash of image
|-....
PC-n for hashes
|-/file_with_hash_of_region: content hash of image
|-....
PC-1 for images
|-/image_file_with_hash_as_name
|-....
PC-n for images
|-/image_file_with_hash_as_name
|-....
It is unlikely that something will be faster than an Aerospike cluster with SSD drives
Make each hash a primary key and then look here:
https://dev.mysql.com/doc/refman/5.5/en/innodb-ind...
UPD: I would add that for training and benchmarking an image (based on a set similar from the database), you need to remove from the further selection (single pass through the entire database) intermediate "close" "similar" instances, leaving a certain percentage of tolerance on the parameters. Thus, it will not grow from "copies" of similar instances.
Try the ArangoDB
API, it's very simple and the rate of fire is great. But this is the case if you want to try a solution with NoSQL
I would even rather go in the direction of scylladb - a more intelligent thing than Cassandra / Hbase.
So many answers, moreover, that no one even specified what the author means by hash search.
Just one hash to return the ID of the photo?
If you need to get the IDs of the pictures whose hashes occur most frequently in the requested selection, then you need to build not just a key-value, but more optimal indexes...
Personally, I was also confused by the following moments:
1) what kind of hashes are so strange - in UTF8 characters? You know that _1 character_ in this encoding can take from 1 to 6 bytes, which leads to a huge scatter on such a number of records. If you have a hash from ASCII, then why did you pull UTF8 here?
2) 32-64 characters - so 32 or 64? On your count, this difference is + - 50Tb. These are quite serious volumes.
3) How did you calculate 100TB? Have you taken into account the place under the index?
Ideas on the problem:
1) it’s not worth dragging a relational here, because ...
2) it’s obvious that all this needs to be run on more than one machine, at least 2 by eye, not counting the backup (is it needed?) or replicas => sharding = > kv-storages would be better (if we understood what you want correctly)
3) nothing is said about the number of requests - inserts/reads. But I would think about placing a preliminary check on the Bloom filter in front of this storage, so as not to knock on the storage once again. But it is necessary to know the nature of the data and queries.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question