Answer the question
In order to leave comments, you need to log in
Quick search on a large array of heterogeneous data, what to choose?
Let's say there are 10 million records, each averaging about 5 kilobytes.
The data in the records is not text and is partly close to binary, and it is not possible to compile a dictionary (indexation) for them.
It is necessary to organize a full-text search on this data.
What system can you recommend for this task so that the search is as fast as possible (how to store data and how to search for it)?
UPD:
The question can be approached from the other side, for example - distributed computing, GAE is the same, or Amazon SimpleDB, maybe someone had such experience?
Answer the question
In order to leave comments, you need to log in
Apply the Bloom filter idea :
choose 10-30 features that are easy to compute for both the query and the content, that give roughly the same true/false distribution on your dataset. Filter the search by those records where the features found in the query occur.
For example, you can choose signs like "there is a substring of N characters, the sum of which is equal to K". Obviously, if such a substring is present in the query, then it must also be present in the required records. For the sake of interest, I conducted an experiment on jpg-avatars with an average size of 4K and picked up the following pairs of N and K: (3, 97), (3 98), (3, 99), (3, 102), (3, 104), (3, 105), (4, 161), (4, 173), (4, 178), (5, 247), (5, 251), (5, 255)…
Here it is not badly written about full-text search www.mysql.ru/docs/man/Fulltext_Search.html
Here only "top-down" is possible. If the data is completely random and it makes no sense to sort it, then there is probably no other way out. If, nevertheless, they can be sorted, then you can try to put "tags" and search in the gap between them.
10,000,000 records of 5,000 bytes each?
Those. about 50 gigabytes of textual information?
I would try shpinx, it's pretty good at digesting large amounts of text data
sphinxsearch.com/about/sphinx/
If your sequences are quite by chance related to biology (and even if not) - bioinformatics has specialized algorithms, such as BLAST .
If there is a lot of data, find an opportunity to work with the index. There are no other effective solutions.
50Gb relative to random byte sequences? No separators? And what is the average length of the desired fragment? And how often will the samples be taken?
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question