Quick search on a large array of heterogeneous data, what to choose?

Z

zeka2011-03-24 00:02:21

Database

zeka, 2011-03-24 00:02:21

Let's say there are 10 million records, each averaging about 5 kilobytes.
The data in the records is not text and is partly close to binary, and it is not possible to compile a dictionary (indexation) for them.

It is necessary to organize a full-text search on this data.

What system can you recommend for this task so that the search is as fast as possible (how to store data and how to search for it)?

UPD:
The question can be approached from the other side, for example - distributed computing, GAE is the same, or Amazon SimpleDB, maybe someone had such experience?

Reply

Answer the question

In order to leave comments, you need to log in

7 answer(s)

A

apangin, 2011-03-24
@apangin

Apply the Bloom filter idea :
choose 10-30 features that are easy to compute for both the query and the content, that give roughly the same true/false distribution on your dataset. Filter the search by those records where the features found in the query occur.
For example, you can choose signs like "there is a substring of N characters, the sum of which is equal to K". Obviously, if such a substring is present in the query, then it must also be present in the required records. For the sake of interest, I conducted an experiment on jpg-avatars with an average size of 4K and picked up the following pairs of N and K: (3, 97), (3 98), (3, 99), (3, 102), (3, 104), (3, 105), (4, 161), (4, 173), (4, 178), (5, 247), (5, 251), (5, 255)…

W

wanmen, 2011-03-24
@wanmen

Here it is not badly written about full-text search www.mysql.ru/docs/man/Fulltext_Search.html

J

jj_killer, 2011-03-24
@jj_killer

Here only "top-down" is possible. If the data is completely random and it makes no sense to sort it, then there is probably no other way out. If, nevertheless, they can be sorted, then you can try to put "tags" and search in the gap between them.

I

Iskander Giniyatullin, 2011-03-24
@rednaxi

10,000,000 records of 5,000 bytes each?
Those. about 50 gigabytes of textual information?
I would try shpinx, it's pretty good at digesting large amounts of text data
sphinxsearch.com/about/sphinx/

Y

YasonBy, 2011-03-24
@YasonBy

If your sequences are quite by chance related to biology (and even if not) - bioinformatics has specialized algorithms, such as BLAST .

J

Jazzist, 2011-03-24
@Jazzist

If there is a lot of data, find an opportunity to work with the index. There are no other effective solutions.

P

pietrovich, 2011-03-24
@pietrovich

50Gb relative to random byte sequences? No separators? And what is the average length of the desired fragment? And how often will the samples be taken?