A
A
Alexey Solodky2015-05-20 16:42:51
SQL
Alexey Solodky, 2015-05-20 16:42:51

How to implement the intersection of two sets (a lot of data)?

i.imgur.com/fTwGNcc.png
has a list of sites (10M)
and a list of keys (40M)
each site has a key. The site on the screen has 500k of them. The
table of competitors of a particular site is all sites that intersect with the given keys. Sorted approximately by the number of these very intersections / the total number of site keys.
You need to be able to retrieve such tables quickly. (page by page)
The main problem is the amount of data.
The table is large (736k) has pagination and sorting by any parameter.
Interested in the approach rather than a specific solution. I think that mysql will fail on this task. What might fit?
How it is possible to do such samplings for comprehensible time (10 sec)?
Maybe graph databases? Or maybe a regular relational database will suffice?

Answer the question

In order to leave comments, you need to log in

3 answer(s)
D
Dmitry Entelis, 2015-05-20
@DmitriyEntelis

imho it is correct to look and think in the direction of hadoop & mapreduce

L
lega, 2015-05-21
@lega

You can try sphinxsearch (or elastic), it searches with sorting by relevance, i.e. at the top there will be the largest intersections of the keys, but he can think a lot if there are a lot of intersections.
Or try to make a reverse index with sorted sites, calculate the intersection of the site in one pass, scatter all the sites into nodes, throw the result into the database for sorting.
How many keys does a website have on average?

A
Alexey Likhachev, 2015-05-21
@Playbot

I don’t quite understand the essence of the problem, but if your queries are correct, that is, all the data you need is immediately issued by the database, without additional code logic and you use pagination (say, you request 100 rows for 1 page), then everything will be done in a reasonable time if no, then check if you are using indexes when querying

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question