S
S
Sergey Sokolov2015-03-21 12:45:06
Algorithms
Sergey Sokolov, 2015-03-21 12:45:06

What is the optimal way to find subsets in a many-to-many dataset?

There are two classes with a many-to-many relationship: eg. users ( U ) and communities ( C ) they belong to. One user can belong to several communities, and there can be several users in a community.
We know who is in what. You can present the data in any convenient form. So far I've settled on this:

c1: [u1, u2, u3],
c2: [u1, u2, u4],
c3: [u2, u3, u5, u6],
...

"Cores" I call relations with at least two participants from each side. In this example, there are two "cores":
[u1, u2] => [c1, c2],
[u2, u3] => [c1, c3],

How to find all kernels among fairly large data?
In addition to sorting through all possible combinations, nothing has yet come to mind. Very inefficient on large sets (tens of thousands on each side).

Answer the question

In order to leave comments, you need to log in

4 answer(s)
M
Mrrl, 2015-03-22
@sergiks

There is a simple solution for (number of communities)*(number of user-community relationships).
For each community Y:
- create an array R[C], where C is the number of communities
- for each user X from community Y:
- - for each community M that includes user X: R[M]=R[M]+ 1
- for each community M: if M!=Y and R[M] > 1, then the pair (Y,M) is the core.
It doesn't get any faster.

A
Armenian Radio, 2015-03-21
@gbg

Store the "kernels" themselves in a classic many-to-many relational database. Three tables - users, groups, connections.
rows in the link table: user id, group id.

S
Sergey, 2015-03-21
Protko @Fesor

It is worth thinking about how to transform the structure into a graph, so that there would be a connection not only between the community and the user, but vice versa. So the search will be more efficient. Also, if there is a lot of data, you can arm yourself with neo4j

S
SeptiM, 2015-03-21
@SeptiM

Let's start with an arbitrary graph. Each vertex will be a community, and we will put two users on each edge. A user belongs to the community if their edge touches the corresponding vertex. For such an example there would be Omega(C^2) nuclei, all different. This imposes some lower bounds on the algorithm.
The trivial way will work for O(C^2 U) + sorting of users in each community. It is clear that we are comparing the communities in pairs, and are looking for an intersection behind the line according to the sorted lists.
It is possible to improve the algorithm through minhash ( en.wikipedia.org/wiki/MinHash ) by paying with precision. Minhash allows you to calculate the Jaccard symbol - the size of the intersection of two sets divided by the size of the union. Only large intersections can be weeded out.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question