A
A
Andrey Kulagin2020-08-11 19:43:04
Java
Andrey Kulagin, 2020-08-11 19:43:04

How to find duplicate elements in different collections in linear time?

Tell me pliz with an algorithm for solving the problem, I imagine what needs to be done with HashMap, the key will be user, the value will be a list of emails (I can add them to the hashset for example), how then to find duplicate elements in different collections in linear time?

There are n users, each of them corresponds to a list of emails
(all users have m emails).
For example:
user1 ->[email protected],[email protected],[email protected] ([email protected],[email protected],[email protected])
user2 ->[email protected], [email protected] ([email protected],[email protected])
user3 ->[email protected],[email protected] ([email protected],[email protected])
user4 ->[email protected] pisem.net,[email protected] ([email protected],[email protected])
user5 ->[email protected]
It is believed that if two users have a common email, then this is
the same user. It is required to build
and implement an algorithm that performs user merging. The output
should be a list of users with their emails (same as the
input).
Any of the
original names can be used as the merged user name. The list of user emails should contain only
unique emails.
Parameters n and m are arbitrary, the length of a particular list of emails is
not limited in any way.
It is required that the asymptotic running time of the obtained solution be
linear, or close to linear.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
W
Wataru, 2020-08-11
@wataru

This is a task of finding connected components in a graph. You have a bipartite graph, but that doesn't matter. Vertices - emails and users, edges - correspondence of the user to the email. Solved by depth-first or breadth-first. Both solutions are linear in the number of edges (in your case, the total number of emails).
Renumber all emails and all users.
The code will be simpler if emails and users are stored in the same number space.
This is implemented with one hashMap, which will give the number by line, and one array of strings, which will store the original line by number. You will also need a boolean array to store whether the given node is a user or an email. When you enter, you get some string, and call the GetID(s, is_user) function from it, which checks if the given string is in the map. If there is, returns the number. If not, it appends the string to the array of strings, writes its index to the map, and returns it.
When you enter - build a graph.
Store the edges in an adjacency list - an array of arrays or lists, where for each vertex number you build a list of all associated with it.
When you enter, you have the number of the user vertex and you read emails and translate their numbers. At the same time, add the email number to the list for the user and vice versa.
Get an array of "bypassed" labels for all vertices. It will be int. 0 - unvisited vertices, otherwise the number of the connected component.
Run DFS/BFS on this graph from each vertex that has not yet been traversed in the loop and mark all reachable vertices with a new number (can be passed as a second parameter to DFS). You can immediately fill in the response structure during the crawl - one number for the user and a list for emails. Or, after a loop with DFS, you can create an array of lists for the answer, go through the array and shove the numbers of vertices into lists. Use a boolean array to figure out which top is the user and which is the email. From users, take only one representative, and push all emails into the list in response. Then output by converting the numbers to strings using the available array.

S
Sergey Tikhonov, 2020-08-11
@tumbler

In my opinion, building a reverse index "email - list of users" will just take linear time (if the insertion into the hashmap is linear), along the way, you can mark those emails in which there is more than one element in the list.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question