How to identify similar phrases in a SQL column of a table?

E

Ernest Faizullin2016-08-11 22:04:46

SQL

Ernest Faizullin, 2016-08-11 22:04:46

The table has thousands of famous music artists who perform in different places and therefore the names of one group can be spelled differently.
The task is to group similar artist names by assigning them a group_id equal to the minimum artist id value among similar ones. For example, this is how it should ideally be:
id 1137 Red Hot Chili Peppers in the Olympic - group_id 1137
id 1138 Red Hot Chili Peppers - group_id 1137
id 1139 Group Red Hot Chili Peppers - group_id 1137
id 1140 Red Hot (Live in CA) - group_id 1137
but now many performers have the group_id field empty.
Now every day on the cron at night the script goes through the exact matches in the name of the artist and groups them. But many performers remain ununited.
Tell me at the level of the algorithm, in general terms, how can you identify similar phrases and combine them into groups?
Thank you all in advance!

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

T

ThunderCat, 2016-08-11
@ThunderCat

I would go through the entire selection first - I found the maximum number of occurrences of words, in the top there will probably be garbage ala "Group", "Live" and so on. Ideally, they should be cleaned at all (or marked somehow, for example like this {{live}}). Further on 1 word (let's say Red), we make a selection, if there are a lot of two or more -word combinations and few single-word combinations in the selection - most likely one-word combinations are rubbish, and multi-word ones contain at least a two-word name. Further, pure statistics - we count the occurrences of each word in the sample, if there is a lot - it is included in the name, a little - garbage. According to statistics, we collect the name from the most repeated words. In more detail, seriously draw / draw on a piece of paper and deduce a coherent algorithm.

D

Denis, 2016-08-12
@prototype_denis

One of the options is k-means (you know the number of clusters - these are group IDs).