A
A
AgentFSB2019-11-11 11:17:36
Clustering
AgentFSB, 2019-11-11 11:17:36

Clustering SMS messages and getting the variable part of each cluster. What are the solutions?

There is a selection of SMS messages. The task is to compose regular expressions for these messages. Texts can be absolutely on different topics and there is no definite pattern. I split the task into two.

  • The first is clustering.
  • The second is generating a regular expression for each cluster.

For clustering, I use my own algorithm based on Oliver's algorithm to compare the similarity of strings. I tried the db scan algorithm, but I ran into the problem of selecting epsilon and minPts. For some texts it is too small for others it is large, something in between could not be found. For example, there are texts
"Raymond Adamson your are arrived. Phone - 12341234."
"Raymond Adamson your are arrived. Phone - 12341234."
"Peter Parker your are arrived. Phone - 12121212."

They should end up in the same cluster and the output should be something like "{var} your are arrived. Phone - {var}."
or
assigned green Ford Escape A1234BC, +16507599755.
assigned red NISSAN V555QW, +16507512321.

They must also be in the same cluster. And we get "assigned {var}".
The problem is in the correct clustering of completely different texts. Has anyone experienced something similar?
Maybe there are ready-made solutions for tasks of this kind or libraries

Answer the question

In order to leave comments, you need to log in

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question