How to calculate full name in columns with wrong order?

G

Gryllidae2021-10-22 09:20:04

SQL Server

Gryllidae, 2021-10-22 09:20:04

Given:

id  |  col1       |  col2         |  col3
--------------------------------------
1   |  Иванов | Иван          | Иванович
2   |  Иван     | Иванович  | Иванов
3   |  Петров | Сергей        | Алексеевич
4   |  Петров | Сергей        | Алексеевич
5   |  Сергей  | Алексеевич| Петров

Ivanov Ivan Ivanovich and Ivan Ivanovich Ivanov are the same person, but the id is different, the order of the full name is simply confused. How to put in the "correct order" and count (group) by full name?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

Akina, 2021-10-22
@CricketIvan

Implementation example for MySQL 8+:

WITH 
cte1 AS ( SELECT id, col1 val FROM test UNION ALL 
          SELECT id, col2 FROM test UNION ALL
          SELECT id, col3 FROM test ),
cte2 AS ( SELECT id, GROUP_CONCAT(val ORDER BY val) FIO
          FROM cte1 
          GROUP BY id )

SELECT test.id, test.col1, test.col2, test.col3, GROUP_CONCAT(cte2_2.id) ids
FROM test
JOIN cte2 cte2_1 USING (id)
JOIN cte2 cte2_2 USING (FIO)
GROUP BY test.id, test.col1, test.col2, test.col3

Conclusion:

id	col1		col2		col3		ids
1	Иванов		Иван		Иванович	1,2
2	Иван		Иванович	Иванов		1,2
3	Петров		Сергей		Алексеевич	3,4,5
4	Петров		Сергей		Алексеевич	3,4,5
5	Сергей		Алексеевич	Петров		3,4,5

DEMO
If you also need normalization (i.e. strictly put the last name in col1, first name in col2 ...) - I think the best option is to download directories of first names, patronymics and surnames (completely accessible directories) and use them for categorization. 99% of the records will be completely processed by them, and the remaining ten records (where two or all three values are missing in the directories) can be processed manually, or rather, replenish the directory so that at the second start all 100% are recognized and categorized correctly.

will be executed on MS SQL Server, I can't tell the version.

For MS SQL, GROUP_CONCAT will be STRING_AGG(column) [WITHIN GROUP ( ORDER BY column)] instead . https://docs.microsoft.com/en-us/sql/t-sql/functio...

C

ComodoHacker, 2021-10-22
@ComodoHacker

You can only hardly request. More like code.
Analyze the endings of words, make lists of endings characteristic of surnames, names, patronymics. As well as exclusion lists. And all the same, the result will need to be checked with the eyes and corrected in some places.
Data cleaning is such an expensive and troublesome business. And it is desirable to conduct it before entering the base.

R

rPman, 2021-10-22
@rPman

since the goal is to find the same people with a confused order
, find records in which col1 matches col2 then col3, then col2 with col3, you will get a list of pairs in which this is confused, carefully, degenerate cases (for example, initials are written in the full name) or for example, oriental names (there are very complex composite names that are written down by someone in what they are), these moments are best processed with separate logic

select a.*,b.*
from table a inner join table b on
a.col1=b.col2
-- a.col1=b.col3
-- a.col2=b.col3