I
I
Ilya2016-07-24 21:50:05
PHP
Ilya, 2016-07-24 21:50:05

How to determine "crazy" by means of MYSQL/PHP?

Hello.
There is MySQL and an application in UTF-8.
There is a table (suppose user: id, name) that contains a large number of records, the values ​​of which are obtained from third-party sources.
It happens that such sources change their encoding, suppose to CP1251, but I still continue to write such values ​​​​(without prior conversion), for example:

РњС‹ Р'ыбираем
Мало Мне

Actually, a question. What is the best way to determine all such entries to delete them? I note, not treatment, namely removal.
In my solution, I make a list of forbidden substrings, for example: Р° , after which I get all the records from the table and use PHP to check if such a substring is in the name field . This approach works well, but with a certain error, and if you increase the base of forbidden characters, the possibility of hitting whole data also increases.
I'd love to hear possible solutions.
Thank you so much! ;-)

Answer the question

In order to leave comments, you need to log in

2 answer(s)
I
Ilya, 2016-07-27
Hrebet @hrebet

As a result, as planned, I collected a list of the most popular substrings, I did it:

$chars = [
  'Ђ',
  'Рњ',
  'Р°',
  'Рµ',
  'Рџ',
  'Р№',
  'СЏ',
  'Р‘',
  'Рђ',
  'Р»',
  'Рї',
  'С€',
  '¶',
  'Р”',
  '‡',
  'Р›',
  'Р•',
];

I cycled through all the records to determine the occurrences of such substrings using regular expressions or substr_count and, if found, got rid of such positions.
I note that I was not interested in the treatment of such lines, otherwise it would be possible to do with the usual iconv .
;-)

N
nozzy, 2016-07-25
@nozzy

Characters can be added to the regexp condition

select
text_column 
from
your_table
where not text_column regexp '[A-Za-z0-9]';

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question