N
N
Name2021-06-17 16:02:06
PHP
Name, 2021-06-17 16:02:06

How can I speed up the execution of a script to search for similar records and remove them from the database (an example of a script in the question)?

The script takes a very long time to run. Three million records will be processed for months.
I do deletion of similar records with such a script.
The essence of the script. We take records in order by id and compare them with all records, if the percentage shows similarity, the verified records are deleted.

//берем записи по порядку и пробегаем базу
$row_res = mysql_query("SELECT id, title FROM blog WHERE st = 0 ORDER by id");

         while ($row = mysql_fetch_assoc($row_res)) {

                   $str = $row['title'];    
 
                   //ставим в базу 2, означает - запись уже обработана
                   mysql_query("UPDATE blog SET st = 2 WHERE id = $row['id']");



                              //пробегам все записи и сверяем с взятой записью
                              $rows_res = mysql_query("SELECT id, title FROM blog WHERE st = 0 ORDER by id");

                                               while ($rows = mysql_fetch_assoc($rows_res)) {

                                                         $sim = similar_text($row['title'], $rows['title'], $perc);
                                                         $sims = round($perc);
                                                         if($sims > 73) {mysql_query("DELETE FROM blog WHERE id = $rows['id']");}
}
}


but such a script processes three million records for a very long time. It'll be out in a couple of months.

Is there a way to speed up?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
R
Rsa97, 2021-06-17
@Rsa97

Count all pairs (`id`, `title`) at once into an associative array, process it, adding unique records to the new id array, then in portions, for example, by a hundred id, set the uniqueness feature in the database, then delete all records where this feature is not installed.
But still, in the worst case, there will be 3,000,000 * 3,000,000/2 = 4,500,000,000,000 comparisons. So, first of all, we need to speed up the similar_text function.

V
Vitaly Kachan, 2021-06-17
@MANAB

Rewrite similar_text logic into a function. Or the entire deletion logic into a stored procedure. Or at least the logic of marking, so that you can later delete it with an additional script, but at least you can first find out exactly what data you will delete and check it before that.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question