What is the fastest way to check the database for occurrences of multiple rows?

astrotrain2016-02-01 12:44:38

PHP

astrotrain, 2016-02-01 12:44:38

There is a base, it will be quite large in the future. The bottom line is that before processing a text file ~ 2mb, you need to select from it only those lines that are not in the database. Is there a faster way than just selecting the data and comparing row by row? Thank you.

Answer the question

In order to leave comments, you need to log in

6 answer(s)

Karl Meinhoff, 2016-02-01
@KarleKremen

As far as I know, requests to the muscle already follow an efficient algorithm. And so the only adequate option is to take a line and

SELECT `id` FROM `table` WHERE `field`="$наша_строка"

Well, look at the answer.

Vitaly Vitaly, 2016-02-01
@V_Tjuryakin

elasticsearch

Andrew, 2016-02-01
@iCoderXXI

A 2 mb text file is not that much. I would parse the text file into a separate temporary table (even in-memory is possible), and then make an OUTER JOIN from it to the main one, and in those lines that are in the file, but not in the main table, I would get NULL, by which I would select what should be added to the main table.
This is more or less adequate and productive way in my opinion.
Of course, there must be indexes on the corresponding fields in both tables, and here, it all depends on several points. Firstly, if the strings are always identical, but longer than 32 bytes, then it makes sense to add additional fields to the tables, drive the MD5 hashes of the strings there and index by hashes.
If something floats in the strings, for example, the case of characters, then before calculating the hash, the strings must be brought to a single form. Moreover, in addition to hashes, you can also store original strings in tables, here the question is only about comparison options.

Andrey, 2016-02-01
@VladimirAndreev

and the strings are entirely stored in the database?
if so - I would store md5 from the string (unless, of course, the original strings are shorter than 32 characters), to this unique index and forward.

Aleksey Ratnikov, 2016-02-01
@mahoho

If you are loading with LOAD DATA INFILE then use LOAD DATA INFILE 'file_name' IGNORE - duplicates will be ignored.
If via an INSERT with a set of VALUES, then INSERT IGNORE can also be used.
However, both options involve incrementing the AUTO_INCREMENT field of the target table (if any) on an unsuccessful attempt to write a duplicate.
To avoid this problem, you can load into a temporary table, and then do something like this:

INSERT INTO target_table (field1, field2)
  SELECT field1, field2 FROM temporary_table WHERE not EXISTS(
    select 1 from target_table where temporary_table.id = target_table.id
  )

Oscar Django, 2016-02-01
@winordie

I had a similar task: I had to add data from a file that is not in the database. There were 4,000,000 records in the file and the same number in the database. In total, in order to pass them, it was necessary to complete approximately 8 * 10 ^ 12 iterations.
Therefore, in one request, I received from the database all the field values \u200b\u200bfor which there was a comparison with sorting by them. Then I went through a loop through all the records from the file and found those that are not in the database with a binary search. In total, it turned out no more than 5 * 10 ^ 7 iterations and 1 query to the database to find records from the file that are not yet in the database.
But I had numerical data for which it is easy to search for binary search and the comparison went only on 2 fields of the table.