Question about MySQL indexes?

Hint2012-03-01 12:56:33

MySQL

Hint, 2012-03-01 12:56:33

Large table (millions of records). Rows are tasks for processing. When a record is added, the “processed” flag is set to 0. Then a separate service processes the records and changes the flag to 1. Records are not deleted, the flag from state 1 never goes to state 0. An index is set on the flag field. It is required to remember the time when the file was processed (unix_timestamp in int).
The question is, is it possible to remove the flag from the table and make a selection only by processing time (processed_time = 0)? Or will an index on a 0/1 flag field work more efficiently than on an int field?

Answer the question

In order to leave comments, you need to log in

9 answer(s)

Vitali Borovik, 2012-03-01
@WAYS

Probably, in the case of such an amount of data, I would make a separate table (archive) where I threw all the processed records, if they never become 0, it makes sense to waste time on them. And no indexes are needed at all.

zuborg, 2012-03-01
@zuborg

It's hard to say unequivocally, there are different mutually compensating effects.
On the one hand, you can get rid of the field with the flag that is no longer needed - this is a plus, on the other hand, you will have to add an index on the field with a timestamp, which already takes more than one byte, like the previous one - this is a minus (the index will take up a lot of memory). The indexes themselves will work equally efficiently for the specified selection.
If I were you, I would create a test plate and make a benchmark, at the same time, based on the results, you can unsubscribe an article on Habr, many would be interested.
Also, the question of choosing a storage engine is not trivial, innodb will have a very noticeable overhead in place for row headers if the table structure is very simple and the row size is small.
Needs to be tested...

Vitaly Zheltyakov, 2012-03-01
@VitaZheltyakov

An index on a field with two values is inefficient initially.

Grigory Peretyaka, 2012-03-01
@Peretyaka

There are opinions with which I do not quite agree. Let's take a look at what an index is in a simplified form.
If the index on the field is not set, then it goes through all the records and compares, if the flag = 1, then select the record. With a million records, there will be a million such comparisons.
If you put an index, it will reduce the list to unique values and it will look like this:
0 - rows where the entry is 0
1 - rows where the entry is 1
That is, there will be only 2 comparisons.
If you combine the fields, then in the index there will be 999,900 unique temporary values and one - 0, corresponding to 100 records, that is, there will be 999,901 comparisons.
(if the field is unique at all, then the benefit is that it will stop the search after the first element it finds)
(this is far from everything that indexes work in, besides, there are different types, nuances)
How significant is the comparison of integers, even multiplied by a million, I can’t judge, as well as the increase in hard disk accesses. But it will definitely be longer, the only question is how much, here it is already necessary to test.
Another situation with NULL. IS NULL - will always work without comparison at all. This is where I would do it.

Vampiro, 2012-03-01
@Vampiro

To optimize this table, I would make a breakdown into PARTITIONS
either by dates or by the processing field Index by 0/1 is fast, since the records you need are stored first in HASH (and I sincerely hope that you made a HASH index, not BTREE)
CREATE TABLE demo (col1 INT, tasks VARCHAR(255), col3 DATE) PARTITION BY KEY(col3) PARTITIONS 4;
PARTITION BY LIST (id) ( PARTITION r0 VALUES IN (0), PARTITION r1 VALUES IN (1),

Vitaly Zheltyakov, 2012-03-02
@VitaZheltyakov

Tested:
— table with 1,000,000 rows, innobd engine;
— Structure of fields id, flag (TINYINT), time(timestamp), data(char);
— The flag field has two possible values;
— 10 lines have a different flag field value;
- Query caching is disabled.
Bottom line:
- Indexes really help to filter out the results even if there are few values. I got an average time of 1.89 seconds without indexes, and 0.015 seconds with indexes.
- Using NULL as one of the values gave a speed increase of 0.005 sec. on one million records without indexes, with indexes grew too small.

Fastto, 2012-03-01
@Fastto

It seems to me that the following option will be effective (if I'm not mistaken, a similar solution in the symphony tasks):
do not use either the flag or the time for sampling, but simply the second, processing service "remember" the last successfully processed identifier, and make the selection
where ID > @yourLastCalcedID
This it will be most efficient, of course, if the processing by the service proceeds sequentially, linearly ID by ID, without gaps and branches between processing elements

Artyushov, 2012-03-01
@Artyushov

if your field accepts a limited number of values, (enumeration), then a bit scale will be built on it. That is, for each value, a vector of 0 and 1 will be stored with the length of the number of records. I think it's more clear.
Thus, you can immediately get the numbers of records with a given field value.

hom9k, 2012-03-01
@hom9k

Most likely, there will be no difference, since in both cases the number of records with flag = 0 or processed_time = 0 is the same. An index on processed_time will only allow you to do statistical queries with the number of processed jobs in time intervals much faster (do you need this?).
The index has such a value as cardinality (literally from English: "power"). And the closer the value of this property is to the total number of records, the faster the index search works.