Is it a good idea to use an MD5 hash as an ID (primary key)?

Vlad Mistetsky2011-05-17 22:17:37

Database

Vlad Mistetsky, 2011-05-17 22:17:37

I'm actually designing a database with entities in which there are images. I want to avoid problems with duplicate names, holes after deletions (if you use just an id increment) and use hashes as an id everywhere. The probability of collisions seems to be not at all high on the estimated number of objects in the database.
What can be pitfalls?

Answer the question

In order to leave comments, you need to log in

14 answer(s)

Konstantin, 2011-05-17
@Norraxx

My Java teacher used to tell me, "Every time you use hash as an id, you're killing a kitten."

Jazzist, 2011-05-18
@Jazzist

Yes Easy. Don't be fooled by old teachings - modern database engines, and even MyISAM, work with text indexes no significantly slower than with integer ones.
If in doubt, just try to estimate the real load. Most likely, this place will not be narrow. Remember the principle of "the harm of premature optimization", which says: "Premature optimization is harmful."
I would doubt the expediency of the solution itself, but not in terms of resource intensity, but in terms of ease of development. Get the identifier of the newly inserted record, accompany it with foreign keys, handle exceptions - all this is done as if on autopilot. If you really do not need additional functionality that requires additional keys (which, in turn, can also be served in an external table, like parent ones ....)

Vitaly Zheltyakov, 2011-05-17
@VitaZheltyakov

md5 hash as id should only be used if it replaces a composite key (several key fields). Otherwise, we use int (bigint) with unsigned and auto_increment.

Maxim Avanov, 2011-05-18
@Ghostwriter

In your task, you don't need either MD5 or auto-incrementing INT as suggested above.
Use a binary field as a primary key and enter 16-byte UUID4 sequences into it, and you can give them to users as a 32-character HEX string.
About overhead. Compared to int64, which should be an auto-increment counter, the field size will be only 8 bytes larger (2 times), which for your case is not a problem at all.
Just make sure that your indexes always fit entirely in physical memory, and are not stored partially or entirely in swap on disk. The nature of the UUID is such that it generates evenly distributed values, which means that a search on such an index will be subject to the effect of "random lookup" (random lookup). And if your index is at least partially stored on disk, then this can lead to numerous random seeks and very slow

Zorkus, 2011-05-18
@Zorkus

It's just that this is - holes after deletions (if you just use the ID increment) - but what's the problem here? Well, there will be holes and there will be. These are surrogate keys.
Yes, if you go deep into the stack, you can talk about the clustering factor on indexes and other things;) but these are trifles.
I repeat, I would use this approach inside the DBMS.

Vladimir Chernyshev, 2011-05-18
@VolCh

If you still have to build MD5, it will be obviously unique (that is, the logic does not allow adding two pictures with one hash), then why not (provided that the indexes of both the main and related tables will fit in memory, if speed is important) . In any case, as far as I understand, the hash will be calculated only when you try to add a picture to the database, and the number of samples by id will be much greater than the number of such attempts.
But I would advise you to think carefully about whether you really need the uniqueness of the hash, whether this requirement can change later. Or, say, decide to change the hash algorithm - changing a single unique field is a much less expensive operation than changing the main key and all indexes of related tables.
Plus, if I understand the idea correctly (something like photo hosting, the image hash is used for urls), there can be interesting situations like: one user added a picture, got its new url, then the second one added it, the system detected a duplicate and issued an existing hash . Then the first user deleted the picture, the system detected that there were still links and removed only the link from the albums of the first user without deleting the picture itself - as a result, the first user can see that despite the deletion, the picture remained available at the old URL. Someone may not care, but someone may raise a cry about personal data, etc.

Zorkus, 2011-05-18
@Zorkus

What are good reasons not to use an autogenerated surrogate key?

Puma Thailand, 2011-05-18
@opium

Don't go down the path of idiocy.

xiWera, 2011-05-17
@xiWera

And what is the size of the field on which the md5 hash is built? And what is the data?

Alexey, 2011-05-18
@alexkbs

Use SHA1 like in Git .

Andrey Shaydurov, 2011-05-18
@GearHead

what DB? MongoDB has an excellent solution for this case as an ID: a 12-byte number, which is formed based on the node (during sharding), timestamp, process ID and only three bytes of auto-increment (more details here ). Perhaps there is something similar for RDB.

ComodoHacker, 2011-05-19
@ComodoHacker

In addition to the already mentioned pitfalls (table and index sizes, performance) and other trifles, the most important one was not named - the use of a natural key for PK instead of a surrogate one. A lot of smart articles and stupid holivars have been written on the forums on this topic, I won’t paint here, I’ll just give one situation:
“Oh, the wrong picture was uploaded! We need to fix it urgently!… Oh, that was a month ago, there are already a lot of links to it in a bunch of tables… Oh, there are such constraints, a simple UPDATE won’t work… Oh, some of the data is already in the archive, what to do with it?..”

kapitansky, 2011-05-19
@kapitansky

I agree with teacher Norraxx (May 17 at 23:58 "My Java teacher told me:" Every time you use hash as id, you kill a kitten. "") and VitaZheltyakov (md5 hash as id should only be used, if it replaces a composite key (multiple key fields. Otherwise we use int (bigint) with unsigned and auto_increment). And only if the total complexity of searching for several key fields is higher than that of a hash ...
The smaller the key, the faster the search for it, so you should always strive to minimize its size.
Searching by a numeric key is often much faster (sometimes up to 4 times faster, possibly even faster) than searching by a number of the same length ... and this is true not only for relational DBMS, but also for the vast majority of programming languages (quite likely, what you need to look for in the set obtained earlier from the database).
Based on your task, for starters, I would choose unsigned mediumint (16,777,215 - 1 records) as the PK type - it can always be expanded to int, and then bigint (if, of course, this is required).