Does it make sense to use checksum or hashbyte in JOINs?

Z

zhaar2019-03-18 17:56:52

SQL

zhaar, 2019-03-18 17:56:52

Actually a "simple" question - let's consider 3 situations of Left Joins. As initial data, there is a simple plate of 3 columns for 10k records, consisting of a date (date), text (varchar (1000) and numbers (int) (tbl1), as well as a second plate for 100k records with a bunch of columns in including those that are needed for the join.
The task is to find differences in the data and display them to the user.
What will work faster? The question is purely theoretical.
1) Left join, in which the links for each field are explicitly indicated, i.

left join tbl2 on tbl1.date=tbl2.date and tbl1.text=tbl2.text and tbl1.size=tbl2.size

2) Left join, where concat from all fields is used to collect one long line to join on it, i.e.:

left join tbl2 on concat(tbl1.date,tbl1.text,tbl1.size) = concat(tbl2.date,tbl2.text,tbl2.size)

3) Left join, where a hash function or checksum is used, i.e.

left join tbl2 on checksum(tbl1.date,tbl1.text,tbl1.size) = checksum(tbl2.date,tbl2.text,tbl2.size)

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

V

Valentin Avramko, 2019-03-21
@zhaar

Definitely the first option. And it is absolutely correct - it is also to build indexes on these fields.

S

Sumor, 2019-03-18
@Sumor

The surest way to find out is to create a sample and try it out.
I suppose that in most cases the first option will win, although for tables with less than tens of thousands of records, the difference will be at the level of error. The database will most likely use date and number effectively for pre-filtering before comparing text. Therefore, the second option is in any case losing.
Computing something at run time for the entire table is not a good idea at all.
In some cases, if there is really a lot of data, then you can calculate the hash in advance, build an index on it and search.
PS. For academic interest, you can build a database and data so that each of the three options will win.

Z

zhaar, 2019-03-21
@zhaar

From what I checked myself, the results are almost the same. The only thing I found out is that it is better not to use checksum for joins, because often gives "collisions" due to which connections are duplicated.
Adding indexes helps a lot on very large tables where there is a lot of data (for example, for more periods than specified in the where clause)

R

Ruslan., 2019-03-22
@LaRN

You can create a field in the tbl table in which to store checksum (tbl.date, tbl.text, tbl.size) and index this field.
If checksum gives a lot of collisions, you can try another hash, in MSSQL, for example, there are so many options:
MD2 | MD4 | MD5 | SHA | SHA1 | SHA2_256 | SHA2_512
https://docs.microsoft.com/ru-ru/sql/t-sql/functio...
But the hash is good only when you need to answer the question = or <>, and for other conditions you still have to go to the table fields .
I have a case where you need to search for matches on 17 analysts and there this method accelerated the search by 10 times compared to the usual search on a bunch of fields, in your case there are only 3 fields and probably join by fields and the presence of a selective index at least by one of them is enough.
On a field with varchar(1000) it is better not to make an index, but on the other two it is quite possible.