Storing hashes in mysql and memory consumption. Which algorithm is more profitable?

W

wanomgn2014-10-24 17:27:15

MySQL

wanomgn, 2014-10-24 17:27:15

It was necessary to store hashes in mysql. Lot. Several billion.
In what form is it more profitable to do this in terms of memory and performance? Apparently the less memory it will take the better?
It turns out that with several billions a 32-bit hash is unacceptable?
Remains 64 bit. For example murmur3. And this turns out to be about 20 digits in the hash column ..
Maybe there are hashes that are not only digital? Then it turns out that the characters in the column will be much less than 20 ... Or is a short character hash worse than a long digital hash?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

�

Дмитрий Энтелис, 2014-10-24
@DmitriyEntelis

Уточните пожалуйста условия задачи.
Что за хеши, от чего они, зачем, какой к ним планируется доступ.
Существует огромное количество разных хеш функций.
google "хеш функции"
т.к 20 символов в bigint не влезут, вам придется хранить это дело в текстовом виде, а тут однозначно лучше использовать символьный хеш.
Может быть как то его дополнительно пережимать в непечатные символы и хранить как BINARY какой нибудь
И да, миллиарды в одной таблицe mysql это уже экстрим :)

table "reference" of two columns.
1.hash
2.text up to 200 characters.
There is a ready hash. We pull the table to find out which line this hash corresponds to?

1. It is not necessary to keep content and hash in one table. It is correct to have 2 tables: hash, text_id; text_id, text
2. hash does not answer "definitely yes" anyway. he gives the answer "maybe yes"
based on this, I would choose some kind of hash function that fits in UNSIGNED INT in length, and additionally check the result already with the text.

R

Rsa97, 2014-10-25
@Rsa97

1. Хранить можно по разному. Можно, например, разбить таблицу на 2ⁿ таблиц по первым битам хэша (table_00, table_01, ... table_ff).
2. Хэш, как таковой, не гарантирует однозначного отображения, то есть вполне вероятен вариант, когда две разные строки будут иметь один и тот же хэш. Для n-битного хэша перебор 2^n/2 строк выдаст два одинаковых значения хэша с вероятностью 63% (парадокс дней рождения). По таблице можете оценить, какая вероятность коллизии будет для вашего количества строк при разной длине хэша.

W

wanomgn, 2014-10-24
@wanomgn Автор вопроса

таблица "справочник" из двух столбцов.
1.хэш
2.текст до 200 символов.
Есть готовый хэш. Дергаем таблицу что бы узнать какой строке соответствует этот хэш?