Screening out duplicate rows with Mysql?

C

camokatik2010-10-09 18:51:54

MySQL

camokatik, 2010-10-09 18:51:54

Hello,
There was a task to duplicate about 60 GB of string data. Unique among them about 25-30%
Decided to use mysql with a unique index for this.
Questions:
1. Is it better to make the field with the string itself (1-5 words) unique, or is it more optimal to count crc32 from this string first, and already put a unique index on the hash?
2. Is it possible to apply some sort of partitioning, but not at the table level, but at the database level?
For example, to divide the data by the first letter of the string (we get 28 physical bases), and simultaneously fill in only one of them, thereby reducing RAM consumption?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

C

camokatik, 2010-10-09
@camokatik

The initial data in text files lie. To be filled.
In general, an index on an uncleaned one can be added like this:

ALTER IGNORE TABLE `test` ADD UNIQUE (`text`)

T

tzlom, 2010-10-09
@tzlom

and those. you want to create a database with a unique key and try to import all the rows on it?
I don’t know about partitioning, I haven’t seen such features, I do
n’t recommend counting a hash, especially since 1-5 words is a very short piece of text, MySQL will cope with the task itself

C

camokatik, 2010-10-10
@camokatik

Unfortunately, only (no) mysql, there, in addition to deduplication, additional functionality is needed. I settled on mysql. only with her worked closely and experience an order of magnitude more than with other subds.