Chunks: break and reassemble, how does it work?

A

antobra2018-09-12 16:38:51

Data storage

antobra, 2018-09-12 16:38:51

Recently, it has become a discovery for me that some companies that store user files break each file into chunks of a certain size, look for the same ones in their database, if there is none, then save it, if there is one, then discard it. (so as not to duplicate information). As a result, one record for uploaded files in the database has a list of links to all chunks, which together create the same file that the user uploaded.
Then I had a question that I wanted to ask people who understand this. How are these chunks glued together before being issued to the user? After all, logically, the request from the user looks like this: a request to the database for the requested file, we get a list of links to all chunks by cluster / servers, and somehow we issue the file (s). How does bonding or dispensing take place? And is it happening? One server copies chunks to itself and issues glued together?
Thanks for the time and reply

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

R

Roman Mirilaczvili, 2018-09-12
@antobra

This, apparently, is about data deduplication.
Each file can be split into N equal data segments and 1 residual length. If we number these segments sequentially, saving in the database the numbers of their sequences with their received hashes and segment files named by hashes, then to restore the contents of the file, it will be enough to find in the database all pieces of data belonging to a given file, reading their corresponding data from segment files. It doesn't matter on which storage nodes these segment files are stored, but the important thing is that there is only 1 server that glues the whole file into 1.
Deduplication is suitable in cases of frequent repetition of pieces of content. For example, many repetitions can be found among document archives (duplicates of entire files or some parts). Sometimes, deduplication can give a good win when the same video files are in different parts of the archive. Although the chances of finding duplicate pieces among different video files are very small.