K
K
kAIST2012-06-06 16:12:45
Python
kAIST, 2012-06-06 16:12:45

python: cram thousands of files into one?!

Greetings!
There was a task to make a certain container for files. From the requirements:
Thousands of small files (5-100 kb) must be included and extracted into one large file, as well as to be able to change (delete) a specific file.
What didn't work:
zipfile and tarfile - they don't know how to delete (modify) files.
Databases sqlite3, dbm, etc.:
Everything is fine, but on one of the platforms where the application is planned to be used, it is not possible to collect everything, only pure python scripts remain.
There are options?

Answer the question

In order to leave comments, you need to log in

10 answer(s)
B
bagyr, 2012-06-06
@bagyr

Make something like the CP \ M file system: we break each file into chunks of a fixed size and write them to the archive, saving somewhere which chunk belongs to which file. On top of this, implement iteration through files, saving, deleting, reading, writing.
An epic crutch, but I can’t imagine how to make it easier so that the archive does not grow during operations on files without overwriting.

D
Denis, 2012-06-06
@uscr

Strange task, but maybe pickle will help you? Read files into a list, and save the list to a file.

R
rPman, 2012-06-06
@rPman

Need efficient placement of files inside a container? Is a two or three-fold excess of the size in relation to the total volume of files tolerable, but if necessary, you can start the procedure for rebuilding the entire archive.
habrahabr.ru/qa/10694/#answer_46206
The code is very simple, anyone can type in the evening (the question is in the variety of service utilities)

S
sergeypid, 2012-06-06
@sergeypid

I found such a database in pure python: buzhug.sourceforge.net/

N
niko83, 2012-06-06
@niko83

How Facebook does to store a bunch of photos “Images began to be stored in large binary files (blobs), providing the application with information about which file and with what indent (in fact, an identifier) ​​from the beginning each photo is located. Such a service in Facebook was called Haystack and turned out to be ten times more effective than the “simple” approach and three times more effective than the “optimized” one. As they say, everything ingenious is simple!”
Taken from here: www.xakep.ru/post/55510/

E
eaa, 2012-06-06
@eaa

Based on the above answers and the experience that I myself asked a similar question, it turned out that with repeated reading and a rare change (not addition, but changing the existing one), you can use tar - it reads smartly and knows how to add to the end, but if you need to change something - then you need to re-create the file, because it cannot be changed inside. Well, with frequent changes, this is not an option.
Look also at dar - info ran through the mailing list and there were proposals to make it possible to change the archive, but the author strongly resisted and refused such functionality. I don't know if things have progressed since then.

K
kAIST, 2012-06-06
@kAIST

Thanks to all. As a result, now I have made my crutch:
Since only a small part of the data (the list of files, flags, file attributes) is subject to constant change, and the operation of deleting a file is very rare), I did this:
Stupidly write the contents of another file to the file, at the end of the file there is a metadata block ( json with what you need (plus a list of files with addresses of the beginning of the file and length) and at the end we finish off the size of this block. )
When you need to add a new file, we seek to the beginning of the block with metadata and write there, again adding the changed block to the end with metadata. If you need to change the metadata, just change it at the end of the file without touching the entire file.
Deletion is a rarely used function, and as a rule, a large number of files need to be deleted. This can already be done by iterating through the entire file, writing the necessary to the new one and skipping the unnecessary

I
impass, 2012-06-06
@impass

where you plan to use the application there is no way to collect everything

Something embedded or what? So there is a compiler, for sure, it is still possible to assemble one of the key-value databases.

M
Maxim Avanov, 2012-06-06
@Ghostwriter

Elliptics from Yandex makes such a thing. It also knows how to share-nothing DHT . There are bindings for Python.

S
smashrod, 2012-06-07
@smashrod

you can make a hash table of offsets at the beginning of a large file, then you can quickly find the file without external knowledge of anything.
Stevens has a good example of a file database in his C book, where the idea is described how to organize a key-val database, it would be just the same here.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question