Is there any protection against leaking electronic documents in python?

K

Kirill Petrov2022-01-18 00:34:26

Python

Kirill Petrov, 2022-01-18 00:34:26

Greetings! Tell me, is there DOCa PDFfile. The task is to generate images for each user and slightly distort, shift some characters. What if later, if the image of the document pops up somewhere, according to these changes, it would be possible to determine from which user the leak occurred.

I see this solution: We
give each user a unique token. Then we generate a set of random numbers from the token as a salt. We take each document and slightly shift / crop / distort some characters on the received numbers. We save the resulting image and give it to the user.

If you want to determine whose document, then we simply compare some parts of the document with a sliding window, opencvas far as I remember.

It seems that there are no pitfalls and I’m not the first to come up with this now, so maybe there is already an implementation of something similar already?

UPD: Links to other programs/resources with this feature are also welcome)

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

K

kisaa, 2022-01-18
@kisaa

The idea is sound, and somewhere similar has already slipped. Briefly - in a document, you can change individual characters to similar in spelling ("s" Cyrillic => "s" Latin) (fuu, this will worsen the search in the document) or play with spaces (insert a second space between words). Of course, if the user suspects such DRM, then cleaning it out of the doc is a piece of cake; from pdf is more difficult.

H

hint000, 2022-01-18
@hint000

I will add to previous answers. Yes, if you run it through OCR, then ("with" Cyrillic => "with" Latin) will not work, and extra spaces may not work (and even more so small shifts and distortions of characters). But intentional errors in spelling and punctuation can work. If you do not overdo it (numerous errors are striking, and many will not notice the only error on the page).

A

AlexVWill, 2022-01-18
@AlexVWill

if the image of the document pops up somewhere

And what, the option that the DOC or PDF itself will pop up somewhere is not considered? And besides, how are you going to give the user a distorted image? Well, of course, if you tell him: "Here's a distorted image for you to post it, and we will find you on this basis" - then of course ... only such logic is a bit like the logic of the mentally ill or the deputies of the State Duma.
And then, first of all, if there is a good scan, then it will be driven into some kind of OCR program, and they will make a normal PDF from it.
I think it's better to think about the fact that if you need to protect some data, a) you must not give it to anyone b) choose such media formats that protect it from copying c) in general, do not distribute confidential data on electronic media.