Answer the question
In order to leave comments, you need to log in
How to programmatically determine the uniqueness of text in search engines?
I wonder how services like copyscape, antiplagiat.ru determine the uniqueness of the text?
Answer the question
In order to leave comments, you need to log in
Most likely, they are looking for similar documents. And if the studied text is very similar to any other by some metric, then it is considered a copy. Perhaps the same is done at the paragraph level.
How to find similar documents quickly - LSH (locality sensitive hashing) and clustering.
Use shingles. That is, they randomly take a shingle from the text (usually they use shingles, I don’t remember exactly, from 5 to 9 words) and request it in quotes in the search. If there are more than 1 results, then someone copy-pasted someone. And here the algorithm of the search engines themselves begins to work to determine the original, moreover, it does not always correctly determine the original source.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question