I
I
icegreenberry2010-10-06 12:20:27
Search Engine Optimization
icegreenberry, 2010-10-06 12:20:27

How to programmatically determine the uniqueness of text in search engines?

I wonder how services like copyscape, antiplagiat.ru determine the uniqueness of the text?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
K
kzn, 2010-10-06
@kzn

Most likely, they are looking for similar documents. And if the studied text is very similar to any other by some metric, then it is considered a copy. Perhaps the same is done at the paragraph level.
How to find similar documents quickly - LSH (locality sensitive hashing) and clustering.

A
Andryxa, 2010-10-08
@Andryxa

Use shingles. That is, they randomly take a shingle from the text (usually they use shingles, I don’t remember exactly, from 5 to 9 words) and request it in quotes in the search. If there are more than 1 results, then someone copy-pasted someone. And here the algorithm of the search engines themselves begins to work to determine the original, moreover, it does not always correctly determine the original source.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question