A
A
Astrohas2017-02-01 09:06:06
Python
Astrohas, 2017-02-01 09:06:06

How to analyze string similarity and check for plagiarism?

It is necessary to analyze the similarity of the string with the strings from the database. Lines are not long < 1024 characters. In which direction should you dig?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
A
al_gon, 2017-02-01
@Astrohas

https://en.wikipedia.org/wiki/Category:String_simi...
chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-mat...
Plagiarism is more difficult. Comparison should go already on the structure and semantic set.
But a rough definition of plagiarism, where the words change and in the pre-sentence part of it, similarity measures will also pull.
Primitive and naive solution in python:

def dice_coefficient(a, b):
    a_bigrams = set(a)
    b_bigrams = set(b)
    overlap = len(a_bigrams & b_bigrams)
    return overlap * 2.0/(len(a_bigrams) + len(b_bigrams))

dice_coefficient("2","3")
=> 0.0
   dice_coefficient("2","23")
=> 0.6666666666666666
   dice_coefficient("Как осуществить анализ схожести строк и проверить на плагиат?","плагиат?")
=> 0.5454545454545454
   dice_coefficient("Как осуществить анализ схожести строк и проверить на плагиат?","плагиат dsfsf?")
=> 0.5405405405405406
   dice_coefficient("Как осуществить анализ схожести строк и проверить на плагиат?","плагиат dsfsf? fdedfdfdfgdgh")
=> 0.5
   dice_coefficient("Как осуществить анализ схожести строк и проверить на плагиат?","Как осуществить анализ схожести строк и проверить на плагиат?")
=> 1.0
   dice_coefficient("Как осуществить анализ схожести строк и проверить на плагиат?","Как осуществить?")
=> 0.8
   dice_coefficient("Как осуществить анализ схожести строк и проверить на плагиат","анализ схожести строк и проверить на плагиат?")
=> 0.9090909090909091

PS: Only as an example and not a recommendation for use in this form.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question