How to compare two versions of marked up text at the word level?

Y

youngmysteriouslight2018-10-16 19:11:08

bash

youngmysteriouslight, 2018-10-16 19:11:08

The general task before me is as follows:
there are two files with marked-up text, or rather one file and one diff, which was issued by the hard currency.
It is necessary to compile a correct (according to the markup language) file from them, in which both versions are included, common parts are not duplicated, different parts are framed with special tags.
In this case, changes in words should be distinguished at the word level.
For example,

Файл 1:
<h1>Заголовок</h1>
Подзаголовок
Мама мыла раму.
Не было тучь на небе.

Файл 2:
<h1>Название</h1>
<h2>Подзаголовок</h2>
Мама мыла пуделя.
В небе были тучи.

Результат:
<h1><span class=old>Заголовок</span> <span class=new>Название</span><h1>
<h2>Подзаголовок</h2>
Мама мыла <span class=old>раму</span> <span class=new>пуделя</span>.
<span class=old>Не было тучь на небе.</span>
<span class=new>В небе были тучи.</span>

In principle, I can write a function that takes two fragments and returns the result of the merge, given my desires for the result and the particular markup language used.
Then the task is reduced to the definition of these very fragments.
diff doesn't work because it compares line by line, and the line is too big to be separated and duplicated if only one word is changed in it, as in the frame and title example. In addition, I want the probability of getting tags into the fragment as small as possible if it is small (see the example with the Header). The option to insert additional line breaks into the file so that there is a word on one line, and then apply diff is not an option, because in large differing fragments there are almost certainly repeated words, for example, conjunctions (see the example with the word "sky").
I would like to find some program that works like kdiff3, which can highlight different characters in a string,
and the output of this program must be in such a format that I can continue to apply my function.
Actually, the question is:
is there any program that could determine the fragments in which two files differ, at the level of words, not lines, and form a list of fragments with their coordinates in an easily parsable format?
Or approach the problem from a different angle?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

Dimonchik, 2018-10-16
@dimonchik2013

How do you imagine the functionality of such a program?
check with Levenshtein, and where close - divide with words, far - with lines