A
A
Alexander Makarov2018-11-20 17:21:23
Python
Alexander Makarov, 2018-11-20 17:21:23

How to divide text into logical blocks?

Good afternoon!
There is a task at the input to take a Word document with an agreement, and at the output to receive a list of text blocks grouped based on logical proximity.
More specifically: there is an insurance contract. It lists the risks that are covered and those that are not covered. At the same time, if a fire is covered, then this can be written not just with the phrase "fire is covered" (then I would just beat the text into paragraphs and the task is solved), but this fire coverage can be spaced apart into several paragraphs with subparagraphs (well, at least they go in a row , not in different places in the text). For a program, these are different paragraphs; for a person, they are a single logical block of text.
Which makes the task easier: since this is a contract, the text is still somehow, but structured. There is (as a rule) numbering of paragraphs - paragraphs, subparagraphs, subparagraphs of subparagraphs.
What complicates the task: structuring rules can be different from contract to contract. Somewhere the whole text is essentially a list, and you can just take a certain level of the list and assign it as the desired minimum logical component of the text. In this case, all sub-items below the given level can simply be added to the text, which is one level higher without violating the logic of the split. But maybe not.
The list may not be numbered, but alphabetic. There may not be a list at all - just each item begins with a number (or with a letter).
Or with a certain number of spaces or tabs.
Or, in general, separate logical blocks are highlighted by HEADINGS (a line of text consisting of all capital letters).
How to be in this case? What algorithm might work? Or maybe ML model?
I would be grateful for tips!

Answer the question

In order to leave comments, you need to log in

1 answer(s)
D
Dimonchik, 2018-11-25
@dimonchik2013

tamita parser closest

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question