P
P
pcdesign2017-10-31 12:25:48
Machine learning
pcdesign, 2017-10-31 12:25:48

How to recognize paragraphs in text?

Definition of the word paragraph:

A paragraph (section, part of the text) is a piece of written speech, consisting of one or more sentences.
A paragraph serves to group homogeneous units of presentation, exhausting one of its moments (thematic, plot, etc.).

Purpose: to feed the text to the program without paragraphs and get the output text broken into paragraphs.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
A
Alexander, 2017-10-31
@pcdesign

With good accuracy - no way.
But if a certain percentage of errors is acceptable, then you can try to take a fairly large corpus of texts already divided into paragraphs, isolate individual sentences in it and calculate some metrics for these sentences. For example, the number of words, the number of punctuation marks, the ratio of the number of punctuation marks to the number of words, the average word length, the resulting vector in the space of words, and at least a few dozen more similar metrics to come up with.
And then everything is standard - there is a set of input parameters, there is a result (is the sentence the first sentence of the paragraph, and is it the last one). At the output, you will have a model that, for each sentence, determines the probability that it is the "head" of a paragraph and the probability that this sentence is the "tail".
And then you just put a paragraph after those sentences with a high probability of "tail", followed by a sentence with a high probability of "head".

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question