A
A
Andrew2016-12-27 00:46:32
PHP
Andrew, 2016-12-27 00:46:32

What is the best way to divide the text into paragraphs (remove extra line breaks and leave / add in the right places)?

Hello, many are probably familiar ... when you copy text, for example, from a PDF or from some page where hyphenation is strictly arranged and looks something like this (you can not read it) :


I lost interest in work. This piece of paper is the fruit of a
vain attempt to occupy oneself. I have not disclosed anything here that
could harm the Brotherhood, which is probably why I am still alive.
Maybe I'll destroy this sheet or send it to the moon, or for
fun, I'll send it to you by mail and put a good
spell on it, so that if it doesn't reach the right place, the culprit will take a sip,
how much is a pound.
I have to ask you. If this letter does not disappear
without a trace, but reaches you, then - print it for edification or as a
lesson, but just so that people, guys, girls read - do not ruin
yourself, do not do magic, do not give announcements like
"Respond, magicians black and white." Suddenly, someone from the Brotherhood will not
disdain and find you, and you will approach him, and he will begin to teach you
This... Live quietly for yourself, when there is nothing to choose from,
the road ahead is so simple and clear - home, family, work , children -
as many as 70-80, or even more years of serene existence.
What a simple happiness - it is no longer destined for me. I have forgotten how
to love, I have no relatives, friends, a favorite thing - so why do I need
this immortality, which in my life I was rewarded with
magic?
When this letter reaches you, anyway, I will not
read it on your pages anyway, because today at the right time,
if I have not miscalculated, my voice will transport me to the World of the
Nameless. If it suddenly comes out, I may be back, but this is
unlikely - no one has ever returned from there.
May your path be clear.
DIEHARD.

I wrote a script that does this (you can not read) :

I lost interest in work. This piece of paper is the fruit of a vain attempt to occupy oneself. I have not disclosed anything here that could harm the Brotherhood, which is probably why I am still alive. Maybe I'll destroy this sheet or send it to the moon, or for fun, I'll send it to you by mail and put a good spell on it, so that if it doesn't reach the right place, the culprit will take a sip, how much is a pound.
I have to ask you. If this letter does not disappear without a trace, but reaches you, then - print it for edification or as a lesson, but just so that people, guys, girls read - do not ruin yourself, do not do magic, do not give announcements like "respond, mages black and white. Suddenly, someone from the Brotherhood will not disdain and find you, and you will approach him, and he will begin to teach you This... Live quietly for yourself, when there is nothing to choose from, the road ahead is so simple and clear - home, family, work , children - as many as 70-80, or even more years of serene existence. What a simple happiness - it is no longer destined for me. I have forgotten how to love, I have no relatives, friends, a favorite thing - so why do I need this immortality, which in my life I was rewarded with magic?
When this letter reaches you, anyway, I will not read it on your pages anyway, because today at the right time, if I did not make a mistake in my calculations, my voice will take me to the World of the Nameless. If it suddenly comes out, maybe I will return, but this is unlikely - no one has ever returned from there.
May your path be clear.
DIEHARD.

The script itself:
$file_name = 'file.txt';
$file = file($file_name);

$arr = array_map(function($e){
  global $file;

  $avg = mb_strlen(implode(PHP_EOL, $file)) / count($file);
  $len = mb_strlen($e);
  $last_chr = mb_substr(trim($e), -1);

  $fdot = strstr($last_chr, '.') != false;
  $fexp = strstr($last_chr, '!') != false;
  $fqup = strstr($last_chr, '?') != false;

  if(($len < $avg) and ($fdot or $fexp or $fqup)){
    return $e."\r\n";
  }else{
    return str_replace(["\r\n", "\r", "\n"], " ", $e);
  }

}, $file);

$text = '';
foreach ($arr as $line){
  $text .= $line;
}

That is, I take the number of characters in each line, consider its average value and, further, if the number of characters is less than the average and the last character is the sign of the end of the sentence, then there should be a hyphen, otherwise no hyphen.
The algorithm is not the best, it can easily make mistakes if, for example, a line break should be in a line whose number of characters is more than the average value ... But, unfortunately, I could not think of a better one ...
Maybe someone already faced with a similar task, maybe someone has some work or ideas on how to improve my script?

Answer the question

In order to leave comments, you need to log in

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question