E
E
Evgeny Trofimov2016-12-23 04:04:40
Python
Evgeny Trofimov, 2016-12-23 04:04:40

How to break text into sentences without using regular expressions?

They gave me a task at the university - to break the Russian text in utf-8 into sentences. But without regular expressions.
The program needs to handle the situation when the first word of a sentence begins with an uppercase letter.
It would seem easy

text = text.replace('. ', '.|').replace('! ', '!|').replace('? ', '?|')
sentences = text.split('|')

But the text is given
Try it, come on! wow what! here's a big chick! you think I won't find a trial for you

this is one suggestion. AND '! ' will already be processed incorrectly.
Okay, I think you can probably then go through the list of proposals and check if the next one starts with a small letter, if so, then combine it with the previous one.
those. something like
i = 0
while i < len(sentences) - 1:
    if not sentences[i + 1].istitle():
        sentences[i] += sentences[i + 1]
        sentences.pop(i + 1)
    i += 1

But... This crutch doesn't really work, perhaps because of the Russian encoding.
In general, I didn’t understand something, I tried to compare it with the range of character codes for capital letters in Unicode, but for some reason it swears at
ch1 = 'Б'
print ord(ch1)

Says he says
TypeError: ord() expected a character, but string of length 2 found

And .istitle does not work correctly at all -
ch1 = 'Б'
ch2 = 'б'
if ch1.istitle():
    print ("Верхний")
else:
    print ("Нижний")

if ch2.istitle():
    print ("Верхний")
else:
    print ("Нижний")

Displays the bottom in both cases ..
Maybe I wandered into the wrong jungle and the solution is much simpler?
Here is the text that is proposed to be divided into sentences in the task:
"Look how brave you are!" said Chub, left alone in the street. "Try it, come on! wow what! here's a big chick! you think I won't find a trial for you. No, my dear, I'll go, and I'll go straight to the commissioner. You will know me. I will not see that you are a blacksmith and painter. However, look at the back and shoulders: I think there are blue spots. The son of the enemy must have beaten him painfully! it's a pity that it's cold and you don't want to throw off the casing! Wait, you demonic blacksmith, so that the devil beats both you and your forge, you will dance with me! you see, damned shibenik! however, because now he is not at home. Solokha, I think, is sitting alone. Hm... it's not far from here; would go! The time is now such that no one will catch us. Maybe even that will be possible... see how painfully the damned blacksmith beat him!'

Well, i.e. in fact, the algorithm here is simple, you need to add to the list everything that was before the [i]th character, if
1)[i] == '.' and [i+1]==' ' and [i+2] - Upper case character
2)[i] == '!' and [i+1]== ' ' and [i+2] - Upper case character
3) Same for '?'

Answer the question

In order to leave comments, you need to log in

4 answer(s)
A
aRegius, 2016-12-23
@deadrime

"Squeeze" your own solution
by adding a check for capital letters and slightly adjusting the arguments in replace() :

>>> letters = 'АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЭЮЯ'
>>> text = '"Смотри, как расхрабрился!" говорил Чуб, оставшись один на улице..' # и далее по тексту
>>> for letter in letters:
          if letter in text:
                text = text.replace('. '+letter, '.|'+letter).replace('. "'+letter, '.|"'+letter).replace('! '+letter, '!|'+letter)
    
>>> for sentence in text.split('|'):
          print(sentence)
  
"Смотри, как расхрабрился!" говорил Чуб, оставшись один на улице.
"Попробуй, подойди! вишь какой! вот большая цяца! ты думаешь, я на тебя суда не найду.
Нет, голубчик, я пойду, и пойду прямо к комиссару.
Ты у меня будешь знать.
.... 
и т.д.

N
NaName, 2016-12-23
@NaName

ch1 = u'Б'
ch2 = u'б'
if ch1.istitle():
    print ("Верхний")
else:
    print ("Нижний")

did you try that?

A
Alexey S., 2016-12-23
@Winsik

i = 0
while i < len(sentences) - 1:
    if ord(sentences[i + 1])>1039 and ord(sentences[i + 1])<1071:
      print (sentences[i + 1])
    i += 1

;) ax method =)
ps Just don't forget, you still have suggestions like this:бла бла. "Ой..."

A
abcd0x00, 2016-12-24
@abcd0x00

It would seem easy

It just looks like it, and it's not a matter of how to define a lowercase or uppercase letter. It's about the algorithm. This task refers to writing a lexical analyzer, and they are written through finite automata.
wiki. state machine (example)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question