How to use the re.split method?

T

Timebird2017-12-30 16:13:14

Python

Timebird, 2017-12-30 16:13:14

Let's say I want to extract the most frequent words from the text, I split the text by characters using re.split: And I also want to add the character " , but if I add it, the quotes will close syntactically prematurely. How to add it? Another small question: where dig, so that the division goes on all characters, except for uppercase-lowercase letters?I remember that it seems like the syntax is like: re.split("[^[az][AZ]"), what is this construction called?
words = re.split("[ \n.,?!:;']", corpus)

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

T

tema_sun, 2017-12-30
@Timebird

Shield re.split("[ \n.,?!:;'\"]", corpus)

S

Sergey Gornostaev, 2017-12-30
@sergey-gornostaev

First, you can escape the double quote character in the same way you escape the newline character - "[ \n.,?!:;'\"]". Secondly, you can do it easier and faster:

from collections import defaultdict, Counter
import string

punctuation_map = dict((ord(char), None) for char in string.punctuation)
prepositions = ['и', 'в', 'без', 'до', 'из', 'к', 'на', 'по', 'о', 'от', 'перед', 'при', 'через', 'с', 'у', 'за', 'над', 'об', 'под', 'про', 'для']

with open('WarAndPeace.txt', encoding='utf-8') as fh:
    text = fh.read()
    clean_data = text.translate(punctuation_map)
    words = Counter(word.strip().lower() for word in clean_data.split() if word not in prepositions)

print(words.most_common(1))