How to implement a loop for deleting words from a string in a dataframe?

M

madsee2020-08-21 21:02:39

Python

madsee, 2020-08-21 21:02:39

Good afternoon.
I'm trying to write a loop to remove certain words from a string, perhaps this can be implemented much more easily.
I have a list of settlements where there are errors instead of the letter "ё" it says "e" or instead of "city" it says "village". I'm trying to create a column only by name, after deleting all the 'extra' words. As a result, nothing changes.

wordlist = ['поселок','посёлок','городской','городского','типа','деревня']
import re
def locality_id(row):
    name_id = row['locality_name']
    if name_id in wordlist:
        name_id = re.sub('(' + '|'.join(wordlist) + ')','',name_id)
        return name_id
    else:
        return name_id

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

P

PavelMos, 2020-08-22
@madsee

1. How is the locality written in locality_name ? If 'village is Bearish', then the function will not process it, because if checks for a match with the list element as a whole, and not the village separately. IMHO it's easier not to do an additional check, but to process everything at once
2. You need to remove the space / s after the deleted word / and remove it via lstrip or add a space to the words in the list in the regexp,
3. Add the options Village, City, Township

wordlist = ['Посёлок','Поселок','поселок','посёлок','городской','городского','типа','деревня','Деревня']

def locality_id(row):
    name_id = row['locality_name']
    name_id = re.sub('(' + '|'.join(wordlist) + ')','',name_id).lstrip()
    return name_id


for idx, row in df1.iterrows():
     print ('cell=', df1.loc[idx, 'locality_name'])
     df1.loc[idx, 'locality_name']=new_cell
     new_cell=locality_id(row)
     print ('new_cell=',df1.loc[idx, 'locality_name'])

T

Ternick, 2020-08-21
@Ternick

What are you passing to Row?