What is the correct way to use the and (&) operator in pandas?

P

Pavel_1321312021-06-17 13:23:15

Python

Pavel_132131, 2021-06-17 13:23:15

I have a lot of data frames (in the example there are several), I need to check each of them for the presence of keywords in the first column. To get the desired date set, I have to get two (or more) matches, in a normal situation this is solved through the 'and' operator, but it does not work in pandas (I know that in pandas the 'and' operator is denoted as '&').
At the same time, the 'or' operator ('|') works great. But the problem is that '|' at least one match is enough to fulfill the condition, but I need several.
I understand that with the '&' operator both conditions will be checked, which will be applied to each row in the data frame, I would like to know if it is possible to apply the conditions to the whole column and not row by row ?

import pandas as pd
import numpy as np
import re


df_16 = pd.DataFrame({"number": ["onE:    ", "  Two:", np.nan, "fIve", "[email protected]", "tHree", "   one", "tWo", "tWo     SIx", "one"], 'phone': [11, 22, 33, 44, 55, 300, 77, 1000, 1, 5],
 "item": ["debt", "phone", "room", "go", "hoot", "aaple", "orange", "phone", "tool", 'hoooo'],
  "seria": [1050.34344343, 150, 544, 245, 230, 50, 300, 70, 33, 100]},
  index=['q', 'e', 'r', 't', 'u', 'y', 'j', 'v', 'm', 'g'])

df_17 = pd.DataFrame({"number": ["onE#", "tWo:", "six      ", " fiVE   ", "four", "thRee:", np.nan, "twO", "two Six", "Two fOr"], 'phone': [11, 22, 33, 44, 55, 300, 77, 1000, 1, 5],
 "item": ["debt", "phone", "room", "go", "hoot", "aaple", "orange", "phone", "tool", 'hoooo'],
  "seria": [1050.34344343, 160, 544, 245, 230, 50, 300, 70, 33, 100]},
  index=['q', 'e', 'r', 't', 'u', 'y', 'j', 'v', 'm', 'g'])

df_18 = pd.DataFrame({"number": ["gggg", "TTt", "HHHH     ", " HJkl   ", "green", "redd", np.nan, "twO", "nuul", "Null"], 'phone': [11, 22, 33, 44, 55, 300, 77, 1000, 1, 5],
 "item": ["debt", "phone", "room", "go", "hoot", "aaple", "orange", "phone", "tool", 'hoooo'],
  "seria": [1050.34344343, 160, 544, 245, 230, 50, 300, 70, 33, 100]},
  index=['q', 'e', 'r', 't', 'u', 'y', 'j', 'v', 'm', 'g'])

list_data = [df_16, df_17]

for i in list_data:
    i['number'] = i['number'].fillna('empty') # заполняет nan значения 
    i['number'] = i['number'].agg([lambda do: re.sub(r'[#@:]', ' ', do)], axis=0) # убирает ненужные символы
    i['number'] = i['number'].agg([lambda do: re.sub(r'\s+', ' ', do)], axis=0) # убирает двойные (или больше) пробелы
    
    # переводит строк в нижний регистр, убирает пробелы (которые остались), поиск совпадений
    check = (i['number'].astype(str).str.lower().str.strip().str.fullmatch('two (six|for)') & i['number'].astype(str).str.lower().str.strip().str.fullmatch('six')) 
    if check.any():
        getting = i
        print(getting)        
        print(check)

I partly solved this problem, I performed keyword checking, separately, but it will be wrong when you need to search for many keywords.

import pandas as pd
import numpy as np
import re


df_16 = pd.DataFrame({"number": ["onE:    ", "  Two:", np.nan, "fIve", "[email protected]", "tHree", "   one", "tWo", "tWo     SIx", "one"], 'phone': [11, 22, 33, 44, 55, 300, 77, 1000, 1, 5],
 "item": ["debt", "phone", "room", "go", "hoot", "aaple", "orange", "phone", "tool", 'hoooo'],
  "seria": [1050.34344343, 150, 544, 245, 230, 50, 300, 70, 33, 100]},
  index=['q', 'e', 'r', 't', 'u', 'y', 'j', 'v', 'm', 'g'])

df_17 = pd.DataFrame({"number": ["onE#", "tWo:", "six      ", " fiVE   ", "four", "thRee:", np.nan, "twO", "two Six", "Two fOr"], 'phone': [11, 22, 33, 44, 55, 300, 77, 1000, 1, 5],
 "item": ["debt", "phone", "room", "go", "hoot", "aaple", "orange", "phone", "tool", 'hoooo'],
  "seria": [1050.34344343, 160, 544, 245, 230, 50, 300, 70, 33, 100]},
  index=['q', 'e', 'r', 't', 'u', 'y', 'j', 'v', 'm', 'g'])

df_18 = pd.DataFrame({"number": ["gggg", "TTt", "HHHH     ", " HJkl   ", "green", "redd", np.nan, "twO", "nuul", "Null"], 'phone': [11, 22, 33, 44, 55, 300, 77, 1000, 1, 5],
 "item": ["debt", "phone", "room", "go", "hoot", "aaple", "orange", "phone", "tool", 'hoooo'],
  "seria": [1050.34344343, 160, 544, 245, 230, 50, 300, 70, 33, 100]},
  index=['q', 'e', 'r', 't', 'u', 'y', 'j', 'v', 'm', 'g'])

list_data = [df_16, df_17]

for i in list_data:
    i['number'] = i['number'].fillna('empty') # заполняет nan значения 
    i['number'] = i['number'].agg([lambda do: re.sub(r'[#@:]', ' ', do)], axis=0) # убирает ненужные символы
    i['number'] = i['number'].agg([lambda do: re.sub(r'\s+', ' ', do)], axis=0) # убирает двойные (или больше) пробелы
    
    check = i['number'].astype(str).str.lower().str.strip().str.fullmatch('two (six|for)') # Проверка 1
    check_1 = i['number'].astype(str).str.lower().str.strip().str.fullmatch('six') # Проверка 2
    
    if check.any() and check_1.any():
        print(check)
        print(check_1)
        getting = i
        print(getting)

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

O

o5a, 2021-06-17
@o5a

It is not entirely clear why the checks have different conditions, if the condition required to find more than 1 match of the condition, and not several conditions?
In general, check returns an array of boolean values, i.e. the number of matches can be found simply through sum()

check = (i['number'].astype(str).str.lower().str.strip().str.fullmatch('two (six|for)')
print('совпадений:', sum(check))

If the match of 2 conditions was really required, then they do not make sense in this form: a string cannot simultaneously match 'two (six|for)' and 'for'