Why is pandas not handling numbers from csv correctly?

M

M-Pack2021-02-06 12:39:34

Python

M-Pack, 2021-02-06 12:39:34

After executing this code:

col_names = ['Project', 'OrderDate', 'orderid', 'ClientID','IsRepeat','IsBlocked','IsManual','AutoDecision','ManualApprove','IsLoan','ShortTermAmount','ShortTermPeriod','LongTermAmount','LongTermPeriod','RequestedAmount','RequestedPeriod','LoanSum','Period','ShortTermScore','LongTermScore']
#dtypes={"Project": bool, "OrderDate": 'str', "orderid": int, "ClientID" : int,"IsRepeat" :bool,"IsBlocked":bool,"IsManual":bool,"AutoDecision":bool,"ManualApprove":bool,"IsLoan":bool,"ShortTermAmount":"Int64","ShortTermPeriod":"Int64","LongTermAmount":"Int64","LongTermPeriod":"Int64","RequestedAmount":"Int64","RequestedPeriod":"Int64","LoanSum":"Int64","Period":"Int64","ShortTermScore":"float64","LongTermScore":"float64"}
test = pd.read_csv("/home/man/Test_task.csv",sep=',', thousands =',', header = 0, decimal='.', names=col_names, usecols=col_names).apply(pd.to_numeric, errors='coerce')#.astype('Float64')#,'Period': lambda x: np.float64(x) if x != '-' else np.nan})#'orderid': lambda x: int(x.replace(',','')) if x != '-' else np.nan, .fillna(0)
df = pd.DataFrame(data=test) 
df['ShortTermAmount'] = pd.read_csv("/home/man/Test_task.csv",sep=',', thousands =',', dtype = {'ShortTermAmount':"str"}, converters = {'ShortTermAmount': lambda x: x if x != '-' else np.nan}).fillna(0).replace(',', '').apply(pd.to_numeric, errors='coerce')
df.head()

I got this table:

Project 	OrderDate 	orderid 	ClientID 	IsRepeat 	IsBlocked 	IsManual 	AutoDecision 	ManualApprove 	IsLoan 	ShortTermAmount 	ShortTermPeriod 	LongTermAmount 	LongTermPeriod 	RequestedAmount 	RequestedPeriod 	LoanSum 	Period 	ShortTermScore 	LongTermScore
0 	1.0 	NaN 	1794004 	1040307 	NaN 	1.0 	NaN 	NaN 	NaN 	NaN 	1.0 	NaN 	NaN 	NaN 	6700 	12 	NaN 	NaN 	0.085 	0.085
1 	1.0 	NaN 	1794005 	1305335 	NaN 	1.0 	NaN 	NaN 	NaN 	NaN 	1.0 	NaN 	NaN 	NaN 	1900 	26 	NaN 	NaN 	0.017 	0.017
2 	1.0 	NaN 	1794021 	1174614 	1.0 	NaN 	NaN 	1.0 	NaN 	1.0 	1.0 	40.0 	NaN 	NaN 	4500 	20 	NaN 	20.0 	0.926 	0.926
3 	1.0 	NaN 	1794032 	1356306 	1.0 	NaN 	NaN 	1.0 	NaN 	1.0 	1.0 	40.0 	NaN 	NaN 	6000 	40 	NaN 	40.0 	0.679 	0.679
4 	1.0 	NaN 	1794057 	1120819 	1.0 	NaN 	NaN 	1.0 	NaN 	1.0 	1.0 	40.0 	NaN 	NaN 	70000 	168 	NaN 	40.0 	0.737

Despite the fact that in my original csv file ( https://drive.google.com/file/d/1Oseh4KnE98tC3-jRy... ) some columns have different values. So, in the ShortTermAmount column, there are integers, in which the thousandth place is separated by a comma. At the same time, the numbers from the RequestedAmount column are loaded normally, without a comma, and then they are assigned the int64 type, and the ShortTermAmount is assigned the float64 type.

How to fix it?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

M

M-Pack, 2021-02-11
@M-Pack

I'll answer myself, if no one answered:

df1 = df.apply(lambda x: pd.to_numeric(x.astype(str).str.replace(',',''),errors='coerce'))