How to remove rows from a table that are repeated in another table?

H

HelloDarknessMyOldFried2020-09-03 14:35:16

Python

HelloDarknessMyOldFried, 2020-09-03 14:35:16

Good afternoon!
Problem essence: there is a big table and the small table. The small table contains individual rows from the large one.
"Large":
col1 col2 col3
0 A 1 5
1 B 2 6
2 C 3 7
3 D 4 8
4 C 3 102

"Small":
col1 col2 col3
0 C 3 7

How to remove rows from a large table that are repeated in both? (in this case it is the string "C 3 7").
It should look like this:

col1 col2 col3
0 A 1 5
1 B 2 6
2 D 4 8
3 C 3 102

It is desirable to do this without loops, since real tables contain hundreds of thousands of rows and many repeating values.

Thank you very much!

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

Z

zexer, 2020-09-03
@HelloDarknessMyOldFried

import pandas as pd

df1 = pd.DataFrame({'col1':['A', 'B', 'C', 'D', 'C'], 'col2': [1,2,3,4,3], 'col3': [5,6,7,8,102]})
df1

df2 = pd.DataFrame({'col1':['C'], 'col2': [3], 'col3': [7]})
df2


df_new = pd.merge(df1, df2, how='outer', indicator=True)
df_new.loc[df_new['_merge'] == 'left_only'].drop('_merge', axis=1)

PS
Learn to google in English (in Google)
I'm not a super connoisseur of pandas, but even an extremely stupid and obvious question in Google in the form of "how to delete rows from table from other table pandas" results in a link on the first line, from where this is taken solution
https://stackoverflow.com/questions/39880627/in-pa...

P

PavelMos, 2020-09-03
@PavelMos

convert dataframes into lists, more precisely into a two-dimensional array format, i.e. a list of lists, and make a compression sheet with a condition, then convert it back with the same column names as they were.

new_list=[x for x in df1.values.tolist() if x not in df2.values.tolist()]
df3=pandas.Dataframe.from_records(data=new_list, columns=df1.columns.values())
df3
Out[217]: 
  col1  col2  col3
0    A     1     5
1    B     2     6
2    D     4     8
3    C     3   102

Or as it was written above through the union in the dataframes themselves. See what works faster