Answer the question
In order to leave comments, you need to log in
How to split data in Pandas row?
Good afternoon, I'm trying to comprehend the Pandas library and I wanted to conduct a study on the number and features of films of different genres in one dataset.
The dataframe has a column with genres separated by this | character.
I want to transform my data so that in my dataframe each row has only one genre and let the films repeat, then I will simply conveniently group the data by the name of the film.
The use of the split function is just spinning in my head, but I can’t continue the idea further
Answer the question
In order to leave comments, you need to log in
>>> import pandas as pd
>>> df = pd.DataFrame(, columns=['title', 'genre'])
>>> df
title genre
0 123 Anime|Action
1 321 Adventure|Comedy
>>> df['genre'] = df['genre'].apply(lambda x: x.split('|'))
>>> df
title genre
0 123 [Anime, Action]
1 321 [Adventure, Comedy]
>>> df.explode('genre')
title genre
0 123 Anime
0 123 Action
1 321 Adventure
1 321 Comedy
IMHO it is more expedient to make attributes like "is it a comedy? yes / no (1/0)", to do this, enter additional. columns. If you duplicate rows, then the size of the dataframe will also increase greatly.
import pandas
df1=pandas.DataFrame.from_records((
(1, 'xxx', 'Adv|Ani|Doc'),
(2, 'yyy', 'Adv|Doc'),
(3, 'zzz', 'Comedy|Doc')),
columns=['movieId','title','genres'])
genres_list=('Adv','Ani','Doc','Comedy')
for i in genres_list:
df1[i]=[0]*len(df1) #сначала прописать всем нули
print (df1)
for idx, row in df1.iterrows():
c=(row[2])
l=c.split('|')
for g in genres_list:
if g in l:
df1.loc[idx, g]=1
print (df1)
movieId title genres Adv Ani Doc Comedy
0 1 xxx Adv|Ani|Doc 0 0 0 0
1 2 yyy Adv|Doc 0 0 0 0
2 3 zzz Comedy|Doc 0 0 0 0
movieId title genres Adv Ani Doc Comedy
0 1 xxx Adv|Ani|Doc 1 1 1 0
1 2 yyy Adv|Doc 1 0 1 0
2 3 zzz Comedy|Doc 0 0 1 1
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question