How to split one column into two in a dataframe?

D

denislysenko2021-12-12 18:19:29

Apache Spark

denislysenko, 2021-12-12 18:19:29

I write in zeppeline notebook

, I have this dataframe:

splited_genres_df.show(20)

+-------+--------------------+---------+
|movieId|               title|   genres|
+-------+--------------------+---------+
|      1|    Toy Story (1995)|Adventure|
|      1|    Toy Story (1995)|Animation|
|      1|    Toy Story (1995)| Children|
|      1|    Toy Story (1995)|   Comedy|
|      1|    Toy Story (1995)|  Fantasy|
|      2|      Jumanji (1995)|Adventure|
|      2|      Jumanji (1995)| Children|
|      2|      Jumanji (1995)|  Fantasy|
|      3|Grumpier Old Men ...|   Comedy|
|      3|Grumpier Old Men ...|  Romance|
|      4|Waiting to Exhale...|   Comedy|
|      4|Waiting to Exhale...|    Drama|
|      4|Waiting to Exhale...|  Romance|
|      5|Father of the Bri...|   Comedy|
|      6|         Heat (1995)|   Action|
|      6|         Heat (1995)|    Crime|
|      6|         Heat (1995)| Thriller|
|      7|      Sabrina (1995)|   Comedy|
|      7|      Sabrina (1995)|  Romance|
|      8| Tom and Huck (1995)|Adventure|
+-------+--------------------+---------+
only showing top 20 rows

in the title column there is both the name of the film and the year of release of this film, and I need to add this year to a separate column called year and modify the title so that it is without a year, but only the title of the film.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

S

Slava Rozhnev, 2021-12-12
@denislysenko

splited_genres_df['year'] = splited_genres_df['title'].str.extract('\((\d+)\)', expand=True)
splited_genres_df['title'] = splited_genres_df['title'].str.extract('(.+)\(\d+\)', expand=True)
splited_genres_df.head()