Answer the question
In order to leave comments, you need to log in
What is the best approach to filter tables?
Hello. Sorry for the long question, but please read to the end. Thanks in advance.
My assignment is that I have two files: movies.csv and rating.csv (the first one has 10,000 lines, the second one has 100,000 lines)
The movies.csv file looks like this:
movieId,title,genres
1,Toy Story (1995 ),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5 ,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
...
The rating.csv file looks like this:
userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
1,50,5.0,964982931
1 ,
70.3.0.964982400
1,101.5.5.0.964980868
2.110.4.0.964982176
2.151.5.5.5.5.5.964984041
2,157,0,964984100
3.163.5.964983650
3.216.5.0,964981208
3.223.3.0.98080EL
_
information about different films, in rating.csv information about reviews of these films.
I need to write a console utility in pure python (without pandas and SQL, you can use only built-in functions in python) (determine the top n most rated films (by average rating) for each specified genre), the result will be a csv-like dataset with the title genre , title, year, rating)
command line arguments:
-N - the number of the most rated films for each genre. optional
-genres - user defined genre filter. may be multiple. for example, Comedy|Adventure or Comedy&Adventure. optional
-year_from
-year_to - filter for release years of films. optional
-regexp - filter (regular expression) for the movie title. optional
if some arguments are not specified, then there is no corresponding filter.
if nothing is specified at all, then all films are displayed sorted by genre and average rating (in descending order)
here is what I wrote, but it works slowly
import argparse
import csv
parser = argparse.ArgumentParser()
parser.add_argument('-N',
type=int,
help='the number of the highest rated films for each genre'
)
parser.add_argument('-genres',
type=str,
help='filter by genre'
)
parser.add_argument('-year_from',
type=int,
help='filter by year (FROM YEAR)',
default=1800
)
parser.add_argument('-year_to',
type=int,
help='filter by year (TO YEAR)',
default=2025
)
parser.add_argument('-regexp',
type=str,
help='filter on the movie name'
)
args = parser.parse_args()
data_m = []
with open('files/movies.csv', encoding='utf-8') as file:
reader = csv.reader(file, delimiter=',')
for row in reader:
data_m.append(row)
data_movies = data_m[1:]
data_r = []
with open('files/ratings.csv', encoding='utf-8') as file:
reader = csv.reader(file, delimiter=',')
for row in reader:
data_r.append(row)
data_rating = data_r[1:]
result = [] # сюда будет добавляться средний рейтинг по фильмам
for row in data_movies:
film_ID = row[0]
sum_rating = 0.0
count = 0.0
for line in data_rating:
if film_ID == line[1]:
sum_rating += float(line[2])
count += 1
if count != 0.0:
result.append(sum_rating / count)
else:
result.append(0)
# добавление в таблицу среднего рейтинга и года
for i in range(len(data_movies)):
data_movies[i].append(result[i])
data_movies[i].append(data_movies[i][1][-5:-1])
# сейчас к data_movies добавлен средний рейтинг к каждому фильму и год выпуска фильма
# вывожу в консоль эту таблицу
for row in data_movies:
print(row)
"""
Короче, сейчас у нас есть data_movies таблица, где добавлен средний рейтинг и год к каждому фильму
!!! НО ЭТО ОЧЕНЬ МЕДЛЕННО РАБОТАЕТ
"""
Answer the question
In order to leave comments, you need to log in
Because you go through the entire rating table for each film, and even parse the numbers every time. It still wouldn't slow down.
Wouldn't it be easier for each film to store the sum of the ratings and their number, and then go through the rating table once and fill these storages?
from collections import defaultdict
total = defaultdict(float)
count = defaultdict(int)
for line in data_rating:
total[line[1]] += float(line[2])
count[line[1]] += 1
averages = { id: total[id]/count[id] for id in count } #средние рейтинги
#и сортируем список фильмов по убыванию рейтинга
data_movies.sort(key=lambda item: averages.get(item[1], 0), reverse=True)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question