What is the best approach to filter tables?

D

denislysenko2021-10-05 20:29:50

Python

denislysenko, 2021-10-05 20:29:50

Hello. Sorry for the long question, but please read to the end. Thanks in advance.

My assignment is that I have two files: movies.csv and rating.csv (the first one has 10,000 lines, the second one has 100,000 lines)

The movies.csv file looks like this:
movieId,title,genres
1,Toy Story (1995 ),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5 ,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
...

The rating.csv file looks like this:
userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
1,50,5.0,964982931
1 , 70.3.0.964982400
1,101.5.5.0.964980868
2.110.4.0.964982176
2.151.5.5.5.5.5.964984041
2,157,0,964984100
3.163.5.964983650
3.216.5.0,964981208
3.223.3.0.98080EL
_

information about different films, in rating.csv information about reviews of these films.
I need to write a console utility in pure python (without pandas and SQL, you can use only built-in functions in python) (determine the top n most rated films (by average rating) for each specified genre), the result will be a csv-like dataset with the title genre , title, year, rating)

command line arguments:
-N - the number of the most rated films for each genre. optional
-genres - user defined genre filter. may be multiple. for example, Comedy|Adventure or Comedy&Adventure. optional
-year_from
-year_to - filter for release years of films. optional
-regexp - filter (regular expression) for the movie title. optional

if some arguments are not specified, then there is no corresponding filter.
if nothing is specified at all, then all films are displayed sorted by genre and average rating (in descending order)

here is what I wrote, but it works slowly

import argparse
import csv

parser = argparse.ArgumentParser()
parser.add_argument('-N',
                    type=int,
                    help='the number of the highest rated films for each genre'
                    )
parser.add_argument('-genres',
                    type=str,
                    help='filter by genre'
                    )
parser.add_argument('-year_from',
                    type=int,
                    help='filter by year (FROM YEAR)',
                    default=1800
                    )
parser.add_argument('-year_to',
                    type=int,
                    help='filter by year (TO YEAR)',
                    default=2025
                    )
parser.add_argument('-regexp',
                    type=str,
                    help='filter on the movie name'
                    )
args = parser.parse_args()

data_m = []
with open('files/movies.csv', encoding='utf-8') as file:
    reader = csv.reader(file, delimiter=',')
    for row in reader:
        data_m.append(row)
data_movies = data_m[1:]

data_r = []
with open('files/ratings.csv', encoding='utf-8') as file:
    reader = csv.reader(file, delimiter=',')
    for row in reader:
        data_r.append(row)
data_rating = data_r[1:]





result = []  # сюда будет добавляться средний рейтинг по фильмам
for row in data_movies:
    film_ID = row[0]
    sum_rating = 0.0
    count = 0.0
    for line in data_rating:
        if film_ID == line[1]:
            sum_rating += float(line[2])
            count += 1
    if count != 0.0:
        result.append(sum_rating / count)
    else:
        result.append(0)

# добавление в таблицу среднего рейтинга и года
for i in range(len(data_movies)):
    data_movies[i].append(result[i])
    data_movies[i].append(data_movies[i][1][-5:-1])

# сейчас к data_movies добавлен средний рейтинг к каждому фильму и год выпуска фильма
# вывожу в консоль эту таблицу
for row in data_movies:
    print(row)

"""
Короче, сейчас у нас есть data_movies таблица, где добавлен средний рейтинг и год к каждому фильму
!!! НО ЭТО ОЧЕНЬ МЕДЛЕННО РАБОТАЕТ
"""

The bottom line is that I made a table where there is all the information about the films and the average rating for each film is added, but in order to get this average rating for each film, I had to put a loop in the loop (because a different number of reviews were made for each film by different people ) and there were about 1 billion passes, and it works very slowly, just to print this final table to the console takes about 30 seconds.

I planned to apply filters to this final table (from the command line arguments) and output the result to the console in the way that I found it quite convenient to have a ready-made table with an average rating and simply apply various conditions to this table based on the command line arguments.

BUT, JUST TO MAKE A CONVENIENT TABLE - WORKS VERY SLOWLY

WHICH APPROACH TO SOLVING THIS PROBLEM WOULD YOU USE, IN WHAT WAY WOULD YOU SOLVE THIS PROBLEM WITHOUT THE USE OF THIRD-PARTY LIBRARIES?
Thanks in advance

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

V

Vindicar, 2021-10-05
@denislysenko

Because you go through the entire rating table for each film, and even parse the numbers every time. It still wouldn't slow down.
Wouldn't it be easier for each film to store the sum of the ratings and their number, and then go through the rating table once and fill these storages?

from collections import defaultdict
total = defaultdict(float)
count = defaultdict(int)
for line in data_rating:
    total[line[1]] += float(line[2])
    count[line[1]] += 1

averages = { id: total[id]/count[id] for id in count } #средние рейтинги
#и сортируем список фильмов по убыванию рейтинга
data_movies.sort(key=lambda item: averages.get(item[1], 0), reverse=True)