D
D
denislysenko2021-11-23 23:28:29
Python
denislysenko, 2021-11-23 23:28:29

What is the map function supposed to do in mapreduce?

The essence of the task that I am doing:
There is a movies.csv file (for 10,000 lines), by the way, if it matters, then you can download it in the same way: wget https://files.grouplens.org/datasets/movielens/ml- ...

the file looks like this:

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995), Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy |Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
...

Now, using the mapreduce approach, I need a file called mapper.py to print keys and values ​​to the console, where the key is the genre of the movie, and the value is the name of the movie and its year, while it should be possible to filter movies by name, year and genre by command line argument, and if we do not pass these arguments, then all movies should be displayed on the console.

Example:
there is this line:
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy

as you can see, it has several genres, and mapper.py should print this movie in the console like this:

Adventure Toy Story;1995
Animation Toy Story;1995
Children Toy Story;1995
Comedy Toy Story;1995
Fantasy Toy Story;1995

and so on for all the movies

I run it this way: cat files/movies.csv | python3 mapper.py [arguments can be passed here to filter movies as desired]
my mapper.py file looks like this:

import sys
import argparse
import re


def argpars():
    parser = argparse.ArgumentParser()
    parser.add_argument('-genres',
                        type=str,
                        help='filter by genre',
                        default=''
                        )
    parser.add_argument('-year_from',
                        type=int,
                        help='filter by year (FROM YEAR)',
                        default=1800
                        )
    parser.add_argument('-year_to',
                        type=int,
                        help='filter by year (TO YEAR)',
                        default=2025
                        )
    parser.add_argument('-regexp',
                        type=str,
                        help='filter on the movie name',
                        default=''
                        )
    return parser.parse_args()


def print_result(data, year_from, year_to, name, genres_argument):
    for line in data:
        for key, value in map(line, year_from, year_to, name, genres_argument):
            print(key, "\t", str(value))


def map(line, year_from, year_to, name, genres_argument):
    list_line = line.split(",")
    list_line[2] = list_line[2][:-2]
    # filter by year and regexp
    if filter_by_year(year_from, year_to, list_line[1]) and filter_by_regexp(name, list_line[1]):
        # filter by genres
        list_genres_argument = genres_argument.split('|')
        genres = list_line[2].split('|')
        for arg_genre in list_genres_argument:
            for genre in genres:
                if filter_by_genres(genre, arg_genre):
                    key = genre
                    value = '{};{}'.format(list_line[1][:-7], list_line[1][-5:-1])
                    yield key, value


def filter_by_year(year_from, year_to, string):
    pattern = r'\(\d{4}\)'
    if re.search(pattern, string):
        year = re.search(pattern, string)
        a = year.group(0)[1:-1]
        int_year = int(a)
        if year_from <= int_year <= year_to:
            return True


def filter_by_regexp(name, string):
    pattern = name
    if re.search(pattern, string):
        return True


def filter_by_genres(genre, genres_argument):
    if genres_argument == '' and genre != '(no genres listed)':
        return True
    elif genre == genres_argument and genre != '(no genres listed)':
        return True


if __name__ == "__main__":
    args = argpars()
    print_result(sys.stdin, args.year_from, args.year_to, args.regexp, args.genres)


This code is fully functional and does its job correctly (it prints a key and value to the console, where the key is the genre and the value is the string movie_name;year). But from the point of view of mapreduce, the map function is not implemented quite right here, although this file does its job perfectly. How can I change this to make it look more correct? And in general, I will be glad to see any comment on this code (for example: what is the best name for a function or variable, by the way, yes, really, can change the names of some functions to make the code more readable?)
But there should be a map function, and how to implement it correctly in a way that is fully consistent with the mapreduce approach? Maybe you need to change the structure of the code?
What should the map function do within my task
WHAT SHOULD THE map() FUNCTION DO within my task, what should it take as input, what should it return?

Answer the question

In order to leave comments, you need to log in

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question