The lab session will happen synchronously on Amphi 2 and on the usual BBB room. Furthermore for all questions, students are welcome to use the DATA AI mattermost (click on gitlab authentication then on shibboleth then select Télécom Paris in the list of authentication provider and input your Télécom credentials) where
I have created a DataAI922 room. I strongly invite you to be on the BBB at 13:30 for an explanation of the tasks (it has been recorded).
The lab session is composed of 4 exercises, using the MovieLens dataset. The goal is train yourself to write programs in the Map-Reduce model with the Map-Reduce API (and the difficulties
of manipulating string and so on...). Since most of you do not know Java, I created an API inspired by the Java API but in Python. The API that you will have to use defines four classes: Mapper, Combiner, Reducer,
and Job. The most important one is Job that allows you to define Map-Reduce jobs. To create a Map-Reduce you have to create a Job, then you add your mappers (using myJob.add_mapper("inputfile","path.to.mapper.class")),
your eventual combiner (using myJob.add_combiner("path.to.combiner.class")) and finally your reducer (using myJob.add_reducer("outputfile","path.to.reducer.class")). Note that you can add several mappers to the same
Map-Reduce job but there can be only one reducer and only one combiner. The combiner is optional but there needs to be to a reducer and at least one mapper.
Once you have defined a Map-Reduce job, you can directly run it (using myJob.run()) or use it later as a dependency of another job (myOtherJob.add_dependency(myJob). Note that this dependency capability is there to match
the Java API but you can limit yourself to using only run()).
To define a mapper class, you need to create a class that inherits the Mapper class and override the map function. This map function takes two arguments, self (the object) and line the line that is going to be mapped. Note that the line includes the '\n' symbol (so you probably want to start the map function with something like line = line[:-1] to remove it). The map function should return a list of
pairs and the two elements of the pairs need to be serializable (i.e. we can cast them to string).
To define a reducer class, you need to create a class that inherits the Reducer class and override the reducer function. This reducer function takes three arguments, self (the object),
key the key (which is a string!) and values which is the list of values associated with the key (this list contains only strings). The output of the reducer should be a list of strings. Each of those strings will
be included in the output of the Map-Reduce job.
To define a combiner class, you need to create a class that inherits the Combiner class and override the combiner function. This combiner function takes the same arguments as a red, self (the object), key the key (which is a string!) and values which is the list of values associated with the key (this list contains only strings). The output of the reducer should be a list of strings. Each of those
strings will be included in the output of the Map-Reduce job.
Exercise 0 contains an example of a two Map-Reduce jobs first counting the distinct items then keeping only those appearing more than twice. Exercise 0 will not be taken into account for the grade of the lab but other exercises will! And the number of attempts that you make will be considered so do not "submit" (the evaluate button) foolishly. You can use the "run" button and the "debug" button as much as you want.
The grading will consider three aspects:
- The correctness of your solution
- The number of attempts per exercise (the number of times you click on "run" or "debug" will not be counted, I only count the number of "evaluate" that failed before a correct solution)
- The style of your code.
This lab session is relatively long and you are not expected to finish everything to have a passing grade. In contrast to get the maximal grade you are expected to finish everything and have nice solutions with useful comments where needed and using combiners
if you think that is appropriate (in which cases your comments should explain why a combiner might help).
You need to be connected to see the exercises!