Topic outline

  • Generalities

    If you have not done so already, you need to click on "join course" button below to get access to all the exercises.


    Join Course


  • Lesson 15/09/2021



    Here is a recording of the course : https://partage.imt.fr/index.php/s/4F4H2jR9YXcsmJN

  • Lesson from last year (09/09/2020): introduction to Big Data Processing and Hadoop

    The first session will start on the 09/09 at 13:30. I will meet you at the url https://digimedia1.r2.enst.fr/b/lou-ybz-9q5 for a synchronous introduction (i.e. through video-conference). This session will be recorded but you are invited to be present physically for the first one :)


    Then the course will be about the Hadoop paradigm on which we will do a lab session next week (meeting physically for those of you who are available)!


    Plan of the course:

  • Lesson 22/09 : Spark RDD

    We will go over the Spark RDD slides and do some of the exercises. For the exercises,  you have access to the following functions on RDD:

    • map(fun)
    • filter(fun)
    • flatMap(fun)
    • mapValues(fun)
    • groupByKey()
    • reduce(fun)
    • reduceByKey(fun)
    • coGroup(otherRDD)
    • union(otherRDD)
    The API also proposes the two following functions (but you should not use them):
    • count()
    • collect()


    Examples of Spark RDD uses:

    • l = sc.parallelize([e for e in range(10)])
    • l.collect()
    • def add(x,y):
          return "("+x+"+"+y+")"

      a = sc.parallelize([str(e) for e in range(4)])
      a.reduce(add)
    • a.fold("zero", add)

    The lesson was not recorded but here is the recording from last year.

  • Lab Session 29/09

    The lab session is composed of 4 exercises, using the MovieLens dataset. The goal is train yourself to write programs in the Map-Reduce model with the Map-Reduce API (and the difficulties of manipulating strings and so on...). Since most of you do not know Java, I created an API inspired by the Java API but in Python.


    Instructions for the Python version

    The API that you will have to use defines four classes: Mapper, Combiner, Reducer, and Job. The most important one is Job that allows you to define Map-Reduce jobs. To create a Map-Reduce you have to create a Job, then you add your mappers (using myJob.add_mapper("inputfile","path.to.mapper.class")), your eventual combiner (using myJob.add_combiner("path.to.combiner.class")) and finally your reducer (using myJob.add_reducer("outputfile","path.to.reducer.class")). Note that you can add several mappers to the same Map-Reduce job but there can be only one reducer and only one combiner. The combiner is optional but there needs to be to a reducer and  at least one mapper.

    Once you have defined a Map-Reduce job, you can directly run it (using myJob.run()) or use it later as a dependency of another job (myOtherJob.add_dependency(myJob). Note that this dependency capability is there to match the Java API but you can limit yourself to using only run()).

    To define a mapper class, you need to create a class that inherits the Mapper class and override the map function. This map function takes two arguments, self (the object) and line the line that is going to be mapped. Note that the line includes the '\n' symbol (so you probably want to start the map function with something like line = line[:-1] to remove it). The map function should return a list of pairs and the two elements of the pairs need to be serializable (i.e. we can cast them to string).

    To define a reducer class, you need to create a class that inherits the Reducer class and override the reducer function. This reducer function takes three arguments, self (the object), key the key (which is a string!) and values which is the list of values associated with the key (this list contains only strings). The output of the reducer should be a list of strings. Each of those strings will be included in the output of the Map-Reduce job.

    To define a combiner class, you need to create a class that inherits the Combiner class and override the combiner function. This combiner function takes the same arguments as a red, self (the object), key the key (which is a string!) and values which is the list of values associated with the key (this list contains only strings). The output of the reducer should be a list of strings. Each of those strings will be included in the output of the Map-Reduce job.

    Exercise 0 contains an example of a two Map-Reduce jobs first counting the distinct items then keeping only those appearing more than twice. Exercise 0 will not be taken into account for the grade of the lab but other exercises will! And the number of attempts that you make will be considered so do not "submit" (the evaluate button) foolishly. You can use the "run" button and the "debug" button as much as you want.


    Instructions for grading

    Ideally you should submit the java version of the exercise, however some of you may not know Java so I will also look at the Python versions you submit on VPL. It is recommended to start with the Python version for each exercise but that you also try to implement at least the first exercise in Java. Note that this TP is already long when using only one language so you are not expected to finish everything.

    The grading will consider three aspects:

    1. The correctness of your solution
    2. The number of attempts per exercise (the number of times you click on "run" or "debug" will not be counted, I only count the number of "evaluate" that failed before a correct solution)
    3. The style of your code.

    This lab session is relatively long and you are not expected to finish everything to have a passing grade. In contrast to get the maximal grade you are expected to finish everything and have nice solutions with useful comments where needed and using combiners if you think that is appropriate (in which cases your comments should explain why a combiner might help).


    You need to be connected to see the exercises!

  • Lab session 13/10 : Spark RDD

    Section 4 will be counted as bonus.


  • Miscellaneous among: Spark Optimization, Exam preparation, Streams, Kafka, Comparison of distributed frameworks