• Generalities

    Not published to students
    Current

    The course will happen in a mostly asynchronous way however to have some interactivity and answer your questions. I will be available for questions during the hours of the course at the URL https://digimedia1.r2.enst.fr/b/lou-ybz-9q5.

    Moreover, one every two week, I will try to be physically at Télécom to answer more directly to questions from students (but that depends on the availabilities of rooms...) and for lab sessions.

    If you have not done so already, you need to click on "join course" button below to get access to all the exercises.


    Join Course


    • Lesson 09/09: introduction to Big Data Processing and Hadoop

      Not published to students
      Current

      The first session will start on the 09/09 at 13:30. I will meet you at the url https://digimedia1.r2.enst.fr/b/lou-ybz-9q5 for a synchronous introduction (i.e. through video-conference). This session will be recorded but you are invited to be present physically for the first one :)


      Then the course will be about the Hadoop paradigm on which we will do a lab session next week (meeting physically for those of you who are available)!


      Plan of the course:

    • Lab Session 16/09

      Not published to students
      Current

      The lab session will happen synchronously on Amphi 2 and on the usual BBB room. Furthermore for all questions, students are welcome to use the DATA AI mattermost (click on gitlab authentication then on shibboleth then select Télécom Paris in the list of authentication provider and input your Télécom credentials) where I have created a DataAI922 room.  I strongly invite you to be on the BBB at 13:30 for an explanation of the tasks (it has been recorded).

      The lab session is composed of 4 exercises, using the MovieLens dataset. The goal is train yourself to write programs in the Map-Reduce model with the Map-Reduce API (and the difficulties of manipulating string and so on...). Since most of you do not know Java, I created an API inspired by the Java API but in Python. The API that you will have to use defines four classes: Mapper, Combiner, Reducer, and Job. The most important one is Job that allows you to define Map-Reduce jobs. To create a Map-Reduce you have to create a Job, then you add your mappers (using myJob.add_mapper("inputfile","path.to.mapper.class")), your eventual combiner (using myJob.add_combiner("path.to.combiner.class")) and finally your reducer (using myJob.add_reducer("outputfile","path.to.reducer.class")). Note that you can add several mappers to the same Map-Reduce job but there can be only one reducer and only one combiner. The combiner is optional but there needs to be to a reducer and  at least one mapper.

      Once you have defined a Map-Reduce job, you can directly run it (using myJob.run()) or use it later as a dependency of another job (myOtherJob.add_dependency(myJob). Note that this dependency capability is there to match the Java API but you can limit yourself to using only run()).

      To define a mapper class, you need to create a class that inherits the Mapper class and override the map function. This map function takes two arguments, self (the object) and line the line that is going to be mapped. Note that the line includes the '\n' symbol (so you probably want to start the map function with something like line = line[:-1] to remove it). The map function should return a list of pairs and the two elements of the pairs need to be serializable (i.e. we can cast them to string).

      To define a reducer class, you need to create a class that inherits the Reducer class and override the reducer function. This reducer function takes three arguments, self (the object), key the key (which is a string!) and values which is the list of values associated with the key (this list contains only strings). The output of the reducer should be a list of strings. Each of those strings will be included in the output of the Map-Reduce job.

      To define a combiner class, you need to create a class that inherits the Combiner class and override the combiner function. This combiner function takes the same arguments as a red, self (the object), key the key (which is a string!) and values which is the list of values associated with the key (this list contains only strings). The output of the reducer should be a list of strings. Each of those strings will be included in the output of the Map-Reduce job.

      Exercise 0 contains an example of a two Map-Reduce jobs first counting the distinct items then keeping only those appearing more than twice. Exercise 0 will not be taken into account for the grade of the lab but other exercises will! And the number of attempts that you make will be considered so do not "submit" (the evaluate button) foolishly. You can use the "run" button and the "debug" button as much as you want.

      The grading will consider three aspects:

      1. The correctness of your solution
      2. The number of attempts per exercise (the number of times you click on "run" or "debug" will not be counted, I only count the number of "evaluate" that failed before a correct solution)
      3. The style of your code.

      This lab session is relatively long and you are not expected to finish everything to have a passing grade. In contrast to get the maximal grade you are expected to finish everything and have nice solutions with useful comments where needed and using combiners if you think that is appropriate (in which cases your comments should explain why a combiner might help).


      You need to be connected to see the exercises!

      • Lesson 23/09 : Spark RDD

        Not published to students
        Current

        I was not able to pre-record videos this time. So this lesson will be entirely synchronous (but recorded). We will use the usual room and these slides. Here is the recording

        For the exercises,  you have access to the following functions on RDD:

        • map(fun)
        • filter(fun)
        • flatMap(fun)
        • mapValues(fun)
        • groupByKey()
        • reduce(fun)
        • reduceByKey(fun)
        • coGroup(otherRDD)
        • union(otherRDD)
        The API also proposes the two following functions (but you should not use them):
        • count()
        • collect()


      • Lab session 30/09 : Spark RDD

        Not published to students
        Current

        The lab session will happen simultaneously in Amphi Estaunié and in the usual BBB room. Written questions can also be asked on the mattermost of Data AI .

        Section 5 will not be counted.


      • Lesson on Spark DataFrames

        Not published to students
        Current

        As usual we will do a short synchronous session on BBB followed by pre-recorded videos. I tried to use the "interactive" videos but this is my first time playing with this module so it is not really interactive… The course will end with a solution to the exercises and then a solution to the previous lab session (on Movie recommendation) using Spark DataFrames. This correction will happen synchronously so that you can ask questions (but it will be recorded).

      • Lab Session 3 on Spark DF and ML

        Not published to students
        Current
        • pdf
          Not published to students
        • xls
          Not published to students
        • text
          Not published to students
        • Assignment
          Not published to students
        • Assignment
          Not published to students
        • Assignment
          Not published to students
        • html
          Not published to students
      • Miscellaneous among: Spark Optimization, Exam preparation, Streams, Kafka, Comparison

        Not published to students
        Current

      Course Dashboard