Shivansh Yadav

Posted on Jun 30 • Edited on Jul 2

Introduction to Apache Hadoop & MapReduce

#hadoop #dataengineering #bigdata #datascience

The History of Hadoop

There are mainly two problems with the big data.

Storage for a huge amount of data.
Processing of that stored data.

In 2003, Google published about Google's distributed file system, called GFS (Google File System) which can be used for storing large data sets.

Similarly in 2004, Google published a paper on MapReduce, that described the solution for processing large datasets.

Doug Cutting and Mike Cafarella (Founders of Hadoop), came across both of these papers that described GFS and MapReduce, while working on the Apache Nutch project.

The aim of the Apache Nutch project was to build a search engine system that can index 1 billion pages. Their conclusion to this project was that it would cost millions of dollars.

Both the papers by Google were not complete solution for the Nutch project.

Fast forward to 2006, Doug Cutting joined Yahoo and started the project Hadoop, implementing the papers from Google.

Finally in 2008, Yahoo released Hadoop as an open source project to ASF(Apache Software Foundation) and they successfully tested a 4000 node cluster with Hadoop.

Intro to Apache Hadoop

Apache Hadoop is software framework for distributed storage and distributed processing of big data using the MapReduce programming model.

Hadoop comes with the following 4 modules:

HDFS (Hadoop Distributed File System): A file system inspired by GFS which is used for distributed storage of Big data.
YARN (Yet Another Resource Negotiator): A resources manager that can be used for job scheduling and cluster resource management. It keeps track of which node does what work.
MapReduce: Programming Model used for distributed processing. It divides the data into partition that are mapped (transformed) and Reduced (aggregated).
Hadoop Common: It includes libraries and utilities used and shared by other Hadoop modules.

Here is a block diagram representation of how they all work together.

MapReduce

As we know MapReduce is a programming model that can process big data in a distributed manner, let's see how MapReduce works internally.

There are majorly 3 tasks performed during a MapReduce job.

Mapper
Shuffle & Sort
Reducer

Below is a example of how a MapReduce job would look like:

This can vary and depends how we want the MapReduce to process our data.

Hadoop & MapReduce are written natively in Java, but streaming allows interfacing to other languages like Python.

Here is an example Python code for a MapReduce job.

from mrjob.job import MRJob
from mrjob.step import MRStep

class RatingsBreakdown(MRJob):

    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_ratings,
                   reducer=self.reducer_count_ratings),
            MRStep(reducer=self.reducer_sorted_count)
            ]

    def mapper_get_ratings(self, _, line):
        (user_id, movie_id, rating, timestamp) = line.split('\t')
        yield movie_id, 1

    def reducer_count_ratings(self, key, value):
        yield str(sum(values)).zfill(5), key

    def reducer_sorted_output(self, count, movies):
        for movie in movies:
            yield movie, count

if __name__ == '__main__':
    RatingsBreakdown.run()

The Hadoop ecosystem has grown significantly and includes various tools and frameworks that build upon or complement the basic MapReduce model. Here’s a look at some of these technologies:

While newer technologies offer more straightforward ways to handle big data, understanding MapReduce is fundamental to grasping the field's broader concepts.

THE END

DEV Community

Introduction to Apache Hadoop & MapReduce

The History of Hadoop

Intro to Apache Hadoop

MapReduce

Top comments (0)

Read next

Oct 10: virtual AI, Machine Learning and Computer Vision Meetup!

All About Parquet Part 01 - An Introduction

All About Parquet Part 05 - Compression Techniques in Parquet

All About Parquet Part 10 - Performance Tuning and Best Practices with Parquet