Hadoop logo


Fast Hadoop10 minutes

...in 10 minutes

Hadoop is

  • A NOSQL platform
  • An open-source Apache project implementing the 2004 MapReduce paper
  • Written in Java, with Java and streaming API
  • Designed for use on commodity hardware (e.g. racks of Dells, Amazon EC2) and scalable to thousands of nodes, although most clusters are tens of nodes
  • The name of Doug Cutting's son's toy elephant

Aside: Not Only SQL

  • Is for storing images, multimedia, objects of varying size
  • Some database-style options: HBase, Cassandra
  • Some key, value pair options: MongoDB, DynamoDB

Aside: Big Data

  • Tweets that appear faster than MySQL can create keys
  • Data that exceed supported table dimensions
    • e.g. columns as positions in the human genome
    • e.g. rows as items on the receipt of every Walmart transaction in a year
  • multimedia, sparse tables, or huge variations in object size

Hadoop architecture is

  1. A distributed file system
  2. A distributed compute system

Hadoop architecture is

  1. A distributed file system
  2. A distributed compute system

Hadoop architecture is

  1. A distributed file system
  2. A distributed compute system

These are the individual machines

(But all of the Hadoop jobs can run on one computer / VM for debugging and development.)

  1. One master for the HDFS, one for MapReduce jobs
  2. Secondary master
  3. Multiple workers

The individual machines

(But all of the Hadoop jobs can run on one computer / VM for debugging and development.)

  1. One master for the HDFS, one for MapReduce jobs
  2. Secondary master
  3. Multiple workers

The individual machines

(But all of the Hadoop jobs can run on one computer / VM for debugging and development.)

  1. One master for the HDFS, one for MapReduce jobs
  2. Secondary master
  3. Multiple workers

MapReduce

The best explanation of MapReduce I've ever heard
nickel quarter penny

Problem: want pizza

But we're broke ⇒ must count our change
  • Map step
    • Everyone empty their pockets; bring the change to the front table
  • Shuffle Sort step
    • I (the Master) will group coins by denomination
  • Reduce step
    • Everyone count a pile
  • End
    • Ask everyone for their result and sum
penny dime

So what?

Parallel computation +
every variable imaginable for every customer
=
Target identifies a pregnant teen by her purchase behavior before her dad knows

...enter Mahout (2011)

Mahout is a Machine Learning and parallel linear algebra library that has a Spark + Scala shell implementing the distributed matrix libraries only (with plans for the rest). Written in Java.

...and Yarn (MapReduce 2 — 2012)

MapReduce is painfully slow and inflexible; MapReduce 2 splits the JobTracker to two processes, broadens the API, and gives the user more control over how the jobs execute.

introducing YARN

...and Spark (2014)

Speed wins: Mahout announced near the end of last year that it will stop developing for MapReduce and will exclusively write code for use with Spark.
No wonder when looking at the performance:

MLib is part of Spark; ALS is a machine learning algorithm.

blog post with benchmark