What Is Hadoop?

1. Fun Fact: created by Doug Cutting and Mike Cafarella in 2005

  • Named After Doug’s son’s toy elephant named, Hadoop

2. Hadoop is a Distributed Data Management and Processing system on clusters of hardware (servers)

  • Unlike MongoDB Database Systems we have studied, Hadoop can do all sorts of processing (not just Database operations).

3. Hadoop is open source (Provisioned through Apache Foundation)

4. Hadoop is the foundation of Big Data Frameworks


Hadoop Users

  • Amazon

  • Google

  • Expedia

  • JP Morgan

  • Facebook

  • Yahoo

  • Ebay


History and Evolution












What is Hadoop Used for?

  • Searching

  • Log Processing

  • Recommendation Systems

  • Data Analytics

  • Video and Image Analysis

  • Data Storage/Retention

• Structured/Unstructured/Semi-Structured

  • Machine Learning Models


Hadoop Distributions

  • Amazon Web Services

  • Apache Bigtop

  • Cascading

  • Cloudera

  • Cloudspace

  • Datameer

  • Data Mine Lab

  • Datasalt

  • DataStax

  • DataTorrent

  • Ndisco

  • Debian

  • Emblocsoft

  • Hortonworks

  • HStreaming

  • IBM

  • Impetus

  • Jaspersoft

  • Karmasphere

  • Apache Mahout

  • Nutch

  • And more others


Hadoop Distributed File System (HDFS)

  • Data is divided into blocks of the same size.

  • Each block is replicated (default 3)

  • Each replica is stored in a different Datanode (some of which may be in different Racks)

  • Namenode contains all the information about which blocks are in which Datanode


  • NameNode is Primary node

  • A write request can go to any DataNode (Secondary Node)

  • That node makes the write and then sends the data to its replicas.

  • A read request can go to any of the replicas

  • No consistency checks like Cassandra

  • NameNode can have a CheckSum (digital signature) of the Data and check the returned data against its memory of CheckSums.


6 views0 comments