What Is Hadoop?

1. Fun Fact: created by Doug Cutting and Mike Cafarella in 2005

2. Hadoop is a Distributed Data Management and Processing system on clusters of hardware (servers)

Unlike MongoDB Database Systems we have studied, Hadoop can do all sorts of processing (not just Database operations).

3. Hadoop is open source (Provisioned through Apache Foundation)

4. Hadoop is the foundation of Big Data Frameworks

Hadoop Users

History and Evolution

What is Hadoop Used for?

• Structured/Unstructured/Semi-Structured

Hadoop Distributions

Hadoop Distributed File System (HDFS)

Data is divided into blocks of the same size.
Each block is replicated (default 3)
Each replica is stored in a different Datanode (some of which may be in different Racks)
Namenode contains all the information about which blocks are in which Datanode

NameNode is Primary node
A write request can go to any DataNode (Secondary Node)
That node makes the write and then sends the data to its replicas.
A read request can go to any of the replicas
No consistency checks like Cassandra
NameNode can have a CheckSum (digital signature) of the Data and check the returned data against its memory of CheckSums.

RealCode4You