1. Fun Fact: created by Doug Cutting and Mike Cafarella in 2005
Named After Doug’s son’s toy elephant named, Hadoop
2. Hadoop is a Distributed Data Management and Processing system on clusters of hardware (servers)
Unlike MongoDB Database Systems we have studied, Hadoop can do all sorts of processing (not just Database operations).
3. Hadoop is open source (Provisioned through Apache Foundation)
4. Hadoop is the foundation of Big Data Frameworks
History and Evolution
What is Hadoop Used for?
Video and Image Analysis
Machine Learning Models
Amazon Web Services
Data Mine Lab
And more others
Hadoop Distributed File System (HDFS)
Data is divided into blocks of the same size.
Each block is replicated (default 3)
Each replica is stored in a different Datanode (some of which may be in different Racks)
Namenode contains all the information about which blocks are in which Datanode
NameNode is Primary node
A write request can go to any DataNode (Secondary Node)
That node makes the write and then sends the data to its replicas.
A read request can go to any of the replicas
No consistency checks like Cassandra
NameNode can have a CheckSum (digital signature) of the Data and check the returned data against its memory of CheckSums.