I was excited when someone introduce the early alpha version of Hadoop in 2007. So i just want to share a little bit fundamental of apache hadoop.
The article will include this 5 topics
1. HADOOP OVERVIEW
2. HDFS (MOTHER)
3. YARN (FATHER)
4. MAP REDUCE (THINKER)
5. SPARK (ANOTHER SMART THINKER)
Let’s go through one by one…
1. HADOOP OVERVIEW
- Named after a toy elephant belonging to developer Doug Cutting’s son
- Hadoop’s roots wind back to Nutch, the 2002 project started by Doug Cutting and Mike Cafarella
- Concept start when Google release their paper called
GFS (Google File System – 2003) = Hadoop HDFS,http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
MapReduce (2004) = Hadoop MapReduce,http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
- Reference : https://hadoop.apache.org/
- “Framework that allows distributed processing of large data sets across clusters of computers…
- using simple programming models.
- It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
- Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”
Hadoop Key Component:
- Hadoop Common: Common utilities
- (Storage Component) Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access
- Many other data storage approaches also in use
E.G., Apache Cassandra, Apache Hbase, Apache Accumulo (NSA-contributed)
- (Scheduling) Hadoop YARN: A framework for job scheduling and cluster resource management
- (Processing) Hadoop MapReduce (MR2): A YARN-based system for parallel processing of large data sets
Other execution engines increasingly in use, e.g., Spark
All of these key components are OSS under Apache 2.0 license
Related: Ambari, Avro, Cassandra, Chukwa, Hbase, Hive, Mahout, Pig, Tez, Zookeeper
2. HDFS (MOTHER)
- Inspired by “Google File System”
- Stores large files (typically gigabytes-terabytes) across multiple machines, replicating across multiple hosts
- Breaks up files into fixed-size blocks (typically 64 MiB), distributes blocks
- The blocks of a file are replicated for fault tolerance
- The block size and replication factor are configurable per file
- Default replication value (3) – data is stored on three nodes: two on the same rack, and one on a different rack
- File system intentionally not fully POSIX-compliant
- Write-once-read-many access model for files. A file once created, written, and closed cannot be changed. This assumption simplifies data coherency issues and enables high throughput data access
- Intend to add support for appending-writes in the future
- Can rename & remove files
- “Namenode” tracks names and where the blocks are
- Hadoop can work with any distributed file system but this loses locality
- To reduce network traffic, Hadoop must know which servers are closest to the data; HDFS does this
- Hadoop job tracker schedules jobs to task trackers with an awareness of the data location
For example, if node A contains data (x,y,z) and node B contains data (a,b,c), the job tracker schedules node B to perform tasks on (a,b,c) and node A would be scheduled to perform tasks on (x,y,z)
This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer
Location awareness can significantly reduce job-completion times when running data-intensive jobs
Handling structured data
- Data often more structured
- Google BigTable (2006 paper)
- Designed to scale to Petabytes, 1000s of machines
- Maps two arbitrary string values (row key and column key) and timestamp into an associated arbitrary byte array
- Tables split into multiple “tablets” along row chosen so tablet will be ~200 megabytes in size (compressed when necessary)
- Data maintained in lexicographic order by row key; clients can exploit this by selecting row keys for good locality (e.g., reversing hostname in URL)
- Not a relational database; really a sparse, distributed multi-dimensional sorted map
- Implementations of approach include: Apache Accumulo (from NSA; cell-level access labels), Apache Cassandra, Apache Hbase, Google Cloud BigTable (released 2005)
- Sources: https://en.wikipedia.org/wiki/BigTable; “Bigtable…” by Chang
3. YARN (Yet Another Resource Negotiator) – FATHER
- Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology.
- YARN is one of the key features in the second-generation Hadoop 2 version of the Apache Software Foundation’s open source distributed processing framework.
- YARN is a software rewrite that decouples MapReduce’s resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications
- YARN combines a central resource manager that reconciles the way applications use Hadoop system resources with node manager agents that monitor the processing operations of individual cluster nodes
- ResourceManager (RM) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NodeManagers (NMs) and the per-application ApplicationMasters (AMs).
- NodeManagers take instructions from the ResourceManager and manage resources available on a single node.
- ApplicationMasters are responsible for negotiating resources with the ResourceManager and for working with the NodeManagers to start the containers.
- The NodeManager (NM) is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the ResourceManager (RM), overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.
- The ApplicationMaster is, in effect, an instance of a framework-specific library and is responsible for negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the containers and their resource consumption. It has the responsibility of negotiating appropriate resource containers from the ResourceManager, tracking their status and monitoring progress.
4. MAPREDUCE V2 (Thinker)
- MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster
- Programmer defines two functions, map & reduce
- Map(k1,v1) → list(k2,v2). Takes a series of key/value pairs, processes each, generates zero or more output key/value pairs
- Reduce(k2, list (v2)) → list(v3). Executed once for each unique key k2 in the sorted order; iterate through the values associated with that key and produce zero or more outputs
- System “shuffles” data between map and reduce (so “reduce” function has set of data for its keys) automatically handles system failures, etc.
MapReduce: Word Count Example (Pseudocode)
map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
- Can also define an option function “Combiner” (to optimize bandwidth)
- If defined, runs after Mapper & before Reducer on every node that has run a map task
- Combiner receives as input all data emitted by the Mapper instances on a given node
- Combiner output sent to the Reducers, instead of the output from the Mappers
- Is a “mini-reduce” process which operates only on data generated by one machine
- If a reduce function is both commutative and associative, then it can be used as a Combiner as well
- Useful for word count – combine local counts
MapReduce problems & potential solution
- Many problems aren’t easily described as map-reduce
- Persistence to disk typically slower than in-memory work
Alternative: Apache Spark
- a general-purpose processing engine that can be used instead of MapReduce
5. APACHE SPARK (ANOTHER SMART THINKER)
- Apache Spark is a cluster computing platform designed to be fast and general-purpose.
- On the speed side, Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. Speed is important in processing large datasets, as it means the difference between exploring data interactively and waiting minutes or hours. One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also more efficient than MapReduce for complex applications running on disk
Spark Mapreduce Comparison
- Hadoop MapReduce is meant for data that does not fit in the memory whereas Apache Spark has a better performance for the data that fits in the memory, particularly on dedicated clusters.
- Hadoop MapReduce can be an economical option because of Hadoop as a service offering(HaaS) and availability of more personnel. According to the benchmarks, Apache Spark is more cost effective but staffing would be expensive in case of Spark.
- Apache Spark and Hadoop MapReduce both are failure tolerant but comparatively Hadoop MapReduce is more failure tolerant than Spark.
- Spark and Hadoop MapReduce both have similar compatibility in terms of data types and data sources.
- Programming in Apache Spark is easier as it has an interactive mode whereas Hadoop MapReduce requires core java programming skills, however there are several utilities that make programming in Hadoop MapReduce easier.