Hadoop, as we are all aware, is a Java framework that makes use of a sizable cluster of affordable hardware to manage and store enormous amounts of data. The MapReduce programming algorithm, developed by Google, is the foundation of Hadoop. Today, many large corporations, like Facebook, Yahoo, Netflix, eBay, and others, use Hadoop inside their organizations to manage large amounts of data. The four main parts of the Hadoop architecture are listed below.
- MapReduce
- HDFS(Hadoop Distributed File System)
- YARN(Yet Another Resource Negotiator)
- Common Utilities or Hadoop Common
Let’s take a closer look at each of these components’ functions.
1. MapReduce
MapReduce is nothing more than a YARN framework-based algorithm or data structure. Hadoop operates so swiftly because of MapReduce, which is primarily responsible for carrying out parallel distributed processing in a Hadoop cluster.Serial processing is no longer useful when working with Big Data. Mainly 2 jobs in MapReduce are separated into phases:
Map is used in the first phase, while Reduce is used in the second.
Here, we can see that the Map() function receives the input and then uses its result as an input for the Reduce function, which finally gives us our final output. Let’s examine the functions of Map() and Reduce().
Given that we are utilizing big data, it is clear that an input is given to the Map() function. A collection of data is the input. This DataBlocks is divided into Tuples using the Map() method, which are just key-value pairs. These key-value pairs are now sent to the Reduce command as input (). In order to create a collection of Tuples and execute operations like sorting, summation, etc., the Reduce() method first joins the broken Tuples or key-value pair depending on its Key value. This set of Tuples is then delivered to the final Output Node. The output is finally obtained.
Depending on the business needs of that sector, data processing is always done in Reducer. This is how Map() is used first, followed by Reduce, one at a time.
Let’s go further into the Map Task and Reduce Task.
Map Task:
- RecordReader Breaking records is the goal of recordreader. It is in charge of giving a Map() function key-value pairs. The information about where it is located and the value of the data that goes with it truly hold the key.
- Map: A map is nothing more than a user-defined function whose job it is to handle the Tuples that record readers provide. Either no key-value pairs are produced by the Map() method, or many pairs of these tuples are produced.
- Combiner: The Map process uses a combiner to organize the data. It has features that a local reducer has. With the aid of this combiner, the intermediate key-value pairs created by the Map are merged. A combiner is optional, therefore using one is not required.
- Partitionar: The key-value pairs produced in the Mapper Phases are fetched by Partitional. Each reducer’s matching shard is created by the partitioner. This partition also retrieves each key’s hashcode. When there are enough reducers, the partitioner applies its (Hashcode) modulus (key.hashcode()%(number of reducers)).
Reduce Task
- Shuffle and Sort: The method by which the Mapper creates the intermediate key-value pairs and sends them to the Reducer activity is known as shuffling. This step initiates the Reducer task. The system may sort the data by key value using the shuffling technique. It is a quicker method that doesn’t wait for the completion of the work carried out by the mapper after part of the mapping chores are finished.
- Reduce: Gathering the Tuple produced by Map and performing some type of aggregation operation on those key-value pairs based on their key element is the primary duty or purpose of Reduce.
- OutputFormat: Once all operations have been completed, record writer is used to write the key-value pairs into the file, with each record starting on a new line and the key and value being separated by spaces.
2. HDFS
For storage permission, HDFS (Hadoop Distributed File System) is used. It is primarily intended for use with low-cost, commodity hardware for developing a distributed file system. The way that HDFS is constructed makes it more inclined to favor storing data in big blocks as opposed to little ones.
Hadoop’s HDFS offers the storage layer and other devices in the Hadoop cluster fault tolerance and high availability. Nodes for data storage in HDFS.
- NameNode(Master)
- DataNode(Slave)
NameNode: In a Hadoop cluster, NameNode serves as a Master and directs the Datanode (Slaves). The primary purpose of Namenode is to store metadata, or information about metadata. Transaction logs that record user activity in a Hadoop cluster may serve as meta data.
In order to locate the nearest DataNode for faster communication, Namenode keeps information about the location (Block number, Block ids) of DataNodes as part of Meta Data. DataNodes are given instructions by Namenodes for actions like create, replicate, and remove.
DataNode: Working as a Slave is DataNodes DataNodes, which may range in number from 1 to 500 or even more, are mostly used for storing data in a Hadoop cluster. The more DataNodes the Hadoop cluster has, the more data it can store. Therefore, it is recommended that the DataNode have a high storage capacity in order to store a lot of file blocks.
High Level Architecture Of Hadoop
File Block In HDFS: HDFS stores data in terms of blocks at all times. Therefore, a single block of data is separated into many blocks, each of which has a default size of 128 MB, but you may manually adjust it.
Let’s use an example to better explain the idea of dividing files into blocks. If you upload a 400MB file to your HDFS, it will be partitioned into blocks that are 128MB, 128MB, 128MB, and 16MB in size, totaling 400MB. Meaning that, apart from the last block, four 128MB blocks are generated. Hadoop treats the last file blocks as a partial record since it has no notion what data is contained in them and is either unaware of it or doesn’t care. A file block in the Linux file system is around 4KB in size, which is much less than the Hadoop file system’s default file block size. As we are all aware, Hadoop is primarily set up to store massive amounts of data, or petabytes, and it is this ability to grow that sets it apart from other file systems. Currently, Hadoop takes into account file blocks of 128MB to 256MB.
Replication In HDFS The availability of the data is ensured via replication. Making copies of anything is known as replication, and the replication factor refers to how many copies of a given object have been made. The HDFS stores data in the form of multiple blocks simultaneously, as we saw in File blocks, and Hadoop is set up to create copies of those file blocks.
Hadoop’s replication factor is customizable, meaning you may manually alter it to meet your needs. For example, in the example above, we created 4 file blocks, which implies that 3 replicas or copies of each file block were created, creating a total of 43 = 12 blocks for backup purposes.
This is due to the fact that we employ cheap system hardware known as “commodity hardware” to run Hadoop, which is prone to failure at any moment. The supercomputer is not being used for our Hadoop setup. This is why we want a fault-tolerant capability in HDFS that can duplicate certain file blocks for backup reasons.
One other point that has to be noted is that after creating so many copies of our file blocks, we are wasting a significant amount of storage. However, major brand organizations don’t care about this excess storage since the data is more valuable than the storage. Your hdfs-site.xml file has Replication factor configuration options.
Rack Awareness The rack is nothing more than our Hadoop cluster’s physical collection of nodes (maybe 30 to 40). There are a lot of Racks in a huge Hadoop cluster. In order to conduct read/write operations at the highest throughput and save network traffic, Namenode uses this Racks information to choose the nearest Datanode.
HDFS Architecture
3. YARN(Yet Another Resource Negotiator)
MapReduce runs on a framework called YARN.The two responsibilities carried out by YARN are work scheduling and resource management. The goal of job scheduling is to break large tasks down into smaller ones so that each job may be distributed across different slaves in a Hadoop cluster, maximizing processing. The job scheduler also maintains track of the jobs’ priorities, dependencies on one another, importance levels, and other details like work time. To manage all the resources made available for operating a Hadoop cluster, Resource Manager is used.
Features of YARN
- Multi-Tenancy
- Scalability
- Cluster-Utilization
- Compatibility
4. Hadoop common or Common Utilities
Hadoop common, often known as the “common utilities,” is nothing more than our Java library, java files, or the Java scripts that we need for all the other components found in a Hadoop cluster. For the cluster to function, HDFS, YARN, and MapReduce require these tools. Hadoop Common confirms that hardware failure in a Hadoop cluster is frequent, necessitating an automated software solution by the Hadoop Framework.