What is Hadoop? How to configure Hadoop on Windows

0
50

Big Data is becoming an essential part of every company, and Hadoop is the core technology for storing and accessing large amounts of data.

Hadoop is an open source Apache framework written in java that allows distributed processing of large datasets on clusters of computers through a simple programming model. Hadoop is designed to scale from a single server to thousands of other computers with local computation and storage.

Hadoop architecture

Hadoop framework consists of 4 modules:

  • Hadoop Common: These are the libraries and utilities needed by Java for other modules to use. These libraries provide abstract file systems and abstract OS classes, and contain Java code to boot Hadoop.
  • Hadoop YARN: This is a framework for managing the process and resources of clusters.
  • Hadoop Distributed File System (HDFS): This is a distributed file system that provides high throughput access for data mining applications.
  • Hadoop MapReduce: This is a YARN-based system used to process large data sets in parallel.

You can use the following diagram to describe the four components included in the Hadoop framework.

Since 2012, the term “Hadoop” not only refers to the above mentioned modules but also to the additional software packages that can be installed next to Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark.

MapReduce

Hadoop MapReduce is a framework for writing applications that handle large amounts of high-fault-tolerant data across thousands of computer clusters.

The term MapReduce deals with two tasks that the Hadoop program does:

  • Map: This is the first task, in which input data is converted into data sets in key / value pairs.
  • Reduce: This action takes the output from the Map operation, merging the data together into smaller data sets.

Normally, the input and output results are stored in the file system. This framework automatically manages, tracks and repairs failed tasks.

The MapReduce framework consists of a single JobTracker master and a TaskTracker slave on each cluster node. Master is responsible for managing resources, tracking resource consumption and scheduling task management tasks, monitoring them, and executing failed tasks. TaskTracker slaves perform master tasks and provide task-status information for the master to monitor.

JobTracker is a weakness of Hadoop Mapreduce. If JobTracker fails, all related work will be interrupted.

Hadoop Distributed File System

Hadoop can work directly with any distributed data system such as Local FS, HFTP FS, S3 FS, and other systems. But the file system commonly used by Hadoop is the Hadoop Distributed File System (HDFS).

The Hadoop Distributed File System (HDFS), based on the Google File System (GFS), provides a distributed data system designed to run on large PC clusters (thousands of computers) with high fault tolerance. .

HDFS uses a master / slave architecture in which the master consists of a NameNode that manages a metadata file system and one or more slave DataNodes to store real data.

An HDFS file is split into several blocks and these blocks are stored in a set of DataNodes. NameNode defines mappings from blocks to DataNodes. DataNode executes read and write operations on the file system. They also manage the creation, destruction, and replication of blocks through directives from NameNode.

HDFS also supports shell commands to interact with files such as other file systems.

How does Hadoop work?

State 1

A user or an application can submit a job to a Hadoop (hadoop job client) with processing requests along with the basic information:

Places of data input, output on distributed data system.

Java classes in jar format contain commands that execute map functions and reduce them.

Specific settings related to the job through the input parameters.

Phase 2

Hadoop job client submit job (job jar file) and settings for JobTracker. The master then dispatches the task to the slave machines to monitor and manage the progress of these machines, and provides job-client status and diagnostics.

Stage 3

TaskTrackers on different nodes execute the MapReduce task and return the output to the file system.

When “running Hadoop” means running a set of daemons, or resident programs, on different servers on your network. These daemons have specific roles, some only exist on one server, some may exist on multiple servers.

The daemons include:

  • NameNode
  • DataNode
  • SecondaryNameNode
  • JobTracker
  • TaskTracker

NameNode

It is the most important Hadoop daemon – the NameNode. Hadoop uses a master / slave architecture for both distributed and distributed storage. Distributed storage systems are called Hadoop File System or HDFS. NameNode is the master of HDFS to direct DataNode slave daemons to perform low-level I / O tasks. NameNode tracks the HDFS, how your files are divided into blocks, which nodes they store, and the overall “health check” of the distributed file system. The function of NameNode is memory and I / O intensive. As such, the NameNode host typically does not store any user data or perform any calculations for a MapReduce application to reduce workloads on the machine. This means that the NameNode server is not double as a DataNode or a TaskTracker.

Unfortunately, there is a negative side to the importance of NameNode. It has a point of failure of your Hadoop cluster. For any other daemons, if their nodes are damaged for software or hardware reasons, the Hadoop cluster can continue to run smoothly or you can start it quickly. But can not apply to the NameNode.

DataNode

Each slave in your cluster will host a DataNode daemon to perform certain tasks of the distributed file system – read and write HDFS blocks to actual files on the local file system ( local filesytem). When you want to read or write an HDFS file, the file is broken up into blocks and NameNode tells your clients where the DataNode daemons will be located. Your client communicates directly with the DataNode daemon to process local files corresponding to the blocks. In addition, a DataNode can communicate with other DataNodes to replicate its data blocks for redundancy.

The DataNode regularly reports with the NameNode. Upon initialization, each DataNode informs the NameNode of the blocks it is currently storing. After Mapping completes, the DataNode continues to poll NameNode to provide local change information as well as receive instructions to create, move, or delete blocks from the local disk.

Secondary NameNode

Secondary NameNode (SNN) is a daemon that monitors the status of the HDFS clusters. Like NameNode, each cluster has an SNN, and it resides on one machine. There are no DataNode or TaskTracker daemons running on the same server. SNN is different from NameNode in its process of not receiving or recording any real-time changes to HDFS. Instead, it communicates with the NameNode by capturing images of HDFS metadata at the intervals determined by the cluster’s configuration.

As mentioned earlier, NameNode is the only access point of failure for a Hadoop cluster, and SNN snapshots help to minimize downtime and data loss. However, a NameNode does not require human intervention to reconfigure shared clusters using the SSN as the primary NameNode.

jobTracker

The JobTracker daemon is a communication between your application and Hadoop. Once you send your source code to the cluster, JobTracker decides the execution plan by specifying which files will be processed, which nodes are assigned different tasks, and tracks all. tasks when they are running. If a task fails, JobTracker will automatically run that task, possibly on another node, up to a predetermined limit of the retry.

There is only one JobTracker on a Hadoop cluster. It usually runs on a server as a master node of the cluster.

TaskTraker

As with the storage daemons, the underlying daemons also have to follow the master / slave architecture: JobTracker is the overall monitoring of the overall performance of a MapRecude job and the TaskTracker manages the execution of individual tasks on each slave node. Each TaskTracker is responsible for performing specific tasks assigned by the JobTracker. Although there is a single TaskTracker for a slave node, each TaskTracker can generate multiple JVMs to handle the same Map or Reduce task.

One of the responsibilities of the TaskTracker is to constantly contact JobTracker. If JobTracker does not receive a beat from a TaskTracker within a specified amount of time, it will assume TaskTracker has been suspended and will send the corresponding task to the other nodes in the cluster.

This topology has a master node named NameNode and JobTracker and a single node with SNN in case the Master node fails. For clusters, the SNN can often be placed in a slave node. On the other hand, for large clusters, separate NameNode and JobTracker into two separate machines. Slave machines each store only one DataNode and Tasktracker to run tasks on the same node where their data is stored.

Advantages of Hadoop

  • The Hadoop framework allows users to quickly write and test distributed systems. This is an effective way to distribute data and work across workstations due to the parallel processing mechanism of the CPU cores.
  • Hadoop does not rely on fault-tolerant and high availability fault tolerance (FTHA) hardware. Instead, Hadoop itself has libraries designed to detect and handle faults at the application layer.
  • Servers can be added or removed from the cluster dynamically and remain active without interruption.
  • Another great advantage of Hadoop besides open source is its compatibility on all platforms as it is developed on Java.

Install Hadoop on Windows

Download software

Hadoop configuration

Edit file C:hadoop-2.7.3etchadoopcore-site.xml

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Edit the file C:hadoop-2.7.3etchadoopmapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Edit the file C:hadoop-2.7.3etchadoophdfs-site.xml

<configuration>
	<property>
		<name>dfs.replication</name>
		<value>1</value>
	</property>
	<property>
		<name>dfs.namenode.name.dir</name>
		<value>C:hadoop-2.7.3datanamenode</value>
	</property>
	<property>
		<name>dfs.namenode.data.dir</name>
		<value>C:hadoop-2.7.3datadatanode</value>
	</property>
</configuration>

Edit the file C:hadoop-2.7.3etchadoopyarn-site.xml

<configuration>
	<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
	</property>
	<property>
		<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
		<value>org.apache.hadoop.mapred.ShuffleHandler</value>
	</property>
</configuration>

Edit the hadoop-env.cmd file in the JAVA_HOME =% JAVA_HOME% line instead of the java jdk installation path.