- Install Hadoop Multi Node Cluster On Ubuntu 18.04
- Install Hadoop Multi Node Cluster Ubuntu Linux
As you have reached on this blogpost of Setting up Multinode Hadoop cluster, I may believe that you have already read and experimented with my previous blogpost on HOW TO INSTALL APACHE HADOOP 2.6.0 IN UBUNTU (SINGLE NODE SETUP). If not then first I would like to recommend you to read it before proceeding here. Since we are interested to setting up Multinode Hadoop cluster, we must have multiple machines to be fit with in
Master- Slave architecture.
Install Hadoop Multi Node Cluster On Ubuntu 18.04
From two single-node clusters to a multi-node cluster – We will build a multi-node cluster using two Ubuntu boxes in this tutorial. In my humble opinion, the best way to do this for starters is to install, configure and test a “local” Hadoop setup for each of the two Ubuntu boxes, and in a second step to “merge” these two single-node. As a prerequisite first we will set up two or three single node hadoop clusters using the previous post and then we will merge it to set up a complete multi node distributed hadoop cluster of two or three nodes. In this tutorial I will go with an assumption that we have three individual Ubuntu setups with single node hadoop cluster installed. 2) Apache Hadoop 2.6.4 Software (Download Here) Fully Distributed Mode (Multi Node Cluster) This post descibes how to install and configure Hadoop clusters ranging from a few nodes to extremely large clusters. To play with Hadoop, you may first want to install it on a single machine (see Single Node Setup). How to install hadoop cluster/multi node on ubuntu server 18.04requirements:- hadoop 3.2.1- java 8 (MUST java 8 for hadoop 3.x.x)- 3 server - master server (.
Here, Multinode Hadoop cluster as composed of Master-Slave Architecture to accomplishment of BigData processing which contains multiple nodes. So, in this post I am going to consider three machines (One as MasterNode and rest two are as SlaveNodes) for setting up Hadoop Cluster. Here there is a correlation between number of computer in cluster with size of data and data processing technique. Hence heavier the dataset (as well as heavier the data processing technique) require larger number of computer/nodes in Hadoop cluster.
Let’s get started towards setting up a fresh Multinode
Hadoop (2.6.0) cluster. Follow the given steps,
- Installation and Configuration of Single node Hadoop :
Install and Confiure Single node Hadoop which will be our Masternode. To get instructions over How to setup Hadoop Single node, visit – previous blog http://pingax.com/install-hadoop2-6-0-on-ubuntu/.
- Prepare your computer network (Decide no of nodes to set up cluster) :
Based on the helping parameters like Purpose of Hadoop Multinode cluster, Size of the dataset to be processed and Availability of Machines, you need to define no of Master nodes and no of Slave nodes to be configured for Hadoop Cluster setup.
- Basic installation and configuration :
Step 3A : Hostname identification of your nodes to be configured in the further steps. To Masternode, we will name it as HadoopMaster and to 2 different Slave nodes, we will name them as HadoopSlave1, HadoopSlave2 respectively in
/etc/hostsdirectory. After deciding a hostname of all nodes, assign their names by updating hostnames (You can ignore this step if you do not want to setup names.) Add all host names to
/etc/hostsdirectory in all Machines (Master and Slave nodes).
Step 3B : Create hadoop as group and hduser as user in all Machines (if not created !!).
If you require to add hdusers to sudoers, then fire following command
Add following line in
Step 3C : Install rsync for sharing hadoop source with rest all Machines,
Step 3D : To make above changes reflected, we need to reboot all of the Machines.
Hadoop configuration steps
- Applying Common Hadoop Configuration :
However, we will be configuring Master-Slave architecture we need to apply the common changes in Hadoop config files (i.e. common for both type of Mater and Slave nodes) before we distribute these Hadoop files over the rest of the machines/nodes. Hence, these changes will be reflected over your single node Hadoop setup. And from the step 6, we will make changes specifically for Master and Slave nodes respectively.
- Update core-site.xml
Update this file by changing hostname from localhost to HadoopMaster
- Update hdfs-site.xml
Update this file by updating repliction factor from 1 to 3.
- Update yarn-site.xml
Update this file by updating the following three properties by updating hostname from localhost to HadoopMaster,
- Update Mapred-site.xml
Update this file by updating and adding following properties,
- Update masters
Update the directory of master nodes of Hadoop cluster
- Update slaves
Update the directory of slave nodes of Hadoop cluster
- Update core-site.xml
- Copying/Sharing/Distributing Hadoop config files to rest all nodes – master/slaves
Use rsync for distributing configured Hadoop source among rest of nodes via network.
The above command will share the files stored within hadoop folder to Slave nodes with location –
/usr/local/hadoop. So, you dont need to again download as well as setup the above configuration in rest of all nodes. You just need Java and rsync to be installed over all nodes. And this
JAVA_HOMEpath need to be matched with
$HADOOP_HOME/etc/hadoop/hadoop-env.shfile of your Hadoop distribution which we had already configured in Single node Hadoop configuration.
- Applying Master node specific Hadoop configuration: (Only for master nodes)
These are some configuration to be applied over Hadoop MasterNodes (Since we have only one master node it will be applied to only one master node.)
Step 6A : Remove existing Hadoop_data folder (which was created while single node hadoop setup.)
Step 6B : Make same (
/usr/local/hadoop_tmp/hdfs) directory and create NameNode (
Step 6C : Make hduser as owner of that directory.
- Applying Slave node specific Hadoop configuration : (Only for slave nodes)
Since we have three slave nodes, we will be applying the following changes over HadoopSlave1, HadoopSlave2 and HadoopSlave3 nodes.
Step 7A : Remove existing Hadoop_data folder (which was created while single node hadoop setup)
Step 7B : Creates same (
an inside this folder again Create DataNode (
Step 7C : Make hduser as owner of that directory
- Copying ssh key for Setting up passwordless ssh access from Master to Slave node :
To manage (start/stop) all nodes of Master-Slave architecture, hduser (hadoop user of Masternode) need to be login on all Slave as well as all Master nodes which can be possible through setting up passwrdless SSH login. (If you are not setting this then you need to provide password while starting and stoping daemons on Slave nodes from Master node).
Fire the following command for sharing public SSH key –
$HOME/.ssh/id_rsa.pubfile (of HadoopMaster node) to authorized_keys file of [email protected] and also on [email protected] (in
- Format Namenonde (Run on MasterNode) :
- Starting up Hadoop cluster daemons : (Run on MasterNode)
Start HDFS daemons:
Start MapReduce daemons:
Instead both of these above command you can also use start-all.sh, but its now deprecated so its not recommended to be used for better Hadoop operations.
- Track/Monitor/Verify Hadoop cluster : (Run on any Node)
Verify Hadoop daemons on Master :
Verify Hadoop daemons on all slave nodes :
(As shown in above snap- The running services of HadoopSlave1 will be the same for all Slave nodes configured in Hadoop Cluster.)
Monitor Hadoop ResourseManage and Hadoop NameNode via web-version,
If you wish to track Hadoop MapReduce as well as HDFS, you can also try exploring Hadoop web view of ResourceManager and NameNode which are usually used by hadoop administrators. Open your default browser and visit to the following links from any of the node.
For ResourceManager – Http://HadoopMaster:8088
For NameNode – Http://HadoopMaster:50070
If you are getting the similar output as shown in the above snapshot for Master and Slave noes then Congratulations! You have successfully installed Apache Hadoop in your Cluster and if not then post your error messages in comments. We will be happy to help you. Happy Hadooping.!! Also you can request me ([email protected]) for blog title if you want me to write over it.
Powered by Google+ Comments
Running Hadoop on Ubuntu Linux (Multi-Node Cluster)
From single-node clusters to a multi-node cluster
We will build a multi-node cluster merge three or more single-node clusters into one multi-node cluster in which one Ubuntu box will become the designated master, and the other box will become only a slave.
Configuring single-node clusters first, here we have used three single node clusters. Shutdown each single-node cluster with the following command:
The easiest is to put three machines in the same network with regard to hardware and software configuration.
/etc/hosts on both machines. Put the alias to the ip addresses of all the machines. Here we are creating a cluster of 3 machines, one is master, one is slave1 and other is slave2:
Add the following lines for two node cluster
hduser1 user on the master (
ssh [email protected]) must be able to connect:
- to its own user account on the master - i.e. ssh master in this context.
- to the
hduser1user account on the slave (i.e.
ssh [email protected]) via a password-less SSH login.
Set up password-less SSH login between cluster:
- Connect with user hduser1 from the master to the user account hduser1 on the slave1 and slave2.
From master to master
From master to slave1
From slave1 to slave2
This will describe how to configure one Ubuntu box as a master node and the other Ubuntu box as a slave node.
The machine on which
sbin/start-dfs.sh is running will become the primary NameNode. This file should be updated on all the nodes. Create the
masters file in the
Add the following line
This file should be updated on all the nodes. Open the
slaves file in the
Add the following lines (remove
$HADOOP_HOME/etc/hadoop/*-site.xml (All nodes.)
Open this file in the
fs.default.name parameter (in
$HADOOP_HOME/etc/hadoop/core-site.xml), which specifies the NameNode (the HDFS master) host and port.
$HADOOP_HOME/etc/hadoop/core-site.xml (All nodes.)
$HADOOP_HOME/etc/hadoop/mapred-site.xml (All nodes.)
Open this file in the
mapred.job.tracker parameter (in
$HADOOP_HOME/etc/hadoop/mapred-site.xml), which specifies the JobTracker (MapReduce master) host and port and add
$HADOOP_HOME/etc/hadoop/mapred-site.xml (All nodes.)
$HADOOP_HOME/etc/hadoop/hdfs-site.xml (All nodes.)
Open this file in the
dfs.replication parameter (in
$HADOOP_HOME/etc/hadoop/hdfs-site.xml) which specifies the default block replication. We have two nodes as slave available, so we set
2. Changes to be like this:
Paste the following between
<configuration></configuration> in file
Applying Master node specific Hadoop configuration: (Only for master nodes)
Install Hadoop Multi Node Cluster Ubuntu Linux
These are some configuration to be applied over Hadoop master nodes (Since we have only one master node it will be applied to only one master node.)
Remove existing Hadoop data folder (which was created while single-cluster hadoop setup.)
Make same (
/app/hadoop/tmp) directory and create NameNode (
hduser1 as owner of that directory:
Applying Slave node specific Hadoop configuration (Only for slave nodes)
Since we have three slave nodes, we will be applying the following changes over slave1 and slave2 nodes:
Remove existing Hadoop_data folder (which was created while single node hadoop setup)
Creates same (
/app/hadoop/tmp) folder, an inside this folder again create DataNode (
Make hduser as owner of that directory
Formatting the HDFS filesystem via the NameNode (Only for master nodes)
Format the cluster's HDFS file system
Starting the multi-node cluster (Only for master nodes)
By this command the NameNode daemon is started on master, and DataNode daemons are started on all slaves (here: slave1 and slave2).
Track/Monitor/Verify Hadoop cluster (Run on any Node)
Verify Hadoop daemons on Master, run the following commands
Verify Hadoop daemons on any slave (here: slave1 and slave2), DataNode and NodeManager should run:
Monitor Hadoop ResourseManage and Hadoop NameNode via web-version
Note: The Job Tracker and Task Tracker concepts are different in Hadoop YARN, In new version of Hadoop we can monitor jobs being executed at ResourseManage.
http://localhost:50070 - web UI of the NameNode daemon
input directory on HDFS:
Execute example program:
Check output directory: