Install Hadoop Multi Node Cluster Ubuntu

  1. Install Hadoop Multi Node Cluster On Ubuntu 18.04
  2. Setup-hadoop-multi-node-cluster-ubuntu
  3. Install Hadoop Multi Node Cluster Ubuntu Linux

As you have reached on this blogpost of Setting up Multinode Hadoop cluster, I may believe that you have already read and experimented with my previous blogpost on HOW TO INSTALL APACHE HADOOP 2.6.0 IN UBUNTU (SINGLE NODE SETUP). If not then first I would like to recommend you to read it before proceeding here. Since we are interested to setting up Multinode Hadoop cluster, we must have multiple machines to be fit with in Master- Slave architecture.

Install Hadoop Multi Node Cluster On Ubuntu 18.04

From two single-node clusters to a multi-node cluster – We will build a multi-node cluster using two Ubuntu boxes in this tutorial. In my humble opinion, the best way to do this for starters is to install, configure and test a “local” Hadoop setup for each of the two Ubuntu boxes, and in a second step to “merge” these two single-node. As a prerequisite first we will set up two or three single node hadoop clusters using the previous post and then we will merge it to set up a complete multi node distributed hadoop cluster of two or three nodes. In this tutorial I will go with an assumption that we have three individual Ubuntu setups with single node hadoop cluster installed. 2) Apache Hadoop 2.6.4 Software (Download Here) Fully Distributed Mode (Multi Node Cluster) This post descibes how to install and configure Hadoop clusters ranging from a few nodes to extremely large clusters. To play with Hadoop, you may first want to install it on a single machine (see Single Node Setup). How to install hadoop cluster/multi node on ubuntu server 18.04requirements:- hadoop 3.2.1- java 8 (MUST java 8 for hadoop 3.x.x)- 3 server - master server (.

Here, Multinode Hadoop cluster as composed of Master-Slave Architecture to accomplishment of BigData processing which contains multiple nodes. So, in this post I am going to consider three machines (One as MasterNode and rest two are as SlaveNodes) for setting up Hadoop Cluster. Here there is a correlation between number of computer in cluster with size of data and data processing technique. Hence heavier the dataset (as well as heavier the data processing technique) require larger number of computer/nodes in Hadoop cluster.

Let’s get started towards setting up a fresh Multinode Hadoop (2.6.0) cluster. Follow the given steps,

Prerequisites

  1. Installation and Configuration of Single node Hadoop :
    Install and Confiure Single node Hadoop which will be our Masternode. To get instructions over How to setup Hadoop Single node, visit – previous blog http://pingax.com/install-hadoop2-6-0-on-ubuntu/.
  2. Prepare your computer network (Decide no of nodes to set up cluster) :
    Based on the helping parameters like Purpose of Hadoop Multinode cluster, Size of the dataset to be processed and Availability of Machines, you need to define no of Master nodes and no of Slave nodes to be configured for Hadoop Cluster setup.
  3. Basic installation and configuration :

    Step 3A : Hostname identification of your nodes to be configured in the further steps. To Masternode, we will name it as HadoopMaster and to 2 different Slave nodes, we will name them as HadoopSlave1, HadoopSlave2 respectively in /etc/hosts directory. After deciding a hostname of all nodes, assign their names by updating hostnames (You can ignore this step if you do not want to setup names.) Add all host names to /etc/hosts directory in all Machines (Master and Slave nodes).

    Step 3B : Create hadoop as group and hduser as user in all Machines (if not created !!).

    Download mysql for apple mac os 10.12

    If you require to add hdusers to sudoers, then fire following command

    OR
    Add following line in /etc/sudoers/

    Step 3C : Install rsync for sharing hadoop source with rest all Machines,

    Step 3D : To make above changes reflected, we need to reboot all of the Machines.

Setup-hadoop-multi-node-cluster-ubuntu

Hadoop configuration steps

  1. Applying Common Hadoop Configuration :
    However, we will be configuring Master-Slave architecture we need to apply the common changes in Hadoop config files (i.e. common for both type of Mater and Slave nodes) before we distribute these Hadoop files over the rest of the machines/nodes. Hence, these changes will be reflected over your single node Hadoop setup. And from the step 6, we will make changes specifically for Master and Slave nodes respectively.

    Changes:

    1. Update core-site.xml
      Update this file by changing hostname from localhost to HadoopMaster
    2. Update hdfs-site.xml
      Update this file by updating repliction factor from 1 to 3.
    3. Update yarn-site.xml
      Update this file by updating the following three properties by updating hostname from localhost to HadoopMaster,
    4. Update Mapred-site.xml
      Update this file by updating and adding following properties,
    5. Update masters
      Update the directory of master nodes of Hadoop cluster
    6. Update slaves
      Update the directory of slave nodes of Hadoop cluster
  2. Copying/Sharing/Distributing Hadoop config files to rest all nodes – master/slaves
    Use rsync for distributing configured Hadoop source among rest of nodes via network.

    The above command will share the files stored within hadoop folder to Slave nodes with location – /usr/local/hadoop. So, you dont need to again download as well as setup the above configuration in rest of all nodes. You just need Java and rsync to be installed over all nodes. And this JAVA_HOME path need to be matched with $HADOOP_HOME/etc/hadoop/hadoop-env.sh file of your Hadoop distribution which we had already configured in Single node Hadoop configuration.

  3. Applying Master node specific Hadoop configuration: (Only for master nodes)
    These are some configuration to be applied over Hadoop MasterNodes (Since we have only one master node it will be applied to only one master node.)

    Step 6A : Remove existing Hadoop_data folder (which was created while single node hadoop setup.)

    Step 6B : Make same (/usr/local/hadoop_tmp/hdfs) directory and create NameNode ( /usr/local/hadoop_tmp/hdfs/namenode) directory

    Step 6C : Make hduser as owner of that directory.

  4. Applying Slave node specific Hadoop configuration : (Only for slave nodes)
    Since we have three slave nodes, we will be applying the following changes over HadoopSlave1, HadoopSlave2 and HadoopSlave3 nodes.

    Step 7A : Remove existing Hadoop_data folder (which was created while single node hadoop setup)

    Step 7B : Creates same (/usr/local/hadoop_tmp/) directory/folder,
    an inside this folder again Create DataNode (/usr/local/hadoop_tmp/hdfs/namenode) directory/folder

    Step 7C : Make hduser as owner of that directory

  5. Copying ssh key for Setting up passwordless ssh access from Master to Slave node :
    To manage (start/stop) all nodes of Master-Slave architecture, hduser (hadoop user of Masternode) need to be login on all Slave as well as all Master nodes which can be possible through setting up passwrdless SSH login. (If you are not setting this then you need to provide password while starting and stoping daemons on Slave nodes from Master node).

    Fire the following command for sharing public SSH key – $HOME/.ssh/id_rsa.pub file (of HadoopMaster node) to authorized_keys file of [email protected] and also on [email protected] (in $HOME/.ssh/authorized_keys)

  6. Format Namenonde (Run on MasterNode) :
  7. Starting up Hadoop cluster daemons : (Run on MasterNode)
    Start HDFS daemons:

    Start MapReduce daemons:

    Instead both of these above command you can also use start-all.sh, but its now deprecated so its not recommended to be used for better Hadoop operations.

  8. Track/Monitor/Verify Hadoop cluster : (Run on any Node)
    Verify Hadoop daemons on Master :


    Verify Hadoop daemons on all slave nodes :

    (As shown in above snap- The running services of HadoopSlave1 will be the same for all Slave nodes configured in Hadoop Cluster.)

    Monitor Hadoop ResourseManage and Hadoop NameNode via web-version,

    If you wish to track Hadoop MapReduce as well as HDFS, you can also try exploring Hadoop web view of ResourceManager and NameNode which are usually used by hadoop administrators. Open your default browser and visit to the following links from any of the node.

    For ResourceManager – Http://HadoopMaster:8088
    For NameNode – Http://HadoopMaster:50070

    If you are getting the similar output as shown in the above snapshot for Master and Slave noes then Congratulations! You have successfully installed Apache Hadoop in your Cluster and if not then post your error messages in comments. We will be happy to help you. Happy Hadooping.!! Also you can request me ([email protected]) for blog title if you want me to write over it.

Google+ Comments

Powered by Google+ Comments

Running Hadoop on Ubuntu Linux (Multi-Node Cluster)

From single-node clusters to a multi-node cluster

We will build a multi-node cluster merge three or more single-node clusters into one multi-node cluster in which one Ubuntu box will become the designated master, and the other box will become only a slave.

Environment Versions

Prerequisites

Configuring single-node clusters first, here we have used three single node clusters. Shutdown each single-node cluster with the following command:

Networking

The easiest is to put three machines in the same network with regard to hardware and software configuration.

Update /etc/hosts on both machines. Put the alias to the ip addresses of all the machines. Here we are creating a cluster of 3 machines, one is master, one is slave1 and other is slave2:

Add the following lines for two node cluster

SSH access

The hduser1 user on the master (ssh [email protected]) must be able to connect:

  • to its own user account on the master - i.e. ssh master in this context.
  • to the hduser1 user account on the slave (i.e. ssh [email protected]) via a password-less SSH login.

Set up password-less SSH login between cluster:

  • Connect with user hduser1 from the master to the user account hduser1 on the slave1 and slave2.

From master to master

From master to slave1

From slave1 to slave2

Hadoop

Cluster Overview

This will describe how to configure one Ubuntu box as a master node and the other Ubuntu box as a slave node.

Configuration

$HADOOP_HOME/etc/hadoop/masters

The machine on which sbin/start-dfs.sh is running will become the primary NameNode. This file should be updated on all the nodes. Create the masters file in the $HADOOP_HOME/etc/hadoop/ directory:

Add the following line

$HADOOP_HOME/etc/hadoop/slaves

This file should be updated on all the nodes. Open the slaves file in the $HADOOP_HOME/etc/hadoop/ directory:

Add the following lines (remove localhost)

$HADOOP_HOME/etc/hadoop/*-site.xml (All nodes.)

Open this file in the $HADOOP_HOME/etc/hadoop/ directory:

Change the fs.default.name parameter (in $HADOOP_HOME/etc/hadoop/core-site.xml), which specifies the NameNode (the HDFS master) host and port.

$HADOOP_HOME/etc/hadoop/core-site.xml (All nodes.)

$HADOOP_HOME/etc/hadoop/mapred-site.xml (All nodes.)

Open this file in the $HADOOP_HOME/etc/hadoop/ directory

Change the mapred.job.tracker parameter (in $HADOOP_HOME/etc/hadoop/mapred-site.xml), which specifies the JobTracker (MapReduce master) host and port and add mapred.framework.name property.

$HADOOP_HOME/etc/hadoop/mapred-site.xml (All nodes.)

$HADOOP_HOME/etc/hadoop/hdfs-site.xml (All nodes.)

Open this file in the $HADOOP_HOME/etc/hadoop/ directory

Change the dfs.replication parameter (in $HADOOP_HOME/etc/hadoop/hdfs-site.xml) which specifies the default block replication. We have two nodes as slave available, so we set dfs.replication to 2. Changes to be like this:

Paste the following between <configuration></configuration> in file $HADOOP_HOME/etc/hadoop/yarn-site.xml:

Applying Master node specific Hadoop configuration: (Only for master nodes)

Install Hadoop Multi Node Cluster Ubuntu Linux

These are some configuration to be applied over Hadoop master nodes (Since we have only one master node it will be applied to only one master node.)

Remove existing Hadoop data folder (which was created while single-cluster hadoop setup.)

Make same (/app/hadoop/tmp) directory and create NameNode (/usr/local/hadoop_tmp/hdfs/namenode) directory:

Make hduser1 as owner of that directory:

Applying Slave node specific Hadoop configuration (Only for slave nodes)

Since we have three slave nodes, we will be applying the following changes over slave1 and slave2 nodes:

Hadoop

Remove existing Hadoop_data folder (which was created while single node hadoop setup)

Creates same (/app/hadoop/tmp) folder, an inside this folder again create DataNode (/app/hadoop/tmp/hdfs/namenode) directory:

Make hduser as owner of that directory

Formatting the HDFS filesystem via the NameNode (Only for master nodes)

Format the cluster's HDFS file system

Starting the multi-node cluster (Only for master nodes)

By this command the NameNode daemon is started on master, and DataNode daemons are started on all slaves (here: slave1 and slave2).

Track/Monitor/Verify Hadoop cluster (Run on any Node)

Verify Hadoop daemons on Master, run the following commands

Verify Hadoop daemons on any slave (here: slave1 and slave2), DataNode and NodeManager should run:

Monitor Hadoop ResourseManage and Hadoop NameNode via web-version

ResourceManager: http://master:8088

Note: The Job Tracker and Task Tracker concepts are different in Hadoop YARN, In new version of Hadoop we can monitor jobs being executed at ResourseManage.

http://localhost:50070 - web UI of the NameNode daemon

Datanode Information

Create input directory on HDFS:

Execute example program:

Check output directory: