Wednesday, 18 December 2013

MultiNode Hadoop Cluster Setup.

A "Hadoop Cluster" is a type of computational cluster designed particularly for storing and analyzing large amount of data (Which can be of any form like structured, semi-structured or unstructured etc..) in a distributed manner. Hadoop Cluster runs Hadoop's open source distributed processing framework built up on java on low cost commodity computers. Hadoop is designed on a concept called WORM-Write Once Read/Execute Multiple times. Also it is called a "Shared Nothing" architecture. Hadoop Cluster can be massively scalable from a single machine to thousands of machines. But the only thing which is being shared between them is the network that connects them. That is why this is referred as  "Shared Nothing" architecture.
Hadoop Cluster are designed for boosting the speed of data analysis applications. Hadoop cluster are highly scalable. If the existing cluster's processing power exceeds by the growth of data, we can add additional nodes to increase thee throughput of the processing unit. Hadoop clusters also are highly resistant to node failure because each blocks of data is copied onto two more cluster nodes(replication of the data blocks- default 3 and it is scalable), which ensures the data redundancy even with the failure of one node.
Now Let's see how a production Hadoop Cluster can be built. Which is explained in detail bellow.

1.Designing The Cluster Network . 
2.Operating System Selection And Preparation.
3.Configure Nameserver And Adding Hostname To Each Machines.
4.Unix Firewall On Hadoop Cluster.
5.Disable The Fastest Mirror Hunting Process By The Yum.
6.Setting Password Less SSH login In All The Nodes Of Hadoop Cluster
7.Installation Of Java.
8.Hadoop Installation.
9.Hadoop Cluster Configuration.
10.Synchronizing the configuration files between master and the slave nodes.
11.Adding The Dedicated Users To The Hadoop Group.
12.Assigning Dedicated Users For The Daemons In hdfs.
13.Formatting The hdfs File System Via Name-Node.
14.Create Working Directory For Jobtracker And Tasktracker In The Hdfs And Assigning Permission. 
15.Starting The Multi-Node Cluster.
16.Running A Map-Reduce Job On The Cluster. 

Step:1
Designing The Cluster Network
It is important to make keen decision about the network architecture in deployment of hadoop cluster.  As a part of providing the best throughput from the cluster the nodes should be physically close together as possible. i.e. lower the network latency , better  the performance.
Another way to reduce the amount of background traffic, is to create a Virtual Private Network(VPN) particularly for the cluster.  Any other application server which is connected to this cluster should be in different subnet to provide the better background traffic reduction. and which in turn can provide the access point to the outside world such as Cloud.
The anticipated latency of the VPN is ~1-2ms. This implies that the issue of physical proximity becomes less of an issue and we should confirm this through environment testing.
Network Architecture recommendation:

   Dedicated TOR(Top On Rack) switches to Hadoop
   Use dedicated Core Switching Blades or switches
   Ensure application servers are “close” to Hadoop
   Consider Ethernet bonding(NIC Teaming) for increased capacity/bandwidth utilization.(Optional)



Network Architecture

Step:2
Operating System Selection and Preparation
Why Linux? Hadoop is a framework written in java, enough native code and Linux-isms are in its surrounding infrastructure to make Linux the only production quality option  today. Being both are of open source it is strongly recommended to use the hadoop cluster in association with the Unix as the base operating system.  The choice of operating system may be influenced by administration tools, hardware support, or commercial software support; the best choice is usually to minimize the variables and reduce risk by picking the distribution with which you’re most comfortable. Preparing the OS for Hadoop requires a number of steps, and repeating them on a large number of machines is both time-consuming and error-prone. For this reason, it is strongly advised that a software configuration management system be used. Puppet and Chef are two open source tools that fit this requirement. A significant number of production clusters run on RedHat Enterprise Linux or its freely available sister, CentOS. Ubuntu, SuSE Enterprise Linux, and Debian deployments also exist in production and work perfectly well. We selected to use Linux as operating system for our Hadoop cluster. It is the best practice to create custom CentOS image preconfigured with pre-required software so all machines contain the same set of utilities.
The following settings should be applied on OS level:

   # Do not use Logical Volume Management
   # Use configuration management system (Yum, Permission, sudoers, etc.) – Puppet   # Reduce kernel swappiness
   # Revoke access to Cluster machines from average users
   # Do NOT use Virtualization for Clustering.
   # At a minimum the following, but limited to Linux commands are required:
   # ln,chmod,chown,chgrp,mount,umount,kill,rm,yum,mkdir,rsync,etc..
     For which better follow the "express text" based installation in all the nodes, instead of default GUI installation. This will avoid the unwanted installation of packages on the OS.
Step:3
Configure Nameserver and adding Hostname to each machines
Edit the resolve.conf file in /etc/
# vi /etc/resolve.conf
  search.example.com
  nameserver  192.168.1.1
Change the hostname in all the machines
# vi /etc/sysconfig/network
  HOSTNAME=<hostname>.example.com
  GATEWAY=192.168.1.1
Enter the details of all the nodes in the /etc/hosts file in the nameserver format with canonic name OR alias name
# vi /etc/hosts
  192.168.1.101 n1.xyz.com n1
  192.168.1.102 n2.xyz.com n2
  192.168.1.103 n3.xyz.com n3
  192.168.1.104 n4.xyz.com n4
  192.168.1.105 n5.xyz.com n5
  192.168.1.106 n6.xyz.com n6
This should be entered in all the nodes so that all the nodes can communicate each other using their hostname.

Step:4
Unix Firewall On Hadoop Cluster
Why Firewall Has To Be Disabled in the Hadoop Cluster:
Generally Hadoop is referred as a "shared nothing architecture" ; The only thing which shared in the hadoop cluster is the networking that interconnect them. In the Hadoop cluster there are lot of cluster daemons. They use a handful of  ports over TCP. Some of these ports are used by Hadoop’s daemons to communicate amongst themselves (to schedule jobs, replicate blocks, etc.). Others ports are listening directly to users, either via an interposed Java client, which communicates via internal protocols, or through plain old HTTP.  Which tend to be "firewall less" between the various hadoop nodes but operate on separate network with a gateway node between the normal network and the hadoop cluster.
What are the Linux Firewalls and how to disable them?
SELinux and IP-Tables are the two type of firewalls in the Unix Systems. Let's see how to disable them.
SELinux
# vi /etc/selinux/config
  SELinux=disabled
IP-Tables
IP-Tables is of two type. IPV4 and IPV6. We need to disable both of them.
     For IPV4
# service iptables stop
# chkconfig iptables off
    For IPV6
# service ip6tables stop
# chkconfig ip6tables off
We can disable the IP-Tables at any time. it is easy. But to disable selinux we need to restart the computer once. Restart the computer and check the status of the firewall as follows.
# init 6 OR reboot
After reboot Check the status of SELinux and IP-Tables
To check the status of SELinux:
# /usr/sbin/sestatus
OR
# sestatus
SELinux status:      disabled
To check the status of IP-Tables:
# /etc/init.d/iptables status
iptables: Firewall is not running.
# /etc/init.d/ip6tables status
Ip6tables: Firewall is not running.

Step:5
Disable the fastest mirror hunting process by the yum installation
The  yum package installation always check the fastest mirror available to download packages for installation. We can disable this mirror hunting process and make yum to proceed with the available  mirrors. For this we need to change value in the configuration file of yum plug-in in the /etc/yum/pluginconf.d/ location.
# vi  /etc/yum/pluginconf.d/fastest mirror.conf
  Enabled=0 
Step:6
Setting Up password less SSH authentication on all the nodes

Step:7
Installation Of Java.
Download and copy the java bin file in to the /usr/local directory
# chmod 755 jdk-6u30-linux-x64.bin
# ./jdk-6u17-linux-x64.bin
Set JAVA_HOME & JAVA  PATH
 Open  /etc/profile  file and add this two lines at the end of the file.
# vim /etc/profile
  export JAVA_HOME=/usr/local/java
  export PATH=$PATH:/usr/local/java/bin

We can check the java installation as follows
# echo $JAVA_HOME
/usr/local/java
# echo $PATH
/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin:/usr/bin:/bin:/root/bin:/usr/local/java/bin:/root/bin
# which java
/usr/local/java/bin/java
#

Step:8
Hadoop Installation
This installation can be done in different ways. we can use Either Manual Installation [tar balls ,online installation from vendors like Cloudera (Using Online yum repository), local yum repository, RPM installation] Or Automatic Installation (Cloudera Manager).
Here we are going to do the online installation from Cloudera (Using Online yum repository). Let's see how to do that.
Step 1: Download the CDH3 Repository or Package.
First of all we need to download and install thee CDH3 repository package installation. which is necessary for the remaining installation of Cloudera hadoop. Here we are going to install Hadoop on a CentOS cluster.
 # Curl -O http://archive.cloudera.com/redhat/6/x86_64/cdh/cdh3-repository-1.0-1.noarch.rpm
Download this RPM in your Home Directory and install the RPM using the following command
# yum --nogpgcheck localinstall cdh3-repository-1.0-1.noarch.rpm
Step 2: Install CDH3 on all hosts
Find and install the Hadoop core and native packages. For example:
# yum search hadoop
 It will display all the available packages available to install with that version of hadoop (Ex: Here CDH3)
# yum install hadoop-0.20 hadoop-0.20-native
Hence the base is installed. this has to follow in all the  machines. Now we need to install the daemons in appropriate machines.
n1.xyz.com
            Namenode
n2.xyz.com
            Secondarynamenode
n3.xyz.com
      Jobtracker
n4.xyz.com
      DataNode
    TaskTracker
n5.xyz.com
      DataNode
    TaskTracker
n6.xyz.com
      DataNode
    TaskTracker
.
.

Step 2: Install daemon packages on appropriate machines.
 # yum install hadoop-0.20-<daemon type>
where <daemon type> is one of the following:

     namenode
     datanode
     secondarynamenode
     jobtracker
     tasktracker

Hence we have installed the hadoop cluster.  To make this cluster running we need to configure this cluster. Every parameter in this cluster is configurable. In the following session we will see how to configure the cluster.
Step:9
Hadoop Cluster Configuration.
Hadoop cluster is a master - slave architecture. It is designed for distributed storage and processing. So for both Storage and Processing we have masters to control and co-ordinate their slaves. Let's see the master slave architecture and the corresponding daemons.
Master-Slave Architecture Of Hadoop Cluster
In the cloudera hadoop installation, by default it will create the minimum required users and groups to volunteer the hadoop cluster. To control the distributed storage a new user will be created and it is called hdfs. Where as in case of distributed processing the newly created user is called mapred. more over it will create a new group fog the hadoop cluster called the hadoop. Here the daemons are user specific. So each daemon should be mapped to the corresponding user. All DFS storage and its working directory should be owned by hdfs user and All DFS process and its working directory should be owned by mapred. Which in turn to be mapped to the hadoop group as well. We will see all these steps in details down the way.
All the hadoop cluster configuration files resides in the cong directory of hadoop home. the location is arbitrary. go to the hadoop installed directory; for Cloudera it can be found on various location such as /etc/hadoop , /usr/lib/hadoop ,  /usr/lib/hadoop-<version> etc.. In any of this location we can find the configuration directory named conf .  If we move to that location we can see the configuration files. the main configuration files which needs to be edited in order to run the hadoop cluster are following.
Hadoop Environment configuration file: hadoop-env.sh
A shell script that includes the  Hadoop Specific environment variables to run the hadoop such as JAVA_HOME, HASOOP_HEEP_SIZE, HADOOP_OPTS etc..
Hadoop Core configuration file: core-site.xml
This is an xml file. It consist of many properties like the hadoop temp directory, thee default namenode in the network etc..
Hadoop Distributed storage configuration file: hdfs-site.xml
This xml file consist of the location of namenode directory in the HDFS file system, location of data Nodes in the network, permission to the users, etc..
Hadoop Distributed storage configuration files: mapred-site.xml
This is the xml file for hadoop distributed processing called as Map-Reduce process. It consist of the jobtracker and tasktracker temp location to store their intermediate results. default jobtracker machine etc..
Let's see the configuration one by one..
hadoop-env.sh
Go to the conf directory of hadoop and open the hadoop-env.sh file and uncomment the required lines and give the appropriate values.
# cd /usr/lib/hadoop/conf/
# vim hadoop-env.sh
# The java implementation to use. Required.
export JAVA_HOME=/usr/local/jdk1.6.0_30/
# The maximum amount of heap to use, in MB. Default is 1000
export HADOOP_HEAP_SIZE=2000
Here the JAVA_HOME must be same as that of /etc/profile file.
Hence the Basic environment setup is ready. Now logout from all the terminals and open the new terminal and check the hadoop version as follows. It can be verified either  As a hadoop user OR As root user.

core-site.xml
Create the tmp directory in some location and give the proper ownership and permissions. Then map the location in the core-site.xml file. This will be the temporary folder in which the data to the hdfs will first saved temporarily. 

# vim core-site.xml
<!-- In: conf/core-site.xml -->
<configuration>
    <property>
          <name>hadoop.tmp.dir</name>
          <value>/usr/lib/hadoop/tmp</value>
    </property>
    <property>
          <name>fs.default.name</name>
          <value>hdfs://hostname of NN machine:8020</value>
    </property>
</configuration>

# mkdir /usr/lib/hadoop/tmp
# chmod 775 tmp/
# chown -R hdfs:hadoop tmp/
# ls -ll tmp/
Drwxrwxr-x   3 hdfs hadoop    4096 Mar 17 23:07 tmp

hdfs-site.xml
#vim hdfs-site.xml
<!-- In: conf/hdfs-site.xml -->
<configuration>
<property>
  <name>dfs.permissions</name>
  <value>false</value>
  <final>true</final>
</property>
<property>
   <name>dfs.name.dir</name>
   <value>/dfs/name</value>
   <final>true</final>
</property>
<property>
   <name>dfs.data.dir</name>
   <value>/dfs/data1,/dfs/data2</value>
   <final>true</final>
</property>
<property>
   <name>dfs.datanode.failed.volumes.tolerated</name>
   <value>1</value>
   <final>true</final>
</property>
<property>
   <name>dfs.replication</name>
   <value>3</value>
   <final>true</final>
</property>
<property>
   <name>dfs.block.size</name>
   <value>67108864</value>
   <final>true</final>
</property>
<property>
   <name>dfs.checkpoint.dir</name>
   <value>/dfs/checkpoint</value>
   <final>true</final>
</property>
<property>
   <name> dfs.http.address</name>
   <value>0.0.0.0:50070</value>
   <final>true</final>
</property>
<property>
   <name> dfs.datanode.http.address</name>
   <value>0.0.0.0:50075</value>
   <final>true</final>
</property>
<property>
   <name>dfs.secondary.http.address</name>
   <value>0.0.0.0:50090</value>
   <final>true</final>
</property>
<property>
   <name> dfs.backup.http.address</name>
   <value>0.0.0.0:50105</value>
   <final>true</final>
</property>
</configuration>


Now we will create a new directory dfs in the /(root) directory of all nodes.  We need to assign the permission to the dfs directory. In all the nodse give the ownership of the dfs directory to the  hdfs user and hadoop group.
# mkdir -p /dfs
# chmod 775 /dfs/
# chown -R hdfs:hdfs /dfs/
mapred-site.xml
# vim mapred-site.xml
<!-- In: conf/mapred-site.xml -->
<configuration>
 <property>
    <name>mapred.job.tracker</name>
    <value> hdfs://hostname of jobtracker machine:8021</value>
<final>true</final>
</property>
<property>           
    <name>mapred.system.dir</name>
    <value>/mapred/system </value>
<final>true</final>
</property>
<property>
    <name>mapred.local.dir</name>
    <value>/mapred/local </value>
<final>true</final>
</property>
<property>
    <name>mapred.temp.dir</name>
    <value>/mapred/temp</value>
<final>true</final>
</property>
<property>
    <name>mapreduce.jobtracker.staging.root.dir</name>
    <value>/user </value>
<final>true</final>
</property>
<property>
    <name>mapred.job.tracker.http.address </name>
    <value>0.0.0.0:50030</value>
<final>true</final>
</property>
<property>
    <name>mapred.task.tracker.http.address </name>
    <value>0.0.0.0:50060</value>
<final>true</final>
</property>
</configuration>

For tasktracker and jobtracker we need to create the local and system directory in the hdfs. this can be done only after formatting the hadoop namenode. 
Step:10
Synchronizing The Configuration Files Between Master And The Slave Nodes.
Now we need to synchronize the configuration files (core-site.xml, hdfs-site.xml and mapred-site.xml) from master node all the slave nodes. For small and medium size cluster it is desirable to do it manually, but for a large production cluster we can use tools like Puppet OR Chef. this configuration must be the same throughout the cluster nodes. We can use rsync OR SCP at this time to synchronize these configuration files.

core-site.xml
#rsync -avrt /usr/lib/hadoop/conf/core-site.xml root@IP:/usr/lib/hadoop/conf/core-site.xml
 hdfs-site.xml
#rsync -avrt /usr/link/hadoop/conf/hdfs-site.xml root@IP:/usr/lib/hadoop/conf/hdfs-site.xml
For mapred-site.xml
#rsync -avrt /usr/link/hadoop/conf/mapre-site.xml root@IP:/usr/lib/hadoop/conf/mapred-site.xml   
This step has to be done in all the nodes.

Step:11
Adding The Dedicated Users To The Hadoop Group.
Now We will add the dedicated users(hdfs and mapred) to the hadoop group in all the nodes.
# gpasswd -a -A hdfs hadoop
  Adding user mapred to group hadoop
# gpasswd -a -A mapred hadoop
  Adding user hdfs to group hadoop
Step:12
Assigning Dedicated Users For The Daemons In hdfs :
# export HADOOP_NAMENODE_USER=hdfs
# export HADOOP_SECONDARYNAMENODE_USER=hdfs
# export HADOOP_DATANODE_USER=hdfs
# export HADOOP_JOBTACKER_USER=mapred
# export HADOOP_TASKTRACKER_USER=mapred
Step:13
Formatting The hdfs File System Via Name-Node.
Now move to the bin directory of hadoop and now we are about to start formatting our hdfs filesystems via namenode.

# cd /usr/lib/hadoop/bin/
# hadoop namenode -format
.
.
12/03/21 13:29:17 INFO common.Storage: Storage directory /storage/name has been successfully formatted.
12/03/21 13:29:17 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at master/192.168.208.217
************************************************************/
[root@master bin]#
Step:14
Create working directory for Jobtracker and tasktracker in the hdfs and Assigning Permission 
For tasktracker and jobtracker we need to create the local and system directory in the hdfs. this can be done only after formatting the hadoop namenode. 

#su hdfs
$ hadoop fs -mkdir -p /mapred/system
$ hadoop fs -mkdir -p /mapred/local
$ hadoop fs -mkdir -p /mapred/temp
$ hadoop fs -mkdir -p /user

$ hadoop fs -chmod 755 /mapred/system
$ hadoop fs -chmod 755 /mapred/local
# hadoop fs -chmod 755 /mapred/temp
# hadoop fs -chmod 755 /user

# hadoop fs -chown -R mapred:hadoop /mapred/system
# hadoop fs -chown -R mapred:hadoop /mapred/local
# hadoop fs -chown -R mapred:hadoop /mapred/temp
# hadoop fs -chown -R mapred:hadoop /user
Step 15
Starting The Multi-Node Cluster.
Hence we successfully formatted our hadoop filesystems. Now we will start the demons  in the appropriate nodes.(Provided Passwordless SSH should be configured)

#cd /usr/lib/hadoop/bin
$ bin/start-all.sh
Step:16
Hadoop Daemons Web UI
For each daemons we have corresponding web UI. To monitor and proper administration of the hadoop cluster we can use this web UI. Let's see one by one.

Namenode:
http://hostname:50070
SecondaryNamenode:
http://hostname:50090
Datanode:
http://hostname:50075
Jobtracker:
http://hostname:50030
Tasktracker:
http://hostname:50060



And This is all about setting up a production Hadoop Cluster. The other hadoop ecosystem component installation and configuration will be explained in the up-coming articles. For any query please contact me at: anulsasidharan@gmail.com 

No comments :