What and Why High Availability Cluster?
HA Cluster Architecture
Components of HA Cluster
Installation and configuration
Install and Configure JAVA
OS Configuration and Optimization
Disable the fastest mirror hunting Process by the yum installation
Disable Firewall (iptables and selinux)
Install the necessary packages for hadoop
Download and Add the Cloudera repository key
Check the available hadoop packages
Installing HDFS package
Hadoop Configuration files
NameNode HA Configuration
JAVA configuration for hadoop cluster
Initializing the Journal Nodes and Formatting the HA Cluster NameNodes
Initializing ZKFC by formatting Zookeeper
Activating the passive NameNode – Bootstrap Process
Checking the HA_Cluster
What and Why High Availability Cluster?
NameNode HA Architecture
NameNode HA Architecture
Components of HA Cluster
- Active NameNode
- Standby NameNode
- Zookeeper
- ZKFC - Zookeeper Fail-over Controller
- Journal Nodes
Installation and configuration
Install and Configure JAVA
Install JAVA (Here I
have selected the /usr/local directory for the jdk
installation, copy the bin file to the /usr/local
directory
and execute the commands as follows), You can also use the rpm based installation as well.
#chmod 755
jdk-6u45-linux-x64-rpm.bin
#./jdk-6u45-linux-x64-rpm.bin
|
Update the
/etc/profile file
with the JAVA_HOME and
bin PATH
#vim
/etc/profile
#cd
~
#vim
.bashrc
export
JAVA_HOME={PATH TO JAVA_HOME}
export
PATH=$PATH: {PATH TO JAVA_HOME}/bin
|
For
32 bit architecture reboot is needed to update this configuration. Better to
logoff and logon the machine once and check the java installation
#echo
$JAVA_HOME
#echo
$PATH
#which
java
|
OS Configuration and Optimization
Edit /etc/hosts file and add all the host entry
in to this file
#
vim /etc/hosts
192.168.0.100
nn1.hadoop.com nn1
192.168.0.101
nn2.hadoop.com nn2
192.168.0.102
dn1.hadoop.com dn1
192.168.0.103
dn2.hadoop.com dn2
192.168.0.104
dn3.hadoop.com dn3
|
Update the /etc/sysconfig/network file and update the hostname in each node
NETWORKING=yes
HOSTNAME=hostname.hadoop.com
GATEWAY=192.168.0.1
|
Disable the fastest mirror hunting Process by the yum installation
#
vim /etc/yum/pluginconf.d/
fastestmirror.conf
enabled=0
|
Disable Firewall (iptables and selinux)
iptables(IPV4)
#service
iptables stop
#chkconfig
iptables off
|
ip6tables(IPV6)
#service
ip6tables stop
#chkconfig
ip6tables off
|
SELinuxt
#vim /etc/selinux/config
SELINUX=disabled
|
reboot the node after
disabling the firewall.
To see the status of firewall
#service iptables status; service
ip6tables status
|
# sestatus
|
Install the necessary packages for hadoop
#yum install perl; openssh-clients
|
Download and Add the Cloudera repository key
Download
the Cloudera repository and save it in the /etc/yum/yum.repos.d/
directory
Wget http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh4.repo
|
Add the
repository key
#rpm –import http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
|
Check the available hadoop packages
#yum search hadoop
|
Installing HDFS package
Here we are taking 3 nodes nn1, nn2, and dn1. We will start with installing packages required
by the NameNode on nn1, nn2 and dn1. The reason we are installing HDFS package
on a non namenode server is because we will need to run a journal node there.
Install hadoop-hdfs-namenode package on all three nodes
# yum install hadoop-hdfs-namenode
|
Install
the journalnode package on all the three nodes
# yum install
hadoop-hdfs-journalnode
|
Install
the Zookeeper Server
package on all the three nodes
# yum install zookeeper-server
|
Install failover controller only on both NameNodes, nn1 and nn2
# yum install hadoop-hdfs-zkfc
|
Before configuring the NameNode
we need to make sure that the zookeeper cluster in up and running. Here we have
three zookeeper. nn1, nn2 & dn1. We need to update this info into the
zookeeper configuration file which is /etc/zookeeper/conf/zoo.cfg.
Enter the
following details at the end of the configuration file.
General
Syntax
[server.id=host:port:port.]
# vim /etc/zookeeper/conf/zoo.cfg
server.1=nn1.hadoop.com:2888:3888
server.2=nn2.hadoop.com:2888:3888
server.3=dn1.hadoop.com:2888:3888
|
Update the zoo.conf file in all the zookeeper nodes
If you are deploying multiple ZooKeeper
servers after a fresh install, you need to create a myid file in the data directory. You can
do this by means of an init command option:
Server1
# service zookeeper-server init
--myid=1
|
Server2
# service zookeeper-server init
--myid=2
|
Server3
# service zookeeper-server init
--myid=3
|
Note:
So myid of server 1 would contain
the text "1" and nothing else. The id must be unique within the
ensemble and should have a value between 1 and 255.
Start the zookeeper server on all
the nodes
# service zookeeper-server start
# chkconfig zookeeper-server on
|
Hadoop Configuration files
There are three main configuration
files for the hadoop components. core-site.xml, hdfs-site.xml, mapred-site.xml. core-site.xml file contains configuration options
that are common to all the servers in the cluster. hdfs-site.xml, mapred-site.xml files provide the configuration
option for HDFS and MAPREDUCE components respectively. The configuration file
location is /etc/hadoop/conf and /etc/default. Some
of the environment variables are moved
to /etc/default. Namenode HA Configuration
Hadoop XML file general property syntax:
<property>
<name> </name>
<value> </value>
<description> </description>
<finalize> </finalize>
</property>
|
Open the core-site.xml file and enter the following
details.
<property>
<name>fs.default.name</name>
<value>hdfs://sample-cluster/</value>
<description>NameNode Cluster Name</description >
</property
>
<property>
<name>ha.zookeeper.quorum</name>
<value>nn1.hadoop.com:2181,nn2.hadoop.com:2181,dn1.hadoop.com:2181</value>
<description>Specifies the location and port of the zookeeper
Cluster</description >
</property
>
|
Open the hdfs-site.xml file and
enter the following details.
<property>
<name>dfs.name.dir</name>
<value>/dfs/nn/</value
>
</property>
<property>
<name>dfs.nameservices</name>
<value>sample-cluster</value>
<description>Logical name of NameNode Cluster</description
>
</property>
<property>
<name>dfs.ha.namenode.sample-cluster</name>
<value>nn1,nn2</value>
<description>NameNodes that makes the HA Cluster</description
>
</property>
<!----rpc-address properties for the HA_NN_Cluster-->
<property>
<name>dfs.namenode.rpc-address.sample-cluster.nn1</name>
<value>nn1.hadoop.com:8020</value>
<description>nn1 rpc-address</description>
</property>
<property>
<name>dfs.namenode.rpc-address.sample-cluster.nn2</name>
<value>nn2.hadoop.com:8020</value>
<description>nn2 rpc-address</description>
</property>
<!----http-address properties for the HA_NN_Cluster-->
<property>
<name>dfs.namenode.http-address.sample-cluster.nn1</name>
<value>nn2.hadoop.com:50070</value>
<description>nn1 http-address</description>
</property>
<property>
<name>dfs.namenode.http-address.sample-cluster.nn1</name>
<value>nn2.hadoop.com:50070</value>
<description>nn2 http-address</description>
</property>
|
Note:
Stand by NameNode uses HTTP calls to
periodically copy the fsimage file from the primary server perform the
checkpoint operation and ship it back
<!--quorum journal properties for the HA_NN_Cluster-->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://nn1.hadoop.com:8485;nn2.hadoop.com:8485;
dn1.hadoop.com:8485/sample-cluster</value>
<description>Specifies the setup of JournalNode Cluster</description>
</property>
|
Note:
This variable specifies the setup of JournalNode Cluster.
Both Active and Standby NameNode will
use this variable to identify which hosts they should contact to send or
receive new changes from edit log.
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/dfs/journal</value>
<description>Location on the local file system where editlog changes
will be stored</description>
</property>
<!--failover properties for the HA_NN_Cluster-->
<property>
<name>dfs.client.failover.proxy.provider.sample-cluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
<description>The Java class that HDFS clients use to contact the
Active NameNode</description>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
<description>Indicates if the NameNode cluster will use manual or
automatic failover</description>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
<description></description>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/var/lib/hadoop-hdfs/.ssh/id_rsa</value>
<description></description>
</property>
|
Passwordless ssh authentication for passwordless hdfs user
in Cloudera Hadoop Cluster.
Then Synchronize these configuration to other nodes in the
cluster. we can use the rsync command to do this.
JAVA
Configuration for Hadoop cluster
JAVA_HOME must be configured in the /etc/defaults/bigtop-utils file
export JAVA_HOME=/usr/local/java
|
Initializing
the Journal Nodes and Formatting the HA Cluster NameNodes
Now we can start the journal nodes in all the three nodes.
# service hadoop-hdfs-journalnode start
|
Now we need to initially format hdfs, for this run the following
command in the hadoop HA node nn1.
# sudo -u hdfs hdfs namenode -format
|
Initializing
ZKFC by formatting Zookeeper
Next step is to create an entry for the HA cluster in
Zookeeper and start NameNode and ZKFC on any of the NameNode,
# sudo -u hdfs hdfs zkfc -formatZK
# service hadoop-hdfs-namenode start
# service hadoop-hdfs-zkfc start
|
Activating
the passive NameNode – Bootstrap Process
To activate the passive NameNode an operation called
bootstraping needs to be performed, execute the following command on nn2
# sudo -u hdfs hdfs namenode -bootstrapStandby
|
Checking
the HA_Cluster
To check the status of active and standby cluster
# sudo -u hdfs hdfs haadmin -getServiceState nn1
# sudo -u hdfs hdfs haadmin -getServiceState nn2
|
To get the web UI of active NN
http://nn1.hadoop.com:50070
|
To get the web UI of standby NN
http://nn1.hadoop.com:50070
|
Happy Hadooping!......