Hadoop Admin Tips

What and Why High Availability Cluster?
HA Cluster Architecture
Components of HA Cluster
Installation and configuration
Install and Configure JAVA
OS Configuration and Optimization
Disable the fastest mirror hunting Process by the yum installation
Disable Firewall (iptables and selinux)
Install the necessary packages for hadoop
Download and Add the Cloudera repository key
Check the available hadoop packages
Installing HDFS package
Hadoop Configuration files
NameNode HA Configuration
JAVA configuration for hadoop cluster
Initializing the Journal Nodes and Formatting the HA Cluster NameNodes
Initializing ZKFC by formatting Zookeeper
Activating the passive NameNode – Bootstrap Process
Checking the HA_Cluster

What and Why High Availability Cluster?

NameNode HA Architecture

NameNode HA Architecture

Components of HA Cluster

Active NameNode
Standby NameNode
Zookeeper
ZKFC - Zookeeper Fail-over Controller
Journal Nodes

Installation and configuration

Install and Configure JAVA

Install JAVA (Here I have selected the /usr/local directory for the jdk installation, copy the bin file to the /usr/local directory and execute the commands as follows), You can also use the rpm based installation as well.

#chmod 755 jdk-6u45-linux-x64-rpm.bin

#./jdk-6u45-linux-x64-rpm.bin

Update the /etc/profile file with the JAVA_HOME and bin PATH

#vim /etc/profile

#cd ~

#vim .bashrc

export JAVA_HOME={PATH TO JAVA_HOME}

export PATH=$PATH: {PATH TO JAVA_HOME}/bin

For 32 bit architecture reboot is needed to update this configuration. Better to logoff and logon the machine once and check the java installation

#echo $JAVA_HOME

#echo $PATH

#which java

OS Configuration and Optimization

Edit /etc/hosts file and add all the host entry in to this file

# vim /etc/hosts

192.168.0.100 nn1.hadoop.com nn1

192.168.0.101 nn2.hadoop.com nn2

192.168.0.102 dn1.hadoop.com dn1

192.168.0.103 dn2.hadoop.com dn2

192.168.0.104 dn3.hadoop.com dn3

Update the /etc/sysconfig/network file and update the hostname in each node

NETWORKING=yes

HOSTNAME=hostname.hadoop.com

GATEWAY=192.168.0.1

Disable the fastest mirror hunting Process by the yum installation

# vim /etc/yum/pluginconf.d/ fastestmirror.conf

enabled=0

Disable Firewall (iptables and selinux)

iptables(IPV4)

#service iptables stop

#chkconfig iptables off

ip6tables(IPV6)

#service ip6tables stop

#chkconfig ip6tables off

SELinuxt

#vim /etc/selinux/config

SELINUX=disabled

reboot the node after disabling the firewall.

To see the status of firewall

#service iptables status; service ip6tables status

# sestatus

Install the necessary packages for hadoop

#yum install perl; openssh-clients

Download and Add the Cloudera repository key

Download the Cloudera repository and save it in the /etc/yum/yum.repos.d/ directory

Wget http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh4.repo

Add the repository key

#rpm –import http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

Check the available hadoop packages

#yum search hadoop

Installing HDFS package

Here we are taking 3 nodes nn1, nn2, and dn1. We will start with installing packages required by the NameNode on nn1, nn2 and dn1. The reason we are installing HDFS package on a non namenode server is because we will need to run a journal node there.

Install hadoop-hdfs-namenode package on all three nodes

# yum install hadoop-hdfs-namenode

Install the journalnode package on all the three nodes

# yum install hadoop-hdfs-journalnode

Install the Zookeeper Server package on all the three nodes

# yum install zookeeper-server

Install failover controller only on both NameNodes, nn1 and nn2

# yum install hadoop-hdfs-zkfc

Before configuring the NameNode we need to make sure that the zookeeper cluster in up and running. Here we have three zookeeper. nn1, nn2 & dn1. We need to update this info into the zookeeper configuration file which is /etc/zookeeper/conf/zoo.cfg. Enter the following details at the end of the configuration file.

General Syntax

[server.id=host:port:port.]

# vim /etc/zookeeper/conf/zoo.cfg

server.1=nn1.hadoop.com:2888:3888

server.2=nn2.hadoop.com:2888:3888

server.3=dn1.hadoop.com:2888:3888

Update the zoo.conf file in all the zookeeper nodes

If you are deploying multiple ZooKeeper servers after a fresh install, you need to create a myid file in the data directory. You can do this by means of an init command option:

Server1

# service zookeeper-server init --myid=1

Server2

# service zookeeper-server init --myid=2

Server3

# service zookeeper-server init --myid=3

Note:

So myid of server 1 would contain the text "1" and nothing else. The id must be unique within the ensemble and should have a value between 1 and 255.

Start the zookeeper server on all the nodes

# service zookeeper-server start

# chkconfig zookeeper-server on

Hadoop Configuration files

There are three main configuration files for the hadoop components. core-site.xml, hdfs-site.xml, mapred-site.xml. core-site.xml file contains configuration options that are common to all the servers in the cluster. hdfs-site.xml, mapred-site.xml files provide the configuration option for HDFS and MAPREDUCE components respectively. The configuration file location is /etc/hadoop/conf and /etc/default. Some of the environment variables are moved to /etc/default.

Namenode HA Configuration

Hadoop XML file general property syntax:

<property>

<name> </name>

<value> </value>

<description> </description>

<finalize> </finalize>

</property>

Open the core-site.xml file and enter the following details.

<property>

<name>fs.default.name</name>

<value>hdfs://sample-cluster/</value>

<description>NameNode Cluster Name</description >

</property >

<property>

<name>ha.zookeeper.quorum</name>

<value>nn1.hadoop.com:2181,nn2.hadoop.com:2181,dn1.hadoop.com:2181</value>

<description>Specifies the location and port of the zookeeper Cluster</description >

</property >

Open the hdfs-site.xml file and enter the following details.

<property>

<name>dfs.name.dir</name>

<value>/dfs/nn/</value >

</property>

<property>

<name>dfs.nameservices</name>

<value>sample-cluster</value>

<description>Logical name of NameNode Cluster</description >

</property>

<property>

<name>dfs.ha.namenode.sample-cluster</name>

<value>nn1,nn2</value>

<description>NameNodes that makes the HA Cluster</description >

</property>

<property>

<name>dfs.namenode.rpc-address.sample-cluster.nn1</name>

<value>nn1.hadoop.com:8020</value>

<description>nn1 rpc-address</description>

</property>

<property>

<name>dfs.namenode.rpc-address.sample-cluster.nn2</name>

<value>nn2.hadoop.com:8020</value>

<description>nn2 rpc-address</description>

</property>

<property>

<name>dfs.namenode.http-address.sample-cluster.nn1</name>

<value>nn2.hadoop.com:50070</value>

<description>nn1 http-address</description>

</property>

<property>

<name>dfs.namenode.http-address.sample-cluster.nn1</name>

<value>nn2.hadoop.com:50070</value>

<description>nn2 http-address</description>

</property>

Note:

Stand by NameNode uses HTTP calls to periodically copy the fsimage file from the primary server perform the checkpoint operation and ship it back

<property>

<name>dfs.namenode.shared.edits.dir</name>

<value>qjournal://nn1.hadoop.com:8485;nn2.hadoop.com:8485; dn1.hadoop.com:8485/sample-cluster</value>

<description>Specifies the setup of JournalNode Cluster</description>

</property>

Note:

This variable specifies the setup of JournalNode Cluster. Both Active and Standby NameNode will use this variable to identify which hosts they should contact to send or receive new changes from edit log.

<property>

<name>dfs.journalnode.edits.dir</name>

<value>/dfs/journal</value>

<description>Location on the local file system where editlog changes will be stored</description>

</property>

<property>

<name>dfs.client.failover.proxy.provider.sample-cluster</name>

<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>

<description>The Java class that HDFS clients use to contact the Active NameNode</description>

</property>

<property>

<name>dfs.ha.automatic-failover.enabled</name>

<value>true</value>

<description>Indicates if the NameNode cluster will use manual or automatic failover</description>

</property>

<property>

<name>dfs.ha.fencing.methods</name>

<value>sshfence</value>

<description></description>

</property>

<property>

<name>dfs.ha.fencing.ssh.private-key-files</name>

<value>/var/lib/hadoop-hdfs/.ssh/id_rsa</value>

<description></description>

</property>

Passwordless ssh authentication for passwordless hdfs user in Cloudera Hadoop Cluster.

Then Synchronize these configuration to other nodes in the cluster. we can use the rsync command to do this.

JAVA Configuration for Hadoop cluster

JAVA_HOME must be configured in the /etc/defaults/bigtop-utils file

export JAVA_HOME=/usr/local/java

Initializing the Journal Nodes and Formatting the HA Cluster NameNodes

Now we can start the journal nodes in all the three nodes.

# service hadoop-hdfs-journalnode start

Now we need to initially format hdfs, for this run the following command in the hadoop HA node nn1.

# sudo -u hdfs hdfs namenode -format

Initializing ZKFC by formatting Zookeeper

Next step is to create an entry for the HA cluster in Zookeeper and start NameNode and ZKFC on any of the NameNode,

# sudo -u hdfs hdfs zkfc -formatZK

# service hadoop-hdfs-namenode start

# service hadoop-hdfs-zkfc start

Activating the passive NameNode – Bootstrap Process

To activate the passive NameNode an operation called bootstraping needs to be performed, execute the following command on nn2

# sudo -u hdfs hdfs namenode -bootstrapStandby

Checking the HA_Cluster

To check the status of active and standby cluster

# sudo -u hdfs hdfs haadmin -getServiceState nn1

# sudo -u hdfs hdfs haadmin -getServiceState nn2

To get the web UI of active NN

http://nn1.hadoop.com:50070

To get the web UI of standby NN

http://nn1.hadoop.com:50070

Happy Hadooping!......

Friday, 4 March 2016

NameNode High Availability for Hadoop Cluster

What and Why High Availability Cluster?

HA Cluster Architecture

Components of HA Cluster

Installation and configuration

Install and Configure JAVA

OS Configuration and Optimization

Disable the fastest mirror hunting Process by the yum installation

Disable Firewall (iptables and selinux)

Install the necessary packages for hadoop

Download and Add the Cloudera repository key

Check the available hadoop packages

Installing HDFS package

Hadoop Configuration files

NameNode HA Configuration

JAVA configuration for hadoop cluster

Initializing the Journal Nodes and Formatting the HA Cluster NameNodes

Initializing ZKFC by formatting Zookeeper

Activating the passive NameNode – Bootstrap Process

Checking the HA_Cluster

What and Why High Availability Cluster?

NameNode HA Architecture

Components of HA Cluster

Installation and configuration

Install and Configure JAVA

OS Configuration and Optimization

Disable the fastest mirror hunting Process by the yum installation

Disable Firewall (iptables and selinux)

iptables(IPV4)

Install the necessary packages for hadoop

Download and Add the Cloudera repository key

Check the available hadoop packages

Installing HDFS package

Hadoop Configuration files

Namenode HA Configuration

JAVA Configuration for Hadoop cluster

Initializing the Journal Nodes and Formatting the HA Cluster NameNodes

Initializing ZKFC by formatting Zookeeper

Activating the passive NameNode – Bootstrap Process

Checking the HA_Cluster