Hadoop Admin Tips: December 2013

Wednesday, 25 December 2013

Issue in Bringing up the "System eth0" in Virtual Machines.

There are many cases in a Linux machine the default Ethernet port "System-eth0" or simply referred as the "eth0" will not come up after the installation of the operating system. there are many error cases will come up . T-shoot will go to a long loop of T-Shoot. The first case where it come up is in case of virtualization. There is simple way to create a virtual machine from an existing system. This is called the VM Cloning. It can occur in case of thin and thick client provisioning of the VM cloning. Let us discuss this issue in detail.

First of all let's create a fresh VM and see whether it is true or not.

Let's check the network configuration whether it is controlled by the "eth0"or not. If we look at the network connection it is clear that the incoming connection is through the "System eth0"

Being this VM is installed from the base, it is using the "System eth0" for the network connectivity for its duplex network connectivity.

Now let's move to "/etc/sysconfig/network-scripts/" directory we can see the configuration files corresponding to the system eth0 i.e. ifcfg-eth0 only by default, as seen below.

WE have a base template Linux VM named "base" (It can be on any platform like VMware OR Oracle VirtualBox etc..). If we need to add a new node to the cluster in the DC(Data Center) We Just Clone from the appropriate template VM. Now let's go and clone the VM and see what happen next.

Here we will do the same steps and see what's the difference.

In this cloned machine instead of System eth0, we have Auto eth1 and Auto eth2. We can see the corresponding configuration files as well in the network configuration directory as follows.

However if you clone a VMWare or Oracle VirtualBox VM, you’ll notice that it kills your network interfaces throwing errors. Let's try to bring up the eth0 and I would like to list out the possible errors coming up.

Now check the status of all the network connection in the machines.

[root@devhost ~]# chkconfig --list | grep network

network 0:off 1:off 2:on 3:on 4:on 5:on 6:off

[root@devhost ~]#

Error: 1

[root@localhost ~]# ifup eth0

Error: No suitable device found: no device found for connection 'System eth0'.

[root@localhost ~]#

#ifup eth0

Device eth0 does not seem to be present, delaying initialization

When we trying to restart the network service we may get some error like this..

Error: 2

[root@localhost ~]# service network restart

Shutting down loopback interface: [ OK ]

Bringing up loopback interface: [ OK ]

Bringing up interface eth0: Error: No suitable device found: no device found for connection 'System eth0'.

[FAILED]

[root@localhost ~]#

Now let's try to check the network monitoring tools like mii-tool and ethtool and find the possible errors.

Error: 3

[root@localhost ~]# mii-tool eth0

SIOCGMIIPHY on 'eth0' failed: No such device

[root@localhost ~]#

Error: 4

[root@localhost ~]# ethtool eth0

Settings for eth0:

Cannot get device settings: No such device

Cannot get wake-on-lan settings: No such device

Cannot get message level: No such device

Cannot get link status: No such device

No data available

[root@localhost ~]#

These are the possible errors comes in while bringing up the eth0 OR system eth0 interface.

So... how to troubleshoot this issue?

Solution:

We started from cloning the VM from the "base" VM. While cloning the VM it will clone the entire VM state. It include the MAC-address/HW-address as well from the base VM to the Cloned one. Machine with same MAC Address will not communicate each other in a network. In order to overcome this the VM will create a new NIC configuration file by default. It is named as either Auto eth1 OR Auto eth2. For confirmation try to clone one more VM from the base machine, and compare the configuration file of eth0 in /etc/sysconfig/network-scripts/. we can see that the MAC Address of eth0 in all machines are the same as that of "base" VM. Let's try to overcome this situation.

What’s happening behind the scene here is, when you clone your VM (VirtualBox OR VMWare) apply a new MAC Address to your network interfaces but they don’t update the Linux configuration files to mirror these changes and so the kernel doesn’t firstly can’t find or start the interface that matches it’s configuration (with the old MAC Address) and it finds a new interface (the new MAC Address) that it has no configuration information for. The result is that only your networking service can only start the loopback networking interface and eth0 is dead.

I don't feel it as a rocket science to fix it, see how to do that.

Let's see the ifcfg-eth0 configuration file

[root@localhost network-scripts]# cat /etc/sysconfig/network-scripts/ifcfg-eth0

DEVICE="eth0"

BOOTPROTO="dhcp"

NM_CONTROLLED="yes"

ONBOOT="yes"

HWADDR=00:0C:29:62:5C:CF

MTU=1500

TYPE=Ethernet

DEFROUTE=yes

PEERDNS=yes

PEERROUTES=yes

IPV4_FAILURE_FATAL=yes

IPV6INIT=no

NAME="System eth0"

UUID=5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03

[root@localhost network-scripts]#

Step: 1

Remove and regenerate the Kernel's networking interfaces rules. The Rules are kept in the location /etc/udev/rules.d/ directory. here we can see a file named 70-persistent-net.rules

# rm -f /etc/udev/rules.d/70-persistent-net.rules

Step: 2

Once we remove the rules related to net interfaces we need to update the operating system about the new MAC/HW address. for this a reboot is necessary.

# reboot

Once the OS is up login as root and check the IP configuration

[root@localhost Desktop]# ifconfig

eth0 Link encap:Ethernet HWaddr 00:0C:29:59:84:BE

inet addr:192.168.40.181 Bcast:192.168.40.255 Mask:255.255.255.0

inet6 addr: fe80::20c:29ff:fe59:84be/64 Scope:Link

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:191 errors:0 dropped:0 overruns:0 frame:0

TX packets:85 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:29908 (29.2 KiB) TX bytes:11864 (11.5 KiB)

lo Link encap:Local Loopback

inet addr:127.0.0.1 Mask:255.0.0.0

inet6 addr: ::1/128 Scope:Host

UP LOOPBACK RUNNING MTU:16436 Metric:1

RX packets:16 errors:0 dropped:0 overruns:0 frame:0

TX packets:16 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:0

RX bytes:960 (960.0 b) TX bytes:960 (960.0 b)

[root@localhost Desktop]#

Step: 3

UPDATE your interface configuration file. Copy the HW address and replace the hw address with the system eth0 configuration file in the /etc/sysconfig/network-scripts/ifcfg-eth0

# vim /etc/sysconfig/networking/devices/ifcfg-eth0

Step: 4

Remove the MACADDR entry or update it to the new MACADDR for the interface (listed in this file: /etc/udev/rules.d/70-persistent-net.rules/ or from ifconfig command).

Step: 5

Remove the UUID entry

Step: 6

Save and exit the file. On the top we can see the eth0 is active.

Now if we look on the top network connection it is clear that the system eth0 is updated and is active.

if we try to restart the network service we will get the expected output.

[root@localhost Desktop]# service network restart

Shutting down loopback interface: [ OK ]

Bringing up loopback interface: [ OK ]

Bringing up interface eth0: Active connection state: activating

Active connection path: /org/freedesktop/NetworkManager/ActiveConnection/3

state: activated

Connection activated

[ OK ]

[root@localhost Desktop]#

If required we can remove the unwanted interfaces like Auto-eth1 and Auto-eth2...

Try to find the corresponding output with the commands which we have tried which has given unexpected output earlier. and find the difference.

Hence the Issue is solved.

Relation with NM-Controller with MACADDR/HWADDR

Let's go further more in detail with a bit of information in the configuration level. is there any relation with the NM-Controller in the ifcfg-eth* file and MACADDR/HWADDR. I would like to say yes.

For addressing the NIC in the configuration file we are using a parameter MACADDR/HWADDR and the value is the MAC-ID or the HW-Address of the NIC of that machine. If we are using the parameter as MACADDR then the parameter value must be set to NM-Controlled="no" otherwise If we are using the parameter as HWADDR then the parameter value must be set to NM-Controlled="yes". the reverse condition will result the same error to repeat throughout the T-Shoot.

Scenario: 1

HWADDR=00:0C:29:59:84:BE

NM_CONTROLLED="yes"

Connection is successful.

Scenario: 2

HWADDR=00:0C:29:59:84:BE

NM_CONTROLLED="no"

Here the connection is unsuccessful.

Scenario: 3

MACADDR=00:0C:29:59:84:BE

NM_CONTROLLED="no"

Here the situation may change, sometimes it will give the connection sometimes not.

Scenario: 4

MACADDR=00:0C:29:59:84:BE

NM_CONTROLLED="yes"

Those who are experimenting with this issue can have some more interesting shuffling to get the exact result.

If any error please revert back to me through mail.
Email: anulsasidharan@gmail.com

Wednesday, 18 December 2013

MultiNode Hadoop Cluster Setup.

A "Hadoop Cluster" is a type of computational cluster designed particularly for storing and analyzing large amount of data (Which can be of any form like structured, semi-structured or unstructured etc..) in a distributed manner. Hadoop Cluster runs Hadoop's open source distributed processing framework built up on java on low cost commodity computers. Hadoop is designed on a concept called WORM-Write Once Read/Execute Multiple times. Also it is called a "Shared Nothing" architecture. Hadoop Cluster can be massively scalable from a single machine to thousands of machines. But the only thing which is being shared between them is the network that connects them. That is why this is referred as "Shared Nothing" architecture.

Hadoop Cluster are designed for boosting the speed of data analysis applications. Hadoop cluster are highly scalable. If the existing cluster's processing power exceeds by the growth of data, we can add additional nodes to increase thee throughput of the processing unit. Hadoop clusters also are highly resistant to node failure because each blocks of data is copied onto two more cluster nodes(replication of the data blocks- default 3 and it is scalable), which ensures the data redundancy even with the failure of one node.

Now Let's see how a production Hadoop Cluster can be built. Which is explained in detail bellow.

1.Designing The Cluster Network .
2.Operating System Selection And Preparation.
3.Configure Nameserver And Adding Hostname To Each Machines.
4.Unix Firewall On Hadoop Cluster.
5.Disable The Fastest Mirror Hunting Process By The Yum.
6.Setting Password Less SSH login In All The Nodes Of Hadoop Cluster
7.Installation Of Java.
8.Hadoop Installation.
9.Hadoop Cluster Configuration.
10.Synchronizing the configuration files between master and the slave nodes.
11.Adding The Dedicated Users To The Hadoop Group.
12.Assigning Dedicated Users For The Daemons In hdfs.
13.Formatting The hdfs File System Via Name-Node.
14.Create Working Directory For Jobtracker And Tasktracker In The Hdfs And Assigning Permission.
15.Starting The Multi-Node Cluster.
16.Running A Map-Reduce Job On The Cluster.

Step:1
Designing The Cluster Network
It is important to make keen decision about the network architecture in deployment of hadoop cluster. As a part of providing the best throughput from the cluster the nodes should be physically close together as possible. i.e. lower the network latency , better the performance.
Another way to reduce the amount of background traffic, is to create a Virtual Private Network(VPN) particularly for the cluster. Any other application server which is connected to this cluster should be in different subnet to provide the better background traffic reduction. and which in turn can provide the access point to the outside world such as Cloud.
The anticipated latency of the VPN is ~1-2ms. This implies that the issue of physical proximity becomes less of an issue and we should confirm this through environment testing.
Network Architecture recommendation:

Dedicated TOR(Top On Rack) switches to Hadoop
Use dedicated Core Switching Blades or switches
Ensure application servers are “close” to Hadoop
Consider Ethernet bonding(NIC Teaming) for increased capacity/bandwidth utilization.(Optional)

Network Architecture

Step:2

Operating System Selection and Preparation

Why Linux? Hadoop is a framework written in java, enough native code and Linux-isms are in its surrounding infrastructure to make Linux the only production quality option today. Being both are of open source it is strongly recommended to use the hadoop cluster in association with the Unix as the base operating system. The choice of operating system may be influenced by administration tools, hardware support, or commercial software support; the best choice is usually to minimize the variables and reduce risk by picking the distribution with which you’re most comfortable. Preparing the OS for Hadoop requires a number of steps, and repeating them on a large number of machines is both time-consuming and error-prone. For this reason, it is strongly advised that a software configuration management system be used. Puppet and Chef are two open source tools that fit this requirement. A significant number of production clusters run on RedHat Enterprise Linux or its freely available sister, CentOS. Ubuntu, SuSE Enterprise Linux, and Debian deployments also exist in production and work perfectly well. We selected to use Linux as operating system for our Hadoop cluster. It is the best practice to create custom CentOS image preconfigured with pre-required software so all machines contain the same set of utilities.

The following settings should be applied on OS level:

# Do not use Logical Volume Management
# Use configuration management system (Yum, Permission, sudoers, etc.) – Puppet # Reduce kernel swappiness
# Revoke access to Cluster machines from average users
# Do NOT use Virtualization for Clustering.
# At a minimum the following, but limited to Linux commands are required:
# ln,chmod,chown,chgrp,mount,umount,kill,rm,yum,mkdir,rsync,etc..

For which better follow the "express text" based installation in all the nodes, instead of default GUI installation. This will avoid the unwanted installation of packages on the OS.
Step:3
Configure Nameserver and adding Hostname to each machines
Edit the resolve.conf file in /etc/

# vi /etc/resolve.conf

search.example.com

nameserver 192.168.1.1

Change the hostname in all the machines

# vi /etc/sysconfig/network

HOSTNAME=<hostname>.example.com

GATEWAY=192.168.1.1

Enter the details of all the nodes in the /etc/hosts file in the nameserver format with canonic name OR alias name

# vi /etc/hosts

192.168.1.101 n1.xyz.com n1

192.168.1.102 n2.xyz.com n2

192.168.1.103 n3.xyz.com n3

192.168.1.104 n4.xyz.com n4

192.168.1.105 n5.xyz.com n5
192.168.1.106 n6.xyz.com n6

This should be entered in all the nodes so that all the nodes can communicate each other using their hostname.

Step:4
Unix Firewall On Hadoop Cluster
Why Firewall Has To Be Disabled in the Hadoop Cluster:
Generally Hadoop is referred as a "shared nothing architecture" ; The only thing which shared in the hadoop cluster is the networking that interconnect them. In the Hadoop cluster there are lot of cluster daemons. They use a handful of ports over TCP. Some of these ports are used by Hadoop’s daemons to communicate amongst themselves (to schedule jobs, replicate blocks, etc.). Others ports are listening directly to users, either via an interposed Java client, which communicates via internal protocols, or through plain old HTTP. Which tend to be "firewall less" between the various hadoop nodes but operate on separate network with a gateway node between the normal network and the hadoop cluster.
What are the Linux Firewalls and how to disable them?
SELinux and IP-Tables are the two type of firewalls in the Unix Systems. Let's see how to disable them.

SELinux

# vi /etc/selinux/config

SELinux=disabled

IP-Tables

IP-Tables is of two type. IPV4 and IPV6. We need to disable both of them.

For IPV4

# service iptables stop

# chkconfig iptables off

For IPV6

# service ip6tables stop

# chkconfig ip6tables off

We can disable the IP-Tables at any time. it is easy. But to disable selinux we need to restart the computer once. Restart the computer and check the status of the firewall as follows.

# init 6 OR reboot

After reboot Check the status of SELinux and IP-Tables

To check the status of SELinux:

# /usr/sbin/sestatus

# sestatus

SELinux status: disabled

To check the status of IP-Tables:

# /etc/init.d/iptables status

iptables: Firewall is not running.

# /etc/init.d/ip6tables status

Ip6tables: Firewall is not running.

Step:5

Disable the fastest mirror hunting process by the yum installation

The yum package installation always check the fastest mirror available to download packages for installation. We can disable this mirror hunting process and make yum to proceed with the available mirrors. For this we need to change value in the configuration file of yum plug-in in the /etc/yum/pluginconf.d/ location.

# vi /etc/yum/pluginconf.d/fastest mirror.conf

Enabled=0

Step:6

Setting Up password less SSH authentication on all the nodes

Click Here

Step:7
Installation Of Java.

Download and copy the java bin file in to the /usr/local directory

# chmod 755 jdk-6u30-linux-x64.bin

# ./jdk-6u17-linux-x64.bin

Set JAVA_HOME & JAVA PATH

Open /etc/profile file and add this two lines at the end of the file.

# vim /etc/profile

export JAVA_HOME=/usr/local/java

export PATH=$PATH:/usr/local/java/bin

We can check the java installation as follows

# echo $JAVA_HOME

/usr/local/java

# echo $PATH

/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin:/usr/bin:/bin:/root/bin:/usr/local/java/bin:/root/bin

# which java

/usr/local/java/bin/java

Step:8

Hadoop Installation

This installation can be done in different ways. we can use Either Manual Installation [tar balls ,online installation from vendors like Cloudera (Using Online yum repository), local yum repository, RPM installation] Or Automatic Installation (Cloudera Manager).

Here we are going to do the online installation from Cloudera (Using Online yum repository). Let's see how to do that.

Step 1: Download the CDH3 Repository or Package.

First of all we need to download and install thee CDH3 repository package installation. which is necessary for the remaining installation of Cloudera hadoop. Here we are going to install Hadoop on a CentOS cluster.

# Curl -O http://archive.cloudera.com/redhat/6/x86_64/cdh/cdh3-repository-1.0-1.noarch.rpm

Download this RPM in your Home Directory and install the RPM using the following command

# yum --nogpgcheck localinstall cdh3-repository-1.0-1.noarch.rpm

Step 2: Install CDH3 on all hosts

Find and install the Hadoop core and native packages. For example:

# yum search hadoop

It will display all the available packages available to install with that version of hadoop (Ex: Here CDH3)

# yum install hadoop-0.20 hadoop-0.20-native

Hence the base is installed. this has to follow in all the machines. Now we need to install the daemons in appropriate machines.
n1.xyz.com
  Namenode
n2.xyz.com
  Secondarynamenode
n3.xyz.com
    Jobtracker
n4.xyz.com
   DataNode
TaskTracker
n5.xyz.com
   DataNode
TaskTracker
n6.xyz.com
   DataNode
TaskTracker
.
.
.
Step 2: Install daemon packages on appropriate machines.

# yum install hadoop-0.20-<daemon type>

where <daemon type> is one of the following:

namenode
datanode
secondarynamenode
jobtracker
tasktracker

Hence we have installed the hadoop cluster. To make this cluster running we need to configure this cluster. Every parameter in this cluster is configurable. In the following session we will see how to configure the cluster.

Step:9

Hadoop Cluster Configuration.

Hadoop cluster is a master - slave architecture. It is designed for distributed storage and processing. So for both Storage and Processing we have masters to control and co-ordinate their slaves. Let's see the master slave architecture and the corresponding daemons.

Master-Slave Architecture Of Hadoop Cluster

In the cloudera hadoop installation, by default it will create the minimum required users and groups to volunteer the hadoop cluster. To control the distributed storage a new user will be created and it is called hdfs. Where as in case of distributed processing the newly created user is called mapred. more over it will create a new group fog the hadoop cluster called the hadoop. Here the daemons are user specific. So each daemon should be mapped to the corresponding user. All DFS storage and its working directory should be owned by hdfs user and All DFS process and its working directory should be owned by mapred. Which in turn to be mapped to the hadoop group as well. We will see all these steps in details down the way.

All the hadoop cluster configuration files resides in the cong directory of hadoop home. the location is arbitrary. go to the hadoop installed directory; for Cloudera it can be found on various location such as /etc/hadoop , /usr/lib/hadoop , /usr/lib/hadoop-<version> etc.. In any of this location we can find the configuration directory named conf . If we move to that location we can see the configuration files. the main configuration files which needs to be edited in order to run the hadoop cluster are following.

Hadoop Environment configuration file: hadoop-env.sh

A shell script that includes the Hadoop Specific environment variables to run the hadoop such as JAVA_HOME, HASOOP_HEEP_SIZE, HADOOP_OPTS etc..

Hadoop Core configuration file: core-site.xml

This is an xml file. It consist of many properties like the hadoop temp directory, thee default namenode in the network etc..

Hadoop Distributed storage configuration file: hdfs-site.xml

This xml file consist of the location of namenode directory in the HDFS file system, location of data Nodes in the network, permission to the users, etc..

Hadoop Distributed storage configuration files: mapred-site.xml

This is the xml file for hadoop distributed processing called as Map-Reduce process. It consist of the jobtracker and tasktracker temp location to store their intermediate results. default jobtracker machine etc..

Let's see the configuration one by one..

hadoop-env.sh

Go to the conf directory of hadoop and open the hadoop-env.sh file and uncomment the required lines and give the appropriate values.

# cd /usr/lib/hadoop/conf/

# vim hadoop-env.sh

# The java implementation to use. Required.

export JAVA_HOME=/usr/local/jdk1.6.0_30/

# The maximum amount of heap to use, in MB. Default is 1000

export HADOOP_HEAP_SIZE=2000

Here the JAVA_HOME must be same as that of /etc/profile file.

Hence the Basic environment setup is ready. Now logout from all the terminals and open the new terminal and check the hadoop version as follows. It can be verified either As a hadoop user OR As root user.

core-site.xml

Create the tmp directory in some location and give the proper ownership and permissions. Then map the location in the core-site.xml file. This will be the temporary folder in which the data to the hdfs will first saved temporarily.

# vim core-site.xml

<configuration>

<property>

<name>hadoop.tmp.dir</name>

<value>/usr/lib/hadoop/tmp</value>

</property>

<property>

<name>fs.default.name</name>

<value>hdfs://hostname of NN machine:8020</value>

</property>

</configuration>

# mkdir /usr/lib/hadoop/tmp

# chmod 775 tmp/

# chown -R hdfs:hadoop tmp/

# ls -ll tmp/

Drwxrwxr-x 3 hdfs hadoop 4096 Mar 17 23:07 tmp

hdfs-site.xml

#vim hdfs-site.xml

<configuration>

<property>

<name>dfs.permissions</name>

<value>false</value>

<final>true</final>

</property>

<property>

<name>dfs.name.dir</name>

<value>/dfs/name</value>

<final>true</final>

</property>

<property>

<name>dfs.data.dir</name>

<value>/dfs/data1,/dfs/data2</value>

<final>true</final>

</property>

<property>

<name>dfs.datanode.failed.volumes.tolerated</name>

<value>1</value>

<final>true</final>

</property>

<property>

<name>dfs.replication</name>

<value>3</value>

<final>true</final>

</property>

<property>

<name>dfs.block.size</name>

<value>67108864</value>

<final>true</final>

</property>

<property>

<name>dfs.checkpoint.dir</name>

<value>/dfs/checkpoint</value>

<final>true</final>

</property>

<property>

<name> dfs.http.address</name>

<value>0.0.0.0:50070</value>

<final>true</final>

</property>

<property>

<name> dfs.datanode.http.address</name>

<value>0.0.0.0:50075</value>

<final>true</final>

</property>

<property>

<name>dfs.secondary.http.address</name>

<value>0.0.0.0:50090</value>

<final>true</final>

</property>

<property>

<name> dfs.backup.http.address</name>

<value>0.0.0.0:50105</value>

<final>true</final>

</property>

</configuration>

Now we will create a new directory dfs in the /(root) directory of all nodes. We need to assign the permission to the dfs directory. In all the nodse give the ownership of the dfs directory to the hdfs user and hadoop group.

# mkdir -p /dfs

# chmod 775 /dfs/

# chown -R hdfs:hdfs /dfs/

mapred-site.xml

# vim mapred-site.xml

<configuration>

<property>

<name>mapred.job.tracker</name>

<value> hdfs://hostname of jobtracker machine:8021</value>

<final>true</final>

</property>

<property>

<name>mapred.system.dir</name>

<value>/mapred/system </value>

<final>true</final>

</property>

<property>

<name>mapred.local.dir</name>

<value>/mapred/local </value>

<final>true</final>

</property>

<property>

<name>mapred.temp.dir</name>

<value>/mapred/temp</value>

<final>true</final>

</property>

<property>

    <name>mapreduce.jobtracker.staging.root.dir</name>

<value>/user </value>

<final>true</final>

</property>

<property>

<name>mapred.job.tracker.http.address </name>

<value>0.0.0.0:50030</value>

<final>true</final>

</property>

<property>

<name>mapred.task.tracker.http.address </name>

<value>0.0.0.0:50060</value>

<final>true</final>

</property>

</configuration>

For tasktracker and jobtracker we need to create the local and system directory in the hdfs. this can be done only after formatting the hadoop namenode.

Step:10

Synchronizing The Configuration Files Between Master And The Slave Nodes.

Now we need to synchronize the configuration files (core-site.xml, hdfs-site.xml and mapred-site.xml) from master node all the slave nodes. For small and medium size cluster it is desirable to do it manually, but for a large production cluster we can use tools like Puppet OR Chef. this configuration must be the same throughout the cluster nodes. We can use rsync OR SCP at this time to synchronize these configuration files.

core-site.xml

#rsync -avrt /usr/lib/hadoop/conf/core-site.xml root@IP:/usr/lib/hadoop/conf/core-site.xml

hdfs-site.xml

#rsync -avrt /usr/link/hadoop/conf/hdfs-site.xml root@IP:/usr/lib/hadoop/conf/hdfs-site.xml

For mapred-site.xml

#rsync -avrt /usr/link/hadoop/conf/mapre-site.xml root@IP:/usr/lib/hadoop/conf/mapred-site.xml

This step has to be done in all the nodes.

Step:11

Adding The Dedicated Users To The Hadoop Group.

Now We will add the dedicated users(hdfs and mapred) to the hadoop group in all the nodes.

# gpasswd -a -A hdfs hadoop

Adding user mapred to group hadoop

# gpasswd -a -A mapred hadoop

Adding user hdfs to group hadoop

Step:12

Assigning Dedicated Users For The Daemons In hdfs :

# export HADOOP_NAMENODE_USER=hdfs

# export HADOOP_SECONDARYNAMENODE_USER=hdfs

# export HADOOP_DATANODE_USER=hdfs

# export HADOOP_JOBTACKER_USER=mapred

# export HADOOP_TASKTRACKER_USER=mapred

Step:13

Formatting The hdfs File System Via Name-Node.

Now move to the bin directory of hadoop and now we are about to start formatting our hdfs filesystems via namenode.

# cd /usr/lib/hadoop/bin/

# hadoop namenode -format

12/03/21 13:29:17 INFO common.Storage: Storage directory /storage/name has been successfully formatted.

12/03/21 13:29:17 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at master/192.168.208.217

************************************************************/

[root@master bin]#

Step:14

Create working directory for Jobtracker and tasktracker in the hdfs and Assigning Permission

For tasktracker and jobtracker we need to create the local and system directory in the hdfs. this can be done only after formatting the hadoop namenode.

#su hdfs

$ hadoop fs -mkdir -p /mapred/system

$ hadoop fs -mkdir -p /mapred/local

$ hadoop fs -mkdir -p /mapred/temp

$ hadoop fs -mkdir -p /user

$ hadoop fs -chmod 755 /mapred/system

$ hadoop fs -chmod 755 /mapred/local

# hadoop fs -chmod 755 /mapred/temp

# hadoop fs -chmod 755 /user

# hadoop fs -chown -R mapred:hadoop /mapred/system

# hadoop fs -chown -R mapred:hadoop /mapred/local

# hadoop fs -chown -R mapred:hadoop /mapred/temp

# hadoop fs -chown -R mapred:hadoop /user

Step 15

Starting The Multi-Node Cluster.

Hence we successfully formatted our hadoop filesystems. Now we will start the demons in the appropriate nodes.(Provided Passwordless SSH should be configured)

#cd /usr/lib/hadoop/bin

$ bin/start-all.sh

Step:16

Hadoop Daemons Web UI

For each daemons we have corresponding web UI. To monitor and proper administration of the hadoop cluster we can use this web UI. Let's see one by one.

Namenode:

http://hostname:50070

SecondaryNamenode:

http://hostname:50090

Datanode:

http://hostname:50075

Jobtracker:

http://hostname:50030

Tasktracker:

http://hostname:50060

And This is all about setting up a production Hadoop Cluster. The other hadoop ecosystem component installation and configuration will be explained in the up-coming articles. For any query please contact me at: anulsasidharan@gmail.com

Subscribe to: Posts ( Atom )