A "Hadoop Cluster"
is a type of computational cluster designed particularly for storing and
analyzing large amount of data (Which can
be of any form like structured, semi-structured or unstructured etc..) in a
distributed manner. Hadoop Cluster runs Hadoop's open source distributed
processing framework built up on java on low cost commodity computers. Hadoop
is designed on a concept called WORM-Write
Once Read/Execute Multiple times.
Also it is called a "Shared Nothing"
architecture. Hadoop Cluster can be massively scalable from a single
machine to thousands of machines. But the only thing which is being shared
between them is the network that connects them. That is why this is referred as
"Shared
Nothing" architecture.
Hadoop Cluster are designed for
boosting the speed of data analysis applications. Hadoop cluster are highly
scalable. If the existing cluster's processing power exceeds by the growth of data,
we can add additional nodes to increase thee throughput of the processing unit.
Hadoop clusters also are highly resistant to node failure because each blocks
of data is copied onto two more cluster nodes(replication of the data blocks-
default 3 and it is scalable), which ensures the data redundancy even with the
failure of one node.
Now Let's see how a production Hadoop Cluster can
be built. Which is explained in detail bellow.
1.Designing The Cluster Network .
2.Operating System Selection And Preparation.
3.Configure Nameserver And Adding Hostname To Each Machines.
4.Unix Firewall On Hadoop Cluster.
5.Disable The Fastest Mirror Hunting Process By The Yum.
6.Setting Password Less SSH login In All The Nodes Of Hadoop Cluster
7.Installation Of Java.
8.Hadoop Installation.
9.Hadoop Cluster Configuration.
10.Synchronizing the configuration files between master and the slave nodes.
11.Adding The Dedicated Users To The Hadoop Group.
12.Assigning Dedicated Users For The Daemons In hdfs.
13.Formatting The hdfs File System Via Name-Node.
14.Create Working Directory For Jobtracker And Tasktracker In The Hdfs And Assigning Permission.
15.Starting The Multi-Node Cluster.
16.Running A Map-Reduce Job On The Cluster.
Step:1
Designing The Cluster Network
It is important to make keen decision about the network architecture in deployment of hadoop cluster. As a part of providing the best throughput from the cluster the nodes should be physically close together as possible. i.e. lower the network latency , better the performance.
Another way to reduce the amount of background traffic, is to create a Virtual Private Network(VPN) particularly for the cluster. Any other application server which is connected to this cluster should be in different subnet to provide the better background traffic reduction. and which in turn can provide the access point to the outside world such as Cloud.
The anticipated latency of the VPN is ~1-2ms. This implies that the issue of physical proximity becomes less of an issue and we should confirm this through environment testing.
Network Architecture recommendation:
Dedicated TOR(Top On Rack) switches to Hadoop
Use dedicated Core Switching Blades or switches
Ensure application servers are “close” to Hadoop
Consider Ethernet bonding(NIC Teaming) for increased capacity/bandwidth utilization.(Optional)
# Do not use Logical Volume Management
# Use configuration management system (Yum, Permission, sudoers, etc.) – Puppet # Reduce kernel swappiness
# Revoke access to Cluster machines from average users
# Do NOT use Virtualization for Clustering.
# At a minimum the following, but limited to Linux commands are required:
# ln,chmod,chown,chgrp,mount,umount,kill,rm,yum,mkdir,rsync,etc..
This should be entered in all the
nodes so that all the nodes can communicate each other using their hostname.
Step:4
Unix Firewall On Hadoop Cluster
Why Firewall Has To Be Disabled in the Hadoop Cluster:
Generally Hadoop is referred as a "shared nothing architecture" ; The only thing which shared in the hadoop cluster is the networking that interconnect them. In the Hadoop cluster there are lot of cluster daemons. They use a handful of ports over TCP. Some of these ports are used by Hadoop’s daemons to communicate amongst themselves (to schedule jobs, replicate blocks, etc.). Others ports are listening directly to users, either via an interposed Java client, which communicates via internal protocols, or through plain old HTTP. Which tend to be "firewall less" between the various hadoop nodes but operate on separate network with a gateway node between the normal network and the hadoop cluster.
What are the Linux Firewalls and how to disable them?
SELinux and IP-Tables are the two type of firewalls in the Unix Systems. Let's see how to disable them.
Step:5
Step:6
Step:7
Installation Of Java.
Open
Step:8
namenode
datanode
secondarynamenode
jobtracker
tasktracker
Step:9
Here the JAVA_HOME must be same
as that of /etc/profile file.
hdfs-site.xml
1.Designing The Cluster Network .
2.Operating System Selection And Preparation.
3.Configure Nameserver And Adding Hostname To Each Machines.
4.Unix Firewall On Hadoop Cluster.
5.Disable The Fastest Mirror Hunting Process By The Yum.
6.Setting Password Less SSH login In All The Nodes Of Hadoop Cluster
7.Installation Of Java.
8.Hadoop Installation.
9.Hadoop Cluster Configuration.
10.Synchronizing the configuration files between master and the slave nodes.
11.Adding The Dedicated Users To The Hadoop Group.
12.Assigning Dedicated Users For The Daemons In hdfs.
13.Formatting The hdfs File System Via Name-Node.
14.Create Working Directory For Jobtracker And Tasktracker In The Hdfs And Assigning Permission.
15.Starting The Multi-Node Cluster.
16.Running A Map-Reduce Job On The Cluster.
Step:1
Designing The Cluster Network
It is important to make keen decision about the network architecture in deployment of hadoop cluster. As a part of providing the best throughput from the cluster the nodes should be physically close together as possible. i.e. lower the network latency , better the performance.
Another way to reduce the amount of background traffic, is to create a Virtual Private Network(VPN) particularly for the cluster. Any other application server which is connected to this cluster should be in different subnet to provide the better background traffic reduction. and which in turn can provide the access point to the outside world such as Cloud.
The anticipated latency of the VPN is ~1-2ms. This implies that the issue of physical proximity becomes less of an issue and we should confirm this through environment testing.
Network Architecture recommendation:
Dedicated TOR(Top On Rack) switches to Hadoop
Use dedicated Core Switching Blades or switches
Ensure application servers are “close” to Hadoop
Consider Ethernet bonding(NIC Teaming) for increased capacity/bandwidth utilization.(Optional)
Network Architecture
Step:2
Operating System Selection and Preparation
Why Linux? Hadoop is a framework
written in java, enough native code and Linux-isms are in its surrounding
infrastructure to make Linux the only production quality option today. Being both are of open source it is
strongly recommended to use the hadoop cluster in association with the Unix as
the base operating system. The choice of
operating system may be influenced by administration tools, hardware support,
or commercial software support; the best choice is usually to minimize the
variables and reduce risk by picking the distribution with which you’re most
comfortable. Preparing the OS for Hadoop requires a number of steps, and
repeating them on a large number of machines is both time-consuming and
error-prone. For this reason, it is strongly advised that a software
configuration management system be used. Puppet and Chef are two open source
tools that fit this requirement. A significant number of production clusters
run on RedHat Enterprise Linux or its freely available sister, CentOS. Ubuntu,
SuSE Enterprise Linux, and Debian deployments also exist in production and work
perfectly well. We selected to use Linux as operating system for our Hadoop
cluster. It is the best practice to create custom CentOS image preconfigured
with pre-required software so all machines contain the same set of utilities.
The following settings should be applied on OS level:
# Do not use Logical Volume Management
# Use configuration management system (Yum, Permission, sudoers, etc.) – Puppet # Reduce kernel swappiness
# Revoke access to Cluster machines from average users
# Do NOT use Virtualization for Clustering.
# At a minimum the following, but limited to Linux commands are required:
# ln,chmod,chown,chgrp,mount,umount,kill,rm,yum,mkdir,rsync,etc..
For which better follow the "express text" based installation in all the nodes, instead of default GUI installation. This will avoid the unwanted installation of packages on the OS.
Step:3
Configure Nameserver and adding Hostname to each machines
Edit the resolve.conf file in /etc/
Step:3
Configure Nameserver and adding Hostname to each machines
Edit the resolve.conf file in /etc/
# vi /etc/resolve.conf
search.example.com
nameserver
192.168.1.1
|
Change the hostname in all the machines
# vi /etc/sysconfig/network
HOSTNAME=<hostname>.example.com
GATEWAY=192.168.1.1
|
Enter the details of all the nodes in the /etc/hosts
file in the nameserver
format with canonic
name OR alias name
# vi /etc/hosts
192.168.1.101 n1.xyz.com n1
192.168.1.102 n2.xyz.com n2
192.168.1.103 n3.xyz.com n3
192.168.1.104 n4.xyz.com n4
192.168.1.105 n5.xyz.com n5
192.168.1.106 n6.xyz.com n6 |
Step:4
Unix Firewall On Hadoop Cluster
Why Firewall Has To Be Disabled in the Hadoop Cluster:
Generally Hadoop is referred as a "shared nothing architecture" ; The only thing which shared in the hadoop cluster is the networking that interconnect them. In the Hadoop cluster there are lot of cluster daemons. They use a handful of ports over TCP. Some of these ports are used by Hadoop’s daemons to communicate amongst themselves (to schedule jobs, replicate blocks, etc.). Others ports are listening directly to users, either via an interposed Java client, which communicates via internal protocols, or through plain old HTTP. Which tend to be "firewall less" between the various hadoop nodes but operate on separate network with a gateway node between the normal network and the hadoop cluster.
What are the Linux Firewalls and how to disable them?
SELinux and IP-Tables are the two type of firewalls in the Unix Systems. Let's see how to disable them.
SELinux
# vi /etc/selinux/config
SELinux=disabled
|
IP-Tables
IP-Tables is of two type. IPV4 and IPV6. We need to disable both of them.
For IPV4
# service iptables stop
# chkconfig iptables off
|
For IPV6
# service ip6tables stop
# chkconfig ip6tables off
|
We can disable the IP-Tables at any time. it is easy. But to disable selinux we need to restart the computer once. Restart the computer and check the status of the firewall as follows.
# init 6 OR reboot
|
After reboot Check the status of SELinux and IP-Tables
To check the status of SELinux:
# /usr/sbin/sestatus
OR
# sestatus
SELinux status: disabled
|
To check the status of IP-Tables:
# /etc/init.d/iptables status
iptables: Firewall is not running.
# /etc/init.d/ip6tables status
Ip6tables: Firewall is not running.
|
Step:5
Disable the fastest mirror hunting
process by the yum installation
The yum package installation
always check the fastest mirror available to download packages for
installation. We can disable this mirror hunting process and make yum to
proceed with the available mirrors. For
this we need to change value in the configuration file of yum plug-in in the /etc/yum/pluginconf.d/
location.
# vi /etc/yum/pluginconf.d/fastest mirror.conf
Enabled=0
|
Setting Up password less SSH
authentication on all the nodes
Step:7
Installation Of Java.
Download and copy the java bin
file in to the /usr/local directory
#
chmod 755 jdk-6u30-linux-x64.bin
#
./jdk-6u17-linux-x64.bin
|
Set
JAVA_HOME & JAVA PATH
/etc/profile
file
and add this two lines at the end of the file.
# vim
/etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:/usr/local/java/bin
|
We can check the java installation as follows
# echo $JAVA_HOME
/usr/local/java
# echo $PATH
/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin:/usr/bin:/bin:/root/bin:/usr/local/java/bin:/root/bin
# which java
/usr/local/java/bin/java
#
|
Step:8
Hadoop
Installation
This installation can be done in
different ways. we can use Either Manual Installation [tar balls ,online installation from
vendors like Cloudera (Using Online yum
repository), local yum repository,
RPM installation] Or Automatic
Installation (Cloudera
Manager).
Here we are going to do the online
installation from Cloudera (Using Online
yum repository). Let's see how to do that.
Step
1: Download the CDH3 Repository or Package.
First of all we need to download and
install thee CDH3 repository package installation. which is necessary for the
remaining installation of Cloudera hadoop. Here we are going to install Hadoop on a CentOS cluster.
# Curl -O http://archive.cloudera.com/redhat/6/x86_64/cdh/cdh3-repository-1.0-1.noarch.rpm
|
Download this RPM in your Home Directory
and install the RPM using the following command
# yum
--nogpgcheck localinstall cdh3-repository-1.0-1.noarch.rpm
|
Step
2: Install CDH3 on all hosts
Find and install the Hadoop core and
native packages. For example:
# yum
search hadoop
# yum
install hadoop-0.20 hadoop-0.20-native
|
Hence the base is installed. this has to follow
in all the machines. Now we need to
install the daemons in appropriate machines.
n1.xyz.com
Namenode
n2.xyz.com
Secondarynamenode
n3.xyz.com
Jobtracker
n4.xyz.com
DataNode
TaskTracker
n5.xyz.com
DataNode
TaskTracker
n6.xyz.com
DataNode
TaskTracker
.
.
.
Step 2: Install daemon packages on appropriate machines.
n1.xyz.com
Namenode
n2.xyz.com
Secondarynamenode
n3.xyz.com
Jobtracker
n4.xyz.com
DataNode
TaskTracker
n5.xyz.com
DataNode
TaskTracker
n6.xyz.com
DataNode
TaskTracker
.
.
.
Step 2: Install daemon packages on appropriate machines.
where <daemon type> is one of
the following:
namenode
datanode
secondarynamenode
jobtracker
tasktracker
Hence we have installed the hadoop cluster. To make this cluster running we need to
configure this cluster. Every parameter in this cluster is configurable. In the
following session we will see how to configure the cluster.
Hadoop
Cluster Configuration.
Hadoop cluster is a master - slave
architecture. It is designed for distributed storage and processing. So for
both Storage and Processing we have masters to control and co-ordinate their
slaves. Let's see the master slave architecture and the corresponding daemons.
Master-Slave Architecture Of Hadoop
Cluster
In the cloudera hadoop installation, by
default it will create the minimum required users and groups to volunteer the
hadoop cluster. To control the distributed
storage a new user will be created and it is called hdfs. Where as in case of distributed processing the newly created user is called mapred. more over it will create a new group fog the hadoop cluster
called the hadoop. Here the
daemons are user specific. So each daemon should be mapped to the corresponding
user. All DFS storage and its working directory should be owned by hdfs user and All DFS process and its working
directory should be owned by mapred. Which in
turn to be mapped to the hadoop group as
well. We will see all these steps in details down the way.
All the hadoop cluster configuration files
resides in the cong directory of hadoop home. the location is arbitrary. go to
the hadoop installed directory; for Cloudera it can be found on various
location such as /etc/hadoop , /usr/lib/hadoop , /usr/lib/hadoop-<version> etc.. In any of this location we can
find the configuration directory named conf .
If we move to that location we can see the configuration files. the main
configuration files which needs to be edited in order to run the hadoop cluster
are following.
Hadoop
Environment configuration file: hadoop-env.sh
A shell script that includes the Hadoop Specific environment variables to run
the hadoop such as JAVA_HOME, HASOOP_HEEP_SIZE, HADOOP_OPTS etc..
Hadoop
Core configuration file: core-site.xml
This is an xml file. It consist of many properties
like the hadoop temp directory, thee default namenode in the network etc..
Hadoop
Distributed storage configuration file: hdfs-site.xml
This xml file consist of the location of
namenode directory in the HDFS file system, location of data Nodes in the
network, permission to the users, etc..
Hadoop
Distributed storage configuration files: mapred-site.xml
This is the xml file for hadoop
distributed processing called as Map-Reduce process. It consist of the
jobtracker and tasktracker temp location to store their intermediate results.
default jobtracker machine etc..
Let's see the configuration one by one..
hadoop-env.sh
Go to the conf directory of hadoop and
open the hadoop-env.sh file and uncomment the required lines and give the
appropriate values.
# cd
/usr/lib/hadoop/conf/
# vim
hadoop-env.sh
# The java implementation to use. Required.
export JAVA_HOME=/usr/local/jdk1.6.0_30/
# The
maximum amount of heap to use, in MB. Default is 1000
export HADOOP_HEAP_SIZE=2000
|
Hence the Basic environment setup is
ready. Now logout from all the terminals and open the new terminal and check
the hadoop version as follows. It can be verified either
As
a hadoop user OR As root user.
core-site.xml
Create the tmp directory in some location and give the
proper ownership and permissions. Then map the location in the core-site.xml file. This
will be the temporary folder in which the data to the hdfs will first saved
temporarily.
# vim core-site.xml
<!-- In: conf/core-site.xml -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/lib/hadoop/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hostname of NN machine:8020</value>
</property>
</configuration>
|
# mkdir /usr/lib/hadoop/tmp
# chmod 775 tmp/
# chown -R hdfs:hadoop tmp/
# ls -ll tmp/
Drwxrwxr-x 3 hdfs hadoop 4096 Mar 17 23:07 tmp
|
hdfs-site.xml
#vim
hdfs-site.xml
<!-- In:
conf/hdfs-site.xml -->
<configuration>
<property>
<name>dfs.permissions</name>
<value>false</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/dfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/dfs/data1,/dfs/data2</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.failed.volumes.tolerated</name>
<value>1</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
<final>true</final>
</property>
<property>
<name>dfs.block.size</name>
<value>67108864</value>
<final>true</final>
</property>
<property>
<name>dfs.checkpoint.dir</name>
<value>/dfs/checkpoint</value>
<final>true</final>
</property>
<property>
<name> dfs.http.address</name>
<value>0.0.0.0:50070</value>
<final>true</final>
</property>
<property>
<name> dfs.datanode.http.address</name>
<value>0.0.0.0:50075</value>
<final>true</final>
</property>
<property>
<name>dfs.secondary.http.address</name>
<value>0.0.0.0:50090</value>
<final>true</final>
</property>
<property>
<name> dfs.backup.http.address</name>
<value>0.0.0.0:50105</value>
<final>true</final>
</property>
</configuration>
|
Now we will create a new directory dfs in the /(root) directory of all nodes. We need to assign the permission to the dfs directory.
In all the nodse give the
ownership of the dfs directory to the hdfs user and hadoop group.
# mkdir -p /dfs
# chmod 775 /dfs/
# chown -R
hdfs:hdfs /dfs/
|
mapred-site.xml
# vim mapred-site.xml
<!-- In: conf/mapred-site.xml -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value> hdfs://hostname of jobtracker machine:8021</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/mapred/system </value>
<final>true</final>
</property>
<property>
<name>mapred.local.dir</name>
<value>/mapred/local </value>
<final>true</final>
</property>
<property>
<name>mapred.temp.dir</name>
<value>/mapred/temp</value>
<final>true</final>
</property>
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user </value>
<final>true</final>
</property>
<property>
<name>mapred.job.tracker.http.address </name>
<value>0.0.0.0:50030</value>
<final>true</final>
</property>
<property>
<name>mapred.task.tracker.http.address </name>
<value>0.0.0.0:50060</value>
<final>true</final>
</property>
</configuration>
|
For tasktracker and jobtracker we need
to create the local and system directory in the hdfs. this can be done only
after formatting the hadoop namenode.
Step:10
Synchronizing The Configuration Files Between Master And The Slave Nodes.
Now we need to synchronize the configuration files (core-site.xml, hdfs-site.xml and mapred-site.xml) from master node all the slave nodes. For small and medium size cluster it is desirable to do it manually, but for a large production cluster we can use tools like Puppet OR Chef. this configuration must be the same throughout the cluster nodes. We can use rsync OR SCP at this time to synchronize these configuration files.
core-site.xml
#rsync -avrt
/usr/lib/hadoop/conf/core-site.xml root@IP:/usr/lib/hadoop/conf/core-site.xml
|
hdfs-site.xml
#rsync -avrt
/usr/link/hadoop/conf/hdfs-site.xml root@IP:/usr/lib/hadoop/conf/hdfs-site.xml
|
For mapred-site.xml
#rsync -avrt
/usr/link/hadoop/conf/mapre-site.xml root@IP:/usr/lib/hadoop/conf/mapred-site.xml
|
This
step has to be done in all the nodes.
Step:11
Adding The Dedicated Users To The Hadoop Group.
Now We will add the dedicated users(hdfs and mapred) to the hadoop group
in all the nodes.
# gpasswd -a -A hdfs hadoop
Adding user mapred to group hadoop
# gpasswd -a -A mapred hadoop
Adding user hdfs to group hadoop
|
Step:12
Assigning Dedicated Users For The Daemons In hdfs :
# export HADOOP_NAMENODE_USER=hdfs
# export HADOOP_SECONDARYNAMENODE_USER=hdfs
# export HADOOP_DATANODE_USER=hdfs
# export HADOOP_JOBTACKER_USER=mapred
# export HADOOP_TASKTRACKER_USER=mapred
|
Step:13
Formatting The hdfs File System Via Name-Node.
Now move to the bin directory of hadoop and now we are about to start formatting our hdfs filesystems via namenode.
# cd
/usr/lib/hadoop/bin/
# hadoop
namenode -format
.
.
12/03/21 13:29:17 INFO common.Storage: Storage
directory /storage/name has been successfully formatted.
12/03/21 13:29:17 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at master/192.168.208.217
************************************************************/
[root@master
bin]#
|
Step:14
Create working directory for Jobtracker and tasktracker in the hdfs and Assigning Permission
For tasktracker and jobtracker we need to create the local and system directory in the hdfs. this can be done only after formatting the hadoop namenode.
#su hdfs
$ hadoop fs -mkdir -p /mapred/system
$ hadoop fs -mkdir -p /mapred/local
$ hadoop fs -mkdir -p /mapred/temp
$ hadoop fs -mkdir -p /user
$ hadoop fs -chmod 755 /mapred/system
$ hadoop fs -chmod 755 /mapred/local
# hadoop fs -chmod 755 /mapred/temp
# hadoop fs -chmod 755 /user
# hadoop fs -chown -R mapred:hadoop /mapred/system
# hadoop fs -chown -R mapred:hadoop /mapred/local
# hadoop fs -chown -R mapred:hadoop /mapred/temp
# hadoop fs -chown -R mapred:hadoop /user
|
Step 15
Starting The Multi-Node Cluster.
Hence we successfully formatted our hadoop filesystems. Now we will start the demons in the appropriate nodes.(Provided Passwordless SSH should be configured)
#cd
/usr/lib/hadoop/bin
$
bin/start-all.sh
|
Step:16
Hadoop Daemons Web UI
For each daemons we have corresponding web UI. To monitor and proper administration of the hadoop cluster we can use this web UI. Let's see one by one.
Namenode:
http://hostname:50070
SecondaryNamenode:
http://hostname:50090
Datanode:
http://hostname:50075
Jobtracker:
http://hostname:50030
Tasktracker:
http://hostname:50060
And This is all about setting up a production Hadoop Cluster. The other hadoop ecosystem component installation and configuration will be explained in the up-coming articles. For any query please contact me at: anulsasidharan@gmail.com
No comments :
Post a Comment