Aim:
(i) Perform setting
up and Installing Hadoop in its three operating modes:
Standalone,
Pseudo distributed,
Fully distributed
(ii)Use web-based
tools to monitor your Hadoop setup
Ans) It is an open-source framework written in java
that is used to store, analyze and process huge amounts of data in a distributed
environment across clusters of computers in an efficient manner. It provides the capability to process distributed data using a simplified programming model. It
is used by Google, Facebook, yahoo, youtube, Twitter, etc. It is developed by
Doug Cutting at Yahoo in 2006 which is inspired by Google File System and
Google Map Reduce algorithm. It is a file system provided by Linux to
store the data.
Operational modes of configuring Hadoop cluster
Hadoop
can be run in one of the three supported modes
1) Local(Standalone) mode-By default,
Hadoop is configured to run in a single-node, non-distributed mode, as a single
Java process. This is useful for debugging. The usage of this mode is very
limited and it can be only used for experimentation.
2) Pseudo-Distributed mode-Hadoop is
run on a single node in a pseudo-distributed mode where each Hadoop
daemon(Namenode, Datanode, Secondary Namenode, Jobtracker, Tassktracker) runs
in a separate Java process. In Local mode, Hadoop runs as a single Java process
3) Fully distributed mode- In this
mode, all daemons are executed in separate nodes forming a multi-node
cluster. This setup offers true distributed computing capability and offers
built-in reliability, scalability, and fault tolerance
Standalone
mode
1) Add Java software information to
repository
$ sudo add-apt-repository
ppa:webupd8team/java
2)
Update repository
$ sudo apt-get
update
3)
Install Java 8
$ sudo apt-get
install oracle-java8-installer
4) Verify which java version is installed
$ java
-version
5) Install
Hadoop-2.8.1
$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz
(or) $ wget http://www-us.apache.org/dist/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz
6
) Extract tar.gz file and hadoop-2.8.1
folder is created
$
tar -zxvf hadoop-2.8.1.tar.gz
7) Ensure HADOOP_HOME is correctly set in
.bashrc file
export
HADOOP_HOME=hadoop-2.8.1
export
PATH=$PATH:$HADOOP_HOME/bin
8)
Evaluate .bashrc file
$
source ~/.bashrc
9)
Verify hadoop is working or not by issuing the following command
$ hadoop version
Pseudo-distributed mode
1) Configure Hadoop
Hadoop
configuration files can be found in $HADOOP_HOME/etc/hadoop. In order to
develop hadoop programs in java, location of java must be set in hadoop-env.sh
file
export $JAVA_HOME=/usr/lib/jvm/jdk1.8.0_131
2) Several files needed to configure Hadoop
located in $HADOOP_HOME/etc/hadoop are described below
a) core-site.xml
(contains configuration settings that hadoop uses when started. It contains
information such as the port number used for Hadoop instance, memory allocated
for the file system, memory limit for storing the data, and size of Read/Write
buffers. It also specifies where Namenode runs in the cluster.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>name
of default file system. URI is used to specify hostname, port number for file
system.</description>
</property>
</configuration>
b) hdfs-site.xml (contains information
such as the value of replication data, namenode path, and datanode paths of
local file systems)
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>
actual number of block
replications(copies) can be specified when file is created.</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>hadoop-2.8.1/namenodenew</value>
<description>
directory for namenode</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>hadoop-2.8.1/datanodenew</value>
<description>directory
for data node</description>
</property>
</configuration>
1.3.Format namenode
$ hdfs namenode -format
1.4. start single
node cluster
$ start-dfs.sh
1.5. check hadoop daemons are running or not by using
jps(java virtual machine process status tool)
$ jps
13136 DataNode
13427 SecondaryNameNode
12916 NameNode
13578 Jps
6. Access hadoop on browser
by http://localhost:50070 (50070
is default port number to access hadoop on browser)
Fully-distributed mode
Steps
to be followed on configuring master and slave nodes
1.Create
same account on all nodes to use hadoop installation
$ sudo useradd cse (cse is username)
$ sudo passwd cse (Enter password for cse)
2.Edit /etc/hosts file on all nodes
which specifies IP address of each node followed by hostnames(assumption) and
add the following lines
192.168.100.22 master
192.168.100.23 slave1
192.168.100.24 slave2
3) Install
SSH on all nodes
$ sudo apt-get
install openssh-server
4) Configure
key based login on all nodes which communicate with each other without
prompting for password
$ su cse (super user switching to cse account)
$ ssh-keygen -t rsa -P “”
(public key is generated)
$ ssh-copy-id -i
/home/cse/.ssh/id_rsa.pub cse@slave1 (copy public key from master to slave nodes)
$ ssh-copy-id -i /home/cse/.ssh/id_rsa.pub
cse@slave2
$ exit
Installing
Hadoop on Master nodes
5.$ sudo mkdir /usr/local/hadoop (create
hadoop folder)
- $wget
http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz
- Extract tar.gz file and
hadoop-2.8.1 is created
$ tar -zxvf hadoop-2.8.1.tar.gz
8. sudo mv hadoop-2.8.1
/usr/local/hadoop (move hadoop installation folder to newly created directory)
9. sudo chown -R cse
/usr/local/hadoop/hadoop-2.8.1 (making cse owner of hadoop folder)
Configuring Hadoop on master node
10.
a) core-site.xml
(contains configuration settings that hadoop uses when started. It contains
information such as the port number used for Hadoop instance, memory allocated
for the file system, memory limit for storing the data, and size of Read/Write
buffers. It also specifies where Namenode runs in the cluster.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>name
of default file system. URI is used to specify hostname, port number for file
system.</description>
</property>
</configuration>
b) hdfs-site.xml (contains information
such as the value of replication data, namenode path, and datanode paths of
local file systems)
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>
actual number of block
replications(copies) can be specified when file is created.</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/hadoop-2.8.1/namenodenew</value>
<description>
directory for namenode</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/hadoop-2.8.1/datanodenew</value>
<description>directory
for data node</description>
</property>
</configuration>
11) Ensure HADOOP_HOME is
correctly set in .bashrc file
export
HADOOP_HOME=/usr/local/hadoop/hadoop-2.8.1
export
PATH=$PATH:$HADOOP_HOME/bin
export
PATH=$PATH:$HADOOP_HOME/sbin
12)
Evaluate ~/.bashrc file
$ source ~/.bashrc
13) Hadoop configuration
files can be found in $HADOOP_HOME/etc/hadoop. In order to develop hadoop
programs in java, location of java must be set in hadoop-env.sh file
export $JAVA_HOME=/usr/lib/jvm/jdk1.8.0_131
14) update remaining configuration
files in master node
$ sudo gedit slaves ($HADOOP_HOME/etc/hadoop folder)
slave1
slave2
$ sudo gedit masters
master
15) transfer hadoop folder
from master node to slave nodes
$ scp -r /usr/local/hadoop cse@slave1:/home/cse
$ scp -r /usr/local/hadoop cse@slave2:/home/cse
16)
format namenode on master node
$ hdfs namenode -format
17)
start hadoop cluster on master node
$ start-dfs.sh
18)
Verify hadoop daemon on slave nodes or master nodes using jps
$ jps
19)
access hadoop on browser on slave nodes using http://master:50070
Comments