You are on page 1of 17

Hadoop

Training
Hands On Exercise

1. Getting started:
Step 1: Download and Install the Vmware player
- Download the VMware-player-5.0.1-894247.zip and unzip it on your
windows machine
- Click the exe and install Vmware player

Step 2: Download and install the VMWare image
- Download the Hadoop Training - Distribution.zip and unzip it on your
windows machine
- Click on centos-6.3-x86_64-server.vmx to start the Virtual Machine

Step 3: Login and a quick check
- Once the VM starts, use the following credentials:
Username: training
Password: training
- Quickly check if eclipse and mysql workbench are installed



2. Installing Hadoop in a pseudo distributed


mode:
Step 1: Run the following command to install hadoop from yum
repository in a pseudo distributed mode (Already done for you,
please dont run this command)
sudo yum install hadoop-0.20-conf-pseudo

Step 2: Verify if the packages are installed properly


rpm -ql hadoop-0.20-conf-pseudo

Step 3: Format the namenode

sudo -u hdfs hdfs namenode -format


Step 4: Stop existing services (As Hadoop was already installed for
you, there might be some services running)
$ for service in /etc/init.d/hadoop*
> do
> sudo $service stop
> done

Step 5: Start HDFS
$ for service in /etc/init.d/hadoop-hdfs-*
> do
> sudo $service start
> done



Step 6: Verify if HDFS has started properly (In the browser)
http://localhost:50070

Step 7: Create the /tmp directory


$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp


Step 8: Create mapreduce specific directories

sudo -u hdfs hadoop fs -mkdir /var
sudo -u hdfs hadoop fs -mkdir /var/lib
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-
hdfs/cache/mapred/mapred
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-
hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-
hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-
hdfs/cache/mapred

Step 9: Verify the directory structure


$ sudo -u hdfs hadoop fs -ls -R /

Output should be

drwxrwxrwt

- hdfs

supergroup

0 2012-04-19 15:14

drwxr-xr-x

- hdfs

supergroup

0 2012-04-19 15:16

drwxr-xr-x

- hdfs

supergroup

0 2012-04-19 15:16

drwxr-xr-x

- hdfs

supergroup

0 2012-04-19 15:16

drwxr-xr-x

- hdfs

supergroup

0 2012-04-19 15:16

/tmp
/var
/var/lib
/var/lib/hadoop-hdfs
/var/lib/hadoop-

supergroup

0 2012-04-19 15:19

/var/lib/hadoop-

0 2012-04-19 15:29

/var/lib/hadoop-

0 2012-04-19 15:33

/var/lib/hadoop-

hdfs/cache
drwxr-xr-x

- mapred

hdfs/cache/mapred
drwxr-xr-x

- mapred

supergroup

hdfs/cache/mapred/mapred
drwxrwxrwt

- mapred

supergroup

hdfs/cache/mapred/mapred/staging


Step 10: Start MapReduce
$ for service in /etc/init.d/hadoop-0.20-
mapreduce-*
> do
> sudo $service start
> done

Step 11: Verify if MapReduce has started properly (In Browser)
http://localhost:50030


Step 12: Verify if the installation went on well by running a program

Step 12.1: Create a home directory on HDFS for the user



sudo -u hdfs hadoop fs -mkdir /user/training
sudo -u hdfs hadoop fs -chown training /user/training

Step 12.2: Make a directory in HDFS called input and copy some XML files
into it by running the following commands

$ hadoop fs -mkdir input


$ hadoop fs -put /etc/hadoop/conf/*.xml input
$ hadoop fs -ls input
Found 3 items:
-rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-
site.xml
-rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-
site.xml
-rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-
site.xml

Step 12.3: Run an example Hadoop job to grep with a regular expression in
your input data.

$ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-
mapreduce/hadoop-examples.jar grep input output 'dfs[a-
z.]+'

Step 12.4: After the job completes, you can find the output in the HDFS
directory named output because you specified that output directory to
Hadoop.

$ hadoop fs -ls
Found 2 items
drwxr-xr-x
- joe supergroup 0 2009-08-18 18:36
/user/joe/input
drwxr-xr-x
- joe supergroup 0 2009-08-18 18:38
/user/joe/output





Step 12.5: List the output files



$ hadoop fs -ls output

Found 2 items

drwxr-xr-x - joe supergroup
0 2009-02-25

10:33
/user/joe/output/_logs

-rw-r--r-- 1 joe supergroup 1068 2009-02-25

10:33
/user/joe/output/part-00000

-rw-r--r1 joe supergroup
0 2009-02-25

10:33
/user/joe/output/_SUCCESS




Step 12.6: Read the output


$ hadoop fs -cat output/part-00000 | head

1
dfs.datanode.data.dir

1
dfs.namenode.checkpoint.dir

1
dfs.namenode.name.dir

1
dfs.replication

1
dfs.safemode.extension

1
dfs.safemode.min.datanodes








3. Accessing HDFS from command line:


This exercise is just to you familiar with HDFS. Run the following commands:

Command 1: List the files in the user/training directory
$> hadoop fs -ls

Command 2: List the files in the root directory


$> hadoop fs ls /

Command 3: Push a file to HDFS

$> hadoop fs put test.txt /user/training/test.txt






Command 4: View the contents of the file
$> hadoop fs cat /user/training/test.txt

Command 5: Delete a file


$> hadoop fs rmr /user/training/test.txt


4. Running the Wordcount Mapreduce job


Step 1: Put the data in the HDFS
hadoop fs -mkdir /user/training/wordcountinput
hadoop fs put wordcount.txt /user/training/wordcountinput



2: Create a new project in eclipse called wordcount
Step

1. cp r /home/training/exercises/wordcount
/home/training/workspace/wordcount
2. Open EclipseNew Project->wordcount->location
/home/training/workspace
3. Right Click on the wordcount project->properties->java
build path->Libraries->Add External JarsSelect all jars
from /usr/lib/hadoop and /usr/lib/hadoop-0.20-
mapreduceOk
4. Make sure that there are no more compilation errors






Step
3: Create a jar file

1. Right click the project-ExportJavaJarSelect the location as


/home/trainingMake sure workdcount is checkedFinish


Step 4 Run the jar file
hadoop jar wordcount.jar WordCount wordcountinput
wordcountoutput

5. Mini Project: Importing MySQL Data


Using Sqoop and Querying it using Hive
5.1 Setting up Sqoop
Step 1: Install Sqoop (Already done for you, please dont run
this command)

$> sudo yum install sqoop



Step 2: View list of databases

$> sqoop list-databases \
--connect jdbc:mysql://localhost/training_db \
--username root --password root

Step 3: View list of tables




$> sqoop list-tables \

--connect jdbc:mysql://localhost/training_db \

--username root --password root





Step 4: Import data to HDFS



$> sqoop import \

--connect jdbc:mysql://localhost/training_db \
--table user_log --fields-terminated-by '\t' \
-m 1 --username root --password root

5.2 Setting up Hive


Step 1: Install Hive


$> sudo yum install hive (Already done for you, dont
run this command)
$> sudo u hdfs hadoop fs mkdir /user/hive/warehouse
$> hadoop fs chmod g+w /tmp
$> sudo u hdfs hadoop fs chmod g+w
/user/hive/warehouse
$> sudo u hdfs hadoop fs chown R training
/user/hive/warehouse
$>sudo chmod 777 /var/lib/hive/metastore
$> hive
Hive>show tables;



Step 2: Create table

hive> create table user_log (country
STRING,ip_address STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t' STORED AS TEXTFILE;

Step 3: Load Data




hive> LOAD DATA INPATH "/user/training/user_log/part-

m-00000" INTO TABLE user_log;



Step 4: Run the query



$> select country,count(1) from user_log group by
country;

6. Setting up Flume
Step 1: Install Flume
$> sudo yum install flume-ng (Already done for you, please
dont run this command)
$> sudo u hdfs hadoop fs chmod 1777 /user/training

Step 2: Copy the configuration file


$> sudo cp /home/training/exercises/flume-
config/flume.conf /usr/lib/flume-ng/conf


Step 3: Start the flume agent
$> flume-ng agent --conf-file /usr/lib/flume-
ng/conf/flume.conf --name agent -
Dflume.root.logger=INFO,console


Step 4: Push the file in a different terminal
$> sudo cp /home/training/exercises/log.txt
/home/training


Step 5: View the output
$> hadoop fs ls logs


7. Setting up a multi node cluster
Step 1: For converting the pseudo distributed mode to distributed
mode, the first step is to stop the existing services (To be done on all
nodes)
$> for service in /etc/init.d/hadoop*
> do
> sudo $service stop
> done

Step 2: Create a new set of blank configuration files. The conf.empty


directory contains blank files, so we will copy those to a new
directory (To be done on all nodes)

$> sudo cp r /etc/hadoop/conf.empty \


> /etc/hadoop/conf.class

Step 3: Point Hadoop configuration to the new configuration (To be


done on all nodes)

$> sudo /usr/sbin/alternatives -install \


> /etc/hadoop/conf hadoop-conf \
> /etc/hadoop/conf.class 99


Step 4: Verify Alternatives (To be done on all nodes)

$> /usr/sbin/update-alternatives \
> --display hadoop-conf

Step 5: Setting up the hosts (To be done on all nodes)


Step 5.1: Find the IP address of your machine



$> /sbin/ifconfig

Step 5.2: List down all the IP Addresses in your cluster setup i.e.
the ones that will belong to your cluster. And decide a name for
each one. In our example, lets say we are trying to setup a 3 node
cluster so we fetch IP address of each node and name it as
namenode and datanode<n>.
Update /etc/hosts file with IP addresses as shown. So /etc/hosts
file on each node should look something like this





192.168.1.12 namenode
192.168.1.21 datanode1
192.168.1.21 datanode2


Step 5.3: Update /etc/sysconfig/network file with Hostname

Open the /etc/sysconfig/network on your local box and make


sure that your hostname is namenode or datanode<n>.
Assuming you have decided to become a datanode1 i.e.
192.168.1.21. So your hostname should be
HOSTNAME=datanode1
HOSTNAME=Your node i.e. namenode or datanode1


5.4: Restart your machine and try pining other machines
Step

Ping namenode



Step
6: Changing configuration files (To be done on all nodes)
The format to add the configuration parameter is
<property>
<name>property_name</name>
<value>property_value</value>
</property>

Add the following configurations in the following files


Name
Value
Filename: /etc/hadoop/conf.class/core-site.xml
fs.default.name
hdfs://<namenode>:8020

Filename: /etc/hadoop/conf.class/hdfs-site.xml
dfs.name.dir
/home/disk1/dfs/nn,/home/disk2/dfs/nn
dfs.data.dir
/home/disk1/dfs/dn,/home/disk2/dfs/dn
dfs.http.address
namenode:50070

Filename: /etc/hadoop/conf.class/mapred-site.xml
mapred.local.dir
/home/disk1/mapred/local,/home/disk2/mapre
d/local
mapred.job.tracker
namenode:8021
mapred.jobtracker.staging.ro /user
ot.dir


Step 7: Create necessary directories (To be done on all nodes)
$> sudo mkdir p /home/disk1/dfs/nn
$> sudo mkdir p /home/disk2/dfs/nn
$> sudo mkdir p /home/disk1/dfs/dn
$> sudo mkdir p /home/disk2/dfs/dn
$> sudo mkdir p /home/disk1/mapred/local
$> sudo mkdir p /home/disk2/mapred/local



Step
8: Manage Permissions (To be done on all nodes)

$> sudo chown R hdfs:hadoop /home/disk1/dfs/nn


$> sudo chown R hdfs:hadoop /home/disk2/dfs/nn
$> sudo chown R hdfs:hadoop /home/disk1/dfs/dn
$> sudo chown R hdfs:hadoop /home/disk2/dfs/dn
$> sudo chown R mapred:hadoop /home/disk1/mapred/local
$> sudo chown R mapred:hadoop /home/disk2/mapred/local



Step 9: Reduce Hadoop Heapsize (To be done on all nodes)


$> export HADOOP_HEAPSIZE=200




Step 10: Format the namenode (Only on Namenode)
$> sudo u hdfs hadoop namenode -format

On Namenode
$> sudo /etc/init.d/hadoop-hdfs-namenode start
$> sudo /etc/init.d/hadoop-hdfs-secondarynamenode start

On Datanode
$> sudo /etc/init.d/hadoop-hdfs-datanode start



Step 11: Start HDFS processes

Step 12: Create directories in HDFS (Only one member should do this)
$> sudo u hdfs hadoop fs mkdir /user/training
$> sudo u hdfs hadoop fs chown training /user/training



$> sudo u hdfs hadoop fs mkdir /mapred/system
$> sudo u hdfs hadoop fs chown mapred:hadoop \
>/mapred/system







Step 13: Create directories for mapreduce (Only one member should do this)

Step 14: Start the Mapreduce process


On Namenode
$> sudo /etc/init.d/hadoop-0.20-jobtracker start

On Slave node
$> sudo /etc/init.d/hadoop-0.20-tasktracker start



Step 15: Verify the cluster
Visit http://namenode:50070 and look at number of nodes

You might also like