Hands On-Exercies

Hadoop
Training
Hands On Exercise
1. Getting started:
Step 1: Download and Install the Vmware player
- Download the VMware-player-5.0.1-894247.zip and unzip it on your
windows machine
- Click the exe and install Vmware player

Step 2: Download and install the VMWare image
- Download the Hadoop Training - Distribution.zip and unzip it on your
windows machine
- Click on centos-6.3-x86_64-server.vmx to start the Virtual Machine

Step 3: Login and a quick check
- Once the VM starts, use the following credentials:
Username: training
Password: training
- Quickly check if eclipse and mysql workbench are installed

2. Installing Hadoop in a pseudo distributed

mode:
Step 1: Run the following command to install hadoop from yum
repository in a pseudo distributed mode (Already done for you,
please dont run this command)
sudo yum install hadoop-0.20-conf-pseudo

Step 2: Verify if the packages are installed properly

rpm -ql hadoop-0.20-conf-pseudo

Step 3: Format the namenode
sudo -u hdfs hdfs namenode -format

Step 4: Stop existing services (As Hadoop was already installed for
you, there might be some services running)
$ for service in /etc/init.d/hadoop*
> do
> sudo $service stop
> done

Step 5: Start HDFS
$ for service in /etc/init.d/hadoop-hdfs-*
> do
> sudo $service start
> done

Step 6: Verify if HDFS has started properly (In the browser)
http://localhost:50070
Step 7: Create the /tmp directory

$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

Step 8: Create mapreduce specific directories

sudo -u hdfs hadoop fs -mkdir /var
sudo -u hdfs hadoop fs -mkdir /var/lib
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-
hdfs/cache/mapred/mapred
sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-
hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-
hdfs/cache/mapred
Step 9: Verify the directory structure

$ sudo -u hdfs hadoop fs -ls -R /

Output should be
drwxrwxrwt
- hdfs
supergroup
0 2012-04-19 15:14
drwxr-xr-x
- hdfs
supergroup
0 2012-04-19 15:16
drwxr-xr-x
- hdfs
supergroup
0 2012-04-19 15:16
drwxr-xr-x
- hdfs
supergroup
0 2012-04-19 15:16
drwxr-xr-x
- hdfs
supergroup
0 2012-04-19 15:16
/tmp
/var
/var/lib
/var/lib/hadoop-hdfs
/var/lib/hadoop-
supergroup
0 2012-04-19 15:19
/var/lib/hadoop-
0 2012-04-19 15:29
/var/lib/hadoop-
0 2012-04-19 15:33
/var/lib/hadoop-
hdfs/cache
drwxr-xr-x
- mapred
hdfs/cache/mapred
drwxr-xr-x
- mapred
supergroup
hdfs/cache/mapred/mapred
drwxrwxrwt
- mapred
supergroup

Step 10: Start MapReduce
$ for service in /etc/init.d/hadoop-0.20-
mapreduce-*
> do
> sudo $service start
> done

Step 11: Verify if MapReduce has started properly (In Browser)
http://localhost:50030

Step 12: Verify if the installation went on well by running a program

Step 12.1: Create a home directory on HDFS for the user

sudo -u hdfs hadoop fs -mkdir /user/training
sudo -u hdfs hadoop fs -chown training /user/training

Step 12.2: Make a directory in HDFS called input and copy some XML files
into it by running the following commands
$ hadoop fs -mkdir input

$ hadoop fs -put /etc/hadoop/conf/*.xml input
$ hadoop fs -ls input
Found 3 items:
-rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-
site.xml
-rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-
site.xml
-rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-
site.xml
Step 12.3: Run an example Hadoop job to grep with a regular expression in
your input data.

$ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-
mapreduce/hadoop-examples.jar grep input output 'dfs[a-
z.]+'

Step 12.4: After the job completes, you can find the output in the HDFS
directory named output because you specified that output directory to
Hadoop.

$ hadoop fs -ls
Found 2 items
drwxr-xr-x
- joe supergroup 0 2009-08-18 18:36
/user/joe/input
drwxr-xr-x
- joe supergroup 0 2009-08-18 18:38
/user/joe/output

Step 12.5: List the output files

$ hadoop fs -ls output

Found 2 items

drwxr-xr-x - joe supergroup
0 2009-02-25

10:33
/user/joe/output/_logs

-rw-r--r-- 1 joe supergroup 1068 2009-02-25

10:33
/user/joe/output/part-00000

-rw-r--r1 joe supergroup
0 2009-02-25

10:33
/user/joe/output/_SUCCESS

Step 12.6: Read the output

$ hadoop fs -cat output/part-00000 | head

1
dfs.datanode.data.dir

1
dfs.namenode.checkpoint.dir

1
dfs.namenode.name.dir

1
dfs.replication

1
dfs.safemode.extension

1
dfs.safemode.min.datanodes

3. Accessing HDFS from command line:

This exercise is just to you familiar with HDFS. Run the following commands:

Command 1: List the files in the user/training directory
$> hadoop fs -ls

Command 2: List the files in the root directory

$> hadoop fs ls /

Command 3: Push a file to HDFS
$> hadoop fs put test.txt /user/training/test.txt

Command 4: View the contents of the file
$> hadoop fs cat /user/training/test.txt

Command 5: Delete a file

$> hadoop fs rmr /user/training/test.txt

4. Running the Wordcount Mapreduce job

Step 1: Put the data in the HDFS
hadoop fs -mkdir /user/training/wordcountinput
hadoop fs put wordcount.txt /user/training/wordcountinput

2: Create a new project in eclipse called wordcount
Step
1. cp r /home/training/exercises/wordcount
/home/training/workspace/wordcount
2. Open EclipseNew Project->wordcount->location
/home/training/workspace
3. Right Click on the wordcount project->properties->java
build path->Libraries->Add External JarsSelect all jars
from /usr/lib/hadoop and /usr/lib/hadoop-0.20-
mapreduceOk
4. Make sure that there are no more compilation errors

Step
3: Create a jar file

1. Right click the project-ExportJavaJarSelect the location as

/home/trainingMake sure workdcount is checkedFinish

Step 4 Run the jar file
hadoop jar wordcount.jar WordCount wordcountinput
wordcountoutput
5. Mini Project: Importing MySQL Data

Using Sqoop and Querying it using Hive
5.1 Setting up Sqoop
Step 1: Install Sqoop (Already done for you, please dont run
this command)
$> sudo yum install sqoop

Step 2: View list of databases

$> sqoop list-databases \
--connect jdbc:mysql://localhost/training_db \
--username root --password root

Step 3: View list of tables

$> sqoop list-tables \


--username root --password root

Step 4: Import data to HDFS

$> sqoop import \

--table user_log --fields-terminated-by '\t' \
-m 1 --username root --password root
5.2 Setting up Hive

Step 1: Install Hive

$> sudo yum install hive (Already done for you, dont
run this command)
$> sudo u hdfs hadoop fs mkdir /user/hive/warehouse
$> hadoop fs chmod g+w /tmp
$> sudo u hdfs hadoop fs chmod g+w
/user/hive/warehouse
$> sudo u hdfs hadoop fs chown R training
/user/hive/warehouse
$>sudo chmod 777 /var/lib/hive/metastore
$> hive
Hive>show tables;

Step 2: Create table

hive> create table user_log (country
STRING,ip_address STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t' STORED AS TEXTFILE;
Step 3: Load Data

hive> LOAD DATA INPATH "/user/training/user_log/part-

m-00000" INTO TABLE user_log;

Step 4: Run the query

$> select country,count(1) from user_log group by
country;
6. Setting up Flume
Step 1: Install Flume
$> sudo yum install flume-ng (Already done for you, please
dont run this command)
$> sudo u hdfs hadoop fs chmod 1777 /user/training

Step 2: Copy the configuration file

$> sudo cp /home/training/exercises/flume-
config/flume.conf /usr/lib/flume-ng/conf

Step 3: Start the flume agent
$> flume-ng agent --conf-file /usr/lib/flume-
ng/conf/flume.conf --name agent -
Dflume.root.logger=INFO,console

Step 4: Push the file in a different terminal
$> sudo cp /home/training/exercises/log.txt
/home/training

Step 5: View the output
$> hadoop fs ls logs

7. Setting up a multi node cluster
Step 1: For converting the pseudo distributed mode to distributed
mode, the first step is to stop the existing services (To be done on all
nodes)
$> for service in /etc/init.d/hadoop*
> do
> sudo $service stop
> done

Step 2: Create a new set of blank configuration files. The conf.empty

directory contains blank files, so we will copy those to a new
directory (To be done on all nodes)

$> sudo cp r /etc/hadoop/conf.empty \

> /etc/hadoop/conf.class

Step 3: Point Hadoop configuration to the new configuration (To be

done on all nodes)
$> sudo /usr/sbin/alternatives -install \

> /etc/hadoop/conf hadoop-conf \
> /etc/hadoop/conf.class 99

Step 4: Verify Alternatives (To be done on all nodes)
$> /usr/sbin/update-alternatives \
> --display hadoop-conf

Step 5: Setting up the hosts (To be done on all nodes)

Step 5.1: Find the IP address of your machine

$> /sbin/ifconfig

Step 5.2: List down all the IP Addresses in your cluster setup i.e.
the ones that will belong to your cluster. And decide a name for
each one. In our example, lets say we are trying to setup a 3 node
cluster so we fetch IP address of each node and name it as
namenode and datanode<n>.
Update /etc/hosts file with IP addresses as shown. So /etc/hosts
file on each node should look something like this

192.168.1.12 namenode
192.168.1.21 datanode1
192.168.1.21 datanode2

Step 5.3: Update /etc/sysconfig/network file with Hostname
Open the /etc/sysconfig/network on your local box and make

sure that your hostname is namenode or datanode<n>.
Assuming you have decided to become a datanode1 i.e.
192.168.1.21. So your hostname should be
HOSTNAME=datanode1
HOSTNAME=Your node i.e. namenode or datanode1

5.4: Restart your machine and try pining other machines
Step
Ping namenode

Step
6: Changing configuration files (To be done on all nodes)
The format to add the configuration parameter is
<property>
<name>property_name</name>
<value>property_value</value>
</property>
Add the following configurations in the following files

Name
Value
Filename: /etc/hadoop/conf.class/core-site.xml
fs.default.name
hdfs://<namenode>:8020

Filename: /etc/hadoop/conf.class/hdfs-site.xml
dfs.name.dir
/home/disk1/dfs/nn,/home/disk2/dfs/nn
dfs.data.dir
/home/disk1/dfs/dn,/home/disk2/dfs/dn
dfs.http.address
namenode:50070

Filename: /etc/hadoop/conf.class/mapred-site.xml
mapred.local.dir
/home/disk1/mapred/local,/home/disk2/mapre
d/local
mapred.job.tracker
namenode:8021
mapred.jobtracker.staging.ro /user
ot.dir

Step 7: Create necessary directories (To be done on all nodes)
$> sudo mkdir p /home/disk1/dfs/nn
$> sudo mkdir p /home/disk2/dfs/nn
$> sudo mkdir p /home/disk1/dfs/dn
$> sudo mkdir p /home/disk2/dfs/dn
$> sudo mkdir p /home/disk1/mapred/local
$> sudo mkdir p /home/disk2/mapred/local

Step
8: Manage Permissions (To be done on all nodes)
$> sudo chown R hdfs:hadoop /home/disk1/dfs/nn

$> sudo chown R hdfs:hadoop /home/disk2/dfs/nn
$> sudo chown R hdfs:hadoop /home/disk1/dfs/dn
$> sudo chown R hdfs:hadoop /home/disk2/dfs/dn
$> sudo chown R mapred:hadoop /home/disk1/mapred/local
$> sudo chown R mapred:hadoop /home/disk2/mapred/local

Step 9: Reduce Hadoop Heapsize (To be done on all nodes)

$> export HADOOP_HEAPSIZE=200

Step 10: Format the namenode (Only on Namenode)
$> sudo u hdfs hadoop namenode -format

On Namenode
$> sudo /etc/init.d/hadoop-hdfs-namenode start
$> sudo /etc/init.d/hadoop-hdfs-secondarynamenode start

On Datanode
$> sudo /etc/init.d/hadoop-hdfs-datanode start

Step 11: Start HDFS processes

Step 12: Create directories in HDFS (Only one member should do this)
$> sudo u hdfs hadoop fs mkdir /user/training
$> sudo u hdfs hadoop fs chown training /user/training

$> sudo u hdfs hadoop fs mkdir /mapred/system
$> sudo u hdfs hadoop fs chown mapred:hadoop \
>/mapred/system

Step 13: Create directories for mapreduce (Only one member should do this)
Step 14: Start the Mapreduce process

On Namenode
$> sudo /etc/init.d/hadoop-0.20-jobtracker start

On Slave node
$> sudo /etc/init.d/hadoop-0.20-tasktracker start

Step 15: Verify the cluster
Visit http://namenode:50070 and look at number of nodes

Hands On-Exercies

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hands On-Exercies

Uploaded by

Copyright:

Available Formats

Hadoop

2. Installing Hadoop in a pseudo distributed

Step 2: Verify if the packages are installed properly

sudo -u hdfs hdfs namenode -format

Step 7: Create the /tmp directory

Step 9: Verify the directory structure

Step 12.1: Create a home directory on HDFS for the user

$ hadoop fs -mkdir input

Step 12.5: List the output files

3. Accessing HDFS from command line:

Command 2: List the files in the root directory

Command 3: Push a file to HDFS

$> hadoop fs put test.txt /user/training/test.txt

Command 5: Delete a file

4. Running the Wordcount Mapreduce job

1. Right click the project-ExportJavaJarSelect the location as

5. Mini Project: Importing MySQL Data

$> sudo yum install sqoop

Step 3: View list of tables

5.2 Setting up Hive

Step 3: Load Data

Step 2: Copy the configuration file

Step 2: Create a new set of blank configuration files. The conf.empty

$> sudo cp r /etc/hadoop/conf.empty \

Step 3: Point Hadoop configuration to the new configuration (To be

$> sudo /usr/sbin/alternatives -install \

Step 5: Setting up the hosts (To be done on all nodes)

Step 5.1: Find the IP address of your machine

Step 5.3: Update /etc/sysconfig/network file with Hostname

Open the /etc/sysconfig/network on your local box and make

Add the following configurations in the following files

$> sudo chown R hdfs:hadoop /home/disk1/dfs/nn

Step 9: Reduce Hadoop Heapsize (To be done on all nodes)

Step 14: Start the Mapreduce process

You might also like