Professional Documents
Culture Documents
Training
Hands
On
Exercise
1. Getting
started:
Step
1:
Download
and
Install
the
Vmware
player
-
Download
the
VMware-player-5.0.1-894247.zip
and
unzip
it
on
your
windows
machine
-
Click
the
exe
and
install
Vmware
player
Step
2:
Download
and
install
the
VMWare
image
-
Download
the
Hadoop
Training
-
Distribution.zip
and
unzip
it
on
your
windows
machine
-
Click
on
centos-6.3-x86_64-server.vmx
to
start
the
Virtual
Machine
Step
3:
Login
and
a
quick
check
- Once
the
VM
starts,
use
the
following
credentials:
Username:
training
Password:
training
- Quickly
check
if
eclipse
and
mysql
workbench
are
installed
Step
4:
Stop
existing
services
(As
Hadoop
was
already
installed
for
you,
there
might
be
some
services
running)
$
for
service
in
/etc/init.d/hadoop*
>
do
>
sudo
$service
stop
>
done
Step
5:
Start
HDFS
$
for
service
in
/etc/init.d/hadoop-hdfs-*
>
do
>
sudo
$service
start
>
done
Step
6:
Verify
if
HDFS
has
started
properly
(In
the
browser)
http://localhost:50070
Step
8:
Create
mapreduce
specific
directories
sudo
-u
hdfs
hadoop
fs
-mkdir
/var
sudo
-u
hdfs
hadoop
fs
-mkdir
/var/lib
sudo
-u
hdfs
hadoop
fs
-mkdir
/var/lib/hadoop-hdfs
sudo
-u
hdfs
hadoop
fs
-mkdir
/var/lib/hadoop-hdfs/cache
sudo
-u
hdfs
hadoop
fs
-mkdir
/var/lib/hadoop-hdfs/cache/mapred
sudo
-u
hdfs
hadoop
fs
-mkdir
/var/lib/hadoop-
hdfs/cache/mapred/mapred
sudo
-u
hdfs
hadoop
fs
-mkdir
/var/lib/hadoop-
hdfs/cache/mapred/mapred/staging
sudo
-u
hdfs
hadoop
fs
-chmod
1777
/var/lib/hadoop-
hdfs/cache/mapred/mapred/staging
sudo
-u
hdfs
hadoop
fs
-chown
-R
mapred
/var/lib/hadoop-
hdfs/cache/mapred
Output should be
drwxrwxrwt
- hdfs
supergroup
0 2012-04-19 15:14
drwxr-xr-x
- hdfs
supergroup
0 2012-04-19 15:16
drwxr-xr-x
- hdfs
supergroup
0 2012-04-19 15:16
drwxr-xr-x
- hdfs
supergroup
0 2012-04-19 15:16
drwxr-xr-x
- hdfs
supergroup
0 2012-04-19 15:16
/tmp
/var
/var/lib
/var/lib/hadoop-hdfs
/var/lib/hadoop-
supergroup
0 2012-04-19 15:19
/var/lib/hadoop-
0 2012-04-19 15:29
/var/lib/hadoop-
0 2012-04-19 15:33
/var/lib/hadoop-
hdfs/cache
drwxr-xr-x
- mapred
hdfs/cache/mapred
drwxr-xr-x
- mapred
supergroup
hdfs/cache/mapred/mapred
drwxrwxrwt
- mapred
supergroup
hdfs/cache/mapred/mapred/staging
Step
10:
Start
MapReduce
$
for
service
in
/etc/init.d/hadoop-0.20-
mapreduce-*
>
do
>
sudo
$service
start
>
done
Step
11:
Verify
if
MapReduce
has
started
properly
(In
Browser)
http://localhost:50030
Step
12:
Verify
if
the
installation
went
on
well
by
running
a
program
Step
12.2:
Make
a
directory
in
HDFS
called
input
and
copy
some
XML
files
into
it
by
running
the
following
commands
Step
12.3:
Run
an
example
Hadoop
job
to
grep
with
a
regular
expression
in
your
input
data.
$
/usr/bin/hadoop
jar
/usr/lib/hadoop-0.20-
mapreduce/hadoop-examples.jar
grep
input
output
'dfs[a-
z.]+'
Step
12.4:
After
the
job
completes,
you
can
find
the
output
in
the
HDFS
directory
named
output
because
you
specified
that
output
directory
to
Hadoop.
$ hadoop fs -ls
Found 2 items
drwxr-xr-x
- joe supergroup 0 2009-08-18 18:36
/user/joe/input
drwxr-xr-x
- joe supergroup 0 2009-08-18 18:38
/user/joe/output
2:
Create
a
new
project
in
eclipse
called
wordcount
Step
1. cp
r
/home/training/exercises/wordcount
/home/training/workspace/wordcount
2. Open
EclipseNew
Project->wordcount->location
/home/training/workspace
3. Right
Click
on
the
wordcount
project->properties->java
build
path->Libraries->Add
External
JarsSelect
all
jars
from
/usr/lib/hadoop
and
/usr/lib/hadoop-0.20-
mapreduceOk
4. Make
sure
that
there
are
no
more
compilation
errors
Step
3:
Create
a
jar
file
Step
4
Run
the
jar
file
hadoop
jar
wordcount.jar
WordCount
wordcountinput
wordcountoutput
Step
2:
View
list
of
databases
$>
sqoop
list-databases
\
--connect
jdbc:mysql://localhost/training_db
\
--username
root
--password
root
$>
sudo
yum
install
hive
(Already
done
for
you,
dont
run
this
command)
$>
sudo
u
hdfs
hadoop
fs
mkdir
/user/hive/warehouse
$>
hadoop
fs
chmod
g+w
/tmp
$>
sudo
u
hdfs
hadoop
fs
chmod
g+w
/user/hive/warehouse
$>
sudo
u
hdfs
hadoop
fs
chown
R
training
/user/hive/warehouse
$>sudo
chmod
777
/var/lib/hive/metastore
$>
hive
Hive>show
tables;
Step
2:
Create
table
hive>
create
table
user_log
(country
STRING,ip_address
STRING)
ROW
FORMAT
DELIMITED
FIELDS
TERMINATED
BY
'\t'
STORED
AS
TEXTFILE;
6. Setting
up
Flume
Step
1:
Install
Flume
$>
sudo
yum
install
flume-ng
(Already
done
for
you,
please
dont
run
this
command)
$>
sudo
u
hdfs
hadoop
fs
chmod
1777
/user/training
Step
3:
Start
the
flume
agent
$>
flume-ng
agent
--conf-file
/usr/lib/flume-
ng/conf/flume.conf
--name
agent
-
Dflume.root.logger=INFO,console
Step
4:
Push
the
file
in
a
different
terminal
$>
sudo
cp
/home/training/exercises/log.txt
/home/training
Step
5:
View
the
output
$>
hadoop
fs
ls
logs
7. Setting
up
a
multi
node
cluster
Step
1:
For
converting
the
pseudo
distributed
mode
to
distributed
mode,
the
first
step
is
to
stop
the
existing
services
(To
be
done
on
all
nodes)
$>
for
service
in
/etc/init.d/hadoop*
>
do
>
sudo
$service
stop
>
done
Step
4:
Verify
Alternatives
(To
be
done
on
all
nodes)
$>
/usr/sbin/update-alternatives
\
>
--display
hadoop-conf
$>
/sbin/ifconfig
Step
5.2:
List
down
all
the
IP
Addresses
in
your
cluster
setup
i.e.
the
ones
that
will
belong
to
your
cluster.
And
decide
a
name
for
each
one.
In
our
example,
lets
say
we
are
trying
to
setup
a
3
node
cluster
so
we
fetch
IP
address
of
each
node
and
name
it
as
namenode
and
datanode<n>.
Update
/etc/hosts
file
with
IP
addresses
as
shown.
So
/etc/hosts
file
on
each
node
should
look
something
like
this
192.168.1.12
namenode
192.168.1.21
datanode1
192.168.1.21
datanode2
Ping
namenode
Step
6:
Changing
configuration
files
(To
be
done
on
all
nodes)
The
format
to
add
the
configuration
parameter
is
<property>
<name>property_name</name>
<value>property_value</value>
</property>
Step
7:
Create
necessary
directories
(To
be
done
on
all
nodes)
$>
sudo
mkdir
p
/home/disk1/dfs/nn
$>
sudo
mkdir
p
/home/disk2/dfs/nn
$>
sudo
mkdir
p
/home/disk1/dfs/dn
$>
sudo
mkdir
p
/home/disk2/dfs/dn
$>
sudo
mkdir
p
/home/disk1/mapred/local
$>
sudo
mkdir
p
/home/disk2/mapred/local
Step
8:
Manage
Permissions
(To
be
done
on
all
nodes)
Step
15:
Verify
the
cluster
Visit
http://namenode:50070
and
look
at
number
of
nodes