You are on page 1of 6

Setting up hadoop 2.4.

1 multi-node
cluster in Ubuntu 14.04 64-bit
1.

Follow my post to setup single node hadoop cluster and set it up in all your slave computers.

2.
One PC will be master, from where everything is controlled. All other PC's are slaves. NOTE: We
will assume mypc1 is the master and other pc's are slaves
3.
Edit Hosts file to say at which IP Address your computers (Master and all slave PC's ) are and
modify following lines accordingly.NOTE: Hostname can be different from the hostname in that PC. -

sudo gedit /etc/hosts


For example, if PC1 is the name of a computer and its IP is 10.200.1.8, hostname in host file entry can
be mypc2 or anything.
NOTE: remove line starting with 127.1.0.1 in hosts file. And you should have same pc name ( as specified
by 127.0.0.1 line, in this case PC1) in /etc/hosts file and /etc/hostname file or else you will get the
error host not found. And restart system for the changes to take effect.
127.0.0.1 localhost PC1
10.200.1.7 mypc1
10.200.1.8 mypc2
10.200.1.9 mypc3
10.200.1.10 mypc4
4.

Configurations to be done in both Master and Slave Computers

Replace the code in core-site xml file with following code. Change mypc1 to the name of

cd
/usr/local/hadoop/etc/hadoop
sudo gedit core-site.xml

<configuration>

<property>

your Master PC ( first change directory using

<name>fs.default.name</name>

<value>hdfs://mypc1:54310</value>

<description>The name of the default file system.

scheme and authority determine the FileSystem implementation.

uri's scheme determines the config property (fs.SCHEME.impl) naming

the FileSystem implementation class.

determine the host, port, etc. for a filesystem.</description>

</property>

</configuration>

A URI whose
The

The uri's authority is used to

Replace the code in hdfs-site xml file with following code. ( value of Replication should
be equal to no. of Slave computers. In this case, 4.)

sudo gedit hdfs-site.xml

<configuration>

<property>

<name>dfs.replication</name>

<value>4</value>

<description>Default block replication.

The actual number of replications can be specified when the file is


created.

The default is used if replication is not specified in create time.

</description>

</property>

<property>

<name>dfs.data.dir</name>

<value>/usr/local/hadoop/hdfs</value>

<description>Directory to store files in HDFS.

This directory is not formatted when namenode is formatted.


</description>

</property>

</configuration>

Replace the code in mapred-site xml file with following code.( modify mypc1 to the name
of your Master PC )

sudo gedit mapred-site.xml

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>mypc1:54311</value>

<description>The host and port that the MapReduce job tracker runs

at. If "local", then jobs are run in-process as a single map

and reduce task.

</description>

</property>

</configuration>

Replace the code in yarn-site xml file with following code. (Replace mypc1 with your
Master Node's name or IP Address )

sudo gedit yarn-site.xml

<configuration>

<!-- Site specific YARN configuration properties -->

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

<property>

<name>yarn.resourcemanager.resource-tracker.address</name>

<value>mypc1:8025</value>

</property>

<property>

<name>yarn.resourcemanager.scheduler.address</name>

<value>mypc1:8030</value>

</property>

<property>

<name>yarn.resourcemanager.address</name>

<value>mypc1:8050</value>

</property>

</configuration>

5.
Configurations to be done only in Master Computer ( NOTE:user should be hduser in
terminal . Redo this step if you change IP's later. First delete .ssh folder in hduser and generate new key
- ( sudo rm -R /home/hduser/.ssh && ssh-keygen -t rsa -P "" && cat
$HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys )
6.

Enabling SSH Access, so that master can access all computers, including the master pc
mypc1. ( use id_dsa.pub in case id_rsa.pub doesn't work.)
ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@mypc1
ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@mypc2
ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@mypc3
ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@mypc4

Now Test SSH connection to all PC's using this commands. ( ssh from your pc's hduser,
i.e type exit after you ssh a pc and then ssh again to a new one.)

ssh mypc1

ssh mypc2

ssh mypc3

ssh mypc4

Create a masters file and a slaves file to specify which PC's are masters and slaves.First
change directory and paste following lines respectively-

cd /usr/local/hadoop/etc/hado
op/
sudo gedit masters

mypc1

sudo gedit slaves

mypc1

mypc2

mypc3

mypc4

cat
/usr/local/hadoop/etc/hadoop/masters)

mypc1

Finally, output of Master File should be ( see using

Output of slaves file ( also includes master pc mypc1 as we want to run programs on

cat
/usr/local/hadoop/etc/hadoop/slaves)
master pc also. ) ( see using

mypc1

mypc2

mypc3

mypc4

7.

Testing Time!! - In Master PC ( NOTE: user should be hduser in the terminal )

cd
/usr/local/hadoop/bin
Change dir-

Format Namenode -

hadoop namenode format

By starting HDFS daemons ( start-dfs.sh ), the NameNode ( and Datandoe ) daemon is


started on Master PC and DataNode daemons are started on all nodes ( Slave PC's ). And by
Starting MapRed daemons ( start-mapred.sh), start NodeManager daemon in all PC's. NOTE:
Check if respective daemons are running by using

jps

cd /usr/local/hadoop/etc/hadoop start-dfs.sh && start-yarn.sh

Warining of Unable to load native hadoop libraries is OK. It won't affect hadoop
functionalities.
To Stopstop-dfs.sh
stop-yarn.sh

8. Web Interfaces ( NOTE: mypc1 can be replace by IP Address of Master PC)

Web UI displaying all cluster info - http://mypc1:8088/

Web UI of the Cluster Health and Namenode - http://mypc1:50070/

Web UI of Datanodes, Accessing Logs & browsing File System - http://mypc1:50075/

If any of the daemons are not running follow these steps: ( do this in all pc's )

1. remove temp. folders -

sudo rm -R /tmp/*

2. Only in master - Check if ssh to all the pc's is working, without entering password every time.
3. Kill all process if they are already running (Don't do it if you have any jobs running) -

sudo kill -9 $(lsof -ti:8088) && sudo kill


-9 $(lsof -ti:8042) && sudo kill -9 $(lsof
-ti:50070) && sudo kill -9 $(lsof
-ti:50075) && sudo kill -9 $(lsof
-ti:50090)
4. Format the namenode ( should not ask you to Re-format the File-system and exit status should be 0) -

hadoop namenode -format


5. From master, start all the daemons again.

Comments are highly appreciated.

You might also like