Installation

Table of Contents:

Installation
Standalone MapReduce
Setup a one-node-cluster
Setup a multi-node-cluster
Using HDFS

Installation

Start an ubuntu instance and set up /etc/resolv.conf and /etc/hosts
Update apt-get repository : sudo apt-get update
Install java 1.7 with sudo aptitude -y install openjdk-7-jdk
Edit your .bashrc file and add export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Get the newest version of hadoop with sudo wget http://apache.openmirror.de/hadoop/common/hadoop-2.5.2/hadoop-2.5.2.tar.gz
Unpack hadoop with sudo tar –xvzf hadoop-2.5.2.tar.gz
Rename the folder with sudo mv hadoop-2.5.2 hadoop
Adjust the JAVA_HOME variable in etc/hadoop/hadoop-env.sh

Standalone MapReduce

By default, Hadoop is configured to run in non-distributed mode. You can start a small mapreduce job by following:

$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar 
$ grep input output 'dfs[a-z.]+'
$ cat output/* //which should deliver 1 dfsadmin

Congratulations! You have Hadoop running on Ubuntu.

Setup a one-node-cluster

Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. First of all we need to shut of IPv6 because Ubuntu is using 0.0.0.0 for different Hadoop instances Edit file /etc/sysctl.conf with below and reboot or just add export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true in hadoop-env

#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Edit file etc/hadoop/core-site.xml:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Edit file etc/hadoop/hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Hadoop uses ssh to communicate so we need to setup ssh localhost Lets create a key: ssh-keygen -t dsa -f ~/.ssh/id_dsa And add this key cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys, now it should be possible to ssh localhost

Setup a multi-node-cluster

This Tutorial is an excerpt of Mohamed and Marawan's Hadoop Tutorial. The Hadoop tutorial for multi-node cluster installation doesn't offer much in the way of detailed steps, presenting just a list of the configuration options available, which is confusing because for a minimal required configuration you only need to change the following : In etc/hadoop/core-site.xml , instead of having the fs.defaultFS property set to localhost, we set it to point to the IP address of the NameNode :

 <configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://172.16.19.10:9000</value>
    </property>
 </configuration>

This has 2 effects. First on the NameNode, the NameNode process will listen on 172.16.19.10:9000 which allows machines on the network to connect to it, in contrast with listening on localhost:9000 which only allowed processed on the same machine to connect to it. The second effect is on the DataNode, this config of course just tells the DataNodes what NameNode to connect to and on which port.

In etc/hadoop/yarn-site.xml, specify the ResourceManager IP, which here we just used the NameNode IP for (but that doesn't always have to be the case) :

 <configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>172.16.19.10</value>
        <description>The hostname of the Resource Manager.</description>
    </property>
 </configuration>

The above configs need to eventually be on both the NameNode and the DataNode, so we applied them to our single machine , then created a snapshot of the machine. And that is our Hadoop image that we'll use to create our cluster.

(Optional) In etc/hadoop/hdfs-site.xml, specify the Secondary Node IP, which is by default set to '''0.0.0.0.0:50090''' which basically points to all the ip addresses currently configured on the machine and the dfs.replication which is how many copies to store for each block of a file in HDFS.

    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>172.16.19.17:50090</value>
    </property>

(Recommended) In etc/hadoop/hdfs-site.xml, disable ACL permissions for HDFS files. This might be enabled for a production environment, but for testing, can lead to some Permission denied exceptions:

    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>

Specific Configuration Now we have an image with the correct hadoop configuration, we proceeded by creating cloned instances from that image to create a small cloud with one name node and 3 data nodes. The IP address of our machines :

 172.16.19.10 (NameNode)
 172.16.19.11 (DataNode)
 172.16.19.12 (DataNode)
 172.16.19.13 (DataNode)

Now all machines are clones. First we need to adjust the hostnames by doing fixing unknown host. Then in terms of Hadoop configuration, only the NameNode needs 2 additional configurations : On the name node server we edited the etc/hadoop/slaves file and added the IP addresses of the 3 data nodes

 172.16.19.11
 172.16.19.12
 172.16.19.13

Now the NameNode knows which slaves it can use. Important : Make sure that the NameNode can ssh to each of the DataNodes without a passphrase or any user input. from the NameNode:

 ssh ubuntu@172.16.19.11
 ssh ubuntu@172.16.19.12
 ssh ubuntu@172.16.19.13

This should be the case if you had already followed the single-node cluster configuration, on one node and then cloned it, as the the ssh key should already be in the list of authorizaed keys for the machine.

On the name node server, edit /etc/hosts add the hostname-ip mappings of the datanodes, so that the file looks like this:

 ubuntu@namenode:~/hadoop$ cat /etc/hosts
 127.0.0.1 localhost
 127.0.0.1 namenode:localdomain namenode

 '''172.16.19.11 datanode-1'''
 '''172.16.19.12 datanode-2'''
 '''172.16.19.13 datanode-3'''
 # The following lines are desirable for IPv6 capable hosts
 ::1 ip6-localhost ip6-loopback
 fe00::0 ip6-localnet
 ff00::0 ip6-mcastprefix
 ff02::1 ip6-allnodes
 ff02::2 ip6-allrouters
 ff02::3 ip6-allhosts

*On the data nodes, edit /etc/hosts add add the hostname-ip mappings of the namenode and the other datanodes. This is required because when running Map reduce jobs, the data nodes will need to communicate with each other and to the name node. You also need to ('''IMPORTANT''') comment out the line with localdomain so that the file looks like this:

 ubuntu@datanode-1:~/hadoop$ cat /etc/hosts
 127.0.0.1 localhost

 #127.0.0.1 datanode-1:localdomain datanode-1

 172.16.19.10 namenode
 172.16.19.11 datanode-1
 172.16.19.12 datanode-2
 172.16.19.13 datanode-3
 ....

Without these modifications, HDFS will work correctly, but running a MapReduce job will fail with a "Connection Refused" error

Upload Files using HDFS

Starting the hadoop cluster now by executing the following on NameNode

bin/hadoop namenode -format
sbin/start-dfs.sh
sbin/start-yarn.sh

Now we are able to upload files. DFS has its own file namespace, and the stored data on dfs is not directly accessible with Linux commands, instead we have to use bin/hadoop command to access it. By default any relative path on DFS is at /user/[username]/ , however this directory is not created by default so we need to create it:

ubuntu@namenode:~/hadoop$bin/hdfs dfs -mkdir /user
ubuntu@namenode:~/hadoop$bin/hdfs dfs -mkdir /user/ubuntu

You can list the files in /user/ubuntu by :

ubuntu@namenode:~/hadoop$bin/hdfs dfs -ls

Note that we didn't need to specify /user/ubuntu in the ls command, because that's the default directory. Now copy a test file and see it with -ls

ubuntu@namenode:~/hadoop$echo Hello HDFS > ~/hellohdfs.txt
ubuntu@namenode:~/hadoop$bin/hadoop dfs -put ~/hellohdfs.txt testfile
ubuntu@namenode:~/hadoop$bin/hadoop dfs -ls
Found 1 items
-rw-r--r--   3 ubuntu supergroup         16 2014-11-04 13:06 testfile

We did this while on the namenode, now go to a datanode and see the file content from there:

ubuntu@datanode-1:~/hadoop$ bin/hadoop dfs -ls
Found 1 items
-rw-r--r--   3 ubuntu supergroup         16 2014-11-04 12:06 testfile
ubuntu@datanode-1:~/hadoop$ bin/hadoop dfs -cat testfile
Hello HDFS

Using the GUI to browse HDFS

It is also possible to use the GUI to browse HDFS. For this, have a look at Section Module0: Preparation & Setup - GUI