Hadoop, Tutorials

Hadoop 2.2.0 – Single Node Cluster

We’re going to use the the Hadoop tarball we compiled earlier to run a pseudo-cluster. That means we will run a one-node cluster on a single machine. If you haven’t already read the tutorial on building the tarball, please head over and do that first.

Geting started with Hadoop 2.2.0 — Building

Start up your (virtual) machine and login as the user ‘hadoop’. First, we’re going to setup the essentials required to run Hadoop. By the way, if you are running a VM, I suggest you kill the machine used for building Hadoop and re-start from a fresh instance of Ubuntu to avoid any issues with compatibility later. For reference, the OS we are using is 64-bit Ubuntu 12.04.3 LTS.

Environment Setup

First thing we need to do is create an RSA keypair for our user.

ssh-keygen -t rsa  # Don't enter a password, pick all defaults
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh localhost 

We just created a key and added it to our user’s authorized keys so that we can do password-less login to our own machine. The last command above should login without asking for password.

Next, we want to call this system ‘master’. It’s a master of itself — this will come in handy in the future. We need to change the hostname and then configure our hosts file.

sudo hostname master
sudo vi /etc/hostname 

The hostname file should have a single line:

master

Next, modify /etc/hosts file and just after the localhost line, add an entry identifying ‘master’:

127.0.0.1           localhost 
192.168.56.101      master 

This is assuming your machine has the IP address 192.168.56.101. Make sure that this address is accessible from other machines. We will be using it for looking at some stats inshaallah.

Next, we install Java. OpenJDK 7 works fine.

sudo apt-get install -y openjdk-7-jdk

Now we can start setting up Hadoop. Copy the tarball over to master. You can use scp or winscp for that, or put it on a webserver and access it from there. After you have the compiled tar in hadoop user’s home folder on master, it’s time to extract and configure it.

cd /usr/local 
sudo tar zxf ~/hadoop-2.2.0.tar.gz   
sudo chown hadoop:hadoop -R hadoop-2.2.0/  # change ownership 
sudo ln -s hadoop-2.2.0 hadoop             # create a symbolic link for future upgrades
sudo chown hadoop:hadoop -R hadoop        

# create DFS storage location and set permissions 
sudo mkdir -p /app/hadoop/tmp
sudo chown hadoop:hadoop /app/hadoop/tmp/ -R

Ok. That was easy. Now, let’s go ahead and append hadoop’s executable folders to our path definition. I made the changes in /etc/environment but you can also modify your ~/.bashrc file. Your choice. Just append the following to your path definition:

:/usr/local/hadoop/bin:/usr/local/hadoop/sbin

Now, source the file to put it into effect:

. /etc/environment 

Configuration

It is now time to create some configuration files. They are plenty but don’t worry. I’m going to try and explain them and they’re fairly straight forward — and they work as of today.

cd /usr/local/hadoop 
mkdir conf 
touch conf/core-site.xml 
touch conf/mapred-site.xml 
touch conf/hdfs-site.xml 
touch conf/yarn-site.xml 
touch conf/capacity-scheduler-site.xml 
touch conf/hadoop-env.sh 
touch conf/slaves

The first one conf/core-site.xml is pretty easy.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/app/hadoop/tmp</value>
    <description>A base for other temporary directories.</description>
  </property>

  <property>
    <name>fs.default.name</name>
    <value>hdfs://master:54310/</value>
  </property>
</configuration>

The first property is of interest — well, both are but the second is left to the default. The first one tells where to put the Hadoop FS (meta) data. That’s the directory we created above. The rest will be put by Hadoop within subfolders. The second property fs.default.name tells Hadoop where to look for the HDFS. If you see the machine’s local filesystem when you later try to retrieve directory listing of HDFS, you will know that you messed this setting up. Notice the host ‘master’ over here. Port is best left to default.

Next file is conf/mapred-site.xml. It only has one setting:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>master:54311</value>
    <description>The host and port that the MapReduce job tracker runs
      at.  If "local", then jobs are run in-process as a single map
      and reduce task.
    </description>
  </property>
</configuration>

The description basically says it all. The HDFS configurations are given in hdfs-site.xml is also straight forward as we are not doing any customization for now.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>dfs.permissions.superusergroup</name>
    <value>hadoop</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is created.
    The default of 3 is used if replication is not specified. 
    </description>
  </property>
</configuration>

The last one is the most detailed but is still pretty simple. This is the listing for yarn-site.xml which basically replaces Job Tracker and Task Tracker of MapReduce 1.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.dispatcher.exit-on-error</name>
    <value>true</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.staging-dir</name>
    <value>/user</value>
  </property>
  <property>
    <name>yarn.application.classpath</name>
    <value> 
      $HADOOP_CONF_DIR,
      $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
      $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
      $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
      $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
    </value>
  </property>

  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>master:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>master:8031</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>master:8032</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>master:8033</value>
  </property>
  <property>
    <name>yarn.web-proxy.address</name>
    <value>master:8034</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>master:8088</value>
  </property>
</configuration>

The bottom half is of somewhat importance to us right now. These are different port configurations for services offered by the YARN resource manager. Make a note of the last one. That’s the web front-end we can use to monitor our cluster.

Next, we put the following content in conf/capacity-scheduler-site.xml.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <value>0.1</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>default</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.default.capacity</name>
    <value>100</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.default.user-limit-factor</name>
    <value>1</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>default</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
    <value>100</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.default.state</name>
    <value>RUNNING</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.default.acl_submit_applications</name>
    <value>*</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.default.acl_administer_queue</name>
    <value>*</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.node-locality-delay</name>
    <value>-1</value>
  </property>
</configuration>

The last file we want to configure is the environment file conf/hadoop-env.sh. This will be read by the different script and can be used to setup different environment variables specific to Hadoop.

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/conf
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_HDFS_HOME=/usr/local/hadoop
export HADOO_MAPRED_HOME=/usr/local/hadoop
export HADOOP_YARN_HOME=/usr/local/hadoop

export YARN_CONF_DIR=/usr/local/hadoop/conf

We also need another .sh file but that is the same as the above. So, let’s just copy it.

cp conf/hadoop-env.sh conf/yarn-env.sh 

One final thing we need to do is to tell Hadoop about our slaves. For now, we have only one so put the following in conf/slaves:

master

Well, that’s done. Now, let’s see if we can execute it.

Cluster Startup

Since we are only on one node, starting it up is pretty easy (although, to look at the official docs, you’d think this was rocket science).

First, we need to format our namenode.

hdfs namenode -format 

The hdfs command is actually in /user/local/hadoop/bin but since we have that added in the path, this works fine. After this, we need to start our HDFS.

. /etc/environment
start-dfs.sh 
# Alternative commands to start namenode and datanode 
# hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode
# hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode 

If at a later run, you get an error like this:

There appears to be a gap in the edit log. We expected txid 1, but got txid 705.

Append -recover to the namenode command above.

Check which resources are running by executing the Java ps command:

jps

You sould see a NameNode and DataNode along with JPS itself.

And now the YARN daemons.

yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager

JPS should now list a NameNode, a DataNode, a ResourceManager, a NodeManager and JPS itself. To test the HDFS, you can issue the following command:

hdfs dfs -ls / 

It shouldn’t return anything at this point. If it lists your local system files, re-check the fs.default.name settings. If everything works fine, go ahead and try to see if you can run the hadoop examples.

Executing an Example Job

Let’s first get some files to upload to our NFS. As usual, we will get a few files form the Gutenberg project. See details here: http://www.gutenberg.org

cd /tmp 
mkdir gutenberg
cd gutenberg 
wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt
wget http://www.gutenberg.org/cache/epub/5000/pg5000.txt
wget http://www.gutenberg.org/cache/epub/4300/pg4300.txt

Now, let’s create a folder in our HDFS and upload the folder there.

hdfs dfs -mkdir -p /user/hadoop/
hdfs dfs -copyFromLocal /tmp/gutenberg /user/hadoop/
hdfs dfs -ls /user/hadoop/gutenberg 

If you can see the three files listed properly, we’re all good to go here and we can now run the wordcount example on this.

cd /usr/local/hadoop
find . -name *examples*.jar 
# see where the file is found and use it below

cp share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar ./
hadoop jar hadoop-mapreduce-examples-2.2.0.jar wordcount /user/hadoop/gutenberg /user/hadoop/gutenberg-out

That should run for a bit and then produce something like the following at the end:

File System Counters
		FILE: Number of bytes read=813183
		FILE: Number of bytes written=4754129
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=8241626
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=24
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=3
	Map-Reduce Framework
		Map input records=77931
		Map output records=629172
		Map output bytes=6076101
		Map output materialized bytes=1459156
		Input split bytes=352
		Combine input records=629172
		Combine output records=101113
		Reduce input groups=0
		Reduce shuffle bytes=0
		Reduce input records=0
		Reduce output records=0
		Spilled Records=101113
		Shuffled Maps =0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=516
		CPU time spent (ms)=0
		Physical memory (bytes) snapshot=0
		Virtual memory (bytes) snapshot=0
		Total committed heap usage (bytes)=524660736
	File Input Format Counters 
		Bytes Read=3671523
	File Output Format Counters 
		Bytes Written=4753421

You can now do a usual dfs -ls on the output folder to check and then get the output using the following command

hdfs dfs -getmerge /user/hadoop/gutenberg-out/part-r-00000 /tmp/gutenberg-wordcount
# Alternative command
# hdfs dfs -copyToLocal /user/hadoop/gutenberg-out /tmp/

head /tmp/gutenberg-wordcount 

The contents should look something like this:

"(Lo)cra"       1
"1490   1
"1498," 1
"35"    1
"40,"   1
"A      2
"AS-IS".        1
"A_     1
"Absoluti       1
"Alack! 1

And that’s it. How do you turn this single-node cluster to a multi-node cluster? That’s not difficult but you’ll have to wait a few days for that.

Let me know in the comments section if you face any problem. I might be able to point you in the right direction.

Advertisements

1 thought on “Hadoop 2.2.0 – Single Node Cluster”

Comments are closed.