Friday, November 7, 2014

Configuring Hadoop on Mac OSx in pseudo-distributed cluster mode.


Hadoop is an open-source Apache project that enables processing of extremely large datasets in a distributed computing environment. There are three different modes in which it can be run:

1. Standalone Mode
2. Pseudo-Distributed Mode
3. Fully-Distributed Mode

This post covers setting up of Hadoop 2.5.1 in a Pseudo-distributed mode. A Pseudo-Distributed mode is one where each hadoop daemon runs as a separate java process.

Prerequisites


Java: Install Java if it isn’t installed on your mac.
Homebrew: Homebrew is a package manager for Mac. You can find the installation instructions here
Keyless SSH : First, ensure Remote Login under System Preferences -> Sharing is checked to enable SSH. Generate the key pairs.
$ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now ssh into your localhost and allow authorization.

Installation


This is where Homebrew is used.
$brew install Hadoop
If you do not want to use homebrew or you want to install a specific version of Hadoop, you can download it from the Apache Hadoop. Unpack the .tar to the location of your choice and assign ownership to the user setting up Hadoop.

Configuration


Every component of Hadoop is configured using an XML file specifically located in /usr/local/Cellar/hadoop/2.5.1/libexec/etc/hadoop.MapReduce properties go in mapred-site.xml, HDFS properties in hdfs-site.xml and common properties in core-site.xml. The general Hadoop environment properties are found in hadoop-env.sh.

hadoop-env.sh

Replace the existing HADOOP_OPTS with following.
export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"
If Homebrew was not used to install Hadoop, kindly point the JAVA_HOME to your java installation.

core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

hdfs-site.xml

The Hadoop Distributed File System properties go in this config file. Since we are only setting up one node, we set the value of dfs.replication to 1.
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>


Execution

Before starting the daemons we must format the newly installed HDFS.
$ cd /usr/local/Cellar/hadoop/2.5.1/libexec/bin
$ hdfs namenode -format

Start the Daemons:
$ cd /usr/local/Cellar/hadoop/2.5.1/libexec/sbin
$ ./start-dfs.sh

Monitoring
Check the output of jps
$jps
10756 NameNode
1282 
10842 DataNode
11022 Jps
10951 SecondaryNameNode
1842 

Alternatively, the web interface for the NameNode can be browsed at http://localhost:50070

Running Examples
1. Create the HDFS directories required to execute MapReduce jobs:
$ cd /usr/local/Cellar/hadoop/2.5.1/libexec/bin
$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/<username>

2. Copy the input files to the Hadoop Distributed File System
$ hdfs dfs -put ../etc/hadoop input

3. Run the example provided
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar grep input output 'dfs[a-z.]+'

4. View the output files on HDFS
$ hdfs dfs -cat output/*

Reference : http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation

1 comment :

  1. If you get the following exception while trying to write datanode:
    "org.apache.hadoop.ipc.RemoteException(java.io.IOException): File could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation."

    try the following:
    * Stop Hadoop components
    * Delete all hadoop files in /tmp
    * Format Namenode
    * Start DFS

    Reference: http://stackoverflow.com/questions/26545524/there-are-0-datanodes-running-and-no-nodes-are-excluded-in-this-operation

    ReplyDelete