Friday, September 7, 2012

Configure your MacOS to run Hadoop MapReduce job












Purpose:
This article is to tell you how to configure a Single Node Hadoop Setup on MacOS to run some basic MapReduce job from scratch.

Step1:
Download a stable Hadoop released from Apache Server. I use hadoop-1.0.3.
Use the following command to extract the files into your favorite directory.
tar xzvf hadoop-1.0.3.tar.gz 

make sure your working directory is under  ~/SOME/hadoop-1.0.3

Step2:
Add the following line into the file  ./conf/hadoop-env.sh
# set JAVA_HOME in this file
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home

#suppress wrong output info due to JRE SCDynamic
export HADOOP_OPTS="-Djava.security.krb5.realm= -Djava.security.krb5.kdc="


Step3:
Modify the following files:
conf/core-site.xml

        
                hadoop.tmp.dir
                /Users/YOURNAME/tmp1
                A base for other temporary directories.
        

        
                fs.default.name
                hdfs://localhost:9000
        

note: the key {hadoop.tmp.dir} is used to set the location for meta data file of hdfs. conf/hdfs-site.xml

        
                 dfs.replication
                 1
             

conf/mapred-site.xml

        
                mapred.job.tracker
                localhost:9001
        


Step 4:
Enable Remote Login for your MacOS. Go to SystemPreference->Sharing check the Remote Login Issue the following command in your terminal:

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa 
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
$ ssh localhost
If it works well without any prompt after your issuing last command, it means OK.

Step5:
format a new distribute-filesystem and start the hadoop framework:
$ bin/hadoop namenode -format
$ bin/start-all.sh 
At this time, you should find several newly-added directory in the path /Users/YOURNAME/tmp1 which you set the value for the key {hadoop.tmp.dir}.

Now, open your browser and visit the following the address

NameNode - http://localhost:50070/

JobTracker - http://localhost:50030/

You should see a page like this:
Note: there should  be 1 live node existing if everything goes well. Otherwise, you must restart the hadoop daemon. However, sometime live node is alway 0 which means you can not execute any mapreduce job. From my experience,  a effective way is by deleting all directories generated by hadoop in the path {hadoop.tmp.dir}. Then run the following commands again
$ bin/hadoop namenode -format
$ bin/start-all.sh 

Step6:
 Test your mapreduce job execution by issuing the following commands:
$ bin/hadoop fs -put conf input
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
You can monitor the job process through your terminal or web browser by visiting http://localhost:50030/