Setting up Hadoop on Windows

Posted on January 21, 2012. Filed under: Uncategorized |

Now usually you’ll want to set up hadoop on a UNIX variant like Linux, Solaris or even Mac OS X. For one, hadoop DFS is so close to UNIX type systems in it’s usage, running it on Windows feels a lot more “alien”. Besides that, large Windows based clusters haven’t been tested yet, most probably for the aforementioned reason.
However, in a development environment, Windows is a lot more common place. So, after having used hadoop on a VM on my Vista box or using my MacBook, I took a heart and explored the necessary steps to run it on Vista (or any other of those Windows flavours).
I. Cygwin
The first step is to install cygwin. This is an absolute requirment to overcome the “alien” part of the windows shell handling. Installing cygwin is easy, it has a set-up which lets you choose what, where and which ported packages you want to install. Make sure to choose the opensshd package in the Net grouping.
When you’ve set up cygwin, you have to set up your ssh keys and install sshd as a service. Please follow the following steps:
0. Open cygwin:
Choose “Run as administrator” from the context menu by right-clicking on cygwin icon
1. Set permissions:
With recent releases of cygwin, there are many permission problems.
Add these 4 commands as work around:
chmod +r  /etc/passwd
chmod u+w /etc/passwd
chmod +r  /etc/group
chmod u+w /etc/group
2. Run SSH configuration script:
ssh-host-config
The script will ask a lot of questions. Answer all with “yes” except for “This script plans to use cyg_server, Do you want to use a different name?”. The answer is no. The script will also prompt you for a CYGWIN env variable. set it to ntsec tty.
ntsec is an environment variable used by cygwin to instruct cygwin to use Windows’ security rules for controlling users’ access to files and other operating system facilities.
tty is an environment variable used by cygwin to make it work properly with editors, It stands for “tele type”. That’s stuff from way back in time :)
The script will also ask you for a password. This will allow you to connect to your windows box.
3. Install sshd as a service
Call net start sshd. It will install sshd as a service, i.e. sshd can be started upon windows startup.
4. Test.
$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is 9f:48:5e:da:0f:11:3b:19:29:56:9f:0b:34:45:2b:4b.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
You’ll be promoted for your user’s password.
Now we can start installing hadoop.
II. Hadoop
Hadoop support two operating modes: the single instance and the clustered mode. For the sake of simplicity, we’ll outline setting up a single instance installation.  Now it’s time to in medias res:
0. Prerequisites
The first and major prerequisite for running hadoop is of course Java. You’ll need at current JDK, which you can obtain form Sun at http://java.sun.com.  Hadoop is recommending Java 6, while it still can be run with Java 5. The next requirement is setting up private key – based ssh access. For that you’ll need to generate keys which you publish as authorized keys in your .ssh directory:
$> ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$> cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now test it:
$> ssh localhost
Last login: Sat Dec  5 15:51:02 2009 from 127.0.0.1
If you get an error, then your sshd is mot probably not running. See the section above.
1. Hadoop installation
Grab a distribution from http://hadoop.apache.org. At the time of writing it’s 0.20.1. Unpack it in your windows on c somewhere handy. I put in C:\Servers where I keep the other stuff like JBoss, Tomcat and such.
After unpacking, we’ll do some symbolic linking to facilitate handling in cygwin:
Link Hadoop:
$>mkdir /u01
$ ln -s /cygdrive/c/Servers/hadoop/ /u01/hadoop
It mustn’t be /u01, you can have it anywhere you like in your own cygwin environment.
Now symlink Java:
$> ln -s /cygdrive/c/Program\ Files/Java/jdk1.6.0_11/ /usr/java
2. Hadoop configuration
Hadoop must now know where java is installed so cd to /u01/hadoop/conf and edit hadoop-env.sh. locate JAVA_HOME and uncomment it. Set the path to /usr/java.
If you use the new configuration scheme which came with hadoop 0.20, you’ll want to at least configure hdfs and map reduce. Here are the contents of hdfs-site.xml:


dfs.data.dir
/u01/hadoop/data/dfs/data
Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited
list of
directories, then data will be stored in all named
directories,
typically on different devices.
Directories that do not exist are
ignored.


dfs.name.dir
/u01/hadoop/data/dfs/name
Determines where on the local filesystem the DFS name
node should store the name table. If this is a comma-delimited list
of directories then the name table is replicated in all of the
directories, for redundancy.


hadoop.tmp.dir
/tmp/hadoop-${user.name}

fs.default.name
hdfs://localhost
mapred-site.xml is as such:

mapred.job.tracker
hdfs://localhost:54311

dfs.replication
8

mapred.child.java.opts
-Xmx512m

core-site.xml can stay empty.
Once you’ve set up java and hadoop, you’ re ready to initialize dfs file system. In your cygwin shell cd to the hadoop directory root and issue the following command:
$>hadoop namenode -format
It will initialize the dfs.You’ll see some output indicating the physical location of the dfs on your system.
After that you’ll be able to start hadoop:
$>/u01/hadoop/bin/start-all.sh
does the trick and starts all hadoop components. You can check if everything is running correctly by calling JPS:
$>/usr/java/bin/jps.exe
(it’s still a windows system, right) which should output the following:
5004 DataNode
5772 Jps
4120 JobTracker
4220 SecondaryNameNode
2044 NameNode
5972 TaskTracker
The numbers are the windows PIDs.
2. conclusion
Now you should have a running hadoop on your vista box. Please keep in mind that command line handling must be passed over the cygwin environment. If you plan to do some scripting or call your map reduce programs, you always have to use that intermediary. If you’re looking for your windows files in cygwin, you’ll find all windows lettered drives under /cygdrive/.

Make a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Liked it here?
Why not try sites on the blogroll...

%d bloggers like this: