Elephants and Pigs

Since I did have HomeBrew installed, I went ahead with this set of instructions, with some variation. Note that at the time I did this HomeBrew installed Hadoop 2.6.0, not 2.4.x as described at this site:

https://www.getblueshift.com/setting-up-hadoop-2-4-and-pig-0-12-on-osx-locally

I had previously verified that I could ssh to my localhost, so skipped over that, and went straight to this:

brew install hadoop
cd /usr/local/Cellar/hadoop/2.6.0/libexec/etc/hadoop
cp mapred-site.xml.template mapred-site.xml
vi hdfs-site.xml core-site.xml mapred-site.xml yarn-site.xml

And got the files looking like this:

<!-- hdfs-site.xml -->
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

<!-- core-site.xml -->
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

<!-- mapred-site.xml -->
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

<!-- yarn-site.xml -->
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>

Following along with the instructions at the original site, I did the following, all of which seemed to work as expected (note that there are some errors in the original site, which I’ve corrected below)

cd /usr/local/Cellar/hadoop/2.6.0
./bin/hdfs namenode -format
./sbin/start-dfs.sh
./bin/hdfs dfs -mkdir /user
./bin/hdfs dfs -mkdir /user/robert
./sbin/start-yarn.sh
./bin/hdfs dfs -put libexec/etc/hadoop input
./bin/hadoop jar libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs[a-z.]+'
./bin/hdfs dfs -get output output
cat output/*
./bin/hdfs dfs -rm -r /user/robert/input/*
./bin/hdfs dfs -rm -r /user/robert/output/*
./sbin/stop-yarn.sh
./sbin/stop-dfs.sh

This all seemed to work, however there are frequent warnings which seem to be a known issue:

2015-03-09 10:13:47.495 java[11710:371646] Unable to load realm info from SCDynamicStore
15/03/09 10:13:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

As an aside, it thrills me that there are different log formats from different applications here. Since I’ve been accumulating a few bits and pieces like this, I created a pair of .command bash scripts in my local Applications folder:

#!/bin/bash
. ~/.bashrc
cd $HADOOP_HOME
./sbin/start-dfs.sh
./sbin/start-yarn.sh

#!/bin/bash
. ~/.bashrc
cd $HADOOP_HOME
./sbin/stop-dfs.sh
./sbin/stop-yarn.sh

Most of the sites talking about installing Hadoop 2.4.1 on the Mac indicate that this does not work

brew install pig

however I did that with 2.6.0 and it seems to have successfully installed Pig 0.14.0. I then adapted the experiments at (using 100Mb file initially rather than 1Gb)

http://ericlondon.com/2014/08/01/hadoop-pig-ruby-map-reduce-on-osx-via-homebrew.html

to verify that pig was, well, piggy. Specifically, what I did was copy his Ruby script and play with it. Note that article does not indicate where the data is coming from, so I used the following after creating the script:

mkdir Sandbox/pig
cd Sandbox/pib
vi map_reduce.rb
chmod +x map_reduce.rb

# Test the script outside hadoop
cat /usr/share/dict/words | ./map_reduce.rb --map | sort | ./map_reduce.rb --reduce

# create a json file with the script, then test with PIG
~/Applications/hadoopStart.command
./map_reduce.rb --create_json_file
$HADOOP_HOME/bin/hdfs dfs -mkdir input
$HADOOP_HOME/bin/hdfs dfs -rm -r output
$HADOOP_HOME/bin/hdfs dfs -put data.json input
pig
grunt> json_data = LOAD 'input/data.json' USING JsonLoader('first_name:chararray, last_name:chararray, address:chararray, city:chararray, state:chararray');
grunt> store json_data into 'output/data.csv' using PigStorage('\t','-schema');
grunt> quit
$HADOOP_HOME/bin/hdfs dfs -cat  output/data.csv/.pig_header >data.csv
$HADOOP_HOME/bin/hdfs dfs -cat output/data.csv/part* >> data.csv

Doing that did reveal some problems. In that nothing was working. A detour through the bowels of the logs showed that there were all sorts of network problems, with some parts of the system trying to get to ‘localhost’ and other parts to the apparent hostname. After flailing around, I went back to the stage of

./bin/hdfs namenode -format
./sbin/start-dfs.sh
./bin/hdfs dfs -mkdir /user
./bin/hdfs dfs -mkdir /user/robert

and then re-running the pig test, which seemed to sort it out. I think it’s all predicated on the assumption that the name of the host does not change, and I’ve noticed that Mac OS X has a tendency to change the name somewhat arbitrarily. I can probably get around this by assigning a fixed name as an alias to localhost, and altering the Hadoop configuration to use that fixed name. And sacrificing a chicken at the altar to the elder gods of the network.

ADDENDUM spoke to soon, the pig test threw a weird message, but does still seem to create the output:

2015-03-09 12:31:25,901 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-03-09 12:31:26,904 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-03-09 12:31:27,909 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-03-09 12:31:28,914 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-03-09 12:31:29,919 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-03-09 12:31:30,925 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-03-09 12:31:31,927 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-03-09 12:31:32,932 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-03-09 12:31:33,935 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-03-09 12:31:34,940 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-03-09 12:31:35,946 [main] INFO  org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

Addendum to the Addendum
This problem turned out to be that the job history server was not running, so the two startup scripts were amended to start and stop that server:

#!/bin/bash
. ~/.bashrc
cd $HADOOP_HOME
./sbin/start-dfs.sh
./sbin/start-yarn.sh
./sbin/mr-jobhistory-daemon.sh start historyserver

#!/bin/bash
. ~/.bashrc
cd $HADOOP_HOME
./sbin/mr-jobhistory-daemon.sh stop historyserver
./sbin/stop-dfs.sh
./sbin/stop-yarn.sh

Leave a Reply

Your email address will not be published. Required fields are marked *