Creating a custom Kylo Sandbox

I had a need – or desire – to build a VM with a certain version of NiFi on it, and a handful of other Hadoop-type services, to act as a local sandbox. As I’ve mentioned before, I do find it slightly more convenient to use a single VM for a collection of services, rather than a collection of Docker images, mainly because it allows me to open the bonnet of the box and get my hands dirty fiddling with the insides of the machine. Since I wanted to be picky about what was getting installed, I opted to start from scratch rather than re-using the HDP or Kylo sandboxes.

The only real complication was that I realised that I also wanted to drop Kylo on this sandbox, which happened after I’d already gone down the route of getting NiFi installed. This was entertaining as it revealed various ways in which the documentation and scripts around installing Kylo have some inadvertent hard-wired assumptions about where and how NiFi is installed that I needed to work around.

The base of the installation was just to run up a raw Centos/7 installation, which I covered previously, and then install Ambari as a mechanism to bootstrap up the installation of the Hadoop bits and pieces I wanted. Before installing Ambari, I first cracked open a variety of network ports that I knew I’d need for NiFi and Kylo (note that unless I indicate otherwise, everything is done as root) and set the system name and timezone

$ firewall-cmd --zone=public --permanent --add-port=8080/tcp
$ firewall-cmd --zone=public --permanent --add-port=8088/tcp
$ firewall-cmd --zone=public --permanent --add-port=8400/tcp
$ firewall-cmd --zone=public --permanent --add-port=8440/tcp
$ firewall-cmd --zone=public --permanent --add-port=8441/tcp
$ firewall-cmd --zone=public --permanent --add-port=8451/tcp
$ firewall-cmd --zone=public --permanent --add-port=18080/tcp
$ firewall-cmd --reload
$ hostnamectl set-hostname nifi.sandbox.io
$ timedatectl set-timezone UTC
$ shutdown -r now

Installing Ambari is pretty straightforward:

$ yum install wget
$ wget -nv \
http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.4.0.1/ambari.repo \
-O /etc/yum.repos.d/ambari.repo
$ yum install ambari-server
$ ambari-server setup
$ ambari-server setup \
--jdbc-db=postgres \
--jdbc-driver=/usr/lib/ambari-server/postgresql-9.3-1101-jdbc4.jar
$ yum install ambari-agent

however I did run into some problems when starting it up and trying to use the UI. First, it was necessary to halt selinux by editing /etc/sysconfig/selinux:

SELINUX=disabled

Additionally the generated /etc/init.d/ambari-server was just daft, and I had to modify the ROOT setting to prevent it prefixing all paths with /etc. As you can see, the default construction of the root location is not particularly useful:

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
#export ROOT=`dirname $(dirname $SCRIPT_DIR)`
#ROOT=`echo $ROOT | sed 's/\/$//'`
ROOT=""

Once Ambari was running as expected, it was a simple but tedious matter to use it to install the desired software set. I did have adventures ensuring these would be booted when the sandbox is started, which I will cover later. The set of software I chose was:

  • HDFS
  • Yarn
  • Hive
  • ZooKeeper
  • Ambari Infra
  • Ambari Metrics
  • Kafka
  • Spark
  • Spark 2
  • whatever additional prerequisites Ambari wanted based on this selection

For our purposes, the key items were Spark, Kafka and Hive, however in order to support these a bunch of other bits go in.

As a quality of life improvement, I edited /etc/issue so that a nice message will show up on the console:

-----------------------------------------------
NiFi 1.1.0 Sandbox
-----------------------------------------------
Ambari: http://192.168.63.121:8080  (admin/admin)
NiFi:   http://192.168.63.121:8079/nifi
Kylo:   http://192.168.63.121:8400  (dladmin/thinkbig)

ssh centos@192.168.63.121 (centos)
-----------------------------------------------

\S
Kernel \r on an \m

And also preemptively made some HDFS directories that would be needed (as the hdfs user):

$ hdfs dfs -mkdir /user/admin
$ hdfs dfs -chown admin /user/admin

Next step was to install NiFi, which was pretty well as I wrote up previously:

$ adduser nifi
$ usermod -a -G hdfs nifi
$ sudo -i -u nifi
$ wget https://archive.apache.org/dist/nifi/1.1.0/nifi-1.1.0-bin.tar.gz
$ tar xvf nifi-1.1.0-bin.tar.gz
$ ln -s nifi-1.1.0 ./latest
$ ln -s nifi-1.1.0 ./current
$ exit

Then create the service startup script /etc/systemd/system/nifi.service:

[Unit]
Description=Apache NiFi
After=network.target

[Service]
Type=forking
User=nifi
Group=nifi
ExecStart=/home/nifi/current/bin/nifi.sh start
ExecStop=/home/nifi/current/bin/nifi.sh stop
ExecReload=/home/nifi/current/bin/nifi.sh restart

[Install]
WantedBy=multi-user.target

and turned that script on:

$ systemctl daemon-reload
$ systemctl enable nifi

The final step with NiFi was to change the port it was listening on from 8080 to 8079, by editing /home/nifi/current/conf/nifi.properties, since Ambari is on port 8080, and Kylo expects NiFi to be on 8079.

Installing Kylo was somewhat painful, with quite a bit of backwards and forwards. The manual installation instructions at Kylo.io are not as complete or accurate as they could be, and the shell scripts used to bootstrap up the manual process have a variety of inadvertent hard-wired assumptions. The process below is a distillation of the intent of those scripts, customised for the actual location of NiFi and configuration of the target environment. The process starts with obtaining and unpacking the RPM:

$ export SETUP_DIR=/opt/kylo/setup
$ useradd -U -r -m -s /bin/bash kylo
$ usermod -a -G hdfs kylo
$ chown kylo:kylo /opt/kylo
$ cd /opt/kylo/setup
$ wget  https://s3-us-west-2.amazonaws.com/kylo-io/releases/rpm/0.8.2/kylo-0.8.2-1.noarch.rpm
$ rpm -ivh kylo-0.8.2-1.noarch.rpm

MySql had been added to the virtual machine by Ambari, but without a password. This will cause problems, as the Kylo bootstrapping process assumes and requires there to be a root password. So the first thing was to set the root password:

set password for 'root'@'nifi.sandbox.io' = PASSWORD('hadoop');
set password for 'root'@'localhost' = PASSWORD('hadoop');
set password for 'root'@'127.0.0.1' = PASSWORD('hadoop');
create database if not exists kylo character set utf8 collate utf8_general_ci;
quit;

I then install ElasticSearch and set it running (note that this is still relative to /opt/kylo):

$ ./elasticsearch/install-elasticsearch.sh
$ systemctl daemon-reload
$ systemctl enable elasticsearch.service
$ systemctl start elasticsearch.service

Similarly for ActiveMQ:

$ useradd -U -r -m -s /bin/bash activemq
$ usermod -a -G users activemq
$ mkdir /opt/activemq
$ chown activemq:users /opt/activemq
$ ./activemq/install-activemq.sh /opt/activemq activemq users
$ systemctl daemon-reload
$ systemctl enable activemq
$ systemctl start activemq

Kylo needs to be told where Java is (note that when I installed Java as part of the base machine setup, I ensure that $JAVAHOME is set for all users)

$ java/remove-default-kylo-java-home.sh /opt/kylo
$ java/change-kylo-java-home.sh $JAVA_HOME /opt/kylo
$ java/install-java-crypt-ext.sh $JAVA_HOME

The next stage is to modify NiFi to include artefacts from Kylo (support JARs and new NiFi processor NARs), and ensure various locations are as expected by Kylo:

$ systemctl stop nifi

$ NIFI_INSTALL_HOME=/home/nifi
$ mkdir $NIFI_INSTALL_HOME/nifi-1.1.0/lib/app
$ mkdir -p $NIFI_INSTALL_HOME/data/lib/app
$ cp nifi/*.nar $NIFI_INSTALL_HOME/data/lib
$ cp nifi/kylo-spark-*.jar $NIFI_INSTALL_HOME/data/lib/app

$ mkdir -p $NIFI_INSTALL_HOME/data/conf
$ mv $NIFI_INSTALL_HOME/current/conf/authorizers.xml $NIFI_INSTALL_HOME/data/conf
$ mv $NIFI_INSTALL_HOME/current/conf/login-identity-providers.xml $NIFI_INSTALL_HOME/data/conf
$ chown -R nifi:nifi $NIFI_INSTALL_HOME

$ sed -i "s|nifi.flow.configuration.file=.\/conf\/flow.xml.gz|nifi.flow.configuration.file=$NIFI_INSTALL_HOME\/data\/conf\/flow.xml.gz|" $NIFI_INSTALL_HOME/current/conf/nifi.properties
$ sed -i "s|nifi.flow.configuration.archive.dir=.\/conf\/archive\/|nifi.flow.configuration.archive.dir=$NIFI_INSTALL_HOME\/data\/conf\/archive\/|" $NIFI_INSTALL_HOME/current/conf/nifi.properties
$ sed -i "s|nifi.authorizer.configuration.file=.\/conf\/authorizers.xml|nifi.authorizer.configuration.file=$NIFI_INSTALL_HOME\/data\/conf\/authorizers.xml|" $NIFI_INSTALL_HOME/current/conf/nifi.properties
$ sed -i "s|nifi.templates.directory=.\/conf\/templates|nifi.templates.directory=$NIFI_INSTALL_HOME\/data\/conf\/templates|" $NIFI_INSTALL_HOME/current/conf/nifi.properties
$ sed -i "s|nifi.flowfile.repository.directory=.\/flowfile_repository|nifi.flowfile.repository.directory=$NIFI_INSTALL_HOME\/data\/flowfile_repository|" $NIFI_INSTALL_HOME/current/conf/nifi.properties
$ sed -i "s|nifi.content.repository.directory.default=.\/content_repository|nifi.content.repository.directory.default=$NIFI_INSTALL_HOME\/data\/content_repository|" $NIFI_INSTALL_HOME/current/conf/nifi.properties
$ sed -i "s|nifi.content.repository.archive.enabled=true|nifi.content.repository.archive.enabled=false|" $NIFI_INSTALL_HOME/current/conf/nifi.properties
$ sed -i "s|nifi.provenance.repository.directory.default=.\/provenance_repository|nifi.provenance.repository.directory.default=$NIFI_INSTALL_HOME\/data\/provenance_repository|" $NIFI_INSTALL_HOME/current/conf/nifi.properties
$ sed -i "s|nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository|nifi.provenance.repository.implementation=com.thinkbiganalytics.nifi.provenance.repo.KyloPersistentProvenanceEventRepository|"   $NIFI_INSTALL_HOME/current/conf/nifi.properties
$ sed -i 's/NIFI_LOG_DIR=\".*\"/NIFI_LOG_DIR=\"\/var\/log\/nifi\"/' $NIFI_INSTALL_HOME/current/bin/nifi-env.sh

The Kylo artefacts are symbolically linked into the NiFi locations (note that this is really a bit messy, and there’s no particularly good reason to have previously copied these artefacts out of their location in the Kylo installation directory. It would be cleaner to link directly to the originals rather than the copies)

$ ln -f -s $NIFI_INSTALL_HOME/data/lib/kylo-nifi-core-service-nar-*.nar $NIFI_INSTALL_HOME/current/lib/kylo-nifi-core-service-nar.nar
$ ln -f -s $NIFI_INSTALL_HOME/data/lib/kylo-nifi-standard-services-nar-*.nar $NIFI_INSTALL_HOME/current/lib/kylo-nifi-standard-services-nar.nar
$ ln -f -s $NIFI_INSTALL_HOME/data/lib/kylo-nifi-core-v1-nar-*.nar $NIFI_INSTALL_HOME/current/lib/kylo-nifi-core-nar.nar
$ ln -f -s $NIFI_INSTALL_HOME/data/lib/kylo-nifi-spark-v1-nar-*.nar $NIFI_INSTALL_HOME/current/lib/kylo-nifi-spark-nar.nar
$ ln -f -s $NIFI_INSTALL_HOME/data/lib/kylo-nifi-hadoop-v1-nar-*.nar $NIFI_INSTALL_HOME/current/lib/kylo-nifi-hadoop-nar.nar
$ ln -f -s $NIFI_INSTALL_HOME/data/lib/kylo-nifi-hadoop-service-v1-nar-*.nar $NIFI_INSTALL_HOME/current/lib/kylo-nifi-hadoop-service-nar.nar
$ ln -f -s $NIFI_INSTALL_HOME/data/lib/kylo-nifi-elasticsearch-v1-nar-*.nar $NIFI_INSTALL_HOME/current/lib/kylo-nifi-elasticsearch-nar.nar
$ ln -f -s $NIFI_INSTALL_HOME/data/lib/kylo-nifi-provenance-repo-v1-nar-*.nar $NIFI_INSTALL_HOME/current/lib/kylo-nifi-provenance-repo-nar.nar
$ ln -f -s $NIFI_INSTALL_HOME/data/lib/app/kylo-spark-validate-cleanse-spark-v1-*-jar-with-dependencies.jar $NIFI_INSTALL_HOME/current/lib/app/kylo-spark-validate-cleanse-jar-with-dependencies.jar
$ ln -f -s $NIFI_INSTALL_HOME/data/lib/app/kylo-spark-job-profiler-spark-v1-*-jar-with-dependencies.jar $NIFI_INSTALL_HOME/current/lib/app/kylo-spark-job-profiler-jar-with-dependencies.jar
$ ln -f -s $NIFI_INSTALL_HOME/data/lib/app/kylo-spark-interpreter-spark-v1-*-jar-with-dependencies.jar $NIFI_INSTALL_HOME/current/lib/app/kylo-spark-interpreter-jar-with-dependencies.jar
$ chown -h nifi:nifi $NIFI_INSTALL_HOME/current/lib/kylo*.nar
$ chown -h nifi:nifi $NIFI_INSTALL_HOME/current/lib/app/kylo*.jar

There are some directories that Kylo expects:

$ mkdir /home/nifi/mysql /home/nifi/ext-config home/nifi/h2 /home/nifi/activemq /home/nifi/feed_flowfile_cache
$ chown -R nifi:nifi /home/nifi/mysql /home/nifi/ext-config /home/nifi/h2 /home/nifi/activemq /home/nifi/feed_flowfile_cache

$ cp nifi/config.properties /home/nifi/ext-config/
$ chown nifi:nifi /opt/nifi/ext-config/config.properties

$ cp nifi/activemq/*jar /home/nifi/activemq/
$ chown -R nifi:nifi /home/nifi/activemq

$ sed -i "s|kylo.provenance.cache.location=\/opt\/nifi\/feed-event-statistics.gz|kylo.provenance.cache.location=/home\/nifi\/feed-event-statistics.gz|" /home/nifi/ext-config/config.properties

$ mkdir -p /var/dropzone
$ chown nifi:nifi /var/dropzone

$ mkdir /opt/nifi
$ ln -s /home/nifi/current /opt/nifi/current

Kylo services need to be told where NiFi is (/opt/kylo/kylo-services/conf/application.properties):

config.nifi.home=/home/nifi

Finally a configuration option needs to be added to the NiFi bootstrap.conf:

java.arg.15=-Dkylo.nifi.configPath=/home/nifi/ext-config

At this stage, I was then able to start NiFi using systemctl start nifi. I confirmed that NiFi looked healthy, then started up Kylo using the start/stop scripts in /opt/kylo. I was happy to see that Kylo created it’s database schema in the MySql instance correctly, but less happy that there were no templates in either Kylo or NiFi. It turns out these need to be manually created, using some REST invocations from https://github.com/ThinkBigAnalytics/kylo-devops/blob/master/packer/common/kylo/install-samples.sh. This was a flaky process, and I had several iterations of applying one or more of these invocations, and manually removing the results, until the templates and processors were in place. In particular one of these invocations created controller services for JMS and JDBC in NiFi that were misconfigured, and the configuration had to be corrected before the other invocations would work.

$ curl -i -X POST -u dladmin:thinkbig -H "Content-Type: multipart/form-data" \
-F "overwrite=true" \
-F "categorySystemName=" \
-F "importConnectingReusableFlow=NOT_SET" \
-F "file=@/opt/kylo/setup/data/feeds/nifi-1.0/index_schema_service_elasticsearch.feed.zip" \
http://localhost:8400/proxy/v1/feedmgr/admin/import-feed

$ curl -i -X POST -u dladmin:thinkbig -H "Content-Type: multipart/form-data" \
-F "overwrite=true" \
-F "categorySystemName=" \
-F "importConnectingReusableFlow=NOT_SET" \
-F "file=@/opt/kylo/setup/data/feeds/nifi-1.0/index_text_service_elasticsearch.feed.zip" \
http://localhost:8400/proxy/v1/feedmgr/admin/import-feed

$ curl -i -X POST -u dladmin:thinkbig -H "Content-Type: multipart/form-data" \
-F "overwrite=true" \
-F "createReusableFlow=false" \
-F "importConnectingReusableFlow=YES" \
-F "file=@/opt/kylo/setup/data/templates/nifi-1.0/data_ingest.zip" \
http://localhost:8400/proxy/v1/feedmgr/admin/import-template

$ curl -i -X POST -u dladmin:thinkbig -H "Content-Type: multipart/form-data" \
-F "overwrite=true" \
-F "createReusableFlow=false" \
-F "importConnectingReusableFlow=YES" \
-F "file=@/opt/kylo/setup/data/templates/nifi-1.0/data_transformation.zip" \
http://localhost:8400/proxy/v1/feedmgr/admin/import-template

$ curl -i -X POST -u dladmin:thinkbig -H "Content-Type: multipart/form-data" \
-F "overwrite=true" \
-F "createReusableFlow=false" \
-F "importConnectingReusableFlow=NOT_SET" \
-F "file=@/opt/kylo/setup/data/templates/nifi-1.0/data_confidence_invalid_records.zip" \
http://localhost:8400/proxy/v1/feedmgr/admin/import-template

At this point, rebooting the VM revealed that while NiFi and Kylo were attempting to start during boot, the Hadoop services were not. To rectify that, we need to set up a startup script that pokes Ambari’s API. There are two parts to this, the systemd service (/usr/lib/systemd/system/hadoop.service) and the script itself:

[Unit]
Description=Hadoop Start
Requires=ambari-server.service
Before=nifi.service kylo-services.service kylo-ui.service kylo-spark-shell.service
After=ambari-server.service ambari-agent.service

[Service]
Type=oneshot
User=root
Group=root
ExecStart=/opt/hadoop-start.sh
TimeoutSec=2000
TimeoutStartSec=2000

[Install]
WantedBy=multi-user.target

The start up script (/opt/hadoop-start.sh) was pinched from the Kylo sandbox, but tweaked to provide some pauses to allow Ambari to catch up (after it responds to the initial ping request, it’s not necessarily ready to accept action requests):

#!/bin/bash

USER=admin
PASSWORD=admin
AMBARI_HOST=localhost
CLUSTER=sandbox

a=1
until [ $a -eq 180 ]
do
  echo "Attempting to connect [$a]"
  a=`expr $a + 1`
  curl -u $USER:$PASSWORD -XGET http://$AMBARI_HOST:8080/api/v1/clusters/$CLUSTER/services/HIVE?fields=ServiceInfo
  if [ "$?" = "7" ]; then
    echo 'connection refused or cant connect to server/proxy';
    sleep 10s;
  else
    echo "Ambari is now listening for REST calls"
    break;
  fi
done

echo "Sleeping since making the REST calls to quick may fail"
sleep 30s

for SERVICE in ZOOKEEPER HDFS YARN MAPREDUCE2 HIVE SPARK SPARK2
do
  echo "starting $SERVICE"
  curl -u $USER:$PASSWORD -i -H "X-Requested-By: ambari" -X PUT -d "{\"RequestInfo\": {\"context\" :\"Start $SERVICE via REST\"}, \"Body\": {\"ServiceInfo\": {\"state\": \"STARTED\"}}}" http://$AMBARI_HOST:8080/api/v1/clusters/$CLUSTER/services/$SERVICE
done
echo "** WAITING FOR HADOOP **"

Of course after creating these, we need to ensure that the privileges on the script are ok, and that systemd has been tickled:

$ chown root:root /opt/hadoop-start.sh
$ chmod 700 /opt/hadoop-start.sh
$ systemctl daemon-reload
$ systemctl enable hadoop.service

Rebooting the VM, I found that everything now starts at boot time (even if it does take 7+ minutes), and I could access and use both Kylo and NiFi as expected. Well, almost as expected. It turns out that my usual smoke test with Kylo – ingest a CSV into a Hive Table, and then fiddle with the ad-hoc Kylo query page to examine the resulting table – does not work out of this box. There’s a known problem around empty ORC tables, and there’s a defined fix for HDP and Kylo, by editing the Spark configuration (/usr/hdp/current/spark-client/conf/spark-defaults.conf):

spark.sql.hive.convertMetastoreOrc false
spark.sql.hive.convertMetastoreParquet false

Leave a Reply

Your email address will not be published. Required fields are marked *