Glad You're Ready. Let's Get Started!

Let us know how we can contact you.

Thank you!

We'll respond shortly.

  • Blog Navigation
6 Easy Steps: Deploy Pivotal's Hadoop on Docker

HadoopSQL-DockerWhile Hadoop is becoming more and more mainstream, many development leaders want to speed up and reduce errors in their development and deployment processes (i.e. devops) by using platforms like PaaS and lightweight runtime containers. One of the most interesting, recent stats in the devops arena is that companies with high performing devops processes can ship code 30x more frequently and complete deployment processes 8,000 times faster.

To this end, Docker is a new-but-rising lightweight virtualization solution (more precisely, a lightweight Linux isolation container). Basically, Docker allows you to package and configure a run-time and deploy it on Linux machines—it’s build-once-run-anywhere, isolated like a virtual machine, and runs faster and lighter than traditional VMs. Today, I will show you how two components of the Pivotal Big Data Suite—our Hadoop distribution, Pivotal HD, and our SQL interface on Hadoop, HAWQ—can be quickly and easily set up to run on a developer laptop with Docker.

With the Docker model, we can literally turn heavyweight app environments on and off like a light switch! The steps below typically take less than 30 minutes: 1) Download and Import the Docker Images, 2) Run the Docker Containers, 3) SSH in to the Environment to Start Pivotal HD, 4) Test Hadoop’s HDFS and MapReduce, 5) Start HAWQ—SQL on Hadoop, and 6) Test HAWQ.

If you would prefer to see a video, here is a demonstration of Hadoop on Docker.

Hadoop on Docker—Architecture

This diagram explains the overall deployment of Pivotal HD and HAWQ across several Docker containers. Basically, the workloads will run on a Hadoop master node (e.g. like namenode etc.) with some Hadoop  nodes (e.g. datanode etc.) and a HAWQ master with two segment servers.


There are a few other components worth mentioning:

  • tar files – These are the Docker image files. In the future, we plan to upload these to Docker’s Repository so that you can pull them from Docker. Currently, you need to download a gzip-ed file from our repository.
  • Containers – These are the Docker containers that contain Pivotal Command Center (our Pivotal HD cluster orchestration tool) and the deployed Pivotal HD and HAWQ components. You will NOT have to install and deploy Pivotal HD. It is already built as part of the Docker files!
  • Other libraries – DNS and SSH servers are set up to work for the cluster.

That’s it. You don’t need any other files—the tar images contain everything you need to set this up on your own laptop or development environment.

Hadoop on Docker—Environments and Prerequisites

Currently, I run this entire environment on my development laptop, and the specs are below. It’s a decent set-up in terms of compute and memory:

  1. Ubuntu 13.10(on Windows 7 using VirtualBox)
  2. 2 CPUs are allocated. Intel i5 2.60Ghz
  3. 10GB of memory is allocated

In addition, I run this Hadoop on Docker environment on an Amazon Web Services Ubuntu 13.10 64-bit m2.xlarge virtual machine. If you are using Amazon, make sure that your root directory has a plenty of space since /var/lib/docker will be used for image extraction. The AMI is ubuntu-saucy-13.10-amd64-server-20140212.

There aren’t many other software prerequisites. Basically, you need Docker 0.9.1 installed, and the Docker team has a good set up document.

In theory, this Hadoop on Docker install should work on all Linux systems, but I only tested it on Ubuntu 13.10 64-bit. Also, make sure that you don’t run any containers. (e.g. docker ps command does not return any container IDs.) There are some hardcode values and limitations, which I will fix in future.

6 Simple Steps to start Hadoop with SQL on Docker

1. Download and Import the Hadoop on Docker Images

First, you are going to download the tar ball, extract, and start importing the image into Docker. Remember to check that md5 is the correct hash.

tar -xvf phd-docker.tar.gz
sudo su # Make sure you run docker command as root.
cd images
cat phd-master.tar | docker import - phd:master
cat phd-slave1.tar | docker import - phd:slave1
cat phd-slave2.tar | docker import - phd:slave2

2. Run the Hadoop on Docker Containers

Once you import images, we can run the Hadoop on Docker containers. This part of the process requires some parameters, but don’t be afraid. It should work in your environment without any problem.

# Set a variable
DOCKER_OPTION="--privileged --dns --dns -d -t -i"

# Start master container
docker run ${DOCKER_OPTION} -p 5443 --name phd-master -h phd:master bash -c "/tmp/phd-docker/bin/"

# Start slave containers
for x in {1..2} ; do docker run ${DOCKER_OPTION} --name phd-slave${x} -h slave${x} phd:slave${x} /tmp/phd-docker/bin/ ; done

Wow, within a second or two, you have three3 nodes (VMs if you wish to call them) running on your machine. Pivotal Command Center, Pivotal HD, and HAWQ are all deployed.

Isn’t it lighting fast!?!?!?

3. SSH and Start the Pivotal Hadoop on Docker Cluster

Now we can log in (ssh) to the master container and start the Pivotal HD cluster.

ssh root@ # Password: changeme

# Make sure all services are running. Wait a few moments if any of them are not running yet.
service commander status

# Login as PHD admin user
su - gpadmin

# Start the cluster
icm_client start -l test

4. Test HDFS and MapReduce for Hadoop on Docker

At this point, Pivotal Hadoop is running, and HAWQ is not started yet. Before we start HAWQ, we should test to see if Hadoop is running.

First, we go to the web-based user interface for the Hadoop status:

Go to – This shows the HDFS status. If you see 2 Live Nodes, it is a good sign! Your data nodes on the Docker containers are connected to your master container.


  • Go to – This shows the MapReduce status. You can check job status while you are trying mapreduce in the next section.

Now, we can run a simple word count test within MapReduce and check the web UI while the job is running.

# Simple ls command
hadoop fs -ls /

# Make an input directory
hadoop fs -mkdir /tmp/test_input

# Copy a text file
hadoop fs -copyFromLocal /usr/lib/gphd/hadoop/CHANGES.txt /tmp/test_input

# Run a wordcount MapReduce!
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /tmp/test_input /tmp/test_output

# Check the result
hadoop fs -cat /tmp/test_output/part*

5. Start HAWQ—SQL on Hadoop

Now that we have confirmed Hadoop is running fine on Docker, we can start the HAWQ cluster. In this example, hawq-master is running on the slave1 machine; so, remember to log into slave1.

ssh slave1 # which is hawq master
su - gpadmin

# Source the environment variables for HAWQ.
source /usr/lib/gphd/hawq/

# SSH keys should be set among HAWQ cluster nodes.
echo -e "slave1nslave2" > HAWQ_HOSTS.txt
gpssh-exkeys -f HAWQ_HOSTS.txt # gpadmin's password is gpadmin

# Initialize HAWQ cluster.
/etc/init.d/hawq init

6. Test HAWQ—SQL on Hadoop

Since we can log in, HAWQ is running. Let’s test it.

# On slave1, login as gpadmin and source the environment variables if you haven’t so.
su - gpadmin
source /usr/lib/gphd/hawq/

# Postgres shell
psql -p 5432

# Create a table
create table test1 (a int, b text);

# Insert a couple of records
insert into test1 values (1, 'text value1');
insert into test1 values (2, 'text value2');

# See it returns the rows.
select * from test1;

# Exit the shell and find how they are stored in HDFS.
hadoop fs -cat /hawq_data/gpseg*/*/*/* # This shows a raw hawq file on hdfs

Well done! Pivotal HD and HAWQ are running on your laptop within the Docker container.

Cleaning Up Your Mess

In my opinion, clean up is the beauty of container solutions like Docker. You can make any mess you like, and, then, you can just kill it—everything is gone. Of course, VMs are a good solution, but stop and start commands take much longer than with a container like Docker. Here is how to clean up your environment:

# docker ps finds all container IDs, and rm command kills them
docker ps -a -q |xargs docker rm -f

After running this, none of the services, web UI links, or commands should work. Similarly, you can no longer ssh to the nodes we’ve created. Everything is clean.

Thank you for reading and watching. I hope you enjoy!

For more information:

  1. Thanks for the wonderful article. I managed to start the three containers as mentioned in the steps in the article, however, the slave containers exited as soon as they were started. Only the master container ran and there were no live nodes connected to it. I am running Linux 14.04 64 bit version on virtualbox.

  2. I was wondering whether the dockerfiles for the 3 images is available so that I could upgrade to the latest phd 2.1.0. Thanks in advance

  3. Andy says:

    The docker image repository links appear to be stale, could you provide working links?

  4. HWu says:

    I have a vbox of centos 7, with 2 CPUs, 8 GB of memory, 60 GB of storage. I was trying to test your hadoop+hawq docker by following your instructions. But I couldn’t go through, see loggings below.
    Could you point out the issues?

    [root@test-vbox opt]# tar -xvf phd-docker.tar.gz
    [root@test-vbox opt]# cd images
    [root@test-vbox images]# cat phd-master.tar | docker import – phd:master
    [root@test-vbox images]# cat phd-slave1.tar | docker import – phd:slave1
    [root@test-vbox images]# cat phd-slave2.tar | docker import – phd:slave2
    [root@test-vbox images]# docker images
    phd slave2 84345affd1fa 12 minutes ago 1.196 GB
    phd slave1 dd26e3e317de 16 minutes ago 1.196 GB
    phd master 0e380047a666 22 minutes ago 2.364 GB
    hello-world latest 0a6ba66e537a 13 days ago 960 B
    [root@test-vbox images]# DOCKER_OPTION=”–privileged –dns –dns -d -t -i”
    [root@test-vbox images]# docker run ${DOCKER_OPTION} -p 5443 –name phd-master -h phd:master bash -c “/tmp/phd-docker/bin/”
    for x in {1..2} ;
    do docker run ${DOCKER_OPTION} –name phd-slave${x} -h slave${x} phd:slave${x} /tmp/phd-docker/bin/ ;
    [root@test-vbox images]# for x in {1..2} ;
    > do docker run ${DOCKER_OPTION} –name phd-slave${x} -h slave${x} phd:slave${x} /tmp/phd-docker/bin/ ;
    > done
    [root@test-vbox images]# ssh root@
    The authenticity of host ‘ (’ can’t be established.
    RSA key fingerprint is 67:2f:47:fd:3b:5b:22:17:9a:08:e5:0b:83:57:fa:83.
    Are you sure you want to continue connecting (yes/no)? yes
    Warning: Permanently added ‘’ (RSA) to the list of known hosts.
    root@’s password:
    -bash-4.1# service commander status
    commander: unrecognized service
    -bash-4.1# su – gpadmin
    could not open session
    -bash-4.1# exit
    Connection to closed.
    [root@test-vbox images]# docker ps
    81411a7553ac phd:slave2 “/tmp/phd-docker/bin/” 23 minutes ago Up 23 minutes phd-slave2
    d403c9a4568e phd:slave1 “/tmp/phd-docker/bin/” 23 minutes ago Up 23 minutes phd-slave1
    be3afb4f8b14 phd:master “bash -c /tmp/phd-doc” 23 minutes ago Up 23 minutes>5443/tcp phd-master
    [root@test-vbox images]#

Post a Comment

Your Information (Name required. Email address will not be displayed with comment.)

* Copy This Password *

* Type Or Paste Password Here *

Share This