Cloudera on Docker on AWS

Running Cloudera Quickstart on Docker on AWS EC2.

I needed to use a test Cloudera Environment, and my Macbook Pro with 16GB of RAM was running hot with the Cloudera Quickstart Docker image, so I decided to run it on EC2.

This is the setup.

Configuring and Running an EC2 Instance

I went for:

  • - instancetype=t2.2xlarge
  • - Canonical, Ubuntu
  • - Security Group that allows 8888 (for Hue), 7180 (for Cloudera Manager, 8088 for YARN)
  • - 32 GB EBS on SSD

(I first tried with my default, Amazon Linux, and the Docker daemon didn’t start - I don’t have time to troubleshoot that, so just went for Ubuntu and it worked first time).

Downloading and running the Cloudera Quickstart Docker Image

Install Docker:

sudo apt-get update
sudo apt-get install docker.io

And check it is working:

sudo docker run hello-world

Then get the Cloudera Quickstart image (it’s 4.4 GB, so will take a moment.):

sudo docker pull cloudera/quickstart

Get its Image ID:

sudo docker images
REPOSITORY              TAG         IMAGE ID         CREATED       SIZE
cloudera/quickstart     latest      4239cd2958c6     2 years ago   6.34 GB

Run it:

sudo docker run \
--privileged=true -ti \
-d \
-p 8888:8888 \
-p 8088:8088 \
-p 80:80 \
-p 7180:7180 \
--name cdh \
--hostname=quickstart.cloudera 4239cd2958c6 /usr/bin/docker-quickstart
Start Services via Cloudera Manager

Browse to Cloudera Manager (using the DNS of your EC2 instance, port 7180) and login (default password cloudera/cloudera).

Troubleshooting

Clock Offset: The host’s NTP service could not be located or did not respond to a request for the clock offset. I’ve not fixed this yet, but it’s impacting the health check status of several services.

Load Data

Get some data into the container:

sudo docker cp dictionary.txt cdh:/home/cloudera/
sudo docker cp events.csv cdh:/home/cloudera/
sudo docker cp ginf.csv cdh:/home/cloudera/

And copy it into HDFS:

sudo docker exec cdh hdfs dfs -put /home/cloudera/events.csv
sudo docker exec cdh hdfs dfs -put /home/cloudera/dictionary.txt
sudo docker exec cdh hdfs dfs -put /home/cloudera/ginf.csv
Make a Hive Table

Get into the container:

sudo docker exec -it cdh /bin/bash

And launch Beeline:

beeline -u jdbc:hive2://172.17.0.2:10000/default;user=cloudera;password=cloudera --silent

create database football;

create external table football.events (
   id_odsp        STRING
  ,id_event       STRING
  ,sort_order     INT
  ,time           INT
  ,text           STRING
  ,event_team     STRING
  ,event_type2    INT
  ,side           INT
  ,opponent       STRING
  ,player         STRING
  ,player2        STRING
  ,player_in      STRING
  ,player_out     STRING
  ,shot_place     INT
  ,shot_outcome   INT
  ,is_goal        INT
  ,`location`     INT
  ,bodypart       INT
  ,assist_method  INT
  ,situation      INT
  ,fast_break     INT
)
COMMENT 'Football data from Kaggle'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
;

LOAD DATA INPATH 'hdfs:///user/root/events.csv' into table football.events;