Cloudera on Docker on AWS
Running Cloudera Quickstart on Docker on AWS EC2.
I needed to use a test Cloudera Environment, and my Macbook Pro with 16GB of RAM was running hot with the Cloudera Quickstart Docker image, so I decided to run it on EC2.
This is the setup.
Configuring and Running an EC2 Instance
I went for:
- - instancetype=t2.2xlarge
- - Canonical, Ubuntu
- - Security Group that allows 8888 (for Hue), 7180 (for Cloudera Manager, 8088 for YARN)
- - 32 GB EBS on SSD
(I first tried with my default, Amazon Linux, and the Docker daemon didn’t start - I don’t have time to troubleshoot that, so just went for Ubuntu and it worked first time).
Downloading and running the Cloudera Quickstart Docker Image
Install Docker:
sudo apt-get update
sudo apt-get install docker.io
And check it is working:
sudo docker run hello-world
Then get the Cloudera Quickstart image (it’s 4.4 GB, so will take a moment.):
sudo docker pull cloudera/quickstart
Get its Image ID:
sudo docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
cloudera/quickstart latest 4239cd2958c6 2 years ago 6.34 GB
Run it:
sudo docker run \
--privileged=true -ti \
-d \
-p 8888:8888 \
-p 8088:8088 \
-p 80:80 \
-p 7180:7180 \
--name cdh \
--hostname=quickstart.cloudera 4239cd2958c6 /usr/bin/docker-quickstart
Start Services via Cloudera Manager
Browse to Cloudera Manager (using the DNS of your EC2 instance, port 7180) and login (default password cloudera/cloudera).
Troubleshooting
Clock Offset: The host’s NTP service could not be located or did not respond to a request for the clock offset. I’ve not fixed this yet, but it’s impacting the health check status of several services.
Load Data
Get some data into the container:
sudo docker cp dictionary.txt cdh:/home/cloudera/
sudo docker cp events.csv cdh:/home/cloudera/
sudo docker cp ginf.csv cdh:/home/cloudera/
And copy it into HDFS:
sudo docker exec cdh hdfs dfs -put /home/cloudera/events.csv
sudo docker exec cdh hdfs dfs -put /home/cloudera/dictionary.txt
sudo docker exec cdh hdfs dfs -put /home/cloudera/ginf.csv
Make a Hive Table
Get into the container:
sudo docker exec -it cdh /bin/bash
And launch Beeline:
beeline -u jdbc:hive2://172.17.0.2:10000/default;user=cloudera;password=cloudera --silent
create database football;
create external table football.events (
id_odsp STRING
,id_event STRING
,sort_order INT
,time INT
,text STRING
,event_team STRING
,event_type2 INT
,side INT
,opponent STRING
,player STRING
,player2 STRING
,player_in STRING
,player_out STRING
,shot_place INT
,shot_outcome INT
,is_goal INT
,`location` INT
,bodypart INT
,assist_method INT
,situation INT
,fast_break INT
)
COMMENT 'Football data from Kaggle'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
;
LOAD DATA INPATH 'hdfs:///user/root/events.csv' into table football.events;