Setting up a Dremio test environment with Docker
I needed to do some experiments on Dremio, an OSS lakehouse solution. Instead of installing Dremio manually (I’m using a MacOS station), I’ve decided to use Docker. To do something useful, I also required an RDBMS instance to query trough Dremio. I’ve picked out MySQL 8.4.0, again with a Docker image. The first step it to find the right images on Docker Hub:
Btw, I don’t use Docker desktop, since I’m runing this on corporate hardware and we don’t have Docker licenses. Instead, I’ve adopted Rancher Desktop which works pretty well, does the same, and is not forcing the user to buy a license.
Docker images offer a set of advantages, some of them being:
- Quick go-live: just ask Docker to run an image and it’ll download everything and set it up for you
- Composability: if you need more than one container, as we do, you can leverage tools like docker-compose to create a testbed in moments
To describe the testbed, I’ve written a docker-compose.yml file that declares several sections. The file uses version 3 of the format:
version: '3'
The first thing to do is declaring a common network the two containers are going to share:
networks:
dremio:
name: dremio
Then in the services section of the file, we can first define the MySQL instance:
services:
mysql:
image: mysql:8.4.0
hostname: mysql
ports:
- "3306:3306"
- "33060:33060"
networks:
- dremio
environment:
MYSQL_ROOT_PASSWORD: weakPassword
I’ve included the MySQL root password here just because this is a local test environment. This is not feasible in production and proper secret management should be adopted here.
With this file in place, I’ve used docker-compose to start MySQL:
$ docker-compose up -d mysql
Now it’s time to define Dremio instance as well:
services:
mysql:
...
dremio:
image: dremio/dremio-oss
hostname: dremio
networks:
- dremio
ports:
- "9047:9047" # Web UI (HTTP)
- "31010:31010" # ODBC/JDBC clients
- "32010:32010" # Apache Arrowflights clients
- "45678:45678"
Now I can start Dremio as well with:
$ docker-compose up -d dremio
Unfortunately Dremio keeps shutting down with an error code 137. This relates to available memory, so I increased docker VM memory up to 8GB, restarted the Rancher environment and tried again. This time docker container started without problems.
I can see that both containers are running:
$ docker ps | grep dremio_lab
f40c36bc9e7e dremio/dremio-oss "bin/dremio start-fg" 43 minutes ago Up 22 minutes 0.0.0.0:9047->9047/tcp, :::9047->9047/tcp, 0.0.0.0:31010->31010/tcp, :::31010->31010/tcp, 0.0.0.0:32010->32010/tcp, :::32010->32010/tcp, 0.0.0.0:45678->45678/tcp, :::45678->45678/tcp dremio_lab-dremio-1
2f3f86db83d9 mysql:8.4.0 "docker-entrypoint.s…" 2 hours ago Up 22 minutes 0.0.0.0:3306->3306/tcp, :::3306->3306/tcp, 0.0.0.0:33060->33060/tcp, :::33060->33060/tcp
I can also check that exposed ports are actually available:
$ netstat -lant | egrep '3306|33060|9047|31010|32010|45678'
tcp4 0 0 *.33060 *.* LISTEN
tcp4 0 0 *.45678 *.* LISTEN
tcp4 0 0 *.9047 *.* LISTEN
tcp4 0 0 *.3306 *.* LISTEN
tcp4 0 0 *.32010 *.* LISTEN
tcp4 0 0 *.31010 *.* LISTEN
Now, if I access port 9047 on my host (don’t use 127.0.0.1 or localhost, it won’t work), I can reach Dremio web interface:
This seems pretty good, but whenever those Docker container will be shutdown, every data inside them will be lost, which is not what I want. I want data to persist. To do this, I have to change my docker-compose.yml to add some managed volumes and to configure mysql and dremio services to use those volumes to stora data. First let’s add the volumes (we’ll need four, one for MySQL and three for Docker, info available on Docker Hub pages):
volumes:
db:
dremio_data:
dremio_lib:
dremio_local:
Then let’s configure the volumes on the services. I’ll post here the complete docker-compose.yml, since this is the last modification we have to do:
version: '3'
networks:
dremio:
name: dremio
volumes:
db:
dremio_data:
dremio_lib:
dremio_local:
services:
mysql:
image: mysql:8.4.0
hostname: mysql
ports:
- "3306:3306"
- "33060:33060"
volumes:
- db:/var/lib/mysql:rw
networks:
- dremio
environment:
MYSQL_ROOT_PASSWORD: mazinga
dremio:
image: dremio/dremio-oss
hostname: dremio
networks:
- dremio
volumes:
- dremio_data:/opt/dremio/data:rw
- dremio_lib:/var/lib/dremio:rw
- dremio_local:/localFiles:rw
ports:
- "9047:9047" # Web UI (HTTP)
- "31010:31010" # ODBC/JDBC clients
- "32010:32010" # Apache Arrowflights clients
- "45678:45678"
With this configuration, starting the whole testbed, shared network included, only takes one single command:
$ docker-compose up -d
assuming the command is run inside the directory containing the docker-compose.yml file.
To see what’s going on in the container we can inspect their logs, by first getting the container names:
$ docker-compose ps
NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS
dremio_lab-dremio-1 dremio/dremio-oss "bin/dremio start-fg" dremio 54 minutes ago Up 34 minutes 0.0.0.0:9047->9047/tcp, :::9047->9047/tcp, 0.0.0.0:31010->31010/tcp, :::31010->31010/tcp, 0.0.0.0:32010->32010/tcp, :::32010->32010/tcp, 0.0.0.0:45678->45678/tcp, :::45678->45678/tcp
dremio_lab-mysql-1 mysql:8.4.0 "docker-entrypoint.s…" mysql 2 hours ago Up 34 minutes 0.0.0.0:3306->3306/tcp, :::3306->3306/tcp, 0.0.0.0:33060->33060/tcp, :::33060->33060/tcp
and then asking Docker the logs of each image:
$ docker logs dremio_lab-dremio-1
starting dremio
2024-05-26T09:55:19.783+0000: [GC pause (G1 Evacuation Pause) (young), 0.0334300 secs]
[Parallel Time: 24.4 ms, GC Workers: 2]
[GC Worker Start (ms): Min: 1898.6, Avg: 1898.6, Max: 1898.6, Diff: 0.0]
[Ext Root Scanning (ms): Min: 7.1, Avg: 10.7, Max: 14.4, Diff: 7.3, Sum: 21.5]
[Update RS (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.1]
[... much more log lines here ...]
2024-05-26 10:42:10,658 [scheduler-5] INFO c.d.e.w.p.ActiveQueryListService - Starting activeQueryListTask on this coordinator.
2024-05-26 10:42:10,712 [FABRIC-rpc-event-queue] INFO com.dremio.sabot.exec.MaestroProxy - There are no active queries on this executor. So nothing to reconcile.
2024-05-26 10:47:10,656 [scheduler-7] INFO c.d.e.w.p.ActiveQueryListService - Starting activeQueryListTask on this coordinator.
2024-05-26 10:47:10,723 [FABRIC-rpc-event-queue] INFO com.dremio.sabot.exec.MaestroProxy - There are no active queries on this executor. So nothing to reconcile.
172.23.0.1 - - [26/May/2024:10:51:42 +0000] "GET /signup HTTP/1.1" 200 2582 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
172.23.0.1 - - [26/May/2024:10:51:43 +0000] "GET /apiv2/login/?nocache=1716720703167 HTTP/1.1" 403 66 "http://192.168.1.11:9047/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
Our testbed is ready, it’s time to experiment with Dremio!