As discussed in Terraform Pain Points, one of the most frustrating things about working with Terraform is the pace and direction of upstream development. Useful features and bug fixes often languish for months or years because the maintainers lack bandwidth to review them or don’t see the utility. Check out all of the (justified) angst in the comments of terraform-provider-aws#8268, as hundreds of people waited eight months for it to be merged. This can feel like an impassable barrier.
It’s not an impassable barrier, though. In fact, we can sidestep this entirely if we run a custom build of Terraform and/or its providers. Here are some reasons you might want to do this:
deepmerge
(Terraform #25032), once you realize that Terraform’s merge
only goes one layer deep.missing_okay
flag to ECS data sources, to work around the bootstrapping issues discussed here. I didn’t bother submitting it upstream because optional/gracefully-failing data sources are unwelcome.This article will explain details of a practical approach to doing that. It’s surprisingly liberating, and IMO one of the most impactful ways to improve the Terraform experience.
It’s important that your team runs Terraform in a consistent way. For one thing, Terraform is touchy about its version: it will only operate on state generated by the same or older version, so everybody has to upgrade in sync. But when we start relying on patched behavior in Terraform or its providers, it’s even more critical to be in sync so that a consistent set of patches is used. (Side note: “everybody” includes CI. A mature Terraform workflow should route primarily through an automation/CI platform).
Packaging Terraform in a Docker image is a great way to encapsulate everything. We can ship an image with the patched Terraform binary and whatever patched providers we need, at the appropriate locations on the file system. Everything we care about is pinned. We only need to build for one target architecture. Distribution is a cinch (docker pull ...
). Running it is less of a cinch, but will be discussed below.
There’s also terraform-bundle. I haven’t used it, but it’s worth being aware of. It seems more complicated than a Docker image (e.g. to distribute), but I could see it being a good fit for some workflows (especially if you’re using Terraform Enterprise).
If you want to skip the explanation, a demo implementation of these ideas may be found here.
There are many ways to approach maintaining and building this Docker image. Here I’ll describe one which aims for a declarative paradigm, making automation and ease of maintenance a focus. What we’ll do is specify a base reference (e.g. a tag or commit hash in the hashicorp/terraform repo) and a list of patches to apply. We’ll have a shell script which applies those patches before compiling the code.
This declarative approach doesn’t involve maintaining a fork, and has some advantages over a fork: it’s easy to tell what base versions and patches a given version of the image contains, and it’s usually a simple one-line code change to upgrade the base reference. However, it’s difficult to handle things which require manual intervention (e.g. merge conflict resolution), and a manually-maintained fork makes that easier because the workflow is less automated to begin with. If another approach works better in your situation, run with that. The pattern described here has been powerful and easy to use in my experience.
To keep things simple, we can list our patches as files on disk. Here’s the proposed file structure:
├── Dockerfile
├── apply-patches
└── patches
├── terraform
│ ├── PR123-some-great-feature
│ │ ├── remote
│ │ ├── reference
│ │ └── explanation
└── terraform-provider-aws
├── PR987-adding-a-resource
│ ├── remote
│ ├── reference
│ └── explanation
└── some-custom-patch
├── apply
└── explanation
If we’re pulling in the contents of a public changeset (e.g. a PR), our patch directory will contain files remote
and reference
. The contents of remote
will be a git remote (e.g. https://github.com/somebody/terraform
). The contents of reference
will be a git reference that can be checked out in that repository, like a branch name (feature/some-great-feature
) or a long or short commit hash (0abcdef
). The file explanation
, is just free-form text explaining what the patch is for, and can be named anything.
If we have some other sort of patch, e.g. a .patch
file, then we can bring an executable shell script named apply
which does whatever is necessary.
The apply-patches
script (detailed below) will be called from the Dockerfile, which we break up into stages like this:
FROM golang:1.15-buster AS builder-terraform
ARG TERRAFORM_BASE_REFERENCE
WORKDIR /code/hashicorp/terraform
RUN git clone https://github.com/hashicorp/terraform . && git checkout $TERRAFORM_BASE_REFERENCE
COPY apply-patches .
COPY patches/terraform/ patches/
RUN ./apply-patches
# Stripping debug symbols for smaller binary:
# `-ldflags '-s -w'`
# Output lives at /go/bin/terraform
RUN go install -a -ldflags '-s -w' .
FROM golang:1.15-buster AS builder-terraform-provider-aws
ARG AWS_PROVIDER_BASE_VERSION
WORKDIR /code/terraform-provider-aws
RUN git clone https://github.com/terraform-providers/terraform-provider-aws . && git checkout v$AWS_PROVIDER_BASE_VERSION
COPY apply-patches .
COPY patches/terraform-provider-aws/ patches/
RUN ./apply-patches
# Stripping debug symbols for smaller binary:
# `-ldflags '-s -w'`
# Output lives at /go/bin/terraform-provider-aws
RUN go install -ldflags '-s -w'
This is taking a base version of the provider as a build argument, and assuming a tag named v...
exists. Other providers may not follow this pattern. It would be preferable to allow an arbitrary git reference to be passed in, but it’s done like this because Terraform 0.13+ is very specific about where it wants providers installed, and that location requires a semver-alike version string (a bit more detail in the comments of the next Dockerfile
segment). It’s possible to hardcode that version string to something like 0.0.1
(for some reason, 0.0.0
doesn’t work) and then use any git reference here, but it could make the output of terraform init
confusing.
FROM debian:buster-20201012-slim
ARG AWS_PROVIDER_BASE_VERSION
# For pulling Terraform modules from git repositories
RUN apt-get update && apt-get install -y git && apt-get clean
COPY --from=builder-terraform /go/bin/terraform /bin/terraform
ENTRYPOINT ["terraform"]
# Note! This is only correct for TF >=0.13.
#
# Put the bundled provider where Terraform will look for it, following
# https://gist.github.com/mildwonkey/85df0f0605880a0f08b6f05c15092bd7
#
# Note that there are some restrictions on the provider version used in the path. A `v` prefix is not allowed
# (e.g. `/v2.70.0/`) and neither is a custom suffix (e.g. `/2.70.0-custom/`. In both cases, Terraform will
# ignore our provider and try to install from the public registry.
ENV AWS_PROVIDER_PATH=/usr/share/terraform/plugins/registry.terraform.io/hashicorp/aws/$AWS_PROVIDER_BASE_VERSION/linux_amd64/terraform-provider-aws
COPY --from=builder-terraform-provider-aws /go/bin/terraform-provider-aws $AWS_PROVIDER_PATH
If you’re using Terragrunt or other supporting tools, you would also download them in this final layer.
Aside: There was a big change in Terraform 0.13 regarding where Terraform searches for provider binaries. If you’re using Terraform 0.12 or earlier, the provider binary can simply be put next to Terraform, like this:
# Note! This is only correct for TF <0.13 (i.e. 0.12.* or earlier) # # Put the bundled AWS provider alongside the Terraform binary (/bin/terraform), which is the second # place that Terraform checks for plugins. # https://www.terraform.io/docs/extend/how-terraform-works.html#discovery COPY --from=builder-terraform-provider-aws /go/bin/terraform-provider-aws /bin/terraform-provider-aws
Here’s the apply-patches
shell script, which applies the patches stored on disk as described above:
#!/bin/sh
set -ex
# This script is run as part of `docker build`.
#
# We should be in the directory where the base git reference has been checked out.
# Now, we loop over the patches in the patches/ directory and apply them.
#
# Usage: ./apply-patches
# git doesn't let us commit unless we configure author details
git config user.name terraform-image-automation
git config user.email nobody@nowhere.com
apply_from_remote_and_reference() {
patch_name=$1
remote=$2
reference=$3
git remote add "$patch_name" "$remote"
git fetch "$patch_name" "$reference"
# Get the exact revision so we can log it for audit purposes (what, exactly, went into this
# image?). If we're working with a branch or tag, resolve that. Otherwise, use the reference,
# which is assumed to be a commit hash. Fallback logic cf. https://stackoverflow.com/a/62338364
revision=$(git rev-parse -q --verify "$patch_name"/"$reference" || echo "$reference")
echo Revision: "$revision"
git merge --squash "$revision"
}
apply_from_script() {
patch_name=$1
./patches/"$patch_name"/apply "$patch_name"
}
for patch_path in patches/*
do
# If there were no glob matches, don't loop.
if [ ! -e "$patch_path" ]; then
break
fi
patch_name=$(basename "$patch_path")
echo Applying patch "$patch_name"
if [ -f patches/"$patch_name"/remote ] && [ -f patches/"$patch_name"/reference ]; then
# A typical patch specifies a remote (e.g. https://github.com/somebody/terraform-provider-aws)
# and a reference (a branch name or commit hash) and doesn't provide an `apply` script.
apply_from_remote_and_reference "$patch_name" "$(cat patches/"$patch_name"/remote)" "$(cat patches/"$patch_name"/reference)"
elif [ -f "$patch_path"/apply ]; then
# Patch bringing its own apply script
apply_from_script "$patch_name"
else
echo Patch "$patch_name" should have either \\\`remote\\\` and \\\`reference\\\` files or an \\\`apply\\\` script.
exit 1
fi
# Can't `merge --squash` more than once in a row without committing in between.
git commit -m "Applying patch ""$patch_name"
done
Now docker build
takes care of the rest. You can set DOCKER_BUILDKIT=1
to use BuildKit, which will parallelize the independent stages.
A full demo of this implementation may be found here. That repository uses another tool I wrote, Dockerfiler, which is useful for automated “build me an image based on this Dockerfile” processes like this.
I’ve previously written about how great it is to run things in Docker. If it’s unfamiliar to you, here’s a useful mental model in the form of a shell alias:
alias terraform='docker run -it --rm -v "$(pwd)":"$(pwd)" -w "$(pwd)" -u $(id -u):$(id -g) custom-terraform:ourtag'
Now we might run terraform init
or terraform plan
and it routes through the container. That docker run
command (well, stub of a command) is functionally equivalent to invoking a locally-installed Terraform binary. It’s better in many ways, though: explicit choice of version via the image tag, easy cross-platform support, environmental context doesn’t leak in by accident (AWS credentials, file system access). The article goes more in depth on the benefits (and drawbacks).
In practice, an alias like this isn’t the best way to do it: not source controlled, not easy to share with a team, some workflow steps require different credentials passed to the container (e.g. init
needs access to download modules while plan
or apply
do not).
One thing that I’ve found useful for this is to use a Makefile like:
image_reference = custom-terraform:ourtag
# This would volume in credentials necessary for init (i.e. for downloading modules, accessing remote state)
# and volume in the `terraform-known-hosts` and `terraform-etc-passwd` files.
tf_init_docker_options = ...
terraform = docker pull $(image_reference) && \
docker run --rm -u $$(id -u):$$(id -g) -v "$$(pwd)":"$$(pwd)" -w "$$(pwd)" $(2) $(image_reference) $(1)
# Specific to if we're downloading modules from GitHub (or another git remote)
terraform-known-hosts:
ssh-keyscan github.com > terraform-known-hosts
# OpenSSH is finicky about running as a user that doesn't exist in /etc/passwd, so create a basic
# /etc/passwd to get around that (and use /tmp as $HOME so it's writable).
terraform-etc-passwd:
echo "terraform:x:$$(id -u):$$(id -g):terraform,,,:/tmp:/bin/sh" > terraform-etc-passwd
tf-init: terraform-known-hosts terraform-etc-passwd
$(call terraform, init $(args), $(tf_init_docker_options) $(docker_args))
tf:
$(call terraform, $(args), $(docker_args))
The workflow is then make tf-init
followed by make tf args='plan'
, etc. There are a lot of ways to make this more ergonomic, but the basic functionality is all here.
There are some subtleties to making this work seamlessly. I’ll call them out here, but I’ll be light on details (which could fill another article).
Sourcing cloud provider credentials will probably look different when you run locally and when you run in automation. For instance, if we’re using AWS, the local credentials might be provided to the container by voluming in ~/.aws
but our automation might use an assumed role/the EC2 instance metadata service. To handle discrepancies like that, it’s useful to distinguish between automated/non-automated sessions. Something like this can work:
# Identify interactive session by presence of stdin, cf. https://stackoverflow.com/a/4251643
is_interactive := $(shell [ -t 0 ] && echo 1)
ifdef is_interactive
special_docker_run_options = -it -v ~/.aws:/tmp/.aws -e HOME=/tmp -e AWS_PROFILE
else
special_docker_run_options = -e TF_IN_AUTOMATION=1 -e TF_INPUT=0
endif
Then we’d add $(special_docker_run_options)
to the docker run
template. More about /tmp
as a home directory below.
The Dockerfile doesn’t create a non-root user. It’s a good practice to run as a non-root user, and actually to run as the same uid/gid as the person invoking Docker, but that can’t be baked into the image because the user id is only known at runtime. Containers are usually okay running as whatever user, even if it’s unknown, but anything that tries to write to the home directory will fail unless we set HOME
to something writable. HOME=/tmp
is a simple thing that works in those situations.
If we’re downloading Terraform modules via git, then we run into an issue where OpenSSH errors out if the user isn’t known to the OS. To get around that, we have this trick in the Makefile to create a dummy /etc/passwd
with the same uid/gid as our current user.
Specific to downloading Terraform modules over git/SSH:
We need to pre-fetch the remote’s certificates with ssh-keyscan
. Otherwise, we get the dreaded Are you sure you want to continue connecting (yes/no)?
interactive prompt from OpenSSH.
It’s usually necessary to volume in an SSH agent. This has been possible on a Mac since late 2019, but requires running as root in the container. The following should work:
ifeq ($(shell uname), Darwin)
# /run/host-services/ssh-auth.sock is a special socket exposed in the Docker-for-Mac Linux VM since
# Docker for Mac Edge release 2.1.4.0 (2019-10-15) https://docs.docker.com/docker-for-mac/edge-release-notes/
# The socket can be volumed into a container to use the host's SSH agent in the container. This only
# works when running as root in the container, but that's okay in D4M because it only ever writes to
# the host file system as the user running Docker (e.g. doesn't write files as root).
tf_init_docker_options = -v /run/host-services/ssh-auth.sock:/ssh-agent -e SSH_AUTH_SOCK=/ssh-agent \
-u 0:0 -v "$$(pwd)"/terraform-known-hosts:/tmp/.ssh/known_hosts
else
tf_init_docker_options = -v $$SSH_AUTH_SOCK:/ssh-agent -e SSH_AUTH_SOCK=/ssh-agent \
-v "$$(pwd)"/terraform-etc-passwd:/etc/passwd \
-v "$$(pwd)"/terraform-known-hosts:/tmp/.ssh/known_hosts
endif
On a Mac, watch out for terrible I/O performance when voluming in the working directory. It may be possible to work around that, though I haven’t.
Setting up a system to produce custom builds takes some work, but I hope this article makes that path a little easier for you. Once you start running a custom build of Terraform, you’re going to wonder how you ever got along without it.