Infrastructure automation tools like Chef and Puppet are becoming more and more common as organizations look to code their infrastructure like they code their applications.
At Brigade, we use Chef to build our infrastructure. One of the challenges we’ve faced is the tradeoff between thorough testing and the time it takes to run those tests. With every Chef cookbook we write, we want to be able to deploy changes with reasonable confidence that they will work in production. At the same time, we don’t want to wait hours for automated scripts to spin up, converge, and spin down nodes in order to verify that our code works.
One of the most powerful tools available in the Chef ecosystem is test-kitchen, a project that simplifies the process of converging your cookbooks in an isolated test environment. Out of the box, it works by spinning up a VirtualBox VM for each test suite, converging the VM with Chef, and then running some tests to verify the VM is in a desired state.
The Problem: Long Test Times
Where this solution has its limitations is that creating and destroying virtual machines is time-consuming and resource-intensive. For our collection of over 20 test suites representing various core components of our infrastructure, testing all of them took almost 2 hours. Converging and testing our base cookbook (the cookbook included on all nodes in our infrastructure which contains common global configuration) took almost 15 minutes alone, so even running tests in parallel would still take significant time.
These long test times were not ideal because:
- We weren’t getting quick feedback on whether a commit broke our infrastructure in some way, which slowed down our development speed.
- Adding more test suites made our integration tests take even longer, so as our infrastructure grew the time it took to test grew at an unsustainable rate.
We clearly needed to explore other solutions to this problem.
Identifying Bottlenecks
Digging into where we were spending significant amounts of time in our tests, we realized that:
-
Creating and destroying VMs took a minimum of 1–2 minutes for each test suite.
-
There was noticeable CPU and disk I/O overhead running within a VM (5–10% in some cases).
-
VMs limited our ability to parallelize, as each VM took a minimum amount of memory, even if it didn’t use all of it. This limited the utilization of our resources as a lot of memory would effectively go unused during the course of a test run—memory that could have been used to run other test suites.
While there certainly were ways we could have addressed some of these issues while continuing to use VMs, it seemed like we needed to drastically rethink how we were going about solving the problem, as the fundamental issue was the overhead of running VMs.
Solution: Containers as VMs
We realized that containers might be a viable alternative. Containers are quick to provision, incur less overhead than a full VM, and make better use of resources as they don’t reserve memory on startup, but share whatever system memory is available. We decided to explore using Docker to provision VM-like containers that we could run our tests within.
kitchen-docker
Fortunately, there was already a plugin for test-kitchen that added support for provisioning Docker containers. However, Docker containers are usually very lightweight—they typically aren’t running SSH. The test-kitchen runtime however, relies on the existence of SSH in order to connect to and run commands on a machine.
The kitchen-docker plugin handles this by generating a Dockerfile that runs an
SSH daemon within the container. This allows test-kitchen to access the
container via SSH and execute commands like running a Chef client and tests.
However, this only allows you to work with very simple Chef cookbooks—using
resource providers like service
will cause Chef runs to fail since
there is no init process running in the container.
Thick Containers
In order to test our cookbooks that used the service
resource provider, we
needed to come up with a way to get an init process running as the root process
in our containers. It seemed like we were trying to fit a square peg into a
round hole, and in many ways we were—Docker isn’t really intended to be
used for so-called “thick” containers, but it is perfectly capable of doing so.
Our infrastructure is built entirely on CentOS 7, whose init process is systemd. Building on Jim Perrin’s Dockerfile for a CentOS container with systemd, we came up with the following Dockerfile:
Much of the above code is commented, but here are the following high-level takeaways:
- We swap the fakesystemd dummy package included in the base CentOS 7 image with the actual systemd package, and remove some unnecessary unit files that only make sense for a “real” machine.
- We run SSH as a service under systemd.
- We finally install some basic utilities including Chef itself so that we don’t need to install them each test run, saving us time.
The kitchen-docker configuration in our .kitchen.yml
file is pretty
straightforward:
Loose Ends
This got us 95% of the way towards our goal. However, two problems still remained:
- If our SSH daemon wasn’t working for whatever reason, we could not log into the machine via the kitchen login command.
- We noticed that destroying containers would timeout, resulting in a forced shutdown which slowly [leaked loopback devices].
The first problem was solved by monkey-patching the kitchen-docker driver to
execute docker exec -it %{container_id} bash
instead of SSH when using kitchen
login
. Regardless of the state of the SSH daemon we could now always run a bash
shell within the container to poke around and debug.
The second problem turned out to be a bit thornier. When Docker shuts down a container, it sends the SIGTERM signal to the root process. Apparently, systemd simply re-execs itself when receiving SIGTERM, and there is no other signal that causes it to shut down gracefully.
Thus we needed to monkey-patch the kitchen-docker driver to execute shutdown
now
in the container to trigger a proper shutdown of systemd.
This fixed our leaky loopback device woes.
In order to get these monkey-patches to work, we needed to include the
following line at the top of our .kitchen.yml
file:
While it looks like it’s in a comment, this actually gets executed since the
.kitchen.yml
file is interpreted as ERB.
The Results
The results of these efforts were substantial. Using the same test hardware as before, our overall test time dropped from almost 2 hours to ~30 minutes, and our base cookbook test suite dropped from 15 minutes to 3.
This was due to the containers being faster to create and destroy, as well as having lower CPU overhead. We were also able to run more test suites in parallel, since containers didn’t reserve memory all at once, but rather only as needed.
Gotchas
While this approach works for our infrastructure needs, this is not a silver bullet. Some important notes:
- The host machine running the container must also be CentOS 7 (or at least be running the same Linux kernel version).
- The containers need to be run with root privileges in order for the systemd within the container to access cgroups. We’re ok with this in our test infrastructure, but others might not like the idea of containers with root privileges on the host system.
- Perhaps obvious, but we’re not actually converging our cookbooks on a real machine. For our purposes this is fine, but there may be certain use cases for which you really need an isolated machine. In cases where you need true isolation, you can set the individual test suite to use the default test-kitchen Vagrant/VirtualBox integration.