This is a bit of an odd post as product marketing is usually not in the position of recommending against the use of certain product features. But a question about Chef’s “why-run” mode came up during a webinar last week, and we’ve heard several other stories of customers still using this feature in Chef. We wanted to provide clear guidance: if you are still using it, you should stop using why-run mode. We’ll explain why (no pun intended) through the rest of this post, and what you should do instead.
A dry-run or no-op mode was originally proposed all the way back in January 2009 when Chef was first released. This was probably because both CFEngine and Puppet had this feature, and early users were familiar with this style of “validating” changes before rolling them into production. Yet right from the beginning, everyone knew that such a feature has “crazy warts”, based on prior experience with these tools. Thus, the community spent the next few years debating whether or not to implement it despite those shortcomings. Chef finally got a why-run mode around 2012 in version 10.14.
If only everyone had waited a little longer. The June 2011 release of Stephen Nelson-Smith’s book Test-Driven Infrastructure with Chef sent us down a radically different path: Stephen realized that what we really needed was a reliable workflow to test infrastructure-as-code in isolated environments before deploying it to production. Originally, Stephen used Cucumber to make his point, and his work inspired the Chef community into developing more sysadmin-native infrastructure testing tools like Test Kitchen, ChefSpec, Foodcritic, Cookstyle, and more recently, InSpec. Coupled with ecosystem tooling like Vagrant, first released in 2010, which lets engineers rapidly provision virtual machines from known images in order to do functional testing, and later Docker, which accomplishes the same thing but with far faster startup times, there’s no justification today to continue testing in production.
I mentioned before that no-op or dry-run modes don’t truly work. That’s because resources in a configuration management system can be related to one another. Something that runs earlier in a Chef run (or a Puppet catalog, or Ansible play) can modify the system so that later resources can observe a different state and thus exhibit different behavior. Yet no-op modes by definition can only observe resources in isolation and try to forecast what will happen based on that limited view. This is especially problematic when guards are used that cause inter-resource dependencies. Take this code snippet for example:
package 'httpd' do action :install end execute '/tools/letsencrypt.sh' do action :run only_if 'rpm -q httpd' end
Why-run mode will infer that only a single resource is going to change, because it has no way to evaluate the guard in the subsequent execute
block to know that its value will change based on the real execution of the first resource. Running this recipe for real changes two resources: not what why-run would have told you.
Equally as alarming: despite the name, no-op modes are not side-effect-free against systems generally. This might be alarming to folks who think that it is safe to run in production. For example, we saw a scenario where a customer reported that a nightly why-run cron job was randomly breaking production servers. Their OS version happened to be running a buggy version of systemd
that would occasionally lock up when interrogated about the state of running services even though “no changes were being made”. Running this cron job across a sufficiently large fleet was enough to guarantee that every night they would have a production outage on at least several machines.
I hope it’s clear by now that testing in production is not the right approach. But what about using why-run mode to report on system compliance?
We think it’s great that people want to continuously evaluate, or audit, system state either before or after making changes. We even have a pattern for this called detect, correct, automate. But putting the “detect” phase in the hands of your configuration management system (in charge of “correct”) is not the right approach. Configuration management is good at enforcing the state of things that you declared. What about the system state you haven’t declared?
More importantly though, there’s a separation-of-duties requirement here, which is the same reason why you need auditors in the first place. The infamous proverb “trust but verify” only holds when the verification of the change is done by a different program – or even different teams – than the program or groups making the change. It’s one of the reasons we developed and released InSpec: to create a clear delineation of roles and responsibilities. Developers and sysadmins are responsible for the Chef code, security engineers are responsible for the InSpec controls. And never the twain shall meet. (It’s also the same reason why we strongly discourage importing Chef attributes into your InSpec controls.)
In other words: don’t rely on dry-run or no-op modes for compliance. Not only will the results be incomplete, but they probably won’t stand up to an auditor’s scrutiny either.
To sum up, why-run mode:
In the spirit of not leaving non-recommended features lying around, we will consider the removal of why-run mode in a future major version of Chef.