Chef Blogs

Chef Client 12.14.60 – Escaped defects and corrective actions

Nathen Harvey | Posted on | announcements | community | release

Despite our normal testing and processes, the 12.14.60 release of Chef Client included a number of regressions and escaped defects (you may also call them “bugs”).  One of the defects was the yum_repository resource which was added and released in chef-client version 12.14.60. The resource was previously shipped as part of and provided by the yum cookbook.  We will use the specific regressions around the yum_repository resource as a proxy for the release and not dig into the specifics of the other regressions though they will be captured in this incident report.

Allowing any defects to escape is problematic.  The defects that shipped in this release caused some of you significant pain.  I am sorry for the pain this caused you and your teams.

We held an internal post mortem to discuss this incident.  Read below for more information on the issue, its impact, and the corrective actions we are taking to reduce our time to detect and resolve these kinds of issues.

Impact

  • Failed chef-client runs for anyone using a yum_repository resource with a url parameter or a :delete action and chef-client version 12.14.60.

Time to Detect and Resolve

The time to detect and resolve this issue are two important metrics that we track.

  • Time to detect – 70 minutes
      • 18:19 – Chef Client released, 19:09 – GitHub issue 5317 opened.
  • Time to resolve – 6 days, 5 hours, 1 minute
    • 6 hours, 52 minutes
      • 14-Sep-2016 18:19 Chef Client 12.14.60 released, 15-Sep-2016 01:11 current build of chef-client released that includes the fixes.
    • 5 days, 5 hours, 27 minutes
      • 14-Sep-2016 18:19 Chef Client 12.14.60 released, 19-Sep-2016 23:46 Chef Client 12.14.77 released
    • 6 days, 5 hours, 1 minute
      • 14-Sep-2016 18:19 Chef Client 12.14.60 released, 20-Sep-2016 23:20 Doc site includes yum_repository resource

Preventing Similar Incidents

The specific steps we are taking to improve our response to these incidents include:

  • Provide more timely announcements when we know that software we have shipped requires an immediate release to resolve escaped defects or regressions.
  • Automate and improve generation of documentation.
  • Add more tests when migrating providers from cookbooks into Chef Client.
  • Consider moving target release dates to earlier in the week which would allow additional work days during the week to repair any reported issues and avoid delays over the weekend.

Complete Post Mortem

The complete post mortem meeting, including timeline, contributing factors, and more, is available in this GitHub Gist.