Supermarket Intermittent Unresponsiveness Postmortem

Christopher Webber | Posted on July 10, 2014 | community | cookbooks

Approximately two hours after Supermarket became the official community site it started to experience increased latency and intermittent unresponsiveness. This made downloading cookbooks problematic and effected various sites including supermarket.getchef.com, community.opscode.com, cookbooks.opscode.com, and api.berkshelf.com. We are sorry for the problems this caused.

In this post, I will explain what happened and what we are doing to make sure this doesn’t happen in the future.

This unresponsiveness caused a number of issues including:

Berkshelf v3.x was unable to connect to api.berkshelf.com to get a list of cookbooks and versions needed for depsolving
An inability to browse cookbooks
Uploading and downloading of cookbooks either broken or extremely latent
The Supermarket API was inaccessible

Background

At Chef, we perform a postmortem for any significant production environment outage or incident that results in an engineer being paged. We have a postmortem writeup, with the timeline, root cause, and corrective actions published in a private repository. Then, we schedule an internal postmortem meeting, where the incident leader for the problem discusses with others who were involved what happened, why, and how to prevent the same thing from happening. This is a learning experience for everyone, and these meetings are conducted in a blameless manner. We also often make a public blog post when the outage had external customer impact.

The writeup and meeting are normally done internally. However, Supermarket is the community site, it serves the community. The application repository and supporting cookbook are open source, and issues that affect “supermarket.getchef.com” may affect anyone who also runs the application on their internal infrastructure. So for this incident, we conducted the postmortem meeting in the open so the community has a chance to participate as equally as anyone else at Chef.

Postmortem write-up – Contents from our internal repository where we store postmortem write-ups.
Postmortem meeting – Video of the postmortem meeting.

Contributing Factors

The supermarket application runs in an AWS account. At launch, it was hosted on three m3.medium instances, with an RDS database, ElasticCache Redis cache, with an Elastic Load Balancer (ELB) in front of the instances. Supermarket is a Rails app run under the unicorn http server, which listens on a unix domain socket on each of the instances. Nginx is used as a local reverse proxy for each of the application servers.

We determined a number of contributing factors that led to this outage.

Load Planning

We realized at the time of soft launch that we couldn’t cut-over to the new site without supporting the /universe endpoint. As such we held off on cutting over to get that implemented but still wanted to launch as soon as possible and skipped over the load planning step in our haste.

Health Check Timeouts

The health check timeouts were set too low to be effective. We ended up in a state where nodes were being taken out of the pool causing a domino effect. As one node would be pulled out, the other nodes would see a significant increase in traffic, causing them to time out and fall out of the pool.

App Servers Undersized

The application servers were woefully undersized for the traffic we needed to be able to handle. The single CPU core of an m3.medium was just not enough CPU to keep up with demand.

Stabilization Steps

We added new instances and changed the instance type we were using in production. We now have five m3.xlarge instances serving traffic.

Corrective Actions

Change the behavior of the Berkshelf API Servers, to use the new /universe endpoint. (Berkshelf Core Team)
Adjust the number of unicorn workers on each node as necessary. (Joshua Timberman)
Improve performance with the way downloads are handled. This will include ensure that the metrics collection is non-blocking. (Christopher Webber and Full Stack)
Get updates from http://status.getchef.com into #chef on irc.freenode.net. (Christopher Webber)
Add additional alerting around nodes falling out of the ELB. (Chef Operations)

Conclusion

We are sorry about the issues with Supermarket, especially so soon after launch. We know that the community depends on these services to be reliable and will work hard to prevent these issues in the future.

Thank you!