Approximately two hours after Supermarket became the official community site it started to experience increased latency and intermittent unresponsiveness. This made downloading cookbooks problematic and effected various sites including supermarket.getchef.com, community.opscode.com, cookbooks.opscode.com, and api.berkshelf.com. We are sorry for the problems this caused.
In this post, I will explain what happened and what we are doing to make sure this doesn’t happen in the future.
This unresponsiveness caused a number of issues including:
At Chef, we perform a postmortem for any significant production environment outage or incident that results in an engineer being paged. We have a postmortem writeup, with the timeline, root cause, and corrective actions published in a private repository. Then, we schedule an internal postmortem meeting, where the incident leader for the problem discusses with others who were involved what happened, why, and how to prevent the same thing from happening. This is a learning experience for everyone, and these meetings are conducted in a blameless manner. We also often make a public blog post when the outage had external customer impact.
The writeup and meeting are normally done internally. However, Supermarket is the community site, it serves the community. The application repository and supporting cookbook are open source, and issues that affect “supermarket.getchef.com” may affect anyone who also runs the application on their internal infrastructure. So for this incident, we conducted the postmortem meeting in the open so the community has a chance to participate as equally as anyone else at Chef.
The supermarket application runs in an AWS account. At launch, it was hosted on three m3.medium instances, with an RDS database, ElasticCache Redis cache, with an Elastic Load Balancer (ELB) in front of the instances. Supermarket is a Rails app run under the unicorn http server, which listens on a unix domain socket on each of the instances. Nginx is used as a local reverse proxy for each of the application servers.
We determined a number of contributing factors that led to this outage.
We realized at the time of soft launch that we couldn’t cut-over to the new site without supporting the /universe endpoint. As such we held off on cutting over to get that implemented but still wanted to launch as soon as possible and skipped over the load planning step in our haste.
The health check timeouts were set too low to be effective. We ended up in a state where nodes were being taken out of the pool causing a domino effect. As one node would be pulled out, the other nodes would see a significant increase in traffic, causing them to time out and fall out of the pool.
The application servers were woefully undersized for the traffic we needed to be able to handle. The single CPU core of an m3.medium was just not enough CPU to keep up with demand.
We added new instances and changed the instance type we were using in production. We now have five m3.xlarge instances serving traffic.
/universe
endpoint. (Berkshelf Core Team)We are sorry about the issues with Supermarket, especially so soon after launch. We know that the community depends on these services to be reliable and will work hard to prevent these issues in the future.
Thank you!