Earlier today we were alerted to an issue where Berkshelf v3 clients were crashing upon attempting to download cookbooks. We were able to identify an issue where Supermarket was returning HTTP URLs instead of HTTPS URLs in the Berkshelf API Endpoint. We flushed the cache to correct the issue.
At Chef, we conduct postmortem meetings for outages and issues with the site and services. Since Supermarket belongs to the community, and we are developing the application in the open, we would like to invite you, the community, to listen in or participate in public postmortem meetings for these outages. Additionally, due to the nature of the problem, we don’t have a good way to determine when the problem started. We would love any feedback from the community as to when they started to see this issue. Please email cwebber@getchef.com with any info.
Postmortem Meeting
We held a public postmortem on Tuesday, July 15, 2014 at 19:00 UTC.
Description
https://supermarket.getchef.com/universe was returning http URLs instead of https URLs, causing Berkshelf clients to crash.
Timeline
All times UTC
- Unknown – Issue occurs
- 2014-07-14 09:07:32 – (#chef) acoulton reports seeing an issue.
- 2014-07-14 09:07:57 – (#chef) jrwesolo confirms issue.
- 2014-07-14 09:10:06 – (#chef) coderanger mentions cwebber.
- 2014-07-14 09:11:19 – (#chef) cwebber responds and begins investigating
- 2014-07-14 09:13:49 – cwebber confirms that /universe is returning http URLs
- 2014-07-14 09:14:12 – (#chef) cwebber notifies #chef that he he has determined the issue
- 2014-07-14 09:17:29 – cwebber flushes the cache for the Universe controller.
knife ssh 'role:supermarket-app AND \
chef_environment:supermarket-prod' \
-a ec2.public_hostname \
'(cd /srv/supermarket/current && \
sudo RAILS_ENV=production \
bundle exec rails runner \
"Rails.cache.delete(Api::V1::UniverseController::CACHE_KEY)")'
- 2014-07-14 09:18:23 – (#chef) cwebber asks acoulton to confirm that things are fixed.
- 2014-07-14 09:20:27 – (#chef) acoulton confirms that issue is resolved
- 2014-07-14 09:20:38 – (#chef) ambient sound also confirms that the issue is resolved
- 2014-07-14 09:33:30 – cwebber updates https://status.getchef.com
Contributing Factors
- A bug in Ruby’s OpenURI library does not allow for protocol changes in redirects.
- The cache key being used to store the /universe endpoint doesn’t properly handle the protocol differences. This results in the protocol of the request that generated the cache to be the protocol in the cache.
Stabilization Steps
Flushed the cache for the Universe controller. knife ssh 'role:supermarket-app AND \
chef_environment:supermarket-prod' \
-a ec2.public_hostname \
'(cd /srv/supermarket/current && \
sudo RAILS_ENV=production \
bundle exec rails runner \
"Rails.cache.delete(Api::V1::UniverseController::CACHE_KEY)")'
.
Impact
Berkshelf v3.x clients and Berkshelf v2.x (prior to v2.0.18) crashed on attempting to download the first cookbook.
Participants
- Brian Cobb
- Jamie Winsor
- Seth Vargo
- Pauly Comtois
- Nathen Harvey
Web Resources
- Notification on status.opscode.com
- Blog post announcing the post mortem
- Video of the public post mortem
Corrective Actions Discussed
These actions were discussed during the postmortem but were determined to have little or no impact on the time-to-detect or time-to-resolve this particular failure mode. As such, they are not actions that will be captured as immediate next-steps for this outage.
- Run Berkshelf integration test suite after all Supermarket code deploys.
- Add flushing the cache to Chef’s internal ChatOps utilities (only relevant if we expect to need to flush the cache more often).
- Force SSL on Supermarket – this is in place but http://api.berkshelf.com -> https://supermarket.getchef.com is not enforced. Otherwise, older versions of Berkshelf would fail. The particular versions that would fail include:
- prior to 3.1.4 in the 3.x series
- prior to 2.0.18 in the 2.x series
Corrective Actions
- Add monitoring to ensure that none of the URLs are returning as http – Joshua Timberman
- Include the protocol as part of the cache key. This would mean we have two instances of the /universe endpoint in the cache, one with protocol of http and one with a protocol of https. – Supermarket Team – Trello Card
- Remove the monitor ensuring the /univerise endpoint is returning the proper protocol (after the Supermarket has been updated) – Joshua Timberman