Earlier today we were alerted to an issue where Berkshelf v3 clients were crashing upon attempting to download cookbooks. We were able to identify an issue where Supermarket was returning HTTP URLs instead of HTTPS URLs in the Berkshelf API Endpoint. We flushed the cache to correct the issue.
At Chef, we conduct postmortem meetings for outages and issues with the site and services. Since Supermarket belongs to the community, and we are developing the application in the open, we would like to invite you, the community, to listen in or participate in public postmortem meetings for these outages. Additionally, due to the nature of the problem, we don’t have a good way to determine when the problem started. We would love any feedback from the community as to when they started to see this issue. Please email cwebber@getchef.com with any info.
We held a public postmortem on Tuesday, July 15, 2014 at 19:00 UTC.
Description
https://supermarket.getchef.com/universe was returning http URLs instead of https URLs, causing Berkshelf clients to crash.
Timeline
All times UTC
knife ssh 'role:supermarket-app AND \
chef_environment:supermarket-prod' \
-a ec2.public_hostname \
'(cd /srv/supermarket/current && \
sudo RAILS_ENV=production \
bundle exec rails runner \
"Rails.cache.delete(Api::V1::UniverseController::CACHE_KEY)")'
Contributing Factors
Stabilization Steps
Flushed the cache for the Universe controller. knife ssh 'role:supermarket-app AND \
chef_environment:supermarket-prod' \
-a ec2.public_hostname \
'(cd /srv/supermarket/current && \
sudo RAILS_ENV=production \
bundle exec rails runner \
"Rails.cache.delete(Api::V1::UniverseController::CACHE_KEY)")'
.
Impact
Berkshelf v3.x clients and Berkshelf v2.x (prior to v2.0.18) crashed on attempting to download the first cookbook.
Participants
Web Resources
Corrective Actions Discussed
These actions were discussed during the postmortem but were determined to have little or no impact on the time-to-detect or time-to-resolve this particular failure mode. As such, they are not actions that will be captured as immediate next-steps for this outage.
Corrective Actions