We are building and operating larger, more complex systems which leads to systems that degrade and fail in new and unexpected ways. We must learn to observe, respond to, and learn from these failures. The chaos engineering track at ChefConf is the place to share stories of improving mean-time-to-detect and mean-time-to-resolve, improving post-mortems and learning reviews, practicing game days, and, in short, learning how to be more responsive to systemic failures. I’m personally inviting you to submit your stories for our Chicago event through our ChefConf call for presenters (CFP). In particular, this blog post is designed to give you some thoughts and inspirations for building a proposal for the Chaos Engineering track.
Perhaps you have heard of Netflix’s Simian Army or Chaos Monkey. Maybe you are practicing Game Days in your own environment. These tools and practices are all about introducing chaos into the system and understanding how the system responds. It’s important to remember that the “system” in this case includes the infrastructure and applications we are running but also the people responsible for running them. You will get a chance to practice your response to failure. Using these tools and practices can help you practice that response in a slightly a more controlled way.
All systems fail. It is not a question of “if” but “when.” When systems do fail we have an opportunity to learn a lot and identify opportunities for improvement. These failures offer a great opportunity for collaboration and can help provide better context about the customers and users of our systems. Failures are tremendous learning opportunities, do not waste them!
Chaos Engineering, as outlined on http://principlesofchaos.org/, is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. This community are bringing a formal approach to experimentation and espousing the necessity for automation. How are you putting some of these things to work today?
Incident response can be improved and time-to-recover reduced when we utilize a formal approaches or frameworks. The incident command system (ICS), the observe, orient, decide, act loop (OODA), and the Cynefin framework are just some of the frameworks that can help teams understand and respond to chaos and failures. Perhaps you have experience with one or more of these systems and can provide some insight and inspiration to your peers.
Schadenfreude is pleasure derived from another person’s misfortune. Let’s face it, in our industry we love to talk about the epic failures and disasters we have experienced. The “other person” is sometimes, nay often, our own past self. Laughing, or crying, about our own past failures can be cathartic and can really help others learn.
Your story and experiences are worth sharing with the community. Help others learn and further your own knowledge through sharing. The ChefConf CFP is open now. Use some of the questions posed here to help form a talk proposal for the Chaos Engineering track. Besides this track, we are encouraging submissions across these tracks as well:
Submit your talk proposal now! The deadline is Wednesday, January 10, 2018 at 11:59 PM Pacific time.