Traditional monitoring includes monitoring low-level items like CPU, memory, and disk utilization. It is still important to understand and have data for these things as they will help with capacity planning and may be helpful data points when responding to incidents and outages.
DevOps is a cultural and professional movement focused on how to build and operate high-velocity organizations.
Practicing DevOps means understanding who the customers of a service are and their needs. A DevOps approach to monitoring may start by answering the question, “is it up?” Starting there helps encourage discussions and discovering what “up” means. Those discussions happen across many parts of the organization to ensure common understanding. It may mean that your customers can pay you money, that they can stream video from your site, or that customers can reserve seats on a flight or in a venue.
Customers’ experience of a single interface is likely provided by a number of backend services working in concert to help the customer complete the task at hand. Understanding “up” means understanding how all of these services work together and which parts are essential.
It’s nearly impossible to talk about monitoring without also discussing alerting. Typically alerts are sent to people when monitors pick-up anomalies. Sometimes these alerts are actionable but, too often, they end up just being noise. For example, there may be a spike in CPU load caused by some batch processing that is not actually having an impact on the customer experience and will end after the batch process has completed. Given that scenario, it is not appropriate to send an alert potentially waking someone at 3AM. Yet, this is often what happens. Practicing DevOps means that we put our people first and waking them at 3AM to tell them about something that is not important and requires no immediate action is inhumane.
I recently had the pleasure of joining Leon Adato (@leonadato), Clinton Wolfe (@clintoncwolfe), and Michael Coté (@cote) at THWACKcamp, an online conference hosted by Solarwinds to discuss these ideas about monitoring an more. We were all part of a panel discussion titled ‘When DevOps Says “Monitor”.’ A recording of the panel, a transcript, and other resources are all freely available now over on the THWACKcamp site. Check out the recording and let us know what you think.
How are you reinventing your organization’s approach to monitoring and alerting?