Sometimes in IT certain terms take on a life of their own. They push past their original meaning and become something different, rollback is one of these terms.
In the language of enterprise IT, “rollback” means getting the system back into a working state. This ensures a system can be immediately restored if a system failure occurs and that there are no disruptions to business. I’ve heard “rollback” from CIOs, Directors of Operations, and VPs of Development. When discussed we weren’t comparing the technical implications of rollback vs. other techniques. What they were really asking is “what type of insurance policy does Chef provide to get the system back into a working state after an issue has been detected?”
With the criticality of IT systems and the pressure for speed, companies need an insurance policy for when things go wrong. They need to be able to quickly restore service after a failed change or a failed release.
Rollforward is required for application types where returning to a previous version may be destructive making Rollback impossible. The process would look something like this:
Start running version A – > upgrade to B -> detect failure -> correct failure in new version C -> upgrade to C
Your application may have a breaking change, or commonly databases may have a schema change making rollback destructive. As an example with SQL upgrades should be additive, it is not safe to return to a previous version of the application. To restore service after a failed deployment the failure must be diagnosed and corrected in a new version.
Rollback is a good choice if you can return to a previous working version of your application. The process would look something like this:
Start running version A – > upgrade to B -> detect failure -> return to version A
This can be advantageous for applications where reverting to a previous version is sufficient to restore service. As an example with a Java/Tomcat application this could be as simple as removing your failed WAR file(Web ARchive) and redeploying the previous working WAR file.
The application architecture will dictate if Rollforward is required or if you can situationally choose to Rollback or Rollforward.
Step 1: Plan for Failure
The time to figure out how to respond to a failed release isn’t after the release has failed. We need to plan for failure and practice responding to failure. In order to minimize risks and ensure application availability every application delivery plan needs to include a recovery failure methodology and testing for that methodology ahead of time.
“Plans are useless, but planning is indispensable” – Dwight D. Eisenhower
Step 2: Decouple the Application from Supporting Components
Today’s modern applications are built upon an interconnected web of components. By decoupling the application from the various supporting components and providing a clean contract between the application and the components we enable a more manageable rollback scenario. Can you rollback an individual component or do you need to rollback a dozen components in a specific order? Are the functions & expectations between each component well established? Do you understand what a deployment or rollback to component A does to component B? By answering these questions ahead of time and decoupling the application from dependencies enables us to avoid “big bang” deployments, keep releases as small as possible, along with keeping rollbacks as small as possible.
Step 3: Select the right technology
The capabilities of your application delivery solution matter when it comes time to roll-back. For rollback to be a real insurance policy you need to be confident it’ll work when you need it.
It’s not enough for your solution to simply revert to the previous version of your application, what counts is that you can quickly restore service. This may actually require you to rollforward depending on your application architecture.
Capabilities needed for rapid service restoration:
In my next post, Automated Application Rollback with Chef, we’ll review how Chef provides that insurance policy, our methodology & how our technology is uniquely suited to restore service after a failure.