We don’t have to find Permit A38: better faster Incident Management
I grew up in Europe. As a child I read comics called Asterix and Obelix. Asterix and Obelix are two Gaulish warriors, who have adventures and fight the Roman Empire during the era of Julius Caesar. In one of their adventures, they have to fulfill 12 impossible tasks. And one of them is obtaining permit ‘A38’ from the administration of ancient Rome. (see extract here: https://youtu.be/GI5kwSap9Ug)
The term “A38” has become a common saying in my generation, referring to the endless bureaucratic nightmare that Asterix and Obelix get stuck in.
And ‘A38’ comes into my head sometimes when I think about the ITIL process library with its 26 processes and 5 stages. To compare this to IT processes seems at first a little bit over the top, but it hits the feelings some users have. Bear with me.
The essence of bureaucracy is that you can’t go to someone who actually solves your problem. Rather you have to contact someone who starts the process of solving your problem, but who will mostly just hand it off to someone else. Sound familiar?
For me, that is reminiscent of the service structure for incidents. For complicated problems, you have to pass through the 1st, and then the 2nd Level and finally the 3rd level to find someone who will actually solve your problem. This structure works. It’s a structure for cost efficiency. But the structure creates artificial barriers between each level and between the customer.
We see now, that this structure is not taking the true mindset of DevOps into account. It’s not optimized for speed. It’s not ‘you build it you run it’. It shields teams from customers’ feedback. It breaks the team into isolated silos.
We have to rethink the traditional Incident Management process.
Our new structure breaks the traditional process with the following innovations. Proactive: it starts before the incident using data driven insights and simulations to prevent incidents from happening. Holistic: it lets one team work from the start of an incident to the end, no handovers. Efficient: saves time by leveraging automation for deployment, customer self-help and self healing for application. Cost Optimizing: continuous improvement is built in to increase the return of the investment for our company.
Our new structure is divided into four stages. The idea is to have a lightweight structure compared with ITIL. It involves the full team and doesn’t lead to that siloed thinking. One team is working across:
Observe & Anticipate: We mitigate the risk upfront.
Orient & Decide: What is going on and what are we going to do?
React: We act according to our decision and solve the issue.
Learn & Optimize: We optimize the return on the unplanned investment.
Step one: all team members are observing and anticipating. They are monitoring the application from the outside to spot known issues. They observe the application from the inside to spot the unknown. They simulate problems and understand how your application reacts in extreme situations and analyse historical incidents to find patterns and solve them.
The second step starts when an issue is upcoming. This can be from our incident ticket system or because we found some problems proactively in step one. In such situations, the incident commander, the person in charge, orients and decides about the next steps. This includes gathering information about the surroundings, like other applications, and swarming to inform and to get the right experts for the job.
If not already done, we name an operation lead who reacts on the incident and solves the issue. The optimal role structure during the reaction phase is based on the incident command system. This role structure is optimized for flexibility and keeps pace in wild spreading and extending issues. If an update of the production system is required, the operation lead deploys a fix over the usual DevOps-pipeline and doesn’t take a shortcut.
After the solution is provided, we have to learn and optimize. In this phase, we optimize the ROI for our company. For that we facilitate a blame free post mortem meeting. We draft actions and update our backlog accordingly. We make our learnings transparent for everybody. This transparency is important. We want our company to learn, not only our team.
Incidents are opportunities to harness the power of DevOps. The DevOps mindset is to learn from the customer’s feedback as fast as possible. We automate deployment and the quality gates to get this feedback quickly. We thus can take this unplanned investment, an incident, and turn it into high value insights about our system application and usability.
Just like Obelix smashing down the walls from the department, we have to tear down the wall between our teams. Developers have to be directly responsible for the customer feedback. At the same time, we have to be smart like Asterix and work in a well defined structure. The structure coordinates the team to work efficiently.
Incident management breaks down the silo between the developers and our customer