Incident Management

Kindor
29 ago 2024
4 Min. de lectura

Actualizado: 6 ago

Why you should have an incident management no matter how large your engineering team is

In a previous post, we talked about the importance of starting to measure the DORA metrics. A critical part of DORA is the way your team detects and reacts to an incident.

If you don't have an incident management process in place, this post will guide you through the importance of this process and how to start implementing it following simple steps.

What is an incident management process?

When building software, there's one thing that is certain: outages will happen. Mistakes happen all the time due to human, process or system failures, however, that doesn't mean that you can't do anything to prevent and mitigate them as quickly as possible.

An incident management process not only provides a guided series of steps to identify, analyze and respond to disruptions, but it also serves as a framework from which you can continuously learn, adapt and prevent future outages.

Additionally, an incident management process and the metrics derived from it can be valuable indicators to answer questions such as:

Is my team moving fast enough without sacrificing quality?
Do I have quality issues in a particular part of my product?
Do I have the right processes and tools to quickly recover from an outage?

What are the key steps in an incident management process?

Detection: This can occur when someone from any department notices that something is "off" or when an alert is triggered.
Logging and declaration: After the outage is detected and an engineer confirms that the outage is occurring, it is crucial to register and communicate it to the affected stakeholders.
Categorization and prioritization: Once the issue is confirmed, an initial assessment must be conducted to evaluate the impact and urgency of the problem. This will help to determine the best team to handle it and how quickly they need to react.
Diagnosis and investigation: In this step, the engineering team performs a quick analysis to understand what caused the issue and how it can be mitigated.
Mitigation in progress: During mitigation, the involved team provides constant updates to affected stakeholders. It is usually recommended to have a single point of contact (eg. an Incident Commander) that is in charge of keeping all parties informed about the progress.
Resolution and recovery: Fix or workaround has been implemented and pushed to production. Sometimes, there's a recovery phase while the fix is propagated to all systems.
Incident analysis / postmortem / root cause analysis: Once the issue is mitigated, the team needs to regroup to summarize what happened, document learnings, and define the tasks to be done to prevent similar incidents in the future. Here, it is crucial to define clear owners and deadlines, otherwise, priorities might shift and nothing may actually be executed.
Continuous Improvement: Beyond the postmortem process, it is important to evaluate whether the team is improving week over week. This is where metrics like Change Failure Rate, Number of Incidents per Week and Mean Time to Recover (MTTR) are important and will help you evaluate your progress over time.

Tools and how to get started

There are multiple tools available to handle your incident management process. These tools have multiple functionalities that besides helping you to track the lifecycle of an incident, can help you have an on-call engineer or team, notify the required persons when needed and document postmortems:

Pager Duty: Is one of the most popular tools for end-to-end orchestration for rapid incidents resolution.
Incident.io: Another popular tool for incident response and status updates
Opsgenie: Opsgenie is the Atlassian offering for incident management, complementing and integrating easily with all the other tools in their ecosystem.

However, if you still a small team and are looking for simpler and cheaper solutions, you can get started using tools that you already use on your day to day basis like:

Coordination through Slack and Google Meet.
Registering an incident and tracking its progress in Jira or Notion.
Incident Postmortem in a Google Doc, Notion or Confluence.
Track next action items in Jira or Notion.

For example, you could define a specific issue type or label in Jira to register incidents and move them through the different board statuses to track its progress. Once the issue is resolved, you could track the total number of issues marked as incidents and the lead and cycle times of the issue to get your MTTR.

Also, while the issue is being analyzed and processed, you could create a specific Slack Channel to report any progress.

Finally, it is recommended to have a Status Page, specially when your incidents have direct impact to your customers, to keep them updated about the status of the outage.

How Kindor makes this process easier for you

With Kindor, you can have a fully functional incident management process with the tools you already use (Github, Gitlab, Bitbucket, Jira, Notion, Slack, Google Calendar and Meet). Kindor will simplify the process of understanding your amount of incidents per week, the time it takes your teams in average to recover from an outage, and many more metrics, like the time the incident spent in each stage of the incident lifecycle, the Pull Requests created to fix the incident, etc.

With its powerful notification engine, Kindor will also keep the involved teams informed regarding the incident progress and the final outcomes, using tools like Slack.

Our team at Kindor, will work with your organization to make sure you have an incident management process in place that adapts to your needs and doesn't force you to completely change your way of work.

Conclusion

In summary, having an incident management process shouldn't be viewed as a process that only big companies require. Once you have a product in the market, you should start thinking about establishing an incident management process in your tech organization. This will not only help your customers, since you'll be able to proactively communicate when something is "off", but it will also help you to evaluate the speed and quality of your product development.

The sooner you establish this process, the easier it will be for your organization to adopt it.

If you are interested in establishing this but don't know where to start, Kindor can help you. As long as you are using a project management tool like Jira or Notion, Kindor can guide you or adapt to your processes to empower your tech organization with the right metrics to become more efficient and productive over time.