Startups operate under constant pressure to innovate, remain competitive, and consistently deliver excellent user experiences. This translates to shipping out more changes, and these changes can make or break the functionalities or user experience of the products and services they offer. This makes Startups an interesting place for incident management.
Incident Management in Startups
Incidents impact customer trust, and in growth-stage startups, it is crucial to have the customers’ trust. The customer should be confident in the startup’s abilities to offer a reliable product as well as improve it. Besides, any reliability issues that are found should be quickly mitigated and never repeated. This means investing dedicated engineering time to make the product more reliable, but as startups are always resource constraints which makes it practically impossible.
In Startups, the typical response to an incident is to get all hands on deck and resolve the issue. While this works for smaller teams, it won’t scale as the business scale. As the organization grows there will be more information silos, tools sprawl, and a lack of knowledge about the overall software infrastructure. It will slow down incident resolutions and burn out teams. To tame this chaos and take back control of incident management, it is imperative to build an incident management strategy.
Build a lightweight incident management process.
In a startup that is trying to move as fast as possible, introducing lengthy and heavyweight processes will cause a bottleneck. What is needed is a lean process that has a structure to it, but flexible enough to evolve with the organization and its technology landscape. The aim should be to build a process that the teams would actually use.
Observability
A prerequisite for any incident management is the ability to know when things are broken. To achieve this, the startup should have some observability solutions. This is not to say that before setting up incident management, organizations should have full-blown observability implemented. Like many other cross-cutting initiatives in software engineering, observability also has a maturity model. The evolution of observability maturity is Metrics > Logs > Traces > Profile. So start with basic metrics for the infrastructure and applications. There are usually a large number of metrics to pick from in any system, use Golden Signals as a starting point. Find reasonable thresholds for these metrics and set alarms accordingly. It is good to have basic business metrics plotted to identify the traffic patterns and use them as a signal.
Basic Incident Management Process
When dealing with incidents, there are a few basic pieces of information that should be well-known/easy to figure out.
- Who is responsible for identifying the problems/declaring an incident – While in small startups, having a full-fledged oncall with an escalations matrix might be overkill, a team of any size should have some sort of a “rotation” between engineers so that the member in the rotation is the first person to be notified/identify/investigate any issues. This helps avoid burnout and also helps everyone in the rotation to get a well-rounded understanding of their software infrastructure.
- A centralized knowledge base of software applications, infrastructure, and ownership of each system. While this sounds like something big, it doesn’t have to be – especially for startups. Start with a spreadsheet that lists out your microservices, databases, and cloud components, and add the engineers who primarily work on these components as owners. Adding a link to the codebase and the deployment pipeline would add additional value. This little index of software infrastructure will help when incidents occur and the engineer-in-rotation doesn’t have to spend time identifying the service, their owners, and the recent code/config changes made.
- Collecting incident data and learning from it – Incidents are unavoidable, especially if you are a startup things will break as and when you make progress with your product. The idea is to learn from the incidents so that we can fix the root cause as well as prevent future incidents. This means that any information that is relevant to analyzing the incident should be recorded. Forget fancy tools and start with a basic document – wiki, text file, word/google doc – anything will do. Record what was observed, what were the timeline, relevant links, who did what, etc. This becomes a living document that will be updated during the course of the incident and then used to do reviews/incident postmortems.
Post-incident Activities
Once the incident is resolved, the relevant engineering teams should conduct further investigation into the root cause, update the incident document, and keep populating it. Once there is sufficient data to ascertain the root cause and decide on action items for a long-term fix, all stakeholders should get together to run an incident review/postmortem. Once all parties agree upon action items, they should be executed on priority.
Beyond the individual incidents, it is always good to look back at the historical data of incidents. Having a record of past incidents is always helpful in making architectural decisions as well as prioritizing reliability fixes.
As discussed so far, setting up a basic incident management process is not hard. However, it is necessary to have an engineering culture that encourages retrospection to learn from the incidents and bounce back from them. A solid leadership buy-in and top-down initiative are necessary to institute such a culture. Also, it is not relevant which tools are used to record and improve the incident management process. Priority is to establish a baseline process, once that process is matured it can be streamlined further, and effective incident management tools can be adopted.
Stackbeaver can help you streamline your incident management process and pick the right incident management tools. Reach out to us