The moment after an incident is resolved is perhaps the most relaxing for any IT team. When your system is finally functioning properly it puts the entire organization at ease, but the most daunting task is yet to come: root cause analysis (RCA). Akin to football teams watching previous plays to pinpoint areas of improvement, root cause analysis goes through data and finds what initially caused the incident.
Analyzing the root cause of a problem presents a unique challenge for an organization. There can be many factors that make the process harder, from too many alerts to lack of documentation. Perhaps the most detrimental is not having a set procedure in place. This key step is missing from many organizations’ incident plans. Any good incident plan includes a process, not just a requirement, for root cause analysis. You can read about one of Netreo’s favorite process methodologies here.
Note that there are a few things that can be done during incident resolution before starting the process of root cause analysis. These tasks make root cause analysis easier; such as assigning and defining roles, establishing best practices, and leveraging available tools. Although, each enterprise will have different needs depending on its functions and size. Avoid major incidents by clearly defining the roles, functions, and scope of each role. The following are a few key roles that each organization should have:
An incident lead will act as a captain, as each incident should have only one incident lead. Having strong command skills and experience in incident management is paramount. They should also be able to understand problem diagnosis and resolution. Their general knowledge should extend beyond the system monitoring and diagnostics tools to the application and infrastructure component, as well as the engineering tools available. They will direct resources where they’re needed the most and will drive all problem resolution actions as needed. Since this is the role that is effectively in charge, they will be responsible for collecting the data needed for final root cause analysis.
A service lead will help direct the restoration efforts and set priorities based on their knowledge of what is important to the business. They should be an experienced engineer or manager who understands the system aspects and delivery requirements for the services that have been impacted. They also should be familiar with and be able to direct service restoration routines and procedures. Service leads are the ones who will know potential downstream impacts that must be considered and addressed. Additionally, they must know which business units and contacts must be engaged to minimize impact while the incident is being worked.
A technical lead is a specialist or subject matter expert. This is typically a high-level senior engineer who has a full understanding of the production environment. Their job is to diagnose and lead a problem resolution effort in their component area (e.g. storage, network, DBMS, etc.). Technical leads throughout the organization must coordinate and communicate with each other to solve issues that may lie between or beyond component areas.
Now that all of the roles have been defined, it is important to outline some best practices the team should adhere to during the incident resolution process to make root cause analysis (RCA) easier.
Having too many alerts can make root cause analysis more difficult. There are some ways you can reduce the amount of alerting noise that can obscure the root cause of an incident. A general rule of thumb is to make sure that active alerts are only for actionable items.
Making sure that you are using your tools in the most optimal way is key in faster incident resolution and root cause analysis.
Root cause analysis is important to resolve future incidents faster and prevent them from happening again. By implementing the aforementioned in your resolution plans, it’ll make for a more efficient and optimized organization. Netreo provides you with the keys to do that with ease with its automated reporting and integrated platform.