Podcast #2 – Finding The Root Cause And Preventing Further Incidents

problem management podcast

Podcast #2: Problem Management – The Process of Finding The Underlying Root Cause of the Problem

Transcript:

Jack: Welcome to another episode of the Transformative IT Service Management Podcast here at Bell Techlogix. This is Jack Mansfield. We’re going to talk today about problem management. Joining me again is Brenda Lichtenberg. Brenda is the Senior Vice President of Strategy and Portfolio. We talked last time quite a bit about incident management and how it connects with the other service management processes. What is problem management? Let’s define that first.

Brenda: Problem management really is a process where you look at several incidents that have been opened, it can be identified as … In working with different types of incidents in terms of the priority. If I have a priority one and two it will automatically open up a problem ticket. And you may say, “Why is that?” The reason you open up a problem ticket is because those are critical issues that occur. And when those critical issues occur, what you want to do is be able to make sure that, that SEV-1 does not happen again. So, automatically a problem ticket is open. A problem ticket will then drill down to make sure that … Drill down to find out what that issue is, and then also determine root cause analysis. And root cause analysis is a very key element of problem management. Problem management will track end to end SEV-1’s, SEV-2’s and a percentage of severity threes as well. We normally address the severity ones first, the severity twos next and make sure that problem management will take care of each one of them in addressing a root cause analysis.

Certainly the root cause analysis identifies how often that issue happened. Why did it happen? Do I need a change to enable that to not occur again? So there’s a lot of things to consider when you’re looking at problem management. It may also involve other vendors or other players in the network, for example, that you may need to pull together a supplier or vendor network, server management to address those particular issues.

Jack: So problem management is a lot more than just getting the thing back on line, right? Getting it back on line, that’s more of incident management. Figuring out why it went offline and preventing it from going offline again, that’s more of problem management, the underlying problem.

Brenda: Right. You’re going to do a lot more research and a lot more digging and that may also then lead you also into change management. If we’re seeing it happens all the time, maybe that’s a server that’s going bad. So you may have to open a change ticket in relationship to that problem ticket to resolve the issue permanently, and that would be part of your root cause analysis.

Jack: So, you talked about sometimes you many log a problem ticket for every P-1 or P-2, maybe some of the P-3’s. Is there a methodology for figuring out when to log a problem ticket? Does it connect back to analytics at all?

Brenda: It absolutely connects back to analytics. In fact, a lot of the times, you may set certain goals, that for instance, every P-1, you’ll open up a problem investigation. Then maybe a percentage of P-2’s and a percentage of P-3’s, but when you look at the analytics against it, you’re going to be looking for trending, why are we seeing this problem so much? Maybe what we need to do, is up the percentage of evaluations that we’re doing on P-2’s to make sure that we address all of the problems.

Another thing that is also looked at in problem management, is that agents, if they commonly see an issue cropping up as they are logging incidents, they can initiate a problem investigation. That’s also another trigger that you want to train your staff to be looking for those repetitive problems, time and time again.

Jack: So if there’s a lower priority incident, but it’s happening all the time, even though when it happens that one time, it may not have a huge business impact, the collective business impact across all of the different times that it’s happening, might actually make it as bad to the business as a single major incident.

Brenda: Correct, that could be the catalyst of a major problem. So when you see those repetitive problems, even at a lower level, even at a normal level or normal priority, and you want to create a problem investigation to just start checking that out and to avoid a larger and more costly issue.

Jack: So you used a phrase earlier, and it’s pretty common in our industry, root cause analysis. Is root cause analysis the same as problem management?

Brenda: No, they are completely different. So problem management is the act of looking at a particular problem, resolving it in terms of tracking it end to end and bringing in the right parties. Root cause analysis is, you’re going to go after the exact root cause of that problem. You may open up a change ticket. You may bring in other suppliers and vendors to resolve that particular issue. So a lot of times it really can have almost any output of, as far as what that cause of that problem can be, but that is in fact, the essence of the word. Right? Root cause analysis.

Jack: I was at a conference last year and there was a CIO speaking and to illustrate the difference between root cause analysis and problem management, he talked about backhoe interrupts. Right? A case where a backhoe was actually doing some excavating for a business property and took out a fiber line. Well, there’s not an underlying problem per se, but the root cause was, the fiber line was cut. So you could do a root cause analysis, but there wasn’t necessarily a big underlying problem to the overall architecture of the IT environment.

Brenda: Exactly and that’s where other factors come into play. So that sometimes a root cause analysis can be very easy, like in the example you just quoted, and sometimes root cause analysis can be very trying, especially when you’re dealing with multiple vendors, perhaps network issues, that type of thing, that may actually make that problem investigation go out for several days in duration.

Jack: Even if there are maybe multiple potential underpinning causes and multiple stake holders, problem management still applies.

Brenda: It certainly does. There are certain … Lots of different conditional issues that will crop up, that will lead you to continue to drive forward to solve a problem investigation.

Jack: So is this connected at all to SIAM?

Brenda: Yeah, so SIAM is another area where … When we were talking about vendors and different providers that may contribute to our problem investigation or root cause analysis, SIAM is the service integration and management process that really pulls it all together. Right? We’ll look at how your providers are tracking in terms of performance, how they’re looking at their incidents, looking at their problems, looking at their root cause analysis. Above all, it’s even tracking to the point of processes. You want to make sure that your suppliers are all using the same process, the same system. That really gives you a picture of transparency and truth of what’s happening in your environment.

In SIAM, you will also track an overall dashboard of how everyone is performing. So it is really collectively teamwork. When you have a SIAM practice involved, you’re looking for everyone, all your stake holders, all your suppliers, to be using the same processes and same tools, and thus that will show everyone how you’re working together and improve operational costs and effectiveness.

Jack: I can see how that would be particularly important with problem management, where you might have a complicated problem, or you may have multiple vendors, whether it’s different network carriers, whether it’s a colocation facility with Smart Hands that are touching the switches and the routers and the storage application providers. If you have a complicated problem, there have to be some sort of coordinated effort to get that base level information from all of the different parties. To really get to the root of what is the cause of this problem and how are we going to work together in the future to implement maybe a change through change management to resolve that underlying problem and make a broader business impact.

Brenda: And really, with having a SIAM practice in place, you kind of hit the nail on the head there, is that when you’re looking at all the different providers, if you know who the provider is, and we’ve already established ourselves as a team, you can quickly get to the right person at the right time and get those issues resolved a lot faster. That’s what really a SIAM practice brings to the table.

Jack: So if I’m doing problem management and I’m identifying potential problems and I’ve gone through, for those that are appropriate, a root cause analysis, I’m going to have a lot of data. What do we do with that data?

Brenda: All of that data is really looked at in terms of planning for the future. Planning for the future and that can be the current future or in the distant future as well. So we’ll look at the trending and the … Look at where the problems are, to say, “Maybe we need better knowledge articles. Maybe we have a system that’s consistently hitting high thresholds and we’ve done corrective actions, but really, it really needs to be replaced.” So maybe you need hardware and that does into the budget right? It may not need to be done now, but it may something you plan on in the future so-

Jack: So, like capacity management, capacity planning?

Brenda: Correct, right. It will highlight those issues, not necessarily point out, but that’s where your root cause analysis will come into play as well, determining that, “Hey, it’s hitting a capacity constraint and we need to plan for a new server.” So it’s all about planning and it’s all about being able to be proactive about your environment and address issues in a timely fashion.

Jack: Well if problem management really is, that more of a forward looking view on how you’re going to impact the future state of the business, how can an organization actually measure the success of problem management?

Brenda: The way that, that’s captured is that, you should not be saying those same incidents reoccur time and time again. Right? If you’re seeing those same incident reoccur, that means that your root cause analysis is not working. It means that the change that was potentially put in, did not work, so that’s when you will really drive right back around to root cause analysis and say, “We didn’t address the issue. We need to do another corrective action and look at this area.” The evidence around whether it’s working or not, is pretty prevalent in the numbers and in the transparency of the data that will be coming back to you.

Jack: One more question, that just as you were talking about the reduction of repetitive problems and the forward looking view, it made me think. I know we both have some history in operations, running different aspects of IT, I guess the question comes up, how do you find that time to focus on problem management as opposed to that day to day firefighting of incident management?

Brenda: You really need to make it a priority. A lot of people really do take problem management as a second priority, but if put … In making problem management a priority, you’re gonna save cost, you’re going to save time overall in your environment and you’re going to be more efficient overall. So problem management is really part of the key success in incident management, problem management, root cause analysis. In fact, a lot of times with service desk, they bundled incident management and problem management together, because of the fact that they were so closely related. So problem management should be made a priority.

Jack: Well and I know as I hear from a lot of IT leaders, every year, there’s increased budget constraints. They’re constantly being asked to either maintain or cut their budget as the business continues to scale and grow. I think the only way that you’re able to make that happen, is through good problem management and getting rid of those repetitive incidents that are constantly a drain on that day to day firefighting. It also connects back to what we talked about in the incident management section with automation, where if we find a repetitive problem, through our problem management process, that might be something that is a target for an opportunity, whether it’s a script that we put into an intelligent agent, whether it’s something that we try to proactively heal through operation management, as well as some automation. But there’s a lot of opportunities there.

Well thanks Brenda. This has been another episode of The Transformative IT Service Management Podcast. Today we’ve been talking about problem management. Tune in next time, where we’re gonna talk about the next step after you’ve identified a problem. How do you actually remediate that problem and that may be through change management.