I fly a lot. Mercifully, not as much as your typical road warrior type, but enough where I’ve developed a routine on travel days. If it’s a good weather day, then I’ll hear the pilot announce, “We’ll be cruising at 33,000 ft. and expect clear skies all the way to our destination.” And if there’s lots of clouds? There’s auto-pilot, air traffic control and radar available to do some of the heavy lifting. That got me to thinking. What if the software tools we rely on every day to do our jobs aren’t clear? Is our only recourse to seek out documentation, vendor tech support, or our local tool guru for assistance? Sadly, the answer here is “yes” and that’s an excellent introduction to our fourth “C” in the “4 Cs of Quality Monitoring Tools” series: “Clarity”.
In the spirit of this particular blog post lets first be clear about our terms (see what I did there?). What does it really mean to have “Clarity” as a cornerstone in a network and systems management application? It comes in many forms. First, you want a clean, digestible primary UI that requires a minimal amount of your attention budget. The second thing to look for is intelligible and succinct alerting behavior. Lastly, you want to be able to rely on concise “tear away” reports and report setup.
A typical IT engineer has a pretty packed day. They have to deal with over-utilized hard drives, non-responding applications, network slowdowns, and we certainly can’t forget about angry, pitchfork-wielding end-users who want their issues resolved too. The last thing on Earth an engineer want to deal with is an NMS with a noisy, cluttered UI. Why? The primary reason is efficiency. If you’ve already got a full plate of troubleshooting activities, then it’s only logical that a UI which highlights your broken stuff AND brings it to your attention should be the desired end state. Why waste more time trying figuring out what needs to be fixed? When I’ve expressed this idea to customers in the past a common objection is “For an application like an NMS system a simple UI is ineffective. I need to know details of the problems I’m troubleshooting.” My response is “Yes, you do, just not on your primary UI that everybody uses”. You want the primary UI to be simple, clear, and demonstrative. Leave the details to the actual troubleshooting process.
Speaking of things vying for the attention of IT Engineers, have you ever had a situation when you’re fixing a device that’s either broken unintentionally or “under the knife” for an upgrade and your smartphone won’t stop chirping? Letting you know that your device is “DOWN” and should be looked after? I have. Notifications like that are about as helpful as an illuminated “Check Engine” light on your dash after you’re already broken down by the side of the road. During my time as a sysadmin I threw my pager across the room more than once in frustration. A phrase I’m fond of using with regard to NMS alerting is “If everything is a CRITICAL alert in your environment, then nothing is a CRITICAL alert in your environment.” Too much noise in your inbox will lead to legitimate problems getting overlooked. Besides the fact the NMS systems with intelligent and succinct alert behavior will keep the message quantity in your inbox to a manageable level, think about the money you’ll save by NOT having to replace a smashed smartphone?
Perhaps it’s obvious why an uncluttered UI and controlled alerting are important to the clarity of an NMS tool and there was no need to explain? In contrast, the third element of a clear NMS system probably isn’t obvious, but it should be. Why is it important for a tool to have concise and easy “tear-away” reporting and setup? The ‘obvious’ answer is that if a tool isn’t easy to use, then it’s not going to be used. Period. Easy reporting invites usage and efficiency. It allows users and managers to make informed decisions quickly and get on with their day.
There’s a big difference between the ‘concept’ of having clarity in your Network and Systems Management platform and bringing that concept to life. To call on the overused MBA-ism, “When the Rubber Meets the Road”, how can we achieve an acceptable level of clarity in our toolset?
First things first, seek out a toolset with an intuitive UI. You want your navigation to start at the 50,000ft level and work it’s way down into more details as you click though. Of course, part of clarity also means not going too far. If you can’t get to the detail you need for initial troubleshooting within one or two clicks of the mouse, you’re probably working too hard. For example, on a primary dashboard all you need is some kind of a designation indicating a CRITICAL problem. Drilling in from there, you should be able to see specifics: the offending device, time of the problem, and another other details.
We know what the user interface should represent at a high level: simply that there’s a problem. Leave the details for further analysis. Even better, however, is if your system doesn’t generate CRITICAL conditions for things that aren’t critical. Unfortunately, computers have this pesky problem of doing precisely what you tell them to do, not what you think they should do. When I ask people their pain points regarding their NMS systems, ‘alert deluge’ is high on the list. Drilling further, on more than one occasion, I’ve heard “Well I want to make sure if any link in my network has latency levels higher than 400ms, then I get sent an alert”. Okay, but what if latency isn’t the real problem? What if somebody at large remote office (with lots of monitored devices) is streaming last night’s episode of “America’s Got Talent”? Billy Bob Haywood, who does a trapeze juggling act while setting himself on fire is probably consuming a good portion of your bandwidth to that site. The result is that latency levels for every other device at the site (who compete for the same bandwidth) will have high latency levels and generate alerts based on the threshold you set. Your NMS is doing what you told it to do, yet you still got a gazillion CRITICAL alerts. Look for tools that either allow you to train it to spot conditions like this and NOT alert you, or detect conditions like this automatically using M/L to weed out the incident before it comes an alarm in your inbox.
The last key to achieving clarity in your network management system is easy report setup. Reports are often one of those things that are an afterthought. In other words, you typically deploy an NMS to tell what’s broken right now so you can fix it. However, one of the best ways to shift from “fire-fighting” mode to “fire-prevention” mode is to take advantage of reporting. Use reports to spot trends and snuff out problems before they become critical. A great ally here is the idea of reporting templates and periodic delivery. Find a tool that has this capability as a feature. Take the time to setup report templates that map to the most common trouble spots in your infrastructure (for example, high disk utilization on servers). Then have the reports delivered once a week (daily is probably too much, and monthly isn’t enough). NMS reporting doesn’t have to be a tool exclusively for the Pointy-Haired Bosses. Reports can effectively be used as proxy for incident creation within your monitoring tools. By extension they also function as an excellent sleep-aid. If you stay on top of diagnostic reports you’ll be less likely to be jarred from a deep sleep because your primary database server ran out of drive space.
Where does “Clarity” rank among “Collaboration”, “Convenience”, and “Cost”, which are the other “C’s” I’ve discussed in this blog series? That answer is hard to determine. It depends upon the specific scenario and environment that the tool is deployed into. Regardless of how you rank them they’re all pretty important. I suppose you could think of them as equivalent of auto-pilot, air traffic control and radar for monitoring your environment. Each can be used to help you do some of the heavy lifting when things get turbulent.
Keep clarity in mind to make sure critical issues are easily found and solved. Netreo’s Omnicenter will help keep things simple.