The “core” network infrastructure is in reality the heart of today’s business – without a fully-functioning network your business cannot run properly. If the network is malfunctioning, your employees work on documents, send faxes, make phone calls… in other words, your business is down. So it is crucial that you employ an adequate network fault management paradigm and you execute precisely.
There are many facets to IT monitoring and management; monitoring systems and applications, managing resources, polling routers and switches, capturing real-time data in the form of SNMP traps, all are important. But one of the most critical things to do in your IT management solution is to implement an adequate trap management solution. The management of faults is not simply capturing SNMP Traps, but it also includes proper filtering, de-duplication, and event correlation. Without this, your monitoring will be incomplete.
But there are problems with traditional fault processing. The SNMP Protocol is based on User Datagram Protocol (UDP), and thus even though it is part of Internet Protocol (IP), it is not connection based. This means, that any packets sent via SNMP are using UDP and there is a chance that messages sent from routers, switches, agents, devices, etc. will not reach their destination. It is this connection-less nature of SNMP that makes relying on traditional snmp management via fault and trap capture alone very dangerous. A solution to this is to use TCP-based traps. But, there are very few SNMP agents that support sending connection-based TCP SNMP Traps, and even fewer systems that can receive them.
One solution is to configure your SNMP enabled devices to send SNMPv2c Informs. These packets of information are sent to the SNMP manager several times until an adequate response is received. So, there is a way around this “unreliability”, but is it imperative that you invest in a fault solution that supports responding to SNMPv2c Informs.
Another critical aspect of the overall solution is de-duplication. You must have heard of the phrase, when it rains it pours. Very often when there is a small problem in the network, it can snowball and then it causes a larger problem. This is especially true with applications. Each of those affected devices send SNMP trap messages. As more components have issues, go down or become unavailable, more and more devices start sending messages. When you end up having is a “perfect storm” of the IT management world, in essence a network “trap flood”. This occurs when everything on the network decides to complain or communicate at the same time, sending their complaints, alerts, etc. all to the same SNMP network manager. It is this type of behavior that causes most event and trap managers to fall over. Yes, that is right – the majority of fault solutions get completely overwhelmed in these situations and either take forever to respond to the events (and by that time it is too late), or they completely fail. A strong
de-duplication algorithm enables a trap capture management solution to not get overwhelmed by these trap storms and enables the IT management system to stay on top of the events as they occur. In addition, a robust snmp trap forwarder and a distributed fault capturing solution can help a great deal with these types of events.
Event correlation is another difficult but crucial portion of this solution. To determine what exactly is going on in the network, what is being affected and what the results may be, it is essential that your solution have a holistic view of the network. That includes not only capturing SNMP traps, but also polling devices in your network, looking for trends, signaling events when certain thresholds are crossed, and then cross referencing these events with SNMP trap based events to see what is currently happening, what has happened in the past, and then… the system can more easily look into the future. Predictive modeling is the ultimate goal – this will save the network, and your business.
So, it turns out that the ultimate solution for fault management should include a system that supports the SNMPv2c Informs, as well as an adequate event correlation engine and event de-duplication. These systems are few and far between, and they can be counted on one hand. Most of the time, these systems cost millions of dollars and are offered only by the largest of companies. But, there are a very few lower-cost systems available (no, these are not open-source systems), with excellent technology and great support. The initial investment and even the annual support costs will pale in comparison to the amount of money that could be lost if the proper system is not put into place.