best practice guide Sure-fire remedies for your Monday morning networking headaches If you re an ICT network technician, manager, or decision-maker, Mondays usually come with a heavier workload and intense pressure to sort out the problems that have crept up over the weekend. Networking issues may slow down the business to a snail s pace, or even force it to a grinding halt. Your ICT network is the nervous system of your organisation. A slow or dysfunctional network equates to a massive corporate headache that needs to be relieved to avoid permanent damage. Luckily, there s a permanent cure for many of these Monday morning headaches: once you understand the cause, you can prescribe a treatment of consistent, reliable network management processes and procedures. A slow or dysfunctional network equates to a massive corporate headache that needs to be relieved to avoid permanent damage.
The tension headache: weekend configuration changes The scenario is familiar to many: you arrive at work after a weekend, only to find that the ICT network has slowed down to the point of being almost unusable. After several hours of painstaking investigation, it transpires that a technician has made a minor configuration change over the weekend without understanding what the impact might be on the rest of the infrastructure by bypassing change control. Organisations often underestimate the value of proper change management because they don t fully understand the inherent complexity in the simplest of adjustments. Often, there are many variables which could affect the way the entire network behaves. When technicians make ad hoc changes over weekends mostly out of necessity at the time formal change processes are not followed, which may lead to a blinding networking tension headache on a Monday morning. For an organisation with a vast and varied network, such as an urban railway operator, the results may be catastrophic, particularly given that Mondays in large cities usually come with their own additional volume of urban travellers. If the system is slow, ticket queues will soon snake out of subway stations and the entire city s public transport system could be brought to its knees. With a large and complicated infrastructure, an underperforming network is sometimes harder to fix than a network that s completely down, because it s more difficult to trouble-shoot and find the fault. Cure: proper change management processes and procedures Proper change management usually starts when someone raises a change request with a change advisory board. This board typically includes a management representative, a business owner who understands the business impact should the change cause a failure, and someone with strong technical capability and skills. Vendors may also be represented. Regular, scheduled change management calls to discuss planned changes are recommended. Often, the more thorough and effective these meetings are, the fewer emergency changes and calls there need to be. That s not to say that change management will eliminate all risks, but it can minimise most. The change advisory board considers the request and recommends an impact analysis which could be either simple and informal, or more complex and formal, depending on the complexity of the change. For this, an engineer usually investigates often through lab testing and/or load testing what the impact of the change may be on the rest of the infrastructure. Often quick and simple, this due diligence process is critical for all changes, even those perceived to be minor. The impact analysis includes not only failure predictions, but also performance impacts. Based on the analysis, a change plan will be developed, which would include such aspects as the length of time the change would take to implement, Organisations often underestimate the value of proper change management because they don t fully understand the inherent complexity in the simplest of adjustments. the resources needed to effect the change, the various risks involved, and a roll-back plan should the change not be successful and it needs to be reversed. This may not be as easy as simply undoing what s been done: sometimes steps need to be taken to make a roll-back possible, particularly if certain files need to be destroyed or configurations permanently adjusted to make the change. Backups or alternatives then need to be considered and put in place. After the change has been made, a period of early life support should follow. This requires assigning skilled resources to closely monitor the systems affected by the change and provide immediate support when needed. Life-support engineers need to understand the impact and associated risks of the change in order to react swiftly and decisively with the appropriate remedial action including the instigation of the planned rollback strategy should any problems arise. Once the change has been made successfully, there should also be a reliable mechanism to test whether or not it s been successful before going live with normal traffic volumes. Organisations easily underestimate the value of having remedial procedures in place following a change. Often, the assumption is that there won t be any problems if a change is effected using proper change management and impact analysis. However, there are always aspects of any system change that can t be predicted even a lab can never completely mimic a live environment. If an organisation chooses to outsource its network management to an external provider, it s imperative that all aspects of proper change management are in place and aligned with industry standards such as ITIL. These should be followed meticulously. The offloading of regular, day-to-day network management tasks will then free up the organisation s own ICT team to work on more strategic projects and innovations.
The migraine: incident management over weekends An unavoidable occurrence in ICT networks is failing devices. These incidents simply can t be predicted accurately so the onus is on the network manager to have the most effective procedures in place to deal with failures as and when they occur. This usually takes the form of a maintenance and support contract with either a service provider or the vendor itself. The challenge is usually when a failing device is not covered by such a contract. This may be because the organisation decided not to cover devices over a certain age to save on support costs. Surprisingly often, though, it simply wasn t aware of the device until it failed. This is often due to sheer asset sprawl resulting in a network inventory that s not up-to-date. When unsupported devices fail on weekends, the organisation often has to wait until Monday to initiate service. It would also have to make special arrangements so that the service provider can work with the vendor to replace the device sooner than the usual three- to four-day waiting period for a new device to ship to site. This inevitably becomes a costly remedy. Depending on the service level agreement that the organisation has with its service provider, there will usually be engineers on duty 24/7 over weekends. These technicians would have access to spares, but probably wouldn t have the authority to make decisions that would help an organisation whose failed device isn t contractually covered. Skilled, experienced, decisionmaking employees are expensive to keep on duty, so support coverage usually suffers over weekends. The result is that problems are often delayed or deferred, to be resolved during the week. It s not necessarily the severity of these issues, but the sheer volume to be resolved that causes a massive Monday migraine. Cure: a proper inventory and skilled, experienced employees on duty who follow standardised incident management processes To solve the headache of weak incident management, it s firstly important for an organisation to have an accurate view of its entire networking estate. This should include all of its devices, their relative age, and their individual risk and value profiles. It s not imperative for each and every device to be covered by a support and maintenance contract: that would mean the organisation is paying unnecessarily towards securing devices that aren t critical. Having a more accurate view of the estate means the organisation can right-size its maintenance contract to match the device s criticality and risk profile with the appropriate service level. It s also important to have access to skilled technicians with extensive knowledge and experience, and the ability to make decisions when devices need to be replaced. Often, this type of employee is expensive to employ, upskill, retain, and keep on duty, in-house. A network managed service delivered by a trusted, experienced service provider is therefore a far more cost-effective and convenient solution. To solve the headache of weak incident management, it s firstly important for an organisation to have an accurate view of its entire networking estate.
The rebound headache: weekend releases and deployments For many organisations that manage their own networks, the implementation and deployment of patches and other general maintenance processes occur on weekends when both network traffic and the risk of affecting the organisation s business are at their lowest. Unfortunately, this means that user acceptance testing is often not done properly when the work is complete. User acceptance testing involves testing the system with a group of users to check if it works following a new release. The users are more familiar with the environment and the system s functionality, and can provide more pertinent feedback. Proper testing after weekend downtime is the best way to gauge if the network will be ready for the greater traffic volumes that usually come with Monday mornings. All too often, the testing hasn t been comprehensive enough, which causes a rebound headache on a Monday. For example, an organisation recently patched a Cisco UCCX system over the weekend. But when its contact centre agents returned to work on Monday and tried to use the system, some were unable to log in, and others weren t able to transfer or conference their calls, or see proper call data on their desktops. It transpired that many steps were left out during the upgrade over the weekend, including testing the system with more than just one agent. Cure: proper lab testing and user acceptance testing after release and deployment While lab testing and user acceptance testing are also involved in change management processes, they re even more critical in proper release and deployment management. If an organisation chooses to outsource its network management to a managed network service provider, it s imperative that the provider gives proper impact analysis, staging of releases, as well as early life support to avoid a Monday morning headache. CS / DDMS-1774 / 09/14 Copyright Dimension Data 2011 For further information visit: www.dimensiondata.com