Here’s a scenario we can all envision: Pattern changes in traffic suddenly flood the network with unprecedented intensity and duration, and despite the use of cloud services and a plethora of backup components across your core network infrastructure, the network still suffers from disruptions. Customers are beyond frustrated, and, now, they’re taking to social media to vent their frustrations.

This is a nightmare scenario, but it’s also a reality for companies across the world. Whether it’s sudden disruption from innovation, a natural disaster, or a pandemic forcing people to stay home, business must continue, and this means networking systems and processes must anticipate unexpected change. They must be future-proofed, and this begins with establishing network resilience. This is never more important than during a crisis.

To ensure business continuity, enterprises and service providers must incorporate end-to-end resilience into their network planning from the outset. This is measured by organizations’ ability to resume normal operations after a network outage and their ability to quickly anticipate, prevent, and react to problems based on visibility into all equipment in a data center or edge site.

In this effort, tools like smart out-of-band (OOB) management, failover to cellular, and NetOps automation enable critical benefits ranging from remote monitoring and management to continued internet connectivity during an ISP outage and minimal human intervention required for network management. 

What Causes an Outage and How Can You Prepare?

From ISP carrier issues to power cuts at a last mile connection to simple human errors, there are a range of reasons for networks to crash. Network infrastructures and the software stacks used to support them are also becoming more complex — making them more susceptible to errors and security breaches.

The increased enterprise adoption of virtualization and SD-WAN has also brought new challenges. These technologies provide for greater flexibility, more efficient services, and reduced expenditures, and they also enable cloud-based control. However, they can also introduce additional points of failure.

The connection to an SD-WAN router may be severed, firmware updates may not work as planned, or a security breach may occur in a visibility blind spot. All of this increases the chance of an outage.

Taking this into account, it’s not a matter of if an issue will occur, it’s a matter of when and how well you’re prepared to handle it. This effort can be helped greatly by implementing self-healing tools that provide failover for continued operations as well as a secure alternate method for engineers to manage and configure the console port of any networking device, even when the primary production network is down.
There are many methods to help achieve this type of resilience. One is to institute link diversity with failover to a high-speed cellular 4G LTE network. Another is to separate network management from the main production network and institute automated rules for network management and monitoring. Better yet would be incorporating these elements of resilience into a unified platform that houses automated self-healing capabilities as well as methods for engineers to remotely manage operations outside of the data and control planes.

Separating Network Management

When considering how to best maintain uptime, it is important to consider the levels at which a network operates: the data, control, and management planes. The data plane, where most users and customers exist, allows data flow. This may be from a web server to a customer’s computer or vice versa, and it is where most cyber breaches occur. The control plane ensures data keeps flowing, so it houses rules that enable routing of information packets from one place to another. Finally, the management plane is used to configure and manage network devices and services.  

When management and data flows through the same interface in the data plane, this is called an in-band approach to network management. In this system, both data and control commands travel across the same network path, so the management plane has the same security vulnerabilities as the data plane, and operators may find themselves locked out of the management plane if the primary production network goes down.

As opposed to in-band management, OOB management provides an alternate way to connect remote equipment, such as routers, switches and servers through the management plane, without directly interfacing with the production IP address in the data plane. This is highly preferred, as it enables administrators a secure path to monitor, access, and manage all devices without traveling through or providing management access to the data plane.

Though in-band management costs less initially, organizations often pay more in the long run than if they would have adopted an OOB approach. With in-band, outages and attacks could compromise both user data, and the ability for management to remediate issues is severely hindered. To put it simply, OOB management is crucial for enhanced security and the ability to withstand an ISP outage if a last-mile error occurs or the production network fails. It is also necessary for companies with remote offices because it enables technicians to troubleshoot and administer equipment anywhere, anytime, through a central management system.

It’s important to keep in mind the LTE bandwidth needed for admin tasks is minimal and can be easily carried out on a wireless network. The most feasible network will probably be 4G LTE for the next several years, which is far superior to plain old telephone service but less resource-intensive than 5G.

Looking to a Post-COVID-19 World

We are witnessing a time when people are being forced to stay at home, leading to much more data-rich and virtual behavior patterns. Beyond increasing capacity, introducing more redundancy into systems has been a common strategy to improve availability. But this may not be enough if a primary network failure happens or something fails with any piece of equipment other than the redundant elements and there is no way for a technician to remotely manage an issue.

Prior to our current situation, network infrastructure has been shifting to become more geographically dispersed to handle more data-intensive, internet-connected devices. While it has been somewhat overhyped in the past, the IoT movement is creating a big shift that will create much more data for systems to process, which will require more local connections closer to the end user.

Historically, IT infrastructure management trended toward centralizing complexity from the edge to the core and cloud data center infrastructure. In the next few years, however, we will see the reverse of that trend, as we move back to the edge due to increased adoption of IoT technology, AI, rich media content, and VR/AR that will increase the volume of data and require more IT infrastructure next to end users.

As networks migrate closer to the edge, this will reenergize the need for remote management technologies, such as OOB management, remote provisioning, and automated remote management. This will also place the need for more stable connections between the core and the edge front and center. This means enterprises will need to monitor and manage equipment at the edge of the network and make sure there is resilience in the connection between the internet or the core network and the edge infrastructure.

What we are seeing now and what and we will see in the future is the need for a mentality of end-to-end network resilience, with the ability to react to events regardless of location.  Building this level of resilience to future-proof network infrastructures and ensure high levels of uptime will require clear separation of the network management plane so that both network engineers and management tools can run at a central site and reach infrastructure remotely, regardless of the status of the production network. This requires network resilience tools such as smart OOB management, failover to cellular, and NetOps automation. 

Those who are prepared will be better equipped to withstand disaster and ready for the future of IT infrastructure. This means a philosophy of resilience is not a luxury that can be passed by but, rather, a necessity.