Enterprises are taking advantage of core and edge clouds to digitally transform their operations using new distributed applications. These enterprise applications make different demands and consume data center and network resources in less predictable ways. The rapid growth in the number and logical distribution of these applications across cloud data centers is now challenging network operations, especially during the pandemic, but this is expected to continue intensifying.

A growing question among network operators is: “How can I match the growing scale of cloud-based applications without disproportionately increasing the costs of designing, building, and operating the new data center network environment?” The only way to ensure operational models keep pace and scale cost-effectively is to use end-to-end network automation. Specifically, using closed-loop automation to continuously monitor the network, traffic, and available resources to automatically make adjustments that ensure optimal service quality and resource utilization based on predetermined intent. To achieve such intent-based networking with closed-loop automation requires a solid foundation that makes network resources as consumable as compute and storage.

Evolving Network Operating Systems

Traditionally, data center networks have been somewhat of a “black box” to the applications that run over them. Ideally, data center operators want to consume network resources on demand, just as they consume compute and storage resources, with the network operating invisibly to support the applications. For this to work, however, the network operating system (NOS) needs to be architected differently.

Traditional closed and vendor-proprietary NOSs offer limited visibility into their operations and little control for applications higher up the stack. Automation toolkits provided by network equipment vendors offer only a limited set of tools and force operations teams to write any network applications in the proprietary language of the vendor’s NOS. This calls for extra resources, provides a more limited scope for automation, and creates inconveniences such as re-compiling automation applications with every new release of the vendor’s NOS.

More open NOSs, often based in Linux, have evolved to solve some of these problems. They use standardized functions that leverage the work of the open-source community and reduce the amount of custom coding required. They have their restrictions, however, and are difficult to customize, integrate, and automate. There is a kind of DIY mentality, which works for some, but has a steep learning curve and requires investment in specialist expertise to gather and test modules and write applications to automate operations.

Infrastructure as Code

The emerging trend in the industry is to get the best of both worlds. In this model, the NOS vendor leverages the open Linux foundation but assembles the toolkit, often based on a mix of proprietary vendor and open-source modules tailored and integrated in a way that makes it flexible and consumable.

The key to this best-of-both-worlds approach is the use of declarative, intent-based automation and operations toolkits based on a container orchestration system, such as Kubernetes. This matches with the bigger movement of "infrastructure as code," which is important for solutions spanning on-premises and off-premises hybrid clouds.

The network should be able to closely align with these cloud ecosystems that it follows the needs of applications and remains invisible until an issue arises. The fabric operations platform must adopt a loosely coupled cloud-native approach to enable plug-and-play integrations with software-defined data center, or SDDC, stacks, such as VMware or Kubernetes-based stacks.

Templating Intents

In this new framework, the data center operations team uses fabric design models that have been trialed for stability and verified in the network vendor’s lab. The "fabric intent" is abstracted to such a level that operations teams do not need to be aware of the underlying advanced networking details and don’t need highly trained and certified personnel to provide a service.

The fabric is made up of different network virtualizations — for example, a "logical distributed switch," or a "logical distributed router." The abstract intent focuses, for instance, on generic constructs of data center infrastructure, such as the number of racks, servers per rack, dual-homing, etc., to automatically design and deploy standard Border Gateway Protocol (BGP)-based IP fabrics. Network automation can be applied to both virtual and physical resources. For physical switching and routing resources, this has the added advantage of eliminating human error in the configuration of the data center stack.

Maximize Agility, Minimize Risk

This evolution to infrastructure as code brings the network fabric more in line with data center operational philosophies, such as DevOps, which uses extensible automation platforms to streamline continuous integration and continuous development. This new intent-based approach to automating the network fabric can allow for rapid and frequent changes to ensure distributed applications are being integrated and developed in lockstep with the network fabric needed to support them.

This means that, along with the other changes, a network digital sandbox will be needed. A network digital sandbox is a digital twin of the real production network — to adopt the language of software development. Network equipment vendors have traditionally developed and tested various scenarios in their physical labs. However, not every scenario can be created or validated, and securing lab resources quickly isn’t always possible. A digital sandbox allows operations teams to rapidly experiment, test, and validate various automation steps and, more importantly, validate failure scenarios and associated closed-loop automation without the risk of trying them out on the production network.

Observability and Automation

Automation and observability go hand in hand. Unfortunately, the traditional approach of simply collecting all kinds of data and just pushing big data at operations teams without interpretation makes the operator’s task complex while providing little useful information. This is referred to as telemetry, but what is needed is extracting and delivering contextual insights that enable the operator to understand the root cause of a problem and mitigate it, not raw data.

The modern data center operations platform must implement an insight database that merges configuration and observability data to deliver contextual operational insights in an easy-to-understand fashion. In addition, these operational insights must enable the operator to conduct closed-loop automation in a programmable fashion.

As the randomization and complexity of collected data rises, applying standard business logic is insufficient. Instead, implementing advanced machine learning baselining and analytics can provide further and deeper insights to a human operator. This way, a software operator can empower a human operator to carry out the complex and intricate operations required in modern data centers.

Summary

To achieve the scalability and flexibility that is needed by modern data centers, closed-loop automation is the key to network resources becoming as consumable as compute and storage. Today’s most advanced NOSs make it possible to deliver automation via abstract intent, combined with innovative network virtualization. This enables the network to become invisible in an ecosystem when needed. Following the best practices of DevOps, these NOSs include a digital sandbox to enable operations teams to design network automation for failures. They leverage the best features of an open approach by delivering plug-and-play and, most importantly, tightly combine observability with automation. This approach and combination of features are already proving itself in the field, providing operations teams with a solid foundation to deploy much-needed closed-loop automation for their data centers.