According to a recent Gartner study, 80% of organizations intend to use some form of cloud services by mid-2014. The same study predicts that cloud adoption will be primarily tactical rather than strategic.
- Asking Tough Questions
- Service is What Matters
- The Legacy Process
- Too Many Rules, Too Many Changes
- Event Management today
- Situation Management
- Adaptive Algorithms Hold the Key
- Avoiding Doom: Fulfilling the Promise of Cloud Computing
Couple this with the Cisco Global Cloud Index — which predicts reaching a tipping point where two-thirds of the datacenter workload will be processed in the cloud — and you have a recipe for disaster in the IT operations center. It’s the countdown to Cloudmaggedon.
If we are to avoid a worst-case scenario, IT operations and solutions providers must start asking some very important tough questions, including:
- How can IT operations support rapidly changing business models?
- How can 20th century legacy incident management processes and tools possibly monitor 21st century cloud technologies?
- How can IT continue to guarantee service levels?
The answer is: they can’t, if they are using 20th century IT support tools and processes. A new approach is needed. But first, let’s take a look at the problem of “service assurance” in general.
Whether you are an internal IT department or an external provider, it’s all about services, and services run across IT infrastructures.
In the enterprise, when services run across a combined hybrid infrastructure (20th century legacy plus 21st century dynamic/cloud), it’s impossible to guarantee service levels. Early adopters of hybrid cloud architectures are now wrestling with this dilemma.
Let’s drill down into this and see why.
The challenge is to identify problems in the infrastructure before endusers see a disruption in service — remember the service level agreement (SLA) clock starts ticking when the endusers are impacted.
Here’s the way 20th century legacy monitoring tools and processes attack the challenge:
Step 1. Document topology. This means itemizing all of the servers, storage, network devices, applications, etc., in the infrastructure and identifying all of the dependencies.
Step 2. Write rules for handling device and system failures (events) to discern the “root cause” of a problem (incident). Today, with dynamic (cloud and software defined networking (SDN) infrastructures and multi-participant environments, often there is no single cause of an incident. History may repeat itself, but never exactly the same way every time. This is why it is difficult for traditional rules-based or traditional behavior-based systems to catch new variants of anamolies.
While it sounds simple in practice, the rules can be voluminous and complex. For example, over the course of 10 years one large organization has written more than 13,000 rules for BMC Event Manager. Who knows how many of those rules are still relevant today? Certainly not the person who originally wrote them. Chances are that person is no longer on your payroll.
There are additional considerations as well:
- What happens when the topology changes? You have to re-write the rules.
- What happens when IT starts adopting cloud services? Topology changes constantly.
In the case of the large organization just mentioned, the number of infrastructure changes that occur daily in 2013 exceeds the number of changes per year in 2003!
With piecemeal cloud adoption, there is no way of knowing for sure if your application is dependent on the cloud that Amazon or Rackspace manages, the infrastructure that you manage, or both.
In global financial institutions, for example, the problem is incredibly complex. Failing to settle a trade at the end of the day because of IT support’s inability to spot trouble early on could cost many millions of dollars.
Today, IT support organizations follow a process called “event management,” which is based on the premise that a single event (device, system, or application failure) is the root cause of service disruption (an “incident”). Thus, when endusers call up and complain about poor service, IT scrambles to look down multiple IT “silos” to find the root cause of the problem. The silos may include databases, storage, network devices, servers, applications, and so on — including virtual networks and servers. Each silo has its own team of operators specializing in that area of IT. Sometimes these teams are outsourced. In theory, legacy tools are supposed to filter out extraneous events before kicking off a trouble ticket, which is supposed to point the experts in the right direction.
However, this approach is dysfunctional when cloud technology is involved, because a disruption in service doesn’t necessarily have one root cause. Many times it is a coincident combination of things that may be occurring outside in the cloud and/or inside the firewall.
Event management (looking down individual silos without seeing the “big picture”) wastes significant amounts of very expensive resources and slows down mean time to repair (MTTR). In one instance at a global bank, our team documented an incident where six highly paid experts (application troubleshooters) spent nearly an hour attempting to resolve six separate trouble tickets. In reality, these tickets reflected the same underlying problem. However, because the team depended on a legacy system that could not see across silos and provide a big-picture view, they were duplicating their work efforts and could not grasp the entire situation.
It’s time for a new, 21st century approach that is capable of looking at the big picture by inferring the cause of trouble (inside the firewall or in the cloud) without depending on outdated rules or post-mortem examinations of log files. “Situation management” is one such approach (Figure 1).
In the new era of cloud technology you must be able to identify, in real time, that an unfolding incident is the result of complex interplay between multiple systems, some of which you own and some of which you don’t. This complex interplay unfolds in situations and the new process is situation management.
Situation management makes no assumptions about the “root cause” of a problem but rather follows a coordinated cross-silo approach to minimizing service disruption. Situation management doesn’t depend on topology models because, in the world of demand-driven provisioning, the infrastructure underpinning services two minutes ago may not be the same as the infrastructure operating now.
Situation management requires a system that can read the content of event messages (machine data) in real time and make inferences about which messages are related to a given incident and which are noise.
The system supports “situational awareness” which automatically categorizes which events and early indicators are likely to impact service. There are no pre-set rules. The autonomous machine algorithms determine the relationship and relevance of the machine-generated data.
Situation management creates the opportunity to work across IT silos. In social media inspired “situation rooms” staff can comment and collaborate on an abnormal situation to avoid duplicated effort and ensure faster MTTR.
For organizations adopting cloud services, situation management technologies offer the only real solution for mapping IT support processes to new business models. These solutions must:
- Continually adapt to change
- Work at ultra-high speeds to identify problems early before they impact SLAs
- Reduce the time it takes to fix problems
This is absolutely essential in a time when IT budgets are shrinking and SLA expectations are rising.
IT is in the early stages of a revolution. Just like back in the mid-1990s, when the world was jumping to connect to the Internet and then realized they had to manage the mess, the world is now rushing to cloud.
The countdown to cloudmaggedon has begun. Incumbent vendors seem to be looking the other way. Innovative start-ups are stepping up to the plate, asking the tough questions and offering bold answers. That is the only way organizations will be able to meet their SLAs in the new demand-driven service economy.
This article was originally posted “Cloudmaggedon: How Global Cloud Adoption Impacts Service Assurance” from Cloud Strategy Magazine.