Public cloud environments such as AWS, Microsoft® Azure, and Google® Cloud Platform have been promoted as a means of saving money on IT infrastructure resources. Unfortunately, this is not often the case. The increasing complexity of cloud offerings and the lack of visibility most organizations have into these environments make it difficult to effectively control costs. Many organizations unwittingly overprovision in the public cloud — an error that’s too costly to ignore. By avoiding five of the most common mistakes, teams can maximize cloud resource efficiency and reduce performance risk in these new school environments.
Mistake #1: Not Understanding the Detailed Application Workload Patterns
Not all workloads are created equal and regardless of which public cloud you’re leveraging, the devil is in the details when it comes to cloud instance selection. It’s important to understand both the purpose of the workload and the detailed nature of the workload utilization pattern.
The economics of running a batch job workload in the public cloud that comes up to do its work once at the end of every month is very different than those of apps that are constantly busy with varying peaks and valleys throughout the day. To properly select the right resources and cloud instance, you really need to understand the intra-day workload pattern and how that the pattern changes over a business cycle.
Unfortunately, many organizations take a simplistic approach to analyzing their workloads, resigning themselves to only look at daily averages or percentiles instead of taking a thorough, in-depth dive into specific patterns. The result is an inaccurate picture of resource requirements, which can cause both over-provisioning and performance issues. Rarely do these simple approaches get it right. When you are looking for a solution to help you select the right cloud instance, choose something that truly understands the detailed utilization patterns of the workloads.
Mistake #2: Not Leveraging Benchmarks to Normalize Data Between Platforms
A common approach to sizing resource allocations for the cloud is to size “like for like” when moving from one virtual or cloud environment to another — meaning allocating a workload the same resources it had in an old one. But not every environment runs the same hardware with the same specifications. If you don’t use benchmarks to normalize workload data and accommodate for the performance differences in the underlying hardware between environments, you won’t get an accurate picture of how that workload will perform in the new environment.
Newer environments often have more powerful hardware, giving you more bang for your buck and as a result, workloads don’t often require the same amount of resource to be allocated. This is key when transforming servers and also when optimizing your public cloud use as providers are constantly offering updated cloud instance types running on new hardware. To avoid leaving money on the table, you need to be able to compare “apples to apples” and the only way to do that is by normalizing the data.
Mistake #3: Focusing on Right-Sizing and Ignoring Modernizing Workloads
Modernizing workloads to newer cloud instance offerings running on newer, more powerful hardware can be a very effective means of reducing costs. In fact, we have found right-sizing instances alone delivers typically 20% savings on a public cloud bill, whereas modernizing and right-sizing increases savings to 41% on average.
With the dizzying number of services and instance types that public cloud vendors offer, it is difficult for organizations to choose the right instance, let alone keep up with the new options. The potential savings though, make it a worthwhile effort. As mentioned, to do this properly requires a detailed understanding of the workloads, the cloud instance catalogs, costs, and the ability to normalize the data to account for performance differences between environments. This isn’t something that can be done manually and requires a thorough analysis to find the right combinations to save money and ensure performance. It’s also something that should be done regularly as apps deployed even a few months ago may be great modernization candidates.
Mistake #4: Getting Caught in the ‘Bump-up Loop’
The “bump-up loop” is an insidious cycle that leads to over-provisioning and overspending. Let’s say workload is running and you see CPU is at 100%. A simple tool would look at this, deem it under-provisioned and recommend bumping up the CPU resources (and the cost of your cloud instance). The problem here is that some workloads will use as much resource as they’re given. If you provision more CPU, these apps will take it and still be run at 100 percent, perhaps just for a shorter time. The cycle repeats itself, and you’re stuck in the costly bump-up loop.
To avoid this resource-sucking loop, you need to understand exactly what a workload does and how it behaves. Again, we come back to the need to understand the individual workload patterns and nature of the workload. This is particularly important as you look at memory, which is a major driver of cloud cost.
Mistake #5: Letting Idle Zombie Instances Go Unmanaged
Most organizations don’t have an effective process for identifying idle “zombie” instances, causing them to slip under the radar and pile up over time. They usually result from someone hastily deploying an instance for the short-term and forgetting to shut it down. Zombie instances do nothing but waste budget. To avoid this unnecessary cost, organizations must look at the workload pattern across a full business cycle (weeks or months of data) using sufficient history. Identifying and eliminating this deadwood can easily save thousands a year, but it requires longer term visibility into the workload than most tools out there provide.
Most organizations don’t realize how much money they are leaving in their public cloud. Getting that money back requires paying much closer attention to understanding how your workloads utilize resources and what they truly need to work as efficiently as possible without compromising performance. Having the ability to understand the detail is the only way to avoid a hemorrhaging cloud budget.