Cloud technology has made huge strides and is now widely used in various sectors of the enterprise. Object storage scales significantly better and is far more cost-effective than previous solutions such as NAS, but enterprises are still skeptical about cloud storage performance. To date, enterprises have focused their use of cloud storage on backup, foregoing the benefits they could gain using this storage tier for active data sets and applications ranging from active archive to Tier 2-3 workloads currently running on traditional storage sub-systems.

Adding a caching strategy to an object storage-based system helps address performance concerns and enables users to access files stored in the cloud more efficiently. While performance is always important, it is particularly crucial for customer workloads with high data throughput requirements — workloads such as in-cloud video rendering, data crunching for medical research, and digital video recording (DVR) in the cloud. For these and the myriad of other applications where speed and latency of access to data matter, a very special type of caching architecture is required to support enduser requirements.

The vast majority of cloud storage systems such as cloud gateway appliances are monolithic — that is, they’re built around a monolithic local cache and a monolithic interface to the cloud through which all data and metadata flows. Like any monolithic device, these appliances become a bottleneck or pinch point once the network connection to the cloud and/or the local cache becomes full. The latter condition, and the “cache thrashing” it creates, is particularly devastating to enduser performance, as endpoints are required to make the full round-trip to the object store to retrieve data that should, but cannot, be locally cached.


New Caching Architectures

Next generation software-defined cloud storage solutions solve this pervasive problem with multi-faceted, multi-tiered, intelligent caching architectures that were built from the ground up to squeeze every ounce of performance from whatever network connection is available. Far from a static, monolithic cache, these types of solutions incorporate multiple caches at every point in the data path — from the cloud to the endpoint device — and employ them adaptively to maximize throughput and minimize latency.

This new generation of cache architectures consists of the following components, starting with the endpoint itself and moving outward until reaching the object store.

Endpoint cache. This capability establishes an encrypted, multi-tier, adaptive cache using the endpoint device’s own local storage and memory to store frequently used data and metadata. The cache is multi-tiered in that it leverages both persistent (on-disk) and ephemeral (in-memory) storage not only to maximize the density of the endpoint cache (storing the greatest amount of data possible), but also to maximize the speed with which endpoints are able to access that data. Critically, all data persisted on the device remains in “chunked” and encrypted format, rendering it completely opaque and useless to anyone who cannot authenticate their session through the enterprise’s native identity management system such as Active Directory. Combined with granular global per-share deduplication, endpoint caching minimizes the number of times a client is required to communicate with the object store to retrieve data. Indeed, depending on the type of workload, deduplication ratios can range well into the 90%+ range.

Regional cache. While endpoint caching provides the performance and latency characteristics required for most users and workloads in most enterprises, there are scenarios in which a large, dedicated cache on-site at a branch office can add significant value. A quintessential example is a very large branch with a relatively “thin” network connection to the object store and a large number of users who frequently share and collaborate around common data sets. A regional cache — deployed as a virtual machine at the branch office, requiring little or no support and fully under the control of central IT — adds another level of caching in the path between the cloud and endpoint. A regional cache accomplishes this without affecting standard file system semantics, including failsafe file locking and the ability to use unmodified existing applications with data stored in the cloud.

CDN caching. Increasingly, cloud service providers and geo-dispersed private clouds are adding content distribution network (CDN) capabilities to their infrastructure. A current leader in this space is Amazon Web Services, whose CloudFront offering is a full-fledged, worldwide CDN that integrates seamlessly with AWS S3 object storage. The concept behind CDNs is straightforward: by replicating data across one or more CDN edge nodes, geographically distributed workloads can leverage greatly enhanced read performance by accessing the required data from the nearby CDN node rather than the central object store.

Moreover, with some CDNs such as CloudFront, the service provider offers a high-speed backbone connecting the CDN edge nodes to the central object store, resulting in improved performance even on the first read (since the time it takes for data to traverse the network from the object store to the edge node is minimal, leaving only the “last mile” to the user). Next generation software-defined storage solutions now integrate natively with AWS CloudFront when using S3 as the backing object store. Central IT chooses which CDN regions to include in its topology, after which all reads from those regions are accelerated by virtue of data locality.

Metadata server caching. Leading software-defined storage solutions should separate data from metadata and provide a metadata server that hosts all metadata, communicates with endpoint agents, and arbitrates data operations. The speed and latency associated with metadata transfers is just as important as that of data transfers, so such solutions should have a purpose-built caching mechanism to help optimize metadata performance (and therefore overall performance) across deployments.

Next generation software-defined storage solutions offer multi-tiered and intelligent caching and storage for a myriad of metadata types. Similarly, these solutions vastly reduce the time required to access critical metadata such as chunk maps by caching them in memory. Additionally, these innovative caching architectures enhance the speed with which user authentication is established and maintained.