Operational systems are, by most definitions, mission-critical for the companies running them. These systems control the day-to-day operations of those companies — they record sales, manage assets, inventory, and people, and at the very core, enable a company’s day-to-day function.

Databases are critical components in most companies’ IT infrastructures. We hear about the increasing importance of data in the operations of companies, with some organizations making analytics a key, critical piece of their IT systems. Now, this is often performed with an automated analysis of data through machine learning and artificial intelligence systems. However, transactional systems lie at the heart of these IT infrastructures and have been operational almost from their very beginning. Due to this fact, they are often treated very carefully — i.e., while a company might choose to experiment with a new data warehouse platform (or even several), they would never “play around” with their transactional systems.

My company, Fauna, offers an operational database called FaunaDB. We spend much of our time talking to customers and prospects about their current and future needs for managing data, particularly their operational data. The world of information technology is constantly evolving but the rate of change seems to have accelerated over the past 10 years or so. Interestingly, operational databases have been slow to evolve. This could be due in part to the point made previously — most companies cannot afford to experiment when it comes to their operational databases and are likely to just keep using what they have been using. Though part of the reason is the fact that database vendors have been slow in advancing their products to a point where they are worthy replacements for existing systems.

Survey says...

We recently funded a third-party survey of enterprise database users and decision makers called The State of the Enterprise 2018. This survey shows the importance of the cloud for IT databases: 55% of the respondents have datastores deployed in the cloud vs 45% who have datastores within their own data centers. Keep in mind there is some overlap — you can have both on-premise and cloud databases. The difference was more pronounced in large enterprises where users reported that only a small number of them were still running databases on their own systems. You can read about the survey at https://bit.ly/2FwDITN.

Moving to the cloud

Just about any traditional database can be made to run in the cloud. In one sense, the cloud can be as simple as a managed/outsourced data center. Someone could install a traditional database on AWS, Google Cloud Platform, Microsoft Azure, or even a private cloud platform, and run it like a traditional database.

Part of the attraction and growth of the major cloud providers is due to their global presence. You can put your data and run your applications in any of their regions, or availability zones, or all of them. You could serve users in Europe or Asia out of a cloud presence in the United States, but performance could suffer due to the network latencies involved. It provides a much better user experience when users can access your applications and data more locally. Thus, running the same application across multiple regions (and thus, across multiple data centers), can be highly desirable for a company that moves their applications to the cloud.

Cloud-native databases

Some databases are starting to shine in the cloud. These tend to be databases that were actually designed to run in the cloud, not those that are simply modified to be cloud-friendly. On the analytical database side, Snowflake has been winning fans and customers over with its cloud-native (actually cloud-only) data warehouse. Recently, there have been cloud-native operational databases that have come on the market that are changing the transactional game in the cloud as well.

Why is the operational database state-of-the-art finally advancing? Like many business events, the emergence of capable cloud-native operational databases is due to the confluence of demand and enabling technical capability. Some would say that technical capability is really driven by the demand as well — where there is a need, there is a way — but often the technical capability takes a significant amount of time to develop, and it is instead the result of a prescient research team doing their work years before demand materializes.

New technical capability sometimes begins as a research project at an academic institution or a large corporation. If the team makes an interesting or promising discovery, they may present their research to the world through a paper, often presenting the paper at a conference or other public event related to the topic.

In the case of distributed transactional databases, there were two research projects (along with the resulting papers) that were presented in about 2012 that showed tremendous promise for being able to enable true global transactionality for databases. Those two projects, Spanner and Calvin, have both since surfaced as commercial offerings (primarily Google Cloud Spanner and FaunaDB).

Transaction Protocols

The primary enabling factor that is needed for practical global transactionality is an effective transaction protocol. Transactions are tricky things — they are very difficult to do “correctly” and very easy to mess up. The goal of the transaction protocol is to enforce database consistency — all of the nodes of the database need to have the same version of the data, providing consistency across all nodes. If a database is read-only, consistency is easy because once the nodes agree on the data, the data doesn’t change and they don’t diverge. The point of transactional databases, however, is to enable transactions that change data, and consistency becomes a great challenge for globally distributed transactions.

A traditional method used to achieve consistency for transactions (as well as other elements of networking) is the two-phase commit protocol. Most databases update records through rewriting existing records. If there are multiple transactions in flight, especially across multiple nodes in a cluster, the database needs to make sure that there is no conflict on any of the nodes regarding the data involved in the transactions before a transaction is allowed to commit. The two-phase commit protocol does that by first checking with all the nodes to make sure a transaction is acceptable to them, and then getting all nodes to commit the transaction. Once committed, the transaction is recorded by all nodes and the next transaction can be run through the same process. If any of the nodes reject the transaction (due to a conflicting transaction, for example), the transaction is rejected and the process needs to start again.

In the operation of the two-phase commit, the database nodes need to communicate with each other to stay in agreement on the proceedings. There are multiple network round-trip exchanges needed for each transaction. Two-phase commit works well enough for database clusters that are in close proximity geographically, as in a single cloud region. However, when the distance between database clusters is significant, so will be the potential latencies for consistent transactions.

Network communication speeds are limited by the laws of physics. A network signal will take a certain amount of time to travel from one point on the globe to another. A database that relies on two-phase commits faces the challenge of distance (network latencies). There is also a challenge in the standard that is used to order transactions in a two-phase commit system — time.

The ordering of transactions is a critical function of transaction protocols. Transactions must be put in an order so that the database can see if the transactions will allow it to continue to be consistent. If a violation of consistency is predicted, the offending transaction is aborted. In databases, timestamps are often used to determine the order of transactions. Keep in mind, time is a relative thing, especially when multiple servers and multiple data centers are involved in determining the time. The clocks within the servers and data centers will deviate from each other over time and that can cause problems for any application running across them that requires timestamps. That time value divergence amongst servers is called “clock skew.”

Spanner

Spanner, in some ways, is a traditional transaction protocol. It relies on consensus between nodes to determine transaction order and it uses timestamps to order transactions. Google faced the time challenge that was mentioned previously, but it figured out a way to change the game so they could win the battle over time (and physics). Google created a new time standard protocol, called TrueTime, that would allow all of its Spanner nodes to have one precise time for them all to use for ordering transactions. They also put Spanner nodes on Google’s own very fast private network, so that network latencies are lower than they would be on the public internet. This allows nodes located in different parts of the planet to reach consensus on transactions within a very small time window — small enough to allow Spanner to be practical for global transactions.

Calvin

Around the same time the Spanner paper was presented, there was another research project in the realm of distributed transactions. This project was called Calvin and it was done by a Yale University team led by Professor Daniel Abadi. Abadi and his team realized that time was a significant challenge to global transaction consistency, so they devised a protocol that did not rely on time. Instead, it relied on a deterministic log to create the consensus on transaction order. They discovered that in getting agreement on the order of transactions within that log before actually allowing the transactions to be executed within the database itself, they could reduce the number of network communications hops needed to achieve consensus and thus overcome the tremendous challenge of time. Calvin was used as the base transaction protocol for FaunaDB and other database development efforts are starting to surface around Calvin (one recent project is streamy-db).

There have been a few recent articles and blogs that compare these two different transaction protocols, so I encourage you to read them for a detailed comparison. What is important here is that these protocols can enable strong consistency for transactions (and thus, for databases) in a single data center or across multiple data centers — even data centers in different parts of the world. The protocols are not sufficient in and of themselves — there is a considerable amount of effort needed to turn an academic or research paper into a practical product, and often, certain decisions are left as an exercise for the reader. It becomes quite possible for a company to make some poor decisions in the design and implementation of a database based on a particular transaction protocol, even a promising, theoretically sound one like Spanner or Calvin. This is why rigorous unbiased testing like the Jepsen testing for distributed systems is extremely useful.

Distributed transaction processing becomes one of the technical enablers that we mentioned before. The internet has opened up user demand for applications and data to a global audience, and companies are finding that their new potential customers can now be anywhere in the world. Thus, there is a strong demand for being able to effectively and efficiently conduct business (which means performing transactions) through systems all around the world, not just through one data center.

The standard for mission-critical operations: The Mainframe

If there is an archetype computer for mission-critical applications many would say it is the mainframe. Many of the world’s transactions still occur on mainframes. While most people might believe that commodity servers, like those used by the major cloud providers, have taken over the computing world, some data tells a different story. A recent study by BMC (THE 2018 BMC MAINFRAME SURVEY RESULTS: The Mainframe’s Bright Future) states that most of the executives they surveyed (93%) believe in the long-term viability of the mainframe. It also says that over half of the respondents reported that more than half of their data resides on mainframes.

Given some of the issues of mainframes that fueled the growth of other types of systems (first minicomputers, then client/server, the internet, and most recently cloud) such as cost and the shortage of talent, why would companies continue to use and rely on mainframes.

Some of the reluctance to move off mainframes has been due to the fact that other systems lack the robustness mainframes have. For operational systems, robustness is critical — the platform needs to stay up and running. It also needs to be easily serviceable, with minimal (if any) downtime. The concept of RAS (Reliability/Availability/Serviceability) was used as a metric to promote computer systems like mainframes for many years, but subsequent computing platforms, and their corresponding database offerings, often fell short in one or more of the RAS dimensions.

Cloud platforms offer some significant degree of RAS by nature of their implementations. There are thousands of servers and massive redundancies for compute, storage, and networking as well as processes and procedures that have been tested through real-world failure scenarios. Still, there are some companies that don’t want to take the chance of running on a single cloud platform — you often still hear about the various cloud providers having a significant failure or going offline. In that case, companies have the option of running their applications as multi-cloud, across multiple cloud platforms and perhaps even on their own private cloud for higher availability.

Now, with these new distributed operational databases that enable the trustworthy transactionality of mainframe systems as well as offering very strong RAS features that are required by mainframe customers, these companies finally have platforms that can be used to augment or even replace their trusted mainframes. At Fauna, we have already seen this occur. One customer, a major financial institution, is porting one of their existing applications, a distributed ledger, from the mainframe to the cloud. For them, being able to do transactions across all regions of the world through their chosen cloud platform was critical and they were able to achieve that with FaunaDB’s cloud-native operational database.

Conclusion

The IT world is constantly changing and evolving. The cloud is the destination for many current IT modernization efforts for many companies — an expanding user base is driving more and more companies to look to the cloud to enable true global operations. Transactional databases are a part of these companies’ mission-critical IT infrastructures and the emergence of true cloud-native distributed operational databases is enabling them to finally execute on their desire to augment their on-premises mainframe applications with cloud deployments or even to migrate some of those applications completely to the cloud.