Database Conundrum: Failover Or Fall Over?
When a critical piece of technology fails, a database will either failover or fall over. Databases are almost always backed-up so if a database falls over, it can normally be recovered, but processing interruptions to important applications can be very expensive, so important databases are rarely allowed to fall over.
- How Bad Is This in Practice?
- Failover to the Rescue?
- Is Failover Sufficient?
- A New Approach in the Cloud
When a database falls over, it usually recovers like this: The database keeps a log file of all completed transactions. If it fails, you run a recovery job that returns it to a known consistent state corresponding either to the last back-up or to a checkpoint. The job then applies completed transactions from the log file and then you reconnect all the users. Users check their last transaction to see if it completed and then everything continues as before.
Experience suggests that the users will lose the database for at least 20 minutes. They will be unhappy. But it will be worse if there are any dependencies. Not so long ago we built silo systems: one or more applications and a database to serve them. If we lost the database then only a limited number of users would be impacted. But then came the internet, web services, and service oriented architecture (SOA).
The internet is merciless as regards availability — it demands 24/7. The impact of web services and SOA proliferated these high service levels because they enabled software reuse, which created dependencies. Reuse is a good thing and dependencies are not a bad thing, if you can meet the service levels.
As we move forward in time, applications and databases simply become more and more interdependent, their availability requirements soon approach 24/7, and database fall over becomes unacceptable. That’s pretty much where many companies are right now.
Failover is better. Technically, failover involves automatically switching to a standby environment (hardware, database software, and networking connections) when a database fails. To achieve this, a considerable amount of redundancy (duplication) needs to be built into the environment, it is necessary to have a replica database ready to go into action when the live database fails.
This can be achieved by continually passing transactions from the live database to the replica and, conveniently, there is very little overhead involved in this. There will also normally be a heartbeat “pulse” continually passed between the primary and the replica so that the replica knows when to go into action — although some databases require operator approval before automatically failing over.
In theory, failover is invisible; the database gets swapped out and nobody notices. In practice it is not so simple. There are a series of re-connections from application to database to be made and the replica suddenly has a large workload run. It may be that insufficient resources are immediately available and operations staff need to bolster the resources to meet the demand.
A hot standby needs to have exactly as much resource as the primary if failover is going to work perfectly. But if you provided that much resource for many databases you would duplicate a large amount of resource. It’s more economic to provide enough resource for failover to commence and then gradually configure. There will be a dip in performance, but not for very long.
When disaster recovery is added to the mix everything gets more complex. Now think of two data centers many miles apart. You might like to have the hot standby database running in the remote data center. But if you do, there are problems when you failover. In that situation, local applications will have to access the standby database many miles away over the network and the latency involved in that may be prohibitive.
So, it may not work to use the disaster recovery site for failover. Failover may have to be set up locally.
A complicated picture quickly starts to develop. Some databases (ones that don’t matter so much) will be allowed to fall over and will be recovered slowly. More important databases, at greater resource cost, will failover and those which are truly mission critical will have a failover capability and a disaster recovery service.
One of the extraordinary things about some of the newer distributed cloud database technologies is that you do not have to worry about any of this. Explained simply, some of the newer distributed cloud database technologies keep a full disk copy of the database at every database location. You can have one or more locations within a data center and a location in each of the multiple data centers. Set this up correctly and you can forget about fail over and disaster recovery for the database entirely. It’s already done.
Take the fall over situation. Imagine that in one of the database locations a critical component fails and eliminates one or more of the servers running one of these new databases in that location. If the data on disk is still available at that location and there is still at least one server running then it will continue to run rather than fall over.
However it may be short of resource to execute all the work it is receiving. If so, the database brokers will divert some of the workload to a different database location which has available resource. It will happen automatically. No one will notice.
Now take the failover situation. Imagine that some critical component fails which takes out one of the database locations completely. The connection brokers simply make connections to one or more new locations and transfer the workload accordingly. The brokers will know which connections will be most efficient. Failover will happen automatically. No one will notice.
Now take the disaster recovery situation. Imagine that a whole data center disappears. The connection brokers will make connections to another physical site. Disaster recovery will happen automatically. No one will notice.
In reality, with failover or disaster recovery, keeping the applications running will be much more of a challenge than keeping your new database running.
It could be argued that today’s new distributed cloud database technology doesn’t really failover or recover, because it has been designed to never fail.