High availability with Oracle Standard Edition
Simple Continuous Operation
If your task is to establish a high-availability (HA) environment in the Oracle universe, Oracle offers you its "Maximum Availability Architecture," consisting of RAC (Real Application Clusters), Data Guard, and GoldenGate. However, this package is made up of different solutions for various problems, which leads to questions regarding the need and suitability of the various products for your company. Keeping the definition of high availability in mind, you need to ask the following for your database environment: What should be protected?
High availability is thus divided into several considerations: Does an application need to be safeguarded against failure or – more likely in many cases – does protection need to be provided against potential loss of data? This takes you to the technical differences between RAC and Data Guard. Simply setting up RAC does not provide any protection against the data center failing, for example, because of a fire. This additionally requires a standby solution for which you can set up a Data Guard configuration.
Unfortunately, however, Data Guard can only be used in the Enterprise Edition. A nice note can be found in the Data Guard documentation: It may be possible to set up a standby database environment manually in the Standard Edition itself by moving archived redo logfiles onto the standby site via the command line and then importing them by script, but, of course, not with the monitoring and management capabilities that Data Guard offers.
Standard Edition and High Availability
Is high availability then possible only using the Enterprise Edition? A look at Oracle GoldenGate, which can also be used with the Standard Edition, shows several additional features compared with Data Guard. These include cross-platform and cross-version use and bidirectional replication. However, these features are not needed for the example here – a high-availability environment with the Standard Edition.
It's high time to think outside the box and address various third-party solutions if the Standard Edition is sufficient for your purposes, but you do not want to do without affordable protection against failure and data loss. In addition to the well-known SharePlex from Quest or Dell, Dbvisit has become a talking point of late with its standby product. Additionally, Dbvisit has developed its own replication solution in the form of Replicate. This tool presents an alternative to Oracle Streams, which you should take a look at because streams were set to a status of "deprecated" in database release 12c R1, alongside Advanced Replication. Thus, the feature is still included in 12c, but Oracle has discontinued development. It is thus only one step ahead of "desupported" and ultimately – along with the limited usability in the Standard Edition – a reason why it probably should be dropped.
Defining Tolerable Data Loss
Replication and standby provide different solutions because, apart from mirrored disks, redundant data storage through replication – logically or physically – as a standby is possible. Unfortunately, the terminology and its use also often causes confusion, especially with Data Guard. The differences lie in the usability of the replica. Although a "Physical Standby" is built exclusively for a failure event, the target database can be fully used with logical replication (e.g., as a reporting database) (Figure 1).
Multiple solutions for lossless replication while simultaneously using the replica are available when using the Enterprise Edition with Active Data Guard and Logical Standby. However, the challenge with the Standard Edition is much greater. Solutions such as Dbvisit Standby, which are based on a redo log shipping, may allow elegant and quick-to-implement high availability; however, they have the disadvantage that data loss of several minutes (about 15 minutes in practice) must be accepted if there is an error, and the replica cannot be used during normal operation.
The most important question when analyzing an HA environment with the Standard Edition is therefore: How much data loss can be tolerated? If the number of transactions is on the low side, a standby solution where the redo log files are transmitted every five minutes might come into question. Be careful: This can quickly become a bottleneck in batch runs or other major actions. Thus, you should refrain from physical replication if the maximum data loss must be less than 15 minutes. Here logical replication demonstrates its strengths, because it can reduce data loss to a minimum when there is a failover.
No recovery of the archived redo log files occurs with logical replication, as opposed to the Physical Standby; instead, the DML and DDL commands are restored directly from the redo logfiles and run on the standby server. Data Guard now goes in other directions and writes directly via Log Writer from the primary database in parallel to separate standby redo logs on the target side.
Completely normal recovery is then performed from this in the Physical Standby version; with Logical Standby, the same is done by a SQL Apply with the database open. The minimal data loss with logical replication is limited to transactional data that has not yet been sent from the primary site to the replication site until the failover. This means that everything that landed in the redo logfiles and was applied also exists in the replicated database. There are two conceptual differences when transmitting the data: SharePlex and Dbvisit Replicate transmit the data before the commit occurs, whereas GoldenGate transmits the data together with the commit. A single "commit" could still be carried out if an error occurs, but larger transactions that have been transferred as a whole are unlikely to be successful.
Challenges in the HA Environment
Although a physical standby is set up quickly and no further work is generally required (in addition to monitoring), this is a separate database with logical replication that must be managed separately and backed up for productive use. The backup should not be an unsolvable problem. The fact that changes to the SYS and SYSTEM schemes were not included in the logical replication, however, creates more difficulties. The same applies to ROWID
s.
Other problems could arise from specific data types that are not supported by a replication solution such as XMLType
. Triggers that might now fire twice represent another task. Additionally, there's a problem with sequences, which are often used as unique IDs by some applications, in that an uninterrupted numerical order cannot be guaranteed – with or without replication. Simply rolling back a transaction creates a gap. Despite this fact, many applications rely on sequences or the primary key (which may be guaranteed to be unique but is not definitely continuous) that could cause discrepancies in a failover.
Another problem that must be avoided involves conflicts that can arise if constraints on the replication site prevent data from being inserted from the primary side. This problem, however, can only occur if changes have been made by third parties in the replication database. Thus, you must ensure appropriate access protection.
The last major challenge is to keep the complexity as low as possible and to develop a workable system that both meets the requirements and remains manageable. Of course, you cannot disregard human error in all considerations regarding high availability and so need to implement a robust backup strategy for this case, too.
Replication Is a Project
Quickly setting up a working replication will work – but usually not for very long. You should plan at least six months until commissioning for such a project, including evaluation and testing. Here, I describe a project in which a migration took place (together with a hardware migration) from a former Oracle Enterprise Edition 11g with Data Guard environment to an Oracle Standard Edition One 11g with Dbvisit Replicate.
In this migration, several problems immediately needed to be solved by the responsible team of database administrators and developers. The reorganization was initiated to create a new environment with powerful hardware, significantly lower licensing costs, and an improved backup concept from the obsolete hardware with expiring support, less power, and high licensing costs. The following example only concerns setup problems and solutions related to the high-availability environment. The data quantity on the database includes several hundred gigabytes, with access from internal and external employees from different locations throughout Europe.
The solution was implemented with a Windows Server 2008 R2, which has 64GB of RAM with two octa-core processors and SAS hard drives with 6Gbps transfer rate. The server was set up accordingly a second time in another data center for the HA environment. The aim was to create an HA environment with those conditions that ensures the least data loss. As already described, the possibilities of a physical standby database are limited to a certain time factor. Logical replication was therefore used, and the team chose Dbvisit Replicate as a third-party product.
Configuring Replication
Before replication can begin, it is essential to make sure that the same database structure and contents are available on both sides. When working with an already growing database, you can best implement this using an RMAN DUPLICATE
. After installing the software on the standby side, the team simply cloned the database one-to-one using standard tools. We therefore avoided conflicts that could arise, for example, by manually setting up the database and its structures and manually loading data.
RMAN makes it pretty simple at this point: After creating and mounting an instance on the standby side, you only need to deposit a DUPLICATE
:
RMAN> DUPLICATE TARGET DATABASE \ TO 'prod' FROM ACTIVE DATABASE \ NOFILENAMECHECK;
After completely cloning the database, you can start to install and set up Dbvisit Replicate (Figure 2) by installing it on both sides. Afterward, the replication configuration proceeds in four steps. Managing and logging the replication occurs using a scheme that is applied to both sides for Dbvisit. The replication itself can take place either in one direction or bi-directionally. For this example, only replication from Primary toward Standby comes into question to completely avoid conflict.
You can specifically determine which tables and schemes to replicate. Finally, you can configure a MINE
process on the primary and an APPLY
process on the standby side. The replication starts running as soon as the preconfigured services are started. To this end, Dbvisit generates PLOGs on the primary side from the database redo logs, pushes them to the standby side via the network, and applies them to the standby database.
Adjustments Required
As already described, individual solutions to typical problems of a replication solution are needed for this environment. All subsequent statements refer exclusively to replication with Dbvisit Replicate and may differ from other tools. Note that the replicated database is standalone, which means that SYS
and SYSTEM
objects are not replicated; it also means that appropriate monitoring must take place here. More importantly, it means that not every DDL is supported by it, and, for example, a table space is therefore not automatically created on both sides.
In the current setup, Dbvisit basically replicates no triggers, constraints, sequences, ALTER DATABASE
and ALTER SYSTEM
commands, or database structures. To this end, I create an appropriate procedure in the operation that ensures such changes are applied on both sides and checked for conformity. Specific data types that are not supported by Dbvisit Replicate and – if used – must therefore be put on the replicated database cause another problem. This concerns the XMLType
in this project. Using Toad and a DB Compare, it is possible to show differences at this point and to apply them to the other side, as required.
Because developers on our team often use triggers, the problem of these now firing on both sides occurred during the first tests. Trigger commands were still replicated at this point in the replication setup. This means that a DML command on the primary side initiated by a trigger is registered once by the replication, and the trigger itself is executed by replicating a second time on the standby side. A conflict occurs immediately because of the duplicate entry or an attempt to perform one, and replication comes to a standstill because it performs a "retry" until the conflict is resolved.
Dbvisit provides a "conflict handler" for this, which generally offers two options: to ignore the conflict and roll back the statement or simply to update an overwrite and the target data in violation of the WHERE
clause. Neither method provides certainty regarding achieving data consistency in the end and a guarantee that the same content was written on both sides. What might be tolerable for a test environment must be ruled out for backing up a production database. At this point, it becomes clear why no conflicts can be tolerated. The solution to the trigger problem is then conceivably easy; because there is full access to the replica, all triggers must be shut down before starting replication:
SQL> ALTER TABLE table_name \ DISABLE ALL TRIGGERS;
The last challenge to take up is shifting in case of failure. Although the main purpose of the HA environment has been realized through replication – the least possible loss of data – that is not enough, because it should be possible to use the replica as a productive environment in case of doubt. The hurdle to overcome is therefore switchover and failover by applying an eponymous dynamic database service to both sides that is control via the DBMS_SERVICE
package as a first measure to protect the replication from unauthorized access. Access to the user's PC is realized exclusively via this service, which can be started and stopped at any time.
Both servers are registered on the client side, and a registration can only be done on the production side on which the service is started. If the production fails, then the service on the replication side is release – after assurance that all production-related data are provided here:
SQL> EXEC \ DBMS_SERVICE.START_SERVICE('PRODUSER');
Yet another problem is caused by replication: Because sequences are not replicated, they need to be adjusted to the current highest value before activation for production. For safety reasons, you should determine the highest value and add 100, for example, to make completely sure that no conflicts arise. This fact often raises discussions regarding accounting and the tax office, which demand consecutive numbers for invoices.
Sequences that do not guarantee consecutive numbering under all circumstances are the cause of problems. A simple test with a rollback shows that a sequence used once is gone and thus immediately causes a gap. The challenge for developers in this case is to find another way, because a sequence is not the appropriate means to ensure consecutive numbering. There are good and less good alternatives for this – depending on the generated overhead – on the Internet.
Live Activation
Live activation of replication followed. This approach ran smoothly at the beginning but, unfortunately, further problems gradually arose during continued production. These were minor bugs in Windows and a bigger problem in the context of CLOBs that were not replicated cleanly. These obstacles had to be worked through step by step.
Because the provision of the HA environment was not ensured at this time, a standby database was set up in parallel with Dbvisit Standby. This was not a problem regarding either performance aspects or a license. Dbvisit provided the Standby license free of charge for as long as the problems with replication were not solved.
All in all, a number of checks and procedures need to be established for the regular operation and a possible switchover, as well as for the failover, to ensure smooth replication and a secure HA environment. Difficulties, complexities, and limitations resulted in an unexpected solution at the end of this project. Together with Dbvisit Standby, a double strategy, so to speak, was implemented that benefitted from the advantage of both solutions: All restrictions regarding non-replicated DDLs, SYS and SYSTEM schemes, and data types using Dbvisit Standby were bypassed, which allowed losses of actual production data to remain as low as possible using replication.
Another advantage is seen in the ease of handling Dbvisit Standby during a switchover. The replication is simply suspended. When a failover occurs (i.e., a total failure of the primary server), the standby database can be activated and, in case of doubt, lost data that has reached the other side via replication, can be imported using a DB Compare.
Conclusions
Is logical replication therefore suitable for implementing a high-availability environment? Sadly, the unsatisfactory answer is: It depends. A standby database provides the advantage of close integration in the database architecture. However, it is static, but very robust, when it is installed correctly.
In this case, several problems and all the options in Oracle Database Standard Edition bring the logical replication to its limits. The more complex the database structure, the less advisable it becomes to rely solely on logical replication. With additional goals, such as a user-friendly switchover, the IT manager must opt for the simpler standby solution with time offset or accept additional costs in the Enterprise Edition and Data Guard (or other products).