Hi all,
I'd like to hear what the general community's thoughts are on this sort of topic.
To sketch the scene, we have a 200TB database running on an almost full rack Exadata X6-2 (There is another 160TB DB running on a full X3-2). Needless to say, the volume of archivelogs generated by these large systems is...large As a standard, we've doubled-down on the merged incremental strategy, so on all systems in excess of 20TB there is local storage assigned which is typically 1.5x larger than the allocation for the actual database. In the non-Exadata space, the production database will reside on Tier 1 storage (e.g. EMC XIO/VMAX) and the FRA is comprised of more cost-effective storage (e.g. EMC VNX). Within the Exa space, a ZFS appliance is used for backups. For these large systems, there is no backup of the FRA to a tertiary device...partially because of the (storage) cost of an additional backup "to tape" as well as the duration this sort of backup would take.
Here's the thing, even when considering going back to a "traditional" weekly full/daily incremental type strategy, some of the problems (In my mind) on these ultra large systems are
- If it's a partial restore/recovery required, where are you going to restore to if i$h hits the fan?
- How long will it take?
- Providing management confidence that you are able to restore from any scenario.
There has been a situation where a developer accidentally dropped a multi-terabyte table whilst connected to production. This is a 12.1.0.2 environment and one of the options attempted was the "Recover table from backup". It worked up to a point, but given the size of the table (The bulk was HCC compressed) producing a Datapump export from which to import back from just wasn't practical given the storage constraints. We were able to reload the most recent months' worth of data from our mediation system, but everything older was lost.
With all said and done, I have considered a "radical" approach and that is to use the storage assigned for merged incremental backups to create a physical standby database with a delay of 'x' hours/days. Not only are you able to provide assurance that your backup is "up-to-date" (Using DG, the broker and OEM monitoring for Apply lags), but you have an exact copy that you can readily make use of in a situation as described with the dropped table above. You also have the ability to failover to this copy, similar to the SWITCH DATABASE TO COPY command applicable in the merged incremental backup space.
The only "negative" is that you have to bear the cost of assigning memory/CPU to the apply instance. There are also question marks around whether or not the apply process would be able to keep up (Multi-Instance Redo Apply could help in 12.2), but I think the benefits substantially outweigh this. I'm having a tough time convincing my colleagues of this strategy though, they think I'm going off my rocker. My feeling is that if what's in place isn't working, it's time to try something else isn't it?
Looking forward to hearing your thoughts/comments on this!
Regards
Ian
P.S. I deliberately used the word "protecting" instead of "backing up"