An Interview – Backups & Recoverability – Part 1

Jim Hannan (@HoBHannan), Principal Architect & Joe Grant, Principal Architect

Jim Hannan recently interviewed his colleague, and fellow Principal Architect Joe Grant, who has completed over 100 restores in his more than 10 years as an Oracle DBA for his insights regarding best practices for backups and recoverability.

In the more than 100 restores you’ve worked on, has there been a common problem or common issues?

Joe: The most common issue that I’ve run across has been hardware failures of some sort. Whether it be a power outage, so there was a little bit of block corruption because things didn’t get flushed to disk properly, or whether it be a SAN unit losing battery backup of some sort or other things of this nature.

Followed closely by number two of “I didn’t mean to drop that table.”

Jim: That was a great answer on the hardware, so definitely storage issues cause block corruption and missing files systems.

Joe: My favorite response for why a database can’t be restored is when we had a client who had a tape physically snap.

Jim: That leads to a great question…

What do you think of tape backups?

Joe: I’m not a big fan.

Jim: Yeah, it’s a bad media isn’t it?

Joe: It’s a terrible media.

Jim: What is the problem with tape? I think they tell they go bad in two years and most people forget to replace them.

Joe: So my primary concerns with tape are – first and foremost it that it is crazy slow.  Yes it can hold a huge volume of data, but it’s crazy slow – both for the backup and for the restore both. That’s even if you’re using SPT tape and pushing it straight to the device.  Yes, the tape itself is a fragile media. Hard drives fail, but tapes fail even more – whether it’s from just sheer decay, or as in with this customer case, they were in the middle of a restore and the tape physically snapped.

Jim: That’s crazy. That’s a great story.

Joe: Yeah, the longer version of the story is that it was an absolutely huge database that took like 30-something hours to backup. So they would do a weekly full backup straight to tape and then during the week they would do archive log only backups they would do no incrementals. It didn’t happen on these dates, but it simplifies things to say that the full happened on Saturday the 1st and on Saturday the 8th there was an error with the full backup. The thinking was “oh, we’ll be all right we still have all of our archive logs” then along comes day 12 or 13, when something else occurred and they decided they needed to restore. We had told them on numerous occasions that they needed to put their archive log in more than one backup piece. We had said that so many times that we offended the client and eventually they said “you are no longer responsible for backups in any way shape or form.” OK we’re done. So in the middle of this restore, they called us for general advice, they didn’t ask us to do it. More just general OK this is what we are doing, this is what we have going on. We told them they were headed down the right path.

So the client was doing this restore and they wanted to restore to current, which would have been day 13, basically just before the next full backup, because the day 7 full backup had failed. So they applied the full backup from day 1 and started applying archive log. Somewhere around day 3 or 4 or archive log, the media physically snapped. That’s when they called us back and said, “OK great – how do we get around this?” First question – did you start putting archive log in more than one backup piece like we told you to? “No” OK, do you have the archive log anywhere available? “No” Did you duplicate the backup? Did you duplicate the tape anywhere? “No” OK, you’re done. Open reset logs – enjoy. They didn’t take it well, but you’re done.

Jim: So archive logs, should be backed up to at least two backup pieces?

Joe: Archive logs need to be in no fewer than 2 backup pieces, although here it wouldn’t have made much difference.

Joe: Another issue with tape is that I can’t see my backup pieces. A lot of information can be gleaned from the file name itself. Yeah, you can ask the media management layer for that information, but it just complicates things.

What do you think of RMAN catalogs?

Joe: RMAN catalogs are a version specific thing.

Jim: I had forgotten about that.

Joe: In 8i and in 9i, they didn’t put a whole lot of catalog information in the control file. There was also no re-catalog command. In 9, the catalog command would only catalog a copy of a data file, so essentially an image copy. So because of this some of your recovery options in 8i and 9i were somewhat limited. So, 8i and 9i and even 10.1 use a catalog. In 10.1, 10.2 they started putting a whole lot more of the catalog information into the control file and then they also introduced the catalog command that allows you to cram your own catalog information into the control file. Once this happened the control file became a whole lot more useful.

The primary issue with the catalog is that now you have another database that you have to maintain and that is a horrible pain. Anyhow, the rumor is that Oracle started putting all that information into the control file because the vast majority of their RMAN questions were catalog related and so they wanted to stop having to support all of these RMAN catalogs.

Jim: Is there a set of questions that you should ask yourself on whether you need a catalog? It seems to me that most people who I ask why are you using a catalog? It’s almost “I don’t know, (shrug shoulders) – it’s just what we’re doing.“

Joe: Right, but a lot of it is also retention requirements. So if you have those crazy retention requirements, where you need to keep annuals for 7 years, quarterlies for 5 years, and monthlies for 3 years, then your catalog is a better way to keep track of all that because you can’t expire certain backups. So since you have to have them, it’s easier to look-up some of that information.

Anytime I have a media management layer, controlling the backup and/or trying to keep track of the backup, a lot of the software requires the use of a catalog. It’s just easier to deal with it sometimes.

Jim: So the backup software or API might drive the requirement?

Joe: Yep.  You know for short retention requirements, or some other scheme, where you’re keeping track of RMAN piece information, I don’t want to use a catalog unless I absolutely have to for some reason or another.  In most situations, I have typically advised against the use of a catalog.

What is the most underused feature of RMAN?

Joe: RMAN. We are still running into DBAs who think that they can do it better.

Jim: What about image copies, compression, list backup summaries, or report schema?

Joe: Honestly, I would say block change tracking.

Jim: Good answer.

Joe: It can make incrementals go so much faster. But, in Oracle 8i, RMAN didn’t work nearly as well as it should have, so it turned a lot of people off and we are still dealing with the fallout of that amongst our clients. They don’t like it. They don’t trust it. They’re concerned that their backups are going to get corrupted or things of that nature. We still have to convince people that the use of RMAN is not a bad thing.

Once you are using RMAN, there are all kinds, tons of features that you can use. For me, it’s all requirements driven.

When planning for a recovery strategy, you just mentioned requirements, talk us through what that would look like.

Joe: The absolute best thing that I’ve even been told, about a backup and recovery strategy, is don’t ever plan for your backup, it doesn’t matter – it’s completely irrelevant. Plan for your recovery. How long does the recovery need to take? Do you always have to be current? Do you always have to not lose any data ever? What are your incomplete recovery scenarios? How do you deal with those things? Then you work backwards from there.

Jim: So Nick Walter, one of our co-workers, has this quote. He says, “Don’t plan as if you’re going to have to do a recovery. Plan that you are going to need to do a recovery.“  I think what he’s telling customers is that there will be a day when you will have to do a recovery.

I don’t know if this is fair Joe, but I think there is a tendency still in the industry that people think not me. Wishful thinking.

Joe: Yeah. I’m just lucky – it will be fine.

With that in mind, how often should RMAN backups be tested? How should they be tested?  What do you think of RMAN validate?

Joe: So it sounds weird, but testing your backups with a recovery of some sort is dependent on how often you do a duplicate.  Since duplicate command uses the RMAN pieces, if you are doing regular cloning activity or regular duplication activity, you are testing your backups. Other than that, hey, test them as frequently as you are willing to lose data. Right?

Jim: If you were starting up your own company tomorrow and your DBAs asked you how often do I need to test backup, what guidance would you give them? Monthly, weekly, quarterly?

Joe: At a very minimum, the validate command should be a part of your backup script. OK – I just took all this. Did it work?

You want to validate with every backup and then yeah, I would try and do at least monthly or quarterly restores. Not only just for the fact – do these RMAN pieces work – but because in the middle of a recovery is not when you want to break out the book and learn. In the middle of a recovery you need to look up syntax sure, you need to look up the nuances that you can’t quite remember.  I understand that, but that’s not the time to learn what a backup is about and how to restore.

Even if you don’t have a need to do a duplicate, or you don’t have a need to do a restore, practice.

Conclusion

In part one of their interview, Jim and Joe discussed common issues they’ve experienced during restores, their thoughts on tape backups, when it makes sense to use RMAN catalogs, when and how to test backups and which features of RMAN are underused.

In the next part of their interview, Jim and Joe discuss what makes up a good RMAN backup script, how to get started with RMAN, what constitutes a good recovery strategy, and Joe’s thoughts on opening a service request (SR) during a restore event.

Table of Contents

Related Posts