Jim Hannan (@HoBHannan), Principal Architect & Joe Grant, Principal Architect
In this final part of a recent interview between Jim Hannan and his colleague, and fellow Principal Architect Joe Grant, Joe shares his favorite RMAN restore war story.
Tell me your best RMAN restore war story
Joe: My war story involves a California-based client running 9.2.04 on HP-UX, without Oracle support because they were still on 9.2.04 and their app vendor wouldn’t let them upgrade. They had fiber channel attached storage (but not a stellar array) and they had a failing controller card either in the HP-UX box, or the storage controller, I can’t remember which.
So in the middle of an operation of some sort, they were pretty sure that a data file got corrupted, due to a bad write, likely through the controller card failing. They also had not noticed it for a relatively long period of time.
They were very much an 8-5 customer, and then overnight batch cycles. So if they had to rerun an overnight batch cycle, no big deal. Believe it or not, if they had to tell their users to recreate the work, it was also not the end of the world.
I don’t remember all of the exact timing, but eventually we decided that we needed to do a full database restore.
Jim: You just said something that I really like, because we have been talking about backup strategy and approach. You should always ask yourself, if I can’t get to an RMAN backup, how catastrophic is that to my application? Because that will tell you how good your strategy needs to be. We’ve certainly worked with people who have said, “we can actually recreate a day’s worth of work”, or said “it is a data warehouse, we’ll just refresh it.” You certainly need to ask yourself, how good does my strategy need to be here?
Joe: Yeah, how close do I have to get?
So in this case, the issue occurred at one or two in the afternoon and the client wanted to restore back to six or seven o’clock in the morning, whenever the batch cycle was complete, but before users started using the system. Because it was easy enough for them to recreate that work – it was easier for them to go to their end users and say “you have to recreate this day” as opposed to them going to their users and saying, “OK, what did you do after lunch?” Right? It was like saying you have to recreate everything after 12:32. Their end users didn’t know what they did at 12:32, but they knew what they did starting in the morning, so we had to recover to either 6:30 or 7:30 in the morning.
The other issue was that they had a relatively large database with a relatively small spot to put the RMAN pieces in, and so things aged out quickly. We had to have them restore certain RMAN pieces and then we found out that RMAN pieces didn’t have all the archive log that we needed to be able to roll forward, so then we had to have them restore more RMAN pieces as we were shuffling RMAN pieces on and off the file system because of space constraints and all this other good stuff.
This is where we found out the hard way that the catalog command does not exist in 9.2. So believe it or not, there’s a DBMS package that can read an RMAN backup piece and will actually extract whatever you need out of it, so in our particular case we were looking to restore archive log. So in looking at old RMAN backup logs – and this is going to go back to the question about what makes a good script – crazy good logging. And we had crazy good logging, so I could tell which archive logs were in which RMAN backup pieces. We had to use this DBMS package in order to recover the archive log, so that we could then apply the archive log to the database restore that we were in the middle of trying to get everything up until 6:30 in the morning. It was a nightmare and a half, and the first couple of times that I ran the DBMS job it wouldn’t work because we’ve got these 10 archive logs in these 5 RMAN backup pieces because it was all from the same channel, but had a 2 GB file limit size and so we had to do it literally through trial and error. OK recover this archive log from this piece and recover this archive log from this piece until you found the one you needed.
Jim: The whole time you’re talking all I’m thinking about it 10g would have been nice.
Joe: Yes – very much so!
So literally through trial and error, I want this archive log out of this piece, OK no – this piece, OK no – this piece, ooh good I got lucky, it was in this one. OK now I need this archive log, was it in this piece – no, was it in this piece – nope, was it is this piece – no… So you had to go back and forth and back and forth, so when you eventually got to the end you had all of the archive log you needed to restore.
Restore database, recover database, Open reset logs. And it all worked. The customer was ecstatic that they didn’t completely lose the entire database and their employees were able to put together the day’s worth of work that they had lost.
But they did have old, slow systems – it took the weekend to do the whole recovery. Old, slow systems – we literally had to call Iron Mountain or whoever was storing their tapes. It was in L.A., so fighting traffic to drive the tapes to them because, of course, you want them 30 miles away or whatever. All that other crazy stuff.
But, in the end, we got all the pieces and parts to be able to put the database back together again, in probably one of the most painful ways that I’ve seen.
Jim: So your ran into space issues, features issues with 9i, logistical issues with the tapes, and no Oracle support – just about everything, sort of a Murphy’s law kind of situation.
But hopefully a CIO will read this and think, “Man we don’t think or talk about recovery enough at my shop.” People don’t. They think “it’s not going to happen to me.”
Share Your War Story
What is your favorite backup and recovery war story? We encourage you to share your story through the comments section below – we’d love to hear them.