An Interview – Backups & Recoverability – Part 2

Jim Hannan (@HoBHannan), Principal Architect & Joe Grant, Principal Architect

Jim Hannan recently interviewed his colleague, and fellow Principal Architect Joe Grant, who has completed over 100 restores in his more than 10 years as an Oracle DBA for his insights regarding best practices for backups and recoverability.

Below is the second part of their interview.

What makes a good RMAN backup script? Or more appropriately, where should someone run a backup – out of cloud control or a script?

Joe: That depends. In a larger organization with lots of DBAs that all need to have their fingers into something, I would say likely Cloud Control. But, it also depends upon whether or not there is some other sort of enterprise scheduling within the environment – because you want that central place to be able to touch and play and fix an update.

And here’s the fun thing, with Cloud Control you can have a script that exists on each of the DB hosts that is just called from Cloud Control. So Cloud Control is the scheduler, but you still have the script and database host where you are doing the work. And that is what we’ve done a lot of.

Jim: So for clarification, it sounds like you’re suggesting using Cloud Control as a scheduler but don’t use the native backup tooling. You should write your own scripts because we can put more capabilities in it.

Joe: Correct. The other thing is you’re going to want is the option of doing more than just backing up the database.

That takes us to the next question of what makes up a good backup script that you asked. What makes a good backup script is checking every single stupid little step that you have. So you have to do a lot of things within the script if you’re going to keep it generic. Am I going to have a script for every database, or am I going to have one script that has a parameter file and config file or something like that? I tend to favor the config file. That way I can have the script be more useful and more portable. And so you’ve got to error check every stupid little step. Is my config file readable? OK, great, now I can read it again. Now I’ve read it in. Did I read it all correctly? Are the variables that I was looking for now defined? Do I have defaults for certain variables so if they’re not defined I can at least continue on? Is there a catalog database? Great, then TNS Ping it first.

Jim: For error checking?

Joe: Yes, for error checking. OK, did the TNS Ping come back correct, or as good? Does that cause you a halt or not? You know, things of this nature all the way through. So in our backup script that we use here, it’s several hundred lines long but the actual backup is 2.

Jim: Are those 98 lines because of error checking?

Joe: Yeah. So it’s all he error checking going on to make sure that every single step along the way ran correctly. I’d have to grab the count of the number of times that the error or the page function can possibly get called. But its not a low number, especially for the fact that it’s, in theory, a relatively simple backup script.

A client is new to RMAN. What are your recommendations for getting started?

Joe: Well, not to plug Oracle, but there is a class. And we had a gray market class that pretty much matched their syllabus, and that’s how I learned. And it’s only a three-day class.

Barring spending that much money, any basic RMAN book will walk you through all recovery scenarios. Do a full recovery. Do an incomplete recovery. Do a tablespace point-in-time recovery. Whatever. Even looking in the table of contents you can see all of those different scenarios that the book will walk you through. And it’s telling your boss. “Ok, I’m going to work from home the next three days,” and you treat it as a class, even though it’s a book. That ‘s where you begin to get that familiarity with these are all of my possible different recovery scenarios. What happens if I don’t have the control file? What happens if you have nothing? You have your RMAN pieces and a blank server that doesn’t even have Oracle installed yet. And you’ve got to walk through all those crazy scenarios.

Jim: You plugged Oracle and also talked about how we used to have a three-day class that was very similar to what they did.

While we don’t currently offer that class, one thing we can do is do on this is mentoring. Spending a few days with our customers to help develop an RMAN script. Not only that, but also get in there and test some recovery scenarios like block corruption and schema like you just talked about. I think that, for customers we are working with, it is a really good way to get going with RMAN and get more comfortable.

So what makes a good recovery strategy?

Joe: Again, it’s what are your failure scenarios? And what do you want to able to recover from?

Personally, I want to be able to have a pile of RMAN pieces and the Oracle installer and be able to put everything together from that.

OK, so what do I need to know from that? And the answer is shockingly very little. As long as I know what database version created those RMAN pieces. OK great, I want to assume that my primary site is a smoking hole and I have a tape, right? So I have all these pieces and I know that I’m on 11.2.04. What does it take to put it together from that? And so you log in and install all your Oracle binaries and then you start off with recovering the control file and recovering the SP file.

Jim: I want to emphasize in your answer that you’re talking about the tactical approach. But before you start that you have the white board strategy.

Joe: Yeah, its what happened? Do I need to restore everything? Did somebody inadvertently drop a table or a tablespace and so I need to do a partial recovery?

Jim: Or can I use a flashback feature?

Joe: Exactly.

Jim: Before I start a recovery?

Joe: And we’ve done all of them. Is database flashback enabled? Do you have flashback logs? Do we have undo flashback technologies available?

Jim: I like what you’re saying and I really want to emphasize this. You have to understand what the issue is. And I know that’s obvious, but I think there’s a tendency sometimes, in a panic, that you just go with what you’re comfortable with. And that might be an RMAN restore. And that might be the inappropriate solution. So you’ve got to be patient. You’ve got to take your time and understand the problem.

Joe: Yeah, it’s always taking the step back and asking what happened. How did it happen? And then what needs to get fixed? And so if you have those scenarios, user deleted certain records, user deleted or removed a table, truncated a table, dropped a tablespace. Those are all things that don’t necessarily need a full database restore, depending upon how the application deals with that missing data for the time that it takes you to restore it. And so that’s when you can restore to another machine. Grab your data file, grab your tablespace, grab your table, grab your couple of rows, whatever you need and then just import it back in. That’s that tablepace point-in-time recovery or things of this nature.

Jim: You’re pointing out there’s a tendency to rush into a full restore.

Joe: Right, and there have been numerous times where we’ve restored a table or tablespace – a chunk of the database – that wasn’t a whole, while the whole was still up and available to users doing stuff. A couple people got weird errors. But for the most part, everything was up and available.

Jim: When you think about something as catastrophic as a corrupted database, that’s a pretty big win that your users were still in.

Joe: Yeah, and you were at least mostly functional.

What are your thoughts on opening an SR during a restore? Do you feel it is a waste of time, or helpful?

Joe: It depends on if I have to wind up doing something I’ve never done before. Or is maybe poorly documented. There are some oddball things that you are going to wind up with where you don’t always necessarily know what you are doing.

For example, we had one client with just a ridiculously large number of tablespaces and data files – five or six hundred data files and a couple hundred tablespaces. Well they needed to restore a portion of that. OK, so open an SR because there might be certain tips, tricks, or techniques that you may not necessarily know about that Oracle support will. Now that being said, opening a service request (SR), that’s 30 to 50 minutes of I’m not doing recovery right now.

Jim: It is an investment.

Joe: So you have to weigh that against what you’re trying to accomplish. I’ve got a 200 gig database and I’m doing a straight full restore and I know I’ve got all of my archive log. I’m not going bother opening and SR because I know how to do that. I have literally done that in my sleep.

Jim: I kind of come out the same way you do on the SR. That is an investment that I’m not jumping into right away – it’s conditional.

Conclusion

In this segment of their interview, Jim and Joe discussed what makes a good RMAN script, their recommendations for getting started with RMAN, what makes a for successful recovery strategy, and their thoughts on opening a service request (SR), during a restore.

In the next part of their interview, Jim and Joe discuss their concerns about the lack of alerts on failed backups, their thoughts on the best RMAN books and how they’ve seen customers effectively engage with House of Brick during a restore to assist in a successful recovery effort.

Table of Contents

Related Posts