Jim Hannan (@HoBHannan), Principal Architect & Joe Grant, Principal Architect
Jim Hannan recently interviewed his colleague, and fellow Principal Architect Joe Grant, who has completed over 100 restores in his more than 10 years as an Oracle DBA for his insights regarding best practices for backups and recoverability.
Below is the second part of their interview.
What makes a good RMAN backup script? Or more appropriately, where should someone run a backup โ out of cloud control or a script?
Joe: That depends. In a larger organization with lots of DBAs that all need to have their fingers into something, I would say likely Cloud Control. But, it also depends upon whether or not there is some other sort of enterprise scheduling within the environment โ because you want that central place to be able to touch and play and fix an update.
And hereโs the fun thing, with Cloud Control you can have a script that exists on each of the DB hosts that is just called from Cloud Control. So Cloud Control is the scheduler, but you still have the script and database host where you are doing the work. And that is what weโve done a lot of.
Jim: So for clarification, it sounds like youโre suggesting using Cloud Control as a scheduler but donโt use the native backup tooling. You should write your own scripts because we can put more capabilities in it.
Joe: Correct. The other thing is youโre going to want is the option of doing more than just backing up the database.
That takes us to the next question of what makes up a good backup script that you asked. What makes a good backup script is checking every single stupid little step that you have. So you have to do a lot of things within the script if youโre going to keep it generic. Am I going to have a script for every database, or am I going to have one script that has a parameter file and config file or something like that? I tend to favor the config file. That way I can have the script be more useful and more portable. And so youโve got to error check every stupid little step. Is my config file readable? OK, great, now I can read it again. Now Iโve read it in. Did I read it all correctly? Are the variables that I was looking for now defined? Do I have defaults for certain variables so if theyโre not defined I can at least continue on? Is there a catalog database? Great, then TNS Ping it first.
Jim: For error checking?
Joe: Yes, for error checking. OK, did the TNS Ping come back correct, or as good? Does that cause you a halt or not? You know, things of this nature all the way through. So in our backup script that we use here, itโs several hundred lines long but the actual backup is 2.
Jim: Are those 98 lines because of error checking?
Joe: Yeah. So itโs all he error checking going on to make sure that every single step along the way ran correctly. Iโd have to grab the count of the number of times that the error or the page function can possibly get called. But its not a low number, especially for the fact that itโs, in theory, a relatively simple backup script.
A client is new to RMAN. What are your recommendations for getting started?
Joe: Well, not to plug Oracle, but there is a class. And we had a gray market class that pretty much matched their syllabus, and thatโs how I learned. And itโs only a three-day class.
Barring spending that much money, any basic RMAN book will walk you through all recovery scenarios. Do a full recovery. Do an incomplete recovery. Do a tablespace point-in-time recovery. Whatever. Even looking in the table of contents you can see all of those different scenarios that the book will walk you through. And itโs telling your boss. โOk, Iโm going to work from home the next three days,โ and you treat it as a class, even though itโs a book. That โs where you begin to get that familiarity with these are all of my possible different recovery scenarios. What happens if I donโt have the control file? What happens if you have nothing? You have your RMAN pieces and a blank server that doesnโt even have Oracle installed yet. And youโve got to walk through all those crazy scenarios.
Jim: You plugged Oracle and also talked about how we used to have a three-day class that was very similar to what they did.
While we donโt currently offer that class, one thing we can do is do on this is mentoring. Spending a few days with our customers to help develop an RMAN script. Not only that, but also get in there and test some recovery scenarios like block corruption and schema like you just talked about. I think that, for customers we are working with, it is a really good way to get going with RMAN and get more comfortable.
So what makes a good recovery strategy?
Joe: Again, itโs what are your failure scenarios? And what do you want to able to recover from?
Personally, I want to be able to have a pile of RMAN pieces and the Oracle installer and be able to put everything together from that.
OK, so what do I need to know from that? And the answer is shockingly very little. As long as I know what database version created those RMAN pieces. OK great, I want to assume that my primary site is a smoking hole and I have a tape, right? So I have all these pieces and I know that Iโm on 11.2.04. What does it take to put it together from that? And so you log in and install all your Oracle binaries and then you start off with recovering the control file and recovering the SP file.
Jim: I want to emphasize in your answer that youโre talking about the tactical approach. But before you start that you have the white board strategy.
Joe: Yeah, its what happened? Do I need to restore everything? Did somebody inadvertently drop a table or a tablespace and so I need to do a partial recovery?
Jim: Or can I use a flashback feature?
Joe: Exactly.
Jim: Before I start a recovery?
Joe: And weโve done all of them. Is database flashback enabled? Do you have flashback logs? Do we have undo flashback technologies available?
Jim: I like what youโre saying and I really want to emphasize this. You have to understand what the issue is. And I know thatโs obvious, but I think thereโs a tendency sometimes, in a panic, that you just go with what youโre comfortable with. And that might be an RMAN restore. And that might be the inappropriate solution. So youโve got to be patient. Youโve got to take your time and understand the problem.
Joe: Yeah, itโs always taking the step back and asking what happened. How did it happen? And then what needs to get fixed? And so if you have those scenarios, user deleted certain records, user deleted or removed a table, truncated a table, dropped a tablespace. Those are all things that donโt necessarily need a full database restore, depending upon how the application deals with that missing data for the time that it takes you to restore it. And so thatโs when you can restore to another machine. Grab your data file, grab your tablespace, grab your table, grab your couple of rows, whatever you need and then just import it back in. Thatโs that tablepace point-in-time recovery or things of this nature.
Jim: Youโre pointing out thereโs a tendency to rush into a full restore.
Joe: Right, and there have been numerous times where weโve restored a table or tablespace โ a chunk of the database โ that wasnโt a whole, while the whole was still up and available to users doing stuff. A couple people got weird errors. But for the most part, everything was up and available.
Jim: When you think about something as catastrophic as a corrupted database, thatโs a pretty big win that your users were still in.
Joe: Yeah, and you were at least mostly functional.
What are your thoughts on opening an SR during a restore? Do you feel it is a waste of time, or helpful?
Joe: It depends on if I have to wind up doing something Iโve never done before. Or is maybe poorly documented. There are some oddball things that you are going to wind up with where you donโt always necessarily know what you are doing.
For example, we had one client with just a ridiculously large number of tablespaces and data files โ five or six hundred data files and a couple hundred tablespaces. Well they needed to restore a portion of that. OK, so open an SR because there might be certain tips, tricks, or techniques that you may not necessarily know about that Oracle support will. Now that being said, opening a service request (SR), thatโs 30 to 50 minutes of Iโm not doing recovery right now.
Jim: It is an investment.
Joe: So you have to weigh that against what youโre trying to accomplish. Iโve got a 200 gig database and Iโm doing a straight full restore and I know Iโve got all of my archive log. Iโm not going bother opening and SR because I know how to do that. I have literally done that in my sleep.
Jim: I kind of come out the same way you do on the SR. That is an investment that Iโm not jumping into right away โ itโs conditional.
Conclusion
In this segment of their interview, Jim and Joe discussed what makes a good RMAN script, their recommendations for getting started with RMAN, what makes a for successful recovery strategy, and their thoughts on opening a service request (SR), during a restore.
In the next part of their interview, Jim and Joe discuss their concerns about the lack of alerts on failed backups, their thoughts on the best RMAN books and how theyโve seen customers effectively engage with House of Brick during a restore to assist in a successful recovery effort.