An Interview – Backups & Recoverability – Part 3
Jim Hannan (@HoBHannan), Principal Architect & Joe Grant, Principal Architect
This is the third part of a recent interview between Jim Hannan and his colleague, and fellow Principal Architect Joe Grant, who has completed over 100 restores in his more than 10 years as an Oracle DBA for his insights regarding best practices for backups and recoverability.
I’ve observed that the industry isn’t alerting on failed backups and I find that very concerning. Would you agree?
Joe: Yes, very much so.
Jim: Nobody is reacting to a failed backup. You know, something could be failing for days and an email just might not be getting read.
Joe: Yeah, any failed backup should wake somebody up. There are maybe some small components within the script. OK, great. We couldn’t connect to the RMAN catalog. I still want to do a backup because my recovery information will be in my control file. And so, OK, I’m going to throw a warning that I couldn’t connect to the RMAN catalog, but I’m still going to continue an RMAN backup. However, if I get to the end of my RMAN backup and my exit status is something other than 0, somebody needs to get woken up. Period.
Jim: It’s interesting that we’re talking about this because one of the things that I like that I saw was a customer wasn’t doing any alerting but they were using about 300 Oracle databases, and they would write to a table in the RMAN catalog. And basically what it would do is it would query the RMAN output. So it would grab those lines and if there was an error or success it was writing to this global table. And what I thought was interesting was they were going to get alerted but they could report later on against how many RMAN failures did we see? What were they?
Joe: OK, now that’s interesting.
Jim: It’s obviously an enterprise-like solution. But I think that point I am making is that there’s a lot of really helpful views for RMAN. If I was thinking about features that people aren’t using, that would be one of them that I would recommend.
Is there an RMAN book out there that is one of your favorites?
Joe: I don’t necessarily have a favorite. There’s an Oracle Press book that was co-written by Freeman, I think. That just out and out explains backup and recovery, and that one is very good at that. And there are a of couple recipe books that are more of by example.
Jim: I think you’re talking about the Apress books. The one I’m looking at right now is RMAN Recipes for Oracle Database 11g. And personally that’s one of my favorites.
Joe: But for me, I don’t know that I would necessarily go to the recipe book first. I want a deep understanding of how it works. The recipe books tend to have a shorter explanation on certain things.
Maybe one of the questions that we didn’t talk about earlier is what is important to know for a recovery? For me it is absolutely, what does it mean for a database to be consistent? What pieces and parts do you have to have there to know what that means? So you have to have all of your redo. You have to have all of your archive log. Yes, redo is an archive log, so if you had to you can point the database to a redo log file as if it were an archive log file.
Jim: You’re making a great point. Before you can start really testing scenarios, you’ve got to understand the core components of the database. How does the archiver work? How do redo logs work? What’s in a redo log? What’s a change vector? How does database writer kick-in? Why does Oracle say the redo logs are the most important? You’ve got to understand all of that first.
Joe: For me, that’s one of my interview questions. What does it mean for a database to be consistent? Even to the point where the SCN is recorded in the header of all the data files.
Jim: I always liked that OCP question where talk about the start up of a database and then you instantiate the memory. I can’t remember the wording, but basically when the database is opening what’s it doing? Is it reading through the data files? Is it only looking at the headers?
See? You and I know and we’re shaking our heads. What does it mean to roll back and how is it doing that before it opens the database for business? So I think those are really important things to understand.
Joe: Yeah. And you’ve absolutely got to know how a database is consistent and how it maintains its consistency. And then that falls into that other question of what type of recovery do I need?
Jim: I think that there is some deception in being good at RMAN duplicate that can create overconfidence, that “I know how to do a recovery.” It oversimplifies the recovery doesn’t it?
Joe: Yeah. It’s what is your recovery scenario? OK, great. Things are completely dead and I’ve got RMAN pieces and the Oracle install files. And I know that I need to be on 11.2.04. And that’s about it.
How have you seen customers engage HoB during a restore and effectively use our skills to assist in their recovery efforts?
Jim: The first one that pops into my mind is a customer who only wanted to discuss the strategy of the restore. And I thought wow, talk about maximizing House of Brick’s skills.
Joe: Well, a large part of that is just the sheer fact that we have the experience of doing this a lot, and most customers do not. They don’t run into it very often because hey, we got lucky – our systems are stable.
Jim: To your point, some ask can you just do the restore for us?
Joe: Some just say can you restore this for me? Sure.
Jim: That’s probably most of the calls we get.
Joe: Yes, but every now and again you get the call of “oh, we had this for a backup strategy and now we can’t seem to find this archive log”. OK, great. And you have to explain to people that they’re hosed. I have explained to more than one company “No, you can’t recover through an archive log that you don’t have.”
Jim: Let me describe a scenario that I remember from a while ago. An experienced DBA called us. His flashcards had given him some issue. It was a driver issue and there was corruption. So he was familiar with recoveries, and he’s done Oracle recoveries before. But how he utilized us really well is that he just wanted a second DBA opinion. So it was a phone call that took about an hour. He said “here’s what I’m seeing and here’s what I think the approach is. What do you think?” And you said “It think that’s a great approach, but I would recommend that you turn off parallelism so we can get this recovery to go faster.”
Joe: I remember that phone call now. Sorry, I didn’t remember it in the context of an RMAN recovery. But yeah, he needed to get data files off the flash. And I was so thankful that he was willing to do the work.
Jim: It’s so nice; we have this luxury of working every day with a lot of smart DBAs. There’s always somebody that will answer the phone in the middle of the night. But I think for a customer where you might be the only DBA, it could just be an hour conversation with us to say, “Here’s what I think the strategy is….”
Joe: Yeah, or even just to walk through type of recovery you need. Oh hey I’m missing this. OK, you don’t need to do a full. You can just do this.
Jim: We’d both really encourage customers to reach out right away before they even start the recovery. That’s a really optimal time for us to give them some input.
Joe: For our MCS clients, we are the ones doing it – we’re doing the recovery. Even for those that just need that few minutes, we’re going to be on the phone faster than Oracle support could.
Jim: That’s a great point. We’re not going to ask for logs. We’re going to talk it through it on the phone to find out what happened.
In this third part of their interview, Jim and Joe discussed their concerns about the lack of alerts on failed backups, their thoughts on the best RMAN books and how they’ve seen customers effectively engage with House of Brick during a restore to assist in a successful recovery effort.
Next week read all about Joe’s favorite RMAN restore war story in the final part of their interview.