Dave Welch (@OraVBCA), CTO and Chief Evangelist
Have you ever noticed that just about 100% of grade school through high school musicians are capable of playing the right note on their sheet music at the right time? That said, precious few of them have been taught to play by ear – a prerequisite for improvisation.
This post is not about the technique of playing business continuity and security notes in routine production operations. Rather, you could say this post is about improvisation.
Let’s introduce the topic with a story.
I was a grow-from-within Oracle DBA in a large health care concern—the first of an eventual team. This was in the early 1990’s. The opportunity included five rapid-fire weeks of Oracle University classes, one of which was given in Microsoft’s back yard – Redmond, WA. The class was full with 20 students. National Language Support came up and was slated to consume at least half a day. I incited a rebellion: “How many of you would rather allocate that time to a deeper exploration of backup/recovery theory and application?” My idea was met with an enthusiastic and unanimous response. The instructor was thrilled and off we went as he improvised his apparently significant business continuity experience into an invigorating and memorable on-the-fly knowledge transfer. His presentation was course-corrected and refined by an unusual dose of totally on-point student interaction. By the time we were done, we had assembled ourselves into an email group. In the following weeks, we exchanged a WordPerfect document that eventually became a 20-page insert in the back of my Franklin Day Planner. I was ready, at least in theory.
Back at the office, I asked my supervisor for a recovery trial. “No can do. We don’t have the time or budget for that.” My probably not-so-humble response was that we would have our recovery trial one way or the other – controlled or uncontrolled.
Less than a year later, it happened. Our clinical production IBM 590 RS/6000 box lost both sides of internal drive mirroring, and IBM could never figure out what happened based on system log analysis. I recall receiving an apology call from an IBM higher-up in a position I didn’t even know existed.
We loaded our backup cartridges and used the HP OpenView recovery client to list the inventory of available files. However, there were a lot of empty rows on the green screen between occasional files. We continued to load and list cartridges earlier in the same backup, and earlier backups with the same result. A Sev-One service request into HP led to the discovery of a bug in the recovery client. Despite the product’s market penetration, the client had a defect when logged into with a root surrogate. We had hosed ourselves by following a security best practice – don’t log in with user ID 1. Within 24 hours, HP FTP’d the patched recovery client and we got underway with restoring our database files.
As I kicked off Oracle recovery, the Oracle client asked me for an archive log that didn’t exist. We had a hole in our archive log thread. Further research determined the cause. Between the time the HP OpenView client made its memory array of inodes to be backed up, and when it got to actually backing up the archive logs, the transactional heat in the database had caused the online redo logs to roll more than three times. Incremental archive logs were generated that weren’t in the OpenView client’s inode array. If we had scripted the OpenView client to backup the database files and archive logs in separate, contiguous calls, we would have been OK. But, the Oracle University class’ enhanced ad hoc backup/recovery theory had failed in actual application, and through no fault of Oracle University.
Our uncontrolled “recovery trial” produced a triple failure, two aspects of which would very probably have manifested themselves in a controlled trial. All told, it took 36 hours to bring the mission-critical system back online. Once that was done, our clinicians had to manually re-key seven hours of data. They were understandably hot about it.
In another organization, I pulled what some considered a stunt. I walked into the ops manager’s office on two different occasions months apart and declared, “Recovery Trial”. I had backed up and deleted obscure files out of a mission-critical file system and invited the ops staff to take no more than 24 hours to restore the pair of files I identified for them. It didn’t go well on either occasion. At the end of each trial, I listened to varying excuses as to why the files couldn’t be restored. On both occasions, I dutifully informed my business vertical’s management team that their system stacks were at risk.
Over House of Brick’s twenty years in business, I can think of at least three occasions where we have apologetically, but boldly, notified clients’ senior management that we were hijacking whatever engagement we had contracted for. While each client was going through the motions of backup or DR operation, our operational definition expertise led us to quickly notice that their backup or DR images were utterly un-restorable and/or un-recoverable.
Thirteen years ago in Baltimore, I listened in awe as a customer IT director told the story of a CIO colleague at another organization who walked into his organization’s cold room. With no authorization from (or in coordination with) anyone, the CIO announced, “Recovery Trial!” He leaned on the big red button on the wall and cut the power to the room. We could debate the risks and costs to the organization and individual of such arrogance, but the anecdote is most instructive.
Cloud: Easier Or Harder?
Now, on to the cloud. Don’t tell me how polished a cloud provider’s security implementation slides are, how many industry security certifications they have obtained, or how robust the contractual security guarantees are. I don’t care who your prospective cloud provider is. If they have a clause in the cloud managed services agreement that will be grounds for your eviction if you scan, in my mind you have no business putting anything other than encrypted data there. (Read the Oracle Cloud Services Agreement’s prohibition on scanning here.)
Someone is indeed scanning into the environment, and very probably also laterally within it, despite any contractual prohibition that may exist. And no, I’m not saying cloud customers should have the contractual privilege of mounting trial Denial-of-Service attacks.
IT professionals don’t own the data. They are data stewards. As such, the least of their concerns should be that a security breach could add their organization to the sorry parade of nameplate security breach victims on the front page of the Wall Street Journal and Financial Times, or that they might lose their jobs.
- If you aren’t doing routinely scheduled recovery trials, you have no backup.
- If you aren’t doing routinely scheduled HA failover trials, you have no HA.
- If you aren’t doing routinely scheduled DR trials, you have no DR.
- If you aren’t doing routinely scheduled scanning into the cloud environment, and laterally within it, without the provider’s prior knowledge and coordination, you have no security.