How to Eat an Elephant – Getting Started with Large Projects
by Joe Grant (@dba_jedi), Principal Architect
Through the years, I have been assigned to several projects that can be considered large migrations. Some were just database migrations, but many of them included a variety of different applications. Other examples of large projects include platform migrations, migrating databases from AIX to Linux on x86_64, or simply moving everything in datacenter A to datacenter B (and had no database component to it at all).
As one works through these large projects, there are several aspects of project planning and management that need to be addressed, which do not necessarily come up during smaller projects. These additional components can be very intimidating, and may cause fear and panic. It is easy to respond to a crazy business decision or upper management request of “what would it take to do…?” with a simple “that task is impossible.” The real challenge, and in my case the fun, comes from figuring out how to accomplish these crazy tasks.
As I was writing this article, I found it very difficult to give direction without sharing specific examples. The whole thing was simply too generic and difficult to understand. As an example to illustrate the concept for this article, I will be discussing a very large datacenter move that I was a part of a few years ago. Even with a specific project in mind, many of the lessons in this example are applicable to different types of large projects.
As for the project itself, we needed to move a client from datacenter A to datacenter B. Their datacenter provider sold the building they were in, and everything had to go. We were lucky in that the new datacenter was in the same metro area, and we had decent connectivity. In addition, the client was nearly completely virtualized, so for the most part, we were simply pushing VMs around (there were around 525 VMs and we had about six months for the move). There were also a few physical systems that needed to be moved.
Someone may say that vMotion and Storage vMotion should take care of everything. However, if that were technically possible in this case, we certainly would have done that. It would have simplified a lot of things. In this case, trust me, it simply was not possible.
No Plan is Perfect
The absolute first thing that needs to be understood is:
“No battle plan ever survives contact with the enemy.”
— Helmuth von Moltke the Elder
It does not make a bit of difference how well you think you plan, something will come up that will require you to change the plan. No plan is perfect. As long as this is understood, you will panic less and better understand what to do next. The key is not to get frustrated and just quit. Keep moving and keep pushing the project forward.
We kick off the project by collecting as much data as possible and start organizing it the best you can. Do not focus too much on getting the organization part right the first time, as it will change. As you begin to collect information, patterns will emerge and you will begin to see the divisions to use for organization. This may mean that you will have to go back and speak to groups (or individuals) to clarify information, and this is ok.
The specific information to collect will depend on the type of project, but some basics for our datacenter move example included:
A list of all VMs
– All means all, even if you think that it won’t be necessary to move, keep it on your list
Information for the VMs
– CPU and memory setting
– Storage allocation
– Networks it is attached to
Once you have a list of everything, then you need to start gathering some additional information in order to make the list and data relevant.
– Application hosted on the VM
– Business users, application owner, etc. Basically, who is going to tell you when you can move it and let you know that it all still works after the move.
– Dependencies on other applications and/or VMs. Keeping app and DB servers together is a good thing.
What is going to change?
– Do the VMs get to keep their IP addresses?
– Are upgrades needed along the way for vSphere, guest operating systems, applications, or VMware Tools? This list can go on and on.
– For my sample project, the datacenter provider also managed backups. The backup infrastructure was different between the datacenters, so backup
software had to be changed. Not a big deal, but it was still something that had to be tracked.
Start tracking performance characteristics of the CPUs and storage. During these types of moves, end users wind up reporting all kinds of crazy things that have nothing to do with the move itself. It’s just stuff that has been bothering them, and now all of a sudden someone is paying attention to them.
Critical personnel for moving any specific application, VM, and/or database.
– Are they critical to more than one phase of your project? Be careful not to double book them.
During this phase, you will begin to see patterns. You will want to make four or five buckets to start putting things into, as well as to start identifying those things that will need an exception process.
- The ‘do not move’ bucket – As you gather information, and start asking questions, you will find at least a few systems that have been forgotten about and do not need to be moved, since no one is using them anymore. In addition, it is likely that a few apps will be retired as a part of the process, for any number of reasons.
- The ‘easy’ bucket – These are the ones that can move at any time. Really no one cares or they are very flexible. Usually these systems are non-production systems that simply have few restrictions on them.
- The ‘weekend only, night time only, and/or some other mild restriction’ bucket – There could be two to three versions of this bucket, but for the most part they can be addressed collectively. These systems are fairly easy to schedule and while there is some concern, overall it is not a big deal.
- The ‘restricted’ bucket – This is where the 80/20 rule applies. There will hopefully be very few in this bucket, but they will take a significant amount of time. They will have very tight outage windows, be very large, have weird maintenance windows, and/or have other issues that impact the move. Physical systems often fall into this bucket.
- The ‘exception’ bucket – Hopefully, there are only two to four items in this bucket. These are the problem children, and will require a significant amount of your time to address.
No project gets started without a deadline, otherwise nothing would ever get done. Any good project manager will tell you that you need to gather all relevant information, organize, and then plan for the project. From there, the project end date can be determined. This is nice in theory, but in practice is a fantasy. Any good project manager will tell you this rarely ever happens.
In the case of the datacenter move, we had a “we are turning the power off” date, and therefore we had no choice but to be done by that date. To accomplish this, we employed a project management approach that does not involve living in a fantasy land. We took our date, subtracted a few weeks (something was going to go wrong), and then started calculating what it was going to take to get there. It involved very simple math, 525 VM divided by the number of weeks we had.
Once we knew how many VMs needed to move on a weekly basis, we dug in to our list and started assigning move dates. Here you start with your easy bucket to work out the kinks in the process. As soon as you can, start tackling the restricted buckets and use the easy bucket for filler. You don’t want to get to your deadline and only have the hard ones left. You will fail.
Long Term Planning
When you are working a project that will take several months, and involves a lot of moving parts, there are several other aspects that need to be addressed. For example, will the Green Bay Packers make the playoffs this year? Hey, do not laugh, this is a real question. There are many environmental factors that you will need to take into consideration, and a sports team making the playoffs may be one. For example, if your lead technical resource is a huge Packers fan, and doesn’t care about the project timeline, he is watching the game instead of performing the migration that is so critical to your project.
When dealing with long term projects, there are all sorts of activities that you will need to plan around that are not normally taken into consideration. Things like vacations, training and conferences, hurricane/tornado season, and business restrictions (for example holds on projects during the holidays for retail businesses) will all come into play. So, be sure to include this information as a part of the data you collect.
So, all of the above comes from real world experiences managing large projects. Here are a few reasons why these things are important.
What About the Books?
One of the first lessons I ever learned is that the people part of the engagement is very important. Several years back, I thought I was so cool, as I had just been given one of my first large projects. We were migrating the databases for 14 rather tightly coupled applications, which had all sorts of dependencies, restrictions, and upgrades to manage along the way. I collected all of the data I thought I needed for the project, and then locked myself into a conference room for two days with lots of white boards. By the end of those two days, I had an aggressive schedule, and dang it we were going to get this pounded out in record time.
Then reality hit when we presented the client with the schedule. Sure, we could move things, but I would have been overbooking resources (some of them 3-5 times over). There literally would have been no one to test, help with application configuration, or to validate that the migration actually worked. I never asked who would be responsible for such tasks.
I took my own advice, did not panic, and simply reworked the schedule to accommodate realistic expectations for everyone, who was not me. We still had an aggressive timeline, and with the exception of one app, things went very well. It may not have been completed in record time, but it was still fast and the client was happy.
Just Because You Can Do Something, Doesn’t Mean You Should
For the project referenced in this article, I did have one system administrator apologize to me several times over one VM. This one very much falls under the category of “just because you can do something doesn’t mean you should.” He had created a Windows VM with a single 50 TB VMDK attached. He had been working on a project, and the software vendor demanded that it be created as such, and yes technically VMware and the storage available at the time supported it. So, it was created.
Then came the time to do something with this VM. If I did not know better, I would swear that this VM was sentient and absolutely did not want to move. It literally fought us every step of the way. Nearly everything we tried to move this VM failed. We had storage issues, network issues, OS issues, VMware issues, you name it. Most of the issues were related to the 50 TB VMDK, and the fact that even though it was possible to create it, it was completely unmovable.
The point is not how we solved the issue, but rather serves as more of a cautionary tale not to paint yourself into a corner. There are always options, and had the administrator pushed back before creating this 50 TB VMDK, our lives would been so much better at migration time.
ONE STEP at a TIME
In short, the business or technical management teams are not trying to make your life difficult by asking for crazy things. For the most part they are completely workable. Don’t panic. Just take them on one step at a time and work the problem. Realize that the plan will change, and likely multiple times at that, and don’t quit. The warm fuzzy feelings of accomplishment at the end are usually well worth it.