When we last left off we were exploring the world of decentralization. Or were we? In an alternate universe somewhere far far away, but not too far, decentralization is actually preferred. Let’s take an adventure and find out what might be the benefits?
Standing up a new DPM server can be just as easy as that. Get the OS installed, install DPM, plug in a USB drive and you’re done. You’re free to start backing up clients with no repercussions to a global policy or standard. You can configure backups to how they best suit the clients there backing up.
Is there more? Sure, there might be more, but in the end, you’ll always find yourself asking, can do more? An you can, with centralization! No, it’s not a new product, but rather a continuation of our topic from our first segment. So let’s recap…
We talked about reporting, monitoring, agnostic operations and how those points play a role in centralized operation. We must be able to view one report that shows us all the data we might need in one view, and a monitoring tool to view those same details in real time. We discussed agnostic operations and how we need to have a master and slave setup to mitigate possible outages and balance loads.
So where do we go from here? Like any good mystery we need a segway to carry us to part two. We left off with agnostic operations so let’s continue to automation. You might be thinking, why do I need to automate, I could just hire more minions or engineers? While that is very true and the price of minions going down with our economic situation, that is a whole different topic we won’t go into. So, let’s keep within scope and review why more engineers is not better, from a technical perspective. Note that the interchangeability of the word minions for engineers is purely coincidental.
Automation
Let’s start off with a very important fact, DPM certainly does have a great deal of automation built in. From job management and remediation to auto-growth and auto-discovery is there, so what’s missing? Remember, we’re talking about centralization and execution of the same task against many servers.
There are many cases where something happens on the backend that stops backups across several DPM servers. Or maybe backups fail for a group of protected servers due to a known outage. It would be nice to have the built in automation tools to select a group of protected clients across multiple clients and say do x, y, and z. That might be a simple action such as “run consistency check” or something more complex like a PowerShell script with triaging steps.
There’s many cases where you might need to run a simple task across many DPM servers or protected clients. But in the end, there’s no built in way to do that, due to the lack of centralization. But this is not where this road ends…
Client Recovery/Configuration
I originally referred to this section as “End User Recovery/Configuration.” However, with more thought this problem is much worse than just relating to end user setup. Regardless of weather you have four or forty DPM servers, when you have to setup client protection, you always ask the same question “on which DPM server should I setup protection?” Even if you go as far as setting up DPM servers for specific sets of servers, you still have to login to that individual host and configure the backup and ensure you match settings to other hosts. In a master/slave setup you would be able to login to one central master, setup the client and be done.
Same thing applies to recovery scenarios. In a master/slave setup you could go to the master and start the restore process from there. What slave the data exists on or does the work, doesn’t matter, because, for a lack of better word, the master would be driving the slaves.
That sounds backwards doesn’t it? In history class we learned that the slaves drove the master. Well, that’s apparently the model DPM was designed on as DPM is the slave and it drives you the admin/master around! So when that restore request comes in you find yourself checking all the DPM servers or the client to figure out which DPM host it’s attached to. The latter is a good trick, but doesn’t always worked if you’ve accidentally attached the client to multiple DPM servers and forgot to clean up afterwards. Now you’re eating into your RTO, recovery time objective, trying to figure out where to pull the backup from.
Another point will briefly touch on is end user recovery. The DPM end user recovery tool requires the user to know which DPM server they are configured on. Since there’s no centralization to balance the load, what happens if you have to shuffle things around. How do you let the end users know which DPM server they are backed up to. Let alone have confidence that they would remember the new DPM server name. Take this approach and you can guarantee yourself a call at two in the morning with the question “uh, what DPM or something server am I on again?.” And good luck telling Joe, hang on, let me figure that out as I login to the x amount of servers handling client backups.
That brings us to today’s final segment of this miniseries that’s quickly turning into a drama:
Load Balancing
So here’s a simple question; What do you do if you’ve reached the supported limit of DPM on a particular DPM server? You stand up a new one and migrate those data sources while still retaining on the old host on disk for your full retention period. So now you not only have to manage the new DPM server and balancing out that host, but you have to manage this rolling window and mark your calendar for the date the backups expire on the old host. That’s not where this drama ends though.
Let’s take a common situation where you have a DPM server setup and it’s working like a charm. Your protecting twenty clients on there with fifty terabytes of backups and under the supported limit for DPM. One day, two of the hosts has a burst of utilization and now you do not have enough storage allocated in the storage pool on that host, so backups fail. All the while you’re number two DPM server is sitting pretty with oodles of free space on the storage pool. In the master/slave setup, the master would say “Hey, you’re not using that free space, I’m taking it back for x datasource on another DPM server.” It would automatically reallocate the needed space prior to the primary host filling up and backups would succeed.
A similar situation based on the above could be where you’ve allocated two DPM servers for all exchange backups and another two DPM servers for all SQL backups. Let’s say the DPM servers handling the SQL backups can now no longer keep up with the number of datasources or the DBA team adds another ten SQL servers to the environment, what do you do? In a master/slave situation that wouldn’t matter as you would simply add another slave to pull the load. Pardon the pun. But not with DPM! As mentioned before, you would have to add a new DPM server to your load, and now manage those backups as a completely different set balancing the load to the existing set of backups.
And in a final attempt, let’s look at balancing the backend storage. In a larger deployment you be using a SAN as your backup target. But with no load balancing, manual or automatic, you will likely find yourself hitting the SAN all at once on the hour from every DPM server and on multiple fabric paths. You might think that this is not a big deal since the target LUNs on the storage arrays are load balanced in a RAID 5 or 6 group. Well, that’s a valid point, however, consider two points. One, you are slamming the data through the controllers possibly at higher rates than they can handle causing package loss and resend requests to go out. Two, regardless of the raid setup you will cause all the spindles, or disks, in that array two spin up to full potential at once. This is not only a huge power drain, but a strain put on the entire array in one blast as opposed to balanced out over a span of time.
Summary
You’ve now completed two parts of this miniseries quickly turning into a drama slash horror flick. But like all those cheesy movies out there, you just can’t take your eyes of the screen and are hungry for more. Well, fear not, as part three is coming, with even more luster and despair!
Maybe I’m being over dramatic, but let’s take a peek at what’s coming up (recapped, from part one):
- Tape Management and Long Term Retention
Anyone still do tape out there? VTLs? How do you maintain a searchable catalog of long term backups for x amount of DPM servers for x years? - Workflow
You have to repeat the same tasks on multiple DPM servers. Human error is going to happen. Do to boredom or fat fingers, it’s inevitable. - Data Retention Periods and Legal Requirements
How would you quickly validate to the legal department that X data is only retained for X amount of days? While at the same time ensuring that another datasource only exists in one place on the backup infrastructure, since federal regulations define how many times a particular dataset can exist?
So you might find yourself asking, “What’s the cliff hanger this time?” Well, there isn’t one! Like those cheesy flicks, you’re going to come back because you want to see just how bad this train can derail. How much worse can this possibly get?
There’s light at the end of the tunnel, I promise. Just hold on tight and we’ll make it through, together!