One of DPM’s biggest problems when you try to scale the product is centralization. You might think that it’s not really necessary, but read on to see some of the pitfalls that come with a non centralized product. And yes, SCOM and the DPM 2012 management pack is an attempt, but a fair one at best.
So why is centralization so important? Well, for one, everyone else in the enterprise is doing it! So does that mean DPM should do it too? Unlike jumping off the cliff because others did it, centralization is very necessary when it comes to backups. So let’s start by exploring some basics…
Reporting is available on a per server basis where each must be individually managed, configured, and analyzed. There is no way to correlate real time or historical data with between servers. How would you know if a possible network wide outage caused backups to fail? Or if an application planned outage caused backups to fail that are spread across multiple DPM Servers? You could find yourself chasing a failure that was never the fault of DPM at all.
Or maybe it was a fault of the DPM backend:
- Storage array going down? But you would never know that since you’re looking at one server at a time and might think it’s just one LUN that went down. An hour or two later you realize it’s a whole disk group, then the entire array.
- DPM’s SQL backend restarted/rebooted? You might think it’s just one server that’s jobs failed in progress or failed to start. Until you realize that everything died at the same time!
- Patch rollout? Similar to the SQL above, but this is a slow rolling death were jobs start failing on different DPM servers as the patches get pushed out or server rebooted. Then you come in in the morning and try to figure out what happened since some servers didn’t get rebooted because their patch deployment failed.So enough real world examples, I’ve been there and seen most of it. So how is this fixed?Simple!A one stop location for all reports and DPM data. This way you could see all LUNs across all DPM servers in one window and easily see if there’s a bigger problem at hand. Or if all jobs died at a similar time, you would know something major might of happened across the board. Another major plus of a centralized reporting is metrics. How can you possibly tell management what the backup success rate as the drop of a dime? You can’t tell a manager, “Hold on, I need to login to every server, run the built in reports, export the data, and calculate the averages.” With centralized reporting, you can give the manger an excel spreadsheet with linked tables to the backend data warehouse, pivot tables, and say “Here, the data is all there in pretty charts, just hit refresh anytime you want to know how awesome we’re doing!”From there, you can turn that around and give the same datasheet to architecture and capacity planning and say “Here’s the raw data so you can plan for growth as needed to support our business demands.”Or use that data yourself to understand where you as the DPM admin are doing things right or wrong. Do you have a habit of overloading DPM servers with too many datasources? Are you over provisioning your disks to the DPM pool? Or maybe you’re doing a great job and now you have the “proof in the pudding.” So you pat yourself on the back and treat yourself to a romantic dinner.So onto the next topic…MonitoringThis one falls along the lines of reporting so we won’t spend too much time here. This can be thought of as more of real-time reporting or alerting.SCOM 2012 provides a central collection of alerts and states for monitoring, however, the views built in are not practical large environment and will have to be heavily customized for internal administration. The focus on monitoring is still on a per server basis, where the admin must focus on issues that may be affecting an entire server and not on an environmental basis. Also, there is no easy way to customize the data that you need. As mentioned before, you might need averages, historical data based charts, or fancy pivot tables, you can’t do that in the SCOM console.And if we look at the alerting component, take an example of 100 failed backups @ 2PM, how would you go upon checking for commonalities and quickly actioning all the alerts that relate to it? What if you have 10 DPM servers, each with 15 LUNs, thats 150 LUNs to watch for an alert when one or two fail then action those alerts out of the noise.Again, not much here that we haven’t already covered. The point is the SCOM console is a good attempt and something that could be usable if the DPM admins have a good understanding of SCOM and know how to customize it to suit their needs. But the limitations of collected data and presentation must also be understood and mitigated with custom reporting tools.Agnostic Operation
Along the lines of what we’ve previously reviewed, DPM servers are not aware of what the other one does. Because there is no “Master Commander” or “Central Manager” to control each server, it is left up to the DPM admin to best distribute the load across the DPM servers and storage array.Another issue is when a DPM server is unavailable as all backups on that server fail until that host is fixed. In a clustered environment the central manager would recognize that backups are at risk due to faulty hardware and migrate jobs to a healthy node to continue backups. Backups on the storage backend might be lost but at least jobs can seamlessly continue to run. Then when the hardware is fixed, the node is returned to the cluster and resource capacity is back to normal levels.Finally, another problem is when the server is down due to windows issues, the storage pool is inaccessible and although the data is intact, the DPM server that holds the catalog cannot be accessed. While we can build another server and load the original catalog this is additional time that we cannot execute restores and end users possibly left waiting. Again, in a central managed setup, the central server would know where the data is and would utilize another DPM server to extract that data and fulfill restore requirements.Summary
We’ve touched on a few points and some of the major downfalls of a non centralized or clustered product. So we’ll end here and call this part 1 of a mini series. Like all the great mini series out there, we’ll end with a cliffhanger and some bait to lure you into the next installment.The bait? Oh there’s more, lots more where this came from! Here’s some of the pain points we’re going to cover in the upcoming installments. And for the thrill of mystery, in no particular order!
- Tape Management and Long Term Retention
Anyone still do tape out there? VTLs? How do you maintain a searchable catalog of long term backups for x amount of DPM servers for x years?
It’s not automation when you have to manually log into each server and execute the script, then concatenate the results
- End User Recovery/Configuration
Good luck telling your end servers “We have 40 DPM servers, you figure out where your data is from this report”
You have to repeat the same tasks on multiple DPM servers. Human error is going to happen. Do to boredom or fat fingers, it’s inevitable.
- Load Balancing
What if a datasource grows beyond expectations? The DPM admin can’t be expected to babysit every backed up datasource.
- Data Retention Periods and Legal Requirements
How would you quickly validate to the legal department that X data is only retained for X amount of days? While at the same time ensuring that another datasource only exists in one place on the backup infrastructure, since federal regulations define how many times a particular dataset can exist?The cliffhanger? Are there any benefits to a decentralized product like DPM? Stay tuned folks, and come back next time when the mystery is revealed…