Sunday, November 11, 2007

Hard Disk Backup

Being an IT professional, there are somethings that still get my goat. Within my department there are quite a number of servers that we are charged with maintaining. However, some aspects of these machines are maintained by other departments. Such as database and hard disk (HDD) backups.

This past week we had a server raid controller die. At first glance this wasn't going to be a big deal. We have c (OS), d (application) & p (page file) drive backups that are written to tape on a daily basis. The database for this application does a backup every day to the D drive and then the HDD backup would backup everything present. This sounds logical. These backups are also taken hot. Therefore, unlike with Ghost backups, these backups are not bootable.

Well, the server is still under a maintenance plan and IBM was tasked with restoring the server to operation. However, what wasn't brought to our attention was that with the motherboard and raid controller replacement the entire system would have to be wiped and restarted from scratch. I understand working with SCSI equipment that much of this can get complicated. However, the one thing that I don't understand is that if the parts were swapped one-for-one in make, model, and version why we would have to start over with this machine.

Anyway, the server operations team installed a new OS to the machine. It was previously Windows 2000 and they upgraded it to Windows 2003. So far, so good. Well, since the OS was wiped, all of the applications would need to be reinstalled. Once again, we were not stressing this. However, we later were informed that the databases, that were supposed to be in the backup folder on the HDD, were not present. WTF? What good is it to backup all of this data and then not have the required information to then restore from?

Furthermore, why backup a drive like the C & P drive if they are throw away in the event of a catastrophic failure? Why only backup portions of a HDD? This backup of the drive in question (D drive for the application) was in the neighborhood of 40G. How do you have a 40G backup and not get the databases that the application requires? I don't understand.

Now for the kicker. There is a known DR flaw in this system. This system is charged with distributing software and collecting journals from ATMs. The permanent storage of any of this information does not reside on this server so we're not missing much. However, the ATMs in this system normally do not talk to the server unless spoken to. The only time to force an ATM to talk to the server is to restart the service. Well, the way ATMs are designed, you can't get to the desktop to bounce the service at need. The only other way to accomplish this task is to reboot the machine. Therefore, each and every ATM on the platform must be dispatched for a reboot before we will reacquire the ATMs back on to this server. What a waste of money. If we would have bridged our DR gap OR had the application database, this would not have been an issue.

In hindsight it would have been really simple to deploy a scheduled task to force the ATMs to talk to the server once a day no matter what. However, we didn't do that yet. It just got bumped up in priority. I predict that this will be bridged in no less then 2 weeks.

No comments: