Wednesday, October 15, 2008

Death of an Xserve...or why you need more than one server...

I get this question all the time: "You spec'd four servers for our Xsan system? Two MDC's and two ODC's? Do I really need that much infrastructure for a few people?!?"

The answer is yes. Absolutely yes. Definitively yes.

The last week has proven this point very painfully. I've never seen this type of behavior before in an Xserve, but it makes the case for multiple Xserve's extraordinarily clear.

Editorial crew (three people) came in Tuesday to find the machines unable to connect to their sinlgle Xserve (which acts as Xsan MDC and ODC). They were running slowly Monday, however I suspect one of their machines took over as MDC and the Xserve had actually crashed sometime over the weekend.

They hooked up a monitor to the Xserve to find the "prohibited symbol" against a grey background. I drove over to check it out myself. When I arrived, the monitor had the grey screen, Apple logo, and the circling status icon. It was rotating around like it was trying to load.

After a few minutes, it was apparent it would not boot. I shut it down, and restarted it in Target mode (which launched successfully and allowed me to mount Server HD on my laptop). DiskWarrior showed some minor things to clean up, which I allowed it to do. The folder structure of the Server HD looked fine. I disconnected and attempted to start it up.

No dice. Xserve got to the grey screen, Apple logo, and the circling status icon--but nothing else.

Thinking it might be a hardware problem with the hard drive, I attempted to boot off the Restore CD to see if Disk Utility had any thoughts.

The Xserve would not boot off the CD either. Neither the 10.4.10 restore disc OR the latest Mac OS X 10.5 Leopard Server disc either.

Now I am thinking its strictly a hardware problem, not a software issue.

Praying hard, I loaded Xserve Diagnostics and pressed "D" which DID allow it to boot into diagnostics mode. I thought I was right around the corner from discovering which hardware was the issue...except that all tests passed. Nuts...

As a last resort I had Carl run over with a brand new set of RAM for the Xserve and swapped the old. No help.

Everything was disconnected--including the RAID's and the ethernet connections. The only thing left in the machine besides the RAM was the fibre card. I even yanked the fibre card to see if it would boot--but no luck.

I left Tuesday at 2:45 PM and set the diagnostics to run for 15 hours to see if it would pick up an intermittent error.

At 9 AM on Wednesday, I arrived to find that Diagnostics hung up at 10 hours. Checking one of the CPU's is where it froze.

Now realize this...the client's budget did not allow for even a parts kit. At this point, I had no choice but to call AppleCare and wait four hours for an engineer and logic board to arrive. The AppleCare call was fairly straightforward since I had already eliminated nearly everything besides the CPU's and logic board.

Engineer arrived around 1 PM on Wed. and swapped the logic board.

Did...not...work...

Now frustrated, the engineer calls someone high up the Tier level at Apple and requests a bunch of parts for next day delivery. RAID card, fibre card, logic board, CPU's, etc...

As of this writing, their system is STILL down. So, while there is no resolution for the client yet, there IS a moral to this story.

1. Servers will die.
2. The best defense against server death is a PRE-DEPLOYED backup to act as failover.
3. You need backups of your MDC and ODC machines. Yes--four servers..
4. By hosting MDC duties and ODC on the same machine and implementing a network home folder--the client couldn't even log into his own machine to do any local work. Having an ODC (and backup) separate from MDC would have allowed them to work even if the SAN was down.
5. Parts kits DO NOT ALWAYS mean you are back in action. We replaced the logic board and the server was still down.

The Cost?

A creative director and two editors at total of $84 per hour (includes fringe, etc...) times 24 hours (three eight-hour days of lost time) equals a straight loss of over $2000--not to mention the grief and possible lost sales the CEO will lose from his creative team being down for three days.

At the very least, one more server in this workflow would have already paid for itself. These don't have to be huge render-class servers--just basic Xserves.

One funny note--the engineer works contracts for Apple and Dell. When asked, he said most of his dispatch repair work is on the Dell machines. He says he puts hands on an Xserve about twice per year for actual repairs.

Update: Turns out it was the logic board AND a RAID card that went bad. The Xserve went right back up serving out Xsan as soon as these were replaced.

No comments: