theweaselking | (Reply)

From:

theweaselking.livejournal.com

Of the machines that failed, 4 were powered off the entire time, one was a dumb-as-hell 100M workgroup switch, and one was remotely powered on by an overeager admin[1] when power and internet came back up 1.5 hours through the scheduled 3 hour outage and he didn't think to question *WHY* power was on so early or whether work was actually complete, which meant that machine was *on* when the power went out again. The UPS should have helped, but still, 2 hours later everything came back up and that server had lost a power supply.

(So that machine wasn't actually "dead" since only 50% of it's power failed and it can run on 50%. On the other hand, a cleanly shut down Xen hypervisor is MUCH easier to bring back up than one that had the power yanked again after booting)

Essentially, what should have been a 45 minute Saturday turned into *hours* due to human error, and then there were a really wacky number of unexpected hardware failures in the more elderly kit.

[1]: Who was not me, for the record.