theweaselking: (Default)
[personal profile] theweaselking
So.

The legendarily erratic Fedora Core 4 machine is crashing, regularly - about once a week, at least. When it crashes, services stop responding one after another, but not all at once. It takes about 5 minutes from the first problems to the point where all services are down. You can get a login prompt, but putting in your password doesn't work. If you're logged in, you can keep running your existing program and navigate the shell through cd and the like, but as soon as you try to run any program at all, even ls, your session hangs and you're stuck.

I can force it to crash by having it write to tape - around the 60GB mark, it goes down. It never completes writing to tape.
I can force it to crash by burning DVDs - it usually gets one done, sometimes two, sometimes it dies before starting the first one, but burning DVDs (with Nautilus) crashes the machine. Same symptoms.
It will crash on its own over the course of a week, even when the tape drive, SCSI card, and DVD burner are physically disconnected from the machine.

I've watched it crash through top and Webmin and df. It's not running out of memory. It's not running out of swap space or space on any partition. I've got a script running every minute to write the results of lsof into a file that archives itself when the machine reboots - it's not producing an exceptional number of entries there when it crashes.

/proc/sys/fs/file-max is over two hundred thousand.
/proc/sys/fs/file-nr never goes more than a few thousand

(The RAID partition for /home is always unclean after a crash and reboot. This is not really remarkable, but might be worth mentioning.)


Assuming, for a moment, that it's not physically broken hardware causing this, what:
A) could it be?
B) can I do to nail down what it could be?

Assuming we allow for physically broken hardware, what:
A) could it be?
B) can I do to nail down what it could be?

(My current plan: Install Debian Stable on another HDD, install it in the machine, boot from that, duplicate the setup, and try to force it to crash there.)

(no subject)

Date: 2007-04-30 02:30 pm (UTC)
From: [identity profile] mhoye.livejournal.com
Watch it crash while you're tailing log files ( tail -f )

(no subject)

Date: 2007-04-30 02:39 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
Which log files? There's nothing showing in any of the logs at the time of the crash to indicate a problem.

(no subject)

Date: 2007-04-30 03:29 pm (UTC)
From: [identity profile] elffin.livejournal.com
Statistically speaking, if this device is a production device and has been more-or-less stable up to this point (which I really can't speak to. You say it is legendarily erratic.)

I call hardware failure. Debian stable or find a build of Knoppix that works on that hardware, boot, force it to crash and that will confirm hardware failure, especially if you're booting from Knoppix and force it to load filesystems to RAMdisk with no HDU access.

In my experience, that kind of behaviour (and my experience is primarily with the Windows architecture, not the *nix so grain of salt) is caused by faulty RAM, faulty RAM controller, faulty swap file hardware chain (RAM, RAM controller, RAM that the kernel is sitting in, storage controller, storage device).
You could also try moving your swap partition to another hardware device and/or recreating it and/or running without it. If it runs week+ afterward with much improved stability, you had a corrupted swap partition. If it merely takes /longer/ to die, it's almost certainly RAM.

(no subject)

Date: 2007-04-30 06:23 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
Other people suggested heat - I knew the machine wasn't overheating, but I checked anyway, and the air was cool, the fans running, the CPU perfectly within acceptable parameters.... and the memory control chip's heat sink was at 70 degrees. As in, burning hot to the touch.

I think that's my problem.

(no subject)

Date: 2007-04-30 08:00 pm (UTC)
From: [identity profile] elffin.livejournal.com
I look forward to seeing if remediation helps the situation.

Profile

theweaselking: (Default)theweaselking
Page generated Jul. 11th, 2025 04:37 am