Geek Pop Quiz.
Apr. 30th, 2007 10:06 am![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
So.
The legendarily erratic Fedora Core 4 machine is crashing, regularly - about once a week, at least. When it crashes, services stop responding one after another, but not all at once. It takes about 5 minutes from the first problems to the point where all services are down. You can get a login prompt, but putting in your password doesn't work. If you're logged in, you can keep running your existing program and navigate the shell through cd and the like, but as soon as you try to run any program at all, even ls, your session hangs and you're stuck.
I can force it to crash by having it write to tape - around the 60GB mark, it goes down. It never completes writing to tape.
I can force it to crash by burning DVDs - it usually gets one done, sometimes two, sometimes it dies before starting the first one, but burning DVDs (with Nautilus) crashes the machine. Same symptoms.
It will crash on its own over the course of a week, even when the tape drive, SCSI card, and DVD burner are physically disconnected from the machine.
I've watched it crash through top and Webmin and df. It's not running out of memory. It's not running out of swap space or space on any partition. I've got a script running every minute to write the results of lsof into a file that archives itself when the machine reboots - it's not producing an exceptional number of entries there when it crashes.
/proc/sys/fs/file-max is over two hundred thousand.
/proc/sys/fs/file-nr never goes more than a few thousand
(The RAID partition for /home is always unclean after a crash and reboot. This is not really remarkable, but might be worth mentioning.)
Assuming, for a moment, that it's not physically broken hardware causing this, what:
A) could it be?
B) can I do to nail down what it could be?
Assuming we allow for physically broken hardware, what:
A) could it be?
B) can I do to nail down what it could be?
(My current plan: Install Debian Stable on another HDD, install it in the machine, boot from that, duplicate the setup, and try to force it to crash there.)
The legendarily erratic Fedora Core 4 machine is crashing, regularly - about once a week, at least. When it crashes, services stop responding one after another, but not all at once. It takes about 5 minutes from the first problems to the point where all services are down. You can get a login prompt, but putting in your password doesn't work. If you're logged in, you can keep running your existing program and navigate the shell through cd and the like, but as soon as you try to run any program at all, even ls, your session hangs and you're stuck.
I can force it to crash by having it write to tape - around the 60GB mark, it goes down. It never completes writing to tape.
I can force it to crash by burning DVDs - it usually gets one done, sometimes two, sometimes it dies before starting the first one, but burning DVDs (with Nautilus) crashes the machine. Same symptoms.
It will crash on its own over the course of a week, even when the tape drive, SCSI card, and DVD burner are physically disconnected from the machine.
I've watched it crash through top and Webmin and df. It's not running out of memory. It's not running out of swap space or space on any partition. I've got a script running every minute to write the results of lsof into a file that archives itself when the machine reboots - it's not producing an exceptional number of entries there when it crashes.
/proc/sys/fs/file-max is over two hundred thousand.
/proc/sys/fs/file-nr never goes more than a few thousand
(The RAID partition for /home is always unclean after a crash and reboot. This is not really remarkable, but might be worth mentioning.)
Assuming, for a moment, that it's not physically broken hardware causing this, what:
A) could it be?
B) can I do to nail down what it could be?
Assuming we allow for physically broken hardware, what:
A) could it be?
B) can I do to nail down what it could be?
(My current plan: Install Debian Stable on another HDD, install it in the machine, boot from that, duplicate the setup, and try to force it to crash there.)
(no subject)
Date: 2007-04-30 02:30 pm (UTC)(no subject)
Date: 2007-04-30 02:39 pm (UTC)(no subject)
Date: 2007-04-30 03:29 pm (UTC)I call hardware failure. Debian stable or find a build of Knoppix that works on that hardware, boot, force it to crash and that will confirm hardware failure, especially if you're booting from Knoppix and force it to load filesystems to RAMdisk with no HDU access.
In my experience, that kind of behaviour (and my experience is primarily with the Windows architecture, not the *nix so grain of salt) is caused by faulty RAM, faulty RAM controller, faulty swap file hardware chain (RAM, RAM controller, RAM that the kernel is sitting in, storage controller, storage device).
You could also try moving your swap partition to another hardware device and/or recreating it and/or running without it. If it runs week+ afterward with much improved stability, you had a corrupted swap partition. If it merely takes /longer/ to die, it's almost certainly RAM.
(no subject)
Date: 2007-04-30 06:23 pm (UTC)I think that's my problem.
(no subject)
Date: 2007-04-30 08:00 pm (UTC)