Geek pop quiz.
Feb. 9th, 2007 05:01 pmUpdate on the previous Fedora machine dying-suddenly idiocy:
#1: Odd behaviour that I can't explain: Every time I reboot, /dev/nst0 has 600 permissions set. nst0 is the tape drive, owned by root:disk, and it needs 660 permissions to the tape backups can run.
I can set this manually after every reboot, and I can set a script to run this manually after every reboot. That's not the point. The point is that this *keeps happening*, and I don't know why it should. Can anyone tell me why?
#2: I think the freeze/lockup/crash problem is related to the tape backups themselves. It's a SCSI Exabyte tape drive - a damn impressive one, as such things go - but the crashes seem to come *only* when Amanda is running hard and backing up all the data. Of course, amanda logs nothing out of the unusual - in fact, it logs that it successfully completed all functions, does not log starting of another function, and then dies. I can *see* the point at which it dies - after gnutar finishes compressing the data to write to tape, but before amanda logs the final results of writing to tape. I know this because I can see the *successful* logs, and they look different from the logs that involve "system died at some point during the night".
Of course, nothing gives a fucking error message, anywhere.
It appears to be related to the *volume* of data being written to tape, in that a small backup works fine but a large backup *might* hang the machine. The same data will work one day and not the next.
Any suggestions for troubleshooting this? Rob, I'm happy to log lsof and the like all day long, but I honestly don't know what I'm looking for in the logs once I've got them.
(I really, really hate asking questions like this. I'm fine with Linux when the docs cover it or when I know where to look, but right now, I have neither, and that's bad.)
#1: Odd behaviour that I can't explain: Every time I reboot, /dev/nst0 has 600 permissions set. nst0 is the tape drive, owned by root:disk, and it needs 660 permissions to the tape backups can run.
I can set this manually after every reboot, and I can set a script to run this manually after every reboot. That's not the point. The point is that this *keeps happening*, and I don't know why it should. Can anyone tell me why?
#2: I think the freeze/lockup/crash problem is related to the tape backups themselves. It's a SCSI Exabyte tape drive - a damn impressive one, as such things go - but the crashes seem to come *only* when Amanda is running hard and backing up all the data. Of course, amanda logs nothing out of the unusual - in fact, it logs that it successfully completed all functions, does not log starting of another function, and then dies. I can *see* the point at which it dies - after gnutar finishes compressing the data to write to tape, but before amanda logs the final results of writing to tape. I know this because I can see the *successful* logs, and they look different from the logs that involve "system died at some point during the night".
Of course, nothing gives a fucking error message, anywhere.
It appears to be related to the *volume* of data being written to tape, in that a small backup works fine but a large backup *might* hang the machine. The same data will work one day and not the next.
Any suggestions for troubleshooting this? Rob, I'm happy to log lsof and the like all day long, but I honestly don't know what I'm looking for in the logs once I've got them.
(I really, really hate asking questions like this. I'm fine with Linux when the docs cover it or when I know where to look, but right now, I have neither, and that's bad.)
(no subject)
Date: 2007-02-09 10:52 pm (UTC)(no subject)
Date: 2007-02-10 01:43 am (UTC)It's not worth the effort any more. Replace the OS with something enterprise-grade.
Yeah, easier said than done, but in my experience Fedora Core 4 had a multitude of bugs that were only finally crushed when it was converted to the formal Red Hat release. As this isn't the formal Red Hat flavor, even if you did solve this one problem... others may arise. If you were running FC6 (Zod), I'd suggest investing a little more time with the issue as FC6 seems to be the most enterprise-worthy Fedora out there to date, but that's not the case....
Debian calls. If anyone gives you crap, point to the irresponsibility and/or illegality of running a known beta product in a production environment. Oh, the liability that could demand should word get to the auditors and/or the board of directors.
...or maybe that's just a threat we use in banking. ;-)
- James -
(no subject)
Date: 2007-02-10 05:44 am (UTC)#1: It's currently, with the exception of those tape backups, a fully working production system.
#2: I have *ZERO* clue how to set up a RAID and SCSI tape drive in Debian. None. This means I'd be taking the *currently working* system offline and starting over from scratch with no certainty of getting it back online in a reasonable timeframe.
These two things mean I want to keep Fedora until and unless I can confirm that the problem is *definitely* Fedora, and I won't just have the same problem with the tape drive if I manage to get Debian in there.
(no subject)
Date: 2007-02-10 06:49 am (UTC)I admit to having set up a SCSI RAID array under both Fedora and SuSE, but I did so in a manner totally eschewed by "real Linux professionals," I used the freakin' GUI. With the latest Fedora Core 6, it was a pretty damned painless process, too. I was lucky in that the IBM-branded server and SCSI card were auto-detected, of course. I have little experience with Debian, I'm afraid.
I agree with the other posters - check out the SCSI configuration and see if you can't kick the problem with those earlier suggestions. If not, all I can ask is that you keep some education on Linux/SCSI configuration on deck in case you do need to migrate the box to a new server. In fact, that's the best suggestion - if you can, get a different server (re-purposed from elsewhere, etc.) and build Debian parallel to the working Fedora box and move production over more slowly. But, I suspect that's what you would do if it came to that anyway.
Good luck, man.
- James -
(no subject)
Date: 2007-02-10 02:35 pm (UTC)While I really do like your solution, I do have to hold it as a last resort.
(no subject)
Date: 2007-02-10 02:18 am (UTC)Long long ago on a Linux for too far away, I tried using optical disks (PD: Phase Differential read/write disks). The system looked fine 'till the console filled with hundreds of SCSI write commands that were queued and timed out (the timeouts were apparently too little to allow for such a slow device). I wonder if the SCSI system is hanging on a too-slow command? I wish I had learned more about SCSI systems when I had access to the logic analyzers :"<
(no subject)
Date: 2007-02-10 02:56 am (UTC)HedRat checks for hardware changes automatically with every boot, and adds and removes device nodes during this process (it also "fixes" any it thinks aren't right).
Add a script in /etc/rc3.d that checks the device perms and corrects them if they aren't right.
Your issue looks like a SCSI issue with the locking up now that you've got a bit more info. Either you've got a bad cache chip on the chain (remember kids, SCSI devices are intelligent, they have and onboard controller of their own), or a cable that's only intermittently failing, or your termination is fucked up. Oh, and you could be suffering from device ID conflicts as well, but those usually cause lots of interesting things to be logged by the kernel when it tries to probe the bus.
(no subject)
Date: 2007-02-10 05:41 am (UTC)The tape drive is new. Brand new. As in, was bought about a month ago and installed by my predecessors, and I'm the first user - they installed Amanda, but didn't configure it to run.
That makes this possible, but unlikely.
or your termination is fucked up.
Given that I'm there because my predecessors were incompetent fuckwits, I suspect this may be the case.
What I know about SCSI and terminators, though, is what I can find through Google: A wealth of information, very little to help in any specific case.
Do you know of any way to test to see if the termination is fucked up? If the SCSI bus runs straight from the tape drive to the card in the back of the PC without stopping, shouldn't that be correctly terminated pretty much by default? And if it wasn't terminated, shouldn't it be dying long, long before it gets to that point?
(As far as I can tell, it dies at about the 65GB mark)
Oh, and you could be suffering from device ID conflicts as well, but those usually cause lots of interesting things to be logged by the kernel when it tries to probe the bus.
There is one and exactly one SCSI device on the bus, and the other end of the bus is a SCSI card that's plugged into what I suspect is a PCI slot, but can't be sure because I've never pulled the machine apart. Wouldn't that make conflicts unlikely?
(no subject)
Date: 2007-02-10 06:42 am (UTC)Are the cable and terminator new too?
If this is an Adaptec card, hitting CTRL-a while the controller probes the bus during post will take you into the scsi card's own BIOS, and from there you have lots of fun little options (I can't remember what LSI and Buslogic use, but generally, you'll get some sort of message telling you), one of which is to enable termination on the card. If you don't have internal SCSI peripherals, you really want to do this. The default on Adaptec is Automatic, and normally this works fine (I give Adaptec because they are the most common PCI SCSI host controllers on the market).
Which terminator you need is entirely dependent upon what your peripherals are.
What's important is that the narrowest device is on the end of the chain (ie, if you've got a SCSI [U|F]W (68 pin MMD) drive and a SCSI [2|F] (50 pin MMD) drive on the same chain, the narrow device has to be last, and you terminate with a terminator on the final 50 pin MMD interface. All 68 pin devices must come in the chain before 50 pin, and all 50 pin before 25 pin (and if you have a 25 pin SCSI device, do yourself a favor and accidentally drop it from at least a fourth floor window, just to be sure).
If you're scsi controller is new, which it sounds like it is, it's probably a U160 or U320 LVD controller. The important thing there is that your tape drive is in now way shape or form an Ultra [160|320] device (unless Exabyte have suddenly stopped being the company I've known and hated for the last 14 years). Make sure you're using a non LVD terminator. LVD controllers can handle SCSI[2|F|U|W] devices, but you must terminate accordingly, and the whole bus always runs at the speed of the least capable device (so, putting an old SCSI1 (10mbyte/s) device at the end of your chain of otherwise U320 (320mbyte/s) devices forces all devices on the chain down to SCSI1.
There is one and exactly one SCSI device on the bus, and the other end of the bus is a SCSI card that's plugged into what I suspect is a PCI slot, but can't be sure because I've never pulled the machine apart. Wouldn't that make conflicts unlikely?
The controller itself has an ID. On SGIs, the controller is at ID0, on Suns, the controller is at ID7. On PCs, you set the damn thing where ever you want it in the on board BIOS. Adaptec, LSI, and Buslogic controllers all default to ID 7. So yes, even with only one device, you can have ID conflicts.
(no subject)
Date: 2007-02-12 02:40 pm (UTC)Adaptec SCSI card, ID 7, set to "automatic" termination (with "disable" being the only other option).
The only device is the Exabyte VXA-320, ID #4.
It's got a terminator plugged into the other slot, with the little light lit up to tell us that it's there and working.
How do I tell if the terminator is LVD?
(no subject)
Date: 2007-02-12 02:49 pm (UTC)(no subject)
Date: 2007-02-12 04:25 pm (UTC)(no subject)
Date: 2007-02-10 08:44 pm (UTC)Can you put two half backups on the same tape?
(no subject)
Date: 2007-02-10 09:05 pm (UTC)Two half backups to the same tape: Not sure, actually.