theweaselking: (Work now)
[personal profile] theweaselking
Here's a stumper for you.

Once upon a time, long ago, there was a DNS server. This DNS server no longer really exists - it's still there, still turned on, but nothing uses it for DNS. This machine also used to be a mail server, but, again, nothing uses it regularly, these days. It's only still taking up server room space and electricity because there's a couple of IMAP mailboxes there that people still access once in a while.

There is an SQL server (which also serves internal web pages) and a Web server (for external pages, calling the SQL server and accessing the same databases as the internal pages) The web server lives in a DMZ, with pinholes from it to the CURRENT DNS server (not the old one!), and to the SQL server. The web server cannot, under any circumstances, see the old DNS server.

If the old DNS server is turned off, page serving from the Web server is EXTREMELY slow. Like, 10-15 second delays before pages load. This ONLY happens to pages that make database connections - serving local files with no DB connection is fast.

The SQL server reports no errors, and serves its local, internal pages pulling from the exact same database at normal speed. Logs show that it the DB requests are being served as soon as they arrive, they're just not really arriving on time.

I logged into the old DNS server and started killing services. With NOTHING running on it (nearly literally - it was sitting there with an IP and that's all, no programs or services running) the web pages are perfectly fast. As soon as it doesn't have a network connection, the web server gets these delays again.

I set up a new SQL server and pointed a copy of the website, running on the same web server, at it. I rebooted the old DNS server while watching the pages. The "real" page going to the real SQL server, slows to a crawl. The fake page to the new SQL server is still it's normal blazingly fast self.

So, the problem has to be the SQL server.... except that it serves all requests EXCEPT the Web Server onces perfectly fast, and it serves the Web Server requests as soon as it gets them. It just isn't getting them.
The Web server *cannot* speak to the old DNS server. It simply can't reach it, and has never been programmed to reach it. The web server postdates the DNS server's decommissioning. It CAN reach the current DNS, but the current DNS don't speak to the old DNS. Also, the problem doesn't happen when Bind on the old server is stopped - it only happens when the old server is turned off or unplugged from the network.

Judicious use of grep on the Web server, SQL server, and other DNS servers on the network have shown that there are NO references to the old DNS server anywhere in their configuration, by name, alternate name, or IP.
The test SQL server I set up is an exact mirror of the normal SQL server's configuration, with only the hostname and IP address changed - and yet, calls to it don't crawl the way calls to the normal SQL server do.

I'm at the point of stealing the IP of the old DNS server with my laptop and running Wireshark just to see what the hell is calling. I'm *that* stumped. Any ideas, other than that?

(no subject)

Date: 2011-04-06 04:50 pm (UTC)
From: [identity profile] silmaril.livejournal.com
They have unionized and they are slowing work in solidarity with their old comrade, almost forgotten and under the threat of being decommissioned.

(I've got nothing. Additionally, I don't have any of the know-how to actually have anything.)

(no subject)

Date: 2011-04-06 05:03 pm (UTC)
From: [identity profile] fourgates.livejournal.com
I would guess that the new web server *box* is using the old DNS *box* as a gateway. Not DNS, but IP Gateway. Or some other funkiness with how your DMZ is configured, at the routing level.

Grabbing the IP address from the old DNS box and moving it to the web server box would be my next move.

I'm at least 20% confident here, so good luck with that.

(no subject)

Date: 2011-04-06 05:18 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
Unfortunately, no, it's not. And if it was, "same server, new SQL server" wouldn't fix the problem, and if the SQL server had that problem it would show the same behaviour on local pages and SQL requests from machines other than the web server.

(no subject)

Date: 2011-04-06 05:12 pm (UTC)
From: [identity profile] skiriki.livejournal.com
Since I don't know what, exactly, you use to run the systems...

I'd just grep in /etc/*/* jne other relevant dirs for the IP address/name and see if it is in confs somewhere. ;)

For starters! Then I'd swear a lot and ask and just as I post my question I figure out where the problem lies and I'd be forced to post "NM, I R STUPID". ;)

(no subject)

Date: 2011-04-06 05:19 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
The machines: An unholy mess of Debian and CentOS boxes of various distrubutions.

I've already done the grep bit and found nothing. And, unfortunately, posting it here hasn't led me to a revelation, which was kind of what I was hoping for.

(no subject)

Date: 2011-04-06 05:25 pm (UTC)
From: [identity profile] skiriki.livejournal.com
Oh, damn. :(

Well, I'm just glad that I'm not the only one who does "ask and receive instant enlightenment" thing.

I think hijacking the IP might be a good idea and checking who calls what and where and why. Clearly, something is calling home.

If you find out, do let us know, 'cuz this has made me curious. :)

(no subject)

Date: 2011-04-06 05:30 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
Hijack-and-listen requires a bit of scheduling, though, because the public-side website that slows to a crawl really is a public-side website and needs to stay up.

(no subject)

Date: 2011-04-06 06:04 pm (UTC)
From: [identity profile] xengar.livejournal.com
The only thing that comes to my mind is routing table weirdness, and even then it's only because you didn't specifically mention having checked the routers.

This is quite a puzzle you've got here.

(no subject)

Date: 2011-04-06 06:11 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
I kind of think it HAS to be routing, somewhere.

I wonder what the SQL server's routing table looks like.

(no subject)

Date: 2011-04-06 06:34 pm (UTC)
From: [identity profile] mhoye.livejournal.com
If you've got a problem at the IP layer, (which you clearly do) the first things I think are "corrupt ARP cache" or "routing problem." The network layer in the SQL box is still asking the old DNS server for something and then timing out.

Hmmm... Any chance the old SQL server has logging to an external machine set up, and is looking to the old DNS server for information about it?

(no subject)

Date: 2011-04-06 06:41 pm (UTC)
From: [identity profile] mhoye.livejournal.com
Actually, if you can kill all the processes running on the old DNS server, why can't you just run wireshark on it directly?

(no subject)

Date: 2011-04-06 07:00 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
#1: Because I always think of wireshark as a desktop app and hadn't really considered it.
#2: Because the Unstable Legacy Machine is absolutely not allowed to have radical and unclean changes made to it that may result in the unavailability of the ancient-ass IMAP boxes. Also, if I break it we don't have a fix for the website being slow.

(no subject)

Date: 2011-04-06 07:01 pm (UTC)
From: [identity profile] mhoye.livejournal.com
Ok, so: Unstable Legacy Machine Ethernets: 100BT, or 10BT?

(no subject)

Date: 2011-04-06 07:02 pm (UTC)
From: [identity profile] theweaselking.livejournal.com

#3: Because the machine is so old it wants Ethereal, not Wireshark... and requires X to go with it.

On the other hand, writing network traffic to stdout should be doable.

(no subject)

Date: 2011-04-06 10:51 pm (UTC)
From: [identity profile] nsanity-au.livejournal.com
isn't tcpdump the same as ethereal anyway? just grep that.

(no subject)

Date: 2011-04-07 05:21 am (UTC)
From: [identity profile] zastrazzi.livejournal.com
on the old dns server

tcpdump -i ethx -vvv host olddns and \( sqlserver or webserver \) -w output.cap

Then transfer the cap file to a box with wireshark and open it up.

(no subject)

Date: 2011-04-07 05:05 pm (UTC)
From: [identity profile] theweaselking.livejournal.com

Yeah, did that. It's fucking RDNS lookups, and I don't know why.

(no subject)

Date: 2011-04-06 07:37 pm (UTC)
secretagentmoof: (Default)
From: [personal profile] secretagentmoof
It makes me think that the old dns server is set up to be authoritative for reverse dns or some other zone. New server tries to query old server for data: if the service is stopped, ICMP unreachable; if it works, it works; if it's not on the net, it hangs.

(no subject)

Date: 2011-04-06 07:57 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
That's clever. I'll have to look at that tomorrow.

(no subject)

Date: 2011-04-07 01:37 am (UTC)
From: [identity profile] quotation.livejournal.com
That was my gut reaction, too

Also, the tool you need is iptraf.

(no subject)

Date: 2011-04-07 03:29 am (UTC)
ext_6388: Avon from Blake's 7 fails to show an emotion (Default)
From: [identity profile] fridgepunk.livejournal.com
This sounds like the proper way to say what I was going to say – the SQL server is doing something that involves in step 1, trying to find something on the Old DNS (ODNS) that isn't there any more and then if, and only if, it can't get a negative from the ODNS does it then move to step 2 which involves looking on the SQL server, or the SQL server is for some reason convinced that the ODNS is part of the SQL server and so the SQL is having the ODNS reroute stuff from the SQL server back into the SQL server and then switches to some weird and ghastly kludge someone has built on top of the process for when the ODNS goes offline.

Though this is all more akin to a genetic disorder of flowers than anything an admin should be dealing with, so lifting off and nuking from orbit would be my advice.

(no subject)

Date: 2011-04-06 08:01 pm (UTC)
From: [identity profile] sbisson.livejournal.com
Are you using named pipes for SQL Server connectivity?

(no subject)

Date: 2011-04-06 08:06 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
All TCP/IP, I think.

(no subject)

Date: 2011-04-07 12:12 am (UTC)
From: [identity profile] dreamshade.livejournal.com
hm. theweaselking has lost a machine.. literally _lost_. it responds to ping, it works completely, he just can't figure out where in his apartment it is.

(no subject)

Date: 2011-04-07 01:29 am (UTC)

(no subject)

Date: 2011-04-07 03:33 am (UTC)
From: [identity profile] jdarkwulf.livejournal.com
Is that kind of like the techie story where they were trying to find an active server for inventory, and couldn't locate it for the life of them? The one where they eventually find out that it mistakenly got sealed behind some drywall during a remodel? :)

(no subject)

Date: 2011-04-07 12:51 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
First: That's a true story. Almost all apocryphal computer stories are true, because they have all happened to me at some point, occasionally with a few of the smaller details getting wrong in the retelling.

Second: Not exactly, because we know precisely where all the machines involved are on the network and in the office, at at all times. We just know they're talking behind our backs.

(no subject)

Date: 2011-04-07 01:38 pm (UTC)
From: [identity profile] alchemist.livejournal.com
The server in question was at UNC Chapel Hill. It was a Novell box, and ran just fine for YEARS, and I forget why they actually had to chase it down.

But yes, there are photos, and I know a couple of the eye-witnesses.

(no subject)

Date: 2011-04-07 01:49 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
The same thing happened to me at MCI. It wasn't "drywalled in", it was put into the inner-wall space of an adjacent unfinished-walls storeroom, with a hole drilled through the drywall for the network cable, but the result was the same - it worked fine for years, until it stopped working and suddenly NOBODY knew where the server that hosted this app was, only that it was down.

(I also got the cupholder and the "no, I can't see bheind my computer because the power is out" guy. No, really. There Are No Apocryphal Computer Stories.)

(no subject)

Date: 2011-04-07 07:34 pm (UTC)
From: [identity profile] ixeian.livejournal.com
Love that story. I remember reading it and thinking it was so damn funny.

(no subject)

Date: 2011-04-07 01:41 pm (UTC)
From: [identity profile] alchemist.livejournal.com
My gut reaction, after all that is :
1) Check the firewall
2) Check the router/switch

Also, if you turn off the box and bring up the IP on another box, does it still "fix" the problem? An IP alias is a cheap fix if it keeps things running while you chase down the real culprit.

Profile

theweaselking: (Default)theweaselking
Page generated Feb. 7th, 2026 07:46 am