theweaselking: (Work now)
[personal profile] theweaselking
There is a machine. Ubuntu 8.04 Desktop.

In the last week, since a power failure, it has started acting oddly - the internet connection will come and go, it will stop talking to the outside world suddenly and without warning, etc.

The catch: The machine is *always* rock-solid on the internal network. I can SSH into it and, over the one single solitary network connection, watch *internet* access come and go. It can always reach internal addresses - just anything outside 192.168.1.0/24 is simply not there.

The router is not blocking any traffic. DNS is working - it's correctly resolving all the host names and attempting to connect via IP gives the same results.
The router's logging capabilities are utter crap, but they *do* see the attempt by the machine to connect out - if I tell it to wget www.google.com, the router's outgoing connection log will show a connection from this machine to something in google's IP range.
No other machine on the network is having a problem.
Changing this machine to have a different static IP (like the other, working server), or to run via DHCP (like the other, working desktops) doesn't change the behavior.

So: My latest thought was a Routing problem - no default route! A static route to somewhere! Right? Well, no. Route looked like crap, so I cleared 'em all, restarted, and now this is what "route" gives me:

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.1.0     *               255.255.255.0   U     0      0        0 eth0
link-local      *               255.255.0.0     U     1000   0        0 eth0
default         192.168.1.1     0.0.0.0         UG    100    0        0 eth0


Looks good to me! netstat -rn gives the same results! Unfortunately, I still can't get a single packet out except to 192.168.1.1

So: Those are my symptoms, and they do not make sense.

/etc/network/interfaces:
# The primary network interface
auto eth0
iface eth0 inet static
address 192.168.1.15
netmask 255.255.255.0
gateway 192.168.1.1


Traceroute from the machine to the router:
traceroute to 192.168.1.1 (192.168.1.1), 30 hops max, 40 byte packets
 1  192.168.1.1 (192.168.1.1)  0.456 ms  0.757 ms  1.068 ms


Traceroute from the machine to another machine on the network:
traceroute to 192.168.1.4 (192.168.1.4), 30 hops max, 40 byte packets
 1  oasis.local (192.168.1.4)  0.163 ms  0.153 ms  0.152 ms


Traceroute to the external IP, on the far side of the router:
traceroute to XXX.XXX.XXX.XXX (XXX.XXX.XXX.XXX), 30 hops max, 40 byte packets
 1  * * *
 2  * * *
 3  * * *


Traceroute to the external IP from another machine on the network, run at the same time:
traceroute to XXX.XXX.XXX.XXX, 30 hops max, 40 byte packets
 1  XXX.XXX.XXX.XXX  1.518 ms  2.108 ms  2.715 ms


... this HAS to be a routing problem, right? The symptoms don't match *anything* else. I just can't figure out *why* the routing is busted.

Help me, interwebs. What am I doing wrong?


EDIT: And now the bastard thing is working. And I don't know why. And I can't break it again.
(deleted comment)

(no subject)

Date: 2009-10-02 05:28 pm (UTC)
From: [identity profile] flemco.livejournal.com
Beat me to it.

I had a very similar situation on a Puppy Linux box. (Fuck you, don't you judge me.) Turned out the NIC I had in there stopped working properly after a system update. Switched it with another, issue resolved.

(no subject)

Date: 2009-10-02 05:39 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
#1: No spare NIC. Good thought, but I don't have one handy. It's definitely on my list.
#2: If the NIC was fucked, how am I SSH'd in this whole time? And why can it speak to all those internal IPs? Gah!

(no subject)

Date: 2009-10-05 02:12 pm (UTC)
From: [identity profile] elffin.livejournal.com
RAM copy of the firmware on the NIC probably got loaded into it's own RAM, got corrupted; Since then, it's had the opportunity to reload the firmware off the storage into RAM. Since the NIC probably has WOL turned on, even powering off the main computer doesn't re-IPL the NIC; That takes a power disconnect and tapping the power switch while disconnected to drain the NIC's power supply.

This conjecture would be irrelevant if the machine was completely taken off power as described above while you were troubleshooting it.

If the hardware NIC has some manner of "network accelerator" or "TCP/IP offloader engine", then that would be the region of the problem. Might turn that off.

(no subject)

Date: 2009-10-02 07:29 pm (UTC)
From: [identity profile] mhoye.livejournal.com
srsly. "Something goes bad after a power failure", that's my first guess.

(no subject)

Date: 2009-10-02 05:23 pm (UTC)
ext_8707: Taken in front of Carnegie Hall (sherman)
From: [identity profile] ronebofh.livejournal.com
Maybe iptables is running to block anything coming back into the box?

(no subject)

Date: 2009-10-02 05:41 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
Doesn't *look* like it. Good thought, though!

(no subject)

Date: 2009-10-02 06:25 pm (UTC)
From: [identity profile] anivair.livejournal.com
this was my guess, but only because it happened to me earlier this week. I don't believe that ubuntu runs iptables by default but by crumb if CentOS doesn't!

(no subject)

Date: 2009-10-02 08:47 pm (UTC)
andrewducker: (Default)
From: [personal profile] andrewducker
I recommend asking people on LJ. When I did that with my fridge it started working again within two hours.

(no subject)

Date: 2009-10-02 08:59 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
Took about 25 minutes after my post. Fucking thing's still working. I don't trust it.

(no subject)

Date: 2009-10-02 08:55 pm (UTC)
From: [identity profile] aeduna.livejournal.com
I've seen something similar where a network card got a tiny bit fried from the power outage. It wouldn't work with one specific wireless router any more - other machines would work with that router, and it would work with other routers, but just the combination tickled some one-bit damage on the card somewhere.

(no subject)

Date: 2009-10-02 08:58 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
My current theory is that the screwed up routing was the problem, and I fixed that, restarted the networking, flushed the dns, restarted the computer, and then *waiting 30 minutes* before the fucking thing noticed the routing had been corrected.

Because the routing *was* wrong, and I *did* change it. It just didn't start working again at that point - and when it *did* start working, nobody had done anything to it in ~25 minutes. We were just brainstorming to try to find a solution.

(no subject)

Date: 2009-10-02 10:07 pm (UTC)
From: [identity profile] alumiere.livejournal.com
this could also possibly be a damaged cat5 cable or ethernet port at either end - the local router is so close it may get a clean signal to that so a few return packets happen and traces go through; but even your providers first hop has a considerably longer round trip time and that means that those few packets that do get to your router may not get to your isps first hop quickly enough to not timeout

(no subject)

Date: 2009-10-03 01:21 am (UTC)
From: [identity profile] theweaselking.livejournal.com
Decent thought, except that first hop is on the far side of the router - as in, it's the *other interface* of the same device. It's really hard to imagine a damaged cable that would allow you to get through the switch to port 1 but die between port 1 and port WAN, every single time, without fail.

Profile

theweaselking: (Default)theweaselking
Page generated Feb. 6th, 2026 06:19 am