There is a machine. Ubuntu 8.04 Desktop.
In the last week, since a power failure, it has started acting oddly - the internet connection will come and go, it will stop talking to the outside world suddenly and without warning, etc.
The catch: The machine is *always* rock-solid on the internal network. I can SSH into it and, over the one single solitary network connection, watch *internet* access come and go. It can always reach internal addresses - just anything outside 192.168.1.0/24 is simply not there.
The router is not blocking any traffic. DNS is working - it's correctly resolving all the host names and attempting to connect via IP gives the same results.
The router's logging capabilities are utter crap, but they *do* see the attempt by the machine to connect out - if I tell it to wget www.google.com, the router's outgoing connection log will show a connection from this machine to something in google's IP range.
No other machine on the network is having a problem.
Changing this machine to have a different static IP (like the other, working server), or to run via DHCP (like the other, working desktops) doesn't change the behavior.
So: My latest thought was a Routing problem - no default route! A static route to somewhere! Right? Well, no. Route looked like crap, so I cleared 'em all, restarted, and now this is what "route" gives me:
Looks good to me! netstat -rn gives the same results! Unfortunately, I still can't get a single packet out except to 192.168.1.1
So: Those are my symptoms, and they do not make sense.
/etc/network/interfaces:
Traceroute from the machine to the router:
Traceroute from the machine to another machine on the network:
Traceroute to the external IP, on the far side of the router:
Traceroute to the external IP from another machine on the network, run at the same time:
... this HAS to be a routing problem, right? The symptoms don't match *anything* else. I just can't figure out *why* the routing is busted.
Help me, interwebs. What am I doing wrong?
EDIT: And now the bastard thing is working. And I don't know why. And I can't break it again.
In the last week, since a power failure, it has started acting oddly - the internet connection will come and go, it will stop talking to the outside world suddenly and without warning, etc.
The catch: The machine is *always* rock-solid on the internal network. I can SSH into it and, over the one single solitary network connection, watch *internet* access come and go. It can always reach internal addresses - just anything outside 192.168.1.0/24 is simply not there.
The router is not blocking any traffic. DNS is working - it's correctly resolving all the host names and attempting to connect via IP gives the same results.
The router's logging capabilities are utter crap, but they *do* see the attempt by the machine to connect out - if I tell it to wget www.google.com, the router's outgoing connection log will show a connection from this machine to something in google's IP range.
No other machine on the network is having a problem.
Changing this machine to have a different static IP (like the other, working server), or to run via DHCP (like the other, working desktops) doesn't change the behavior.
So: My latest thought was a Routing problem - no default route! A static route to somewhere! Right? Well, no. Route looked like crap, so I cleared 'em all, restarted, and now this is what "route" gives me:
Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 192.168.1.0 * 255.255.255.0 U 0 0 0 eth0 link-local * 255.255.0.0 U 1000 0 0 eth0 default 192.168.1.1 0.0.0.0 UG 100 0 0 eth0
Looks good to me! netstat -rn gives the same results! Unfortunately, I still can't get a single packet out except to 192.168.1.1
So: Those are my symptoms, and they do not make sense.
/etc/network/interfaces:
# The primary network interface auto eth0 iface eth0 inet static address 192.168.1.15 netmask 255.255.255.0 gateway 192.168.1.1
Traceroute from the machine to the router:
traceroute to 192.168.1.1 (192.168.1.1), 30 hops max, 40 byte packets 1 192.168.1.1 (192.168.1.1) 0.456 ms 0.757 ms 1.068 ms
Traceroute from the machine to another machine on the network:
traceroute to 192.168.1.4 (192.168.1.4), 30 hops max, 40 byte packets 1 oasis.local (192.168.1.4) 0.163 ms 0.153 ms 0.152 ms
Traceroute to the external IP, on the far side of the router:
traceroute to XXX.XXX.XXX.XXX (XXX.XXX.XXX.XXX), 30 hops max, 40 byte packets 1 * * * 2 * * * 3 * * *
Traceroute to the external IP from another machine on the network, run at the same time:
traceroute to XXX.XXX.XXX.XXX, 30 hops max, 40 byte packets 1 XXX.XXX.XXX.XXX 1.518 ms 2.108 ms 2.715 ms
... this HAS to be a routing problem, right? The symptoms don't match *anything* else. I just can't figure out *why* the routing is busted.
Help me, interwebs. What am I doing wrong?
EDIT: And now the bastard thing is working. And I don't know why. And I can't break it again.
(no subject)
Date: 2009-10-02 05:28 pm (UTC)I had a very similar situation on a Puppy Linux box. (Fuck you, don't you judge me.) Turned out the NIC I had in there stopped working properly after a system update. Switched it with another, issue resolved.
(no subject)
Date: 2009-10-02 05:39 pm (UTC)#2: If the NIC was fucked, how am I SSH'd in this whole time? And why can it speak to all those internal IPs? Gah!
(no subject)
Date: 2009-10-05 02:12 pm (UTC)This conjecture would be irrelevant if the machine was completely taken off power as described above while you were troubleshooting it.
If the hardware NIC has some manner of "network accelerator" or "TCP/IP offloader engine", then that would be the region of the problem. Might turn that off.
(no subject)
Date: 2009-10-02 07:29 pm (UTC)(no subject)
Date: 2009-10-02 05:23 pm (UTC)(no subject)
Date: 2009-10-02 05:41 pm (UTC)(no subject)
Date: 2009-10-02 06:25 pm (UTC)(no subject)
Date: 2009-10-02 08:47 pm (UTC)(no subject)
Date: 2009-10-02 08:59 pm (UTC)(no subject)
Date: 2009-10-02 08:55 pm (UTC)(no subject)
Date: 2009-10-02 08:58 pm (UTC)Because the routing *was* wrong, and I *did* change it. It just didn't start working again at that point - and when it *did* start working, nobody had done anything to it in ~25 minutes. We were just brainstorming to try to find a solution.
(no subject)
Date: 2009-10-02 10:07 pm (UTC)(no subject)
Date: 2009-10-03 01:21 am (UTC)