Switches Gone Wild

Had an interesting network error the other day. Very suddenly, while I was paying bills online and my kids were watching Doctor Who on Netflix, the network was completely non-responsive. My first thought was that I had lost my connection to the internet again. But as I tried to connect to my server to check, I was unable to connect to that machine, too. Even when my internet connection drops, I don't lose connection to machines on the internal network.

I went downstairs to check the server. It appeared to be running normally, and I was able to log in to the machine using the keyboard and ancient CRT monitor I keep plugged in for emergencies. I tried resetting the network stack. Curiously, it threw up an error message trying to bring up the network card for the internal network, something about an inability to allocate memory. But the external interface came up without error. I tried a second time, and again, the internal interface still couldn't connect to any other machine.

Thinking there shouldn't be anything wrong with the network adapter itself, I spent quite a bit of time checking my iptables rules. I assumed there must be something in there preventing packets from the internal network from getting processed. Perhaps I had set the wrong rule when I was trying to get a couple new devices to play nice with the email server. But why would it work fine for a while and only now, many days later, decide to drop internal traffic? And why couldn't I fix it? Even when I dropped all filtering rules, I couldn't get any traffic through my internal network to that network card.

Not long ago, my computer failed to get an IP address from my ISP, and for whatever reason, it wouldn't get one until I gave up and rebooted the server. Figuring this was another symptom of whatever the problem was before, I just rebooted the server. It came up again, with only errors I've seen before (fairly obscure warning messages that haven't before caused any noticeable issues). It acquired the ISP's network quickly enough, but still, the internal interface didn't appear to respond.

I finally decided to take a close look at the network switch. One light was flickering rapidly, which was on the port connected to the entertainment center. This seemed odd, since the only devices on at the time were the TV (which does have an Ethernet port, but its features have been pretty useless so far) and the Blu-ray player (which was streaming Netflix before the network crashed, had been powered off by the kids, and was currently displaying a screen complaining about a lack of network connectivity since I turned it on to check).

Just to check if it was a bad port, I pulled the cord for the entertainment center and plugged it into an empty port. The light on that port came on solid, then started to flicker rapidly as before. I then unplugged the switch's power cable, waited several seconds, and plugged it back in. Each of the lights cycled in sequence as the switch went through the startup sequence, then all lights on ports connected to live devices turned on solid. Some started to blink, and the one connected to the entertainment center started to flicker rapidly again.

I went up to the entertainment center to check on the switch there. The lights on ports connected to the TV and Blu-ray player were on solid. Only the light on the port connected to the wall (and back down to the entertainment center) was blinking, and it was flashing rapidly.

This was seriously odd. No lights connected to any computers or devices were blinking with any intensity — only the lights on the switches that connected to each other. Were the switches generating their own traffic, talking back and forth to each other? These are fairly inexpensive, unmanaged switches, with no network address of their own to speak of; what could they possibly be saying to each other, and how?

I unplugged the power cord on the switch behind the entertainment center, waited a few seconds, and plugged it back in. All its lights came on briefly as it powered up, and then finally, mercifully, the lights on all connected ports came on and stayed solid, including the light leading to the switch on the server. I checked the Blu-ray player and my laptop, and both were able to connect to the server and, by extension, the rest of the internet.

I still have no clue what was causing all that traffic. I did notice that, when the Blu-ray player started streaming Netflix again, the lights on both ports on the switch that connect to the wall and the Blu-ray player were flickering quickly, looking much like the flickering on the one port when things weren't working (although there's no way to tell by sight if the flickering was exactly the same). Near as I can figure, there was so much of this mysterious traffic that it jammed the main switch so thoroughly that no traffic could get through any other port. (Since they are both 1Gb switches, it certainly could do it. The only other gigabit network device on my home network is the network adapter on the server facing the internal network, but the light on the switch on that port wasn't showing anything but the most rudimentary activity.)

No comments: