2016-08-30

The server exploded.

As a warning, because I know the boys will tell you as soon as you walk in the door, the internet is down.

As I was debating whether or not I should leave work or try to get one more thing done first, I got a text from my wife saying the internet was down at the house. It happens from time to time. Usually, I just have to reboot something, and usually, that thing is the wireless router that most everything connects to. It's a Rosewill, which I bought mostly because my trusty Linksys WRT-54G was having a hard time keeping up with all the devices we kept adding to the mix. The Rosewill supports wireless N and has a much better signal range, but maybe once every week or so, things just go a little "wonky" and it has to be rebooted. Just this past weekend, in fact, it ended up "jamming" my home network completely, sending so much traffic (that wasn't actually going anywhere) that none of my machines could hear each other. I didn't immediately know it was the router at the time, but when I went to the basement to check the servers and all the networking gear, that's when I saw the rapid flashing on the switch connected to the Rosewill router.

I didn't expect it to be a big deal this time, either, but I got a couple more texts with some more details. Of course, my wife had tried rebooting the wireless router already. (Even the kids know that, sometimes, you just have to go over to it, pull the power plug, wait a few seconds, and plug it back in.) When that didn't help, she went down to the basement herself, and she heard some high-pitched beeping from what she described as a small box with blue lights on it. She turned it off, waited a bit, and tried turning it on again; and when it started screaming at her immediately, she just turned it back off.

From her description, I knew the device in question was the UPS. It seemed strange that the UPS would be beeping like that, unless the power was off and it was running out of batteries or something. It's a common story in tech support circles to get a call from someone who claims their computer doesn't work, and only after troubleshooting for a while does the clueless user say something like, "Well, I can't quite see, because the power is out and it's dark in here." I didn't believe my wife would fail to mention a power outage, though, so I figured it must be something else. The UPS going bad, perhaps? A tripped circuit breaker that cut the power to that outlet?

I got home and went downstairs to check things out. It was very quiet, which seemed like a bad sign. Two servers — the email server, and the main server that does just about everything else — are plugged into the UPS, but a third — the media server, which stores all our DVDs for easy access — is plugged straight into the outlet. If the UPS were bad, the media server should still have power. At this point, I'm thinking it's a tripped circuit breaker.

I go out to the power box to check the breakers, but none are tripped. I locate the one going to the servers and flip it off and on, just in case; then I head back inside. At this point, I'm getting a little concerned. We ran that power line ourselves; did we do something really wrong in the process? It's been fine for a few years, though. If it did go bad, what can I do to get power to that corner of the basement while we figure things out?

Back in the basement, the outlet still has no power. I noticed, though, that we installed a GFI outlet. Maybe that's what tripped. I pushed the red "reset" button, and as soon as it clicked, the media server hummed to life. Ok, that's what's keeping things down. Now, I'll bring things back up and see if it trips again, and then figure out what's causing the problem. I turned on the UPS, which gave only the slightest of beeps. The email server gave a soft beep as it got power, and then….

A little background on the main server. This thing runs pretty much everything. It is the only thing connected to the cable modem on one network card, and another network card connects to the switches that distribute internet traffic to the rest of the house. It was a machine I built several years ago, picking out the parts and assembling them myself. I didn't look for anything special in the case, but the one I happened to find on sale had some interesting LED lights on the fans and a clear side window, so you can see everything inside. I didn't even know about these features of the case when I bought it; I was just looking for something that would hold all the parts together for a decent price. Over the years, the server has been carefully configured to do everything I need it to. It has a web server, which is mostly used by my wife for her web design work. It has a minimal email server, which does some preliminary filtering before passing email on to my "real" email server inside the network. It does the firewall and routing, with some hand-crafted iptables scripts to make sure bits go where they're supposed to. It has a DNS server, which is configured to give easy access to important devices on the network by name, plus has the bonus of having a few hundred known advertising sites redirected to the address 0.0.0.0 as a convenient, network-wide ad block. (Fun fact: I tried to do the same with porn sites, but when I got a list of known sites and fed them into my DNS server, it promptly crashed. There were just way too many to filter out wholesale.) It also has a large file store with an FTP server used internally to back up, share, and keep files we want to hang on to.

Anyway, as the main server got its turn to power up, there was a series of three or four very loud POPs, accompanied by a bright flash that could be clearly seen through the case's clear side panel. Accompanying the popping noise, I shouted something that I don't quite remember. And then everything went quiet again as the GFI switch once again tripped and cut the power. A thin tendril of blue smoke leaked out of the power supply fan of the main server, and the smell of fried electrical parts hung in the air.

I went upstairs and told my wife the bad news. The server just exploded.

My wife helped me get the server unplugged (mostly because, even with the power cut, I was still a little terrified to touch the thing after what I had just seen), and I took it upstairs where I had more light and began taking it apart. There were a few cobwebs and a lot of dust inside, but no obvious sign of what blew up. Unfortunately, there's no real easy way to tell what may be good and what may be dangerous. Unwilling to risk frying any more components than necessary, I resigned myself to having to buy a new machine and rebuild.

I can only hope at this point that the hard drives are ok. The server contained four in total — two smaller ones that held most of the OS, and two larger ones that made up the file share, each pair in a RAID-1 array. But without access to the internet, downloading the appropriate installation media would be tricky. Not that I had a replacement server handy, anyway. First things first, find a replacement.

I took a quick trip to the nearest electronics-type store, that being Best Buy. I knew it was probably a long shot going in there, and, unfortunately, I was right. Plenty of laptops and costly consumer desktop systems, but nothing that would be good for a server. I wasn't willing to overspend on a system that wasn't suited for the task.

My next bet was Micro Center, which was a half hour away. Unfortunately, that, too, was a wasted trip. Pre-assembled systems were limited to the desktop and laptop variety. They do have a large array of components for building machines from parts, but, being perfectly honest with myself, I was not in a frame of mind to start piecing one together in a hurry. If I'm going to build something, I want to take the time to research, and really put together what I want for the best value. But I need a server, and quick. I figured Amazon is probably going to be my best bet.

In the parking lot of the Micro Center, I double-checked Amazon's site. (I had looked before I left the house, but I didn't commit to anything as I wanted to at least try to buy something from a local store that I could take home and start working on that night.) I found a couple possibilities, but my biggest issue was trying to find the internal specs on the machines. This mini-tower server looks like a good deal, but does it have the internal space and ports for four full-sized SATA hard drives? I don't know if it was because I was trying to use the mobile website, or if their site was really lacking that information, but I found it really hard to find. (I probably would have found more details on NewEgg, but I was wanting to take advantage of Amazon's better prices and faster shipping.) I found one that actually included a mention of "space for 6 drives" in the description, placed the order, and elected to pay extra for one-day shipping.

On my way home, I started to go over my options. I wouldn't be able to restore the web and file server until the new machine arrives, but what could I get up and running now? I had that old Linksys wireless router, which I had installed DD-WRT firmware on — meaning it is something that is very configurable and something I could really tweak. That, I figured, could take the duty of routing and firewalling for the internal network, and we would at least have internet access again. Email might be a bigger problem, though. Sure, the email server was alive, but the way I had it configured, I depended on the main server to filter email first. Maybe some of the security settings I had applied in the not-too-distant-past would allow me to grant it more direct access to the internet without becoming an open relay for spam mail. But that could be a secondary task.

I got home and set to work, hooking up the Linksys router in the place of the main server. I had some issues getting it configured, since my prior tinkering with the device (when it was just a toy to play with) had left it in a weird state. I ultimately had to reset it to its default state and rebuild it from there. DD-WRT has a very convenient web-based interface, though, and it took me much less time than I expected to get things to a working state. The thing that slowed me down the most was the fact that devices on the network still remembered their configuration from the main server, and didn't immediately update to point to the Linksys router when I brought it online.

With that accomplished, I figured I'd try setting up email. I did have a few small issues configuring the network, but again it came down to having to just reboot the server a couple times to force it to update its network configuration. I had some issues from there trying to get some external email server testing programs to talk to my email server, and that slowed me down a bit. It turned out that the email server didn't take too kindly to being forcibly rebooted, and the email services just plain hadn't started up. (It's amazing how much better things can work if the expected program is actually running.) I forwarded the secure email ports through the firewall easily enough, but I wasn't too sure about opening up the unsecured email port required to let outside email come in. It turned out much better than I expected. The security settings I had enabled recently were working perfectly. I ran a couple different open relay tests against my web server (which is something I always, always do when I tinker with the email server — last thing I want to do is to get shut down because my email server is sending out everyone else's spam mail), and it passed perfectly.

So, now I'm back up and running with internet access and email. The major items are taken care of, so everything else from here can get rebuilt on a much less rushed timeline. (Still want to do it quickly, but it doesn't have to be done yesterday.)

Time to count the blessings and see what I learned.

The biggest blessing is that nothing burned down. The GFI outlet tripped, but the UPS at that point should have still been providing power. Near as I can figure, it also detected something was wrong and cut power, then beeped as an alarm. When my wife turned it off and back on, it must have been able to still detect the problem and not try powering on the server. I'm not sure what changed when I got to it later, but when I tried turning things on and it started making loud boomy noises, the GFI tripped again and the UPS just shut itself off immediately. If it hadn't, there could have been much more damage done, and possibly an electrical fire as well. (I'm still keeping my fingers crossed that the hard drives aren't fried.)

We're up and running. Email and internet are the most important things we have to keep going, especially with one child doing homeschool and taking lessons over the internet. I pay for a backup email server that, when our server is down, will receive and hold our email in a queue until our server comes back online; so we haven't lost any email.

With the Linksys router doing the routing and firewall duties, I can rebuild the main server and just keep it behind the firewall itself, without having to configure it for routing as well. Whenever the server has an issue in the future, I won't have to bring down the whole network. Plus, keeping the file share off of the computer exposed to the internet is a better setup anyway.

I should probably look into offsite backups. While I can hope that the hard drives didn't get fried, if it turns out that they did, I could be up a very smelly creek in a barbed wire canoe without a paddle.