Friday, October 18, 2013

Lessons 67 and 68 - Internet Outage

Lessons 67 and 68

So far, I've lost 8 hours and $150.  But we have internet again.

It was Tuesday. I am normally at the office on Tuesdays. But my teammate was out and I have a full plate, so I thought I'd save the commute time and work from home. While sitting with my wife and contemplating the day before us, I noticed the lights blink. We both heard a beep.  Right away I knew two or three appliances had reset, including a server or two.  No biggie, so I thought.

When we have an outage, I try to learn from the mistakes. Whether I missed a step or some system has let me down, it's a healthy challenge to review and adjust.

Mistake Number 1 - delayed UPS battery replacement

The beep was from the UPS which covers our "important" computer gear:
the cablemodem, the router, and the main server.  The BSL1079 (lead/acid "gel cell") failed long ago, but I had been using an aging car battery.  This was not simply putting things off. The car battery, even aging, has more than ten times the capacity of the normal UPS battery.  But either the battery had aged more than I knew or my spit-n-bailin-wire rigging had loosened up.  This was totally my fault.

Lesson: just order the [expletive deleted] normal battery and do high capacity as a separate project.

Mistake Number 2 - mixing services ... and service levels

Some time back, all our stuff hubbed off a server called "main".
NFS, YP/NIS, SMB, NTP, DNS, SMTP, IMAP, internal HTTP, and notably DHCP. Most of these services have been doled out to dedicated appliances or to service providers.  The exception is DHCP.  So when the primary server powered up, it had these old filesystems to check. (Things did not come down clean, so an integrity check is warranted.) DHCP had to wait until that was done.

The filesystems are still used, but with a lower service level requirement. DHCP has a much higher service level requirement, especially with increased WiFi. So the idea that a high priority service is waiting behind a lower priority service is bass-ackards (as we say in Texas).  This will change.  My fault, there's history.

Lesson: consider service requirements and plan accordingly. (DHCP will move)

Mistake Number 3 (not mine) - deceptive diagnostics (and this was the worst)

With the server back, and the network units functioning normally (I thought), I checked on our IPv6 tunnel server.  This is a Xen virtual machine hosted by the same physical box as "main".  Native IPv6 is not available yet where we are, but the SixXS tunnel does nicely.  But the tunnel wasn't starting.

Turned out that IPv4 connectivity was still down.
Turned out that the router had no DHCP lease from our ISP.
After multiple (controlled) on/off cycles of both the cablemodem and router, the relationship was still "we're not talking".  Activity lights, yes, but all zeros for the external address. Plugged in a laptop directly to the cablemodem; got a lease!  So that indicated clearly the router has failed. Clearly.

This NetGear router has been giving us a little trouble on the WiFi side. Dunno if it is just RF interference or perhaps something we can blame on the internet provider.  There are gaps in 802.11 coverage inside the house, dead zones. So off I went to Best Buy, returning with a shiny new LinkSys "AC" model, with A/B/G/N backward compatibility.

The new router also failed to obtain a DHCP lease.  Huh?!?

Pause and reflect:  Old router, no lease.  Laptop, yes lease.
But new router, also no lease. The cablemodem is just not smart enough (one would think) to distinguish between a "computer" and a "router". I had already tried cloning the MAC address of the laptop to the old router so it would look to the provider's DHCP server "just like the laptop" (which had succeeded in DHCP).  No joy.

This is where I lost the rest of the day ... multiple reboots, power on/off cycles, WiFi reassociations, and DHCP transactions (on the "inside").  Cablemodem worked with two computers, failed on one, failed on both routers.  The new router got a DHCP lease on our internal LAN, and then so did the old router.  [sigh]   (I could have gotten a better priced on the new router if I were not in a rush from the outage.)

What a waste of time.

What finally worked was to put our old "firewall" on the cablemodem. This is a Linux box with two ethernet ports.  That's one I got right. And the "mistake" was misleading cues from Time/Warner Cable's device.

Lesson: hang onto what worked before, at least one generation.Maybe consider changing internet providers!

Mistake Number 4 - too much reliance on internet (?)

Uhh... maybe not.
We do *business* online.
For most transactions, using the Internet is no less legitimate, even more reliable, than using the telephone.  So what are you gonna tell me? Too much reliance on the phone?

I grant you that if we rely on internet for these things we need to have reliable alternatives, whether a procedural habit fall-back to voice or perhaps redundant internet service.  Yeah ... that's it.  Redundant internet service.  In a prior job, my employer paid for internet and we paid for a second line, keeping personal and "work" on separate channels. WHEN (not if) there was a failure on one, I would switch all traffic to the other.

The idea that Joe Suburbia have two internet contracts is silly.
The idea that my family have two internet contracts is less so, but still pricey.

Lesson: be prepared, and have alternatives.

I wrote this in a hurry because I thought it should be dispatched quickly. Hopefully it's not too jumbled.  Hopefully your internet experience is more effective.

-- R; <><