Sunday, August 21, 2016

The Death of Veracity

The Death of Veracity

Your vocabulary word for today, students, is "veracity". You may notice that Wikipedia re-directs to the page for "honesty". But this post is about a computer, not about moral integrity.

How we came to name the machine "veracity" is a sentimental saga for another season. The soggy sadness of today is the fallout from Veracity (the machine) having silently succumbed to an as-yet unknown hardware failure.

In recent weeks I've gotten a number of alarms "you need to take backups". We seem to have had a rash of machine failures. Veracity's demise is the second outage this week alone. (The other was "the day the WiFi died" which I'll describe in a separate post.)

This is a disaster recovery story, a D/R tail with a happy ending. 

Veracity and Virtualization

The good news is that the systems hosted on veracity appear to be intact. One of them, zechariah by name, is our primary IPv6 gateway server. It is up again, after copying of its system disk and root disk to another virtualization host. Whatever failed on Veracity, thankfully the disk was okay.

The other guest, Jeremiah, followed soon after. It acts as our "main" server (email, files, print, DNS, DHCP). But I had gotten lax about backups. The D/R plan for jeremiah was that if it failed we'd switch the IP address for main over from Jeremiah to Nehemiah. While it lived, Nehemiah contained regular backups of Jeremiah's content. We did switch between the two once or twice in those days.

This method of using a similar-but-not-identical system for failover goes back before we had virtual machines on our little network. Where physical systems are involved, the historical plan for D/R is to have another system with the same or better capability standing ready to pick up the load. I was introduced to virtualization in 1982 but pervasive PC-think prevented me from applying tried and true V12N methods to personal systems. Bummer.

It began to dawn on me that we don't need a secondary server for a virtual system. All we really need is a copy of that system, a clone. Call it replication. Then when disaster strikes, bring up the clone on a designated alternate hypervisor: no moving around of IP addresses, no quirks from the subtle differences between the recovery system and the real system. A copy of a virtual machine is a perfect substitute because it's not actually a substitute. They're more identical than twins. 

Replication and Recovery

Zecharian and Jeremiah are in better shape now than they were before the mishap. The host hardware to which they got moved has KVM. Previously they were Xen guests. Not complaining about Xen, but the change forced me to make some adjustments that had been put off, things that needed to be done anyway. They were already configured to share the O/S (another fun rabbit trail, maybe another blog post). They share a common system disk image, now corrected for KVM. (They each have their own root filesystem.) Once the KVM changes were done for one, the other instantly got the same benefit.

I had more success recovering Zechariah and Jeremiah than with this blog post. (Don't get me started about how the Blogger app likes to lose draft updates.) 

NORD to the Rescue

As it happened, I had a newer kernel for NORD than that of the SUSE release Jeremiah and Zechariah run. As it happened, I already had a KVM-bootable system disk. So I copied NORD's kernel to the SUSE system disk and re-stamped the bootstrap. Generally, one simply brings in the kernel and related modules. As long as the kernel and modules are as-new-as and within the same generation the userland parts should have few problems. Works.

Note: this is a good reason to sym-link /lib/modules to /boot/modules and commit to a replaceable /boot volume It's a feature of NORD but trivial with any Linux distro.

KVM performance sucks relative to that of Xen. Any time you can use para-virtualization (instead of full hardware simulation) you're going to see better performance. Xen was originally para-virt-only and continues to strongly support para-virtualization. But we're using KVM for the sake of manageability. The guests can run with no knowledge of the hypervisor. (We can always switch to para-virt later, selectively per device.) And these guests aren't doing heavy multi-media work. Performance is sufficient. Presence is paramount.

You can see Zechariah for yourself. (You'll need IPv6 to hit it.) The web content is presently unchaged, demonsntrating the effect of a replicated virtual machine. An update with part of this story will be forthcoming. Jeremiah's connectivity is more controlled, not generally reachable from outside.

-- R; <><