Since our last update I’ve encountered a couple of bugs where the problem had been hiding in plain sight. In one case for weeks and another for a year or more. Discovering root causes long term subtle bugs is both satisfying and important to create a better product.
The self hiding log message
In our second most recent release Beta 8, there was a neat little line the changelog mentioning
Multihomed exits are now on by default, increasing reliability.
This is a feature I've been working on in the background for quite some time. Most modern web applications are hosted in clusters of servers so that any one can fail without taking the whole service down with it.
'Multihomed exits' describes moving our exit servers into this sort of architecture, where one or more than one could fail without significant impact on it's users.
This is actually quite technically challenging because of the encryption we are using. WireGuard provides Althea connections a property called perfect forward secrecy, this means the server uses it’s public encryption key to negotiate another encryption key that will be discarded by both parties at a later time.
Even if a hostile entity decides to capture traffic for years and then some day gets it’s hands on the exits encryption key it still won’t be able to decrypt the traffic captured years ago. Because the traffic is actually encrypted with a session key that was generated by both parties and then scrubbed from existence after exactly a minute.
While this is an excellent privacy feature we can’t just redirect traffic from one exit to the other, even if they share the same encryption keys, the session key will need to be securely re-negotiated on demand. A feature WireGuard does not provide.
Cloudflare ran into this same issue with their WireGuard based WARP VPN. Their solution is actually remarkably similar to ours, which I find somewhat disappointing as I was hopping to find a superior one.
While a single Cloudflare server failing won’t take down everyone on their WARP service but it will disrupt connectivity for all the users on that machine for a minute.
Cloudflare uses their own packet tagging system we re-direct traffic with Babel to achieve the same effect that we get with Babel redirecting traffic. The connection drops for 60 seconds while the user waits for WireGuard to re-negotiate the key and then the user is back online on the other exit.
Obviously that's a pretty disruptive wait but it sounds pretty easy to deal with, just make sure that traffic doesn’t get redirected unless the primary server is actually down.
As I learned the hard way that’s not actually so easy. Babel is designed to switch connections proactively as it tries to maintain a solid network out of a series of unreliable links.
I thought I could stay ahead of this by configuring Babel with a very strong preference for one exit server. I tried this solution and after some testing and the design of a monitoring tool I moved it into production.
Everything seemed perfect, even on prolonged observation. I congratulated myself on a job well done and moved on.
Over the next few weeks reports of momentary connectivity outages rolled in. But none of my monitoring was being tripped.
I was puzzled for a few weeks. The reports where very infrequent but I couldn't’ deny that something was going on that I couldn’t see.
Eventually I realized that I had designed my monitoring with a fatal flaw. I was using the connection that was being disrupted to monitor connection disruption. Very short disruptions that resolves themselves in under a minute would have their notification messages lost as the connection to the new exit server would never finish.
After rolling back the change problems abated, Cloudflare avoided solving this problem for good reason. It’s complicated and will require nothing short of shared cryptographic session state across a cluster of exits. A challenge it seems neither of us are really prepared to take on at the moment.
For the time being we will have to satisfy ourselves with scripted fail over to our secondary exits rather than the dynamic latency and packet loss detecting fail over that Babel would have allowed.
Nodes won’t accept new connections after months of uptime
As our test network grows larger the amount of node churn any individual node sees inevitably increases.
We started to run into another puzzling bug, after several thousand different node connections and reconnection an Althea router would be unable to successfully connect with new peers.
This condition would take weeks or months of uninterrupted uptime to appear on production routers, but appear it did. The root cause turned out to be rather interesting.
Babel (our mesh daemon) needs to be started using an interface. But since all of the interfaces we talk to peers over are virtual encrypted links created by Rita (our billing daemon) its sort of a chicken and egg problem. My ‘clever’ solution was to launch Babel bound to the systems loopback interface.
This turns out to trigger buggy behaviour in Babel where any interface going from an active to an inactive state will trigger a memory allocation in Babel. It’s pretty similar to the bug outlined here. At first glance this sounds like the sort of memory leak that would eventually crash the program. Which would be fine as it would be restarted by our downtime watchdog script.
But due to a quirk of how exactly interface pointers are allocated only additional virtual memory is consumed. Presumably because malloc is run but the memory is never written to before the handler exits. So eventually Babel would be stuck unable to allocate additional memory from the operating system but otherwise continuing to function perfectly.
Our latest stable release, Beta 9 incorporates a fix for this problem. Ironically it turns out feeding babel the name of an interface that doesn’t exist is much safer and more gracefully handled.
Another lesson for the ‘too clever for my own good’ pile.
We’ve been working on stabilizing the Beta 9 release for the last couple of weeks and with Beta 9 RC5 we’re going to call it a wrap and move onto more ambitious features for Beta 10. We really wanted a stable base for our first Altheahoods as they get off the ground and Beta 9 will provide that.
At this point we have a pretty rock solid product where nodes can easily accumulate months of uptime without failures. The primary goal of all future development is to add new functionality without changing that.
Beta 10 will include compressed telemetry logging, which saves quite a bit of upload data for large nodes such as gateways which often have telemetry on. It will also include our first foray into IPv6 dualstacking for user devices. Which is something we’ve avoided so far for compatibility reasons despite our own internal network being IPv6 native.
As user counts grow we really don’t have any choice but to get dualstacking (providing user devices both an IPv4 and IPv6 default route simultaneously) figured out. Otherwise our exits will require an unsustainable number of IPv4 external IP addresses to map NAT traffic.
Beta 11 will be updating Babel to include upstream fixes and to add the ability to set per-interface prices. This is very important for relays that are both serving individual users and other relays. Otherwise you either have to provide all the direct users the wholesale rate or your wholesale buyer the user rate. Neither is sustainable as networks get deeper and less tree-shaped.