Althea Development Update #67: Chasing ghosts in the machine

Althea Development Update #67: Chasing ghosts in the machine

Once you design a distributed system to handle any one point of failure you don't stop having failures, they just get more complicated and interconnected.

Beta 3 is finally rolling out, after several weeks of furious QA. Here are some example bugs.

1) Fixed a race condition where the Linux kernel won't assign fe80 local ips to interfaces despite completing the 'network' boot step. This occurs for no more than one in several dozen reboot cycles.

2) Fixed a bug where after about two years power on time (across the fleet of devices) the local server will respond improperly to requests. The fix for this is actually the same as the fix to the lack of recovery bug we mentioned in the last update.

There's a good reason why the basics of home networks haven't changed in decades. Debugging a program running in an uncontrolled environment out in the field requires a huge amount of intuition and often more than a little luck.

I only figured out what was going on with (1) because I stumbled on a machine that did it every single time and (2) is a bug we've been seeing in one form or another for months, we just finally caught an instance that gave us just the right data to reach conclusions.

It's great that we're fixing these sorts of long tail issues, it means we're getting close to the point where we can take Althea out of beta.