Althea Development Update #71: Transaction amnesia

As I slapped this solution together I got a bad feeling. What if the full node lied to us? I really couldn’t think of a situation in which it would but I just couldn’t shake it.

Althea Development Update #71: Transaction amnesia
// removed and placed into error handling in payment controller
// time will tell if that was a good idea
// update_nonce(our_address, &web3);

Lesson of the week, if your gut tells you that something might be a bad idea you should probably listen to it.

The ‘nonce’ is the sequence number of a transaction in Ethereum, they must be issued strictly in order. If you have a published transaction with nonce 1 your next transaction must contain nonce 2.

I wrote this code on May 1st, the goal was to start managing our transaction nonces locally. Previously we had been able to get away with just asking the full node what our current nonce was and using that.

But if the blockchain was congested this was a problem. The full node will happily tell you the current valid nonce, but not how many transactions you have pending.

My goal was to let Althea devices manage multiple pending transactions at the same time. If we just incremented it locally this should work perfectly. In the case that something actually did go wrong we could just update our nonce from the full node at that time.

As I slapped this solution together I got a bad feeling. What if the full node lied to us? I really couldn’t think of a situation in which it would but I just couldn’t shake it.

It's around midnight on Friday, May 24th. Nearly a month after we put this code into production. My brother is in town for the first time a year and we're out getting drinks and regaling each other with the various ridiculous failures of our respective tech jobs when I get a call.
Over on the west coast its 9pm and prime time for internet usage in our test network.

Unfortunately over in the rainy cascades, all is not well. The relays have not been paid by the gateway in over a day and  high weekend usage had pushed them over the edge to demanding the late payment they where due by slowing down their connection to the gateway.

This inevitability caused the network to catch fire and my phone to ring.

It took me about 30 minutes to piece together what had happened. My gut feeling had been correct, there was a way for the full nodes to lie to us.

Remember what I said about transactions having to be in exact sequence to be valid? This was the key cause of our error.

Nonce 728 was submitted on May 23rd at 5am and made it into the blockchain a few minutes later.

Transaction 729 was submitted shortly thereafter but after hours of waiting it never got into a block. Here’s where everything goes horribly wrong.

When a transaction so broadcast into the network of Ethereum full nodes it’s in the hope that a miner (who creates blocks) will pick it up and place it into the blockchain. Exactly how long any given miner keeps transactions around waiting to put them into a block varies, but usually isn’t longer than an hour.

Our full nodes on the other hand, are not miners. They merely keep track of blockchain data and respond to questions from Althea devices. They had no configured timeout for transactions.

So when we submitted transaction 730 our full nodes thought this was perfectly fine, after all whenever 729 got in, 730 would follow shortly thereafter.

The miners, having discarded 729 after an hour or two didn’t agree. 730 was to them a totally invalid transaction, they didn’t have a 729 and without some way to fill in the sequence there was no way to publish 730.

On and on this goes, up to nonce 1000+ until my phone rings at the bar.

In the end my late night fix simply involved manually setting the nonce to the correct value and restarting the gateway, which immediately produced a new transaction 729 which is in the blockchain to this day. This is a classic sort of distributed systems bug, correctness is a matter of perspective.

Development update


Since our last dev update we’ve been working to finish stabilizing payments and prepare our own infrastructure (specifically the Rita exit code) to scale as our growth rate increases.

Some of these performance improvements have trickled down into the routers, which didn’t really need them so much but hey faster is faster.

We’ve also continued to patch issues, as the payment code becomes more and more stable smaller and smaller problems become apparent and are able to be tweaked. At this point we can maintain payment consensus for several weeks on end despite whatever chaos is going on in production. 4x sweeps in the gas price or antenna degradation and packet loss. We’re well into the long tail and I would expect within an update or two we may just have nothing to patch about billing anymore.

What's new in Beta 6?

---

  • Password protection support for the router dashboard!
  • New relay management screen to manage neighbours connections and payments
  • Dramatic efficiency improvements thanks to async Babel interactions
  • Faster saving for bandwidth use display features, now every 4 hours rather than 24 so less data is lost on reboot
  • Use 105% of the median gas price in order to avoid the situation described in Dev update #70
  • Devices will now apply spare change from slight overpayments before instead of after enforcing
  • Fixed a bug where the exit may be paid twice in quick succession
  • Fixed a bug where the babel routing table could grow to large for Rita to read


The Altheahoods program is going pretty well and all told. At this point it seems that getting backhaul on location will probably take longer than getting all the signups. Which is either really good or a scathing criticism of backhaul fiber providers depending on your perspective.

You will also notice that we’ve changed our domain (yet again) to althea.net, this was the result of a lot of thinking about our branding choices. We’ve needed to ‘drop the mesh’ for a while now as it tends to cause confusion in different audiences. It’s much better for us to define what we do rather than be pigeon-holed into what people think when they hear mesh (which can vary widely from venue to venue).

The bird also got dropped, since it didn’t match the new art and branding style of the website. Lets all have a moment of silence for flappy.