Althea Development Update #52: Asking the right questions about performance

Here’s our Alpha 6 release

Here’s our Alpha 6 release

We’re attending and presenting at SOON in Toronto this weekend, so keep an eye on the IPFS powered livestream

What’s new in Alpha 6

  • DNS servers have been moved to more stable defaults after a recent incident
  • Meshing over device WiFi is now a runtime toggle instead of a firmware compile time option
  • Dashboard API endpoints are now automatically tested and should generally be more reliable
  • Moved to the system allocator for rust Stable builds

Now Hiring!

We have raised a funding round and are now hiring remote developers. See the application here. We’re a fully remote team and we will happily hire wherever the talent is. We are also pretty flexible about pay, so generally expect a little above the average for your region.

In which detailed data confirms the obvious

If it becomes clear that the Zyxel is a part of a larger pattern of substandard performance rather than the exception I’ll start digging into kernel performance profiling and really get to the bottom of the issue.

Good news is that GL-B1300 has a good enough price/performance to become our new go-to midrange pick. Bad news is that’s 100mbps throughput, at least 3x less than what I had estimated the device to be capable of.

True to my word, I dug into the performance at the kernel level on both the WD MyNet n600 and the GL-B1300, trying to get to the bottom of our performance woes.

First off thanks to Brendan Gregg for the wonderful pearl script that converts data from the Linux perf tool into easily readable Flame Graphs. For those of you not familiar this graph represents a call stack, going up the deeper ‘down’ the callstack you go. Length is a relative measure of how much cpu time was spent in a given function.

The below graphs where captured over 100 second intervals during an iperf UDP test from kworker threads using a kernel compiled with as identical as possible flags.

The n600 is a pretty old device, it runs a 560mhz AR9344 MIPS core right out of 2011. Variants of this core are popular into even modern routers. Tested speed with Althea is ~25mbps.

The GL-B1300 is a brand new quad core 717mhz ARM chip Atheros IPQ4028. Tested speed with Althea is ~100mbps.

Considering the process and architecture improvements between these two processors we should be seeing a lot more than a 4x improvement. The latter processor is easily a dozen times more powerful.

The reason we’re focusing on processing power is that in our design with Althea we actually nest two WireGuard tunnels. One to provide security for your traffic as it traverses out to the internet, the other exists between every hop. A very important part of being able to bill for traffic is being able to identify who is responsible for paying for that traffic.

Since each hop pays the hop adjacent to it in our system WireGuard is actually the most expedient way to authenticate traffic is actually from the peer who will be billed. In theory we could improve efficiency by removing the chacha20 encryption and only using the poly1305 authentication.

This is even something we seriously discussed as it became clear even modern devices where not performing to expectations. But I decided to do some more investigating before going down that route.

In the graph for the n600 CPU time dominated by WireGuard tasks. So much so that the iptables and routing rules remain thin little licks of ‘flame’.

Kernel worker thread sample for the n600

But the picture painted by the B1300 is very different. The previously flame thin iptables stacks now make up nearly the entire graph. Packet decryption is relegated to this little corner on the left and an even smaller one about a fourth of the way over. Where you can barely see a chacha20_neon function.

Kernel worker thread sample for the GL-B1300

What we’re seeing is WireGuard’s optimized ARM assembly implementation paying off. It’s not so much that iptables tasks are larger on the B1300 as WireGuard tasks are massively smaller.

The similarly optimized MIPS implementation does impressive work for such an old processor. But exactly as expected the newer device blows it out of the water.

So then why is it only 4x faster?

The answer is packet forwarding hardware acceleration. While the n600 achieves it’s sticker 100mbps running stock OpenWRT most modern routers need special drivers to reach their sticker performance.

Testing the B1300 with stock OpenWRT reveals it gets…. about 100mbps throughput without WireGuard. The CPU is easily powerful enough to perform the WireGuard operations, exactly as I estimated.

To be honest I also suspected packet acceleration issues before I began, but it’s good to have rock solid confirmation.

The conclusions we can draw here are that on modern devices Althea’s encryption/authentication stack is efficient enough to not be a problem. Instead we should focus our performance efforts on FOSS drivers for more common packet accelerators.

Fortunately such drivers already exist and are floating around in the depths of the forums. Waiting to be upstreamed, or if that fails applied as local patches to our router firmware.