Performance improvements via TSO/GRO and UDP_SEGMENT #439

byo-books · 2023-07-07T16:31:05Z

This blog post by tailscale sounds promising. It points out that the Linux Tun device supports TSO/GRO offloading.

Also, there is another post for using GSO (Generic Segmentation Offload) to send multiple UDP packets from a single large buffer.

Both techniques reduce network stack traversals. Unfortunatedly these features do not seem to be well documented.

splitice · 2024-03-03T03:32:11Z

If you look at benchmarks of tinc you will quickly find that for many real world workloads the largest user of CPU time is TUN/TAP.

I did some work on sendmmsg in the past but rand into architectural issues primarily. Tinc was never built to handle a queue of packets (but this can change!)

If you really want performance for tinc, build ktincd (linux kernel tinc). I've debated it numerous times.

It was originally going to be one of my next experiments after the AES protocol changes merged (but they never did)

The networking side of it wouldnt be too hard. Tinc is structured well enough that adapting to a linux netdev would not be too difficult. Configuration though is potentially a real nightmare.

gsliepen · 2024-03-03T13:56:19Z

Another option might be investigating if io_uring can be used, and what performance improvements that can give.

splitice · 2024-03-03T14:17:30Z

I don't know if io_uring is really worth the effort tbh. I don't have strong data to back this up however.

Packet mmap appears to be the fastest way to read / send from tap.

See for example https://github.com/google/gvisor/blob/master/pkg/tcpip/link/fdbased/mmap_unsafe.go#L50

However tinc doesn't have the architecture in place for batching on the tap side. And that's what holds me back. I'm not certain I want to do that level of change without guidance.

gsliepen · 2024-03-03T15:28:04Z

The advantage of io_uring is that you don't have to batch things at all in the application. You can still do single packet read()/write()/send()/recv() calls, but instead of them being system calls, you enqueue them on the io_uring. You can also have buffers shared between userspace and kernelspace, so you can theoretically avoid copies being made. However, I don't know how well that works compared to packet mmap.

gsliepen added enhancement New feature requests or performance improvement. linux Issues specific to Linux. labels Mar 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements via TSO/GRO and UDP_SEGMENT #439

Performance improvements via TSO/GRO and UDP_SEGMENT #439

byo-books commented Jul 7, 2023

splitice commented Mar 3, 2024

gsliepen commented Mar 3, 2024

splitice commented Mar 3, 2024

gsliepen commented Mar 3, 2024

Performance improvements via TSO/GRO and UDP_SEGMENT #439

Performance improvements via TSO/GRO and UDP_SEGMENT #439

Comments

byo-books commented Jul 7, 2023

splitice commented Mar 3, 2024

gsliepen commented Mar 3, 2024

splitice commented Mar 3, 2024

gsliepen commented Mar 3, 2024