High Throughput Ring Buffers

2024-06-28

Digital Signal Processing (DSP) often requires vast amount of data at high speeds and low latency. An often overlooked method when storing data with locality constraints is the Circular Buffer. This data-structure has a finite cyclic structure with FIFO (first in, first out) logic.

Implementation Issues

Even though the Ring Buffer is relatively simple, it can be a hassle of setting up and using.

// Simple ring buffer
template <typename T>
struct ring_buffer {
    T *data;
    size_t size;
}

When using the buffer checks must be made to make sure that reading and writing is allowed within bounds. This in itself is not wrong, but will require checks to see reading or writing is allowed.

template <typename T>
T read_buffer(ring_buffer<T> *rb, unsigned i) {
    unsigned pos = i % rb->size;
    return (T)*(rb->data + pos * sizeof(T));
}

Sneaky Optimizations

If we pause for a second to think about what we are trying to achieve we may find something interesting. We do not want to read outside our buffer. This is quite simple, if we use an unsigned integer for indexing, we may use \(N\) bits to automatically rollover. This will prevent getting indexes out of bounds. But what happens when we need to read or write \(K\) items? That may be problematic because that will mean that we need to stop reading and jump to the next part of the data expensive…

Abusing the page mapping process

Here comes the fun part! What if we didn’t have to worry about boundschecking? In theory this should speed up the process due to less instructions for the CPU. Let me introduce the page mapping process. When the program asks the kernel for memory the kernel usually provides what is called virtual memory. So if the program asks for 16bytes the kernel might give 16bytes of contigious memory. This is not always true and may introduce performance penalties if the CPU have to jump around a lot through the memory. What you can do is ask for contigious memory specifially. This should when possible return memory with perfect locality another improvement can be to allign the memory for better access. This still won’t solve the problem with bounds checking!!! But here is the key component, the kernel returns a virtual memory chunk to the program. In practice this means that the memory the program now holds is contigious somewhere in the physical memory but to the program it looks like it’s located at some address: 0x0001 or similar. In practice the program memory 0x0001 might actually map to 0x5A7E. So if the program reads out of bounds i.e 0x0002 this should result in undefined behaviour or more likely an error. But if you_ tell the kernel that the next virtual memory is located on the same physical address it will look something like this:

    0x0001     0x0002  // Virtual
        \        /     // --------
          0x517E       // Physical

which migh be the following: