Custom Buffers

This documentation was originally copied from the wiki within gnuradio/gnuradio-ngsched, written by David Sorber, and pulled on Oct 26 2021

Description
GNU Radio provides a flexible block-based interface for signal processing tasks. Historically GNU Radio signal processing blocks have been written in software but there is increasing need to offload complex signal processing algorithms to accelerator devices including GPUs, FPGAs, and DSPs. Many accelerated blocks have been created using GNU Radio's block interface but these blocks require manual handling of data movement to and from the accelerator device. The purpose of this project is to add accelerator device support directly to GNU Radio.

Project Goals

 * Maintain backwards compatibility with all existing blocks (both in-tree and from OOT modules)
 * Create flexible interface for creating "custom buffers" to support accelerated devices
 * Custom buffer interface provides necessary hooks to allow the scheduler to handle data movement
 * Provide infrastructure to support "_insert signal processing here_" paradigm for common accelerator devices such as NVidia GPUs

High-level Plan

 * Milestone 1 - completed: May 11, 2021
 * refactor existing code and create single mapped buffer abstraction
 * support single accelerated block (block responsible for data movement)
 * simple custom buffer interface
 * Milestone 2 - completed: August 5, 2021
 * support multiple accelerated blocks with zero-copy between
 * more flexible custom buffer interface (scheduler handles data movement)

Overview and Usage
![double copy](https://i.imgur.com/QYZfucc.png)

GNU Radio's block-based interface is very flexible and has allowed users to create their own accelerated blocks for some time. However, this approach has some limitations. In particular if the accelerator device requires special (DMA) buffers for data movement, the accelerator block must then copy data from the GNU Radio buffer into the device's buffer on the input path and vice versa on the output path. This process is inefficient and is known as the "double copy" problem as shown in the diagram above. Furthermore, in addition to the double copy inefficiency, accelerated blocks written in this fashion require the writer to manage data movement explicitly. While this is doable it may be challenging for a novice and off-putting for a user that wishes to concentrate on implementing a signal processing algorithm. The new accelerated block interface changes address both of these issues while (very importantly) maintaining backwards compatibility for all existing GNU Radio blocks.

Supporting Code
The accelerated block interface changes currently live in [this repository](https://github.com/gnuradio/gnuradio-ngsched) however the intention is to upstream these changes into GNU Radio. The following repositories contain supporting code that is also intended to be upstreamed to the project but not directly into the main GNU Radio repository itself (**NOTE:** both of the repositories below require the accelerated block interface changes, also called "ngsched"):


 * [gr-cuda_buffer](https://github.com/BlackLynx-Inc/gr-cuda_buffer) - This repository contains an OOT module containing the  class which is a "custom buffer" supporting the CUDA runtime for NVidia GPUs. This module is intended to be a base CUDA buffer implementation for CUDA blocks and can be used directly when writing CUDA accelerated blocks for NVidia GPUs.
 * [gr-blnxngsched](https://github.com/BlackLynx-Inc/gr-blnxngsched) - This repository contains an OOT module containing various examples of the accelerated block interface (aka "ngsched") changes. These blocks are described in a additional detail in the "Examples" section below. Note that the CUDA-related blocks in this OOT require  from gr-cuda-buffer.

Examples

 * - A buffer object that shows a simple example for creating a custom buffer object. It uses normal host buffers and does not require any specific accelerator hardware.
 * - A loopback block that uses the  class from above. It shows a very simple example of how to use a custom buffer object defined within an OOT module.
 * - (NOTE: requires CUDA and gr-cuda-buffer) A simple CUDA-based fanout block that utilizes block history.
 * - (NOTE: requires CUDA and gr-cuda-buffer) A CUDA-based loopback block that uses the  class from gr-cuda-buffer.
 * - (NOTE: requires CUDA and gr-cuda-buffer) A CUDA-based complex multiplication block. This block has two inputs and one output.
 * - (NOTE: requires CUDA and gr-cuda-buffer) A loopback block that combines a CUDA-based loopback with a simple host loopback. This block has two inputs and two outputs. One input/output pair uses   while the other uses default GNU Radio host buffers.

How to Use a Custom Buffer
The following instructions illustrate how to write a block using a "custom buffer". These instructions use  from gr-cuda-buffer for example purposes but the same general concepts can be applied to any custom buffer class.

1. If the custom buffer class exists in a separate OOT module, install that OOT module in the same path prefix as the OOT module containing the block which will use the buffer class. For example, to use, the gr-cuda-buffer OOT module must be installed in the same prefix. 1. In the implementation source file include the appropriate header file for the buffer class. For example, in : 3. Next, within the block's constructor update the `gr::io_signature` to include the desired buffer's type. For example: 4. Finally, the pointers passed to the block's work function will now be of the selected type. For the  class used in this example the pointers passed to the work function are not host accessible and should not be dereferenced on the host. Instead the pointers should be passed to a kernel invocation. Note that the pointer usage restrictions depend on the buffer class being used.

Detailed Design
This section contains detailed design information for the changes introduced in the accelerated block interface.


 * 1) Single Mapped Buffer Abstraction

The ability of the GNU Radio runtime to directly manipulate device-specific (aka "custom") buffers is a key aspect of the accelerated block interface changes. Integrating device-specific buffers into the runtime required several changes, the most significant of which is the refinement of the runtime `buffer` class and the creation of the single mapped buffer abstraction.

A GNU Radio flow graph contains a series of blocks where each pair of blocks is connected by a buffer. The "upstream" block writes (produces) to the buffer while the "downstream" block reads (consumes) from the buffer.

![buffer](https://i.imgur.com/fCouLOE.png)

A GNU Radio buffer may have only one writer but multiple readers. The  class provides an abstraction over top of the underlying buffer itself. Likewise the  class provides an abstraction for each reader, one or more   instances are attached to each   instance. The underlying buffers, which the  class wraps, are very elegantly implemented circular buffers. The  class provides the circular buffer interface although several implementations exist. Fundamentally the  interface relies on virtual memory "double mapping" to provide the illusion of an automatically wrapping memory buffer (see https://www.gnuradio.org/blog/2017-01-05-buffers/ for additional details). This approach works very well for host buffers where the virtual memory mapping to the underlying physical memory can be manipulated but does not work well for device-specific buffers where the virtual memory mapping cannot be manipulated.

The single mapped buffer abstraction was created to provide a similar encapsulation for underlying buffers whose virtual memory mapping cannot be manipulated. This applies to the majority, if not all, device-specific buffers. One side effect of the single mapped buffer abstraction is that, unlike the traditional double mapped buffers, single mapped buffers require explicit management to handle wrapping from the end of the buffer back to the beginning. For trivial cases where the item consumption or production size aligns with the size of the buffer the  and   functions handle wrapping automatically. However for more complex wrapping cases two callback functions are used to handle wrapping; they are described in detail in the [Callback Functions](#callback-functions) section below.

Some refactoring was necessary to accommodate the new single mapped buffer abstraction alongside the existing double mapped buffer abstraction. The existing  class was refactored to be an abstract base class for underlying buffer wrapper abstractions. It provides the interface that those abstractions use to hook into the GNU Radio runtime. The  and   classes now derive from the   class and implement its interface (see the simplified class diagram below). The  class, as its name suggests, contains the double mapped buffer abstraction that was previously contained within the   class. The  class contains the new single mapped buffer abstraction that was added as part of the accelerated block interface changes.

![buffer_hierarchy](https://i.imgur.com/otkG1MB.png)

The  class is itself an abstract class that represents the interface for single mapped buffers. In the diagram above the interface functions are listed in the gray box. Functions that are pure virtual, that is functions that must be overridden, are listed in bold. The remaining non-bold functions are all virtual, that is they may be overridden if specific customization is necessary. Device-specific "custom buffers" must derive from the `buffer_single_mapped` class and implement the interface, which includes the functions listed in bold at a minimum. Note that the `host_buffer` class is an example implementation of the interface using host buffers. Additional information about it is available in the [host_buffer Class](#host_buffer-class) section below. The  and `hip_buffer` classes also derive from the `buffer_single_mapped` class but they reside externally in separate OOT modules.

![buffer_reader_hierarchy](https://i.imgur.com/bJiTPHo.png)

Additional minor refactoring of the `buffer_reader` class was also necessary to support the single mapped buffer abstraction. The `buffer_reader_sm` ("sm" for single mapped) was created to customize the behavior of several `buffer_reader` functions when used with `buffer_single_mapped` derived classes. To support this, several functions in the `buffer_reader` class were marked virtual. The `buffer_reader_sm` then derives from `buffer_reader` and overrides those virtual functions. The `buffer_reader` class was also slightly refactored to redirect most calls back to the corresponding `buffer` object where possible. This refactoring eliminated the need to create a specific `buffer_reader` derived class to accompany each `buffer_single_mapped` derived class. That is, it should be possible to use `buffer_reader_sm` with any `buffer_single_mapped` derived class. For example, the `host_buffer`, `cuda_buffer`, and `hip_buffer` classes all utilized `buffer_reader_sm` without requiring their own corresponding specific derived classes.

Device Buffer Encapsulation and Data Movement
The single mapped buffer abstraction described above provides a suitable abstraction for managing device-specific buffers. However additional functionality to support data movement is needed to incorporate accelerated blocks into GNU Radio. This functionality was also added to the `buffer_single_mapped` class.

![flowgraph](https://i.imgur.com/Cp2bvGG.png)

The diagram above shows a conceptual representation of an accelerated block within a simplified GNU Radio flowgraph. In the lower portion of the diagram the underlying buffer contained within the `buffer` object spanning the host to accelerator device boundary is shown as split into two pieces with a red arrow connecting the two. Each piece represents an underlying buffer residing on the host and accelerator device respectively. The red arrow represents the data movement path between underlying buffers across the host to accelerator device (and vice versa) boundary. `buffer_single_mapped` derived classes use the same arrangement shown in the lower portion of the diagram; they wrap two underlying buffers, a host buffer and a device-specific buffer, and the newly added `post_work` function is used to explicitly move data between the two buffers. Note that the host buffer is provided directly by default by the `buffer_single_mapped` class while the device-specific buffer must be implemented in the derived class.

The `post_work` function was added to the `buffer` class interface. It is called in the `block_executor::run_one_iteration` function after a block's `work` function has been called and is responsible for performing the necessary device-specific data movement between the encapsulated underlying buffers. The `post_work` function uses a buffer's assigned "context" (see [Replace Upstream](#replace-upstream) for details on buffer context) in order to conditionally move data between the buffers. Also note that the accelerated block interface supports back-to-back accelerated blocks provided that they reside on the same accelerator device. In this case the `post_work` may simply perform a "no-op" as data may not need to be moved at all in order to be accessible by the downstream block (this concept is also called "zero copy").

Below is a simplified (error checking removed for brevity) version of the `post_work` function from the `cuda_buffer` class:

host_buffer Class
The  class is included as an example   derived class that is intended to be used for testing in the   test module. As its name suggests the  class implements the   interface but using only host buffers and therefore does not require any specific accelerator device hardware.

Buffer Type
The  type (see  ) was created to provide an easy mechanism to connect blocks with the potentially custom buffers that they intend to use. Furthermore  is intended do this in a type consistent way such that arbitrary `buffer_single_mapped` derived classes implementing "custom buffers" can be added easily via OOT modules. In addition to providing type information  instances also provide a mechanism, via a pointer to a simple factory function, to create the corresponding `buffer` derived class instance.

The `buffer_type` type itself is defined as: Each `buffer_type` instance is therefore a constant reference to singleton instance of a class derived from `buffer_type_base`. A macro function: `MAKE_CUSTOM_BUFFER_TYPE` was created to facilitate easy creation of `buffer_type` clasess. The first argument is the desired name of the derived class, it must be unique for each unique buffer type. Within the macro function the class name argument will be prefixed with `buftype_` resulting in a class named `buftype_ `. The second argument to the macro function is a pointer of type `factory_func_ptr` to a simple factory function for creating `buffer` derived class instances of the corresponding type. Although not required it is recommended that the factory function be made a static function within the `buffer` derived class. Likewise it is also recommended that the `buffer_type` be included as a static member (called `type` by convention) within the `buffer` derived class.

Below is a simple example showing usage of `MAKE_CUSTOM_BUFFER_TYPE` for the fictional `my_device_buffer` class (`my_device_buffer.h`): Within `my_device_buffer.cc`: Note for interface consistency `buffer_type` is used by the `buffer_double_mapped` class too. The `buffer_double_mapped` class's buffer type value is `buftype_DEFAULT_NON_CUSTOM`. It is used transparently by the runtime for all existing blocks.

Replace Upstream
By convention GNU Radio blocks are responsible for allocating their (downstream) output buffer(s). Downstream blocks then attach their own `buffer_reader` instances to the corresponding output buffer. This presents a problem for accelerated blocks because most require use of a device-specific buffer for both the input data path and the output data path. To solve this problem additional logic was added to the `flat_flowgraph::connect_block_inputs` function. The `connect_block_inputs` function iterates over the input connections for a given block and adds the necessary `buffer_reader` instances. It was modified to perform two additional actions:

1. First the upstream buffer type is compared to the block's output buffer type. If the buffers' types match or if the blocks's output buffer type is the default (`buffer_double_mapped::type`), no further action is taken. If the block's output buffer and the upstream buffer are both not the default type and not the same then a runtime error is produced. This case means that two incompatible accelerated blocks are connected to each other, which is not permitted. Finally, if the block's output buffer is not the default type but the upstream buffer is the default type, then the upstream block's output buffer (i.e. the upstream buffer) is replaced with a new buffer of the block's own output buffer type. The `replace_buffer` function was added to the `block` class to handle buffer replacement when appropriate. 1. In addition to potentially replacing the upstream buffer the `connect_block_inputs` function also calculates and set each buffer's "context". A buffer's context is represented by the `transfer_type` enumeration and represents the direction of data movement for data within the buffer. There are four possible values: * **HOST_TO_HOST** - both the upstream and downstream blocks are normal non-accelerated host blocks; this is the default context for blocks * **HOST_TO_DEVICE** - the upstream block is a normal host block while the downstream block is an accelerated block; within the buffer wrapper object data must be moved from the host to a device in order to be consumed by the downstream block * **DEVICE_TO_HOST** - the upstream block is an accelerated block while the downstream block is a normal host block; within the buffer wrapper object data must be moved from the device to the host in order to be consumed by the downstream block * **DEVICE_TO_DEVICE** - both the upstream and downstream blocks are accelerated blocks; within the buffer wrapper object some device-specific operation (possibly nothing) must be performed in order for the data to be consumed by the downstream block

Note that not all all usages of the accelerated block interface and custom buffers exactly fit the chosen nomenclature. The names chosen were meant to reflect the most common logical use cases.

Additional details about how buffer context is used are described in the [Data Movement](#data-movement) section.

Callback Functions
As mentioned above the single mapped buffer abstraction requires explicit logic to handle wrapping from the end of the underlying buffer back to the beginning of the buffer. Two callback functions, `input_blocked_callback` and `output_blocked_callback`, were added to the `buffer` class interface to handle non-trivial wrapping cases.

The "input blocked" case occurs when the desired number of input items cannot be read even though additional input items may be available back at the beginning of the buffer. The `input_blocked_callback` function of the `buffer_reader` class is called when the number of items available between the current read pointer and the end of a single mapped buffer is less than the minimum number required. It attempts to realign the available input data to be contiguous beginning at the start of the buffer so that it can be read.

![input_blocked_callback Example](https://i.imgur.com/3XkNZoz.png)

In the above example, the reader needs to read three items however only two items are available until the end of the buffer is reached (shown in blue). Note that there are additional items ready to be read located at the beginning of the buffer (shown in green). The first step of the `input_blocked_callback` is to copy the "down" to make room for the items located at the end of the buffer. Next, the items at the end of the buffer are copied into the now free space at the beginning of the buffer. Finally the read and write pointers are updated to reflect the changes. After the `input_blocked_callback` has run the input reader is no longer blocked and can now read the three items it needs.

Similarly the "output blocked" case occurs when the desired number of output items cannot be written to the buffer. The `output_blocked_callback` function of the `buffer` class is called when the space available between the current write pointer and the end of the buffer is less than the space required. It attempts to copy any unread data back to the beginning of the buffer and then reset the write pointer immediately after the data such that additional space is available for writing.

![output_blocked_callback Example](https://i.imgur.com/wjXwZl3.png)

In the above example, the writer needs to write three items however space is only available for two items before the end of the buffer is reached. Note that unread data exists between the read pointer and write pointer (shown in blue). First, the `output_blocked_callback` copies any unread data from its current location back to the beginning of the buffer. Next, the read and write pointers are adjusted such that the read pointer points to the unread data aligned at the beginning of the buffer and the write pointer points immediately after it. After the `output_blocked_callback` has run the output reader is no longer blocked and the next three items can be written after the unread data.

Although it is possible to customize the implementation of the callback functions, it is not recommended. Instead, to connect `buffer_single_mapped` derived classes to the callback functions, two additional functions have been created `input_blocked_callback_logic` and `output_blocked_callback_logic`. Each function includes the logic to perform the general steps described in the examples above however the exact functions used for data movement within the buffer are abstracted and must be passed in as parameters (via function pointers) to each function. Specifically, `buffer_single_mapped` derived classes must implement two functions, `memcpy` and `memmove` that operate on the underlying device-specific buffer and behave in the same fashion as the standard C library functions of the same names. For convenience a `std::function` pointer type has been defined to be used for these two functions. In addition to the data movement functions, the underlying buffer on which to operate is also passed as an argument to the `input_blocked_callback_logic` and `output_blocked_callback_logic` functions. The `input_blocked_callback` and `output_blocked_callback` functions (which themselves must be implemented `buffer_single_mapped` derived class) must use a buffer's context to determine which underlying buffer pointer and corresponding data movement function pointers should be passed into the `input_blocked_callback_logic` and `output_blocked_callback_logic` functions.

Custom Lock Interface
Executing the callback functions described above potentially rearranges data and adjusts read and write pointers within a buffer. As such it is important to execute the callback functions safely so that they do not interfere with blocks attempting to read or write data to and from the buffer. Each buffer object is protected by a mutex while the runtime calculates the read and write pointers for a given block's next call to its `work` function. However, the runtime releases the buffer's mutex and then hands the pointers to the block for use. This means that simply locking the buffer object's mutex is not sufficient to protect it as attached blocks could already be using previously calculated pointers into the buffer even if its mutex is locked. A more sophisticated locking mechanism is needed to protect buffers while callback functions are being executed.

Several additions were added to the `buffer` class to build such a locking mechanism. First a count of the number of active pointers (either read or write) was added along with mutex protected functions for incrementing and decrementing the counts. The `block_executor::run_one_iteration` function was then updated to call these functions in the appropriate places before and after the call to a block's `work` function. Next, a simple flag indicating if a callback function is currently executing was added to the `buffer` class. Finally, the `custom_lock_if` was added to the `buffer` class to provide additional locking logic. The `custom_lock_if` class is an abstract class and the `buffer` class derives from it (this provides a C++-style "interface").

The `custom_lock` class implements an [RAII style](https://en.wikipedia.org/wiki/Resource_acquisition_is_initialization) lock (similar to `std::lock_guard`) using a mutex and a pointer to an object implementing the `custom_lock_if`. The `custom_lock_if` includes two functions `on_lock` and `on_unlock` that are called when the created `custom_lock` object is locked (constructed) and unlocked (destructed) respectively. The `buffer` class's implementation of `on_lock` uses a condition variable to wait until the active pointer count is zero and the callback flag is false. The condition variable then locks the passed mutex when the condition is met. This ensures that no blocks have active pointers into the buffer and that no other callback are currently executing on the buffer. Furthermore locking the mutex prevents other blocks from calculating pointers into the buffer until the lock is unlocked and the mutex is released. The custom lock interface as implemented but the `buffer` class provides sufficient protection that allows the `input_blocked_callback` and `output_blocked_callback` functions to execute safely.

Acknowledgements
The work described herein was performed by David Sorber between August 2020 and August 2021.

I would like to thank both Joshua Morman and Seth Hitefield for their support and feedback during this project.

Methodology and Description
![methodology](https://i.imgur.com/MW7XFAz.png)


 * Used [gr-bench](https://github.com/mormj/gr-bench) approach and scripts
 * Tests were run using CUDA loopback blocks in the flow graph shown above for each of the following three cases:
 * stock GR 3.9 + legacy (double copy) loopback - shown in blue in the graphs below
 * ngsched + legacy (double copy) loopback - shown in orange in the graphs below
 * ngsched + single mapped custom buffer - shown in green in the graphs below
 * Each test case iterated over various values for "veclen" (batch size) and number of loopback blocks
 * "veclen" (batch size) values: 1024, 2048, 4096, 8192, 16384, 32768
 * number of loopback blocks value: 1, 2, 4, 16
 * Each test case was run 10 times
 * Each plot below shows execution time plotted against veclen. Note an equivalent plot could be made showing throughput (MB/s) vs. veclen.
 * Total data copied was 100,000,000 * 8 byte `gr_complex` values for a total of 800,000,000 bytes (~762.94 MB)

Dell XPS 15 laptop + NVidia GTX 1650 GPU

 * Dell XPS 15 laptop
 * Intel i9-10885H (8 cores/16 threads)
 * 32 GB DDR4-2933
 * NVidia GTX 1650 GPU

![xps15_1](https://i.imgur.com/qRdPEBl.png) ![xps15_2](https://i.imgur.com/9uzwAFd.png)

SuperMicro SM-X11DGQ Server + NVidia P100

 * SuperMicro SM-X11DGQ Server
 * 2x Intel Gold 6148 (20 cores/40 threads each)
 * 512 GB DDR4-
 * NVidia P100 GPU

![sm-x11dgq_1](https://i.imgur.com/2d41zuC.png) ![sm-x11dgq_2](https://i.imgur.com/y8gaCfZ.png)

NVidia Jetson AGX Xavier

 * NVidia Jetson AGX Xavier
 * 8-core ARM v8.2 CPU
 * 32GB 256-Bit LPDDR4x
 * 512-core Volta GPU with Tensor Cores

![xavier_1](https://i.imgur.com/SzpJOya.png) ![xavier_2](https://i.imgur.com/8zSdhRz.png)