CoProcBufferDesignDoc

= Co-Processor Buffer Design Document =

This document is a work in progress until it is finalized. It will pose questions that need to be answered, and likely contains incomplete and/or incorrect information. Please feel free to answer question and correct any part of this document.

Background
This is intended to document the design goals of proposed changes to the GNURadio scheduler that will enable use cases of various co-processors across a variety of platforms.

In particular we want to identify what use cases are either not well suited, or not available to what the current GNURadio API provides.

Note that currently there are no restrictions on the ability to use co-processors with GNURadio (i.e. see gr-dsp: https://github.com/alfayez/gr-dsp, fosphor: http://sdr.osmocom.org/trac/wiki/fosphor, gr-theano: https://github.com/osh/gr-theano). However, the desire is to improve the efficiency of co-processor support, in other words to ensure that we are achieving the maximum performance possible for some measurable metrics (i.e. throughput, latency, etc.). Often this comes down to improving the flow of data between the co-processor and the rest of the GNURadio flow graph. In other words, the desire is to reduce the number of  required to move data from blocks running on the host processor to/from blocks running on the targeted co-processor.

The ultimate goal is to ensure that GNURadio supports the desired use cases, but that no hardware-specific code resides within the mainline code base. All hardware specific support code should reside in OOT Modules

Reference documents
Most of the content of this document are based on the discussion held during the Co-Processors Working Group during GRCon'13 and GRCon'14.

Additionally, we are referencing the issues created after GRCon'14: #729, #730, #731, #732

Use Cases
In general co-processor use case break down into two categories of interest:

1. Hardware supports scatter/gather DMA: i.e. the driver handles all the hard parts


 * No changes are needed to the GNURadio scheduler
 * However the buffer base/length should be accessible to the  functions: #729 (which has been addressed in commit 1fe3a091a6ad0940ea8880c796f89ab194508b7e)
 * It is desirably that the OOT module can receive signals from the scheduler indicating the read/write pointers have moved (i.e. upstream/downstream blocks are populating inputs/consuming outputs): #732
 * Example use cases: GPU's, Xilinx Zynq

2. Hardware doesn't support scatter/gather: i.e. it requires buffers consisting of contiguous memory


 * GNURadio scheduler need to support blocks with custom allocators (i.e. the scheduler must allow the block to control it's input/output buffers): #730
 * Additionally, in the case that the custom allocator cannot provide doubly-mapped buffers, the schedule must handle this appropriately #731
 * Example use cases: ARM SoC's with DSP co-processors (e.g. TI OMAP, Qualcomm Snapdragon, etc.)
 * Example contiguous memory allocator's: TI's CMEM, Android ION, NVidia NVMAP, Linux dmabuf

At this stage use cases with hardware that don't fall into one of the two categories above are to be considered unsupportable at this time, and will generally require that the OOT blocks perform the copying to/from the co-processor manually (e.g. via )

Implementation Concepts
With the above use cases in mind, implementation goes as follows:


 * io_signature flags: following Corgan's &quot;buffer_flags&quot; branch on github ( https://github.com/jmcorgan/gnuradio.git ), add a flags (uint32_t) to io_signature::make, to allow the caller to set specific flags such as &quot;MEM_BLOCK_OWNS&quot; (meaning that the block wants to allocate and own its own memory);
 * flags can be done independently (e.g., a vector of flags), or via a single uint32_t via OR and AND bit manipulation (32 independent settings + default);
 * using bit manipulation is probably easier in terms of the API, but more limited.


 * blocks: create a new base block type that provides a method to return a specially allocated memory pointer, the buffer length, and whether this buffer is single or double mapped. CoProc blocks must inherit from this block, and must define this method. This block inherits from either basic_block or block (not sure which makes more sense yet, given the actual functionality required for egress / ingress blocks) . In my github branch I implemented this as a  function for , this currently get's called by   immediately after the block_detail is allocated. The rationale being that this could be useful for custom blocks outside of just co-processors with custom allocators/block-owned buffers. It is the responsibility of the   function to allocate the buffer's (including wrapping the custom allocated space in a   class instance), and then pass those buffers into the.
 * there must also be a &quot;deallocate&quot; method, called when the block is being deleted.
 * probably also want &quot;start hardware&quot; and &quot;stop hardware&quot; methods, to handle situations where hardware needs some extra functionality to get going beyond the memory allocation / deallocation. This should be handled in the  function.
 * probably also want methods to handle buffer reads and writes, for hardware that requires it.
 * can these methods be moved into the buffer class (see below)? Seems practical given the number of methods and their uses.


 * buffers: use a base buffer class with the minimal required methods to meet the scheduler's needs. Create 2 buffer classes that inherit from base: double mapped and single, to meet the use cases listed above.
 * can we just have the block return an instantiation of this new buffer, with all of the methods provided by the buffer instead of the block? Seems practical.
 * If we do the buffers correctly, no changes should be required to the scheduler.
 * a single mapped buffer can be made much larger than required (say, 10x the desired size [d_size]), then once a read or write pointer gets to within d_size of the end we do a memcpy and reset the pointers to be near the start of the buffer, preserving history and such. In this way, the overhead for memcpy isn't too large (1/10 of the time for a shorter buffer, maybe), but we hopefully reduce change to, or require no changes to, the scheduler.


 * allocation: flat_flowgraph::allocate_buffer in gnuradio-runtime/flat_flowgraph.cc . When a block is created with &quot;MEM_BLOCK_OWNS&quot; in the io_signature, and assuming there is no conflict with adjacent blocks (in the flow-graph sense of adjacent; any given single connection between blocks is actually just one buffer), then call the block's special memory allocator method and use the returned pointer to create a single or double-mapped buffer (depending on the actual returned values from allocation).
 * if there is a conflict where adjacent blocks both want to allocate memory, we'll need to decide what makes sense to do. As a first effort, print out a warning and allow neither to allocate.
 * if the block's special memory allocate does not return valid or useful information (e.g., buffer address is 0; buffer length is 0), then print a warning and revert to current usage (or, maybe, error out?).

Verifying Functionality
This is likely the most difficult part of the task: testing to make sure that the new methods and classes function as desired (building, linking, basic use case needs are met), and that they are used correctly during runtime. A good design should minimize runtime issues to be addressed, but we will need lots of testing here.