5. Programming with GNAT for CUDA®

5.1. CUDA API

The CUDA API available from GNAT for CUDA® is a binding to the CUDA API provided by NVIDIA. The NVIDIA API is installed with the CUDA driver. You access the Ada API by adding a reference to cuda_host.gpr (on the host) and cuda_device.gpr (on the target). The initial installation script generates the Ada version of the API from the CUDA version that’s installed on your system.

Two versions of the Ada API are available:

a “thick” binding version. These are child units of the CUDA package, the main one being CUDA.Runtime_API. This is the API you will most likely use. However, this API is still in the process of being completed and a number of types and subprogram specs have not been mapped to higher-level Ada constructs. For example, you will still see a lot of references to System.Address where you would normally expect specific access types in Ada.
a “thin” binding version. These are typically identified by having a suffix of _h. They are direct bindings to the underlying C APIs. These bindings are functional and complete. They can be used as a low-level alternative to the thick binding, but they don’t expose an interface consistent with the Ada programming style and may require more work to use.

You can regenerate these bindings at any time. You may want to do this, for example, if you install a new version of CUDA. To regenerate these bindings, execute the bind.sh script located in <your GNAT for CUDA installation>/cuda/api/.

5.2. Defining and calling Kernels

Just as in a typical CUDA program, programming in GNAT for CUDA requires you to identify application entry points to the GPU code, called “kernels”. In Ada, you do this by annotating a procedure with the CUDA_Global aspect, which serves the same role as the CUDA __global__ modifier. For example:

procedure My_Kernel (X : Some_Array_Access)
with CUDA_Global;

Kernels are compiled both for the host and the device. They can be called as regular procedures, e.g:

My_Kernel (An_Array_Instance);

The above makes a regular single-threaded call to the kernel and executes it on the host. You may want to do this because of a better debugging environment on the host.

To call a kernel on the device (which means copying it to the device and executing it there), you use the CUDA_Execute pragma:

pragma CUDA_Execute (My_Kernel (An_Array_Instance), 10, 1);

The procedure call looks the same as a regular call, but this call is surrounded by the pragma CUDA_Execute, which has two extra parameters defining, respectively, the number of blocks per grid and the number of threads per block. This is equivalent to the CUDA call:

<<<10, 1>>> myKernel (someArray);

In each case, these calls launch ten instances of the kernel to the device.

The numbers of threads per block and blocks per grid can be expressed as a one-dimensional scalar or a Dim3 value that specifies all three dimensions (x, y, and z). For example:

pragma CUDA_Execute (My_Kernel (An_Array_Instance), (3, 3, 3), (3, 3, 3));

The above call launches (3 * 3 * 3) * (3 * 3 * 3) = 729 instances of the kernel on the device.

5.3. Passing Data between Device and Host

5.3.1. Using Storage Model Aspect

“Storage Model” is an extension to the Ada language that is currently under development. General description of this capability can be found here.

GNAT for CUDA provides a storage model that maps to CUDA primitives for allocation, deallocation, and copying. The model is declared in the package CUDA.Storage_Models. You may either use CUDA.Storage_Models.Model itself or you may create your own.

When a pointer type is associated with a CUDA storage model, memory allocation through that pointer occurs on the device in the same manner as it would in the host if a storage model wasn’t specified. For example:

type Int_Array is array (Integer range <>) of Integer;

type Int_Array_Device_Access is access Int_Array
   with Designated_Storage_Model => CUDA.Storage_Model.Model;

Device_Array : Int_Array_Device_Access := new Int_Array (1 .. 100);

In addition to allocation being done on the device, copies between the host and device are converted to call the CUDA memory copy operations. So you can write:

procedure Main is
   type Int_Array_Host_Access is access Int_Array;

   Host_Array : Int_Array_Host_Access := new Int_Array (1 .. 100);
   Device_Array : Int_Array_Device_Access := new Int_Array'(Host_Array.all);
begin
   pragma CUDA_Execute (
       Some_Kernel (Device_Array),
       Host_Array.all'Length,
       1);

   Host_Array.all := Device_Array.all;
end Main;

On the kernel side, CUDA.Storage_Model.Model is the native storage model (as opposed to the foreign device one when on the host side). You can use Int_Array_Device_Access directly:

procedure Kernel (Device_Array : Int_Array_Device_Access) is
begin
   Device_Array (Thread_IDx.X) := Device_Array (Thread_IDx.X) + 10;
end Kernel;

This is the recommended way of sharing memory between device and host. However, the storage model can be extended to support capabilities such as streaming or unified memory.

5.3.2. Using Unified Storage Model

An alternative to using the default CUDA Storage model is to use so-called “unified memory”. In that model, the device memory is mapped directly onto host memory, so no special copy operation is necessary. The factors that may lead you to choose to one model over the other are outside of the scope of this manual. To use unified memory, you use the package Unified_Model instead of the default one:

type Int_Array is array (Integer range <>) of Integer;

type Int_Array_Device_Access is access Int_Array
   with Designated_Storage_Model => CUDA.Storage_Model.Unified_Model;

5.3.3. Using Storage Model with Streams

CUDA streams allows you to launch several computations in parallel. This model allows you to specify which computation write and read operation must wait for. The Ada CUDA API doesn’t provide a pre-allocated stream memory model. Instead, it provides a type, CUDA_Async_Storage_Model, that you can instantiate and specify the specific stream:

My_Stream_Model : CUDA.Storage_Model.CUDA_Async_Storage_Model
  (Stream => Stream_Create);

type Int_Array is array (Integer range <>) of Integer;

type Int_Array_Device_Access is access Int_Array
   with Designated_Storage_Model => My_Stream_Model;

The data stream associated with a specific model can vary over time, allowing different parts of a given object to be used by different streams, e.g.:

   X : Int_Array_Device_Access := new Int_Array (1 .. 10_000);
   Stream_1 : Stream_T := Stream_Create;
   Stream_2 : Stream_T := Stream_Create;
begin
   My_Stream_Model.Stream := Stream_1;
   X (1 .. 5_000) := 0;
   My_Stream_Model.Stream := Stream_2;
   X (5_001 .. 10_000) := 0;

5.3.4. Low-Level Data Transfer

At the lowest level, you can allocate memory to the device using the standard CUDA function malloc that’s bound from CUDA.Runtime_API.Malloc. E.g.:

Device_Array : System.Address := CUDA.Runtime_API.Malloc (Integer'Size * 100);

This is equivalent to the following CUDA code:

int *deviceArray = cudaMalloc (sizeof (int) * 100);

In this example, objects on the Ada side aren’t typed. Creating typed objects requires more advanced Ada constructions that are described later.

The above statement created space in the device memory of 100 integers. That space can now be used to perform copies back and forth from host memory. For example:

procedure Main is
   type Int_Array is array (Integer range <>) of Integer;
   type Int_Array_Access is access all Int_Array;

   Host_Array : Int_Array_Access := new Int_Array (1 .. 100);
   Device_Array : System.Address := CUDA.Runtime_API.Malloc (Integer'Size * 100);
begin
   Host_Array := (others => 0);

   CUDA.Runtime_API.Memcpy
       (Dst   => Device_Array,
        Src   => Host_Array.all'Address,
        Count => Host_Array.all'Size,
        Kind  => Memcpy_Host_To_Device);

    pragma CUDA_Execute (
        Some_Kernel (Device_Array, Host_Array.all'Length),
        Host_Array.all'Length,
        1);

    CUDA.Runtime_API.Memcpy
       (Dst   => Host_Array.all'Address
        Src   => Device_Array,
        Count => Host_Array.all'Size,
        Kind  => Memcpy_Device_To_Host);
end Main;

This code copies the contents of Host_Array to Device_Array, performs some computations on that data on the device, and then copies the data back. At this level of coding, we’re not passing a typed array but instead a raw address. On the kernel side, we need to reconstruct the array with an overlay:

procedure Kernel (Array_Address : System.Address; Length : Integer) is
   Device_Array : Int_Array (1 .. Length)
      with Address => Array_Address;
begin
   Device_Array (Thread_IDx.X) := Device_Array (Thread_IDx.X) + 10;
end Kernel;

While it works, this method of passing data back and forth is not very satisfactory and you should reserve it for cases where an alternative doesn’t exist or doesn’t exist yet. In particular, typing is lost at the interface, and you need to carefully check manually for type correctness.

5.4. Specifying Where Code is For

Like in CUDA, a GNAT for CUDA application contains code that may be compiled exclusively for the host, the device, or both. By default, all code is compiled for both the host and the device. You can identify code as only being compilable for the device by using the CUDA_Device aspect:

procedure Some_Device_Procedure
   with CUDA_Device;

Some_Device_Procedure will not exist on the host. Calling it will result in a compilation error.

The corresponding CUDA_Host aspect is currently not implemented.

5.5. Accessing Block and Thread Indexes and Dimensions

GNAT for CUDA® allows you to access block and thread indexes and dimensions in a way that’s similar to CUDA. The package CUDA.Runtime_API declares Block_Dim, Grid_Dim, Block_IDx and Thread_IDx which map directly to the corresponding PTX registers. For example:

J : Integer := Integer (Block_Dim.X * Block_IDx.Y + Thread_IDx.X);