5. Programming with GNAT for CUDA®
5.1. CUDA API
The CUDA API available from GNAT for CUDA® is a binding to the CUDA API
provided by NVIDIA. The NVIDIA API is installed with the CUDA driver. You
access the Ada API by adding a reference to cuda_host.gpr (on the
host) and cuda_device.gpr (on the target). The initial
installation script generates the Ada version of the API from the CUDA
version that’s installed on your system.
Two versions of the Ada API are available:
a “thick” binding version. These are child units of the
CUDApackage, the main one beingCUDA.Runtime_API. This is the API you will most likely use. However, this API is still in the process of being completed and a number of types and subprogram specs have not been mapped to higher-level Ada constructs. For example, you will still see a lot of references toSystem.Addresswhere you would normally expect specific access types in Ada.a “thin” binding version. These are typically identified by having a suffix of
_h. They are direct bindings to the underlying C APIs. These bindings are functional and complete. They can be used as a low-level alternative to the thick binding, but they don’t expose an interface consistent with the Ada programming style and may require more work to use.
You can regenerate these bindings at any time. You may want to do this, for
example, if you install a new version of CUDA. To regenerate these
bindings, execute the bind.sh script located in <your GNAT
for CUDA installation>/cuda/api/.
5.2. Defining and calling Kernels
Just as in a typical CUDA program, programming in GNAT for CUDA requires
you to identify application entry points to the GPU code, called
“kernels”. In Ada, you do this by annotating a procedure with the
CUDA_Global aspect, which serves the same role as the CUDA
__global__ modifier. For example:
procedure My_Kernel (X : Some_Array_Access)
with CUDA_Global;
Kernels are compiled both for the host and the device. They can be called as regular procedures, e.g:
My_Kernel (An_Array_Instance);
The above makes a regular single-threaded call to the kernel and executes it on the host. You may want to do this because of a better debugging environment on the host.
To call a kernel on the device (which means copying it to the device and
executing it there), you use the CUDA_Execute pragma:
pragma CUDA_Execute (My_Kernel (An_Array_Instance), 10, 1);
The procedure call looks the same as a regular call, but this call is
surrounded by the pragma CUDA_Execute, which has two extra
parameters defining, respectively, the number of blocks per grid and the
number of threads per block. This is equivalent to the CUDA call:
<<<10, 1>>> myKernel (someArray);
In each case, these calls launch ten instances of the kernel to the device.
The numbers of threads per block and blocks per grid can be expressed as a
one-dimensional scalar or a Dim3 value that specifies all three
dimensions (x, y, and z). For example:
pragma CUDA_Execute (My_Kernel (An_Array_Instance), (3, 3, 3), (3, 3, 3));
The above call launches (3 * 3 * 3) * (3 * 3 * 3) = 729 instances of the kernel on the device.
5.3. Passing Data between Device and Host
5.3.1. Using Storage Model Aspect
“Storage Model” is an extension to the Ada language that is currently under development. General description of this capability can be found here.
GNAT for CUDA provides a storage model that maps to CUDA primitives for
allocation, deallocation, and copying. The model is declared in the package
CUDA.Storage_Models. You may either use
CUDA.Storage_Models.Model itself or you may create your own.
When a pointer type is associated with a CUDA storage model, memory allocation through that pointer occurs on the device in the same manner as it would in the host if a storage model wasn’t specified. For example:
type Int_Array is array (Integer range <>) of Integer;
type Int_Array_Device_Access is access Int_Array
with Designated_Storage_Model => CUDA.Storage_Model.Model;
Device_Array : Int_Array_Device_Access := new Int_Array (1 .. 100);
In addition to allocation being done on the device, copies between the host and device are converted to call the CUDA memory copy operations. So you can write:
procedure Main is
type Int_Array_Host_Access is access Int_Array;
Host_Array : Int_Array_Host_Access := new Int_Array (1 .. 100);
Device_Array : Int_Array_Device_Access := new Int_Array'(Host_Array.all);
begin
pragma CUDA_Execute (
Some_Kernel (Device_Array),
Host_Array.all'Length,
1);
Host_Array.all := Device_Array.all;
end Main;
On the kernel side, CUDA.Storage_Model.Model is the native storage
model (as opposed to the foreign device one when on the host side). You
can use Int_Array_Device_Access directly:
procedure Kernel (Device_Array : Int_Array_Device_Access) is
begin
Device_Array (Thread_IDx.X) := Device_Array (Thread_IDx.X) + 10;
end Kernel;
This is the recommended way of sharing memory between device and host. However, the storage model can be extended to support capabilities such as streaming or unified memory.
5.3.2. Using Unified Storage Model
An alternative to using the default CUDA Storage model is to use so-called
“unified memory”. In that model, the device memory is mapped directly onto
host memory, so no special copy operation is necessary. The factors that
may lead you to choose to one model over the other are outside of the scope
of this manual. To use unified memory, you use the package
Unified_Model instead of the default one:
type Int_Array is array (Integer range <>) of Integer;
type Int_Array_Device_Access is access Int_Array
with Designated_Storage_Model => CUDA.Storage_Model.Unified_Model;
5.3.3. Using Storage Model with Streams
CUDA streams allows you to launch several computations in parallel. This
model allows you to specify which computation write and read operation must
wait for. The Ada CUDA API doesn’t provide a pre-allocated stream memory
model. Instead, it provides a type, CUDA_Async_Storage_Model, that
you can instantiate and specify the specific stream:
My_Stream_Model : CUDA.Storage_Model.CUDA_Async_Storage_Model
(Stream => Stream_Create);
type Int_Array is array (Integer range <>) of Integer;
type Int_Array_Device_Access is access Int_Array
with Designated_Storage_Model => My_Stream_Model;
The data stream associated with a specific model can vary over time, allowing different parts of a given object to be used by different streams, e.g.:
X : Int_Array_Device_Access := new Int_Array (1 .. 10_000);
Stream_1 : Stream_T := Stream_Create;
Stream_2 : Stream_T := Stream_Create;
begin
My_Stream_Model.Stream := Stream_1;
X (1 .. 5_000) := 0;
My_Stream_Model.Stream := Stream_2;
X (5_001 .. 10_000) := 0;
5.3.4. Low-Level Data Transfer
At the lowest level, you can allocate memory to the device using the
standard CUDA function malloc that’s bound from
CUDA.Runtime_API.Malloc. E.g.:
Device_Array : System.Address := CUDA.Runtime_API.Malloc (Integer'Size * 100);
This is equivalent to the following CUDA code:
int *deviceArray = cudaMalloc (sizeof (int) * 100);
In this example, objects on the Ada side aren’t typed. Creating typed objects requires more advanced Ada constructions that are described later.
The above statement created space in the device memory of 100 integers. That space can now be used to perform copies back and forth from host memory. For example:
procedure Main is
type Int_Array is array (Integer range <>) of Integer;
type Int_Array_Access is access all Int_Array;
Host_Array : Int_Array_Access := new Int_Array (1 .. 100);
Device_Array : System.Address := CUDA.Runtime_API.Malloc (Integer'Size * 100);
begin
Host_Array := (others => 0);
CUDA.Runtime_API.Memcpy
(Dst => Device_Array,
Src => Host_Array.all'Address,
Count => Host_Array.all'Size,
Kind => Memcpy_Host_To_Device);
pragma CUDA_Execute (
Some_Kernel (Device_Array, Host_Array.all'Length),
Host_Array.all'Length,
1);
CUDA.Runtime_API.Memcpy
(Dst => Host_Array.all'Address
Src => Device_Array,
Count => Host_Array.all'Size,
Kind => Memcpy_Device_To_Host);
end Main;
This code copies the contents of Host_Array to
Device_Array, performs some computations on that data on the
device, and then copies the data back. At this level of coding, we’re not
passing a typed array but instead a raw address. On the kernel side, we
need to reconstruct the array with an overlay:
procedure Kernel (Array_Address : System.Address; Length : Integer) is
Device_Array : Int_Array (1 .. Length)
with Address => Array_Address;
begin
Device_Array (Thread_IDx.X) := Device_Array (Thread_IDx.X) + 10;
end Kernel;
While it works, this method of passing data back and forth is not very satisfactory and you should reserve it for cases where an alternative doesn’t exist or doesn’t exist yet. In particular, typing is lost at the interface, and you need to carefully check manually for type correctness.
5.4. Specifying Where Code is For
Like in CUDA, a GNAT for CUDA application contains code that may be
compiled exclusively for the host, the device, or both. By default, all
code is compiled for both the host and the device. You can identify code as
only being compilable for the device by using the CUDA_Device
aspect:
procedure Some_Device_Procedure
with CUDA_Device;
Some_Device_Procedure will not exist on the host. Calling it will
result in a compilation error.
The corresponding CUDA_Host aspect is currently not implemented.
5.5. Accessing Block and Thread Indexes and Dimensions
GNAT for CUDA® allows you to access block and thread indexes and
dimensions in a way that’s similar to CUDA. The package
CUDA.Runtime_API declares Block_Dim, Grid_Dim,
Block_IDx and Thread_IDx which map directly to the
corresponding PTX registers. For example:
J : Integer := Integer (Block_Dim.X * Block_IDx.Y + Thread_IDx.X);