Monday, March 18, 2013

Intro to CUDA programming

I have already posted one blog on basic introduction to CUDA in http://itsitrc.blogspot.in/2013/01/compute-unified-device-architecture-cuda.html. Now in this blog we will be discussing about basic concepts in CUDA programming. Firstly we have to study some of  identifiers/keywords in CUDA:
  1. __global__ :-  The function or variable defined as __global__ is executed in GPU but can be called from host(CPU).
  2. __shared__:- The variable defined as __shared__ is resides in shared memory of threads.   
  3. __host__:- The function or variable defined as __host__ is stored on host memory and executed in only host.
  4. __device__:- The function or variable defined as __device__is stored on device(GPU) memory and runs on device only.
  5. __constant__:- The variable defined as __constant__ is same as in C programming.
  6. function<<<n,m>>>:- This syntax is responsible for calling kernel by creating function. here n is number of threads to be created and m is the number of thread blocks.
  7. cudamalloc():- This function allows us to allocate memory in device.
  8. cudafree():- This function is responsible for releasing memory occupied by using cudamalloc().
  9. cudamemcpy():- This function is responsible for coping data from device memory to cpu memory and vice-vars.    
Example:-

__global__ void add(int *a, int *b, int *c)
{
*c = *a + *b;
}
int main(void)
{
int a, b, c;                                             // host copies of a, b, c
int *d_a, *d_b, *d_c;                             // device copies of a, b, c
int size = sizeof(int);
                                                             // Allocate space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
                                                             // Setup input values
a = 2;
b = 7;
                                                             // Copy inputs to device
cudamemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudamemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
                                                             // Launch add() kernel on GPU
add<<<1,1>>>(d_a, d_b, d_c);
                                                             // Copy result back to host
cudamemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
                                                             // Cleanup
cudafree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}