This week NVIDIA provided a tutorial outlining first steps for GPU acceleration using OpenACC and CUDA. This was offered as part of the “GPUs Accelerating Research” week at Northeastern University and Boston University. After attending, it seemed appropriate to review and boil it down to the essentials. There’s no way we can cover everything, but sometimes getting started is the hardest part.
Set aside half an hour, four hours, eight hours; whatever you can spare. The steps below will get your code up and running on GPUs. The resources at the end will be necessary for those seriously digging into acceleration.
You very likely have a decent GPU in an existing laptop, workstation or server. While NVIDIA Tesla GPUs are typically best for computation, a GeForce or Quadro card will be fine for your first development attempts. You are welcome to contact us for a GPU Test Drive on Tesla once you have working code and want to see the speedups available from professional cards.
Getting Started with GPU Acceleration
- OpenACC
For first attempts, there’s no question that OpenACC is easier than coding in CUDA. One or two comments (pragmas) added above your most intensive loop is all that’s needed. You’ll see cases where a few hours of developer time translate into 2X or 5X speedups. Even the largest labs, such as ORNL, are using OpenACC because their legacy codes show great results with much less effort than would be required for a re-write in CUDA. They’ve also discovered OpenACC sped up the CPU-only versions of the code by 50% to 100%, which was an unexpected benefit! And because OpenACC looks like comments to other compilers, you only need to maintain a single code base.For most readers, the best way to use OpenACC will be the Portland Group (PGI) Accelerator compilers * . You may start with a trial version and later purchase with academic or commercial pricing.
Find the most intensive loop and place this statement before the loop begins (on the line above the for or while statement):
#pragma acc kernels
That single line informs the compiler that you have a loop you wish to run on the GPU rather than the CPU. Do take note that you’ll want to pick loops with a lot of math operations (on the order of half a million or more). If you have a few large loops in a row, put a
#pragma
in front of each. With just that one change, save your code and compile using these flags:pgcc -acc -ta=nvidia <filename>
Compare timings versus a CPU-only version and see how it worked. You might immediately see speedups, or you may need to make tweaks before the GPU version beats the CPU version.
- OpenACC Assumptions and Timings
Many applications use sophisticated algorithms, so if the first attempt did not yield a speedup, there are some simple fixes possible. The PGI compiler is able to show you some very useful information about the accelerated code. Recompile using these flags and re-run the application:pgcc -acc -ta=nvidia,time -Minfo=accel <filename>
Take note of the
-Minfo=accel
andtime
flags above. The first tells you what assumptions have been made by the compiler. During compilation, you’ll see a printout of the GPU resources that will be used. Thetime
flag prints timings at the end of your application’s execution. Among other things, the timings show how long your code spends computing on the GPU and how long is spent transferring data to and from the GPU. These can immediately give you an idea of issues which need to be fixed, as it is common to see too much data transfer time. - OpenACC Data Construct
If the previous step showed the code spending as much time transferring data as it did computing, it’s likely that the compiler needs a few more hints from you. To be safe, it defaults to copying data back and forth between CPU and GPU at every stage. If possible, you want to supply a start (where data is sent to the GPU) and an ending (where all the results are returned by the GPU). This may line up perfectly with your loops. Consider this simple 2D Laplace function from NVIDIA’s tutorial:#pragma acc data copy ( Anew, A ) while ( error > tol && iter < iter_max ) { error = 0.0; #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; }
There are three
#pragma
statements; the remainder of the C code is completely unchanged (you can ignore the implementation details – just look at where the pragmas are positioned). The second two mark thefor
loops to be accelerated. The#pragma acc data copy
indicates that matricesAnew
andA
should be copied to the GPU before entering thewhile
loop. Both will be automatically copied back when thewhile
loop exits.That’s it! There’s a lot more sophistication inside OpenACC, and many more clauses you should learn about, but knowing just two will give you a start. PGI offers many resources and tutorials along with their free trial.
- Go Direct with NVIDIA CUDA
For those unable to use OpenACC, or those who prefer to get closer to the bare metal, CUDA is approachable. Anyone who has worked in C or Fortran should be comfortable with the basics of CUDA after a couple hours.Starting with CUDA requires its own tutorial, and there are incredible resources already available * . Work through some of them separately. Then come back and make sure you’ve tried these tips:
- Be sure your CUDA kernels make coalesced memory requests (reads and writes). A block of threads should all be working within the same section of memory if at all possible. Groups of threads which access memory in many different locations will not run efficiently.
- Remember that you can include
printf
statements inside CUDA kernels. These may be invaluable for debugging in a pinch:__global__ void print(int *value) { printf("Current value is: %dn", *value); }
Don’t forget to include a call to
cudaDeviceSynchronize()
to flush your print statements from the GPU. This cannot go inside a CUDA kernel function (__global__
); it must be in the host code. - Try using shared memory (cache) within your CUDA kernels. This prevents the threads from waiting so long for the GPUs main memory to respond. Your algorithm may be able to read in a chunk of data that will satisfy all the threads in a thread block. Don’t forget to insert __syncthreads() before you switch between reading and writing variables that are shared! You can’t have some threads reading while others are writing to the same location.
- Run your application from NVIDIA’s visual profiler nvvp. It’s cross-platform and offers a simple interface for finding out what your code is doing on the hardware. Intelligent messages suggest improvements you should consider.
- GPU-Accelerated Libraries
If you suspect your application uses some fairly common algorithms, you might be right! Thousands of the most common routines are already GPU accelerated. Your application may be able to speed up simply by switching your function calls. Google is your friend. NVIDIA also maintains a list * .
Obviously, there is an incredible depth of knowledge needed for perfect GPU acceleration. Nevertheless, the quantity and quality of the tools and libraries offers beginners very usable options. Sit down and try a few of these simple steps.