CUDA (originally an acronym for Compute Unified Device Architecture although this is no longer used) is a parallel computing architecture developed by NVIDIA. Simply put, CUDA is the computing engine in NVIDIA graphics processing units or GPUs, that is accessible to software developers through industry standard programming languages. Programmers use 'C for CUDA' (C with NVIDIA extensions), compiled through a PathScale Open64 C compiler to code algorithms for execution on the GPU. CUDA architecture supports a range of computational interfaces including OpenCL and DirectX Compute Third party wrappers are also available for Python, Fortran and Java.
CUDA has several advantages over traditional general purpose computation on GPUs (GPGPU) using graphics APIs.
- Scattered reads – code can read from arbitrary addresses in memory.
- Shared memory – CUDA exposes a fast shared memory region (16KB in size) that can be shared amongst threads. This can be used as a user-managed cache, enabling higher bandwidth than is possible using texture lookups.
- Faster downloads and readbacks to and from the GPU
- Full support for integer and bitwise operations, including integer texture lookups.



General

