General-purpose computing on the GPU (GPGPU)

Tuesday, September 15, 2009

Parallelism is the future of computing. Future microprocessor development efforts will continue to concentrate on adding cores rather than increasing single-thread performance [1]. One example of this trend is the heterogeneous nine-core Cell broadband engine, currently used on Sony Playstation 3, which has also attracted substantial interest from the scientific computing community [1,7]. Similarly, the highly parallel graphics processing unit (GPU) is rapidly gaining maturity as a powerful engine for computationally demanding applications. The GPU’s performance and potential offer a great deal of promise for future computing systems. However the architecture and programming model of the GPU are slightly different from the commodity single-chip processors.

One of the historical difficulties in programming GPGPU applications has been that despite their general-purpose tasks having nothing to do with graphics, the applications still had to be programmed using graphics APIs. In addition, the program had to be structured in terms of the graphics pipeline, with the programmable units only accessible as an intermediate step in that pipeline, when the programmer would almost certainly prefer to access the programmable units directly [1].

Today, GPU computing applications are structured in the following way.
  1. The programmer directly defines the computation domain of interest as a structured grid of threads.
  2. An SPMD general-purpose program computes the value of each thread.
  3. The value for each thread is computed by a combination of math operations and both “gather” (read) accesses from and “scatter” (write) accesses to global memory. Unlike in the previous two methods, the same buffer can be used for both reading and writing, allowing more flexible algorithms (for example, in-place algorithms that use less memory).
  4. The resulting buffer in global memory can then be used as an input in future computation.
This programming model is powerful for several reasons. First, it allows the hardware to fully exploit the application’s data parallelism by explicitly specifying that parallelism in the program. Next, it strikes a careful balance between generality (a fully programmable routine at each element) and restrictions to ensure good performance (the SPMD model, the restrictions on branching for efficiency, restrictions on data communication between elements and between kernels/passes, and so on). Finally, its direct access to the programmable units eliminates much of the complexity faced by previous GPGPU programmers in co-opting the graphics interface for general-purpose programming. As a result, programs are more often expressed in a familiar programming language and are simpler and easier to build and debug. The result is a programming model that allows its users to take full advantage of the GPU’s powerful hardware but also permits an increasingly high-level programming model that enables productive authoring of complex applications.

GPU versus CPU

Recent experiments show that GPU implementations of traditionally CPU-restricted algorithms have performance increases of an order of magnitude or greater. The picture 1 can give an overall of current performances.

 GPU versus CPU performance

   Figure 1. GPU versus CPU performance

Furthermore, Intel is developing a hybrid between a multi-core CPU and a GPU – Larabee, targeting to hit 2 Teraflops. This GPGPU chip should be released in the end of 2009 and the first two versions will have 32 and 24 cores respectively, with a 48 core version coming in 2010. Several short articles about Larrabee are claiming that Larrabee will have a TDP as large as 300W [4], that it will use a 12-layer PCB and has a cooling system that "is meant to look similar to what you can find on high-end Nvidia cards today” [5]. Larrabee will use GDDR5 memory and it is targeted to have 2 single-precision teraflops of computing power [6].

AMD has already bought ATI and is already hitting 1.2 Teraflops with the Radeon HD 4870. Radeon HD 4870 X2 should already break 2 Teraflops.

NVIDIA will offer the GT300 containing up to 512 cores, up from 240 cores in NVIDIA's current high-end GPU. Since the new chips will be on the 40nm process node, NVIDIA could also crank up the clock. The current Tesla GPUs are running at 1.3-1.4 GHz and deliver about 1 teraflop, single precision, and less than 100 gigaflops, double precision. There also some speculations that a 2 GHz clock could up that to 3 teraflops of single precision performance, and, because of other architectural changes, double precision performance would get an even larger boost. Furthermore and according to Valich [2], the upcoming GPU will sport a 512-bit interface connected to GDDR5 memory. If true he says, "we are looking at memory bandwidth anywhere between 268.8-294.4 GB/s per single GPU" [3].

Software Environments

In the past, the majority of GPGPU programming was done directly through graphics APIs. Although many researchers were successful in getting applications to work, there is a fundamental mismatch between the traditional programming models people were using and the goals of the graphics APIs. Originally, people used fixed function, graphics-specific units (e.g. texture filters, blending, and stencil buffer operations) to perform GPGPU operations [1]. This quickly got better with fully programmable fragment processors which provided pseudo assembly languages, but this was still unapproachable by all but the most ardent researchers.

With DirectX 9, higher level shader programming was made possible through the “high-level shading language” (HLSL), presenting a C-like interface for programming shaders. NVIDIA’s Cg provided similar capabilities as HLSL but was able to compile to multiple targets and provided the first high-level language for OpenGL. The OpenGL Shading Language (GLSL) is now the standard shading language for OpenGL. However, the main issue with Cg/HLSL/GLSL for GPGPU is that they are inherently shading languages. Computation must still be expressed in graphics terms like vertices, textures, fragments, and blending. So, although you could do more general computation with graphics APIs and shading languages, they were still largely unapproachable by the common programmer [1]. What developers really wanted were higher level languages that were designed explicitly for computation and abstracted all of the graphics-isms of the GPU.

In the past, the majority of GPGPU programming was done directly through graphics APIs.

Most high-level GPU programming languages today share one thing in common: they are designed around the idea that GPUs generate pictures. As such, the high-level programming languages are often referred to as shading languages [1]. That is, they are a high-level language that compiles a shader program into a vertex shader and a fragment shader to produce the image described by the program.

Cg [8], HLSL [9], and the OpenGL Shading Language [10] all abstract the capabilities of the underlying GPU and allow the programmer to write GPU programs in a more familiar C-like programming language. They do not stray far from their origins as languages designed to shade polygons. All retain graphics-specific constructs: vertices, fragments, textures, etc. Cg and HLSL provide abstractions that are very close to the hardware, with instruction sets that expand as the underlying hardware capabilities expand. The OpenGL Shading Language was designed looking a bit further out, with many language features (e.g. integers) that do not directly map to hardware available today [1].

Sh is a shading language implemented on top of C++ [16]. Sh provides a shader algebra for manipulating and defining procedurally parameterized shaders. Sh manages buffers and textures, and handles shader partitioning into multiple passes. Sh also provides a stream programming abstraction suitable for GPGPU programming [1].

BrookGPU [12] takes a pure streaming computation abstraction approach representing data as streams and computation as kernels. There is no notion of textures, vertices, fragments, or blending in Brook. Kernels are written in a restricted subset of C, notably the absence of pointers and scatter, and defined the input, output, and gather streams used in a kernel as part of the kernel definition. The user’s kernels are mapped to fragment shader code and streams to textures. Data upload and download to the GPU is performed via explicit read/write calls translating into texture updates and framebuffer read backs [1]. Lastly, computation is performed by rendering a quad covering the pixels in the output domain.

Microsoft’s Accelerator [13] project has a similar goal as Brook in being very compute-centric, but instead of using offline compilation, Accelerator relies on just-in-time compilation of data-parallel operators to fragment shaders. Unlike Brook, but similar to Sh, the delayed evaluation model allows for more aggressive online compilation, leading to potentially more specialized and optimized generated code for execution on the GPU [1].

RapidMind [14] commercialized Sh and now targets multiple platforms including GPUs, the STI Cell Broadband Engine, and multicore CPUs, and the new system is much more focused on computation as compared to Sh, which included many graphics-centric operations [1].

PeakStream [15] (purchased by Google in 2007) is a new system, inspired by Brook, designed around operations on arrays. Similar to RapidMind and Accelerator, PeakStream uses just-in-time compilation but is much more aggressive about vectorizing the user’s code to maximize performance on SIMD architectures. Peak-Stream is also the first platform to provide profiling and debugging support, the latter continuing to be a serious problem in GPGPU development [1].

Ashli [11] works at a level one step above that of Cg, HLSL, or the OpenGL Shading Language. Ashli reads as input shaders written in HLSL, the OpenGL Shading Language, or a subset of RenderMan. Ashli then automatically compiles and partitions the input shaders to run on a programmable GPU [1].

AMD announced and released their system to researchers in late 2006. CTM (Close To the Metal), provides a low-level hardware abstraction layer (HAL) for the R5XX and R6XX series of ATI GPUs. CTMHAL provides raw assembly-level access to the fragment engines (stream processors) along with an assembler and command buffers to control execution on the hardware AMD also offers the compute abstraction layer (CAL), which adds higher level constructs, similar to those in the Brook runtime system, and compilation support to GPU ISA for GLSL, HLSL, and pseudo assembly like Pixel Shader 3.0. For higher level programming, AMD supports compilation of Brook programs directly to R6XX hardware, providing a higher level programming abstraction than provided by CAL or HAL.

NVIDIA’s CUDA is a higher level interface than AMD’s HAL and CAL. Similar to Brook, CUDA provides a C-like syntax for executing on the GPU and compiles offline. However, unlike Brook, which only exposed one dimension of parallelism, data parallelism via streaming, CUDA exposes two levels of parallelism, data parallel and multithreading. CUDA also exposes much more of the hardware resources than Brook, exposing multiple levels of memory hierarchy: per-thread registers, fast shared memory between threads in a block, board memory, and host memory. Kernels in CUDA are also more flexible that those in Brook by allowing the use of pointers (although data must be on board), general load/store to memory allowing the user to scatter data from within a kernel, and synchronization between threads in a thread block. However, all of this flexibility and potential performance gain comes with the cost of requiring the user to understand more of the low-level details of the hardware, notably register usage, thread and thread block scheduling, and behavior of access patterns through memory.

CUDA provides tuned and optimized basic linear algebra subprograms (BLAS) and fast Fourier transform (FFT) libraries to use as building blocks for large applications. Low-level access to hardware, such as that provided by CTM, or GPGPU specific systems like CUDA, allow developers to effectively bypass the graphics drivers and maintain stable performance and correctness.

NVIDIA’s CUDA allows the user to access memory using standard C constructs (arrays, pointers, variables). AMD’s CTM is nearly as flexible but uses 2-D addressing.

The use of direct-compute layers such as CUDA and CTM both simplifies and improves the performance of linear algebra on the GPU. For example, NVIDIA provides CuBLAS, a dense linear algebra package implemented in CUDA and following the popular BLAS conventions.


[1] Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., and Phillips, J. C. GPU computing. IEEE Proceedings, May 2008, 879-899.

[2] Theo Valich. GT300 to feature 512-bit interface - nVidia set to continue with complicated controllers?.

[3] nVidia's GT300 specifications revealed - it's a cGPU!.!.aspx

[4]Larrabee to launch at 300W TDP. Retrieved on 2008-08-06.

[5] Larrabee will use a 12-layer PCB. Retrieved on 2009-07-09.

[6] Larrabee will use GDDR5 memory. Retrieved on 2008-08-06.

[7] Jakub Kurzak, Alfredo Buttari, Piotr Luszczek, Jack Dongarra, "The PlayStation 3 for High-Performance Scientific Computing," Computing in Science and Engineering, vol. 10, no. 3, pp. 84-87, May/June, 2008.

[8] Mark W. R., Glanville R. S., Akeley K., Kilgard M. J.: Cg: A system for programming graphics hardware in a C-like language. ACM Transactions on Graphics 22, 3 (July 2003), 896–907.

[9] Microsoft high-level shading language., 2005.

[10] Kessenich J., Baldwin D., Rost R.: The OpenGL Shading Language version 1.10.59., Apr. 2004.

[11] Bleiweiss A., Preetham A.: Ashli—Advanced shading language interface. ACM SIGGRAPH Course Notes (July 2003).

[12] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for GPUs: Stream computing on graphics hardware”, ACM Trans. Graph., vol. 23, no. 3, pp. 777–786, Aug. 2004.

[13] D. Tarditi, S. Puri, and J. Oglesby, “Accelerator: Using data-parallelism to program GPUs for general-purpose uses”, in Proc. 12th Int. Conf. Architect. Support Program. Lang. Oper. Syst., Oct. 2006, pp. 325–335.

[14] M. McCool, “Data-parallel programming on the cell BE and the GPU using the RapidMind development platform”, in Proc. GSPx Multicore Applicat. Conf., Oct.–Nov. 2006.

[15] PeakStream, The PeakStream platform: High productivity software development for multi-core processors. [Online]. Available:

[16] Mccool M., Du Toit S., Popa T., Chan B., Moule K.: Shader algebra. ACM Transactions on Graphics 23, 3 (Aug. 2004), 787–795.


Omkar said...

Hi Can you provide the reference for Fig 1 which gives a graph of comparison between Intel CPU and nVidia GPU.

Bruno Simões said...

Someone sent me that picture one year ago. Now I am not able to find out who was, since I cant find the document. Anyway you should take into account that this picture is from 2008 (more than one years old). If you need some references check this presentation (2007):

Simon Green, NVIDIA
GPU Physics

If you find some version from 2009, I would be extremely happy if you could email it to me.