rss

Subversion through proxy

Wednesday, September 23, 2009

People with a proxy running on their networks run into a little snag when trying to access an SVN database through it. SVN over http requires more than the usual GET and POST. Proxies by default aren't configured to allow the extra http commands that SVN uses. However this problem can be solved.

Open servers file in the following location.

Window XP
C:\Documents and Settings\[YOUR USERNAME]\Application Data\Subversion\servers

Windows Vista
C:\Users\[YOUR USERNAME]\AppData\Roaming\Subversion\servers

Linux
/etc/subversion/servers

Update following lines at the end of file.
[global]
http-proxy-exceptions = yourproxy exceptions (e.g. dev.company.com)
http-proxy-host = yourproxyhost (e.g. proxy.company.com)
http-proxy-port = your proxy port (e.g. 8080)
http-proxy-username = proxyusername
http-proxy-password = proxypassword
Update Eclipse Proxy settings (This step usually is not needed)

Windows > Preferences

Type 'proxy' in the filter text box. Go to 'Network Connections' config page. Choose manual proxy configuration and Update the proxy details.

GPU Computing using jCUDA

Sunday, September 20, 2009

jCUDA provides access to CUDA for Java programmers, exploiting the full power of GPU hardware from Java based applications. Using jCUDA you can create cross-platform CUDA solutions, that can run on any operating system supported by CUDA without changing your code.

What do I need ?
  1. Recent NVIDIA hardware with a CUDA driver
  2. Java SDK
  3. Eclispe (optional)
  4. jCUDA library
  5. Probably you will have to install Visual Studio if you are using the Windows Platform

Which are the magic steps?


1. Download and install all the requirements

Check if you have a CUDA-enabled graphic card. If you don’t have one, you need to buy it. Then install the CUDA drivers for your graphic card and the Java SDK.
Download the jCuda library and use it together with your favorite IDE (if you have one).


2. Write your Java application

You can write your Java application normally and access to your GPU through jCUDA library. However you need to learn how to use it. This example will illustrate how to detect available GPUs and how to execute your CUDA program on them.

Note that CUDA program is stored inside a folder called resources, and carful with the number of memory bytes occupied by pointers. You will find that size of pointer often corresponds to the bit-architecture of target machine. That is, if the compiler is for 32 bit architecture, then the pointer would occupy 32 bits (4 bytes), and if the compiler is for 64 bit architecture, then the pointer would occupy 64 bits(8 bytes).
CUDA cuda = new CUDA(true);

int count = cuda.getDeviceCount();
if(count == 0){
 System.out.println("No GPU devices found"); 
 return;
}

System.out.println("Total number of devices: " + count); 
for (int i = 0; i < count; i++) {
 CUdevice dev = cuda.getDevice(i);
 
 String name = cuda.getDeviceName(dev);
 System.out.println("Name: " + name);
 
 int version[] = cuda.getDeviceComputeCapability(dev);
 System.out.println("Version: " + String.format("%d.%d", version[0], version[1]));
 
 CUdevprop prop = cuda.getDeviceProperties(dev);
 System.out.println("Clock rate: " + 
                      prop.clockRate + " MHz");
 System.out.println("Max threads per block: " + 
                      prop.maxThreadsPerBlock);
}

/** Select 1st device */
cuda.getDevice(0);

/** Create a context (necessary) */
cuda.createContext();

/** Load the module */
File cubinFile = new File("resources", "sub_module.cubin");
cuda.loadModule(cubinFile.getAbsolutePath());

/** Get the function we want */
cuda.getModuleFunction("subtract");

/** Now we should allocate the necessary memory */
int memSize = Integer.SIZE / 8 * 64;
CUdeviceptr devicePtr = cuda.allocate(memSize);

int originalData[] = new int[64];
for (int i=0; i<originalData.length; i++)
 originalData[i] = i;

/** Copy the original array to device */
cuda.copy(devicePtr, originalData, memSize);

/**
 * Setup function parameters
 * 1st Parameter is pointer to device memory
 * Note on 64 bit platforms, each pointer consumes 8 bytes, 
 * on 32 bit only 4.
 */

int offset = 0;
int delta = 13;
cuda.setParameter(offset, devicePtr);

/** Change to 4 if you are using a 32 bits platform */
offset = 8;
cuda.setParameter(offset, delta);

/** 
 * A simple int scalar takes only 4 if 32 bits 
 * or 8 bytes if 64 bits.
 */
offset += Integer.SIZE / offset;

cuda.setParameterSize(offset);

/** Now, configure the execution configuration */
cuda.setFunctionBlockShape(originalData.length, 1, 1);

cuda.launch();

/** Wait for all operations to complete */
cuda.synchronizeStream(new CUstream(0));

/** Copy results back */
cuda.copy(originalData, devicePtr, memSize);

/** Verify results.... */
boolean correct = true;
for (int i=0; i<originalData.length; i++) {
 if (originalData[i] != (i - delta)*(i - delta)) {
  System.out.println("Error at " + i +":"+ originalData[i]);
  correct = false;
 }
}

if (correct)
 System.out.println("Test passed");
else
 System.out.println("Test failed");

/** Release resources */
cuda.free(devicePtr);

3. Write your CUDA code

The following example is written in CUDA, which is very similar to C language. If you can’t understand this example, well, all I can tell you is: the result is equal to (src - amount) * (src - amount).
extern "C" __global__ void subtract(int *src, int amount){
 src[threadIdx.x] -= amount;
 src[threadIdx.x] *= src[threadIdx.x];
}

4. Compile your CUDA code

Before running the application, your CUDA code must be compiled to an assembly intermediate language, PTX and then assembled in the cubin file format.

Here is the simplified compilation process:



If you are using Windows, you can execute the following command.
nvcc.exe -cuda -I $(SDK)/common/inc resources/sub_module.cu

If you prefer to use Ant. Use this script together with your favorite IDE.
<?xml version="1.0" encoding="UTF-8"?>
<project name="GPGPU" basedir="." default="all">
 <target name="all">
  <exec executable="nvcc.exe">
   <arg value="resources/sub_module.cu" />
   <arg value="--cubin" />
  </exec>
 </target>
</project>

4. Compile and run your Java application

Don't forget, GPUs are connected to the main computer processor by relatively slow connection, the bus. For this reason moving data on and off of the GPU is an expensive task if compared to performing calculation directly on the GPU. This can create critical bottlenecks for instance when one wants to perform a calculation on the CPU using some data, then perform further calculation on the GPU and then use the output of the GPU to use once more the CPU. The overhead introduced by data transfers through the bus can overwhelm the benefits of fast GPU computation.

GPGPU Debugging Tools

Thursday, September 17, 2009

Until recently, support for debugging on GPUs was fairly limited, and the features necessary for a good GPU debugger were not well defined. The advent of GPGPU programming makes it clear that a GPU debugger should have similar capabilities as traditional CPU debuggers, including variable watches, program break points, and single-step execution. GPU programs often involve user interaction. While a debugger does not need to run the application at full speed, the application being debugged should maintain some degree of interactivity [8]. A GPU debugger should be easy to add to and remove from an existing application, should mangle GPU state as little as possible, and should execute the debug code on the GPU, not in a software rasterizer. Finally, a GPU debugger should support the major GPU programming APIs and vendor-specific extensions.

There are a few different systems for debugging GPU programs available to use, but nearly all are missing one or more of the important features we just discussed.

gDEBugger [1] and GLIntercept [2] are tools designed to help debug OpenGL programs. Both are able to capture and log OpenGL state from a program. gDEBugger allows a programmer to set breakpoints and watch OpenGL state variables at runtime, as well as to profile applications using GPU hardware performance signals [8]. There is currently no specific support for debugging shaders, but both support runtime shader editing.

The Microsoft Shader Debugger [3], however, does provide runtime variable watches and breakpoints for shaders. The shader debugger is integrated into the Visual Studio IDE, and provides all the same functionality programmers are used to for traditional programming [8]. Unfortunately, debugging requires the shaders to be run in software emulation rather than on the hardware. In contrast, the Apple OpenGL Shader Builder [4] also has a sophisticated

IDE and actually runs shaders in real time on the hardware during shader debug and edit. The downside to this tool is that it was designed for writing shaders, not for computation. The shaders are not run in the context of the application, but in a separate environment designed to help facilitate shader writing.

The Shadesmith Fragment Program Debugger [5] was the first system to automate printf-style debugging while providing basic shader debugging functionality like breakpoints, program stepping, and programmable scale and bias for the image printf [8]. While Shadesmith represents a big step in the right direction for GPGPU debugging, it still has many limitations, the largest of which is that Shadesmith is currently limited to debugging assembly language shaders. Additionally, Shadesmith only works for OpenGL fragment programs, and provides no support for debugging OpenGL state.

Finally, Duca et al. recently described a system that not only provides debugging for graphics state but also both vertex and fragment programs [6]. Their system builds a database of graphics state for which the user writes SQL style queries. Based on the queries, the system extracts the necessary graphics state and program data and draws the appropriate data into a debugging window [8]. The system is build on top of the Chromium [7] library, enabling debugging of any OpenGL applications without modification to the original source program. This promising approach combines graphics state debugging and program debugging with visualizations in a transparent and hardware-rendered approach.


References

[1] Graphic Remedy gDEBugger. http://www.gremedy.com/, 2006.

[2] Trebilco D.: GLIntercept. http://glintercept.nutty.org/, 2006.

[3] Microsoft shader debugger. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directx9_c/directx/graphics/Tools/ShaderDebugger.asp, 2005.

[4] Apple Computer OpenGL shader builder profiler. http://developer.apple.com/graphicsimaging/opengl/, 2006.

[5] Purcell T. J., Sen P.: Shadesmith fragment program debugger. http://graphics.stanford.edu/projects/shadesmith/, 2003.

[6] Duca N., Niski K., Bilodeau J., Bolitho M., Chen Y., Cohen J.: A relational debugging engine for the graphics pipeline. ACM Transactions on Graphics 24, 3 (Aug. 2005), 453–463.

[7] Humphreys G., Houston M., Ng R., Frank R., Ahern S., Kirchner P., Klosowski J.:
Chromium: A stream-processing framework for interactive rendering on clusters. ACMTransactions on Graphics 21, 3 (July 2002), 693–702.

[8] Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, J. A., and Purcell, T. A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, 26(1):80–113, March 2007.

General-purpose computing on the GPU (GPGPU)

Tuesday, September 15, 2009

Parallelism is the future of computing. Future microprocessor development efforts will continue to concentrate on adding cores rather than increasing single-thread performance [1]. One example of this trend is the heterogeneous nine-core Cell broadband engine, currently used on Sony Playstation 3, which has also attracted substantial interest from the scientific computing community [1,7]. Similarly, the highly parallel graphics processing unit (GPU) is rapidly gaining maturity as a powerful engine for computationally demanding applications. The GPU’s performance and potential offer a great deal of promise for future computing systems. However the architecture and programming model of the GPU are slightly different from the commodity single-chip processors.

One of the historical difficulties in programming GPGPU applications has been that despite their general-purpose tasks having nothing to do with graphics, the applications still had to be programmed using graphics APIs. In addition, the program had to be structured in terms of the graphics pipeline, with the programmable units only accessible as an intermediate step in that pipeline, when the programmer would almost certainly prefer to access the programmable units directly [1].

Today, GPU computing applications are structured in the following way.
  1. The programmer directly defines the computation domain of interest as a structured grid of threads.
  2. An SPMD general-purpose program computes the value of each thread.
  3. The value for each thread is computed by a combination of math operations and both “gather” (read) accesses from and “scatter” (write) accesses to global memory. Unlike in the previous two methods, the same buffer can be used for both reading and writing, allowing more flexible algorithms (for example, in-place algorithms that use less memory).
  4. The resulting buffer in global memory can then be used as an input in future computation.
This programming model is powerful for several reasons. First, it allows the hardware to fully exploit the application’s data parallelism by explicitly specifying that parallelism in the program. Next, it strikes a careful balance between generality (a fully programmable routine at each element) and restrictions to ensure good performance (the SPMD model, the restrictions on branching for efficiency, restrictions on data communication between elements and between kernels/passes, and so on). Finally, its direct access to the programmable units eliminates much of the complexity faced by previous GPGPU programmers in co-opting the graphics interface for general-purpose programming. As a result, programs are more often expressed in a familiar programming language and are simpler and easier to build and debug. The result is a programming model that allows its users to take full advantage of the GPU’s powerful hardware but also permits an increasingly high-level programming model that enables productive authoring of complex applications.


GPU versus CPU

Recent experiments show that GPU implementations of traditionally CPU-restricted algorithms have performance increases of an order of magnitude or greater. The picture 1 can give an overall of current performances.

 GPU versus CPU performance

   Figure 1. GPU versus CPU performance


Furthermore, Intel is developing a hybrid between a multi-core CPU and a GPU – Larabee, targeting to hit 2 Teraflops. This GPGPU chip should be released in the end of 2009 and the first two versions will have 32 and 24 cores respectively, with a 48 core version coming in 2010. Several short articles about Larrabee are claiming that Larrabee will have a TDP as large as 300W [4], that it will use a 12-layer PCB and has a cooling system that "is meant to look similar to what you can find on high-end Nvidia cards today” [5]. Larrabee will use GDDR5 memory and it is targeted to have 2 single-precision teraflops of computing power [6].

AMD has already bought ATI and is already hitting 1.2 Teraflops with the Radeon HD 4870. Radeon HD 4870 X2 should already break 2 Teraflops.

NVIDIA will offer the GT300 containing up to 512 cores, up from 240 cores in NVIDIA's current high-end GPU. Since the new chips will be on the 40nm process node, NVIDIA could also crank up the clock. The current Tesla GPUs are running at 1.3-1.4 GHz and deliver about 1 teraflop, single precision, and less than 100 gigaflops, double precision. There also some speculations that a 2 GHz clock could up that to 3 teraflops of single precision performance, and, because of other architectural changes, double precision performance would get an even larger boost. Furthermore and according to Valich [2], the upcoming GPU will sport a 512-bit interface connected to GDDR5 memory. If true he says, "we are looking at memory bandwidth anywhere between 268.8-294.4 GB/s per single GPU" [3].


Software Environments

In the past, the majority of GPGPU programming was done directly through graphics APIs. Although many researchers were successful in getting applications to work, there is a fundamental mismatch between the traditional programming models people were using and the goals of the graphics APIs. Originally, people used fixed function, graphics-specific units (e.g. texture filters, blending, and stencil buffer operations) to perform GPGPU operations [1]. This quickly got better with fully programmable fragment processors which provided pseudo assembly languages, but this was still unapproachable by all but the most ardent researchers.

With DirectX 9, higher level shader programming was made possible through the “high-level shading language” (HLSL), presenting a C-like interface for programming shaders. NVIDIA’s Cg provided similar capabilities as HLSL but was able to compile to multiple targets and provided the first high-level language for OpenGL. The OpenGL Shading Language (GLSL) is now the standard shading language for OpenGL. However, the main issue with Cg/HLSL/GLSL for GPGPU is that they are inherently shading languages. Computation must still be expressed in graphics terms like vertices, textures, fragments, and blending. So, although you could do more general computation with graphics APIs and shading languages, they were still largely unapproachable by the common programmer [1]. What developers really wanted were higher level languages that were designed explicitly for computation and abstracted all of the graphics-isms of the GPU.

In the past, the majority of GPGPU programming was done directly through graphics APIs.

Most high-level GPU programming languages today share one thing in common: they are designed around the idea that GPUs generate pictures. As such, the high-level programming languages are often referred to as shading languages [1]. That is, they are a high-level language that compiles a shader program into a vertex shader and a fragment shader to produce the image described by the program.

Cg [8], HLSL [9], and the OpenGL Shading Language [10] all abstract the capabilities of the underlying GPU and allow the programmer to write GPU programs in a more familiar C-like programming language. They do not stray far from their origins as languages designed to shade polygons. All retain graphics-specific constructs: vertices, fragments, textures, etc. Cg and HLSL provide abstractions that are very close to the hardware, with instruction sets that expand as the underlying hardware capabilities expand. The OpenGL Shading Language was designed looking a bit further out, with many language features (e.g. integers) that do not directly map to hardware available today [1].

Sh is a shading language implemented on top of C++ [16]. Sh provides a shader algebra for manipulating and defining procedurally parameterized shaders. Sh manages buffers and textures, and handles shader partitioning into multiple passes. Sh also provides a stream programming abstraction suitable for GPGPU programming [1].

BrookGPU [12] takes a pure streaming computation abstraction approach representing data as streams and computation as kernels. There is no notion of textures, vertices, fragments, or blending in Brook. Kernels are written in a restricted subset of C, notably the absence of pointers and scatter, and defined the input, output, and gather streams used in a kernel as part of the kernel definition. The user’s kernels are mapped to fragment shader code and streams to textures. Data upload and download to the GPU is performed via explicit read/write calls translating into texture updates and framebuffer read backs [1]. Lastly, computation is performed by rendering a quad covering the pixels in the output domain.

Microsoft’s Accelerator [13] project has a similar goal as Brook in being very compute-centric, but instead of using offline compilation, Accelerator relies on just-in-time compilation of data-parallel operators to fragment shaders. Unlike Brook, but similar to Sh, the delayed evaluation model allows for more aggressive online compilation, leading to potentially more specialized and optimized generated code for execution on the GPU [1].

RapidMind [14] commercialized Sh and now targets multiple platforms including GPUs, the STI Cell Broadband Engine, and multicore CPUs, and the new system is much more focused on computation as compared to Sh, which included many graphics-centric operations [1].

PeakStream [15] (purchased by Google in 2007) is a new system, inspired by Brook, designed around operations on arrays. Similar to RapidMind and Accelerator, PeakStream uses just-in-time compilation but is much more aggressive about vectorizing the user’s code to maximize performance on SIMD architectures. Peak-Stream is also the first platform to provide profiling and debugging support, the latter continuing to be a serious problem in GPGPU development [1].

Ashli [11] works at a level one step above that of Cg, HLSL, or the OpenGL Shading Language. Ashli reads as input shaders written in HLSL, the OpenGL Shading Language, or a subset of RenderMan. Ashli then automatically compiles and partitions the input shaders to run on a programmable GPU [1].

AMD announced and released their system to researchers in late 2006. CTM (Close To the Metal), provides a low-level hardware abstraction layer (HAL) for the R5XX and R6XX series of ATI GPUs. CTMHAL provides raw assembly-level access to the fragment engines (stream processors) along with an assembler and command buffers to control execution on the hardware AMD also offers the compute abstraction layer (CAL), which adds higher level constructs, similar to those in the Brook runtime system, and compilation support to GPU ISA for GLSL, HLSL, and pseudo assembly like Pixel Shader 3.0. For higher level programming, AMD supports compilation of Brook programs directly to R6XX hardware, providing a higher level programming abstraction than provided by CAL or HAL.

NVIDIA’s CUDA is a higher level interface than AMD’s HAL and CAL. Similar to Brook, CUDA provides a C-like syntax for executing on the GPU and compiles offline. However, unlike Brook, which only exposed one dimension of parallelism, data parallelism via streaming, CUDA exposes two levels of parallelism, data parallel and multithreading. CUDA also exposes much more of the hardware resources than Brook, exposing multiple levels of memory hierarchy: per-thread registers, fast shared memory between threads in a block, board memory, and host memory. Kernels in CUDA are also more flexible that those in Brook by allowing the use of pointers (although data must be on board), general load/store to memory allowing the user to scatter data from within a kernel, and synchronization between threads in a thread block. However, all of this flexibility and potential performance gain comes with the cost of requiring the user to understand more of the low-level details of the hardware, notably register usage, thread and thread block scheduling, and behavior of access patterns through memory.

CUDA provides tuned and optimized basic linear algebra subprograms (BLAS) and fast Fourier transform (FFT) libraries to use as building blocks for large applications. Low-level access to hardware, such as that provided by CTM, or GPGPU specific systems like CUDA, allow developers to effectively bypass the graphics drivers and maintain stable performance and correctness.

NVIDIA’s CUDA allows the user to access memory using standard C constructs (arrays, pointers, variables). AMD’s CTM is nearly as flexible but uses 2-D addressing.

The use of direct-compute layers such as CUDA and CTM both simplifies and improves the performance of linear algebra on the GPU. For example, NVIDIA provides CuBLAS, a dense linear algebra package implemented in CUDA and following the popular BLAS conventions.

References

[1] Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., and Phillips, J. C. GPU computing. IEEE Proceedings, May 2008, 879-899.

[2] Theo Valich. GT300 to feature 512-bit interface - nVidia set to continue with complicated controllers?. http://www.brightsideofnews.com/news/2009/5/5/gt300-to-feature-512-bit-interface---nvidia-set-to-continue-with-complicated-controllers.aspx

[3] nVidia's GT300 specifications revealed - it's a cGPU!. http://www.brightsideofnews.com/news/2009/4/22/nvidias-gt300-specifications-revealed---its-a-cgpu!.aspx

[4]Larrabee to launch at 300W TDP. fudzilla.com. http://www.fudzilla.com/index.php?option=com_content&task=view&id=7651&Itemid=1. Retrieved on 2008-08-06.

[5] Larrabee will use a 12-layer PCB. fudzilla.com. http://www.fudzilla.com/index.php?option=com_content&task=view&id=8435&Itemid=1. Retrieved on 2009-07-09.

[6] Larrabee will use GDDR5 memory. fudzilla.com. http://www.fudzilla.com/index.php?option=com_content&task=view&id=8460&Itemid=1. Retrieved on 2008-08-06.

[7] Jakub Kurzak, Alfredo Buttari, Piotr Luszczek, Jack Dongarra, "The PlayStation 3 for High-Performance Scientific Computing," Computing in Science and Engineering, vol. 10, no. 3, pp. 84-87, May/June, 2008.

[8] Mark W. R., Glanville R. S., Akeley K., Kilgard M. J.: Cg: A system for programming graphics hardware in a C-like language. ACM Transactions on Graphics 22, 3 (July 2003), 896–907.

[9] Microsoft high-level shading language. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directx9_c/directx/graphics/reference/hlslreference/hlslreference.asp, 2005.

[10] Kessenich J., Baldwin D., Rost R.: The OpenGL Shading Language version 1.10.59. http://www.opengl.org/documentation/oglsl.html, Apr. 2004.

[11] Bleiweiss A., Preetham A.: Ashli—Advanced shading language interface. ACM SIGGRAPH Course Notes (July 2003). http://www.ati.com/developer/SIGGRAPH03/AshliNotes.pdf.

[12] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for GPUs: Stream computing on graphics hardware”, ACM Trans. Graph., vol. 23, no. 3, pp. 777–786, Aug. 2004.

[13] D. Tarditi, S. Puri, and J. Oglesby, “Accelerator: Using data-parallelism to program GPUs for general-purpose uses”, in Proc. 12th Int. Conf. Architect. Support Program. Lang. Oper. Syst., Oct. 2006, pp. 325–335.

[14] M. McCool, “Data-parallel programming on the cell BE and the GPU using the RapidMind development platform”, in Proc. GSPx Multicore Applicat. Conf., Oct.–Nov. 2006.

[15] PeakStream, The PeakStream platform: High productivity software development for multi-core processors. [Online]. Available: http://www.peakstreaminc.com/reference/peakstream_platform_technote.pdf

[16] Mccool M., Du Toit S., Popa T., Chan B., Moule K.: Shader algebra. ACM Transactions on Graphics 23, 3 (Aug. 2004), 787–795.