I am new to this topic so i don't think i know much about it.

As the topic says, i want to do general computing on the GPU. I know that there are full blown API like DirectX and OpenGL to help people do that. But i what i want to do is not really anything as complex as drawing vertices, textures and so on; which DirectX and OpenGL makes them easy to do so thanks to their API procedures. I want to start with stuff as simple as adding two numbers on the GPU and maybe do a personalized JPEG decoder that uses the GPU (preferably on assembly).

I have been looking around at a few books and tutorials and i see that DirectX and OpenGL starts doing the big stuff in small steps right away with their API procedures. I just want to know how to directly access to the video instructions and memory. But i am also aware that although what i am trying sounds simple it might not be so due to the abstraction layers and how the drivers might be manipulating the instructions due to individual hardware implementations of the graphics cards.
Posted on 2011-04-18 09:29:18 by banzemanga
Write your own shader? Or use DirectCompute/OpenCL/CUDA. Can't think of any other way.
Posted on 2011-04-18 17:11:13 by ti_mo_n
No, i don't want to start anything complicated like making my own shader. Let's start simple. Add two numbers on the GPGPU. Move a value to a register or video memory location.
Posted on 2011-04-18 18:02:48 by banzemanga
There are two major libraries to assist in writing GPU-based code: they are CUDA and GPGPU.

If you insist on creating your own solution, then you cannot avoid writing a shader function (actually, you can't ever avoid this but I digress).
The typical solution is to use a custom pixel shader, along with your own custom Pixel Format for passing your data to/from the GPU code, via 'custom textures' which are just flat arrays of your custom pixel structure, and using 'render-to-texture' to capture the output back from the pixel shader.

There are of course other avenues, which may be more or less efficient depending on your requirements.

Just to add two numbers together, you can get away with reading and writing shader variables, with the most braindead shader code ever, and no textures.
Shader functions are meant to primarily operate apon their input args, however it's our function, we can make it do what we want.
Posted on 2011-04-19 00:52:07 by Homer
You can write assembly only with DirectX 9 shaders (SM3.0 max) or OpenGL vertex/fragment programs (SM2.0 max), which is not REAL assembly, but more like bytecode as in Java/.NET, or by using Cuda (again a sort of pseudo-assembly).
DX10+ and newer OpenGL versions only support HLSL/GLSL for shaders, since there's no point in using assembly when it's a bytecode language anyway.

Programming directly on the GPU isn't going to work, since all GPUs are massively different. A GeForce 7-series has a completely different architecture and instructionset to a GeForce 8-series, and so on. Not to mention the differences between nVidia, AMD and Intel GPUs. For that reason, there really aren't any interfaces available for any GPU to write native assembly directly. It's just useless, as it will only run on that particular GPU.

So in short, I wouldn't even bother trying to use assembly. Just stick to the C-like languages of the particular APIs. There's nothing to gain with assembly, and lots to lose.

The difference between writing graphics shaders (vertex/geometry/pixel/etc) and using GPGPU (Cuda/DirectCompute/OpenCL/Stream) is mainly in the flexibility of memory access.
Graphics shaders have very limited input and output. The input can only come from a vertex buffer or a texture. And the output can only go to a single pixel. Another limitation is that you cannot use the same texture as both input and output, which means that shaders cannot access each other's data.

With GPGPU, there is a local storage buffer. This allows you to share data between shaders, which allows you to solve a lot more problems (or solve the same problems in more efficient ways, by requiring less render passes).

But really, if you just want to add 2+2 together on a GPU, you're already going to need quite a bit of setup code. The programming model generally looks something like this:
1) Initialize the device through the API (create a device object, create a context etc).
2) Initialize your input and output buffers.
3) Initialize your GPU code (in most cases this means actually sending the ASCII source code to the driver to have it compiled, and it will return you an object representing the compiled code. Cuda allows you to compile the shader and link it directly to your program. It just appears as an extern "C" function).
4) Set the input and output buffers for your GPU code.
5) Execute the GPU code.
6) Read back the results from the output buffers on the CPU.

Another thing you have to realize is that the execution model is vastly different from a CPU. On a GPU you never just want to run a single addition or such. A GPU is a massively parallel architecture. Its speed does not come from its linear processing power, but from the fact that it can run many (slow) threads at the same time.
With GPGPU you generally execute a grid of data at the same time (this can be structured as 1D, 2D or 3D, depending on what makes the most sense for your particular code).
Basically all shaders in the grid run the exact same code (it's a form of SIMD), but they are all fed with different input grid coordinates. You use these grid coordinates to determine what input to sample from the input buffer, and where to write the output. So basically you are executing an array at a time.
Posted on 2011-04-19 05:20:16 by Scali
Thanks for your responses and explanations.

Now i see ti_mo_n meant by writing my "own shader". One of the simplest way to make the GPU do calculations is to get through the shaders pipeline to do the simple steps through big leaps for me.

I did expect the possibility that because many GPUs are massively different in architecture, device drivers would do some coding/decoding to tell the GPU what instructions are to execute. But i still hang on the other possibility of being to program on it directly.

I did a little research and i found out that OpenCL is the only open standard (pseudo-)assembly(?) option out there. CUDA and ATI Stream are pretty much private high-level language standards. And even OpenCL for GPCPU programming is rather a new thing. OpenCL 1.1 was standarized on 2010! Only one year ago! DirectX is only for Windows; while for some reason i am a little reluctant about GLSL.

Probably the only way to access the GPU directly is by creating my own custom device driver which i could just forget about it. Also, the custom driver will only work on that specific device.

I hoped that GPUs would have certain standard in the architecture like AMD and Intel CPUs. Different on the inside but have mostly the same general instruction sets.

What i really wanted to do was something like a custom JPEG decoder that uses the GPU instead of the CPU. And i hoped it would something simple enough to be able to work on a really old computer with old graphics card.

Which it brings to what i am now confused about. How did earlier versions DirectX took advantage of older graphics cards when older graphics cards didn't have much SIMD instructions capability to be considered GPGPU in the first place? Does it mean that DirectX did the computations on the CPU and then send the preliminary results to the graphics card 2D acceleration pipeline for final processing?
Posted on 2011-04-19 06:53:26 by banzemanga
OpenCL has no assembly, only a C-like language.
And you should be very VERY glad that GPUs do NOT have an architectural standard such as Intel's x86. Videocards have evolved at a much faster rate than CPUs, and pack a lot more processing power in a single chip, exactly for that reason: There is no architectural standard. You can do anything you like (such as Intel attempting to build a GPU with a set of Atom-like x86 cores with extended SIMD units, aka Larrabee), as long as you can write a driver that will support DirectX and/or OpenGL (and these days they're so similar that if you can support one, you can support the other as well).

GPGPU is still relatively new. The first videocards to support OpenCL/DirectCompute/Cuda are the GeForce 8-series. And you can probably just forget about doing something like JPG decoding without GPGPU.

How older DirectX/OpenGL worked is a very long story, but I'll try to make it as short as possible.
Evolution of videocards and APIs:
1) 3D was completely CPU-based. The CPU would perform all lighting, transforming, rasterizing and finally the actual pixel drawing.
(Pre-accelerator era, CGA/EGA/VGA)
2) The innerloop of the triangle filling routine was accelerated by the videocard. A triangle is rendered as two scanline-oriented quads (upper and lower half). The CPU could pass these quads to the videocards, and the scanlines were filled automatically. Basic texturing and shading could be applied as well, but the CPU still had to do the setup to calculate the gradients for the quads.
(Early VooDoo cards, pre-D3D to early D3D)
3) Rasterizing and triangle gradient setup were accelerated by the videocard. The CPU could now feed triangles in screenspace directly to the videocard.
(Roughly D3D5-era)
4) The dawn of the GPU: Transforming and lighting were accelerated by the videocard. The CPU could now pass triangles in object space (which could be stored in videomemory, since they would be static throughout the lifetime of the object), transform matrices and light parameters to the GPU, and the GPU would completely accelerate the drawing process from start to finish.
5) The dawn of programmable shaders: Up to now, the lighting and shading were fixed-function, and operated as a state machine. The CPU would set a few states to control how the GPU would perform shading. This state machine has become so complex, and because of multitexturing, it already worked in multiple stages, that it started to make sense to model these states as simple instructions with input and output registers. The fixed-function T&L and shading operations could now be 'scripted' in an assembly-like language.
6) Unified shaders and GPGPU: Up to now, vertex processing and pixel processing were two seperate types of operations, requiring separate types of execution units. The GPU would have a small set of vertex units, which would have high precision floating point, and a relatively powerful instructionset. Then it would have a larger set of pixel units, which were more aimed at texturing, and had lower precision arithmetic, and a simpler, less powerful instructionset. You basically had to use two languages when programming: vertex shader language and pixel shader language.
But now, all shaders were made unified. So now you could use the same high-precision powerful instructions for pixel shaders as for vertex shaders. The hardware now also used a single large array of shader units, which could dynamically be allocated to whatever shaders were running (effectively an automatic load balancing system between vertex processing and pixel processing).
At this time, nVidia also introduced the first real GPGPU: the GeForce 8-series. Its unified shaders were linked to a large shared cache, and could be used outside the graphics pipeline, which had been hardwired up to now (if you wanted to do any calculations, you'd always have to set up geometry and render actual triangles, in order to make pixel shaders execute and output data to a buffer).

So in short: Yes, at first the CPU did everything, then very gradually, the GPU started to take over. At first the GPU would just have hardwired functionality, so it was not programmable at all. The first generation of 'programmable' GPUs were barely more 'programmable' than the last generation of state-based fixed function GPUs (especially regarding pixel processing. At first vertex shaders were the biggest step forward. Pixel shaders were very limited).
Posted on 2011-04-19 08:00:45 by Scali
Thank you very much Scali. You are such a knowledge mine. I have learned so much from your replies that it would probably take me weeks or months of research to  to dig up the answers from articles or books.
Posted on 2011-04-19 16:50:02 by banzemanga
You're welcome. I've been doing graphics since the days of Amiga and PCs with VGA cards. So I've experienced pretty much the entire development of 3d acceleration and APIs from the beginning.
I guess a big problem with finding information on the internet is that it's difficult to place it in the proper context. Videocards and APIs have changed a lot over the years, and there's a lot of outdated, useless, or even downright wrong information out there.

I've written a JPG decoder myself (just CPU though), and I have some experience with Cuda and DirectCompute, so I might be able to help.
F0dder also mentioned some GPU-accelerated JPG stuff to me a while ago. I think it was done with Cuda. One of the experimental things there was a 'trial-and-error' approach to the decoding of the Huffman bitstream. They would cut up the stream in small pieces, and have many threads decode parts of them in parallel.
The stream pieces had no clear beginning and ending, so the threads would just have to try to find a starting point for decoding, and just find out where it leads to.

Most of the other parts of JPG decoding are very straightforward for a SIMD-oriented parallel architecture. It's mainly just operations on 8x8 matrices.
Posted on 2011-04-22 05:14:39 by Scali
I found some CUDA jpeg decoder stuff on sourceforge yeah, but there was also some article somewhere - don't have any links available, so you can probably google it faster than I can, heading off to bed.

But the gist was that jpeg decoding on GPU isn't super easy, you have a massive bottleneck around the huffman decode part which is very hard to parallelize; it's basically, as scali mentions, a trial and error hackjob. Syncing between the GPU threads is very expensive, if even possible at all.
Posted on 2011-05-18 18:23:46 by f0dder
I think this was the one: http://www.eecg.toronto.edu/~moshovos/CUDA08/arx/JPEG_report.pdf
Posted on 2011-05-18 18:32:29 by Scali
After abandoning this i decided to come back with a different approach. I tried using OpenCL with ATI Stream since i have a ATI video card. I managed to make a simple program that does some simple math computation. But pretty much let it go because the compiled code was some kind of byte code like it was mentioned before and it had to be re-run with another additional software provided by ATI.

So, i remembered about creating a custom shader on DirectX was also a possibility. So i was wondering if i could get the output of the shader and cast it into one of the variables of my application. I have almost zero knowledge of programming with DirectX so i need some help. And i can't find the missing pieces i need from the tutorials i get online.

So let's assume this simple shader which returns the value of a variable which was set equals to '1':

int global_dx;

int output() {
global_dx = 1;
return global_dx;

And then my application which prints the value of the shader to the console:

#include <iostream>
#include <d3d11.h>
#include <d3dx11.h>
#include <d3dx10.h>

// include the Direct3D Library file
#pragma comment (lib, "d3d11.lib")
#pragma comment (lib, "d3dx11.lib")
#pragma comment (lib, "d3dx10.lib")

using namespace std;

int main() {
ID3D10Blob *VS;
D3DX11CompileFromFile("shaders.hls", 0, 0, "output", "vs_5_0", 0, 0, 0, &VS, 0, 0);

//I need help here
//int value_from_shader = DX_function_calls_shader_function;
//cout << value_from_shader  << endl;

return 0;

I need to know how i call the shader function from my application.
Posted on 2011-06-19 16:51:18 by banzemanga
You don't 'call' the shader, you set the active shaders to the pipeline (the name of the function reflects the part of the pipeline where it is set, see http://msdn.microsoft.com/en-us/library/ff476882(v=vs.85).aspx).
There are 6 different types of shaders in Direct3D 11, each with their own function to set them as active:
- Vertex shaders: VSSetShader()
- Hull shaders (step 1 of tessellation): HSSetShader()
- Domain shaders (step 2 of tessellation): DSSetShader()
- Geometry shaders: GSSetShader()
- Pixel shaders: PSSetShader()

Wait, that's only 5, and I said 6. But we covered the whole pipeline already.
That's because these 5 shaders are all used simultaneously during rendering (although some of them are optional). This is the *graphics* pipeline, after all. So you set all the shaders you want to use, and then you set up geometry to be rendered. That is how you 'execute' the shaders.

The last type of shader is the compute shader. You set this with the CSSetShader() function. However, it is not executed during rendering. There is a special function to run a compute shader. Instead of the Draw() functions, you use the Dispatch() functions to execute compute shaders. The graphics pipeline is not used in this case. You specify input and output buffers for your compute shader, a more generalized version of the textures and render targets that you'd use in rendering. There is also no fixed-function hardware in between (like with rendering, you still get 'automatic' perspective divide, clipping, triangle setup and everything), only your shader code is executed, nothing else.

The data you get in the blob after compiling a shader is still just bytecode. It is not native to your videocard, the bytecode will be compiled to native code by the driver. The main reason why D3D allows you access to the bytecode (unlike OpenGL/OpenCL) is to be able to pre-compile shaders. This makes them more compact, and it makes your application start more quickly, since most of the compilation is already done. There's also the benefit that you don't need to distribute the source code of your shaders.

So what you want to do, extract values from a shader, can only be done in 2 ways:
1) Use a compute shader, Dispatch() it, and extract the output from the output buffer.
2) Set up geometry so you can execute your graphics shaders and capture the results you want in the pixels of a rendertarget. Then extract the output from the rendertarget.

Clearly compute shaders are the nicer option.
Posted on 2011-06-20 03:51:39 by Scali
It's also good to know that you can run Compute Shaders on DX10-capable hardware (I.E. you don't need a DX11 gfx adapter).
Posted on 2011-06-23 08:07:11 by ti_mo_n

It's also good to know that you can run Compute Shaders on DX10-capable hardware (I.E. you don't need a DX11 gfx adapter).

To be exact: *some* DX10-capable hardware can run Compute Shaders. There are 3 versions of Compute Shaders:
CS4.0: DX10.0 class (GeForce 8-series and up).
CS4.1: DX10.1 class (Radeon 4000-series and up).
CS5.0: DX11 class

They do require driver support (it's an optional feature for DX11 in DX10/DX10.1 downlevel mode. For a true DX11 device, CS support is required, so all DX11 hardware supports it). nVidia and AMD have driver support for their DX10.0/10.1 devices. Intel does not supply drivers with compute shader support (although afaik AMD only supports 4000-series and up, even though 2000 and 3000 series are also DX10.0 and DX10.1 respectively). So although their DX10+ hardware might be capable of compute shaders in theory, you cannot use them.
I don't have experience with other GPU brands (S3? XGI?), so I don't know if any brand other than nVidia and AMD support compute shaders in practice.

Also, obviously you need a DX11-capable OS, so it only works on Windows 7 or Vista with SP2 and the Platform Update.
Posted on 2011-06-23 10:04:46 by Scali