I tried my hand at learning some MMX and this is the result.
I use the following formula:
dest = alpha * (source - dest) / 255 + dest;
Previously the best I could do was work on each channel of the pixel, so doing everything at once I imagine is a bit quicker.
I'm using a DIB Section for drawing, so I get a pointer to the bits. (which is dst)
color is a dword, and serves as the source.
Basically this is how I tried to lay it out...
Any ideas on improvements? It is very straight forward, but if anyone is aware of any neat tricks / improvements, please let me know.
I use the following formula:
dest = alpha * (source - dest) / 255 + dest;
Previously the best I could do was work on each channel of the pixel, so doing everything at once I imagine is a bit quicker.
I'm using a DIB Section for drawing, so I get a pointer to the bits. (which is dst)
color is a dword, and serves as the source.
Basically this is how I tried to lay it out...
00 XX 00 RR 00 GG 00 BB
-
00 XX 00 RR 00 GG 00 BB
*
00 AA 00 AA 00 AA 00 AA
/
00 FF 00 FF 00 FF 00 FF
+
00 XX 00 RR 00 GG 00 BB
pxor mm3, mm3 //clear register
mov eax, dword ptr //get everything ready
movd mm0, dword ptr
movd mm1, dword ptr //src
movd mm2, dword ptr
punpcklbw mm0, mm3 //unpack dst to words
punpcklbw mm1, mm3 //unpack color
punpcklbw mm2, mm2 //unpack alpha
punpcklbw mm2, mm2
punpcklbw mm2, mm3
psubusb mm1, mm0 //(color - dest)
pmullw mm1, mm2 //alpha * (color - dest)
psrlw mm1, 8 //alpha * (color - dest) / 256
paddusw mm1, mm0 //alpha * (color - dest) / 256 + dest
packuswb mm1, mm3
movd dword ptr , mm1
Any ideas on improvements? It is very straight forward, but if anyone is aware of any neat tricks / improvements, please let me know.
Good work 8)
The only way to do much better is to take advantage of the video card via per pixel operations written in (preferably asm) shader language.... aka "pixel shader".
This essentially means rendering under D3D or OGL.
The only way to do much better is to take advantage of the video card via per pixel operations written in (preferably asm) shader language.... aka "pixel shader".
This essentially means rendering under D3D or OGL.
Good work 8)
The only way to do much better is to take advantage of the video card via per pixel operations written in (preferably asm) shader language.... aka "pixel shader".
This essentially means rendering under D3D or OGL.
I have considered it, but they are a lot deeper / harder to get into than GDI; this is only magnified by the fact that I'm horrible at math, and have no real experience in it past a basic college level. :)
I've been goofing with creating some effects like starfields, fun stuff.
Keep doing that, once you've written a pixel effect for gdi, you can translate it into gpu code at a later date should you choose to do so.
And don't worry about the maths too much, I'm in the same boat, you pick it up as you go along (as you need it).
And don't worry about the maths too much, I'm in the same boat, you pick it up as you go along (as you need it).
I've been reading around more
http://www.tommesani.com/SSEPrimer.html
http://avisynth.org/mediawiki/Filter_SDK/Simple_MMX_optimization
I realized one can do
http://www.tommesani.com/SSEPrimer.html
http://avisynth.org/mediawiki/Filter_SDK/Simple_MMX_optimization
I realized one can do
pshufw mm2, mm2, 0
instead of the 3 unpacks for alpha. ;)Good work 8)
The only way to do much better is to take advantage of the video card via per pixel operations written in (preferably asm) shader language.... aka "pixel shader".
This essentially means rendering under D3D or OGL.
Assembly shader language has been abandoned in DX10 though. There was little to no advantage over HLSL anymore after the compiler matured (the assembly is a virtual format anyway. It's merely bytecode that gets compiled by the actual videocard driver to the native architecture, which may or may not bear any resemblance to the DX assembly language).
For alphablending you don't necessarily need shaders though. Historically, alphablending was a special stage after rasterization, with dedicated hardware. This saves you an extra renderpass (you can't read from the framebuffer in a shader, so you'd have to use render-to-texture).
You mean DX11, which is far from finalized and Microsoft are known to do radical changes days before finalizing; and the asm is still there in the form of vendor-agnostic macro-instructions in binary blobs. Meanwhile, Microsoft explains they're moving to HLSL-only for the sole reason of inlining and unrolling loops and subroutines, as both have unexpectedly awful performance even on latest gpus - and the only way to improve the situation is to have hundreds of variations of each shader. In any case - yes, asm is irrelevant on an ever-changing platform, where hlsl/Cg operations directly map to asm opcodes, memory is accessed via funcs, and stalls are easily avoided by switching threads on every single cycle.
Anyway, we're talking about a single one-liner for this alphablending:
gl_FragColor = texture2D(tex,varCoord); // glsl
TEX.F oCol, fragment.texcoord[1], texture[0], 2D ; nvASM4
Then simply set renderstate to alphablend.
Anyway, we're talking about a single one-liner for this alphablending:
gl_FragColor = texture2D(tex,varCoord); // glsl
TEX.F oCol, fragment.texcoord[1], texture[0], 2D ; nvASM4
Then simply set renderstate to alphablend.
No, I meant DX10, you know, that API that has been out for a few years already.
ASM has been gone for a while, as far as I know:
http://msdn.microsoft.com/en-us/library/bb509561(VS.85).aspx
http://msdn.microsoft.com/en-us/library/bb205073(VS.85).aspx#Porting_Shaders
So I most certainly did not mean DX11. I think both you and Homer somehow must have missed out on DX10 altogether?
There still is a bytecode-like language yes, which is passed on to the display driver for final compilation into hardware-specific code. However, bytecode is not assembly code. There's a difference there. DX10 doesn't give you the tools to write bytecode with mnemonics (aka assembly programming). HLSL is the only way. In theory it will always be possible, but there is no use. Just like there's no use to writing Java or .NET directly in assembly, rather than using a regular programming language (aside from perhaps obfuscation reasons or such).
Also, the code you provided doesn't actually perform the alphablending. It just does a texture-fetch. The actual alphablending is done in the output-merger state, which can be controlled with blendstates.
So your example doesn't really show anything :)
You wouldn't need shaders anyway, in DX9 or OpenGL, if you just wanted to do a texture-fetch. Fixedfunction is well-capable of that, and is less code to write/maintain.
ASM has been gone for a while, as far as I know:
http://msdn.microsoft.com/en-us/library/bb509561(VS.85).aspx
With the introduction of the Direct3D 10 API, the pipeline is now virtually 100% programmable using only HLSL; in fact, assembly is no longer used to generate shader code with Direct3D 10.
http://msdn.microsoft.com/en-us/library/bb205073(VS.85).aspx#Porting_Shaders
Direct3D 10 limits the use of assembly language to that of debugging purposes only, therefore any hand written assembly shaders used in Direct3D 9 will need to be converted to HLSL.
So I most certainly did not mean DX11. I think both you and Homer somehow must have missed out on DX10 altogether?
There still is a bytecode-like language yes, which is passed on to the display driver for final compilation into hardware-specific code. However, bytecode is not assembly code. There's a difference there. DX10 doesn't give you the tools to write bytecode with mnemonics (aka assembly programming). HLSL is the only way. In theory it will always be possible, but there is no use. Just like there's no use to writing Java or .NET directly in assembly, rather than using a regular programming language (aside from perhaps obfuscation reasons or such).
Also, the code you provided doesn't actually perform the alphablending. It just does a texture-fetch. The actual alphablending is done in the output-merger state, which can be controlled with blendstates.
So your example doesn't really show anything :)
You wouldn't need shaders anyway, in DX9 or OpenGL, if you just wanted to do a texture-fetch. Fixedfunction is well-capable of that, and is less code to write/maintain.
You're right, I didn't thoroughly check the params of ID3D10Device::CreatePixelShader.
I haven't used DX10, but have been checking its references (while never even thinking of coding in macro-asm, so I didn't notice..) from initial public drafts to the March2009 version. Instead, I've been using this and more functionality via GL2.1, GL3.0 and GL3.1.
Scali, don't step on my foot here :P. I've been coding and optimizing SM4+ graphics for a year or so already: things like custom improved motion-blurs/DOFs, hybrid deferred renderers, antialiasers, etc etc - for as much detail and eyecandy without aliasing as possible.
Getting shaders to work is a few more lines of code; why rely on fixed-func to generate the optimal shaders and IA layouts for you?
I haven't used DX10, but have been checking its references (while never even thinking of coding in macro-asm, so I didn't notice..) from initial public drafts to the March2009 version. Instead, I've been using this and more functionality via GL2.1, GL3.0 and GL3.1.
Scali, don't step on my foot here :P. I've been coding and optimizing SM4+ graphics for a year or so already: things like custom improved motion-blurs/DOFs, hybrid deferred renderers, antialiasers, etc etc - for as much detail and eyecandy without aliasing as possible.
Getting shaders to work is a few more lines of code; why rely on fixed-func to generate the optimal shaders and IA layouts for you?
Scali, don't step on my foot here :P. I've been coding and optimizing SM4+ graphics for a year or so already: things like custom improved motion-blurs/DOFs, hybrid deferred renderers, antialiasers, etc etc - for as much detail and eyecandy without aliasing as possible.
Well, you were stepping on my feet.
If you want to correct me, fine, but at least make sure I'm wrong first. You were coming off as rather pedantic, with your "You mean DX11"-rant. I know what I mean. You could have bothered to check your facts first.
I've been doing graphics programming since the early 90s. So I'm not easily impressed.
Getting shaders to work is a few more lines of code; why rely on fixed-func to generate the optimal shaders and IA layouts for you?
It's all moot anyway. I wouldn't recommend starting with anything but DX10 at this point, and DX10 doesn't have any fixed-function anymore, so shaders are the only option.
There still is a bytecode-like language yes, which is passed on to the display driver for final compilation into hardware-specific code.
True. The idea is to allow GFX cards manufacturers their own native language. .NET was supposed to do the same with CPU manufacturers.
Yea, .NET, and Java before it.
One of the main reasons why GPUs can evolve so quickly is because they aren't tied to an instructionset. A DX9 card can execute DX8 code, but the underlying hardware is completely different. Likewise, a DX10 card is completely different from a DX9 card. But because of the abstraction done in DirectX/OpenGL shaders, the programmer doesn't notice anything of this, and his code will just continue to work.
I've often wondered what the world of CPUs would look like if something like Java or .NET became popular, and CPU designers didn't have to put x86 compatibility into the hardware, but instead were free to design a custom instructionset that would suit the current state-of-the-art in the best way possible... We may get a much faster rate of evolution in the CPU world, just like GPUs.
But ironically, the opposite seems to happen now. Because GPUs are getting ever more generic, Intel is actually designing a 'GPU' based on many cores with x86-technology (Larrabee). So who knows, future GPUs may be x86 aswell :)
One of the main reasons why GPUs can evolve so quickly is because they aren't tied to an instructionset. A DX9 card can execute DX8 code, but the underlying hardware is completely different. Likewise, a DX10 card is completely different from a DX9 card. But because of the abstraction done in DirectX/OpenGL shaders, the programmer doesn't notice anything of this, and his code will just continue to work.
I've often wondered what the world of CPUs would look like if something like Java or .NET became popular, and CPU designers didn't have to put x86 compatibility into the hardware, but instead were free to design a custom instructionset that would suit the current state-of-the-art in the best way possible... We may get a much faster rate of evolution in the CPU world, just like GPUs.
But ironically, the opposite seems to happen now. Because GPUs are getting ever more generic, Intel is actually designing a 'GPU' based on many cores with x86-technology (Larrabee). So who knows, future GPUs may be x86 aswell :)
You're right, I didn't thoroughly check the params of ID3D10Device::CreatePixelShader.
I haven't used DX10, but have been checking its references (while never even thinking of coding in macro-asm, so I didn't notice..) from initial public drafts to the March2009 version. Instead, I've been using this and more functionality via GL2.1, GL3.0 and GL3.1.
Scali, don't step on my foot here :P. I've been coding and optimizing SM4+ graphics for a year or so already: things like custom improved motion-blurs/DOFs, hybrid deferred renderers, antialiasers, etc etc - for as much detail and eyecandy without aliasing as possible.
Getting shaders to work is a few more lines of code; why rely on fixed-func to generate the optimal shaders and IA layouts for you?
This is really interesting, do you work for a specific game studio or something?
slovach: no. The SM4+ graphics are a hobby of mine. I was sick of writing optimized rasterizers+engines for the different weak PDAs at work.