Heya.
I'm living back in the Dark Ages. I don't know a lot of new-fangled instructions.
Would anyone be willing to help optimize these two short functions?
I want to either make no assumptions about the cpu, or have cases optimized for various cpu's.

;===============================================
;Some helper functions I had to code myself :(
;Badly need optimizing !!!

ColorLerp PROC pC1:DWORD, pC2:DWORD, F:FLOAT, pOut:DWORD

fld .D3DXCOLOR.r ; pOut->r = pC1->r + s * (pC2->r - pC1->r);
fsub .D3DXCOLOR.r
fmul F
fadd .D3DXCOLOR.r
fstp .D3DXCOLOR.r

fld .D3DXCOLOR.g ; pOut->g = pC1->g + s * (pC2->g - pC1->g);
fsub .D3DXCOLOR.g
fmul F
fadd .D3DXCOLOR.g
fstp .D3DXCOLOR.g

fld .D3DXCOLOR.b ; pOut->b = pC1->b + s * (pC2->b - pC1->b);
fsub .D3DXCOLOR.b
fmul F
fadd .D3DXCOLOR.b
fstp .D3DXCOLOR.b

fld .D3DXCOLOR.a ; pOut->a = pC1->a + s * (pC2->a - pC1->a);
fsub .D3DXCOLOR.a
fmul F
fadd .D3DXCOLOR.a
fstp .D3DXCOLOR.a

ret
ColorLerp ENDP

;=====================================
;Convert D3DXColor into ARGB dword
DXColorToDW PROC pIn:DWORD
local btemp:DWORD
fld pIn.D3DXCOLOR.a
fistp btemp
mov ebx,btemp
mov al,bl
shl eax,8
fld pIn.D3DXCOLOR.r
fistp btemp
mov ebx,btemp
mov al,bl
shl eax,8
fld pIn.D3DXCOLOR.g
fistp btemp
mov ebx,btemp
mov al,bl
shl eax,8
fld pIn.D3DXCOLOR.b
fistp btemp
mov ebx,btemp
mov al,bl
ret
DXColorToDW ENDP

Well, there they are, in all their gory.
They simply suck. I guess I do too.
Thanks in advance to anyone who takes pity on me, I was a dab hand when I knew all the opcodes :(
Posted on 2003-01-06 12:43:15 by Homer
ColorLerp seems to be optimized and I'm too lazy to create sumthin to test the codes but this "might" work on DXColorToDW...

not tested.

DXColorToDW PROC pIn:DWORD
local btemp:DWORD
fld pIn.D3DXCOLOR.a
fistp btemp
mov ebx,btemp
mozx eax, bl
shl eax,8

fld pIn.D3DXCOLOR.r
fistp btemp
mov ebx,btemp
shl ebx, 24
shr ebx, 24
or eax, ebx
shl eax,8

fld pIn.D3DXCOLOR.g
fistp btemp
mov ebx,btemp
shl ebx, 24
shr ebx, 24
or eax, ebx
shl eax,8

fld pIn.D3DXCOLOR.b
fistp btemp
mov ebx,btemp
shl ebx, 24
shr ebx, 24
or eax, ebx
ret
DXColorToDW ENDP

I don't know if this is faster or if it works but hey there's no harm in trying... :grin:

but the thing here is that I'm using full registers which may be slow if your using 32 bit then using 8 bit ... blah! blah! :grin:
Posted on 2003-01-06 15:16:48 by arkane
Originally posted by EvilHomer2k
fld .D3DXCOLOR.r ; pOut->r = pC1->r + s * (pC2->r - pC1->r);

How will this work? If pC2 is a pointer to a D3DXCOLOR struct you first have to dereference it before adding the '.r' offset to it (like mov eax, / fld ). Have you tested if your code works? I don't think it would work this way, this code would just load the pointer as a floating point value :)

Well, there they are, in all their gory.
They simply suck. I guess I do too.
Thanks in advance to anyone who takes pity on me, I was a dab hand when I knew all the opcodes :(

Don't be so negative :) Just try, measure the time your code takes and try to improve it. It's the only way you'll learn it.

Thomas
Posted on 2003-01-06 15:21:54 by Thomas
Two problems I see with your code, one major the other one minor. Nothing to do however with optimization.

Major: RGBa is the little-endian representation of a 32-bit color. Which means that the "red" component must be in the least significant byte of the dword. Your code loads the blue component in the least significant byte.

Minor: If there ever is a possibility of the calculated color intensity being higher than 255.5, the FPU would return a value of 256 when the FPU is in the default mode. Which means that the color intensity effectively used in such a case would be "0". Either you set the FPU for truncation before storing fp's as integers, or you must check each stored value for the possibility of the 100h value. (I must assume that there is no possibility of values >=256 from your calculations.)

Have fun

Raymond
Posted on 2003-01-06 19:56:20 by Raymond
Heya.

Thanks for the feedback on those.
Here is the (tested) revised version of those functions.
Anyone wanna have a go at an SSE2 version ?


Some helper functions I had to code myself :(
;Badly need optimizing !!!

ColorLerp PROC pC1:LPD3DXCOLOR, pC2:LPD3DXCOLOR, F:FLOAT, pOut:LPD3DXCOLOR
push esi
push edi
push ecx
mov esi,pC1
mov edi,pOut
mov ecx,pC2

fld .D3DXCOLOR.r ;pOut->r = pC1->r + s * (pC2->r - pC1->r);
fsub .D3DXCOLOR.r
fmul F
fadd .D3DXCOLOR.r
fstp .D3DXCOLOR.r

fld .D3DXCOLOR.g ; pOut->g = pC1->g + s * (pC2->g - pC1->g);
fsub .D3DXCOLOR.g
fmul F
fadd .D3DXCOLOR.g
fstp .D3DXCOLOR.g

fld .D3DXCOLOR.b ; pOut->b = pC1->b + s * (pC2->b - pC1->b);
fsub .D3DXCOLOR.b
fmul F
fadd .D3DXCOLOR.b
fstp .D3DXCOLOR.b

fld .D3DXCOLOR.a ; pOut->a = pC1->a + s * (pC2->a - pC1->a);
fsub .D3DXCOLOR.a
fmul F
fadd .D3DXCOLOR.a
fstp .D3DXCOLOR.a

pop ecx
pop edi
pop esi
ret
ColorLerp ENDP

;=====================================================================================
;Convert D3DXColor(RGBA floats) into ARGB dword in ass-about format
;floats truncated at 255 (saturated byte)

DXColorToDW PROC pIn:LPD3DXCOLOR
local btemp:DWORD
push esi
mov esi,pIn
fld .D3DXCOLOR.b
fistp btemp
mov ebx,btemp
.if ebx>255
mov ebx,255
.endif
mov al,bl
shl eax,8
fld .D3DXCOLOR.g
fistp btemp
mov ebx,btemp
.if ebx>255
mov ebx,255
.endif
mov al,bl
shl eax,8
fld .D3DXCOLOR.r
fistp btemp
mov ebx,btemp
.if ebx>255
mov ebx,255
.endif
mov al,bl
shl eax,8
fld .D3DXCOLOR.a
fistp btemp
mov ebx,btemp
.if ebx>255
mov ebx,255
.endif
mov al,bl
pop esi
ret
DXColorToDW ENDP
Posted on 2003-01-07 06:56:10 by Homer
for ColorLerp - assuming everthing is in float size
#include<stdio.h>


typedef struct
{
float r;
float g;
float b;
float a;
} D3DXCOLOR;

int main(void)
{
D3DXCOLOR d1 = {3.5f, 3.5f, 3.5f, 3.5f};
D3DXCOLOR d2 = {1.5f, 1.5f, 1.5f, 1.5f};
D3DXCOLOR f = {2.0f, 2.0f, 2.0f, 2.0f};
D3DXCOLOR output;

[color=blue]__asm
{
movaps xmm0, d1
movaps xmm1, d2
movaps xmm2, f
subps xmm0, xmm1
mulps xmm0, xmm2
addps xmm0, xmm1
movaps output, xmm0
}[/color]

printf("%f\n", output.r);
printf("%f\n", output.g);
printf("%f\n", output.b);
printf("%f\n", output.a);
return 0;
}
:grin: C he! he! :)

Just in case your wondering about D3DXCOLOR f - I did that for parallel processing whatever the value of f is, just plug it in to the whole structure fields of D3DXCOLOR. We don't care on what field it is on as long as all fields have the same values. :)

pc1 == d1
pc2 == d2

for DXColorToDW - Not tested again :)
DXColorToDW PROC pIn:DWORD

local btemp:DWORD
fld pIn.D3DXCOLOR.b
fistp btemp
mov ebx,btemp
cmp ebx, 255
cmova ebx, 255
movzx eax, bl
shl eax,8

fld pIn.D3DXCOLOR.g
fistp btemp
mov ebx,btemp
cmp ebx, 255
cmova ebx, 255
shl ebx, 24
shr ebx, 24
or eax, ebx
shl eax,8

fld pIn.D3DXCOLOR.r
fistp btemp
mov ebx,btemp
cmp ebx, 255
cmova ebx, 255
shl ebx, 24
shr ebx, 24
or eax, ebx
shl eax,8

fld pIn.D3DXCOLOR.a
fistp btemp
mov ebx,btemp
cmp ebx, 255
cmova ebx, 255
shl ebx, 24
shr ebx, 24
or eax, ebx
ret
DXColorToDW ENDP




btw, I think you have to change the use of register ebx to ecx/edx ... so you don't have to push ebx on entry at the procedure and pop it afterwards.



don't forget 16 bit alignment(align 16) of memory operands when coding in assembly... movaps requires this alignment.



changed DXColorToDW structure fields order of operation, I was still following the old code which was A, R G, B not B, G, R, A. :)

I think you can eliminate shl ebx, 24 and shr ebx, 24 if you already have a conditional move since that prevents bits going off the 8 bit limit.

I don't know since I'm too lazy to test anything... :grin:
Posted on 2003-01-07 10:11:04 by arkane
Maybe something tricky that a compiler couldn't dream up:
_PUSHAD STRUCT

_EDI DWORD ?
_ESI DWORD ?
_EBP DWORD ?
_ESP DWORD ? ; not used when POPAD
_EBX DWORD ?
_EDX DWORD ?
_ECX DWORD ?
_EAX DWORD ?
_PUSHAD ENDS

DXColorToDW:
mov edx, [esp+4]
pushad

fld [edx].D3DXCOLOR.a
fld [edx].D3DXCOLOR.r
fld [edx].D3DXCOLOR.g
fld [edx].D3DXCOLOR.b

fistp [esp]._PUSHAD._ESP ; :)
fistp [esp]._PUSHAD._EAX
fistp [esp]._PUSHAD._EDX
fistp [esp]._PUSHAD._ECX

mov eax, [esp]._PUSHAD._ESP
mov ebx, [esp]._PUSHAD._EAX
mov ecx, [esp]._PUSHAD._EDX
mov edx, [esp]._PUSHAD._ECX

sub eax, 255
sbb esi, esi
sub ebx, 255
sbb edi, edi
sub ecx, 255
sbb ebp, ebp

and eax, esi
and ebx, edi
and ecx, ebp

sub edx, 255
sbb esi, esi
and edx, esi

add eax, 255
add ebx, 255
add ecx, 255
add edx, 255

shl eax, 24
shl ebx, 12
shl ecx, 8

or eax, edx
or ecx, ebx

or eax, ecx

mov [esp]._PUSHAD._EAX, eax
popad
ret 4
Conditional moves are not much better than conditional jump instructions - better to do without them all together. This code is good for processors without SSE/K3D/SSE2.
Posted on 2003-01-07 15:59:25 by bitRAKE
final :) - optimizations welcome :)
#include<stdio.h>


typedef struct
{
float r;
float g;
float b;
float a;
} D3DXCOLOR;

typedef struct
{
unsigned char r;
unsigned char g;
unsigned char b;
unsigned char a;
} BYTEVIEWER;

BYTEVIEWER finaloutput;

int main(void)
{
D3DXCOLOR d1 = {3.5f, 3.5f, 3.5f, 3.5f};
D3DXCOLOR d2 = {1.5f, 1.5f, 1.5f, 1.5f};
D3DXCOLOR f = {2.0f, 2.0f, 2.0f, 2.0f};
D3DXCOLOR dxcolor2dw = {1.5f, 300.59f, 259.0f, 240.0f};
D3DXCOLOR output;

[color=blue]__asm
{
movaps xmm0, d1
movaps xmm1, d2
movaps xmm2, f
subps xmm0, xmm1
mulps xmm0, xmm2
addps xmm0, xmm1
movaps output, xmm0
}[/color]

printf("%f\n", output.r);
printf("%f\n", output.g);
printf("%f\n", output.b);
printf("%f\n", output.a);

[color=blue]__asm
{
movaps xmm0, dxcolor2dw
movaps xmm1, xmm0
cvtps2pi mm0, xmm0
shufps xmm1, xmm0, 0Eh
cvtps2pi mm1, xmm1
packssdw mm0, mm1
packuswb mm0, mm0
movq mm1, mm0
psllq mm0, 16
psrlq mm0, 16
psllq mm1, 48
por mm0, mm1
movd eax, mm0
mov finaloutput, eax
emms
}[/color]

printf("%d\n", finaloutput.r);
printf("%d\n", finaloutput.g);
printf("%d\n", finaloutput.b);
printf("%d\n", finaloutput.a);

return 0;
}
I could have exploited some SSE2 instructions but since I don't have a p4... only a p3, so, I have to limit to SSE instructions only.

The second set of code is for DXColorToDW... I use the BYTEVIEWER structure to see if the results exceeds 255.

Just in case anyone will forget that rounding here applies since we're converting from float to integer.

:)
Posted on 2003-01-07 16:17:48 by arkane
much better... dunno why I was thinking of shifts and ors. :grin:
movaps		xmm0, dxcolor2dw

movaps xmm1, xmm0
cvtps2pi mm0, xmm0
shufps xmm1, xmm0, 0Eh
cvtps2pi mm1, xmm1
packssdw mm0, mm1
packuswb mm0, mm0
movd eax, mm0

;Not Needed

mov finaloutput, eax
Posted on 2003-01-07 22:43:52 by arkane
awww shucks :)

I don't know what to say except THANKS :alright:
I'll try the SSE version of the functions as soon as I get my example working lol.
The ones I coded are tested and working, but the example which uses them needs more work :(
I will benchmark on a few machines and submit my findings.
Anyone coded non-naive fpu case code before?
Did u simply flag the chip and call the appropriate function?
Or did u compile for various chips?
Posted on 2003-01-08 05:49:04 by Homer