ah i know why mine is faster - it has no emms. If you try to add it, the ticks jumps off to 260++

I notice my code is nearly identical as other MMX codes. :grin:

forget the code above. :grin:

but I have other solutions to further complicate the code but not faster (good for mind-twisting, to those who will try to debug the app) :grin:
Posted on 2003-04-04 03:24:13 by arkane
	pxor	mm7, mm7

pxor mm6, mm6

movq mm0, [_table + 0x00]
movq mm1, [_table + 0x08]

movq mm2, [_table + 0x10]
movq mm3, [_table + 0x18]

pcmpeqb mm0, mm7
pcmpeqb mm1, mm7

pcmpeqb mm2, mm7
pcmpeqb mm3, mm7

paddb mm0, mm1
paddb mm2, mm3

psubb mm6, mm0
psubb mm6, mm2

psadbw mm6, mm7

movd eax, mm6
movzx eax, ax

emms
ret
Posted on 2003-04-04 06:28:21 by bitRAKE
SSE2. somewhat faster. will post more complete stuff later.



_cz11:
pxor xmm7, xmm7 ; up counter
pxor xmm6, xmm6 ; up counter

movups xmm0, [_table + 0x00]
movups xmm1, [_table + 0x10]

pcmpeqb xmm0, xmm7
pcmpeqb xmm1, xmm7

psubb xmm6, xmm0 ; count up
psubb xmm7, xmm1 ; count up

paddb xmm6, xmm7
pxor xmm7, xmm7

psadbw xmm6, xmm7

movups xmm7, xmm6

psrldq xmm7, 8

paddw xmm6, xmm7

movd eax, xmm6


;mov [counter], al
movzx eax, al
ret


also messing with ICL7 intrinsics, but that'll have to wait - girlfriend here :)
Posted on 2003-04-04 09:34:15 by f0dder
Once again, I've been at work and come back to find the thread has moved on!

Well this is what I came up with at work...



pxor XMM7, XMM7

movups XMM0, [Table + 00h]
movups XMM1, [Table + 10h]

pcmpeqb XMM0, XMM7
pcmpeqb XMM1, XMM7

paddb XMM0, XMM1

psadbw XMM0, XMM7

movd edx, XMM0
lea eax, [edx + 32]
and eax, 0FFh


Not had a chance to time it...

Mirno
Posted on 2003-04-04 11:56:33 by Mirno
doesn't give correct results here, mirno
Posted on 2003-04-04 12:10:42 by f0dder
It goes to show, you should test before you post :o

Anyway, here's an MMX version which sparked off the SSE attempt.



pxor MM7, MM7

movq MM0, QWORD PTR [Table + 0]
movq MM1, QWORD PTR [Table + 8]
movq MM2, QWORD PTR [Table + 16]
movq MM3, QWORD PTR [Table + 24]

pcmpeqb MM0, MM7
pcmpeqb MM1, MM7
pcmpeqb MM2, MM7
pcmpeqb MM3, MM7

paddb MM0, MM1
paddb MM0, MM2
paddb MM0, MM3

psadbw MM0, MM7

movd edx, MM0
lea eax, [edx + 32]
and eax, 0FFh
Posted on 2003-04-04 14:13:02 by Mirno



lea eax, Table
mov edx, 36
xor ecx, ecx

@@:
cmp DWORD PTR [eax + edx*4 - 4], 1
adc ecx, 0
dec edx
jnz @B



Mirno

You can save a byte easy way without harming
speed or changing logic of your algo:
1. change mov edx,36 to mov edx,35
2. change cmp DWORD PTR , 1
to cmp DWORD PTR , 1
3. change
dec edx
jnz @B
to
dec edx
jns @B
Posted on 2003-04-04 14:25:54 by The Svin
Perhaps is it better to use a negative index to get the data read in the 'normal' order.



lea eax, Table+36*4
mov edx, -36
xor ecx, ecx
@@:
cmp DWORD PTR [eax + edx*4], 1
adc ecx, 0
inc edx
jnz @B


Faster for data caching ?
Posted on 2003-04-04 15:12:41 by MCoder
hrm, still doesn't give correct results, mirno - or is it just me who's stupid? :/
Posted on 2003-04-04 15:14:23 by f0dder

Perhaps is it better to use a negative index to get the data read in the 'normal' order.



lea eax, Table+36*4
mov edx, -36
xor ecx, ecx
@@:
cmp DWORD PTR [eax + edx*4], 1
adc ecx, 0
inc edx
jnz @B


Faster for data caching ?

I Don't see why it's faster for data caching.
Posted on 2003-04-04 15:55:03 by The Svin
I actually tested the code above f0dder, and it worked for me.

I also found out why the SSE(2) version didn't work, it actually needs SSE2, and MASM - although it assembled fine without warning with the XMM registers for pxor etc. it actually assembled the MMX registered versions! Which I think may be a bug in MASM.

The actual code I believe should work, I checked it all in MASM last night, but I'll explain my logic here:

pcmpeqb MM0, MM7 ...
This turn every zero byte into -1, and all non-zero bytes into zero.

paddb MM0, MM1 ...
Now MM0 has 8 bytes, each ranging from 0 to -4

psadbw MM0, MM7
Sum the 8 bytes in 0. This will leave a 16 bit word at the bottom of MM0 which is between 7FFh - 7E0, or 0. The bottom byte of this value (when treated as signed) is the negative of the number of zero bytes there are in the table.
All that is left to do is add 32 to the value in MM0s bottom word, and truncate it to a byte.

In the mean time, I'll check again.

Mirno

-------------------------------- edit --------------------------------
Here's the MASM code I used, and it worked (It didn't have the SSE2 conditional code in though).


.686
.MMX
.XMM
.model flat,stdcall
option casemap:none

SSE2 EQU 0

.nolist
.nocref
include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
include \masm32\include\user32.inc
includelib \masm32\lib\kernel32.lib
includelib \masm32\lib\user32.lib
.cref
.list

.data
Table db 001h, 001h, 001h, 001h, 001h, 001h, 001h, 001h
db 001h, 001h, 001h, 001h, 000h, 001h, 001h, 001h
db 001h, 001h, 001h, 001h, 001h, 001h, 001h, 001h
db 001h, 001h, 001h, 001h, 001h, 001h, 001h, 001h

Format db "%d", 0
Buffer db 32 DUP(0)

.code
start:
IF SSE2
db 66h
ENDIF
pxor MM7, MM7

IF SSE2
movups XMM0, [Table + 00h]
movups XMM1, [Table + 10h]
ELSE
movq MM0, QWORD PTR [Table + 0]
movq MM1, QWORD PTR [Table + 8]
movq MM2, QWORD PTR [Table + 16]
movq MM3, QWORD PTR [Table + 24]
ENDIF

IF SSE2
db 66h
ENDIF
pcmpeqb MM0, MM7

IF SSE2
db 66h
ENDIF
pcmpeqb MM1, MM7

IF NOT SSE2
pcmpeqb MM2, MM7
pcmpeqb MM3, MM7
ENDIF

IF SSE2
db 66h
ENDIF
paddb MM0, MM1

IF NOT SSE2
paddb MM0, MM2
paddb MM0, MM3
ENDIF

IF SSE2
db 66h
ENDIF
psadbw MM0, MM7

IF SSE2
db 66h
ENDIF
movd eax, MM0
add eax, 32
and eax, 0FFh

invoke wsprintf, ADDR Buffer, ADDR Format, eax
invoke MessageBox, NULL, ADDR Buffer, ADDR Buffer, MB_OK

invoke ExitProcess, 0

end start


If someone with a P4 could check whether the SSE2 thing worked I'd be grateful, but its not really too important. Just assemble with SSE2 EQU 1 for the P4 test, or SSE2 EQU 0 for the vanilla MMX/SSE1 code (psadbw is SSE1 I think).

Mirno
Posted on 2003-04-05 03:05:14 by Mirno
mirno, I tested both your SSE and MMX (not the newest post though) in my framework (assembled with nasm), and it didn't give correct results. I test 500k iterations where the table is filled with random values (within the -16 to +8 range), and your routines failed - and yes, I'm on a P4.
Posted on 2003-04-05 07:33:43 by f0dder
Ok, here' the stuff. Compiled with ICL7. Test executable is built with
SSE2 stuff in it, so you'll need a P4 to run this (or run testp.exe,
which only has "plain" (MMX) code). Test machine is a 2.53ghz P4.
The MMX intrinsic code ICL7 generates blows. While this compiler is the
best I've seen yet, it stinks with regards to it's own MMX intrinsics,
so you might as well handcode it. It couldn't figure out how to unroll
a loop properly, it uses memory operands for psrl*, etc.

Scali's SSE2 stuff seems to work pretty nicely :)

cz1 - simple C code (2)
cz2 - TryToBeClever C code (2)
cz3 - mirno SimpleLoop (2)
cz4 - MMX 1 (with *cough* help from scali) (2)
cz5 - MMX 2 (butchered from scali) (2)
cz6 - scali pplain (2)
cz7 - scali pplain (SLOW ANSI C) (2)
cz8 - scali pplain (SLOW ANSI C, GCC) (2)
cz9 - bitRAKE MMX (2)
cz10- more scali code (2)
cz11- scali SSE2 (2)
cz12- icl intrinsics 1 (naive) (2)


run 1 run 2 run 3 run 4
cz1 (000438 ticks) (000438 ticks) (000437 ticks) (000437 ticks)
cz2 (000500 ticks) (000484 ticks) (000485 ticks) (000500 ticks)
cz3 (000609 ticks) (000656 ticks) (000672 ticks) (000641 ticks)
cz4 (000156 ticks) (000157 ticks) (000156 ticks) (000156 ticks)
cz5 (000156 ticks) (000156 ticks) (000156 ticks) (000156 ticks)
cz6 (000235 ticks) (000219 ticks) (000234 ticks) (000235 ticks)
cz7 (000172 ticks) (000171 ticks) (000172 ticks) (000187 ticks)
cz8 (000328 ticks) (000329 ticks) (000328 ticks) (000313 ticks)
cz9 (000125 ticks) (000140 ticks) (000125 ticks) (000140 ticks)
cz10 (000140 ticks) (000125 ticks) (000125 ticks) (000141 ticks)
cz11 (000079 ticks) (000094 ticks) (000094 ticks) (000078 ticks)
cz12 (000187 ticks) (000187 ticks) (000187 ticks) (000188 ticks)
Posted on 2003-04-05 08:23:14 by f0dder
f0dder, I cannot run either of the test programs on my Athlon TB.

Scali, welcome back. :) There is no need to zero XMM7 twice?
Posted on 2003-04-05 13:54:30 by bitRAKE
Scali, forgive my ignorance - don't have a P4 to play with, but would this be faster?


pxor xmm6, xmm6 ; up counter
movaps xmm0, [_table + 0x00]

movaps xmm1, [_table + 0x10]
pcmpeqb xmm0, xmm6

pcmpeqb xmm1, xmm6
psubb xmm6, xmm0 ; count up

pxor xmm7, xmm7 ; up counter
psubb xmm6, xmm1 ; count up

psadbw xmm6, xmm7
movaps xmm7, xmm6
psrldq xmm6, 8
paddw xmm6, xmm7

movd eax, xmm6

movzx eax, al
Also, with the psadbw instruction: where are sums put in the octword? Intel docs say they are at [0:15] and [64:79], but that doesn't fit with your algo?
Posted on 2003-04-05 14:16:14 by bitRAKE
Scali, thanks for the clarification.
Posted on 2003-04-05 14:37:17 by bitRAKE
new version attached - this time testp.exe is (should be? :stupid: ) compiled for ppro+mmx architecture, as well as assembly SSE2 version excluded; if it doesn't run on non-P4, I've fucked up again (give me exception address).

bitRAKE, your SSE2 code seems to clock at same speed as scalis (~78 ticks).
Posted on 2003-04-05 15:24:01 by f0dder
K7 TB 1.333Ghz
(000701 ticks) - cz1 - simple C code (2)

(000711 ticks) - cz2 - TryToBeClever C code (2)
(000851 ticks) - cz3 - mirno SimpleLoop (2)
(000160 ticks) - cz4 - MMX 1 (with *cough* help from scal
(000170 ticks) - cz5 - MMX 2 (butchered from scali) (2)
(000461 ticks) - cz6 - scali pplain (2)
(000320 ticks) - cz7 - scali pplain (SLOW ANSI C) (2)
(000471 ticks) - cz8 - scali pplain (SLOW ANSI C, GCC) (2)
(000100 ticks) - cz9 - bitRAKE MMX (2) :)
(000110 ticks) - cz10- more scali code (2)
(000201 ticks) - cz12- icl intrinsics 1 (naive) (2)
(000190 ticks) - cz14- icl intrinsics 2 (still naive) (2)
How about this one:
	pxor	xmm6, xmm6		; up counter

movaps xmm0, [_table + 0x00]

movaps xmm1, [_table + 0x10]
pcmpeqb xmm0, xmm6

pcmpeqb xmm1, xmm6
psubb xmm6, xmm0 ; count up

pxor xmm7, xmm7 ; up counter
psubb xmm6, xmm1 ; count up

psadbw xmm6, xmm7

movd eax, xmm6
pextrw edx, xmm6, 4

add eax, edx
Posted on 2003-04-05 17:14:09 by bitRAKE