Hey all, I was wondering about something
I'm making some vector/matrice classes to use with OpenGL, DirectX has D3DX with all you can think off, and I wanted something for OpenGL so I thought it would be a good opportunity to program some inline ASM I asked intel for their manuals and they kindly sent me 5 of them including the instruction reference (a-m & n-z) I was looking around (just to see if I could find some useful instructions) when I saw PSHUFD - Shuffle Packed Double Words I felt a anime cartoon with a sweat drop on my face because until now I was using SHUFPS to (sorry if it's the wrong word) broadcast, what's the diference? =| I mean, the commands are different of course, one takes the values of 1 register and put into another and the other uses the values of 2 registers, but in the case of broadcasting, PSHUFD is MUCH simplier O_O is there anything wrong with my line of thinking? because I only see SHUFPS in code examples around (for broadcast/rorder a register =|) is SHUFPS faster or something?

Sorry, I think it's quite a noob question but I'm kinda curious =|

EDIT: What I mean by broadcast is change the order of the register's dwords, here's an example http://www.songho.ca/misc/sse/sse.html
Posted on 2008-12-27 15:46:01 by Melanikus
It's a SSE2 instruction, so there are bound to be fewer examples of it than of shufps.
Latencies of SSEx change with each new generation, you'll have to benchmark the different implementations. (sending here a cmdline benchmark for us to run on different cpus/systems is easily accepted)
Posted on 2008-12-28 13:49:34 by Ultrano
How would one do a benchmark we can thrust? I used to get a clock(); before and after , subtract and display in miliseconds, so would be like: a hundred million calls takes 700miliseconds... but then I found a instruction that measures the amount of clockcicles since the program started (sorry, don't remember the instruction right now) but there's a "catch" with the instruction, you must pass it to a local variable (in memory) and in case of (speculation) a cache miss, it takes much more clocks(clocks added to the total call clocks) if you measure the clocks to a cache miss sometimes there's no cache miss =p so you end up getting 0 clocks, what I do is like

time1 = getClocksHere
for (i < 100000000)
callInlineAsmInlineFunc
time2 = getClocksHere

totalTime = time2 - time1 / 100000000

A very large number makes the cache miss time less significative, but it's still there, maybe there's a way to force time1 & time2 variables to be in cache? (prefetch0?)

Well... is there anyway to KNOW FOR SURE how many clocks took for an instruction to run? or it's all speculation/best guess/trust intel/amd specs?
Posted on 2008-12-28 18:54:50 by Melanikus
Here's my reliable way:

;=====[[ Benchmarking macros START_TEST/END_TEST >>===\
TEST_ITERS equ 10000000
TEST_SUBCYCLE = 1 ; 0 or 1, whether to show FP result

TEST_ID = 0 ; internal thingie
START_TEST macro where:REQ
local where2
@START_TEST_VAR textequ <where>
mov where,-1


invoke GetModuleHandle,0
invoke SetPriorityClass,eax,HIGH_PRIORITY_CLASS
invoke GetCurrentThread
invoke SetThreadAffinityMask,eax,1

invoke SwitchToThread

TEST_ID = TEST_ID + 1
rdtsc
push eax
push edx
mov ecx,TEST_ITERS
align 16
@CatStr(<testlabl>,%TEST_ID,<:>)
endm
END_TEST macro
dec ecx
jnz @CatStr(<testlabl>,%TEST_ID)
nop
nop
nop
nop

rdtsc
sub eax,
sbb edx,
  if TEST_SUBCYCLE EQ 0
add esp,8
mov ecx,TEST_ITERS
div ecx
.if eax<@START_TEST_VAR
mov @START_TEST_VAR,eax
.endif

invoke GetModuleHandle,0
invoke SetPriorityClass,eax,NORMAL_PRIORITY_CLASS

%@CatStr(<print >,@START_TEST_VAR)
else
mov ,edx
mov ,eax
fild qword ptr
mov dword ptr,TEST_ITERS
fidiv dword ptr
fstp @START_TEST_VAR
add esp,8

invoke GetModuleHandle,0
invoke SetPriorityClass,eax,NORMAL_PRIORITY_CLASS

%@CatStr(<PrintFloat >,@START_TEST_VAR)
endif
endm

;=======/

Example usage:


main proc
local time1,time2

START_TEST time1
inc ebx
inc ebx
inc ebx
inc ebx
inc ebx

END_TEST

START_TEST time2
add ebx,1
add ebx,1
add ebx,1
add ebx,1
add ebx,1
END_TEST



ret
main endp

Note that I avoid benchmarking a single instruction, as it could have its latency hidden during the looping (being executed while jumping).
It's best to benchmark a whole proc-call that has i.e 30+ instructions in it.
Then also try with and without warming-up caches.

The "print" macro is actually the VKDebug PrintDec, renamed
Posted on 2008-12-28 22:21:38 by Ultrano