Hey all, I was wondering about something
I'm making some vector/matrice classes to use with OpenGL, DirectX has D3DX with all you can think off, and I wanted something for OpenGL so I thought it would be a good opportunity to program some inline ASM I asked intel for their manuals and they kindly sent me 5 of them including the instruction reference (a-m & n-z) I was looking around (just to see if I could find some useful instructions) when I saw PSHUFD - Shuffle Packed Double Words I felt a anime cartoon with a sweat drop on my face because until now I was using SHUFPS to (sorry if it's the wrong word) broadcast, what's the diference? =| I mean, the commands are different of course, one takes the values of 1 register and put into another and the other uses the values of 2 registers, but in the case of broadcasting, PSHUFD is MUCH simplier O_O is there anything wrong with my line of thinking? because I only see SHUFPS in code examples around (for broadcast/rorder a register =|) is SHUFPS faster or something?

Sorry, I think it's quite a noob question but I'm kinda curious =|

EDIT: What I mean by broadcast is change the order of the register's dwords, here's an example http://www.songho.ca/misc/sse/sse.html
Posted on 2008-12-27 15:46:01 by Melanikus
It's a SSE2 instruction, so there are bound to be fewer examples of it than of shufps.
Latencies of SSEx change with each new generation, you'll have to benchmark the different implementations. (sending here a cmdline benchmark for us to run on different cpus/systems is easily accepted)
Posted on 2008-12-28 13:49:34 by Ultrano
How would one do a benchmark we can thrust? I used to get a clock(); before and after , subtract and display in miliseconds, so would be like: a hundred million calls takes 700miliseconds... but then I found a instruction that measures the amount of clockcicles since the program started (sorry, don't remember the instruction right now) but there's a "catch" with the instruction, you must pass it to a local variable (in memory) and in case of (speculation) a cache miss, it takes much more clocks(clocks added to the total call clocks) if you measure the clocks to a cache miss sometimes there's no cache miss =p so you end up getting 0 clocks, what I do is like

time1 = getClocksHere
for (i < 100000000)
time2 = getClocksHere

totalTime = time2 - time1 / 100000000

A very large number makes the cache miss time less significative, but it's still there, maybe there's a way to force time1 & time2 variables to be in cache? (prefetch0?)

Well... is there anyway to KNOW FOR SURE how many clocks took for an instruction to run? or it's all speculation/best guess/trust intel/amd specs?
Posted on 2008-12-28 18:54:50 by Melanikus
Here's my reliable way:

;=====[[ Benchmarking macros START_TEST/END_TEST >>===\
TEST_ITERS equ 10000000
TEST_SUBCYCLE = 1 ; 0 or 1, whether to show FP result

TEST_ID = 0 ; internal thingie
START_TEST macro where:REQ
local where2
@START_TEST_VAR textequ <where>
mov where,-1

invoke GetModuleHandle,0
invoke SetPriorityClass,eax,HIGH_PRIORITY_CLASS
invoke GetCurrentThread
invoke SetThreadAffinityMask,eax,1

invoke SwitchToThread

push eax
push edx
mov ecx,TEST_ITERS
align 16
END_TEST macro
dec ecx
jnz @CatStr(<testlabl>,%TEST_ID)

sub eax,
sbb edx,
add esp,8
mov ecx,TEST_ITERS
div ecx

invoke GetModuleHandle,0
invoke SetPriorityClass,eax,NORMAL_PRIORITY_CLASS

%@CatStr(<print >,@START_TEST_VAR)
mov ,edx
mov ,eax
fild qword ptr
mov dword ptr,TEST_ITERS
fidiv dword ptr
add esp,8

invoke GetModuleHandle,0
invoke SetPriorityClass,eax,NORMAL_PRIORITY_CLASS

%@CatStr(<PrintFloat >,@START_TEST_VAR)


Example usage:

main proc
local time1,time2

inc ebx
inc ebx
inc ebx
inc ebx
inc ebx


add ebx,1
add ebx,1
add ebx,1
add ebx,1
add ebx,1

main endp

Note that I avoid benchmarking a single instruction, as it could have its latency hidden during the looping (being executed while jumping).
It's best to benchmark a whole proc-call that has i.e 30+ instructions in it.
Then also try with and without warming-up caches.

The "print" macro is actually the VKDebug PrintDec, renamed
Posted on 2008-12-28 22:21:38 by Ultrano