Got my hands finally on a Pentium 4 (1.4 GHz).

In the thread ?Quickk reference to MMX/SSE/SSE2 ..." ...

http://www.asmcommunity.net/board/index.php?topic=3708

... bitRAKE optimized a mmx-3D-transform-loop from intel (see below) from 15 down to only 10 cyles on AMD Athlon (data from/to register).

Now we have the cycles/loop for 4 Processors:

1) PMMX: 12 cycles
2) P3: 12 cycles
3) AMD Athlon: 10 cycles

and:

4) P4: 17 cycles !!!

If there is no catch then it looks like that the marketing-department from intel is now involved in research and development too...

There are exactly 17 MMX-instructions in the loop so the statement (link above) ... :

?But troubles do not stop here. The image above outlines how instructions are addressed to specific ports in the P4 execution engine. All MMX instructions are queued in Port 1! This is major drawback compared to the P6 Core, in which most MMX instructions could be issued to Port 0 or Port 1. ?

... seems to be true.

Consequence? Enjoy your machine (and wait for the hammer).

The loop (3D-transforms vectors xyzw, each 16 bit):


---------------------------
NextVect:
; Load vector (4 16-bit elements) into reg

movq mm3, mm7 ;original: movq mm3,

movq mm4,mm3

pmaddwd mm3,mm0
movq mm5,mm4

pmaddwd mm4,mm1
inc ecx

pmaddwd mm5,mm2
movq mm6,mm3

punpckldq mm3,mm4

punpckhdq mm6,mm4

movq mm4,mm5
punpckhdq mm5,mm5

paddd mm3,mm6
paddd mm5,mm4

psrad mm3,AnzNKomma
psrad mm5,AnzNKomma

packssdw mm3,mm5
movq mm4, mm3 ; original: movq ,mm3

jnz NextVect
---------------------------




Rem.:
---quote bitRAKE---------
?I figured it out - these dummy test runs don't work on out-of-order processors. The dummy moves to MM7 were working against the internal optimizers within the core of the processor"
-----------------------------

bitRAKE: What was the issue with reading/writing the vectors from/to mmx-registers (On P4 it was exactly the codesniped from above)?


VShader
Posted on 2002-03-03 10:39:39 by VShader
VShader, the problem is that the Athlon optimizes the MacroOps internally, and the dummy moves create false dependancies - preventing optimization.

Intel really wants people to use SSE/2, because the Athlon so thoroughly beats the P3/P4 at MMX. I read up on both chips internal workings before buying an Athlon. It performs as expected - the only problem is keeping the CPU fed with data from memory. And this is the problem that the P4 tried to solve - on a 2.2Ghz chip, 17 cycles isn't a problem because memory is so slow. Need to use prefetch to speed up real world use of algos, because the weakest link isn't the CPU - it is memory. This is why no speed increase is really precieved between these very fast CPUs and those slightly faster than half the speed - most applications aren't optimized for that preformance range.

What timing did the code in ( this ) post show?
Posted on 2002-03-03 11:53:22 by bitRAKE
Have only tested the sequence above and I am not sure if I can test again because my colleague wrote something about an explosion and reformatting her harddisk - but I think she is joking.

VShader
Posted on 2002-03-03 12:44:45 by VShader
btw: Some very cool and in depth information on Pentium4 and other Intel-processors can be found here:

http://www.emulators.com/pentium4.htm


VShader
Posted on 2002-03-03 13:37:41 by VShader
My tests show that if the vectors aren't in the cache, then each transform takes ~40 cycles. Those 30 extra cycles are based on a 1.3Ghz CPU with DDR memory - expect this to be higher on faster processors. This leaves quite some room for prefetch optimizations! The ideal design would be one where blocks of vectors are processed from start to finish in small batches - where all the memory read/write takes place in cache, and no prefetching is needed. The block size would be based on the data cache size. Most processes can't fit into such a tight design constraint, so work is still to be done on the above algo to even get close to the 10 cycle minimum, and the final product will be processor dependant. :(

Thanks for the link.
Posted on 2002-03-03 13:48:57 by bitRAKE
All the info on the P4 suggests Intel has been working on the Prestonia Hyper-Threading for a very long time, and everything in the design is geared towards that feature. The P4 isn't the best on the block right now, but maybe in the short future it will be a different story? I imagine the Prestonia running as four concurrent processors at 2Ghz, and leaving the Athlon in the dust. The main thing to come from this is that Mhz/Ghz doesn't mean much anymore. :)
Posted on 2002-03-03 14:39:09 by bitRAKE
I don't think MHz has been an exact indicator since the 486 processer. I believe artificial clock cycles were introduced to that processor.
Actually, as processors have progressed past the the original 8086, generally necessary clock cycles per instruction went down, with a few exceptions.

Dig
Posted on 2002-03-03 22:43:49 by dig
I just found this page about Athlon instruction timing experiments:
http://members.jcom.home.ne.jp/kgoto/athlon.html

Using prefetch and MOVNTQ, I have only been able to get the timing down to 30 cycles per vector average of 100,000 vectors. This is down from 45 cycles - before when I stated 40, I was reading/writing to the same memory. That is still 66% of the time spent waiting on memory!

Edit: Adding a 32K 'touch' prefetch on the source has helped drop the timing to <24 cycles per vector - again 100,000 vectors not in cache.

Very solid timing of 24 +/- 0.5 cycles per vector:
	mov	ecx,iNumVec

mov eax,pMatrix
lea edx,[ecx*8]
neg ecx

movq mm0,[eax + 0]
movq mm1,[eax + 8]
movq mm2,[eax + 16]

mov eax,edx
add edx,pVector
add eax,pResult

_32k_prefetch:
mov ebx,(32*1024/64)/2
; unrolling this more than two doesn't help
@@: mov esi,[edx + ecx*8 + 64*0]
mov edi,[edx + ecx*8 + 64*1]
add ecx, 64*2/8
dec ebx
jne @B
mov ebx, ecx
sub ecx, 64*2/8 * (32*1024/64)/2

NextVect:
cmp ecx, ebx
je _32k_prefetch

movq mm3,[edx + ecx*8]
inc ecx

movq mm4,mm3
pmaddwd mm3,mm0

movq mm5,mm4
pmaddwd mm4,mm1

movq mm6,mm3
pmaddwd mm5,mm2

punpckldq mm3,mm4
punpckhdq mm6,mm4

movq mm4,mm5
punpckhdq mm5,mm5

paddd mm3,mm6
paddd mm5,mm4

psrad mm3,NUMBER_SCALE-2
psrad mm5,NUMBER_SCALE-2

packssdw mm3,mm5
movntq [eax + ecx*8 - 8],mm3

jnz NextVect

sfence
emms
Posted on 2002-03-04 02:39:15 by bitRAKE
bitRAKE,

The w-component (16bit) of the untransformed xyzw mmx-vector ist always 1.0 (fixed point).

So you can (on a faster processor with more cycles to spend) store the vertices in only 6 bytes (xyz xyz xyz xyz ...) instead of 8 (xyzw ...) and insert the 1.0 while transforming to chop off another 25% of input-data.

BTW: With a bit of loopunrolling and EDO-Ram friendly reading (5-2-2-2 Burst) I have now about 10.000.000 vectortransforms/s (while running the 3D-engine) and 20 cycles per vector on my P200mmx - not fair (because you have more Hz) , but: go for it! :grin:

VShader
Posted on 2002-03-04 16:15:55 by VShader
VShader, this has been a good test for me. It's demostrated what kind of performance hit is taken by non-cached data on newer processors, and now much time is spent moving data around. My approach to data proccessing will be different in the future. Thanks for the idea, I'll see what I can do with it.
Posted on 2002-03-04 17:22:29 by bitRAKE