Hello, a pure assembly contirb.


Very cool reference to the MMX/SSE/SSE2
instructions set and some performace
compares between AMD/PIII/PIV
Posted on 2002-02-18 13:05:25 by marsface
Has anybody some hands on experience with MMX-Code on P4 (in comparison to MMX-code on PMMX,P2,P3)?

Does the increased latency on P4 slow down mmx-code considerably?

Is it true, that (2x) 64bit mmx on P3 ist faster than 128bit mmx on P4?

Posted on 2002-02-18 14:18:48 by VShader
Great Reference!
I learnt more from this site than from any other source!:alright:

Posted on 2002-02-18 22:34:50 by dig
VShader, I haven't coded on a P4, but you have to understand the whole pipeline - latency is just one factor. MMX code is usually used in a loop to process much data - even with high latency the instructions will execute in parallel if there are execution units. The Athlon can execute 2 MMX instructions in parallel (it would be three, but only eight bytes can be decoded at once and MMX instructions are too long). I'm assuming the P4 has similar technology behind it.
Posted on 2002-02-18 23:05:37 by bitRAKE
I asked because of this 3 passages in the link above:

1) ?Assuming that we should have P4s running at 2 GHz and more pretty soon, I would not worry about the doubling in latency of most MMX instructions. But the multiply instructions' latency (PMADDWD / PMULHW / PMULLW) jumped from 3 cycles in the P6 core to 8 cycles in the Pentium 4! This will affect all convolutional kernel codes that are widely used, for example, in audio applications. Another troublesome latency is MOVQ's 6 cycles versus only 1 cycle on the P6 core, given that is widely used to move memory blocks and copy results. ?

2) ?But troubles do not stop here. The image above outlines how instructions are addressed to specific ports in the P4 execution engine. All MMX instructions are queued in Port 1! This is major drawback compared to the P6 Core, in which most MMX instructions could be issued to Port 0 or Port 1. ?

3) ?Summing up, the P4 can issue only one MMX instruction per cycle, and the latency is at best twice that on the older Pentium III processor. In pathological conditions, this adds up to bring P4's SIMD performance down to about one third P-III's. Until the P4 ramps up into the 2+ GHz frequency range, its integer SIMD execution speed will simply lag behind the venerable P6 core.

Because I want to compare some realworlddata I did a test on my P200mmx.

It would be nice if we can get the numbers for P2,P3,P4 and AMD.
Don?t forget to put in "YourMachine_Hz" the Hz of your machine.

Here is the loop:

Posted on 2002-02-19 10:39:26 by VShader
Yes, it does sound like the case to P4 isn't that good. My Athlon has a latency of 4 cycles for PMULx. I will test when I get home later. I would like to state again that if the dependancy for the results of these operations are pushed back far enough, then the latency doesn't effect the code. My alpha blend code is fastest on the Athlon, two cycles more on the P3, and untested on P4. Another thing is that the Athlon/P4 will execute instructions out-of-order, but that doesn't mean it couldn't use some help from the programmer. ;)

Posted on 2002-02-19 11:21:18 by bitRAKE
My numbers are (Athlon 1334 MHz):
eax: 13
ebx: 65.522.762
ecx: 49 :grin:

Now it would be interesting to see if we can optimize this, both for the general case and for processor specific case. If you can find a way around using MOVD - load/store using only MOVQ - then you'll get better performance in actual use. Also, if you can find a way to process the vectors in batches that remain in the cache during all operations on the vectors, then you should see a boost in speed on faster processors.

To give you an idea of how memory bound 2Ghz processors are: The AMD optimized memcpy is able to move 1.63 GBytes with a 1GHz processor & DDR2100 memory, and there is no speed increase with a 1.6GHz processor - the faster processor spends over half the time waiting on memory. This is a best case, where the data move to/from memory has been optimized. In the majority of code things will be much worse if data isn't handled in batches that fit within the cache.

Note: I did notice that the algo above is exactly the same as the Intel MMX code. Their code is often not the best possible - even on their own processors. I think a couple clocks could be saved within the loop? (haven't spent much time with it.)

Here are papers reguarding the algo:
Posted on 2002-02-19 21:03:11 by bitRAKE
Hasn't been tested, but I think that this code:
movq	mm6,mm3		;add row0 high and low order 32-bit results

psrlq mm3,32 ;
paddd mm3,mm6 ;
movq mm6,mm4 ;add row1 high and low order 32-bit results
psrlq mm4,32 ;
psrad mm3,15-2 ;Shift 32 to 16; also app. specific <<2
paddd mm4,mm6 ;
psrad mm4,15-2 ;Shift 32 to 16; also app. specific <<2
punpcklwd mm3,mm4 ;Copy word0 of mm4 into word0 of mm3
movd [eax]-8+0,mm3 ;Store 1st and 2nd elements, one 32-bit write
Should be replaced by this code!:
; mm3 = A1 A2

; mm4 = B1 B2
movq mm6,mm3 ; A1 A2
punpckldq mm3,mm4 ; B2 A2
punpckhdq mm6,mm4 ; B1 A1
paddd mm3,mm6 ; B1+B2 A1+A2
psrad mm3,15-2 ; scale dwords back to signed fixed point
packssdw mm3,mm3 ; pack dwords into words
movd [eax]-8+0,mm3 ;Store 1st and 2nd elements, one 32-bit write
If someone from Intel is browsing the forum:
I wouldn't mind doing this for a living! ;)
Posted on 2002-02-20 00:31:01 by bitRAKE
When you test CPUs relative performances, make sure you're not using memory, or at least that all is in the cache.. otherwise a Pentium60 will look faster than a Athlon1400.. cycles-wise.

I recall once a guy who was extremely pissed off because the OUT instruction took 20 times more to execute on the latter than on the former. I leave to you the comments. ;)

Posted on 2002-02-20 05:00:06 by Maverick
your code is correct, I put it in my 3D-engine.

There are 3 instructions less than in the original mmx-transform-loop from intel, but I spend halve a hour of reordering the instructions but could not get it really faster (imperfect pairing?, mmx-shift instructions can not pair ?!)

Perhaps on a other machine than my old PMMX it looks better. How is it on your Athlon?


btw: Nobody with a P4 and MASM here?
Posted on 2002-02-20 13:35:47 by VShader
I will test and follow-up later, but I wanted to inquire if you had tested it again in the above method? ...and your results? There most likely will be no perceived speed increase in actual use without prefetch of data into cache. Also you will be able to save another two instructions by combining the pack/store. ;) You should notice a speed increase by eliminating one of the stores: see note about MOVD above.
Posted on 2002-02-20 14:12:27 by bitRAKE
Yes, I benched with real data in my engine.

With reading from mm7 instead from memory:
Here are the cycles per loop for intel-code and your modification on my P200mmx:

Posted on 2002-02-20 17:51:37 by VShader
VShader, both those sections of code are the same.
I would expect similar results from them on same CPU. :)
Posted on 2002-02-20 22:34:33 by bitRAKE
Here is the pretty looking loop I came up with, but I can't get
it faster than 19 cycles! And I don't know why? I replaced the
moves like in your test code above. I'll go sit in the corner
until I figure this out. :)

; Load vector (4 16-bit elements) into reg
movq mm3,[edx + ecx*8]

movq mm4,mm3 ;copy to other regs for use by 3 pmadds
pmaddwd mm3,mm0 ;multiply row0 X vector

movq mm5,mm4
pmaddwd mm4,mm1 ;multiply row1 X vector

movq mm6,mm3 ; A1 A2
pmaddwd mm5,mm2 ;multiply row2 X vector

punpckldq mm3,mm4 ; B2 A2
punpckhdq mm6,mm4 ; B1 A1

movq mm4,mm5 ;add row2 high and low order 32-bit results
psrlq mm5,32

paddd mm3,mm6 ; B1+B2 A1+A2
paddd mm5,mm4

psrad mm3,NUMBER_SCALE-2
psrad mm5,NUMBER_SCALE-2

packssdw mm3,mm5 ; pack dwords into words
dec ecx

movq [eax + ecx*8],mm3 ; store resulting vector
jnz NextVect ;then loop back to do the next one.
Edit: I figured it out - these dummy test runs don't work on out-of-order processors. The dummy moves to MM7 were working against the internal optimizers within the core of the processor, and costing 6 cycles! I've got the execution down to 12 cycles, but I need to make a better test app - it should be lower, IMO. The Athlon can execute almost all MMX instructions in parallel:
The AMD Athlon processor floating-point logic is a
high-performance, fully-pipelined, superscalar, out-of-order
execution unit. It is capable of accepting three MacroOPs of any
mixture of x87 floating-point, 3DNow! or MMX operations per
To execute three MMX instructions in parallel one would have to be a load/store, and only one could be a multiply - any combination that follows those guidelines should work if there aren't any forward dependencies.
Posted on 2002-02-21 00:47:46 by bitRAKE
Here is a RadASM Project that outputs to a debug window (vkim's). This show's <11.5 cycles per vector for 1024 vectors (src/dest in the cache). Please try this on your machine if you can, else let me know what your working with and I'll see what I can do. These figures fall inline with what I expect.

As to prefeching on your CPU: that just consists of 'touching' (load) of memory in the next cache line. I don't think it's going to matter with 200Mhz. The above algo should be faster none the less. :)

Edit: I have it at 10 cycles for Athlon! I'm guessing your CPU will weigh in at 11 concidering I got rid of all but two shifts! :) Let me know if I broke the algo? :eek:

Edit Again: I think the: punpckldq mm5,mm5 should be punpckhdq mm5,mm5? And you might have to mask off the high word of the MM3 before it is stored, if you need that value zero? Okay, my final guess is 12 cycles on your CPU. I'll test on a P3 tomorrow at work.
Posted on 2002-02-21 02:25:58 by bitRAKE
Hi all,
I have sent before a link for the microsoft reference for processors at the thread:Instruction set
you will find a link for microsoft website where you can download the help file for the processor instruction set. It contain MMX/SSE/SSE2 instructions and there reference and compares between AMD and INTEL processors. It contains also the 3Dnow instruction set try it and tell me your opinion.
Posted on 2002-02-21 03:11:50 by amr
amr, there is no documentation at that link:
The Visual C++ 6.0 Processor Pack provides intrinsic support for enhanced instruction sets supported by Intel and Advanced Micro Devices (AMD) processors. The instructions sets supported are Intel's Pentium III new instruction sets (Streaming SIMD Extensions ) and Intel's Pentium 4 new instruction sets (Streaming SIMD Extensions 2 ) as well as AMD's 3DNow! Instruction sets.
This is an add-on for VC++ that allows you to use the additional instructions of newer processors. The version of ML.EXE I'm using has this support. Your statements are false, amr. No help file, no reference for processors? Or is there something I mis-understand?
Posted on 2002-02-21 08:06:56 by bitRAKE
VectorC *online* compiler:


Posted on 2002-02-21 10:22:34 by Maverick
Maverick, that's cool, but don't you have to learn the compiler switches and stuff to produce really good code? Guess I'll have to try it out...

The algo above is ~12 cycles on P3.
Posted on 2002-02-21 10:35:49 by bitRAKE
Hi pal:)

Well, I would have expecially adviced to download the stand-alone executable demo.. but it's not available anymore (hopefully will be again).. so all that remains to play with is the online compiler. Dunno about the switches.. sure the real product has an *interactive* optimizer, which on the online compiler is not present.. I guess this says all though. ;)

Posted on 2002-02-21 10:42:35 by Maverick