A friendly technical discussion on the win32asm board got quite out of hand
(http://www.asmcommunity.net/board/index.php?topic=12764), so
here we go.

The purpose of the algorithm is the following: multiply the values of an
array of 16bit unsigned words with a constant. The array size is 2048 items
(one x86 4096 byte page), and the constant chosen is 92. Algorithms that
can easily be modifiable to use other constants are cool, but focus should
be put on optimizing for the constant 92. Why? Because that was the original

The release of this is somewhat rushed, but I'll be gone for the weekend
and really want to have some feedback when I get home. Please read through
all this, and all the stuff in readme.txt in the zip file, before you run
any of the binary code. Benchmark results for built-in functions are not
desired; I already have plenty. The benchmarks are just there for your
amusement. You _WILL_ want to edit yodel.ini for non-gigahertz machines!

Any instructionsets are allowed - algorithms should be classified as
"good generic", "good for pentium or older", "MMX", "Athlon", "Pentium4".
Of course athlon and P4 optimized routines may use MMX as well, as may
they use simple instruction sets.

This test isn't to be seen as a "P4 vs. Athlon" or a "C/Whatever vs. Assembly",
it's just for fun - with a side purpose of developing a good test/benchmarking
base. Of course it's fun to see how we can get Athlons and P4's to perform,
especially when effectively taking fractional clocks per multiply :-).

The benchmarking/testing suite. I've found myself writing variations of
the same code too many times, so now is probably a good time to write something
reusable. I call this "yodel", simply because that was the first word that
poppen into my mind when I named the source file. At the time of this writing,
it does its job fairly well, but isn't the most flexible nor accurate. Things
will change, I hope.

If you wish to write your own test function: make a "timedll.dll" that has
the export "_func". This is a STDCALL calling convention function that take
a single parameter: a pointer to an array of 2048 16bit unsigned words.
The function must return either 1 (success) or 0 (failure) in EAX. In other
words, "bool __stdcall func(unsigned short *parm);" -- for most languages,
you will need to write the equivalent of a .def file to get the export name
right, or use a hex editor to hack up the binary :-)

The idea is to get something that does fairly accurate benchmarks (the timing
code should be souped up, perhaps using Maverick's stuff), as well as being
"somewhat portable" (by this I mean sane win32 and linux environments. Main
reason for linux support is that is I have a P3 class machine running linux,
which would be nice to be able to run benchmarks on). In the more distant
feature, even nicer things could be added, like a nice GUI, generating test
overviews, etc. Scali is working on some code to get system (OS and CPU)
information, this will also be used to make the benchmark not try to test
routines the current test rig can't handle (SSE2 on a Pentium3, MMX on a pplain,
et cetera).

There might be other suites around that does this already, but I feel like
spending some time writing "my own" - it's something you can learn from.
Would also be nice to get some feedback from people and such.

I'd really like people to try this stuff out. If you're going to try your
hand at optimizing the algorithm, please use yodel to test it - that way
results are comparable. Also, suggestions, bug reports, et cetera. Furthermore,
I hope we can maintain a friendly atmosphere. Competitive is a-okay, that's
part of the fun anyway, but the keyword here is *FRIENDLY* (and accurate;
guesses and mistakes are okay, as long as people are willing to accept their
mistakes being corrected. I am.)
Posted on 2003-04-25 08:25:37 by f0dder
Posted on 2003-04-27 13:22:52 by Hiroshimator
thanks, Hiro.
Hope to see some posts soon :)
Posted on 2003-04-27 14:05:02 by f0dder
Weird the program still attempt to call the functions through it does not conform
/] running performance tests
test01-simple C code ...021719 ticks (effective 6.342 clk/mul)
test02-Ekted 1 ...030359 ticks (effective 8.865 clk/mul)
test03-Roticv 1 ...012312 ticks (effective 3.595 clk/mul)
test04-Hutch 1 ...017593 ticks (effective 5.137 clk/mul)
test05-f0dder imul ...017594 ticks (effective 5.137 clk/mul)
test06-scali MMX ...003547 ticks (effective 1.036 clk/mul)
test07-scali SSE2 ...doesn't conform(!)...002219 ticks (effective -1
.#IO clk/mul)
test09-Jibz' C++ code ...026625 ticks (effective 7.774 clk/mul)
test10-hutch fmul 1 ...doesn't conform(!)...038704 ticks (effective 11
.301 clk/mul)
test11-scali athlon1 ...017578 ticks (effective 5.133 clk/mul)

Just a question, what is wrong with my dll?

.model flat, stdcall
option casemap:none

include /masm32/include/windows.inc
DllEntry proc hInstDLL:DWORD, reason:DWORD, reserved1:DWORD
DllEntry endp

_func proc
push esi
push edi

mov esi, [esp+12]
mov edi, 2048
sub esi, 2

movzx eax, word ptr[esi + edi*2]

; roticv code begin
lea ecx, [eax*8+eax] ; n * 9
shl eax, 7 ; n * 128
shl ecx, 2 ; n * 36
sub eax, ecx ; n * 128 - n*36
; roticv code end

mov [esi + edi*2], ax
sub edi, 1
jnz _loop

pop edi
pop esi
mov eax, 1
ret 4
_func endp

end DllEntry

Just testing code, but it seems to fail. Yes I did name it timedll.dll and the function _func.
Posted on 2003-04-28 09:30:55 by roticv
roticv, edit the ini file - it lets you change whether to run a function if it doesn't conform (conform=true). Furtheremore, you should set "plain = true" since you don't have MMX.

One problem with your DLL is DllMain - it should return 1. Quoting PlatformSDK:

When the system calls the DllMain function with the DLL_PROCESS_ATTACH value, the function returns TRUE if it succeeds or FALSE if initialization fails. If the return value is FALSE when DllMain is called because the process uses the LoadLibrary function, LoadLibrary returns NULL. (The system immediately calls your entry-point function with DLL_PROCESS_DETACH and unloads the DLL.) If the return value is FALSE when DllMain is called during process initialization, the process terminates with an error.

Next, are you sure "_func" ends up as "_func" in the executable? I can't remember how STDCALL does it when there's on parameters, I'd have though "_func@0". I should whip up a small DLL template for masm, but there's a lot of stuff on my todo list currently :)
Posted on 2003-04-28 09:36:30 by f0dder

Instead of measuring ticks, elapsed time is more accurately measured on all Pentium + machines with the processor?s time stamp counter - # of clock cycles since last system reset: Before executing each routine use RDTSC to read the 64 bit counter into EDX:EAX, and save these; after executing the routine use RDTSC to read the new counter into EDX:EAX, and subtract the old values from the new?

Here is sample code taken from an example program for Randall Hyde?s HLA. prevTime and qw are 32bit variables (double actually).

mov( eax, prevTime );
mov( edx, prevTime[4] );

?? routine to be timed

sub( prevTime, eax );
sbb( prevTime[4], edx );
mov( eax, (type dword qw));
mov( edx, (type dword qw[4]));

Remember the brackets and semicolon are not used in NASM, MASM code, and src, dest order is reversed in the HLA instructions compared to NASM, MASM. In MASM code, that would be (I think ?)

mov prevTime, eax
mov prevTime[4], edx

?? routine to be timed

sub eax, prevTime
sbb edx, prevTime[4]
mov qw, eax
mov qw[4], edx

Using the sensitive clock counter (less than 1 nanosecond per clock on a 1GHz + machine) instead of ticks (milliseconds), you can time 2048 multiplications and the clock count would be a 32 bit number (in EAX only ? 32 bit # clocks is 1.4 seconds on a 3GHz processor). To get # of clocks (including decimal places) per iteration divide by 2048 (or right shift 11 bits to get integer clock count). Each test would therefore take less than 1 millsecond on any processor including a 166MHz Pentium MMX machine.

If an insensitive measure like ticks are counted, you need to do a long test ---1 million iterations of 2048 multiplications just for each test to take a few seconds 2.53GHz Pentium 4 (enough to be able to divide and get a reasonable average), however, on a 166MHz Pentium MMX machine, for example, each test now takes minutes. And by 2004, 10 GHz Pentium 6 machines would be too fast for your code again anyway.

Now IF YOU NEED TO use ticks, since you are measuring the clock speed at the start of the program, you should probably use different numbers of iterations for different speeds: a simple way of doing it would be to divide the clock speed in MHz by 100 and use that as an index into a table of # of iterations to use:

0 (0-99 MHz) ? 50,000 ; the first figure is increased by necessity
1 (100-199) ? 50,000
2 (200-299) ? 100,000
3 (300-399) ? 150,000
4 (400-499) ? 200,000
20 (2000-2100) ? 1,000,000

[2 GHz and above use 1,000,000? for simplicity, and so as not to have an infinite table] Or simply multiply clock speed in MHz by 500 (as above).

Easiest solution: Measure clock cycles with RDTSC. If you do that you may not even NEED to bump the process priority up?.
Posted on 2003-04-28 22:08:58 by V Coder
f0dder, nice work. I tried to do a similar thing with macros for a timing test framework. How about a general program that would work for any test routines? I litte more involved but your half way there. The expected input and parameters would have to be defined, then each routine would have an information block stating it's requirements (MMX,K3D,SSE). There could be a whole bank of test to run on new chips like x86-64, Crusoe, emulators -- to get a feel for their strengths or weaknesses.
Posted on 2003-04-28 23:52:12 by bitRAKE
VCoder, thanks but I already know :)
I was planning to use RDTSC in the timing later on (I'm already using that approach in in CPU speed detection.) - I still want to run through a few iterations, but you're definitely right about not having to run through as much to get stable results. I'll want to use the full 64bit range of RDTSC too.

I think it's fair to assume at least pentium/rdtsc in the benchmarking. I wouldn't like developing on/for anything less. As for process priority bumping... well, it's already a togglable option in the INI file. I like it, because it gives more accurate results (very very few ms off by each iteration). Of course if your code doesn't even run for a full timeslice, it might not be necessary... but I'll leave in the option (and time with in on myself).

bitRAKE, that's exactly what I want to do :). scali has made up a nice piece of code that detects a helluvalot CPU information (and converts intel cache descriptor format to AMD - ie, same interface). Functions like hasFPU, hasMMX, hasSSE (etc) will be used to only run the supported tests. Linux version on the way, too.

One of the very important parts of this benchmark thingy is the conformance tests. Ok, you don't necessarily spot errors by just feeding in random numbers, but it works most of the time. And conformance testing _is_ pretty important imo.

Well, I think I'll get some more work done on yodel today. Perhaps the linux port, perhaps rdtsc timing. Probably adding in scalis CPU feature detection code.
Posted on 2003-04-29 01:56:55 by f0dder

I notice that SSE2 (3 instructions operate on 8 words) is twice as fast as MMX (3 instructions operate on 4 words). Yet, only Pentium 4 processors (and Opeteron) have SSE2 capability, whilst all Pentium MMX or later processors have MMX capability. It is not just AMD users at risk. Intel still sells Pentium III processors for notebooks. Therecore compatibility is a primary concern across the board. SSE2 coders must test for compatibility and include an MMX optimized version of their code anyway...

I am interested in the timings for a partially unrolled loop of scali's MMX. I'm wondering how much of a penalty the sub and jnz are, and therefore if one unroll will speed up measureably.

Can you include this test in the next release please. In addition to the standard MMX Test,

mov edi, (2048) / 4 ; we can do 4 words at a time with MMX
sub esi, 8

; scali code begin
movq mm0, [esi + (edi*8)]
pmullw mm0, [UBER_CONSTANT]
movq [esi + (edi*8)], mm0
; scali code end

sub edi, 1
jnz .loop

Please have scali MMX partial unroll...

mov edi, (2048) / 8 ; we can do 4 words (twice) at a time with MMX
sub esi, 16

; scali code (twice) begin
movq mm0, [esi + (edi*8)]
pmullw mm0, [UBER_CONSTANT]
movq [esi + (edi*8)], mm0

movq mm0, [esi + (edi*8)+8]
pmullw mm0, [UBER_CONSTANT]
movq [esi + (edi*8)+8], mm0 ; scali code end

sub edi, 2
jnz .loop

Thanks much.
Posted on 2003-04-29 08:10:48 by V Coder

SSE2 coders must test for compatibility and include an MMX optimized version of their code anyway...

Sure, just like you shouldn't provide solely 3dnow. I think it's safe+okay to require a minimum of pentium+mmx, unless you're specifically targetting lesser CPUs.

When doing routines optimized for the more funky instruction sets, you should choose either the optimized or the "plain" version runtime, depending on CPU capabilities. This can be done in a number of ways. The easiest is probably a function pointer, but you could do code overwriting too.

I'll have a look at your unrolled routine when I get home.
Posted on 2003-04-29 08:39:53 by f0dder
V Coder, don't forget the latency of the multiple instruction - you will have to unroll further and use more mmx registers to get that jump in speed your looking for!
Posted on 2003-04-29 10:21:43 by bitRAKE
On my P4, the MMX unrolling had about no effects. On my kid brothers' athlon700, it improved stuff nicely. The unrolling is probably somewhat naive :).
MMX*1: 002263 ticks (effective 0.773 clk/mul)
MMX*4: 001162 ticks (effective 0.397 clk/mul)

Original routine:

mov eax, [esp+4]
mov ecx, (2048) / 4 ; we can do 4 words at a time with MMX
sub eax, 8

align 16
; scali code begin
movq mm0, [eax + (ecx*8)]
pmullw mm0, [UBER_CONSTANT]
movq [eax + (ecx*8)], mm0
; scali code end

sub ecx, 1
jnz .loop

mov eax, 1
emms ; yeah, required for MMX
ret 4

4x unrolled:

; we can do 4 words at a time with MMX, and loop is 2x unrolled - but x16
; isn't possible in effective address calculation, so we do it the old way.
mov eax, [esp+4]
mov ecx, (2048*2)
movq mm0, [UBER_CONSTANT]

align 16
movq mm1, [eax + 0]
movq mm2, [eax + 8]
movq mm3, [eax + 16]
movq mm4, [eax + 24]

pmullw mm1, mm0
pmullw mm2, mm0
pmullw mm3, mm0
pmullw mm4, mm0

movq [eax + 0], mm1
movq [eax + 8], mm2
movq [eax + 16], mm3
movq [eax + 24], mm4

add eax, (4*2*4)
sub ecx, (4*2*4)
jnz .loop

mov eax, 1
ret 4

You will notice that the 4x unroll loads the constant in a register, while the single version constantly uses it from memory. On the P4, there wasn't a difference between the two. Also, the 4x unroll has a "plain" loop structure (no "complex" effective addresses, add+sub), where the 1x version has "eax + (ecx*8)" - on my P4, changing the 1x version to "plain" caused a major slowdown.
Posted on 2003-04-29 11:06:18 by f0dder
f0dder, 0.375 clk/mul is maximum possible on Athlon MMX - that's close enough. :)
Posted on 2003-04-29 12:44:06 by bitRAKE
Well, that's fast. I like the code rearrangement too...

How soon can we download yodel_06 or later? Thanks.
Posted on 2003-04-30 07:09:34 by V Coder
I have a preliminary linux version of yodel running now; if I'm not too trashed after work, I might work on rdtsc timing in the evening.
Posted on 2003-04-30 07:17:39 by f0dder