Nice stuff Thomas.. I like the idea of putting challenges on the board, keeps the brains fresh. ;)

I'd like to propose one not-too-trivial practical problem I came across some time ago.. let's see:

Input: You have a 32bit CPU register containing four bytes, let's call them ABCD.

Output: You have to output 3 registers (one of which will be the original register), containing the original ABCD components, but reordered this way:

orig.reg = DBCD
2nd.reg = CDBC
3rd.reg = BCDB

The challenge for the most optimized asm code is open ;-)
I didn't spend much time on my code but it seems very efficient. I'll post it after you all.. and if you wrote a better one, I'll jump into the challenge myself to try to do even better. :)

Greets,
Maverick
Posted on 2002-02-09 13:03:29 by Maverick
Nice stuff Thomas.. I like the idea of putting challenges on the board, keeps the brains fresh.

Well it was The Svin's idea, I just posted the challenge..

For your challenge, here's one solution... to be the first :), it can probably be done much better.



mov ecx, eax
shl eax, 8
mov al, cl
ror eax, 8
mov ecx, eax
shl ecx, 16
ror eax, 8
mov edx, eax
mov cx, ax
rol eax, 8
ror edx, 16
mov dl, ch

edit:
eax = src & first output
ecx = second output
edx = third output

Thomas
Posted on 2002-02-09 13:44:02 by Thomas
A shorter and faster version:


mov ecx, eax
shl eax, 8
mov edx, eax
mov al, ah
ror ecx, 16
mov dl, cl
ror eax, 16
mov cx, ax
rol eax, 8


About 4 clockcycles on an athlon thunderbird.

Thomas
Posted on 2002-02-09 13:56:03 by Thomas
This only uses rol's and mov's:


rol eax, 8
mov edx, eax
rol eax, 8
mov ecx, eax
mov ah, dh
mov dl, al
rol eax, 16
mov ch, ah

; eax = DBCD
; ecx = CDBC
; edx = BCDB
Posted on 2002-02-09 17:39:02 by eet_1024
Hi eet_1024,

I can talk only at the end, but I must warn you that your routine produces a wrong result in ECX:

It produces CDCB instead of CDBC.

Just wanted to inform you about that.

When the posts stop I will start a deeper discussion on all the implications, and why did I say "not-too-trivial practical problem".. and the seriously understimated importance that development tools have in my opinion. :)

Greets,
Maverick
Posted on 2002-02-09 17:55:33 by Maverick
Fixed:


shl eax, 8
mov ecx, eax
mov edx, eax
shr ecx, 16
or dl, ch
shl eax, 8
or ecx, eax
or ax, dx
rol eax, 16


After finishing this, I noticed that it looks very similar to Thomas's.
Posted on 2002-02-09 20:56:36 by eet_1024
This is a little different:
	mov edx,eax ; ABCD

shl eax,8 ; BCD0
xchg dl,dh ; ABDC
mov ecx,eax ; BCD0
bswap edx ; CDBA
mov al,ah ; BCDD
mov cl,dh ; BCDB *
ror eax,8 ; DBCD *
;-
mov dl,ah ; CDBC *
Should be fast, too? I am very interested in the explaination for needing this algo, and await your comments, Maverick.
	mov edx,eax ; ABCD

shl eax,8 ; BCD0
xchg dl,dh ; ABDC
mov ecx,eax ; BCD0
mov al,ah ; BCDD
bswap edx ; CDBA
ror eax,8 ; DBCD *
mov cl,dh ; BCDB *
mov dl,ah ; CDBC *
Posted on 2002-02-09 22:46:12 by bitRAKE
Another solution :



mov eax, 0AABBCCDDh ;// eax = ABCD
mov edx, eax ;// edx = ABCD
mov ecx, eax ;// ecx = ABCD
shl edx, 8 ;// edx = BCD0
shld ecx, edx, 8 ;// ecx = BCDB
shld eax, ecx, 24 ;// eax = DBCD
shrd edx, eax, 16 ;// edx = CDBC
Posted on 2002-02-10 03:31:35 by Dr. Manhattan
This is even slower than Dr. Manhattan's. ;)
20 bytes (one byte larger than Dr. Manhattan's).
push eax

rol eax,8
push eax ; Stack = ADCBDCBA
mov eax,[esp + 1] ; DBCD
mov edx,[esp + 2] ; CDBC
mov ecx,[esp + 3] ; BCDB
add esp,8
This is the fastest on my machine:
	mov edx,eax ; ABCD

rol eax,8 ; BCD0
rol edx,16 ; CDAB
mov ecx,eax ; BCD0
mov al,ah ; BCDD
rol dx,8 ; CDBA
ror eax,8 ; DBCD *
mov cl,dh ; BCDB *
mov dl,ah ; CDBC *
...under 4 ticks on Athlon. ;)
Posted on 2002-02-10 10:25:57 by bitRAKE
shld and shlr are sloooow :)

By the way, how do you measure the number of ticks ?

I use the following code

.data
TICK_A DWORD 0

.CODE

rdtsc
mov TICK_A, eax

; code...

rdtsc
sub eax, TICK_A

But the timings are not constant, even when there is no code to test !
Posted on 2002-02-10 12:33:06 by Dr. Manhattan
There is a very long winded explaination as to why, but I use the lowest count of several loops with overlaping copies. Here is the code:
Posted on 2002-02-10 12:40:57 by bitRAKE
Originally written by Maverick

Nice stuff Thomas.. I like the idea of putting challenges on the board, keeps the brains fresh.

I'd like to propose one not-too-trivial practical problem I came across some time ago.. let's see:

Input: You have a 32bit CPU register containing four bytes, let's call them ABCD.

Output: You have to output 3 registers (one of which will be the original register), containing the original ABCD components, but reordered this way:

orig.reg = DBCD
2nd.reg = CDBC
3rd.reg = BCDB

The challenge for the most optimized asm code is open ;-)

Several replies came. I think that the proposed solutions span all possible implementations, so I think I could close the "competition".

Well, in my original post I called it a "not-too-trivial practical problem" for a reason: this is a good example of an algorithm for which probably there is not a good solution that spans all the CPU's. It is also a good example of how much a human skilled coder is better than compilers, and/or of how good a certain compiler is. Also, when doing your evaluations, keep in mind that all the modern CPU's are designed with the goal to execute as fast as possible the code generated by high level (mostly C/C++) compilers.. rather than human asm coders.

But it's also a good example of how much primitive are still today the standard tools that ~everybody uses and thinks that are "THE ONLY WAY THAT EXISTS".
I've always believed that tools are an extremely important part of the work, and I've always felt kinda lonely in this belief. I created my own programming language, compiler and source-level debugger on the Amiga, as well on the PC (work in progress), where, among many other things, I introduced a layered loader.. which, again among other things, can load different code depending on the CPU running in the system. So you can have automatically loaded the best routine for the PC your program is running on, and leave the others on disk. If you think that 3x times of difference in clock cycles count for an innerloop is a difference, then you will agree that the development time to spend on own development tools pays back (BTW: it took me less time to write my HLA language on Amiga plus my game Virtual Karting than if I wrote Virtual Karting 100% in assembly! And then the language was there to be reused, as well as most of my first HLA-made game's code. Note that my language has the same name of Randall Hyde's HLA, but Randall's one came many years after and has really nothing in common with my language (I was also a bit unhappy about the shared name, urgh, not his fault though; he even proved that IBM used HLA even before me)). Anyway, let's look at the results of our optimizations:

First of all, the routines:




Maverick - Athlon XP optimized n.1 (original routine):
; EAX = ABCD (entry)
shl eax,8 ; EAX = BCD0
mov ebx,eax ; EBX = BCD0
mov al,ah ; EAX = BCDD
mov ecx,ebx ; ECX = BCD0
rol ebx,8 ; EBX = CD0B
ror eax,8 ; EAX = DBCD (out 1)
mov cl,bl ; ECX = BCDB (out 3)
mov bh,bl ; EBX = CDBB
mov bl,ah ; EBX = CDBC (out 2)


Maverick - Athlon XP optimized n.2 (based on BitRake 2nd, and improved):
; EAX = ABCD (entry)
mov ecx,eax ; ECX = ABCD
rol eax,8 ; EAX = BCD0
rol cx,8 ; ECX = ABDC
mov ebx,eax ; EBX = BCD0
mov al,ah ; EAX = BCDD
bswap ecx ; ECX = CDBA
ror eax,8 ; EAX = DBCD (out 1)
mov bl,ch ; EBX = BCDB (out 3)
mov cl,ah ; ECX = CDBC (out 2)


Maverick - Pentium-II optimized (original routine, extremely similar to Dr.Manhattan's!!):
; EAX = ABCD (entry)
mov ecx,eax ; ECX = ABCD
mov ebx,eax ; EBX = ABCD
rol ecx,8 ; ECX = BCDA
shld ebx,ecx,8 ; EBX = BCDB (out 3)
shld eax,ebx,24 ; EAX = DBCD (out 1)
shrd ecx,eax,16 ; ECX = CDBC (out 2)


Thomas 1st attempt:
mov ecx, eax
shl eax, 8
mov al, cl
ror eax, 8
mov ecx, eax
shl ecx, 16
ror eax, 8
mov edx, eax
mov cx, ax
rol eax, 8
ror edx, 16
mov dl, ch


Thomas 2nd attempt:
mov ecx, eax
shl eax, 8
mov edx, eax
mov al, ah
ror ecx, 16
mov dl, cl
ror eax, 16
mov cx, ax
rol eax, 8


eet_1024:
shl eax, 8
mov ecx, eax
mov edx, eax
shr ecx, 16
or dl, ch
shl eax, 8
or ecx, eax
or ax, dx
rol eax, 16


BitRake 1st attempt:
mov edx,eax ; ABCD
shl eax,8 ; BCD0
xchg dl,dh ; ABDC
mov ecx,eax ; BCD0
bswap edx ; CDBA
mov al,ah ; BCDD
mov cl,dh ; BCDB *
ror eax,8 ; DBCD *
mov dl,ah ; CDBC *


BitRake 2nd attempt:
mov edx,eax ; ABCD
shl eax,8 ; BCD0
xchg dl,dh ; ABDC
mov ecx,eax ; BCD0
mov al,ah ; BCDD
bswap edx ; CDBA
ror eax,8 ; DBCD *
mov cl,dh ; BCDB *
mov dl,ah ; CDBC *


Dr.Manhattan:
;// eax = ABCD
mov edx, eax ;// edx = ABCD
mov ecx, eax ;// ecx = ABCD
shl edx, 8 ;// edx = BCD0
shld ecx, edx, 8 ;// ecx = BCDB
shld eax, ecx, 24 ;// eax = DBCD
shrd edx, eax, 16 ;// edx = CDBC


WatcomC/C++ 11.0 optimized code (one of the best optimizing compilers around):
shl ecx,0x00000008 ; ECX = entry
mov eax,ecx
mov edx,ecx
and eax,0x0000ff00
xor cl,cl
shr eax,0x00000008
or ecx,eax
mov eax,edx
mov esi,edx
shl eax,0x00000008
shr edx,0x00000018
or edx,eax
mov eax,ecx
shr eax,0x00000008
shl ecx,0x00000018
or ecx,eax
mov eax,edx
and si,0x0ff00
and eax,0x000000ff
xor dh,dh
or esi,eax
shl eax,0x00000008
or edx,eax
mov eax,ecx
and eax,0x0000ff00
xor dl,dl ; ECX = 1st
shr eax,0x00000008 ; EDX = 2nd
or edx,eax ; ESI = 3rd




I'm impressed about four things in particular:

1) The great disparity of optimal code depending on the CPU (and that's why I called it a not-so-trivial *practical* problem). I will say more about this soon.

2) One of my solutions was almost identical to Dr.Manhattan's one.. (too bad that although it looks like the most elegant and short of all, it's actually by far the slowest on my Athlon XP! SHLD/SHRD are such a good resource that I wish it was more optimized in these CPU's.. I kept it only because I later found it to be still optimal on PentiumII, half cycle better than a previous routine I wrote which used only 32bit registers to avoid partial register stalls on those CPU's). Also, I wrote a stack-based routine which looked extremely elegant too, but didn't perform well on most modern CPU's.

3) Many solutions were extremely similar to each other.. showing that we're bare to the metal already (if it was necessary to show this).

4) Me (Maverick), BitRake and Dr.Manhattan all used the same style in commenting.. shows maybe a similar structured mind. ;)

---

Now it's time for some benchmarking. Available machines are:

Athlon-XP
Pentium-II
K6-2
Pentium-S

Benchmark is done by counting the number of CPU cycles that it takes to compute each routine unrolled 10 times and looped 1,000,000 of times. The total clock cycles count is then divided by 10,000,000. The benchmark is performed under Dos (i.e. no virtual memory nor multitasking interferences).

---

on the Athlon-XP:
4.20 cycles = Maverick - Athlon XP opt.1
3.70 cycles = Maverick - Athlon XP opt.2
10.20 cycles = Maverick - Pentium-II opt
6.10 cycles = Thomas 1st attempt
4.40 cycles = Thomas 2nd attempt
4.30 cycles = eet_1024
5.00 cycles = BitRake 1st attempt
5.00 cycles = BitRake 2nd attempt
10.20 cycles = Dr.Manhattan
11.90 cycles = WatcomC/C++ 11.0 opt


on the Pentium-II:
9.20 cycles = Maverick - Athlon XP opt.1
11.10 cycles = Maverick - Athlon XP opt.2
7.00 cycles = Maverick - Pentium-II opt
12.20 cycles = Thomas 1st attempt
9.30 cycles = Thomas 2nd attempt
10.10 cycles = eet_1024
17.50 cycles = BitRake 1st attempt
11.70 cycles = BitRake 2nd attempt
7.00 cycles = Dr.Manhattan
32.00 cycles = WatcomC/C++ 11.0 opt


on the K6-2:
10.00 cycles = Maverick - Athlon XP opt.1
12.30 cycles = Maverick - Athlon XP opt.2
9.30 cycles = Maverick - Pentium-II opt
16.60 cycles = Thomas 1st attempt
11.00 cycles = Thomas 2nd attempt
11.40 cycles = eet_1024
13.20 cycles = BitRake 1st attempt
11.10 cycles = BitRake 2nd attempt
9.00 cycles = Dr.Manhattan
24.00 cycles = WatcomC/C++ 11.0 opt


on the Pentium-S:
7.00 cycles = Maverick - Athlon XP opt.1
9.00 cycles = Maverick - Athlon XP opt.2
15.00 cycles = Maverick - Pentium-II opt
12.00 cycles = Thomas 1st attempt
9.00 cycles = Thomas 2nd attempt
7.00 cycles = eet_1024
10.00 cycles = BitRake 1st attempt
9.00 cycles = BitRake 2nd attempt
15.00 cycles = Dr.Manhattan
20.00 cycles = WatcomC/C++ 11.0 opt


---

My development machine is an Athlon-XP, but I can have (rare) testing access on those other listed machines as well.

It is my habit now to produce (for my key routines, where speed matters) different versions for each CPU. My own executable format's loader will then load the right routine for the host PC. This includes also MMX2,3DNow,etc..

If you do not think that this is useful, then.. for example, if I had to choose my Pentium-II optimized routine for all, then it would be nearly 3 times suboptimal on Athlon-XP's!!! Is 3x an significant difference? It is the same difference (well, ratio ;) ) between 1700 MHz and 5100 MHz.. it is not much, right? "Why bother.. use a C compiler, it's unbeatable", and the virtual 5100 MHz drop to 1115 (Pentium-II case). Buy a faster CPU, eh!! ;-)

Run-time checks are tedious, and you also waste RAM keeping all the versions together; it's much better to have a loader that chooses the right routine for you. A loader can do much more for you, it can resolve inter-module referiments (dynamic linking) and e.g. avoid many virtual function pointers, for increased performance not possible with the standard .EXE OS loader, and give other benefits as well (also for protection purposes).

---

PS: BitRake,
The use of this routine is simply to pack xRGB values into RGBRGBRGBRGB, so to e.g. quickly draw horizontal lines or bars on a 24bit screen, and not have to go to 32bit which steals precious bandwidth, or anyway in the cases where one has to work with 24bit screens. When I faced this problem I thought it was a very good testbench for my compiler's optimizer, and for any human coder.. and a good example of how key routines (as e.g. gfx routines) should come in many versions.. one of each possible host CPU, and automatically loaded. Also, you gave me the possibility to improve further my Athlon-XP version. ;)

I'm forwarding this post now to comp.lang.asm.x86, I think it may interest some people there.

Greets,
Maverick
Posted on 2002-02-10 13:01:45 by Maverick
Maverick, I get 3.9000023 cycles for your best on my Athlon? Mine is 3.8000021, very strange? I tried to use the same test conditions you outlined. Could you post the test proggie, or use mine above? Can you try the second version from my second post:
	mov edx,eax ; ABCD

rol eax,8 ; BCD0
rol edx,16 ; CDAB
mov ecx,eax ; BCD0
mov al,ah ; BCDD
rol dx,8 ; CDBA ; faster than mov dh,dl?
ror eax,8 ; DBCD *
mov cl,dh ; BCDB *
mov dl,ah ; CDBC *
The XCHG instruction is too costly.

I find that the added bandwidth of 32 bits is offset by the speed of MMX and the use of an alpha channel.

I like the sound of your loader being able to use different version of code! :alright:
Posted on 2002-02-10 14:57:51 by bitRAKE
Howdy bitRAKE :)

Maverick, I get 3.9000023 cycles for your best on my Athlon? Mine is 3.8000021, very strange? I tried to use the same test conditions you outlined. Could you post the test proggie, or use mine above? Can you try the second version from my second post:

Done.. here's the benchmark results with your added routines, on a Athlon-XP (not Athlon.. but I will have access to an Athlon machine in 3 days and test it also there):

Maverick Athlon-XP opt.1: cycles=4.20
Maverick Athlon-XP opt.2: cycles=3.70
Maverick Pentium-II: cycles=10.20
Thomas 1st attempt: cycles=6.10
Thomas 2nd attempt: cycles=4.40
eet_1024 fxd attempt: cycles=4.30
BitRake 1st attempt: cycles=5.00
BitRake 2nd attempt: cycles=5.00
BitRake 3rd attempt: cycles=18.00
BitRake 4th attempt: cycles=3.80
Dr.Manhattan attempt: cycles=10.20
C much opt. routine: cycles=11.90

I'm attaching the sourcecode of my test program.. I wrote it in 5 minutes, it doesn't subtract some setup cycles etc.. today I've been with my girlfriend and didn't really have much time for coding.. the results are reliable, though. I guess the difference is because of your Athlon vs my Athlon-XP.
Also, I'm sure you're not running it in Windows though.. since that would false the results a bit, and make them not constant (I guess this is the problem that Dr.Manhattan was referring to).

---

I find that the added bandwidth of 32 bits is offset by the speed of MMX and the use of an alpha channel.

Indeed.. I'm not a fan of 24bit at all. I just *had* to use it for someone that commissioned me a library ;)
There was a GUI specific case in which I've chosen to use it, though.. but it's rare.

If you're curious about what I often like and use.. and maybe even why.. normally I use 3 byteplanes (4 or more sometimes, for added effects), and for some gfx things the "classical CPU" is still much better than MMX. ;)
We can discuss more about it.. but there are some things that I will keep for myself.

Byteplanes are 256 colors 2D arrays of only one color component.
I use them because they're much more cache-friendly.. it's clear that doing 8bit gfx three times what fits in the cache is better than doing one time 32bit what doesn't fit in the cache. ;)
Moreover, it simplifyes some tricky algorithms. I can also use wordplanes, and custom color formats in certain cases. I've always been kinda maniacal in this aspect, I even designed a hardware device on the Amiga (called AgaEXTENDER, you can search the Internet for it) that if Viscorp and Escom didn't go bankruptcy (after Commodore) would have given coders new impressive possibilities to do nice things.

Now about the CPU vs MMX part: my font routine, but not only that, (sprites and more..) loads gfx but then precompiles it into machinecode. This means that e.g. an "A" becomes a bunch of MOV instructions, and some LEA or ADD. Even better a "I". I could print 8 millions of characters (12x16) on real world tests very easily on a Athlon-800 @ 100MHz fsb and SDR memory this way. The same is for sprites with e.g. many holes in them.. and even more for other advanced stuff about which I can't talk, sorry.
Of course MMX is still very ok for other cases, as well as hardware 3D acceleration is. There's no rule of thumb.. only specific cases and specific solutions.

Greets,
Maverick
Posted on 2002-02-10 16:29:00 by Maverick
Thanks for posting the bench code. Very strange that there is a diffence between XP and standard Athlon? Guess there really was some changes in the CPU.

I totally agree about breaking the data up when it benefits the algo, and MMX certainly has it's limitations.
Posted on 2002-02-10 18:41:13 by bitRAKE

Thanks for posting the bench code.
You're wellcome :)

Very strange that there is a diffence between XP and standard Athlon? Guess there really was some changes in the CPU.
According to AMD there is.. my personal experience at the begin was that there wasn't any difference instead, so I suspected that the claimed better performances of the XP were rather because of the support of SSE/SSE2. But given our thread's results, I now believe that also the core was slightly improved. In any case, I'll know better in a couple of days (getting my hands on my old Athlon machine).

Greets,
Mav
Posted on 2002-02-11 16:51:06 by Maverick