Here is a program (written in HLA - sorry folks... ) which tests the speed of addition of two BCD numbers - actually reverse addition of a 1 megabyte BCD number to itself.

The number is chosen to cause a carry after each addition.

The addition is repeated several times and a clock cycle count per single digit addition calculated. (The total cycle count is shown in decimal if edx=0, else in hexadecimal)

Executable and HLA source (and the MASM output of HLA) are enclosed in the zip file below.

Feel free to post optimizations (incl. MMX, SSE, SSE2) and times.



// **********************************
// Reverse Addition Test Program
// Coded using HLA
//
// **********************************

program RevaddCheck;
#include( "stdlib.hhf" );

const
memtoalloc :=$200000; // [2MB = 1048576 bytes]
upperalloc :=$100000; // [1MB = 1048576 bytes]
numberadds :=50;
static (4)

mem1addr: pointer to byte;
mem2addr: pointer to byte;
mem1lsb: pointer to byte;
mem2lsb: pointer to byte;
numadd: int32:=numberadds;

t:time.timerec;

begin RevaddCheck;

console.cls();
console.gotoxy(4, 15);
stdout.put ( "Addition Test Program.", nl);
console.gotoxy(5, 15);
stdout.put ( " Coded using HLA", nl, nl);
stdout.put ( "This program times the reverse addition (using aaa)", nl);
stdout.put ( "of a megabyte length BCD number to itself...", nl, nl, nl);

stdout.put ( "Initializing memory. Initial Time: ");
time.curTime(t);
// stdout.put(t.hours:2, ':', t.mins:2, ':', t.secs:2);
stdout.puti16Size(t.hours, 2, '0');
stdout.put(':');
stdout.puti8Size(t.mins, 2, '0');
stdout.put(':');
stdout.puti8Size(t.secs, 2, '0');


// Allocate RAM; On return EAX contains address;
malloc (memtoalloc);

// Store memory address
mov (eax, mem1addr);
add (upperalloc,eax);
mov (eax, mem2addr);
dec (eax);
mov (eax, mem1lsb);
add (upperalloc,eax);
mov (eax, mem2lsb);

// Clear memory
mov (mem1addr, esi);
mov (mem2addr, edi);
xor (eax, eax); // Fill m3m2 with BCD 0000
mov ($07060504,edx); // Fill mem1 with BCD 7654
mov (upperalloc, ecx);
sub (4, ecx);
clm1:
mov (edx, [esi+ecx]);
mov (eax, [edi+ecx]);

sub (4, ecx);
jns clm1;

// Commencing Message
stdout.put ( nl, nl,"Commencing addition. Time: ");
time.curTime(t);
// stdout.put(t.hours:2, ':', t.mins:2, ':', t.secs:2);
stdout.puti16Size(t.hours, 2, '0');
stdout.put(':');
stdout.puti8Size(t.mins, 2, '0');
stdout.put(':');
stdout.puti8Size(t.secs, 2, '0');

// ADD NUMBER at ds:esi to its reverse at ds:ebx; store at ds:edi
// actually [esi], [ebx] are the least, and most significant digit
// respectively of the source number and [edi] is the LSD of the target

rdtsc; // First measure of time
// rdtsc takes 13 cycles on Pentium MMX,
push (edx); // I store the 64 bit cycle count on the stack.
push (eax);

adding:
mov (mem1lsb, esi); // source
mov (mem1addr, ebx); // source
mov (mem2lsb, edi); // destination
mov (upperalloc, ecx); // number of digits


// add byte pairs with aaa
#asm

add1: mov al,[esi]
adc al,[ebx]
aaa
mov [edi],al
dec esi
inc ebx
dec edi
dec ecx
jnz add1
#endasm
// Note the final carry from the final addition is not stored

// repeat numadd times
dec (numadd);
jne adding;

// Check clocks
rdtsc; // Second measure of time
push (edx);
push (eax);

// Completion Message
stdout.put ( nl, nl,"Addition Completed. Time: ");
time.curTime(t);
// stdout.put(t.hours:2, ':', t.mins:2, ':', t.secs:2);
stdout.puti16Size(t.hours, 2, '0');
stdout.put(':');
stdout.puti8Size(t.mins, 2, '0');
stdout.put(':');
stdout.puti8Size(t.secs, 2, '0');

mov (upperalloc, ebx);

stdout.put(nl,nl);
stdout.putu32(ebx);
stdout.put (" single-digit additions performed ");
stdout.putu32(numberadds);
stdout.put (" times ");

// Retrieve 2nd rdtsc
pop (eax);
pop (edx);

sub ([esp], eax); // subtract first from second
sbb ([esp+4],edx); // result in EDX:EAX

add (8, esp); // remove first edx, eax from stack


if (edx=0) then
stdout.put (nl, "in ");
stdout.putu32(eax);
stdout.put(" clock cycles = ", nl);
else
stdout.put (nl, "in $",edx," ",eax, " clock cycles = ", nl);
endif;

mov (upperalloc, ebx);
div (ebx, edx:eax);
stdout.putu32(eax);
push (eax);
stdout.put (" clock cycles for ");
stdout.putu32(numberadds);
stdout.put (" single digit additions =",nl);

pop (eax);
xor (edx,edx);
mov (numberadds, ebx);
div (ebx, edx:eax);

stdout.putu32(eax);
stdout.put (" clock cycles per single digit addition.",nl,nl);

end RevaddCheck;


I get variable timings (internet connected) of 16 cycles (and above) per addition on a Pentium MMX, and of 64 cycles (and above) on a K6-2.

Why so slow? Alignment?
Posted on 2003-05-01 06:31:10 by V Coder
Try changing dec xxx to sub xxx,1. That's the only speed optimisation I can find. Anyway, Athlon is weird. xor eax,eax is slower on Athlon than on intel.
Posted on 2003-05-01 06:48:41 by roticv
Thanks roticv, but.... as I was just about to do that ... I wondered if the rationale was that ADD & SUB pair better than INC & DEC? or would they take up less space?

So, I went to the Integer Pairing Tables, and found that whereas ADC and SBB pair in the U pipe only, ADD, SUB, INC and DEC all pair in U or V... So there should be no theoretical improvement with using Sub instead of DEC...

Then I noticed another thing: SUB breaks the main routine.

The routine is a simple loop ADC (with carry) {after the addition, the carry will always be 0, since two nibbles are added in al from , . any overflow past 9 will be recognized by ...} AAA {if al >9, al=al-10 and carry set; else carry remains cleared} INC & DEC pointers, maintaining carry flag {INC & DEC affect the Zero and Sign flags but not the carry flag}



// add byte pairs with aaa
#asm

add1: mov al,[esi]
adc al,[ebx]
aaa
mov [edi],al
dec esi
inc ebx
dec edi
dec ecx
jnz add1
#endasm


Once DEC ecx does not reach 0, the loop continues.

Unfortunately, SUB would also zero the carry flag, so the carry from the last addition would be lost...

I believe that this represents the most optimal routine for single digit addition. What would an MMX optimization or something for BCD addition look like?
Posted on 2003-05-01 21:11:36 by V Coder
This 4 digit optimization, http://www.asmcommunity.net/board/showthread.php?threadid=12949 based on Jones' algorithm, http://www.cs.uiowa.edu/~jones/bcd/bcd.html takes 10 cycles...



sub edi,3
sub esi,3
sub esi, ecx
sub edi, ecx

clc
pushf
add1: popf
mov eax,[esi+ecx]
bswap eax
adc eax,[ebx]

add eax, 0f6f6f6f6h
pushf
mov edx, eax
and edx, 30303030h
shr edx, 3
sub eax, edx
and eax, 0f0f0f0fh

bswap eax
mov [edi+ecx],edx
add ebx, 4
sub ecx, 4
jnz add1

popf1


Please see Jones webpage for an explanation of how it works.
Posted on 2003-05-04 01:02:24 by V Coder
You most probably don't always have a multiple of 4 bytes in the "string" you have to add. I assume you would follow up with single-byte additions to complete the overall addition.

I can also only assume that it was a typo in your posted code. I would think that
mov ,edx should have been mov ,eax.

Raymond
Posted on 2003-05-04 11:04:17 by Raymond
Here's the joy of it...

You don't have to follow up with single digit adds!!!!

And thanks, about the _edx_ thing. Nope I'm not blind. Actually, its a test routine, and I was trying to avoid a pairing problem with writing to a register (mov edx, eax) and changing it right after (and edx...), so I changed to and eax..., but some of the final changes did not make it to the post.

This is the corrected version. It reported 7 cycles immediately when I tested it. Thereafter it is reporting 9 cycles. :grin:



sub edi,3
sub esi,3
sub esi, ecx
sub edi, ecx

clc
pushf
add1: popf
mov eax,[esi+ecx]
bswap eax
adc eax,[ebx]

add eax, 0f6f6f6f6h
pushf
mov edx, eax
and eax, 30303030h ; use eax here instead of edx
shr eax, 3
sub edx, eax
and edx, 0f0f0f0fh ; result in edx

bswap edx
mov [edi+ecx],edx
add ebx, 4
sub ecx, 4
jnz add1

popf
Posted on 2003-05-04 12:14:34 by V Coder