In a separate thread I use ADC and AAA in adding BCD digits of a number.

http://www.asmcommunity.net/board/index.php?topic=12934

The main loop is as follows:



// add byte pairs with aaa
#asm

add1: mov al,[esi]
adc al,[ebx]
aaa
mov [edi],al
dec esi
inc ebx
dec edi
dec ecx
jnz add1
#endasm


(Yes ESI is decremented while EBX is incremented - that's because it is a reverse addition - as in the palindrome quest...)

I believe this is the most optimal single digit addition routine, excluding SIMD routines with MMX, etc (except if I align it on a 16 byte boundary, I suppose). Is it?
Posted on 2003-05-01 21:22:01 by V Coder
add1:     mov al,[esi]

dec esi

adc al,[ebx]
inc ebx

aaa

mov [edi],al
dec edi

dec ecx
jnz addl
Look at the latency of the AAA instruction and how long 8 bytes takes to execute - MMX could do better. On the Athlon this loop takes 10 cycles. Your test says 14 cycles for loop?
Posted on 2003-05-01 22:58:31 by bitRAKE
Actually, bitRake, I originally wrote it something like that, (months ago) but thought the further away inc esi is from accessing, better to allow pairing... I can't remember what kind of stall it is now.

However, from the pairing rules, I think both ways pair identically, and when I tested various variations, the program speed was not affected (not with RDTSC).

On my Pentium MMX, 16 cycles is the best I get for any mod of this routine...

But I have one more change I'll try (dec esi, inc ebx just after aaa)... Nope. Same 16 cycles.
Posted on 2003-05-01 23:30:10 by V Coder

V Coder, I met a woman on chat from where you are. She seems really nice. Maybe I can introduce you to each other? She says she is 20 and has her own business. Who knows?


Try this one?:
	sub	esi, ecx

sub edi, ecx

addl: mov al, [esi+ecx]

adc al, [ebx]
inc ebx

aaa

mov [edi+ecx], al
dec ecx

jnz addl
Convert Norbert Juffa's algorithm to MMX - you'd have a real winner!
Propagating the carry across MMX registers is the really hard part.

What do you need the algorithm for? Why not convert to binary and back? In some cases keeping the numbers BCD is faster, but that is very rare case, IMHO.

My explaintion for your timing is that memory is slow and your operating on data that will not fit in cache, so part of your timing is the memory access time. Try the test on (cache size / 4) byte numbers and see if it is faster?
Posted on 2003-05-01 23:57:03 by bitRAKE
Thanks. I'll try that routine.


(Yes ESI is decremented while EBX is incremented - that's because it is a reverse addition - as in the palindrome quest...)


I'm working on palindrome 196. Reverse and add a number to itself. Repeat if the result is not a palindrome {121, 235747532, etc}. 196 has been reversed and added to itself (196+691=887+788=1675....) probably 220 million times to yield over 90 million digits over about 3 1/2 years and has not yet yielded a palindrome. It is not expected to. Of course the machines performing the checks now are lots faster than those used earlier, and the software is highly optimized (MMX, SSE2), etc. I am just trying to optimize a program to perform the reverse and add, to see if I can get one of comparable speed as others...

Juffa's algorithm is found here, along with an algorithm for unpacked BCD addition. I'll try this routine too, and let you know if it is any faster.
http://www.cs.uiowa.edu/~jones/bcd/bcd.html
Posted on 2003-05-02 00:51:25 by V Coder
Perhaps you should use tables to perform the addition.
Instead of adding BCD numbers, you just have to add 2 binary digits (0-063h).
Or perhaps use one digit per byte, since we have now computers with 1Gb RAM.

JC
Posted on 2003-05-02 18:28:42 by MCoder

Perhaps you should use tables to perform the addition.
Instead of adding BCD numbers, you just have to add 2 binary digits (0-063h).


Not sure I understand...


Or perhaps use one digit per byte, since we have now computers with 1Gb RAM.


Yup, I am using 1 digit per byte - unpacked BCD numbers.

The version based on Jones algorithm (to add two 32-bit 4 digit numbers instead of two 8-bit single digit numbers) takes 12 cycles instead of 16. I suppose an MMX version to add two 8 digit numbers would increase the speed slightly more. edit - after editing as below, it takes 10 cycles

I have reached the stage I believe I need to learn how to increase thread priority in HLA.

Please visit Jones' website to learn more about adding BCD numbers, and the rationale for the strange algorithm.

Hey, MMX, 3dNow, SSE and SSE2 coders, any optimizations???

          sub edi,3

sub esi,3
sub esi, ecx
sub edi, ecx


clc
pushf
add1: popf
mov eax,[esi+ecx]
bswap eax
adc eax,[ebx]

add eax, 0f6f6f6f6h
pushf
mov edx, eax
and edx, 30303030h
shr edx, 3
sub eax, edx
and eax, 0f0f0f0fh

bswap eax
mov [edi+ecx],edx
add ebx, 4
;** sub edi, 4 ;removed
;** sub esi, 4 ;removed
sub ecx, 4
jnz add1

popf


Edit - 2 subtractionss and attachment removed. Fixed below.
Posted on 2003-05-03 23:52:29 by V Coder
V Coder, memory violation on add_test_jo.
Posted on 2003-05-04 01:02:48 by bitRAKE
Rake, I get the error in Windows 2000 (Pentium 4), but not on Windows 95 (Pentium MMX).

Sorry, I see my error... I decrement edi and esi twice effectively (by subtracting 4 from edi, esi, and then subtracting 4 from ecx, when the index used is edi/esi +ecx)

After fixing it, the version takes 10 cycles.

add_test_jo
adding:

// add dword pairs with aaa
#asm
sub edi,3
sub esi,3
sub esi, ecx
sub edi, ecx


clc
pushf
add1: popf
mov eax,[esi+ecx]
bswap eax
adc eax,[ebx]

add eax, 0f6f6f6f6h
pushf
mov edx, eax
and eax, 30303030h
shr eax, 3
sub edx, eax
and edx, 0f0f0f0fh

bswap edx
mov [edi+ecx],edx
add ebx, 4
sub ecx, 4
jnz add1

popf
#endasm
Posted on 2003-05-04 01:40:41 by V Coder
Perhaps you should use tables to perform the addition.
Instead of adding BCD numbers, you just have to add 2 binary digits (0-063h).
Or perhaps use one digit per byte, since we have now computers with 1Gb RAM.


V-Coder, I realized that you were adding one digit at a time.
Instead of working one digit at a time, wouldn't it be faster to work on 2 digits ?
Ok, here is some unoptimized code to show my idea:



inverttable db 100 dup (?) ;this table will contain the reversed numbers: 10 -> 01, and 01->10
xor eax, eax
xor ebx, ebx
xor ecx, ecx

mov al,[esi]
mov bl,[edi]
add bl,cl
xor ecx,ecx
add bl,[inverttable+eax]
cmp bl,100
jb skip
sub bl,100
inc ecx
skip:
mov [edx], bl



Perhaps this addition could be done efficiently in MMX ?

JC
Posted on 2003-05-04 05:56:26 by MCoder
MCoder you missed when I changed to 4 digits in 32-bits using Jones algorithm...

Actually, there's also no need for tables, since I am reversing not the bits in each digit, but the digits in the number 12345 becomes 54321 (Hexademinal: 0102030405 becomes 0504030201)... But thanks much.
Posted on 2003-05-04 12:24:49 by V Coder
This works beautifully: :)
.data


ALIGN 16

db 16 dup(0) ; this buffer is needed for algorithm overflow
Number1 db 16*8 dup(0)

db 16 dup(0) ; this buffer is needed for algorithm overflow
Number2 db 16*8 dup(0)

mmx_0A QWORD 0A0A0A0A0A0A0A0Ah

.code

Palendrome_Test PROC
lea edi, Number1

mov DWORD PTR [edi], 010906h ; 196
lea esi, [edi + 3]

lea ebp, Number2
lea ebx, [ebp + 3]

_0: call Palindrome_Sum

; carry flag set if block overflows
; adjust esi/ebx as number grows
adc BYTE PTR [ebx], 0
je _2
inc esi
inc ebx
_2: xchg ebp, edi
xchg ebx, esi

call Palindrome_Find
; loop until we find a palidrone
jne _0

ret
Palendrome_Test ENDP




Palindrome_Sum PROC
push esi ; start address of first number
push edi ; end address of buffer
push ebx ;
push ebp ; end address of destination buffer

lea eax, [esi + 15]
sub eax, edi
shr eax, 4
push eax ; 16 byte blocks to process

pcmpeqd mm6, mm6 ; mmx_FF
movq mm7, mmx_0A

clc
_0: ; unaligned high order digits (higher addresses)
mov eax, [esi - 1*4]
mov ebx, [esi - 2*4]
mov ecx, [esi - 3*4]
mov edx, [esi - 4*4]

bswap eax ; Athlon note: faster than MMX
bswap ebx
bswap ecx
bswap edx

; fix to ensure overflow into next byte
lea eax, [eax + 0F6F6F6F6h]
lea ebx, [ebx + 0F6F6F6F6h]
lea ecx, [ecx + 0F6F6F6F6h]
lea edx, [edx + 0F6F6F6F6h]

; 00 01 02 03 04 05 06 07 08 09
; F6 F7 F8 F9 FA FB FC FD FE FF

; Adding carry flags across MMX registers is slow, too

; aligned low order digits
adc eax, [edi + 0*4]
adc ebx, [edi + 1*4]
adc ecx, [edi + 2*4]
adc edx, [edi + 3*4]

lea esi, [esi - 4*4]
lea edi, [edi + 4*4]

; aligned low order digits
mov [ebp + 0*4], eax
mov [ebp + 1*4], ebx
mov [ebp + 2*4], ecx
mov [ebp + 3*4], edx

movq mm0, [ebp]
movq mm1, [ebp+8]
pcmpgtb mm0, mm6
pcmpgtb mm1, mm6
pandn mm0, mm7
pandn mm1, mm7
paddb mm0, [ebp]
paddb mm1, [ebp+8]
movq [ebp], mm0
movq [ebp+8], mm1

lea ebp, [ebp + 4*4]

dec DWORD PTR [esp]
jne _0

pop eax ; 16 byte blocks to process
pop ebp ; end address of destination buffer
pop ebx ;
pop edi ; end address of buffer
pop esi ; start address of first number
ret
Palindrome_Sum ENDP


Palindrome_Find PROC
or eax, -1
ret
Palindrome_Find ENDP
The numbers in the data section are just to show the format in memory - I'm sure you will allocate some larger blocks. All but four memory accesses are aligned (they are aligned sometimes, too)! If this isn't faster then it's only memory speed holding it back. Note: numbers are stored with least significant digit first. I will test some more before reporting an official timing on the algorithm.

call Palendrome_Test ; to get it started...
Posted on 2003-05-04 16:46:30 by bitRAKE
Here is a test program. 1000 Mhz Athlon clocks in at <7 secs for 2^16 digits.
This actually calculates the digits for the number 196.

2^20 digits takes 3 hours, 33 minutes, 23.52 seconds!

Sure beats ( this ) guy's 1 day and 18 hours!


; 2^11 ( 2k): 0 ms
; 2^12 ( 4k): 20 ms
; 2^13 ( 8k): 100 ms
; 2^14 ( 16k): 400 ms
; 2^15 ( 32k): 1 552 ms
; 2^16 ( 64k): 6 839 ms
; 2^17 ( 128k): 28 270 ms
; 2^18 ( 256k): 482 003 ms (no longer in the cache!)
; 2^19 ( 512k):
; 2^20 (1024k): 12 803 520 ms
Posted on 2003-05-04 17:49:32 by bitRAKE
Hey, wait a minute, you weren't supposed to post an entire palindrome program... I'm working on it...:rolleyes:

Anyway, check out the full list of times at http://home.cfl.rr.com/p196/milestones.html

and the software comparisons page at http://home.cfl.rr.com/p196/compare.html

You should probably be able to submit your code to be officially timed. And if he likes your code enough, he would use it until a faster one is developed.

My intention was to write a routine that was faster on my Pentium MMX 166 than the 1 day 18 hours on a Pentium II 266.
http://www.jasondoucette.com/worldrecords.html

My fastest routine (8 digit MMX) running in unreal mode for DOS finishes 1 million in 14.7 hours on a K6-2 550, 33% faster on a Pentium II 350, and 50% longer on a Pentium MMX 166 - so I have accomplished my objective...

Beyond this I wanted to submit for testing to the Software comparisons page.

Unfortunately, the tester prefers a Windows version. I estimate that with all the penalties I accrue from segment prefix, 32 bit prefix, my code should be at least four times as fast in Win32. My initial simple single digit add was half the speed on Pentium MMX. Worse yet, Pentium 4's hate 16 bit code with segment prefixes, etc. The unreal mode code runs barely faster on a 1.8GHz Pentium 4 than on a 166 MHz Pentium MMX.

Here's my problem now. I am coding in HLA, and there's some problem preventing the program from working in Win32. I port the routine over and it works for the first adds, then breaks. Plus HLA does not support all the MMX instructions, so I have to use a #asm, #endasm hack. (It must be so annoying to you MASM guys to see HLA code... my bad)

My interest here was to develop the best add routine...

I had discovered Jones' website months ago, but could not understand his algorithm until I read it carefully a few nights ago, and I was able to post code. I was debugging my MMX routine... Here it is, on my Pentium MMX, it takes 11 cycles, compared to Jones routine, 7-9 cycles... (I do not adjust thread priority yet).

It is slower than Jones' but I wrote it myself!!!:grin:
Posted on 2003-05-04 23:32:39 by V Coder
My MMX algorithm is as follows - only 4 digit is included, I also have it extended to 8 digits ...



#asm
sub edi,3
sub esi,3
sub esi, ecx
sub edi, ecx

clc
pushf
add1: popf
mov eax,[esi+ecx]
bswap eax
adc eax,[ebx]
add eax, 0f6f6f6f6h
pushf ; carry from addition

movd mm0,eax ; MMX
movd mm1,eax
#endasm
pcmpgtb (allff, mm1); // $ffffffffffffffff
pxor (allff, mm1); // reverse bits
pand (allf6, mm1);
#asm
psubb mm0, mm1
movd eax,mm1

bswap eax
mov [edi+ecx],eax
add ebx, 4
sub ecx, 4
jnz add1

popf
#endasm


Let me explain how it works...

After adding the two numbers together with carry from previous add, I add $f6f6f6f6 to eax to force the carry out of each digit. Now, any byte from f6-ff has not had the carry out, and any byte from 0-9 has had a carry. Store carry on stack

If it is from 0-9, all well and good. If from f6-ff, then f6 needs to be subtracted.

MMX signed byte compare with ffffffff. Any byte which is positive will now become ff, any byte which is ff or less will become 00. xor to switch from 00 to ff and vice versa. And with f6. Thus everywhere 0-9 is found the byte value is 00. Everywhere f6-ff is found the byte value is f6.
Subtract, pass back to eax, bswap and store...

Okay my algorithm has a problem.

I am storing all numbers in big endian format (or is that little?) The least significant number is in higher numbered bytes, as opposed to IA-32 architecture. This is why I have to bswap before store. Easy to change. I just have not done it yet, since I cannot get HLA to work with my routine... and I cannot code MASM to save anyone's life.
Posted on 2003-05-04 23:47:42 by V Coder

Hey, wait a minute, you weren't supposed to post an entire palindrome program... I'm working on it...:rolleyes:
Sorry, didn't mean to help too much - just got carried away.

V Coder, your doing well. Look at the algo above - it fixes all the problems you state your's has:

1) little endian numbers: only one set of non-aligned loads, only one bswap

2) pxor mm0, <FF> ; pand mm0, <F6> should be changed to pandn mm0, <F6> !!

3) you don't want to use MOVD it is really slow, use only MOVQ

4) almost no forward dependancies


On cached data each digit takes 2.66 clock cycles!
The faster your memory the faster the algorithm!
I still think it is twice as slow as it should be!

Do you know where I can get the files of numbers already computed? I'd like to start at 90 million digits if I can. I'm going to attempt a packed BCD version because memory is a bottleneck here and we have plenty of time to do more calculations.
Posted on 2003-05-05 00:21:33 by bitRAKE
Ah, I like your routine...

(Apart from the fact that you add $0a, and I subtract $f6 - as we say in Trinidad, six of one, half a dozen of the other).

Cute way of setting up allff - $ffffffffffffffff. You can see that I have not yet put allf6 and allff in mmx registers... So I have room for optimization.

I suppose moving the data back to memory is faster than movd and shift... which is what I do in my 8 byte version.

However, I believe you are wrong...After the pcmpgtd you need to xor $ffffffff. Wait, okay. That's a pandn. That's cute.

I suppose it is supported by a C launcher or something?

I only now realize that I do not need to push the flags here in this test program, since I do not affect the flags with the MMX comparisons...

Well, in my full program, I use an integer CMP to see if the number is a palindrome. Obviously CMP affects the flags, so I need to preserve it... Where do you check for the palindrome?

Actually, in theory, you do not need to specifically check for palindrome. If there is any carry in any addition, then the number being formed will not be a palindrome. That's the theory at least. It fails for small numbers, eg 29+92=121. And there are a couple of others. However, it is hard to imagine that any addition of a 90 million digit number will not produce a carry somewhere, so no one actually believes there will be a palindrome, so you do not really need to check right?

You need to keep score of how many digits in the new number, and how many additions have passed so far. Also, in my DOS unreal program I use BP, but in my windows port, I leave EBP severely alone.

I understand the LEAs. I did not think of them, but then again, I am a relative newbie. I started writing my routine 5 years 4 months ago (before the current "pioneers" started). I got my initial program working in Unreal mode, but thought it was too slow and did not want to stress my mind in learning Win32 API. So I let it go until Jan 1, 2003. I reworked the routines until I reached an MMX version, started teaching myself HLA. But I got stuck January 15th when I could not get the routines that work perfectly under NASM to work under HLA. I have yet to submit my program there for debugging. But I'm sure it is not my routines themselves, but the HLA packaging (yeah for sure!!!)... so, I'm here to try to learn MASM32, and/or fix up the HLA code... Meanwhile I wanted to develop my routines further.

Will I get flamed if I post my Unreal Mode routine here?

Here goes.

I compiles as is with NASMW. I have no idea what to do to make it work with the others.
Posted on 2003-05-05 00:35:17 by V Coder
Will I get flamed if I post my Unreal Mode routine here?
It fits in well with the discussion, IMHO.

I don't test for a palindrome because I'm not to 90 million digits yet - no need. The algorithm posted above moves ESI/EBX as the number grows - they point to the byte after the number. With some minor changes (prefetch) I am at 7.5 clock cycles for uncached data. :)
; 2^18 ( 256k):    482 003 ms uncached data

; 2^18 ( 256k): 343 944 ms " " w/ PREFETCHNTA
...a 28.6% improvement.
Posted on 2003-05-05 00:54:05 by bitRAKE
I took too long typing my answer. I started before you...

Jones' code uses 5 instructions as opposed to your 6, however.

You have to contact the guy at www.p196.org to get the 90 million byte file.

Have a look at my HLA please.

I am trying to debug it. With your deep understanding of ASM, it will probably only bring furrows to your forehead... (Why does V Coder accept reversed operands, mandatory brackets and semicolon?)

Thanks

You are slightly behind the fastest on the software comparisons page. But then again, you're on a slower clocked machine... Try Jones' routine too.
Posted on 2003-05-05 00:57:41 by V Coder
I have a record of the correct number of additions to reach each 10,000 digits.

10000 digits, 24094 additions;
20000 - 48150
50000 - 120428 etc.

Just for you to verify that your program is accurate (which goes without saying, V Coder :grin: ) I used it to verify each new routine I wrote, and the unreal mode code beeps at each 10,000, so I can time it manually.
Posted on 2003-05-05 01:21:50 by V Coder