In a separate thread I use ADC and AAA in adding BCD digits of a number.

http://www.asmcommunity.net/board/index.php?topic=12934

The main loop is as follows:

(Yes ESI is decremented while EBX is incremented - that's because it is a reverse addition - as in the palindrome quest...)

I believe this is the most optimal single digit addition routine, excluding SIMD routines with MMX, etc (except if I align it on a 16 byte boundary, I suppose). Is it?

http://www.asmcommunity.net/board/index.php?topic=12934

The main loop is as follows:

```
```

// add byte pairs with aaa

#asm

add1: mov al,[esi]

adc al,[ebx]

aaa

mov [edi],al

dec esi

inc ebx

dec edi

dec ecx

jnz add1

#endasm

(Yes ESI is decremented while EBX is incremented - that's because it is a reverse addition - as in the palindrome quest...)

I believe this is the most optimal single digit addition routine, excluding SIMD routines with MMX, etc (except if I align it on a 16 byte boundary, I suppose). Is it?

```
add1: mov al,[esi]
```

dec esi

adc al,[ebx]

inc ebx

aaa

mov [edi],al

dec edi

dec ecx

jnz addl

Look at the latency of the AAA instruction and how long 8 bytes takes to execute - MMX could do better. On the Athlon this loop takes 10 cycles. Your test says 14 cycles for loop?Actually, bitRake, I originally wrote it something like that, (months ago) but thought the further away inc esi is from accessing, better to allow pairing... I can't remember what kind of stall it is now.

However, from the pairing rules, I think both ways pair identically, and when I tested various variations, the program speed was not affected (not with RDTSC).

On my Pentium MMX, 16 cycles is the best I get for any mod of this routine...

But I have one more change I'll try (dec esi, inc ebx just after aaa)... Nope. Same 16 cycles.

However, from the pairing rules, I think both ways pair identically, and when I tested various variations, the program speed was not affected (not with RDTSC).

On my Pentium MMX, 16 cycles is the best I get for any mod of this routine...

But I have one more change I'll try (dec esi, inc ebx just after aaa)... Nope. Same 16 cycles.

**V Coder**, I met a woman on chat from where you are. She seems really nice. Maybe I can introduce you to each other? She says she is 20 and has her own business. Who knows?

Try this one?:

```
sub esi, ecx
```

sub edi, ecx

addl: mov al, [esi+ecx]

adc al, [ebx]

inc ebx

aaa

mov [edi+ecx], al

dec ecx

jnz addl

Convert **Norbert Juffa**'s algorithm to MMX - you'd have a real winner!

Propagating the carry across MMX registers is the really hard part.

What do you need the algorithm for? Why not convert to binary and back? In some cases keeping the numbers BCD is faster, but that is very rare case, IMHO.

My explaintion for your timing is that memory is slow and your operating on data that will not fit in cache, so part of your timing is the memory access time. Try the test on (cache size / 4) byte numbers and see if it is faster?

Thanks. I'll try that routine.

(Yes ESI is decremented while EBX is incremented - that's because it is a reverse addition - as in the palindrome quest...)

I'm working on palindrome 196. Reverse and add a number to itself. Repeat if the result is not a palindrome {121, 235747532, etc}. 196 has been reversed and added to itself (196+691=887+788=1675....) probably 220 million times to yield over 90 million digits over about 3 1/2 years and has not yet yielded a palindrome. It is not expected to. Of course the machines performing the checks now are lots faster than those used earlier, and the software is highly optimized (MMX, SSE2), etc. I am just trying to optimize a program to perform the reverse and add, to see if I can get one of comparable speed as others...

Juffa's algorithm is found here, along with an algorithm for unpacked BCD addition. I'll try this routine too, and let you know if it is any faster.

http://www.cs.uiowa.edu/~jones/bcd/bcd.html

(Yes ESI is decremented while EBX is incremented - that's because it is a reverse addition - as in the palindrome quest...)

I'm working on palindrome 196. Reverse and add a number to itself. Repeat if the result is not a palindrome {121, 235747532, etc}. 196 has been reversed and added to itself (196+691=887+788=1675....) probably 220 million times to yield over 90 million digits over about 3 1/2 years and has not yet yielded a palindrome. It is not expected to. Of course the machines performing the checks now are lots faster than those used earlier, and the software is highly optimized (MMX, SSE2), etc. I am just trying to optimize a program to perform the reverse and add, to see if I can get one of comparable speed as others...

Juffa's algorithm is found here, along with an algorithm for unpacked BCD addition. I'll try this routine too, and let you know if it is any faster.

http://www.cs.uiowa.edu/~jones/bcd/bcd.html

Perhaps you should use tables to perform the addition.

Instead of adding BCD numbers, you just have to add 2 binary digits (0-063h).

Or perhaps use one digit per byte, since we have now computers with 1Gb RAM.

JC

Instead of adding BCD numbers, you just have to add 2 binary digits (0-063h).

Or perhaps use one digit per byte, since we have now computers with 1Gb RAM.

JC

Perhaps you should use tables to perform the addition.

Instead of adding BCD numbers, you just have to add 2 binary digits (0-063h).

Not sure I understand...

Or perhaps use one digit per byte, since we have now computers with 1Gb RAM.

Yup, I am using 1 digit per byte - unpacked BCD numbers.

The version based on Jones algorithm (to add two 32-bit 4 digit numbers instead of two 8-bit single digit numbers) takes 12 cycles instead of 16. I suppose an MMX version to add two 8 digit numbers would increase the speed slightly more.

*edit - after editing as below, it takes 10 cycles*

I have reached the stage I believe I need to learn how to increase thread priority in HLA.

Please visit Jones' website to learn more about adding BCD numbers, and the rationale for the strange algorithm.

Hey, MMX, 3dNow, SSE and SSE2 coders, any optimizations???

```
sub edi,3
```

sub esi,3

sub esi, ecx

sub edi, ecx

clc

pushf

add1: popf

mov eax,[esi+ecx]

bswap eax

adc eax,[ebx]

add eax, 0f6f6f6f6h

pushf

mov edx, eax

and edx, 30303030h

shr edx, 3

sub eax, edx

and eax, 0f0f0f0fh

bswap eax

mov [edi+ecx],edx

add ebx, 4

;** sub edi, 4 ;removed

;** sub esi, 4 ;removed

sub ecx, 4

jnz add1

popf

Edit - 2 subtractionss and attachment removed. Fixed below.

**V Coder**, memory violation on add_test_jo.

Rake, I get the error in Windows 2000 (Pentium 4), but not on Windows 95 (Pentium MMX).

Sorry, I see my error... I decrement edi and esi twice effectively (by subtracting 4 from edi, esi, and then subtracting 4 from ecx, when the index used is edi/esi +ecx)

After fixing it, the version takes 10 cycles.

add_test_jo

Sorry, I see my error... I decrement edi and esi twice effectively (by subtracting 4 from edi, esi, and then subtracting 4 from ecx, when the index used is edi/esi +ecx)

After fixing it, the version takes 10 cycles.

add_test_jo

```
adding:
```

// add dword pairs with aaa

#asm

sub edi,3

sub esi,3

sub esi, ecx

sub edi, ecx

clc

pushf

add1: popf

mov eax,[esi+ecx]

bswap eax

adc eax,[ebx]

add eax, 0f6f6f6f6h

pushf

mov edx, eax

and eax, 30303030h

shr eax, 3

sub edx, eax

and edx, 0f0f0f0fh

bswap edx

mov [edi+ecx],edx

add ebx, 4

sub ecx, 4

jnz add1

popf

#endasm

**Perhaps you should use tables to perform the addition.**

Instead of adding BCD numbers, you just have to add 2 binary digits (0-063h).

Or perhaps use one digit per byte, since we have now computers with 1Gb RAM.

Instead of adding BCD numbers, you just have to add 2 binary digits (0-063h).

Or perhaps use one digit per byte, since we have now computers with 1Gb RAM.

V-Coder, I realized that you were adding one digit at a time.

Instead of working one digit at a time, wouldn't it be faster to work on 2 digits ?

Ok, here is some unoptimized code to show my idea:

Perhaps this addition could be done efficiently in MMX ?

JC

V-Coder, I realized that you were adding one digit at a time.

Instead of working one digit at a time, wouldn't it be faster to work on 2 digits ?

Ok, here is some unoptimized code to show my idea:

```
```

inverttable db 100 dup (?) ;this table will contain the reversed numbers: 10 -> 01, and 01->10

xor eax, eax

xor ebx, ebx

xor ecx, ecx

mov al,[esi]

mov bl,[edi]

add bl,cl

xor ecx,ecx

add bl,[inverttable+eax]

cmp bl,100

jb skip

sub bl,100

inc ecx

skip:

mov [edx], bl

Perhaps this addition could be done efficiently in MMX ?

JC

**
**

Hey, wait a minute, you weren't supposed to post an entire palindrome program... I'm working on it...:rolleyes:Sorry, didn't mean to help too much - just got carried away.

1) little endian numbers: only one set of non-aligned loads, only one bswap

2) pxor mm0, <FF> ; pand mm0, <F6> should be changed to pandn mm0, <F6> !!

3) you don't want to use MOVD it is really slow, use only MOVQ

4) almost no forward dependancies

On cached data each digit takes 2.66 clock cycles!

The faster your memory the faster the algorithm!

I still think it is twice as slow as it should be!

Do you know where I can get the files of numbers already computed? I'd like to start at 90 million digits if I can. I'm going to attempt a packed BCD version because memory is a bottleneck here and we have plenty of time to do more calculations.

It fits in well with the discussion, IMHO.

I don't test for a palindrome because I'm not to 90 million digits yet - no need. The algorithm posted above moves ESI/EBX as the number grows - they point to the byte after the number. With some minor changes (prefetch) I am at 7.5 clock cycles for uncached data. :)

MCoder you missed when I changed to 4 digits in 32-bits using Jones algorithm...

Actually, there's also no need for tables, since I am reversing not the bits in each digit, but the digits in the number 12345 becomes 54321 (Hexademinal: 0102030405 becomes 0504030201)... But thanks much.

Actually, there's also no need for tables, since I am reversing not the bits in each digit, but the digits in the number 12345 becomes 54321 (Hexademinal: 0102030405 becomes 0504030201)... But thanks much.

This works beautifully: :)

call Palendrome_Test ; to get it started...

```
.data
```

ALIGN 16

db 16 dup(0) ; this buffer is needed for algorithm overflow

Number1 db 16*8 dup(0)

db 16 dup(0) ; this buffer is needed for algorithm overflow

Number2 db 16*8 dup(0)

mmx_0A QWORD 0A0A0A0A0A0A0A0Ah

.code

Palendrome_Test PROC

lea edi, Number1

mov DWORD PTR [edi], 010906h ; 196

lea esi, [edi + 3]

lea ebp, Number2

lea ebx, [ebp + 3]

_0: call Palindrome_Sum

; carry flag set if block overflows

; adjust esi/ebx as number grows

adc BYTE PTR [ebx], 0

je _2

inc esi

inc ebx

_2: xchg ebp, edi

xchg ebx, esi

call Palindrome_Find

; loop until we find a palidrone

jne _0

ret

Palendrome_Test ENDP

Palindrome_Sum PROC

push esi ; start address of first number

push edi ; end address of buffer

push ebx ;

push ebp ; end address of destination buffer

lea eax, [esi + 15]

sub eax, edi

shr eax, 4

push eax ; 16 byte blocks to process

pcmpeqd mm6, mm6 ; mmx_FF

movq mm7, mmx_0A

clc

_0: ; unaligned high order digits (higher addresses)

mov eax, [esi - 1*4]

mov ebx, [esi - 2*4]

mov ecx, [esi - 3*4]

mov edx, [esi - 4*4]

bswap eax ; Athlon note: faster than MMX

bswap ebx

bswap ecx

bswap edx

; fix to ensure overflow into next byte

lea eax, [eax + 0F6F6F6F6h]

lea ebx, [ebx + 0F6F6F6F6h]

lea ecx, [ecx + 0F6F6F6F6h]

lea edx, [edx + 0F6F6F6F6h]

; 00 01 02 03 04 05 06 07 08 09

; F6 F7 F8 F9 FA FB FC FD FE FF

; Adding carry flags across MMX registers is slow, too

; aligned low order digits

adc eax, [edi + 0*4]

adc ebx, [edi + 1*4]

adc ecx, [edi + 2*4]

adc edx, [edi + 3*4]

lea esi, [esi - 4*4]

lea edi, [edi + 4*4]

; aligned low order digits

mov [ebp + 0*4], eax

mov [ebp + 1*4], ebx

mov [ebp + 2*4], ecx

mov [ebp + 3*4], edx

movq mm0, [ebp]

movq mm1, [ebp+8]

pcmpgtb mm0, mm6

pcmpgtb mm1, mm6

pandn mm0, mm7

pandn mm1, mm7

paddb mm0, [ebp]

paddb mm1, [ebp+8]

movq [ebp], mm0

movq [ebp+8], mm1

lea ebp, [ebp + 4*4]

dec DWORD PTR [esp]

jne _0

pop eax ; 16 byte blocks to process

pop ebp ; end address of destination buffer

pop ebx ;

pop edi ; end address of buffer

pop esi ; start address of first number

ret

Palindrome_Sum ENDP

Palindrome_Find PROC

or eax, -1

ret

Palindrome_Find ENDP

The numbers in the data section are just to show the format in memory - I'm sure you will allocate some larger blocks. All but four memory accesses are aligned (they are aligned sometimes, too)! If this isn't faster then it's only memory speed holding it back. Note: numbers are stored with least significant digit first. I will test some more before reporting an official timing on the algorithm.
call Palendrome_Test ; to get it started...

Here is a test program. 1000 Mhz Athlon clocks in at <7 secs for 2^16 digits.

This actually calculates the digits for the number 196.

2^20 digits takes 3 hours, 33 minutes, 23.52 seconds!

Sure beats ( this ) guy's 1 day and 18 hours!

This actually calculates the digits for the number 196.

2^20 digits takes 3 hours, 33 minutes, 23.52 seconds!

Sure beats ( this ) guy's 1 day and 18 hours!

```
```

; 2^11 ( 2k): 0 ms

; 2^12 ( 4k): 20 ms

; 2^13 ( 8k): 100 ms

; 2^14 ( 16k): 400 ms

; 2^15 ( 32k): 1 552 ms

; 2^16 ( 64k): 6 839 ms

; 2^17 ( 128k): 28 270 ms

; 2^18 ( 256k): 482 003 ms (no longer in the cache!)

; 2^19 ( 512k):

; 2^20 (1024k): 12 803 520 ms

Hey, wait a minute, you weren't supposed to post an entire palindrome program... I'm working on it...:rolleyes:

Anyway, check out the full list of times at http://home.cfl.rr.com/p196/milestones.html

and the software comparisons page at http://home.cfl.rr.com/p196/compare.html

You should probably be able to submit your code to be officially timed. And if he likes your code enough, he would use it until a faster one is developed.

My intention was to write a routine that was faster on my Pentium MMX 166 than the 1 day 18 hours on a Pentium II 266.

http://www.jasondoucette.com/worldrecords.html

My fastest routine (8 digit MMX) running in unreal mode for DOS finishes 1 million in 14.7 hours on a K6-2 550, 33% faster on a Pentium II 350, and 50% longer on a Pentium MMX 166 - so I have accomplished my objective...

Beyond this I wanted to submit for testing to the Software comparisons page.

Unfortunately, the tester prefers a Windows version. I estimate that with all the penalties I accrue from segment prefix, 32 bit prefix, my code should be at least four times as fast in Win32. My initial simple single digit add was half the speed on Pentium MMX. Worse yet, Pentium 4's hate 16 bit code with segment prefixes, etc. The unreal mode code runs barely faster on a 1.8GHz Pentium 4 than on a 166 MHz Pentium MMX.

Here's my problem now. I am coding in HLA, and there's some problem preventing the program from working in Win32. I port the routine over and it works for the first adds, then breaks. Plus HLA does not support all the MMX instructions, so I have to use a #asm, #endasm hack. (It must be so annoying to you MASM guys to see HLA code... my bad)

My interest here was to develop the best add routine...

I had discovered Jones' website months ago, but could not understand his algorithm until I read it carefully a few nights ago, and I was able to post code. I was debugging my MMX routine... Here it is, on my Pentium MMX, it takes 11 cycles, compared to Jones routine, 7-9 cycles... (I do not adjust thread priority yet).

It is slower than Jones' but I wrote it myself!!!:grin:

Anyway, check out the full list of times at http://home.cfl.rr.com/p196/milestones.html

and the software comparisons page at http://home.cfl.rr.com/p196/compare.html

You should probably be able to submit your code to be officially timed. And if he likes your code enough, he would use it until a faster one is developed.

My intention was to write a routine that was faster on my Pentium MMX 166 than the 1 day 18 hours on a Pentium II 266.

http://www.jasondoucette.com/worldrecords.html

My fastest routine (8 digit MMX) running in unreal mode for DOS finishes 1 million in 14.7 hours on a K6-2 550, 33% faster on a Pentium II 350, and 50% longer on a Pentium MMX 166 - so I have accomplished my objective...

Beyond this I wanted to submit for testing to the Software comparisons page.

Unfortunately, the tester prefers a Windows version. I estimate that with all the penalties I accrue from segment prefix, 32 bit prefix, my code should be at least four times as fast in Win32. My initial simple single digit add was half the speed on Pentium MMX. Worse yet, Pentium 4's hate 16 bit code with segment prefixes, etc. The unreal mode code runs barely faster on a 1.8GHz Pentium 4 than on a 166 MHz Pentium MMX.

Here's my problem now. I am coding in HLA, and there's some problem preventing the program from working in Win32. I port the routine over and it works for the first adds, then breaks. Plus HLA does not support all the MMX instructions, so I have to use a #asm, #endasm hack. (It must be so annoying to you MASM guys to see HLA code... my bad)

My interest here was to develop the best add routine...

I had discovered Jones' website months ago, but could not understand his algorithm until I read it carefully a few nights ago, and I was able to post code. I was debugging my MMX routine... Here it is, on my Pentium MMX, it takes 11 cycles, compared to Jones routine, 7-9 cycles... (I do not adjust thread priority yet).

It is slower than Jones' but I wrote it myself!!!:grin:

My MMX algorithm is as follows - only 4 digit is included, I also have it extended to 8 digits ...

Let me explain how it works...

After adding the two numbers together with carry from previous add, I add $f6f6f6f6 to eax to force the carry out of each digit. Now, any byte from f6-ff has not had the carry out, and any byte from 0-9 has had a carry. Store carry on stack

If it is from 0-9, all well and good. If from f6-ff, then f6 needs to be subtracted.

MMX signed byte compare with ffffffff. Any byte which is positive will now become ff, any byte which is ff or less will become 00. xor to switch from 00 to ff and vice versa. And with f6. Thus everywhere 0-9 is found the byte value is 00. Everywhere f6-ff is found the byte value is f6.

Subtract, pass back to eax, bswap and store...

Okay my algorithm has a problem.

I am storing all numbers in big endian format (or is that little?) The least significant number is in higher numbered bytes, as opposed to IA-32 architecture. This is why I have to bswap before store. Easy to change. I just have not done it yet, since I cannot get HLA to work with my routine... and I cannot code MASM to save anyone's life.

```
```

#asm

sub edi,3

sub esi,3

sub esi, ecx

sub edi, ecx

clc

pushf

add1: popf

mov eax,[esi+ecx]

bswap eax

adc eax,[ebx]

add eax, 0f6f6f6f6h

pushf ; carry from addition

movd mm0,eax ; MMX

movd mm1,eax

#endasm

pcmpgtb (allff, mm1); // $ffffffffffffffff

pxor (allff, mm1); // reverse bits

pand (allf6, mm1);

#asm

psubb mm0, mm1

movd eax,mm1

bswap eax

mov [edi+ecx],eax

add ebx, 4

sub ecx, 4

jnz add1

popf

#endasm

Let me explain how it works...

After adding the two numbers together with carry from previous add, I add $f6f6f6f6 to eax to force the carry out of each digit. Now, any byte from f6-ff has not had the carry out, and any byte from 0-9 has had a carry. Store carry on stack

If it is from 0-9, all well and good. If from f6-ff, then f6 needs to be subtracted.

MMX signed byte compare with ffffffff. Any byte which is positive will now become ff, any byte which is ff or less will become 00. xor to switch from 00 to ff and vice versa. And with f6. Thus everywhere 0-9 is found the byte value is 00. Everywhere f6-ff is found the byte value is f6.

Subtract, pass back to eax, bswap and store...

Okay my algorithm has a problem.

I am storing all numbers in big endian format (or is that little?) The least significant number is in higher numbered bytes, as opposed to IA-32 architecture. This is why I have to bswap before store. Easy to change. I just have not done it yet, since I cannot get HLA to work with my routine... and I cannot code MASM to save anyone's life.

Hey, wait a minute, you weren't supposed to post an entire palindrome program... I'm working on it...:rolleyes:

**V Coder**, your doing well. Look at the algo above - it fixes all the problems you state your's has:

1) little endian numbers: only one set of non-aligned loads, only one bswap

2) pxor mm0, <FF> ; pand mm0, <F6> should be changed to pandn mm0, <F6> !!

3) you don't want to use MOVD it is really slow, use only MOVQ

4) almost no forward dependancies

On cached data each digit takes 2.66 clock cycles!

The faster your memory the faster the algorithm!

I still think it is twice as slow as it should be!

Do you know where I can get the files of numbers already computed? I'd like to start at 90 million digits if I can. I'm going to attempt a packed BCD version because memory is a bottleneck here and we have plenty of time to do more calculations.

Ah, I like your routine...

(Apart from the fact that you add $0a, and I subtract $f6 - as we say in Trinidad, six of one, half a dozen of the other).

Cute way of setting up allff - $ffffffffffffffff. You can see that I have not yet put allf6 and allff in mmx registers... So I have room for optimization.

I suppose moving the data back to memory is faster than movd and shift... which is what I do in my 8 byte version.

However, I believe you are wrong...After the pcmpgtd you need to xor $ffffffff. Wait, okay. That's a pandn. That's cute.

I suppose it is supported by a C launcher or something?

I only now realize that I do not need to push the flags here in this test program, since I do not affect the flags with the MMX comparisons...

Well, in my full program, I use an integer CMP to see if the number is a palindrome. Obviously CMP affects the flags, so I need to preserve it... Where do you check for the palindrome?

Actually, in theory, you do not need to specifically check for palindrome. If there is any carry in any addition, then the number being formed will not be a palindrome. That's the theory at least. It fails for small numbers, eg 29+92=121. And there are a couple of others. However, it is hard to imagine that any addition of a 90 million digit number will not produce a carry somewhere, so no one actually believes there will be a palindrome, so you do not really need to check right?

You need to keep score of how many digits in the new number, and how many additions have passed so far. Also, in my DOS unreal program I use BP, but in my windows port, I leave EBP severely alone.

I understand the LEAs. I did not think of them, but then again, I am a relative newbie. I started writing my routine 5 years 4 months ago (before the current "pioneers" started). I got my initial program working in Unreal mode, but thought it was too slow and did not want to stress my mind in learning Win32 API. So I let it go until Jan 1, 2003. I reworked the routines until I reached an MMX version, started teaching myself HLA. But I got stuck January 15th when I could not get the routines that work perfectly under NASM to work under HLA. I have yet to submit my program there for debugging. But I'm sure it is not my routines themselves, but the HLA packaging (yeah for sure!!!)... so, I'm here to try to learn MASM32, and/or fix up the HLA code... Meanwhile I wanted to develop my routines further.

Will I get flamed if I post my Unreal Mode routine here?

Here goes.

I compiles as is with NASMW. I have no idea what to do to make it work with the others.

(Apart from the fact that you add $0a, and I subtract $f6 - as we say in Trinidad, six of one, half a dozen of the other).

Cute way of setting up allff - $ffffffffffffffff. You can see that I have not yet put allf6 and allff in mmx registers... So I have room for optimization.

I suppose moving the data back to memory is faster than movd and shift... which is what I do in my 8 byte version.

However, I believe you are wrong...After the pcmpgtd you need to xor $ffffffff. Wait, okay. That's a pandn. That's cute.

I suppose it is supported by a C launcher or something?

I only now realize that I do not need to push the flags here in this test program, since I do not affect the flags with the MMX comparisons...

Well, in my full program, I use an integer CMP to see if the number is a palindrome. Obviously CMP affects the flags, so I need to preserve it... Where do you check for the palindrome?

Actually, in theory, you do not need to specifically check for palindrome. If there is any carry in any addition, then the number being formed will not be a palindrome. That's the theory at least. It fails for small numbers, eg 29+92=121. And there are a couple of others. However, it is hard to imagine that any addition of a 90 million digit number will not produce a carry somewhere, so no one actually believes there will be a palindrome, so you do not really need to check right?

You need to keep score of how many digits in the new number, and how many additions have passed so far. Also, in my DOS unreal program I use BP, but in my windows port, I leave EBP severely alone.

I understand the LEAs. I did not think of them, but then again, I am a relative newbie. I started writing my routine 5 years 4 months ago (before the current "pioneers" started). I got my initial program working in Unreal mode, but thought it was too slow and did not want to stress my mind in learning Win32 API. So I let it go until Jan 1, 2003. I reworked the routines until I reached an MMX version, started teaching myself HLA. But I got stuck January 15th when I could not get the routines that work perfectly under NASM to work under HLA. I have yet to submit my program there for debugging. But I'm sure it is not my routines themselves, but the HLA packaging (yeah for sure!!!)... so, I'm here to try to learn MASM32, and/or fix up the HLA code... Meanwhile I wanted to develop my routines further.

Will I get flamed if I post my Unreal Mode routine here?

Here goes.

I compiles as is with NASMW. I have no idea what to do to make it work with the others.

**Will I get flamed if I post my Unreal Mode routine here?**

It fits in well with the discussion, IMHO.

I don't test for a palindrome because I'm not to 90 million digits yet - no need. The algorithm posted above moves ESI/EBX as the number grows - they point to the byte after the number. With some minor changes (prefetch) I am at 7.5 clock cycles for uncached data. :)

```
; 2^18 ( 256k): 482 003 ms uncached data
```

; 2^18 ( 256k): 343 944 ms " " w/ PREFETCHNTA

...a 28.6% improvement.I took too long typing my answer. I started before you...

Jones' code uses 5 instructions as opposed to your 6, however.

You have to contact the guy at www.p196.org to get the 90 million byte file.

Have a look at my HLA please.

I am trying to debug it. With your deep understanding of ASM, it will probably only bring furrows to your forehead... (Why does V Coder accept reversed operands, mandatory brackets and semicolon?)

Thanks

You are slightly behind the fastest on the software comparisons page. But then again, you're on a slower clocked machine... Try Jones' routine too.

Jones' code uses 5 instructions as opposed to your 6, however.

You have to contact the guy at www.p196.org to get the 90 million byte file.

Have a look at my HLA please.

I am trying to debug it. With your deep understanding of ASM, it will probably only bring furrows to your forehead... (Why does V Coder accept reversed operands, mandatory brackets and semicolon?)

Thanks

You are slightly behind the fastest on the software comparisons page. But then again, you're on a slower clocked machine... Try Jones' routine too.

I have a record of the correct number of additions to reach each 10,000 digits.

10000 digits, 24094 additions;

20000 - 48150

50000 - 120428 etc.

Just for you to verify that your program is accurate (which goes without saying, V Coder :grin: ) I used it to verify each new routine I wrote, and the unreal mode code beeps at each 10,000, so I can time it manually.

10000 digits, 24094 additions;

20000 - 48150

50000 - 120428 etc.

Just for you to verify that your program is accurate (which goes without saying, V Coder :grin: ) I used it to verify each new routine I wrote, and the unreal mode code beeps at each 10,000, so I can time it manually.