Paul Hsieh included the following code in his webpage: http://www.azillionmonkeys.com/qed/asmexample.html

Norbert Juffa has given me some devilishly clever code for performing a 64 bit BCD addition.

Douglas Jones's Webpage http://www.cs.uiowa.edu/~jones/bcd/index.html discusses BCD and packed BCD addition, and provides algorithms, but this code carries it further allowing carry in and out, and it can do full 8 digit numbers instead of only 7.

Next step, adaptation to the palindrome quest.

**9.64bit BCD add**

Norbert Juffa has given me some devilishly clever code for performing a 64 bit BCD addition.

```
```

; by Norbert Juffa

mov eax, dword ptr [x] ; x (lo)

mov ebx, dword ptr [y] ; y (lo)

mov edx, dword ptr [x+4] ; x (hi)

mov ecx, dword ptr [y+4] ; y (hi)

; here: EDX:EAX = augend, ECX:EBX = addend

mov esi, eax ; x

lea edi, [eax+66666666h] ; x + 0x66666666

xor esi, ebx ; x ^ y

add eax, ebx ; x + y

shr esi, 1 ; t1 = (x ^ y) >> 1

add edi, ebx ; x + y + 0x66666666

sbb ebx, ebx ; capture carry

rcr edi, 1 ; t2 = (x + y + 0x66666666) >> 1

xor esi, edi ; t1 ^ t2

and esi, 88888888h ; t3 = (t1 ^ t2) & 0x88888888

add eax, esi ; x + y + t3

shr esi, 2 ; t3 >> 2

sub eax, esi ; x + y + t3 - (t3 >> 2)

sub edx, ebx ; propagate carry

mov esi, edx ; x

lea edi, [edx+66666666h] ; x + 0x66666666

xor esi, ecx ; x ^ y

add edx, ecx ; x + y

shr esi, 1 ; t1 = (x ^ y) >> 1

add edi, ecx ; x + y + 0x66666666

;;sbb esi, esi ; capture carry

rcr edi, 1 ; t2 = (x + y + 0x66666666) >> 1

xor esi, edi ; t1 ^ t2

and esi, 88888888h ; t3 = (t1 ^ t2) & 0x88888888

add edx, esi ; x + y + t3

shr esi, 2 ; t3 >> 2

sub edx, esi ; x + y + t3 - (t3 >> 2)

; here EDX:EAX = sum

mov edi, z

mov [edi], eax

mov [edi+4], edx

Douglas Jones's Webpage http://www.cs.uiowa.edu/~jones/bcd/index.html discusses BCD and packed BCD addition, and provides algorithms, but this code carries it further allowing carry in and out, and it can do full 8 digit numbers instead of only 7.

Next step, adaptation to the palindrome quest.

```
; swap high order and low order nibbles
```

pshufw mm0, [ebx], 00011011y ; MNOP IJKL EFGH ABCD

movq mm1, mm0

psllw mm0, 8 ; ..MN ..IJ ..EF ..AB

psrlw mm1, 8 ; OP.. KL.. GH.. CD..

por mm0, mm1 ; OPMN KLIJ GHEF CDAB

movq mm1, mm0

pand mm0, mmx_0F ; .P.N .L.J .H.F .D.B

psrlw mm1, 4 ; .OPM .KLI .GHE .CDA

psllw mm0, 4 ; P.N. L.J. H.F. D.B.

pand mm1, mmx_0F ; .O.M .K.I .G.E .C.A

por mm0, mm1 ; PONM LKJI HGFE DCBA

:)Thanks bitRAKE.

Now I did not notice until recently but the first 5 lines of bitRAKE's code are an elegant way to reverse 8 bytes. I thought it needed 7 instructions (two pands to zero the upper bytes of each word in mm0, and lower bytes of each word in mm1). But the word shift (psllw, psrlw) eliminates that need! So, bitRAKE has provided the most optimal MMX byte swap (BSWAP), and the most optimal MMX nibble swap.

Juffa's code above can be understood. Just check Douglas Jones' website to understand the general algorithm, then this code becomes easy. Now, because the logical operations (and, xor, etc) clear the carry flag, it must be preserved elsewhere. That elsewhere is a general register. It is stored in ebx, while working on the lower dword, then in esi while working on the upper dword.

Unfortunately this uses basically all our registers (except ebp). Obviously we cannot keep all the data in registers...

In my palindrome program, esi, edi and ebp contain contain index values, so I do everything in four registers with one register as the scratch (carry register).

Now I did not notice until recently but the first 5 lines of bitRAKE's code are an elegant way to reverse 8 bytes. I thought it needed 7 instructions (two pands to zero the upper bytes of each word in mm0, and lower bytes of each word in mm1). But the word shift (psllw, psrlw) eliminates that need! So, bitRAKE has provided the most optimal MMX byte swap (BSWAP), and the most optimal MMX nibble swap.

Juffa's code above can be understood. Just check Douglas Jones' website to understand the general algorithm, then this code becomes easy. Now, because the logical operations (and, xor, etc) clear the carry flag, it must be preserved elsewhere. That elsewhere is a general register. It is stored in ebx, while working on the lower dword, then in esi while working on the upper dword.

Unfortunately this uses basically all our registers (except ebp). Obviously we cannot keep all the data in registers...

In my palindrome program, esi, edi and ebp contain contain index values, so I do everything in four registers with one register as the scratch (carry register).

**bitRAKE**

I finally got to working on packed BCD addition a month later. It is easier than I thought. I have a repeated 8 digit addition (as above) working, and I can write ISF files (for the palindrome test). I have not coded the Load file for the packed BCD as yet.

The program is slower than I thought it would be: as slow as (exact time) my initial version using a repeated 8 digit unpacked BCD addition. The latter has been unrolled and optimized several times and now takes 3/4 the original time.

I can only hope the packed version speeds up more with optimization. Unfortunately, there is heavy dependency on results from previous operations.

I expect you have a MMX packed BCD addition working.

Do you have a MMX packed BCD addition working?

I am moving - unable to post any code from work, but there is a better algorithm.

**bitRAKE**

What is the better algorithm? Eagerly awaiting a hint.

**bitRAKE**,

I concluded in June that MMX may be a bad idea from the standpoint of the instructions are high latency on the Pentium 4 (and that is the platform I am coding for)...

I have not tried it yet though (been real busy, have not coded ASM for about 6 weeks), but you said there's a better one... I am looking forward to hearing about it.

X+Y+Z = S + 2C

S =: XOR( X, XOR( Y, Z ) )

C =: OR( AND( X, Y ), OR( AND( Y, Z ), AND( Z, X ) ) )

No carry is required and this should be possible with MMX/SSE2 registers.

S =: XOR( X, XOR( Y, Z ) )

C =: OR( AND( X, Y ), OR( AND( Y, Z ), AND( Z, X ) ) )

No carry is required and this should be possible with MMX/SSE2 registers.

**bitRAKE**,

Thanks. A search for "x+y+y=s+2c" yields that this is a carry save adder. It converts a 3 input addition to a two input addition (which is apparently useful for multiplication, but I am trying to understand how that is implemented too).

I am also trying to figure out how a carry save adder would be implemented to add two BCD (or packed BCD) numbers.

Please help. A bigger hint...

Thanks.

Split the bits of the BCD digits and preform the binary operations in parallel.

bitRAKE,

can you give me an example code for carry save adder being used to add two long numbers?

Thanks.

can you give me an example code for carry save adder being used to add two long numbers?

Thanks.

No, I haven't developed a method to generate the algorithm, yet.

Split the bits of the BCD digits and preform the binary operations in parallel.

**bitRAKE**, that about you think? :)

I don't known how split the bits of the

**BCD**digits ... :)

Preform the binary operations in parallel:

```
```

model flat

codeseg

startupcode

mov eax,085555555h ; x

mov ebx,00B555555h ; y

; eax=x

; ebx=y

mov ecx,eax

and ecx,ebx

or eax,ebx

mov edx,ecx

add edx,edx

mov ebx,edx

not ecx

and ecx,eax

add eax,eax

rept 31

add ebx,ebx

and ebx,eax

or edx,ebx

endm

xor ecx,edx

; ecx=x+y

exitcode

end

Hmm...

**Nexo**, I got 90AAAAAAh for an answer in ECX. How does this equal x+y? Furthermore, where is the carry flag for multi-percision use? I don't understand the numbering scheme used - what is the value put into EBX (00B555555h) in decimal notation?

BCD digits require four bits each. If we spread those bits across four registers, such that bit zero of each register is a component of the least significant BCD digit, we may operate on all BCD bits of similar magnitude in parallel. Not that this will help - I am just brain-storming out loud.

Hi,

I am only try show with this source that BITS carry is useless and low performance. Because simple "ADD" is faster.

Good luck.

**BitRAKE**. I am just kill my bored time. This souce is little joke :) 85555555h+0B555555h=90AAAAAh is right. Because this applicable ONLY for add of BINARY numbers. "the carry flag for multi-percision use" is use for carry between BITS. "X+Y+Z = S + 2C". Is expression ONLY for BINARY numbers? If we spread bits of BCD across four registers, we will have BINARY numbers (digits of BCD). We can add those BINARY numbers with the carry flag for multi-percision use between bits of BCD digit. But I don't see as we will do carry between**BCD**DIGITS with "X+Y+Z = S + 2C".I am only try show with this source that BITS carry is useless and low performance. Because simple "ADD" is faster.

Good luck.

**Nexo**, oh. :) It does make sense, but I was thinking maybe a multiply instruction could be used to do multiple adds in some way?

Do you have a MMX packed BCD addition working?

The 32 bit general registers version operated as fast in practice as it did in theory (by adding the cycles manually): no amount of optimization was possible. I last tried doing an MMX version October 12 last year, but it did not work. I finally got back to it September 1 and got it working about 70% as fast as general registers. Nothing that seems to be an obvious improvement works since MMX works on parallel bytes, words, dwords, but not nibbles. Plus 8 extra (MMX) instructions to get the carry another one to propagate it...

So yes... I have found that MMX packed BCD has its challenges!!!