Paul Hsieh included the following code in his webpage: http://www.azillionmonkeys.com/qed/asmexample.html

Norbert Juffa has given me some devilishly clever code for performing a 64 bit BCD addition.
``````
; by Norbert Juffa
mov  eax, dword ptr [x]  ; x (lo)
mov  ebx, dword ptr [y]  ; y (lo)
mov  edx, dword ptr [x+4] ; x (hi)
mov  ecx, dword ptr [y+4] ; y (hi)
; here: EDX:EAX = augend, ECX:EBX = addend
mov  esi, eax            ; x
lea  edi, [eax+66666666h] ; x + 0x66666666
xor  esi, ebx            ; x ^ y
add  eax, ebx            ; x + y
shr  esi, 1              ; t1 = (x ^ y) >> 1
add  edi, ebx            ; x + y + 0x66666666
sbb  ebx, ebx            ; capture carry
rcr  edi, 1              ; t2 = (x + y + 0x66666666) >> 1
xor  esi, edi            ; t1 ^ t2
and  esi, 88888888h      ; t3 = (t1 ^ t2) & 0x88888888
add  eax, esi            ; x + y + t3
shr  esi, 2              ; t3 >> 2
sub  eax, esi            ; x + y + t3 - (t3 >> 2)
sub  edx, ebx            ; propagate carry
mov  esi, edx            ; x
lea  edi, [edx+66666666h] ; x + 0x66666666
xor  esi, ecx            ; x ^ y
add  edx, ecx            ; x + y
shr  esi, 1              ; t1 = (x ^ y) >> 1
add  edi, ecx            ; x + y + 0x66666666
;;sbb  esi, esi            ; capture carry
rcr  edi, 1              ; t2 = (x + y + 0x66666666) >> 1
xor  esi, edi            ; t1 ^ t2
and  esi, 88888888h      ; t3 = (t1 ^ t2) & 0x88888888
add  edx, esi            ; x + y + t3
shr  esi, 2              ; t3 >> 2
sub  edx, esi            ; x + y + t3 - (t3 >> 2)
; here EDX:EAX = sum
mov  edi, z
mov  [edi], eax
mov  [edi+4], edx
``````

Douglas Jones's Webpage http://www.cs.uiowa.edu/~jones/bcd/index.html discusses BCD and packed BCD addition, and provides algorithms, but this code carries it further allowing carry in and out, and it can do full 8 digit numbers instead of only 7.

Next step, adaptation to the palindrome quest.
Posted on 2003-05-11 21:51:37 by V Coder
``````; swap high order and low order nibbles
pshufw	mm0, [ebx], 00011011y		; MNOP IJKL EFGH ABCD

movq	mm1, mm0

psllw	mm0, 8				; ..MN ..IJ ..EF ..AB
psrlw	mm1, 8				; OP.. KL.. GH.. CD..

por	mm0, mm1			; OPMN KLIJ GHEF CDAB

movq	mm1, mm0

pand	mm0, mmx_0F			; .P.N .L.J .H.F .D.B
psrlw	mm1, 4				; .OPM .KLI .GHE .CDA

psllw	mm0, 4				; P.N. L.J. H.F. D.B.
pand	mm1, mmx_0F			; .O.M .K.I .G.E .C.A

por	mm0, mm1			; PONM LKJI HGFE DCBA``````
:)
Posted on 2003-05-11 22:52:28 by bitRAKE
Thanks bitRAKE.

Now I did not notice until recently but the first 5 lines of bitRAKE's code are an elegant way to reverse 8 bytes. I thought it needed 7 instructions (two pands to zero the upper bytes of each word in mm0, and lower bytes of each word in mm1). But the word shift (psllw, psrlw) eliminates that need! So, bitRAKE has provided the most optimal MMX byte swap (BSWAP), and the most optimal MMX nibble swap.

Juffa's code above can be understood. Just check Douglas Jones' website to understand the general algorithm, then this code becomes easy. Now, because the logical operations (and, xor, etc) clear the carry flag, it must be preserved elsewhere. That elsewhere is a general register. It is stored in ebx, while working on the lower dword, then in esi while working on the upper dword.

Unfortunately this uses basically all our registers (except ebp). Obviously we cannot keep all the data in registers...

In my palindrome program, esi, edi and ebp contain contain index values, so I do everything in four registers with one register as the scratch (carry register).
Posted on 2003-05-31 23:24:20 by V Coder
bitRAKE

I finally got to working on packed BCD addition a month later. It is easier than I thought. I have a repeated 8 digit addition (as above) working, and I can write ISF files (for the palindrome test). I have not coded the Load file for the packed BCD as yet.

The program is slower than I thought it would be: as slow as (exact time) my initial version using a repeated 8 digit unpacked BCD addition. The latter has been unrolled and optimized several times and now takes 3/4 the original time.

I can only hope the packed version speeds up more with optimization. Unfortunately, there is heavy dependency on results from previous operations.

I expect you have a MMX packed BCD addition working.
Posted on 2003-06-26 09:22:52 by V Coder

Do you have a MMX packed BCD addition working?
Yes, but the propagating the carries really limit what can be done - as you have found.
Posted on 2003-06-26 10:13:12 by bitRAKE
I am moving - unable to post any code from work, but there is a better algorithm.
Posted on 2003-06-27 16:47:47 by bitRAKE
bitRAKE

What is the better algorithm? Eagerly awaiting a hint.
Posted on 2003-07-01 09:22:32 by V Coder
bitRAKE,

I concluded in June that MMX may be a bad idea from the standpoint of the instructions are high latency on the Pentium 4 (and that is the platform I am coding for)...

I have not tried it yet though (been real busy, have not coded ASM for about 6 weeks), but you said there's a better one... I am looking forward to hearing about it.
Posted on 2003-09-07 15:58:46 by V Coder
X+Y+Z = S + 2C

S =: XOR( X, XOR( Y, Z ) )
C =: OR( AND( X, Y ), OR( AND( Y, Z ), AND( Z, X ) ) )

No carry is required and this should be possible with MMX/SSE2 registers.
Posted on 2003-12-31 14:11:10 by bitRAKE
bitRAKE,

Thanks. A search for "x+y+y=s+2c" yields that this is a carry save adder. It converts a 3 input addition to a two input addition (which is apparently useful for multiplication, but I am trying to understand how that is implemented too).

I am also trying to figure out how a carry save adder would be implemented to add two BCD (or packed BCD) numbers.

Thanks.
Posted on 2004-01-06 22:39:41 by V Coder
Split the bits of the BCD digits and preform the binary operations in parallel.
Posted on 2004-02-04 15:32:39 by bitRAKE
bitRAKE,

can you give me an example code for carry save adder being used to add two long numbers?

Thanks.
Posted on 2004-05-16 03:54:44 by V Coder
No, I haven't developed a method to generate the algorithm, yet.
Posted on 2004-05-16 10:01:15 by bitRAKE

Split the bits of the BCD digits and preform the binary operations in parallel.

bitRAKE, that about you think? :)
I don't known how split the bits of the BCD digits ... :)
Preform the binary operations in parallel:
``````
model flat

codeseg
startupcode
mov eax,085555555h	; x
mov ebx,00B555555h	; y
; eax=x
; ebx=y
mov ecx,eax
and ecx,ebx
or eax,ebx
mov edx,ecx
mov ebx,edx
not ecx
and ecx,eax
rept 31
and ebx,eax
or edx,ebx
endm
xor ecx,edx
; ecx=x+y

exitcode
end
``````

Hmm...
Posted on 2004-05-16 12:10:46 by Nexo
Nexo, I got 90AAAAAAh for an answer in ECX. How does this equal x+y? Furthermore, where is the carry flag for multi-percision use? I don't understand the numbering scheme used - what is the value put into EBX (00B555555h) in decimal notation?

BCD digits require four bits each. If we spread those bits across four registers, such that bit zero of each register is a component of the least significant BCD digit, we may operate on all BCD bits of similar magnitude in parallel. Not that this will help - I am just brain-storming out loud.
Posted on 2004-05-16 19:19:29 by bitRAKE
Hi, BitRAKE. I am just kill my bored time. This souce is little joke :) 85555555h+0B555555h=90AAAAAh is right. Because this applicable ONLY for add of BINARY numbers. "the carry flag for multi-percision use" is use for carry between BITS. "X+Y+Z = S + 2C". Is expression ONLY for BINARY numbers? If we spread bits of BCD across four registers, we will have BINARY numbers (digits of BCD). We can add those BINARY numbers with the carry flag for multi-percision use between bits of BCD digit. But I don't see as we will do carry between BCD DIGITS with "X+Y+Z = S + 2C".

I am only try show with this source that BITS carry is useless and low performance. Because simple "ADD" is faster.

Good luck.
Posted on 2004-05-17 10:55:51 by Nexo
Nexo, oh. :) It does make sense, but I was thinking maybe a multiply instruction could be used to do multiple adds in some way?
Posted on 2004-05-17 11:22:17 by bitRAKE

Do you have a MMX packed BCD addition working?
Yes, but the propagating the carries really limit what can be done - as you have found.

The 32 bit general registers version operated as fast in practice as it did in theory (by adding the cycles manually): no amount of optimization was possible. I last tried doing an MMX version October 12 last year, but it did not work. I finally got back to it September 1 and got it working about 70% as fast as general registers. Nothing that seems to be an obvious improvement works since MMX works on parallel bytes, words, dwords, but not nibbles. Plus 8 extra (MMX) instructions to get the carry another one to propagate it...

So yes... I have found that MMX packed BCD has its challenges!!!
Posted on 2004-09-02 20:27:59 by V Coder