I would like to double a packed BCD number.
eg. 1234 decimal can be represented thusly in 32 bits:
$000004D2 hexadecimal conversion
$01020304 BCD (each digit stored in a byte)
or $00001234 packed BCD (each digit stored in a nibble)
A complete routine for adding two packed BCD numbers is given here http://www.asmcommunity.net/board/viewtopic.php?t=13215
What I want to do now is multiply a packed BCD number by 2. The following code works (and incorporates carry in & out). It simplifies one aspect of the addition algorithm: when binary number being added to itself, normally just shift....
Can it be sped up any? Thanks.
eg. 1234 decimal can be represented thusly in 32 bits:
$000004D2 hexadecimal conversion
$01020304 BCD (each digit stored in a byte)
or $00001234 packed BCD (each digit stored in a nibble)
A complete routine for adding two packed BCD numbers is given here http://www.asmcommunity.net/board/viewtopic.php?t=13215
What I want to do now is multiply a packed BCD number by 2. The following code works (and incorporates carry in & out). It simplifies one aspect of the addition algorithm: when binary number being added to itself, normally just shift....
Can it be sped up any? Thanks.
mov ([esi+mda], eax);
add ($33333333, eax);
mov (eax, edx);
xor (-1, edx);
and ($88888888, edx);
shr (1, ebx); // carry
rcl (1, eax);
rcl (1, ebx); // carry
sub (edx, eax);
shr (2, edx);
add (edx, eax);
Okay, but I'm going to write this in a normal way:
mov ecx,[esi+mda]
lea eax,[ecx+0x33333333]
and eax,0x88888888
shr eax,1
add ecx,eax
shr eax,1
add eax,ecx
as '0x33' HexMark, it's belong C (or other HLL) .
all most Asms' HexMark, has 2, "33h" & "$33" ; 33h in asm source, written by Asm programmer; $33 in online disassembler for display.
all most Asms' HexMark, has 2, "33h" & "$33" ; 33h in asm source, written by Asm programmer; $33 in online disassembler for display.
mov ecx, [esi+mda]
lea eax, [ecx+33333333h]
and eax, 88888888h
shr eax, 1
add ecx, eax
shr eax, 1
add eax, ecx
Ah, sorry, I had just been using NASM and grown accustomed to writing 0x for some reason :P
Thanks guys. But it does not work. With input 123456h, output eax=1234BCh ecx=12349Ah. Something should be 246912h. Please also check out at the other links. I like the brivity of your routine though, so I'm looking at it further. Do you have a source for it or stuff on packed BCD arithmetic? Thanks.
Ah, you're right, I missed an instruction!
I now also made it return the value in ECX to save one byte and one shift instruction.
mov ecx, [esi+mda]
lea eax, [ecx+33333333h]
and eax, 88888888h
add ecx,ecx
add ecx,eax
shr eax,2
sub ecx,eax
I now also made it return the value in ECX to save one byte and one shift instruction.
That is great. Four instructions less than mine!!! How did you do it? BTW one thing missing though - how do we get the carry please in case of doubling eight digit numbers?
Another question. How would you extend this to a generic add for packed BCD please? Also with carry?
Do you have any other links on the topic please? Thanks.
I got it. This handles the carry...
Still, how would you extend this to generic packed BCD add please. Thanks.
Another question. How would you extend this to a generic add for packed BCD please? Also with carry?
Do you have any other links on the topic please? Thanks.
I got it. This handles the carry...
mov ecx, [esi+mda]
lea eax, [ecx+33333333h]
and eax, 88888888h
lea ecx, [ecx*2+ebx]
; add ecx,ecx
sets bl
add ecx,eax
shr eax,2
sub ecx,eax
Still, how would you extend this to generic packed BCD add please. Thanks.
I think I did it in about the same way as you did, but with reversed logic.
Maybe you can add like this?
The doubler with carry can be implemented like this:
I'm afraid I don't know of any litterature about BCD numerical processing.
Maybe you can add like this?
mov eax,[valuea]
mov ecx,[valueb]
mov edx,eax
adc eax,ecx
rcl ebx,1
xor ecx,edx
mov edx,eax
add eax,66666666h
adc ebx,0
xor eax,ecx
shr ebx,1
rcr eax,3
and eax,22222222h
lea eax,[eax*2+eax]
add edx,eax
shl eax,2
; Result in EDX, upper bit of EBX cleared
The doubler with carry can be implemented like this:
mov ecx, [esi+mda]
lea eax,[ecx+33333333h]
adc ecx,ecx
and eax,88888888h
add ecx,eax
shr eax,2
sub ecx,eax
shl eax,3
; Result in ECX
I'm afraid I don't know of any litterature about BCD numerical processing.
I combined the two techniques above to produce:
However, it is one quarter the speed of
and
Apparently the flags dependency after adc slows down the rcr - and we don't even use this carry!!! Changing rcr to shr brings it almost up to speed.
mov ecx, [esi+mda]
lea eax,[ecx+33333333h]
adc ecx,ecx
rcr eax,2
and eax,22222222h
lea eax,[eax*2+eax]
add ecx,eax
shl eax,2
However, it is one quarter the speed of
mov ecx, [esi+mda]
lea eax,[ecx+33333333h]
adc ecx,ecx
and eax,88888888h
add ecx,eax
shr eax,2
sub ecx,eax
shl eax,3
and
mov ecx, [esi+mda]
lea eax, [ecx+33333333h]
and eax, 88888888h
lea ecx, [ecx*2+ebx]
; add ecx,ecx
sets bl
add ecx,eax
shr eax,2
sub ecx,eax
Apparently the flags dependency after adc slows down the rcr - and we don't even use this carry!!! Changing rcr to shr brings it almost up to speed.
mov ecx, [esi+mda]
lea eax,[ecx+33333333h]
adc ecx,ecx
shr eax,2
and eax,22222222h
lea eax,[eax*2+eax]
add ecx,eax
shl eax,2
Maybe you could just keep the code I posted, then :P Or is any of the other variants faster?
Consider the four routines above (Sep 8 ) to be labelled a, b, c, d. They rank as follows:
1st place <b> (Sephiroth3 - Sep 7 with carry in carry flag) and <c> (V Coder - Sep 6 with carry added in ebx & sets bl after Sephiroth3 - Sep 6) calculation in 5 seconds (both approx 5.75 seconds but truncated to 5). Total with startup and save=8.5 seconds
2nd place <d> (V Coder - Sep 8 with carry in carry flag after Sephiroth3 - Sep 7) calculation in 7 seconds
3th place <not posted> (V Coder - a variation of the initial Sep 2 code using setc bl) calculation in 7 seconds
4rd place <initial code> (V Coder Sep 2) calculation in 10 seconds
5th place <a> (V Coder - Sep 8 with carry in carry flag after Sephiroth3 - Sep 7) calculation in 25 seconds
Total time with startup & save measured with Randy's howlong.
With twice as large terminal dataset, b & c completed the calculation in 23 seconds (4 times as long), and the initial code took 44 seconds. b took 25.7 seconds and c 26.1 seconds according to howlong. b is therefore just under twice as fast as the original routine. Furthermore, all things being equal, b should save a minute every hour over c.
1st place <b> (Sephiroth3 - Sep 7 with carry in carry flag) and <c> (V Coder - Sep 6 with carry added in ebx & sets bl after Sephiroth3 - Sep 6) calculation in 5 seconds (both approx 5.75 seconds but truncated to 5). Total with startup and save=8.5 seconds
2nd place <d> (V Coder - Sep 8 with carry in carry flag after Sephiroth3 - Sep 7) calculation in 7 seconds
3th place <not posted> (V Coder - a variation of the initial Sep 2 code using setc bl) calculation in 7 seconds
4rd place <initial code> (V Coder Sep 2) calculation in 10 seconds
5th place <a> (V Coder - Sep 8 with carry in carry flag after Sephiroth3 - Sep 7) calculation in 25 seconds
Total time with startup & save measured with Randy's howlong.
With twice as large terminal dataset, b & c completed the calculation in 23 seconds (4 times as long), and the initial code took 44 seconds. b took 25.7 seconds and c 26.1 seconds according to howlong. b is therefore just under twice as fast as the original routine. Furthermore, all things being equal, b should save a minute every hour over c.
Update:
Arithmetic (add, sub... affect carry), logic (and, or... zero carry) or shift (shr, rcl...) instructions all affect the carry flag (except lea, dec, inc), and thus cannot be used until the carry is accessed from routines a, b, d. Loops to deal with long numbers require lea and dec to test for termination conditions.
By contrast, the routine with sets bl dispatches the carry, so that the faster add (index)/sub (number of iterations) can be used to test for the loop condition.
But this is not enough to edge <c> just ahead of <b>.
Arithmetic (add, sub... affect carry), logic (and, or... zero carry) or shift (shr, rcl...) instructions all affect the carry flag (except lea, dec, inc), and thus cannot be used until the carry is accessed from routines a, b, d. Loops to deal with long numbers require lea and dec to test for termination conditions.
By contrast, the routine with sets bl dispatches the carry, so that the faster add (index)/sub (number of iterations) can be used to test for the loop condition.
But this is not enough to edge <c> just ahead of <b>.