; by Stepan Polovnikov
; al-char ;; xx30. xx39 | xx41. xx46
aam 10h ;; 0300. 0309 | 0401. 0406
aad 9 ;; 001B. 0024 | 0025. 002A
sub al, 1Bh ;; 0000. 0009 | 000A. 000F
Optimization of a size of the code.
Posted on 2002-02-23 02:56:49 by Nexo
Ha!
I finally can see it :)
Stepan is a very good asm programmer and my friend.
We spend a lot of time exchanging ideas and procs and given each
other puzzles to solve (BTW we both havilly use tbls).
The thing is that couple months ago he said that long ago he had
found solution how to convert HEX symbol to HEX digit unbranched way
by 6 clocks, and asked me if I was able to figure it out or may be write
faster.
The funny thing is that I found 4 clocks solution right away a sent it
to him. After that I didn't see his original solution until now - until
this very post of yours :)
(BTW when your first appiered here I thought that it may be Stepan
fooling me around :) )
My solution was:

cmp al,41h
sbb cl,cl
sub al,'A'-0Ah
and cl,7
add al,cl

Stepa, eto ne ti chasom durish menya zdes?
Posted on 2002-02-23 07:09:00 by The Svin
What? :cool: We together work in one department in research institute of resources of computer technology.
Five more years back at university I saw his library of operation with hexadecimal digits.
Some lines:
macro _Hex2Asc
cmp al, 10 ;; cf=1 [00:09] cf=0 [0A:0F]
ifndef optSize
sbb ah, ah ;; FF:FF-00:00
add al, '0' +7 ;; 37:40-41:45
and ah, 7 ;; 07:07-00:00
sub al, ah ;; 30:39-41:45
else
sbb al, 69h ;; 96:9F-A1:A5
das ;; 30:39-41:45
endif
endm

macro _Asc2Hex
ifndef optSize
cmp al, 'A' ;; cf=1 [30:39] cf=0 [41:45]
sbb ah, ah ;; FF:FF-00:00
add al, 10-'A' ;; F9:02-0A:0F
and ah, 7 ;; 07:07-00:00
add al, ah ;; 00:09-0A:0F
else
aam 10h ;; 0300:0309-0401:0406
aad 9 ;; 001B:0024-0025:002A
sub al, 1Bh ;; 0000:0009-000A:000F
endif
endm

It is the best code what I ever saw.
Posted on 2002-02-23 11:08:50 by Nexo
Say hellow from Shaska The Svin to him when you see him next
time.
Principles of the code shown was independnlty developed
by several developers over the world, which often happend
'cause usually we rare share knowlidge and communicate less
then rest of programmers world :)
For example working on replacing slow div instruction I figured out
how to do it with fraction mul. In couple years I read it Anger.hlp
where Fog called it Terje Mathisen method :)
I guessed we invented the same wheel :)
As to HEX to ASCII, there is something that might be
interest for you, I wrote it several years ago and still
find it usefull
Last three years I also see algos used my mask,
I have no idea if they took it from me or figured out independently
only in two of the algos the authors mentioned me :)



OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
fastadw proc
mov edx,eax
shl eax,4
and edx,0FFFF0000h
mov ebx,eax
shr edx,12
and eax,0ff0h
shr bh,4
shr al,4
and ebx,0f0f00h
mov ecx,edx
shl ebx,8
add eax,06060606h
and edx,0ff0h
add ebx,eax
shr ch,4
mov eax,ebx
shr dl,4
and ebx,10101010h
and ecx,0f0f00h
shr ebx,4
shl ecx,8
sub eax,ebx
add edx,06060606h
shl ebx,3
add eax,2a2a2a2ah
add ecx,edx
add eax,ebx
mov edx,ecx
bswap eax
;################################################
and ecx,10101010h
shr ecx,4
sub edx,ecx
shl ecx,3
add edx,2a2a2a2ah
add edx,ecx
bswap edx
ret
fastadw endp

fastadwsz proc FORCENOFRAME
;dest:DWORD

call fastadw
mov ecx,dest
mov dword ptr [ecx],edx
mov dword ptr [ecx+4],eax
mov byte ptr [ecx+8],0
ret 4
fastadwsz endp

fastadws proc FORCENOFRAME
; dest:DWORD
call fastadw
mov ecx,dest
mov dword ptr [ecx],edx
mov dword ptr [ecx+4],eax
ret 4
fastadws endp

fastadwszsp proc FORCENOFRAME
;dest :DWORD
call fastadw
mov ebx,dest
mov dword ptr [ebx],edx
mov byte ptr [ebx+4],' '
mov dword ptr [ebx+5],eax
mov byte ptr [ebx+9],0
ret 4
fastadwszsp endp

OPTION PROLOGUE:DefaultOption
OPTION EPILOGUE:DefaultOption
Posted on 2002-02-23 13:41:19 by The Svin
As illustration what I meant here is some articale I found
in inet (compare it with one of Stepa's algo)



cmp al,10 ; if x < 10, set CF = 1
sbb al,69h ; 0-9: 96h .. 9Fh, A-F: A1h..A6h
das ; 0-9: subtr. 66h -> 30h-39h,
; A-F: subtr. 60h -> 41h..46h

Gem writer: Norbert Juffa
last updated: 1998-03-16


To crearify Stepan's name in the issue :) I can say that the same
method I used in DOS progs dated 1997 when I had no idea who the hell is Norbert Juffa, The Gem writer :)
As for contest it was a couple months ago as I said, why Stepan
didn't say that he wrote already the simular code?
Maybe he was too disappointed that I figured out it in a fly :)
Srtange men we are, asm programmers - all I can say :)
Posted on 2002-02-23 14:00:00 by The Svin
Alex,

I have been watching this discussion with some interest. I have an algorithm I am working on in
PowerBASIC inline assembler that must convert a byte to string and have a leading comma which is then appended to the end of and existing buffer.

The way I am doing it is to fill up 4 bytes with padding, the comma then the ascii number and writing the DWORD to the end of the existing written data.


--,9 1 digit
-,99 2 digits
,199 3 digits

When I wrote this about 2 years ago, I mixed inline asm with high level functions and it works fine but i think I can get a measurable speed increase by doing the byte to ascii conversion on the fly and this is where the algos that have been discussd here appeal to me.

Let me know what you think in terms of putting an ascii number into a DWORD memory location by doing high speed conversions on the fly.

Regards,

hutch@movsd.com
Posted on 2002-02-23 18:06:28 by hutch--
It seems to be possible do fast, but I don't get the task clear.
What's need to be done?
In terms of:
Input? Output?
Posted on 2002-02-23 19:18:43 by The Svin
Alex,

Sorry, I should have made more sense of it. The input is a BYTE in AL, the output I need is the string data for the ascii number in decimal written into a DWORD in normal string order as follows,


--,9 1 digit
-,99 2 digits
,199 3 digits

I can add the padding and comma myself. I post strip the padding in one pass which I am satisfied with, what I am trying to gain is not farming out the BYTE to an external ascii number conversion.

I am using for an algo that outputs byte data as a sequence of DB statements,

DB 1,2,3,4,5, etc ....

Hope this makes sense.

Regards,

hutch@movsd.com
Posted on 2002-02-23 22:47:50 by hutch--
Why you do not make the elementary sequential optimization of commands? The command " and edx, 0FFFF0000h " is absolutely not necessary.

shld edx,eax,20
shl eax, 4
mov ebx, eax
mov ecx, edx
and eax, 0FF0h
and edx, 0FF0h
shr bh, 4
shr ch, 4
shr al, 4
shr dl, 4
and ebx, 0F0F00h
and ecx, 0F0F00h
shl ebx, 8
shl ecx, 8
add edx, ecx
add eax, ebx
lea ecx,
lea ebx,
shr ecx, 4
shr ebx, 4
and ecx, 1010101h
and ebx, 1010101h
sub edx, ecx
sub eax, ebx
lea edx,
lea eax,
bswap edx
bswap eax

Relative decrease of clock ticks with 33 up to 24 (137.5 %).
But here it is a lot of delays for PPro (usage of registers of a various digit capacity).
Usage of the old scheme and MMX will give decrease of clock ticks with 33 up to 9 (336 %)
n09 dq 00909090909090909h
n07 dq 00707070707070707h
n37 dq 03737373737373737h

bswap eax
shld ecx, eax, 32-4
and ecx, 0F0F0F0Fh
and eax, 0F0F0F0Fh
movd mm1, eax
movd mm0, ecx
punpcklbw mm0, mm1
movq mm1, mm0
pcmpgtb mm1,
paddb mm0,
pandn mm1,
psubb mm0, mm1
movq , mm0
Posted on 2002-02-24 03:09:27 by Nexo
The Svin, Stepan spoke, that this code
cmp al, 10
sbb al, 69h
das
Was taken from Dr. Dobb`s Journal August 1990 (Letters).
Really this code it was mentioned in some source file somewhere 1984.
;)
Posted on 2002-02-24 04:29:42 by Nexo
Nexo, you can also do a qword in 9 cycles on Athlon. :)
http://www.asmcommunity.net/board/index.php?topic=3801

Cycle per cycle the Athlon beats any IA-32 chip - 24 instructions in 9 cycles is pretty good! :) To do this the loop alignment must be ((($ + 15) AND -16) + 7). Otherwise, it can take up to 3 cycles more per iterration!
Posted on 2002-02-24 09:53:34 by bitRAKE
Hi, bitRAKE.

I have made some analysis of the code for processor AthlonXP.
; movq mm4,
; movq mm5,
; movq mm6,
; movq mm7,

movq mm0, ; +1 clk
add eax, 8
add edx, 16
dec ecx ; +1 clk
movq mm1, mm0
psrlq mm0,4 ; +1 clk
pand mm1, mm4
pand mm0, mm4
paddb mm1, mm5 ; +1 clk
paddb mm0, mm5
movq mm3, mm1 ; +1 clk
movq mm2, mm0 ; +1 clk
pcmpgtb mm3, mm6 ; +1 clk
pcmpgtb mm2, mm6
psubusb mm3, mm7 ; +2 clk
psubusb mm2, mm7
paddb mm1, mm3 ; +2 clk
paddb mm0, mm2
movq mm2, mm0 ; +2 clk
punpckhbw mm0, mm1; +1 clk
punpcklbw mm2, mm1; +1 clk
movq , mm0
movq , mm2; +2 clk
Total 17 clock ticks.

Below my alternative of the code follows.
; movq mm4,
; movq mm5,
; movq mm6,
; movq mm7,

movq mm0, ; +1 clk
add eax, 8
add edx, 16
dec ecx ; +1 clk
movq mm1, mm0
psrlq mm0,4 ; +1 clk
pand mm1, mm4
pand mm0, mm4
movq mm3, mm1 ; +1 clk
movq mm2, mm0
pcmpgtb mm3, mm5 ; +1 clk
pcmpgtb mm2, mm5 ; +1 clk
paddb mm1, mm6
paddb mm0, mm6
pandn mm3, mm7 ; +1 clk
pandn mm2, mm7
psubb mm1, mm3 ; +2 clk
psubb mm0, mm2
movq mm2, mm0 ; +2 clk
punpckhbw mm0, mm1; +1 clk
punpcklbw mm2, mm1; +1 clk
movq , mm0
movq , mm2; +2 clk
Total 15 clock ticks.

My code has less dependences of registers and as a result 2 clock ticks are released.
Posted on 2002-02-24 11:39:49 by Nexo
Nexo, having tested both, they perform the same? Both have the alignment problem as well. They are 72 bytes long - the decode bandwidth is at maximum when performing at 9 clock cycles. It will not get faster than that without getting smaller. There are 6 groups of 3 instructions, and 3 groups of two instructions. Here is a picture of the decode part:
Posted on 2002-02-24 12:24:12 by bitRAKE
After the instructions are decoded, then the processor rearranges them to reduce dependancies. Your algo is technically better, but the processor does that for me.
Posted on 2002-02-24 12:28:42 by bitRAKE
bitRAKE, I tested only linear execution (without cycle). I made real measurements of execution of the code. They were confirmed with programs " VTune (TM) Performance Analyzer " and " AMD CodeAnalyst ".
1. You in one group have dependences which do not allow to be executed to group for one clock tick. Even it is not possible.
2. In the document " AMD Athlon Processor x86 Code Optimization Guide " (pages 289-292) can be seen, that MMX commands are executed for two clock ticks in the block of execution!
3. AMD the processor has physically two blocks of execution MMX (AMD Athlon Processor Technical Brief, page 3), therefore on two clock ticks can be executed two MMX commands if is not present dependence.
4. Three commands are simultaneously executed only in blocks of execution of integer operations (IEU/ALU), but MMX commands are executed in blocks FADD/MMX and FMUL/MMX.

Therefore yours 9 clock ticks are erratic.
Posted on 2002-02-24 14:17:31 by Nexo
Hi, hutch--.
As a first approximation my code of your task looks so:
; Input: al-number
; Output: eax-chars
movzx eax, al
cmp eax, 10
jb @@ 1
mov ecx, '-,'shl 16
cmp eax, 100
jb @@ 10
mov edx, 51EB851Fh
mov ebx, eax
mul edx
shr edx, 5
imul eax, edx,-100
lea ecx,
add eax, ebx
shl ecx, 16
@@ 10:
mov edx, 0CCCCCCCDh
mov ebx, eax
mul edx
shr edx, 3
imul eax, edx,-10
shl edx, 8
add eax, ebx
lea ecx,
add eax, ecx
bswap eax
ret
@@ 1:
shl eax, 24
add eax, ' 0,--'
ret
Posted on 2002-02-24 14:18:36 by Nexo
Hutch, I have absolutely forgotten about the best alternative :) , that it is possible to use the table with a size in 1 Kilobytes.
movzx eax, al
mov eax,
Posted on 2002-02-24 14:43:10 by Nexo
Hope this makes sense.

Last question: unsigned? (in your example was 199,
I assume unsigned. Right?
Posted on 2002-02-24 15:28:02 by The Svin
Alex, Nexo,

A friend of mine, Jibz dug up an algo he had which more or less did the job perfectly, it took the BYTE data in AL and wrote the output to which saved me doing the register transfer I had done when interfacing with a number of HLL routines.

This is the code below, the slight notation difference is because it is PowerBASIC inline asm.


' @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

! push ecx

! xor ecx, ecx
more1:
! aam
! push eax
! inc ecx
! shr ax, 8
! jnz more1
more2:
! pop eax
! add al, &H30
! stosb
! loop more2

! mov al, "," ; formatting
! stosb

! pop ecx

' @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

he also gave me another one that he thought was probably faster but the quantum leap in performance was done by this one. It is plugged in the middle of a very messy algo that loads a file, sequentialy reads through it, converts each byte to its ascii equivalent, formats it in rows of 16 ascii bytes and trims the end of the buffer at the correct length.

When I get the time, I would like to rewrite the complete algo as it was based on a different set of asumptions when I first wrote it and I think there are still some performance gains left in doing so.

Regards & thanks to both for your effort.

hutch@movsd.com
Posted on 2002-02-24 16:44:11 by hutch--

Therefore yours 9 clock ticks are erratic.
Yes, I have read the same material many times - I don't believe all that I have read. I tested 4,000 times loop with average of 9 cycles - I don't think this is 'erratic'? I will do a more thorough test - I need to know what is really taking place within the CPU. I have many questions that I must answer for my own code analysis program. The picture above is just the decode, not the execution - I know those groupings can not execute together. Please read the sentence before the picture! Do you know why it must be aligned at 16+7 to get this speed?
Instruction Control Unit
The instruction control unit (ICU) is the control center for the AMD Athlon processor. The ICU controls the following resources?the centralized in-flight reorder buffer, the integer scheduler, and the floating-point scheduler. In turn, the ICU is responsible for the following functions?MacroOP dispatch, MacroOP retirement, register and flag dependency resolution and renaming, execution resource management, interrupts, exceptions, and branch mispredictions.

The ICU takes the three MacroOPs per cycle from the early decoders and places them in a centralized, fixed-issue reorder buffer. This buffer is organized into 24 lines of three MacroOPs each. The reorder buffer allows the ICU to track and monitor up to 72 in-flight MacroOPs (whether integer or floating-point) for maximum instruction throughput. The ICU can simultaneously dispatch multiple MacroOPs from the reorder buffer to both the integer and floating-point schedulers for final decode, issue, and execution as OPs. In addition, the ICU handles exceptions and manages the retirement of MacroOPs.
Above is the part that interests me.

Also if you look at page 231, Table 4, you will see that all three execution units can process MMX instructions - this is in document 22007J - August 2001. So you are quite plainly wrong about only 2 MMX instructions executing at once. :)
Posted on 2002-02-24 18:37:50 by bitRAKE