One quote from 7-th issue of ASSEMBLY  PROGRAMMING JOURNAL (Dec 99 - Feb 00), http://asmjournal.freeservers.com/issues/apj_7.txt
; MMX ltostr by Cecchinel Stephan
; Summary:     Convert long  value to an ASCII string
; Compatibility: MMX
; Notes:  Converts a number in EAX to an 8 bytes hexadecimal
;         string at 
;         14 clocks on a Celeron-333

Sum1  dd 0x30303030, 0x30303030
Comp1 dd 0x09090909, 0x09090909
Mask1 dd 0x0f0f0f0f, 0x0f0f0f0f
bswap eax
movq  mm2,   
movq  mm3,   
movq  mm4,   
movq  mm5,   mm3
psubb mm5,   mm4
movd  mm0,   eax
movq  mm1,   mm0
psrlq mm0,   4
pand  mm0,   mm2
pand  mm1,   mm2
punpcklbw mm0,mm1
movq  mm1,   mm0
pcmpgtb mm0, mm4
pand  mm0,   mm5
paddb mm1,   mm3
paddb mm1,   mm0
movq  , mm1
Apparently, this is not a MASM code, so I've rewritten the proc. The core portion
bswap eax
. . .
paddb mm(1), mm(0)
took only 9 clocks on my Athlon !!!
Furthermore, I've reordered the instructs to reduce the read-write dependencies and MMX reg-isters stalls, so code became messy, but now to output a string representation of dword in EAX into mm(1) register costs ~7 clocks only !!!
.model flat, stdcall
option casemap:none		; case sensitive
option PROLOGUE:None	; suppress prologue generation 
option EPILOGUE:None	; suppress epilogue generation 

Bias	  dq 3030303030303030h
Sieve	  dq 0f0f0f0f0f0f0f0fh
DHBarrier dq 0909090909090909h
; Convert dword to Hex string representation
; Declarations:
; MASM  - MMXdw2hex proto :dword, :dword
; C(++) - void MMXdw2hex(unsigned int, char*)
MMXdw2hex proc num:DWORD, buff:DWORD
	mov   eax, 
	movq  mm(3), Bias
	movq  mm(2), Sieve
	bswap eax
	movq  mm(5), mm(3)
	movq  mm(4), DHBarrier
	movd  mm(0), eax
	psubb mm(5), mm(4)
	movq  mm(1), mm(0)
	psrlq mm(0), 4
	pand  mm(1), mm(2)
	pand  mm(0), mm(2)
	punpcklbw mm(0),mm(1)
	movq  mm(1), mm(0)
	pcmpgtb mm(0), mm(4)
	pand  mm(0), mm(5)
	paddb mm(1), mm(3)
	paddb mm(1), mm(0)
	mov   eax,  
	movq  qword ptr , mm(1)
	mov   byte  ptr , 0
	ret 8
MMXdw2hex endp
The proc takes 51h bytes in .code and 18h bytes in .data only.
Now, in some code
. . .
MMXdw2hex proto :dword, :dword
. . .
. . .
num	dd 123456ABh
. . .
buff	db 9 dup(0)
. . .
. . .
	invoke MMXdw2hex, num, addr buff
. . .
it costs for invoke 14 clocks only to output sz "123456ab" into buff, all PUSHs/CALL/RET accounting!!!
1. You may modify/extend the proc to format the output as "123456AB", "123456ABh", "0x123456ab" etc.
2. To handle sdword (signed int), insert the following code
somewhere between  mov eax, and bswap eax instructs in dw2hex to obtain abs(num) in EAX and use SF for conditional formatting negative value representation (none of the proc's instructs affects it).
3. The proc isn't safe, because it destroys the contents of 6 MMX registers and (on Pentium) the whole FPU state. On Athlon you may freely use FPU/MMX instructs concurrently without any penalty or data loss. More over, it is very difficult to make it safe, as far as it doesn't know, do you need to preserve the current environment, and, if so, which registers should be preserved (saving/restoring the whole FPU/MMX engine state is VERY expensive, on Athlon ~300 clocks as per doc - I didn't check it in real code as it doesn't bother me). So the proper instructs should be issued before/after dw2hex call if you need to save/restore registers. For mm's situation isn't very bad - invoke MMXdw2hex toge
Posted on 2001-02-05 22:21:00 by DVA