I have written myself a primitive "itoa" function (NASM) which actually works (or so it seems so far) :shock: But it uses both EAX, EBX, ECX and EDX. I'm not sure if that is a problem or not, but I was wondering what the "conventions" for using registers were. Can you just use any register at any time, or will that be frowned upon? Should I use local variables instead whenever possible (for example instead of ECX as a loop counter in my code below)?

Here is the specific code, but my question is general. And please bear with me if the code sucks - I have just started out and don't know very much yet.. Just hacking along with what I've learned so far. Suggestions for improvement are very welcome, I'm eager to learn. ;)



; Function _itoa (int, char*)
; ------------------------------
; Converts 32-bit integer to ASCII
; Input:    - Arg1: An integer (dword) to convert to ascii
;          - Arg2: 32-bit pointer to a 32 bytes (or more) string buffer to store the result in     
;
; Note 1: Push arguments on stack in reverse order (Arg2 first)
; Note 2: The ASCII string will contain the number represented in binary with all leading zeroes
_itoa:
    %define    src_integer dword     ; 1st argument
    %define    dst_buffer  dword     ; 2nd argument
   
    push ebp
    mov ebp, esp                        ; Stack frame
   
    push eax
    push ebx
    push ecx
    push edx
   
    ; Conversion loop   
    mov edx, dst_buffer                ; Pointer to string buffer             
    mov eax, src_integer                ; Integer to convert
    mov ecx, 32                        ; Repeat once for each bit
conv_loop:   
    mov ebx, eax
    and ebx, 80000000h                  ; Get leftmost bit in ebx - sets ZF 
    jnz addone     
    mov , byte 30h                ; Leftmost bit is 0 - write to buffer
    jmp next
  addone:
    mov , byte 31h                ; Leftmost bit is 1 - write to buffer
  next:
    shl eax, 1                          ; Next bit in integer
    inc edx                            ; Next byte in buffer
    dec ecx   
    jnz conv_loop
   
    ; Cleanup
    pop edx
    pop ecx
    pop ebx
    pop eax
   
    pop ebp
    ret (2 * 4)                        ; 2 arguments * 4 bytes

Posted on 2010-04-29 12:48:03 by !me
!me,

Along with fine guides about optimization, Agner Fog's site contains document about calling conventions used in various compilers/OSes. For example, stdcall calling conventions, common for Win32 API, allow to use eax, ecx, edx and FPU/MMX/XMM registers freely while preserving others.

You can use registers as you wish if you have preserved those specified in calling conventions you wish to adhere to.

As about your code, for a first try it's not bad (only quite straightforward).

Stack frame is not necessary for small functions like this, you may use dword to access arguments (it makes those mov instructions slightly longer, but at a benefit of absent prolog/epilog code).

Your if/then/else construction is not necessary, it can be reduced to something like

        xor    ebx, ebx        ; not necessarily right before the following
        shl    eax, 1          ; shift leftmost bit into CF
        adc    ebx, '0'        ; bl = '0'+CF


You may use calculated binary digit right now, or defer store until ebx contains four digits (counter in ecx helps to detect when this occurs).

ecx and edx are updated synchronously, this can be used to eliminate one of the updates. For example, you can count from 0 to 31 (easy way to detect 31->32 transition is to test lower 5 bits of ecx for being zero) and use as the address of memory to store digit.
Posted on 2010-04-29 13:38:00 by baldr
Thank you for taking time replying :) That link was great, I'll read it all as soon as I get time.

As for the _itoa function, I got most of your suggestions to work, but couldn't make it work without the stack frame.. Never learned how to do that, and it seems like I did something very wrong.

Here is what I currently have:



; Function _itoa (int, char*)
; ------------------------------
; Converts 32-bit integer to ASCII
; Input:    - Arg1: An integer (dword) to convert to ascii
;          - Arg2: 32-bit pointer to a 32 bytes (or more) string buffer to store the result in     
;
; Note 1: Push arguments on stack in reverse order (Arg2 first)
; Note 2: The ASCII string will contain the number represented in binary
_itoa:
    %define    src_integer dword     ; 1st argument
    %define    dst_buffer  dword     ; 2nd argument
                       
    push    eax
    push    ebx
    push    ecx
    push    edx
   
    ; Conversion loop   
    mov    edx, dst_buffer                ; Pointer to string buffer             
    mov    eax, src_integer                ; Integer to convert
    xor    ecx, ecx                        ; Clear loop counter
conv_loop:
    xor    ebx, ebx                        ; Clear EBX
    shl    eax, 1                          ; Shift leftmost bit into CF 
    adc    ebx, '0'                        ; Add with CF => BL contains byte to write to buffer     
    mov    , bl                ; Write to buffer
    inc    ecx                            ; Next loop iteration
    cmp    ecx, 32
    jne    conv_loop
   
    ; Cleanup
    pop    edx
    pop    ecx
    pop    ebx
    pop    eax
   
    ret    (2 * 4)                        ; 2 arguments * 4 bytes


Posted on 2010-04-29 14:14:06 by !me
!me,

There is a catch in frameless functions: relative (to esp) offsets of arguments/local variables change if you use stack. Those pushes move stack pointer 16 bytes down, now arguments are at dword .
Posted on 2010-04-29 14:27:59 by baldr
I thought so, actually.. IMO it gets a bit messy then, I think I'll just keep my stack frame for now, unless there is a major performance hit or something

Thanks much for your help!
Posted on 2010-04-29 15:06:57 by !me
!me,

There is another technique to manipulate bits: let's assume that ebx contains 0Bh (arbitrary 4-bit pattern, 1011). Multiply it by magic number (it has every 7th bit set):


        imul    eax, ebx, 00000000001000000100000010000001b
; now eax == 0162C58Bh == 00000001011000101100010110001011b
        and    eax, 0x01010101
; now eax == 01000101h == 00000001000000000000000100000001b
        or      eax, '0000'
; now eax == 31303131h


Though there is a little problem al contains binary digit for the least significant bit in bl. If we store eax as is, displayed digits will have reverse order. To restore proper order, bswap instruction can be used.

BTW, that magic number can be easily built using expression (1 SHL ((32/7)*7) - 1)/(1 SHL 7 - 1), where 7 is the period.
Posted on 2010-04-29 16:59:26 by baldr
Now that is cool! I love hacks like that :P

Does it have any advantages over the other algorithm (except from the added coolness-factor)?
Posted on 2010-04-29 18:07:20 by !me
Your mileage may vary. That depends on many factors, not the last of them being the CPU make/model.

This approach shows possibility of pseudo-SIMD operation on plain i386, four bits are converted in parallel.

Google for "bithack", you may learn much from this. Google is your friend. ;-)
Posted on 2010-04-30 09:01:22 by baldr
I'd say a rule of thumb is that you use registers for the most frequently modified values.
Registers are fastest, local variables are effectively cache variables, so they will have a bit more latency.
How many registers you use exactly, depends on the specific routine. Since in most cases, only eax, ecx and edx are considered 'volatile', all other registers have to be preserved by the callee (and likewise, if your routine uses eax, ecx and edx, it has to preserve those when it calls another function).
So it then depends on the cost of storing and restoring the registers vs the cost of using local variables (which are preserved when you call another function).
Posted on 2010-04-30 11:55:18 by Scali