I have written myself a primitive "itoa" function (NASM) which actually works (or so it seems so far) :shock: But it uses both EAX, EBX, ECX and EDX. I'm not sure if that is a problem or not, but I was wondering what the "conventions" for using registers were. Can you just use any register at any time, or will that be frowned upon? Should I use local variables instead whenever possible (for example instead of ECX as a loop counter in my code below)?
Here is the specific code, but my question is general. And please bear with me if the code sucks - I have just started out and don't know very much yet.. Just hacking along with what I've learned so far. Suggestions for improvement are very welcome, I'm eager to learn. ;)
Here is the specific code, but my question is general. And please bear with me if the code sucks - I have just started out and don't know very much yet.. Just hacking along with what I've learned so far. Suggestions for improvement are very welcome, I'm eager to learn. ;)
; Function _itoa (int, char*)
; ------------------------------
; Converts 32-bit integer to ASCII
; Input: - Arg1: An integer (dword) to convert to ascii
; - Arg2: 32-bit pointer to a 32 bytes (or more) string buffer to store the result in
;
; Note 1: Push arguments on stack in reverse order (Arg2 first)
; Note 2: The ASCII string will contain the number represented in binary with all leading zeroes
_itoa:
%define src_integer dword ; 1st argument
%define dst_buffer dword ; 2nd argument
push ebp
mov ebp, esp ; Stack frame
push eax
push ebx
push ecx
push edx
; Conversion loop
mov edx, dst_buffer ; Pointer to string buffer
mov eax, src_integer ; Integer to convert
mov ecx, 32 ; Repeat once for each bit
conv_loop:
mov ebx, eax
and ebx, 80000000h ; Get leftmost bit in ebx - sets ZF
jnz addone
mov , byte 30h ; Leftmost bit is 0 - write to buffer
jmp next
addone:
mov , byte 31h ; Leftmost bit is 1 - write to buffer
next:
shl eax, 1 ; Next bit in integer
inc edx ; Next byte in buffer
dec ecx
jnz conv_loop
; Cleanup
pop edx
pop ecx
pop ebx
pop eax
pop ebp
ret (2 * 4) ; 2 arguments * 4 bytes
!me,
Along with fine guides about optimization, Agner Fog's site contains document about calling conventions used in various compilers/OSes. For example, stdcall calling conventions, common for Win32 API, allow to use eax, ecx, edx and FPU/MMX/XMM registers freely while preserving others.
You can use registers as you wish if you have preserved those specified in calling conventions you wish to adhere to.
As about your code, for a first try it's not bad (only quite straightforward).
Stack frame is not necessary for small functions like this, you may use dword to access arguments (it makes those mov instructions slightly longer, but at a benefit of absent prolog/epilog code).
Your if/then/else construction is not necessary, it can be reduced to something like
You may use calculated binary digit right now, or defer store until ebx contains four digits (counter in ecx helps to detect when this occurs).
ecx and edx are updated synchronously, this can be used to eliminate one of the updates. For example, you can count from 0 to 31 (easy way to detect 31->32 transition is to test lower 5 bits of ecx for being zero) and use as the address of memory to store digit.
Along with fine guides about optimization, Agner Fog's site contains document about calling conventions used in various compilers/OSes. For example, stdcall calling conventions, common for Win32 API, allow to use eax, ecx, edx and FPU/MMX/XMM registers freely while preserving others.
You can use registers as you wish if you have preserved those specified in calling conventions you wish to adhere to.
As about your code, for a first try it's not bad (only quite straightforward).
Stack frame is not necessary for small functions like this, you may use dword to access arguments (it makes those mov instructions slightly longer, but at a benefit of absent prolog/epilog code).
Your if/then/else construction is not necessary, it can be reduced to something like
xor ebx, ebx ; not necessarily right before the following
shl eax, 1 ; shift leftmost bit into CF
adc ebx, '0' ; bl = '0'+CF
You may use calculated binary digit right now, or defer store until ebx contains four digits (counter in ecx helps to detect when this occurs).
ecx and edx are updated synchronously, this can be used to eliminate one of the updates. For example, you can count from 0 to 31 (easy way to detect 31->32 transition is to test lower 5 bits of ecx for being zero) and use as the address of memory to store digit.
Thank you for taking time replying :) That link was great, I'll read it all as soon as I get time.
As for the _itoa function, I got most of your suggestions to work, but couldn't make it work without the stack frame.. Never learned how to do that, and it seems like I did something very wrong.
Here is what I currently have:
As for the _itoa function, I got most of your suggestions to work, but couldn't make it work without the stack frame.. Never learned how to do that, and it seems like I did something very wrong.
Here is what I currently have:
; Function _itoa (int, char*)
; ------------------------------
; Converts 32-bit integer to ASCII
; Input: - Arg1: An integer (dword) to convert to ascii
; - Arg2: 32-bit pointer to a 32 bytes (or more) string buffer to store the result in
;
; Note 1: Push arguments on stack in reverse order (Arg2 first)
; Note 2: The ASCII string will contain the number represented in binary
_itoa:
%define src_integer dword ; 1st argument
%define dst_buffer dword ; 2nd argument
push eax
push ebx
push ecx
push edx
; Conversion loop
mov edx, dst_buffer ; Pointer to string buffer
mov eax, src_integer ; Integer to convert
xor ecx, ecx ; Clear loop counter
conv_loop:
xor ebx, ebx ; Clear EBX
shl eax, 1 ; Shift leftmost bit into CF
adc ebx, '0' ; Add with CF => BL contains byte to write to buffer
mov , bl ; Write to buffer
inc ecx ; Next loop iteration
cmp ecx, 32
jne conv_loop
; Cleanup
pop edx
pop ecx
pop ebx
pop eax
ret (2 * 4) ; 2 arguments * 4 bytes
!me,
There is a catch in frameless functions: relative (to esp) offsets of arguments/local variables change if you use stack. Those pushes move stack pointer 16 bytes down, now arguments are at dword .
There is a catch in frameless functions: relative (to esp) offsets of arguments/local variables change if you use stack. Those pushes move stack pointer 16 bytes down, now arguments are at dword .
I thought so, actually.. IMO it gets a bit messy then, I think I'll just keep my stack frame for now, unless there is a major performance hit or something
Thanks much for your help!
Thanks much for your help!
!me,
There is another technique to manipulate bits: let's assume that ebx contains 0Bh (arbitrary 4-bit pattern, 1011). Multiply it by magic number (it has every 7th bit set):
Though there is a little problem — al contains binary digit for the least significant bit in bl. If we store eax as is, displayed digits will have reverse order. To restore proper order, bswap instruction can be used.
BTW, that magic number can be easily built using expression (1 SHL ((32/7)*7) - 1)/(1 SHL 7 - 1), where 7 is the period.
There is another technique to manipulate bits: let's assume that ebx contains 0Bh (arbitrary 4-bit pattern, 1011). Multiply it by magic number (it has every 7th bit set):
imul eax, ebx, 00000000001000000100000010000001b
; now eax == 0162C58Bh == 00000001011000101100010110001011b
and eax, 0x01010101
; now eax == 01000101h == 00000001000000000000000100000001b
or eax, '0000'
; now eax == 31303131h
Though there is a little problem — al contains binary digit for the least significant bit in bl. If we store eax as is, displayed digits will have reverse order. To restore proper order, bswap instruction can be used.
BTW, that magic number can be easily built using expression (1 SHL ((32/7)*7) - 1)/(1 SHL 7 - 1), where 7 is the period.
Now that is cool! I love hacks like that :P
Does it have any advantages over the other algorithm (except from the added coolness-factor)?
Does it have any advantages over the other algorithm (except from the added coolness-factor)?
Your mileage may vary. That depends on many factors, not the last of them being the CPU make/model.
This approach shows possibility of pseudo-SIMD operation on plain i386, four bits are converted in parallel.
Google for "bithack", you may learn much from this. Google is your friend. ;-)
This approach shows possibility of pseudo-SIMD operation on plain i386, four bits are converted in parallel.
Google for "bithack", you may learn much from this. Google is your friend. ;-)
I'd say a rule of thumb is that you use registers for the most frequently modified values.
Registers are fastest, local variables are effectively cache variables, so they will have a bit more latency.
How many registers you use exactly, depends on the specific routine. Since in most cases, only eax, ecx and edx are considered 'volatile', all other registers have to be preserved by the callee (and likewise, if your routine uses eax, ecx and edx, it has to preserve those when it calls another function).
So it then depends on the cost of storing and restoring the registers vs the cost of using local variables (which are preserved when you call another function).
Registers are fastest, local variables are effectively cache variables, so they will have a bit more latency.
How many registers you use exactly, depends on the specific routine. Since in most cases, only eax, ecx and edx are considered 'volatile', all other registers have to be preserved by the callee (and likewise, if your routine uses eax, ecx and edx, it has to preserve those when it calls another function).
So it then depends on the cost of storing and restoring the registers vs the cost of using local variables (which are preserved when you call another function).