Getting tired of searching for an effective algorithm which calculates the number of digits of a 64-Bit-unsigned-integer but
finding only long-winded binary-to-string-to-binary conversion routines I wrote my own one.
The decimal part uses the coprocessor using this simple formula:
numdigits = int(log(number64))+1
The hexadecimal part uses the powerful BSR ("bit scan reverse") instruction, which can also be considered as binary logarithm.
Input: EAX=Pointer qword ptr
Output: EAX=number of digits
Changed: EAX, EDX, FPU(S0...S2)
Written for TASM, 32Bit.
Thanks for any comments or suggestions.
finding only long-winded binary-to-string-to-binary conversion routines I wrote my own one.
The decimal part uses the coprocessor using this simple formula:
numdigits = int(log(number64))+1
The hexadecimal part uses the powerful BSR ("bit scan reverse") instruction, which can also be considered as binary logarithm.
Input: EAX=Pointer qword ptr
Output: EAX=number of digits
Changed: EAX, EDX, FPU(S0...S2)
Written for TASM, 32Bit.
Thanks for any comments or suggestions.
fpu_ctrlword record fpucw_$1:3,fpucw_ic:1,fpucw_rc:2,fpucw_pc:2,fpucw_ie:1,fpucw_$2:1,fpucw_pm:1,fpucw_um:1,fpucw_om:1,fpucw_zm:1,fpucw_dm:1,fpucw_im:1
;number64 dq 12345678901234
number64 dq 01fh
main proc @type_main argc:dword,argv:dword,argenv:dword
mov eax,offset number64
fninit
;----Start Decimal part----------------------------------------------------------------------------
push dx ;make room for FPU control word
fnstcw ;save FPU control word
push word ptr ;create a copy of FPU control word
setflag word ptr,3 shl fpucw_rc ;set FPU rounding mode: truncate
fldcw ;set the FPU rounding moude
fld1 ;"1"
fldlg2 ;log(2) = 1/log(2,10)
fild qword ptr ;load number64
ftst ;test if number64 equals to zero
fnstsw ax
sahf ;set processor flags to FPU result for evaluation
.if zero?
fcompp ;throw away s0 and s1, leave "1"
.else
fyl2x ;s1 * log(2,s0) = log(2,s0) / log(2,10) = log(s0)
fadd ;= log(s0)+1
.endif
fistp word ptr
xor eax,eax
pop ax ;load EAX with num of digits
fldcw ;restore FPU control word
pop dx ;throw away FPU control word space
;----End Decimal part----------------------------------------------------------------------------
mstr s1,<"The number %I64u contains %d decimal digits.",10>
call printf,offset s1,dword ptr number64+4 dword ptr number64,eax
mov eax,offset number64
;----Start Hexadecimal part----------------------------------------------------------------------------
xor edx,edx ;initialize digit counter value in case number64 equals to zero
bsr edx,dword ptr ;search for the first set bit in HIGH_DWORD
.if zero? ;no bit was set
bsr edx,dword ptr ;search the first set bit in LOW_DWORD
.else
add edx,32
.endif
xchg eax,edx
shr eax,2
inc eax
;----End Hexadecimal part----------------------------------------------------------------------------
mstr s2,<"The number %I64Xh contains %d hexadecimal digits.">
call printf,offset s2,dword ptr number64+4 dword ptr number64,eax
endmain: ret
endp
the number of digits of a 64-Bit-unsigned-integer
numdigits = int(log(number64))+1
numdigits = int(log(number64))+1
One small word of caution: You can load only a 64-Bit-signed-integer to the FPU. Your procedure should thus be restricted to a 63-Bit-unsigned-integer if you want to use it only for unsigned integers.
If you intend to use it for 64-Bit-signed-integers, be sure to use "fabs" before you attempt to compute the log.
That's fully correct. I overlooked this case. But assuming that such a large 64Bit unsigned number will never be committed to the procedure and a programmer pays attention to the argument I think "fabs" can be omitted for code space.
Funny, you can load 128 bits to the gpu and do math and return a meaningful result via a texture.
Am I missing something?
Am I missing something?
Funny, you can load 128 bits to the gpu and do math and return a meaningful result via a texture.
Am I missing something?
Yes, certainly. You are in the wrong thread. But if you are able to program your "gpu" with this "texture" please post it here to suggest a better solution or leave this thread. Thank you.
I didn't see a DO NOT DISTURB sign on the door, sorry about that.
Actually its becoming quite common to use the gpu for general purpose processing - its a powerful number cruncher, and allows for 1x128 bit, 2x64 bit or 4x32bit parallel processing - you have this muscle that you never flex unless you're playing games, its a waste - search for GPGPU.
Thank you, have a nice day.
Actually its becoming quite common to use the gpu for general purpose processing - its a powerful number cruncher, and allows for 1x128 bit, 2x64 bit or 4x32bit parallel processing - you have this muscle that you never flex unless you're playing games, its a waste - search for GPGPU.
Thank you, have a nice day.
Are you serious? If I get you right, a user shall buy a multi-monster-high-end-PC with a super-3D-power-wasting-multitexture-water-cooled graphic card just only to run a program which calculates the number of digits from a shitty integer? That's like breaking a butterfly on a wheel. But thanks for the proposal.
TasmDev, it's true that for one simple calculation you don't need anything like GPGPU, super-uber parallelism, SSE78 optimization, etc. But what you asked for was a fast way to do this (or at least that's what I understand from your first post) and Homer suggested one. If you want to "calculate the number of digits from a shitty integer" then you can very well use the "long-winded binary-to-string-to-binary conversion routines". Please state clearly what speeds are you aiming at so you won't get frustrated next time someone suggests you a solution.
How about something like this:
Just an idea. May be faster, may be slower, may be optimized, may lead to tricks
Homer: GPUs can only handle single-precision floats, only 2 recent ones added double-precision floats. You can't even do bit-fiddling ops like XOR/AND/OR. Maybe, just maybe that functionality is exposed/emulated in CUDA, but GL_NV_fragment_program4's doc only rejects that notion.
Whoops, it's documented in GL_NV_gpu_program4 - it has xor/and/or/not instructions
mov edx,dword ptr ; load hi-dword
mov eax,dword ptr ; load lo-dword
.if edx==0
cmp eax,10
jl _numDigitsIs1
cmp eax,100
jl _numDigitsIs2
cmp eax,1000
jl _numDigitsIs3
cmp eax,10000
jl _numDigitsIs4
...
.else
.. here in a bit more complicated way compare to 64-bit values
.endif
Just an idea. May be faster, may be slower, may be optimized, may lead to tricks
Homer: GPUs can only handle single-precision floats, only 2 recent ones added double-precision floats. You can't even do bit-fiddling ops like XOR/AND/OR. Maybe, just maybe that functionality is exposed/emulated in CUDA, but GL_NV_fragment_program4's doc only rejects that notion.
Whoops, it's documented in GL_NV_gpu_program4 - it has xor/and/or/not instructions
Shader constants are one or more 128 bit float, typically addressed as 4xreal4.
The trick then is to structure operations to take advantage of this ability, and to feed crafted data to the shader.
The trick then is to structure operations to take advantage of this ability, and to feed crafted data to the shader.
But what you asked for was a fast way to do this (or at least that's what I understand from your first post) and Homer suggested one.
Quite the contrary, the GPU would not be fast for this at all. Way too much latency to seriously consider it for doing non-parallel work.
Using the slow-assed div instruction and dividing by 10 until the quotient is 0 would be many times faster that trying to shuttle data over to the GPU, running a shader, and then shuttling a result back.
I'm sure that GPGPU is all cool and shit, but come on... be realistic... it aint cool this time.. it would be a novel solution with no practical value at all
You are using the BSR for hex numbers. Why don't you use that procedure also for decimal numbers and use the result to pick the number of decimal digits from a look-up table which should take you less time to prepare with your calculator than the time it took to write your initial algo.
I doubt you can devise a faster procedure. AND it would work for the full range of unsigned 64-bit integers.
I doubt you can devise a faster procedure. AND it would work for the full range of unsigned 64-bit integers.
You are using the BSR for hex numbers. Why don't you use that procedure also for decimal numbers and use the result to pick the number of decimal digits from a look-up table which should take you less time to prepare with your calculator than the time it took to write your initial algo.
Consider that a 10-bit number can range from 512..1023
..so is a 10-bit number 3 digits, or 4? How will a lookup table help while still fitting on a practical storage medium (a 1TB drive wouldn't be big enough, would it?)
Note the relationship between base-2 and base-16 that makes his hex techique possible.
Okay I wrote this based on Ultrano's suggestion. The code is written in BASM:
I compiled and tested it in Delphi 6 for 32-bit DWORD values.
Function GetDigitCount (Number : DWORD) : DWORD; Assembler; Register;
asm
(* EAX = Number *)
PUSH EDI
MOV EDI , 10
MOV EDX , EAX
XOR EAX , EAX
MOV ECX , 10 (* The number that EDX has to be compared to *)
MOV EBX , 1 (* The number of digits that have to be returned *)
@@Loop:
CMP EDX , ECX
MOV EAX , EBX
JL @@End
IMUL ECX , 10
INC EBX
DEC EDI
JNZ @@Loop
@@End:
POP EDI
end;
I compiled and tested it in Delphi 6 for 32-bit DWORD values.
MASM
.data
??pow10 textequ <1>
POW10TABLE dq ??pow10
rept 19
??pow10 CatStr ??pow10,<0>
%dq ??pow10
endm
.code
UInt64Log10 proc uses ebx edi esi uint64:qword
mov eax,dword ptr uint64
mov edx,dword ptr uint64+4
lea ebx,POW10TABLE
.if edx
bsr ecx,edx
add ecx,32+1
imul ecx,ecx,1233
shr ecx,12
mov esi,eax
mov edi,edx
sub esi,
sbb edi,
@@: shr edi,31
sub ecx,edi
mov eax,ecx
ret
.endif
bsr ecx,eax
add ecx,1
imul ecx,ecx,1233
shr ecx,12
mov edi,eax
sub edi,
jmp @B
UInt64Log10 endp
UInt64Log2 proc uint64:qword
mov edx,dword ptr uint64+4;hi
mov eax,dword ptr uint64;lo
cmp edx,1; if (edx<1) { carry=1 } else { carry=0 }
cmovnc eax,edx
sbb edx,edx; set to -1 if carry
bsr eax,eax
and edx,-32
lea eax,
ret
UInt64Log2 endp
Here's a procedure using BSR and a lookup table. Fast and efficient.
- lpsrc is the memory address of the qword
- The number of digits in the decimal representation of the qword is returned in EAX
- Usage: invoke Ddigits, ADDR qword_source
- The source qword is not modified
I have tested it with all possible values of 2^{n}, 2^{n}-1, 10^{n} and 10^{n}-1, and all returned values were correct.
- lpsrc is the memory address of the qword
- The number of digits in the decimal representation of the qword is returned in EAX
- Usage: invoke Ddigits, ADDR qword_source
- The source qword is not modified
Ddigits proc uses ebx lpsrc :DWORD
.data
ALIGN 4
digits db 1, 1, 1, 1, 2, 2, 2, 3
db 3, 3, 4, 4, 4, 4, 5, 5
db 5, 6, 6, 6, 7, 7, 7, 7
db 8, 8, 8, 9, 9, 9,10,10
db 10,10,11,11,11,12,12,12
db 13,13,13,13,14,14,14,15
db 15,15,16,16,16,16,17,17
db 17,18,18,18,19,19,19,19
lo_limit dd -1,-1,-1,9,-1,-1,63h,-1
dd -1,3E7h,-1,-1,-1,270Fh,-1,-1
dd 1869Fh,-1,-1,0F423Fh,-1,-1,-1,98967Fh
dd -1,-1,5F5E0FFh,-1,-1,3B9AC9FFh,-1,-1
hi_limit dd -1,540bE3FFh,-1,-1,4876E7FFh,-1,-1,0D4A50FFFh
dd -1,-1,-1,4E729FFFh,-1,-1,107A3FFFh,-1
dd -1,0A4C67FFFh,-1,-1,-1,6FC0FFFFh,-1,-1
dd 5D89FFFFh,-1,-1,0A763FFFFh,-1,-1,-1,89E7FFFFH
dd 2,2,8,10h,17h,40h,80h,0E8h
dd 200h,400h,800h,918h,2000h,4000h,5AF3h,10000h
dd 20000h,38D7Eh,80000h,100000h,200000h,2386F2h,800000h,1000000h
dd 1634578h,4000000h,8000000h,0DE0B6B3h,20000000h,40000000h,80000000h,8AC72304h
.code
mov ebx,lpsrc
mov eax,
mov edx,
bsr ecx,edx ;check high dword first
.if ZERO? ;only a 32-bit number
bsr ecx,eax
.if ZERO?
xor ebx,ebx
.else
movzx ebx,digits
.if eax > lo_limit
inc ebx
.endif
.endif
.else
add ecx,32
movzx ebx,digits
cmp edx,hi_limit
.if ZERO?
.if eax > lo_limit
inc ebx
.endif
.elseif !CARRY?
inc ebx
.endif
.endif
mov eax,ebx
ret
Ddigits endp
I have tested it with all possible values of 2^{n}, 2^{n}-1, 10^{n} and 10^{n}-1, and all returned values were correct.
That's the best solution I have ever seen!
You could have done it easily by yourself if you had put your mind to it after I gave you the clue about a lookup table.
Chalk it up to experience for your next challenge. ;)
Chalk it up to experience for your next challenge. ;)
Funny, you can load 128 bits to the gpu and do math and return a meaningful result via a texture.
Am I missing something?
Neither nVidia nor ATi support 128 bit floats, as far as I know.
Their hardware normally works with 32-bit floats, but with GPGPU tasks they can process 64-bit precision at reduced performance.
I also don't know of any textures formats that would support 128 bit float.
4x real4 is not the same as 1x real16. You couldn't do a 128 bit log operation on the data.
A32R32G32B32 pixel format used as a vertex texture , and indexed via a pixelshader.