hello everybody here! i found this nice forum and read some posts of this great comunity - sadly my first posting here starts with a question :lol:
As the topic say it, i am looking for two routines to convert UFT-8 to ASCII and converting it from ASCII to UTF-8 back. I know this can be done with the Win32 API calls MultiByteToWideChar() and WideCharToMultiByte() ! So far so good ;)
Is there any faster way to do it? Thanks in advance.
As the topic say it, i am looking for two routines to convert UFT-8 to ASCII and converting it from ASCII to UTF-8 back. I know this can be done with the Win32 API calls MultiByteToWideChar() and WideCharToMultiByte() ! So far so good ;)
Is there any faster way to do it? Thanks in advance.
Essentially, you can simply ignore any value over 0x7F (or 0xFF if you want full 8-bit ASCII), though actual implementation is a little more complex.
If the string/text only has ASCII-based characters (i.e. Latin letters/numbers), no conversion will be needed. However, if you want to support extended Latin characters (0x80 - 0xFF), it involves a little more checking.
Check out the some Wiki Info on the subject, the tables there are pretty good.
Here is a quick-n-dirty (i.e. unoptimized for code clarity) example of converting UTF-8 to ASCII. This code pertains to 7-bit ASCII (0x00 - 0x7F), converting 8-bit ASCII would involve a little more code.
If you have no need/goal for optimization, API calls will do.
Hope this gives you some idea of what to do :)
If the string/text only has ASCII-based characters (i.e. Latin letters/numbers), no conversion will be needed. However, if you want to support extended Latin characters (0x80 - 0xFF), it involves a little more checking.
Check out the some Wiki Info on the subject, the tables there are pretty good.
Here is a quick-n-dirty (i.e. unoptimized for code clarity) example of converting UTF-8 to ASCII. This code pertains to 7-bit ASCII (0x00 - 0x7F), converting 8-bit ASCII would involve a little more code.
;Note: NASM Syntax
;* UTF-8 to ASCII *
Conversion:
mov esi,UTF_String ;Pointer to UTF-8 String (Source)
mov edi,ASCII_String ;Pointer to ASCII String (Destination)
.convert:
mov al,BYTE ;Retreive UTF-8 Byte
inc esi ;Increment the UTF-8 String Pointer by one byte
cmp al,0x80 ;Is the Byte within 7-Bit ASCII?
jl .ASCII ;If so, process byte as ASCII...
;... * Otherwise, skip this UTF-8 Sequence *
shr al,4 ;Get the High 4 bits of the UTF-8 Byte
and al,7 ;Ignore the highest bit (one is already accounted for by "inc esi" above)
.count:
cmp al,0 ;Have we processed this Sequence???
je .convert ;If so, continue processing UTF-8 String
shr al,1 ;Subtract 1 from UTF-8 Sequence Indication bits
inc esi ;Increment UTF-8 Pointer by one byte for Sequence bypassing
jmp .count ;Continue processing Sequence bits...
;* Process ASCII Byte *
.ASCII:
mov BYTE,al ;If not, store the byte to the ASCII String
inc edi ;Increment the ASCII String Pointer by one byte
cmp al,0x00 ;Have we reached the null-terminator in the string?
je .end ;If so, end conversion
jmp .convert ;Continue Conversion...
;* End of Conversion *
.end:
ret ;Return from function
If you have no need/goal for optimization, API calls will do.
Hope this gives you some idea of what to do :)
@SpooK:
Thanks for the very fast reply :) We will try to convert the source to work with FASM and maybe we are able to optimize it. Btw, the API stuff is to slow for us :)
Thanks for the very fast reply :) We will try to convert the source to work with FASM and maybe we are able to optimize it. Btw, the API stuff is to slow for us :)
I know this can be done with the Win32 API calls MultiByteToWideChar() and WideCharToMultiByte() !
You could convert from UTF-8 to UTF-16 or vice versa by shifting and ORing bits. See the table in the Wikipedia article mentioned by SpooK or the ready C function at the official Unicode site; UTF and BOM FAQ is also worth reading.
After that, you could use MultiByteToWideChar with CP_ACP for converting UTF-16 to ASCII. For reverse convertion (ASCII to UTF-8), call WideCharToMultiByte first, then convert UTF-16 to UTF-8 with your own function.
SpooK's code will work only for English language, so it's fast but incorrect, because English is not the only language in the world.