hello everybody here! i found this nice forum and read some posts of this great comunity - sadly my first posting here starts with a question  :lol:

As the topic say it, i am looking for two routines to convert UFT-8 to ASCII and converting it from ASCII to UTF-8 back. I know this can be done with the Win32 API calls MultiByteToWideChar() and WideCharToMultiByte() ! So far so good  ;)

Is there any faster way to do it? Thanks in advance.
Posted on 2006-08-14 18:06:14 by Ralf
Essentially, you can simply ignore any value over 0x7F (or 0xFF if you want full 8-bit ASCII), though actual implementation is a little more complex.

If the string/text only has ASCII-based characters (i.e. Latin letters/numbers), no conversion will be needed. However, if you want to support extended Latin characters (0x80 - 0xFF), it involves a little more checking.

Check out the some Wiki Info on the subject, the tables there are pretty good.

Here is a quick-n-dirty (i.e. unoptimized for code clarity) example of converting UTF-8 to ASCII. This code pertains to 7-bit ASCII (0x00 - 0x7F), converting 8-bit ASCII would involve a little more code.


;Note: NASM Syntax

;* UTF-8 to ASCII *
Conversion:
  mov esi,UTF_String  ;Pointer to UTF-8 String (Source)
  mov edi,ASCII_String ;Pointer to ASCII String (Destination)

.convert:
  mov al,BYTE ;Retreive UTF-8 Byte
  inc esi          ;Increment the UTF-8 String Pointer by one byte
  cmp al,0x80      ;Is the Byte within 7-Bit ASCII?
  jl .ASCII        ;If so, process byte as ASCII...

  ;... * Otherwise, skip this UTF-8 Sequence *
  shr al,4    ;Get the High 4 bits of the UTF-8 Byte
  and al,7    ;Ignore the highest bit (one is already accounted for by "inc esi" above)
.count:
  cmp al,0    ;Have we processed this Sequence???
  je .convert  ;If so, continue processing UTF-8 String
  shr al,1    ;Subtract 1 from UTF-8 Sequence Indication bits
  inc esi      ;Increment UTF-8 Pointer by one byte for Sequence bypassing
  jmp .count  ;Continue processing Sequence bits...

  ;* Process ASCII Byte *
.ASCII:
  mov BYTE,al ;If not, store the byte to the ASCII String
  inc edi          ;Increment the ASCII String Pointer by one byte
  cmp al,0x00      ;Have we reached the null-terminator in the string?
  je .end          ;If so, end conversion
  jmp .convert    ;Continue Conversion...

  ;* End of Conversion *
.end:
  ret ;Return from function


If you have no need/goal for optimization, API calls will do.

Hope this gives you some idea of what to do :)
Posted on 2006-08-14 21:41:57 by SpooK
@SpooK:
Thanks for the very fast reply :) We will try to convert the source to work with FASM and maybe we are able to optimize it. Btw, the API stuff is to slow for us :)
Posted on 2006-08-14 23:05:20 by Ralf

I know this can be done with the Win32 API calls MultiByteToWideChar() and WideCharToMultiByte() !

You could convert from UTF-8 to UTF-16 or vice versa by shifting and ORing bits. See the table in the Wikipedia article mentioned by SpooK or the ready C function at the official Unicode site; UTF and BOM FAQ is also worth reading.

After that, you could use MultiByteToWideChar with CP_ACP for converting UTF-16 to ASCII. For reverse convertion (ASCII to UTF-8), call WideCharToMultiByte first, then convert UTF-16 to UTF-8 with your own function.

SpooK's code will work only for English language, so it's fast but incorrect, because English is not the only language in the world.
Posted on 2006-08-19 22:56:38 by Peter