how do i recognise the unicode or ansi strings in a buffer ?
Posted on 2010-01-23 04:16:10 by dcskm4200
unicode characters obey a certain binary encoding, its the best you can do
Posted on 2010-01-23 05:58:28 by Homer
Homer, maybe you should rename your "binary exercise #2" thread to something more descriptive?
Posted on 2010-01-23 12:42:34 by r22
how do i recognise the unicode or ansi strings in a buffer ?
"That depends".

First, unicode is a pretty vague term - there's several different encodings... detecting utf-8 is going to be very different from detecting utf-16 or UCS-2. Also, are we talking arbitrary buffers/strings, or detecting the encoding of a file? Some unicode text files are encoded with a BOM.
Posted on 2010-01-24 09:10:18 by f0dder

how do i recognise the unicode or ansi strings in a buffer ?


IsTextUnicode if you want to use the Windows API
Posted on 2010-01-24 10:36:42 by donkey
thanks all, Especially donkey.
Posted on 2010-01-24 20:14:53 by dcskm4200
I wouldn't trust that api - Microsoft's implementation of unicode is based on some very bad ideas, such as
The number of characters in the string is odd. A string of odd length cannot (by definition) be Unicode text.


This is a misleading and incorrect statement.
Unicode certainly does support 3-byte sequences, rendering the notion of "16 bit unicode" as a relic.
The following code assumes we already tested the first byte and determined that its greater than 128 (80h).
Anything less would be a single byte of american ascii.
Anything bigger MAY indicate the beginning of a unicode (or utf8-encoded unicode) multibyte sequence.
Here's my test for pure unicode sequence, returning 0,2 or 3 (#bytes in unicode codepoint).


;Method:    Parser.IsUnicodeChar
;Purpose:   Determine if the given value in eax is a possible unicode character
;Returns:   EAX = FALSE (not unicode), 2 (unicode length) or 3 (unicode length)
Method Parser.IsUnicodeChar,uses esi,dChar
   SetObject esi
   mov eax,dChar
   .if eax>0010FFFFh
       ;Its not unicode
       xor eax,eax
   .elseif eax>=00100000h
       ;Supplementary Private Use Area-B
       mov eax,3
   .elseif eax>=000F0000h
      ;Supplementary Private Use Area-A
     mov eax,3
   .elseif eax>=000E0000h
     ;Supplementary Special-purpose Plane SSP
     mov eax,3
   .elseif eax>=00040000h
       ;currently unassigned
       ;It is NOT unicode
       return FALSE
   .elseif eax>=00030000h
       ;Tentatively designated as the Tertiary Ideographic Plane (TIP), but no characters have been assigned to it yet.
       mov eax,3
   .elseif eax>=00020000h
       ;Supplementary Ideographic Plane SIP
       mov eax,3
   .elseif eax>=00010000h
      ;Supplementary Multilingual Plane SMP
      mov eax,3
   .else
       ;Basic Multilingual Plane BMP
       mov eax,2
   .endif
MethodEnd


And do note that unicode codepoints within utf8 (or 16) are ENCODED - depending on the value of the codepoint, it can take 5 bytes to encode a 3-byte unicode character, etc...

Have a nice day :)


(ps r22: I've done a lot of work with detecting various text encodings 'on the fly' lately, perhaps I'll write a thread dedicated to that topic in the near future, if anyone is interested)
Posted on 2010-01-28 07:33:24 by Homer
Saying that Microsoft's Unicode is 'wrong' is a stretch, if you ask me.
There's unicode, and then there's unicode.
What you describe is UTF-8, which is one possible way to encode unicode characters.
However, the Win32API does not use this, they instead use the UCS-2 encoding. In UCS-2 indeed all odd-length strings must be non-unicode. So as far as the Win32API is concerned, they are correct (the API cannot handle UTF-8 encoding anyway, so why should they support detecting it?).

UCS-2 is now obsolete yes... it is superceded by UTF-16 (where the above statement still holds, all strings are even length), which extends UCS-2 in a similar way that UTF-8 extends ANSI charactersets.
But can you blame Microsoft, even say they are 'wrong', when their unicode implementation dates back to the early days of Windows NT? Back then all unicode was 16-bit anyway.
They support UTF-8 in their .NET framework anyway (which isn't bogged down by Windows NT legacy). But obviously they can't just swap the Win32 API around to another encoding, it would break all existing unicode applications (clearly you realize that UTF-8 is not compatible with Win32 API?).
Knowing whether a string is unicode or not, is not enough, you need to know the actual encoding.
For UCS-2 or UTF-16, your code is as 'wrong' as Microsoft's is for UTF-8.

For more info on different versions of unicode and different encodings, see http://en.wikipedia.org/wiki/Unicode
Posted on 2010-01-28 07:56:46 by Scali
What I stated is that unicode codepoints can be 3 bytes long - eastern languages such as chinese and russian are using these.

UTF-8 IS NOT UNICODE. It came later, it can be thought of as a compression scheme for unicode codepoints...(People looked at the old pure 16-bit unicode , and said "gee, look at all those zeroes..."), and happens to have the property that american ascii bytes are legal, and not encoded... but unicode codepoints are cleverly encoded, and the scheme can be extended anytime to handle longer (new) unicode codepoints, which is precisely what I was describing...

utf-8 never contains 'pure' unicode codepoints!

Pure unicode is not encoded, its codepoints = the values in the byte sequence.
utf-8 encoded unicode is just that - its a binary packing scheme.
So if we have some utf-8 bytestream, we have to work hard to extract the unicode codepoint values from it.

;U-00000000 U-0000007F: 0xxxxxxx
;U-00000080 U-000007FF: 110xxxxx 10xxxxxx
;U-00000800 U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
;U-00010000 U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
;U-00200000 U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
;U-04000000 U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

You can see in the above table how utf-8 encodes unicode codepoints of up to 33 bits (yes, 33, so 5 byte unicode sequences have been defined, at least in theory).
The x's contain the bits which , when gathered together, contain our unicode codepoint value.
The 1's and 0's are there for sanity checking, and allow us to decode unicode from utf-8 (and detect bad encodings).
Note that the number of 1's in the high bits of the high order byte tells us how many bytes are in this encoded sequence, and how much work we need to do to get a U-value in the plain.

It's worth mentioning that only the first four rows of this table are in common use, the last two are reserved for expansion... but even so, I count 21 x's on the fourth row, 21 bits won't fit into 16 whichever way you slice it.

Posted on 2010-01-28 08:26:01 by Homer
And having demonstrated the utf-8 encoding scheme, I see no point in utf-16 at all.
utf-8 is infinitely expandable, and has a smaller encoding than utf-16 for the larger codepoint values (utf-16 is based strongly on the historical notion of byte-pairs, potentially leading to a lot of junk bytes in the bytestream for mid-valued codepoints).
utf-8 does everything that utf-16 does, generally using fewer bits.

Posted on 2010-01-28 08:37:20 by Homer
What I stated is that unicode codepoints can be 3 bytes long - eastern languages such as chinese and russian are using these.


Which is incorrect. UTF-8 encoded unicode can have characters of 3 bytes long. UCS-2 or UTF-16 cannot.
But even then, it depends on the version of unicode supported. For example, although Java supports UTF-8, its is an older version of unicode, its char datatype is 16-bits, and as such it will not handle 3-byte characters.

Unicode is a concept, just like the ASCII codes of characters.
Although technically all ASCII characters can fit into 7-bit, the most common encoding is to use 8-bits.
It's like a number. The number 23423 is an 'entity', so to speak. Storing it in a 16-bit word is a certain binary representation of that number. Storing it in a 32-bit word is a different binary representation of that same number. So what is the length of that number? Is it 2 bytes? Is it 4 bytes? Doesn't make sense... In theory it could be less than 2 bytes aswell, if you so choose.
Likewise, while there is only one code for each unicode character, there are various ways to represent these codes in binary. One is not more 'right' than another. You cannot speak of THE length of a unicode character or string in terms of bytes, as it depends on the encoding used.

And as was pointed out earlier, UTF-8 is not supported by the Win32API, so if we are to assume that the question was about detecting unicode or ansi strings in Windows, UTF-8 is completely irrelevant, and the IsTextUnicode() API is the right answer.
Posted on 2010-01-28 08:38:19 by Scali
And having demonstrated the utf-8 encoding scheme, I see no point in utf-16 at all.


The point of UTF-16 is that it's an extension of UCS-2, as UTF-8 is an extension of ASCII. Where all ASCII characters in a UTF-8 string can be freely read by any ASCII routine, any UCS-2 characters in UTF-16 can be read by any UCS-2 routine.
Posted on 2010-01-28 08:41:41 by Scali
That's true - unicode codepoints are an abstract, the encoding determines the actual bytelength.
Therefore, the statement that unicode characters must be of even length is ONLY true of UCS-2 and NOT true of unicode itself, therein lies the ambiguity of the statement made in the msdn documentation for that particular api.
Misleading, and incorrect, unless we're strictly talking about UCS-2 encoding, which is not actually stated.

With a little work, it's possible to determine exactly which encoding scheme of all the above named is being employed, with a relatively high degree of confidence. For example, utf-8 will never contain FE or FF (excluding the BOM), utf-16 will never contain FFFE or FFFF (again the BOM), UCS-2 will typically contain a bunch of zeroes, and utf-16 will contain the occasional zero in the second byte of a pair. We can develop a fairly solid set of rules to crunch through, even if there's no BOM, to make an intelligent decision about each codepoint in the bytestream, and can even detect a change of encoding midstream.

Microsoft's api can return a lot of false positives for an arbitrary sequence of bytes.
Posted on 2010-01-28 08:51:23 by Homer

That's true - unicode codepoints are an abstract, the encoding determines the actual bytelength.
Therefore, the statement that unicode characters must be of even length is ONLY true of UCS-2 and NOT true of unicode itself, therein lies the ambiguity of the statement made in the msdn documentation for that particular api.
Misleading, and incorrect, unless we're strictly talking about UCS-2 encoding, which is not actually stated.


As I say, that's the result of legacy. UCS-2 was the first encoding for unicode. UCS-2 is actually a 'backronym'.
See the original unicode 88 standard document: http://www.unicode.org/history/unicode88.pdf
While they clearly describe what we now know as UCS-2 (they specifically mention 16 bits), they only describe it as 'unicode' or 'wide-body ASCII'. Only after alternative encodings came into swing, the name UCS-2 was devised to refer to the original 16-bit encoding.
Therefore, when Microsoft originally designed their unicode API, unicode was indeed equivalent to UCS-2. It's too late to rename the API's now.
I'm sure the MSDN will explain this somewhere, but it's common knowledge that within the context of the Win32 API, unicode and UCS-2 are equivalent. I think it will be more confusing for most people if they start to use UCS-2 or UTF-16 everywhere. People are used to referring to it as unicode, nothing else.

Gives me the same feeling as the 'redefinition' of things like KB and MB into KiBi and MeBi stuff.
I am so used to referring to powers-of-two with KB, MB etc, that I find it disturbing that people have now 'redefined' KB to mean 1000 bytes rather than 1024 (which it always has, ever since computers were still made of wood), and insist that I must now start using KiBi to refer to something that I've referred to as KB all my life. Sure, by THEIR definition, probably 99% of everything ever written about computers and storage is 'wrong' when they use KB... but it was right at the time of writing. People should leave well enough alone.

Microsoft's api can return a lot of false positives for an arbitrary sequence of bytes.


Microsoft's API wasn't meant to detect anything other than UCS-2. Which should be clear by the fact that it can only say whether it's 'unicode' or not, which as I've explained, means UCS-2 within this context.
What's the use of having a function that says "This is unicode" without saying WHICH encoding it is? Clearly you can only use the function if you want to know whether a certain string is ANSI or UCS-2. If it can be anything else, you should be shot if you even try to call this function in the first place.
Posted on 2010-01-28 08:59:10 by Scali
My personal belief is that started as a marketing ploy, the numbers look bigger, it must be better.
And I'd like to petition Microsoft to stop using the term Unicode to refer to something it isn't, or at least rename their api to "IsPossiblyUnicode" :P (actually, it does state "Determines if a buffer is LIKELY to contain a form of Unicode text.")

And sure, that's great if you used a Microsoft api to generate the strings in the first place, but if the data is alien, the api is not suitable. Perhaps "IsNotAmericanAscii" would better describe this api.

Debugging microsoft's Notepad while it opens a file is an eye opener. This thing jumps through hoops to determine the text encoding, it doesn't make any assumptions, it makes lots of tests (the more data the better), I don't think it calls that IsUnicode api once, and it does handle very well the case of switching encodings midstream (which happens if you paste data between textfiles that have different encodings, as a lot of programmers are known to do).
Posted on 2010-01-28 09:10:00 by Homer
My personal belief is that started as a marketing ploy, the numbers look bigger, it must be better.
And I'd like to petition Microsoft to stop using the term Unicode to refer to something it isn't, or at least rename their api to "IsPossiblyUnicode" :P


My personal belief is that you like to fight windmills.

And sure, that's great if you used a Microsoft api to generate the strings in the first place, but if the data is alien, the api is not suitable. Perhaps "IsNotAmericanAscii" would better describe this api.


What exactly do you expect? How can Microsoft build an API for text encodings that didn't even exist? There is always the option of 'alien data'. It's impossible to make an API that catches every possible encoding, now and in the future.
Personally I think you're doing something very wrong if you somehow managed to get hold of a string without knowing how it is encoded in the first place.
Aside from that, Microsoft is quite clear, both in the description of the API, and the return values themselves, that there are various levels of accuracy in the detection, and that the function should certainly not be assumed to be 100% reliable in all cases. They even give an example of how it can go wrong.
They also say "Included in Windows NT 3.5 and later"... NT 3.5, that was September 1994, to put things in perspective.
Posted on 2010-01-28 09:14:22 by Scali

And having demonstrated the utf-8 encoding scheme, I see no point in utf-16 at all.


strlen() speed :lol:
Posted on 2010-01-28 09:58:16 by SpooK

IsTextUnicode if you want to use the Windows API


Hi Homer, Scali

I only suggested he use IsTextUnicode in the context of Windows and meant it to be used for detecting Unicode strings for use with the API, that's why I had the qualifying statement. Obviously if he wants to use a different Unicode encoding scheme he would be required to find a different, more flexible route. For the odd numbered byte length, Microsoft's documentation needs a few footnotes but in essence is correct within the scope of their definition of Unicode support. The documentation for Windows Unicode support clearly states:

While Unicode-enabled functions in Windows use UTF-16, it is also possible to work with data encoded in UTF-8 or UTF-7, which are are supported in Windows as multibyte character set code pages.


It has always been Microsoft's habit to bend a definition to match the capabilities of its software. If you want to work in Windows you have to live with Microsoft's idiosyncrasies. Unicode in the Windows API

As for the advantage of size for UTF-8 who seriously cares about size anymore, especially when dealing with text, even when it amounts to a few million wasted bytes. Fixed length is by definition faster to deal with in any number of functions than a variable length encoding. Any extensibility advantage of UTF-8 is easily negated by surrogate pairs in UTF-16.

Edgar
Posted on 2010-01-28 22:00:46 by donkey
Yea, as I said, it's bound to be in MSDN somewhere. Here it is:
http://msdn.microsoft.com/en-us/library/dd374081(VS.85).aspx
Unicode-enabled functions are described in Conventions for Function Prototypes. These functions use UTF-16 (wide character) encoding, which is the most common encoding of Unicode and the one used for native Unicode encoding on Windows operating systems.


And as I already proved earlier, UCS-2 was the first unicode encoding, UTF-8 came later, at a time when the unicode functionality in the Win32API was already in use.
You can't blame Microsoft for naming their functions as 'unicode' when only one encoding of unicode exists, which doesn't even have a specific name yet. As I alraedy said, UCS-2 is a backronym. Much like CISC for example. CPU designers didn't know that they were designing CISC CPUs at the time, the term wasn't even invented yet. When they came up with the RISC philosophy, they named the current philosophy CISC.
Same here.
Please, grow up. Not everything that Microsoft does has to have some kind of evil background to it. It's a character encoding for crying out loud.

But, Microsoft does explain that unicode is UTF-16 in the Win32API, so I don't see the big deal. 'Unicode' in itself doesn't say anything about encoding, and the encodings aren't interchangeable, so you'd need to read the manual to see what encoding they use. And it's in there.

I find the accusations towards Microsoft rather childish in nature. Saying they're 'wrong', or that they're trying to 'bend a definition'. There WAS no definition at the time. If anything, it's the guys who invented UTF-8 and the UCS-2 names who 'bent the definition'. Just like the guys who invented KiBi, MeBi and all that other nonsense.
Posted on 2010-01-29 02:06:33 by Scali
Interesting that the docs say it's UTF-16, I thought it was UCS-2... wonder how much code out there does simple *2 or /2 when dealing with ascii<>unicode...
Posted on 2010-01-29 02:11:12 by f0dder