Interesting that the docs say it's UTF-16, I thought it was UCS-2... wonder how much code out there does simple *2 or /2 when dealing with ascii<>unicode...


From what I understood, it WAS UCS-2, but it was expanded to UTF-16 in Windows 2000.
Posted on 2010-01-29 05:42:15 by Scali
For good old 7-bit american ascii, it is safe to make this assumption.
But not for U-codepoints in general, because utf-16 is similar to utf-8 in that there is a strict binary encoding of the U-value into the lower 10 bits of each byte (of each pair).
That is to say, we can't just look at a 16-bit value and say "yeah, that's the U-code value" like we can with UCS-2 (which I refer to as 'vanilla unicode').
Posted on 2010-01-29 07:24:57 by Homer

Interesting that the docs say it's UTF-16, I thought it was UCS-2... wonder how much code out there does simple *2 or /2 when dealing with ascii<>unicode...


UCS-2 does not support surrogate pairs, UTF-16 does, otherwise they are identical. Supporting only UCS-2 would severely limit the available character sets, though for most APIs it is not an issue, the GDI and Uniscribe support supplementary characters.
Posted on 2010-01-29 09:49:00 by donkey

I find the accusations towards Microsoft rather childish in nature. Saying they're 'wrong', or that they're trying to 'bend a definition'. There WAS no definition at the time. If anything, it's the guys who invented UTF-8 and the UCS-2 names who 'bent the definition'. Just like the guys who invented KiBi, MeBi and all that other nonsense.


Uniscribe dates well past the addition of UTF-7 and UTF-8 however Microsoft did not alter their definition of Unicode, preferring to keep the old one. They did add support for the new encoding schemes however, this would be "bending" the definition. Because you seem to rely on the fact that they supported Unicode in the era of version 1 they are exempt from using the full range of encoding schemes in their definition. Whatever the reasons for the limitations in their software they continue to bend the definition to those limitations.

Oh, and by the way UTF-8 started development in 1992 (called Plan 9 by the X/Open committee), Microsoft was completely aware that other Unicode encoding schemes were being developed and chose to support UTF-16, the more mature standard and call it and it alone Unicode. So, I would hope that its not my 'bent the definition' statement that you're calling 'nonsense' since it is well grounded in the history of Unicode.
Posted on 2010-01-29 10:20:49 by donkey
Uniscribe dates well past the addition of UTF-7 and UTF-8 however Microsoft did not alter their definition of Unicode, preferring to keep the old one. They did add support for the new encoding schemes however, this would be "bending" the definition. Because you seem to rely on the fact that they supported Unicode in the era of version 1 they are exempt from using the full range of encoding schemes in their definition. Whatever the reasons for the limitations in their software they continue to bend the definition to those limitations.


That's not what I said at all.
What I said was that 'Unicode' at the time the Win32API was devised, only existed in the form of UCS-2. They cannot change the names of the APIs now that they are widely in use.

Oh, and by the way UTF-8 started development in 1992 (called Plan 9 by the X/Open committee)


They STARTED in 1992. But the first Windows NT with Unicode support was RELEASED in 1993, so clearly Microsoft started on their Unicode API well before UTF-8.

Microsoft was completely aware that other Unicode encoding schemes were being developed and chose to support UTF-16, the more mature standard and call it and it alone Unicode


Incorrect. Microsoft chose UCS-2, which was not known as UCS-2 yet, but only as Unicode, at that time (somewhere well before 1992, so before both UTF-8 and UTF-16 existed).
As I already said, Microsoft expanded it to UTF-16 in Windows 2000. Clearly UTF-8 wasn't really an option with the UCS-2 legacy they built up since 1992.
As I also said, with .NET they DO use UTF-8, because they could start with a clean slate there,

So, I would hope that its not my 'bent the definition' statement that you're calling 'nonsense' since it is well grounded in the history of Unicode.


Nope, it's still nonsense, you got your facts all wrong.  Which you wouldn't have, if you bothered to read my posts, because I've already said everything I said in this post before.
Posted on 2010-01-29 11:46:46 by Scali

They STARTED in 1992. But the first Windows NT with Unicode support was RELEASED in 1993, so clearly Microsoft started on their Unicode API well before UTF-8.


The standard for Unicode was published in April, 1992. It was only finalized earlier that year so I am assuming that Microsoft was making adjustments right up to the date of release and waited for the standard to be published before documenting it (which would be the sensible way to do it). And we are talking only about documenting it here, not when they began coding the APIs. They narrowed the scope of the definition (which was known to potentially support multiple standards at the time they documented the API) to fit their software's capability, that's what "bending it" means, I never said that they misrepresented it or were not factual in their interpretation. In fact I said quite the opposite, only that they needed perhaps a footnote and the Unicode statements (ie that Unicode text by definition could not have an odd numbers of bytes) they made were within the scope of their definition.
Posted on 2010-01-29 14:22:26 by donkey
The standard for Unicode was published in April, 1992.


Then how come I linked to a publication of the Unicode standard from 1988?
And why do various internet resources report that the Unicode 1.0 standard was published in 1991?
Give it up, UTF-8 was NOT in the official Unicode standard, and was added in later revision. This means that your entire argument falls apart.
UCS-2 was the ONLY encoding in the original 1988 publication, and also the ONLY encoding in the Unicode 1.0 standard. Also, it was NOT called UCS-2 yet.
The irony of it all... because Windows uses UCS-2/UTF-16, and Windows is about 90% of the entire market, the most common unicode encoding BY FAR is UCS-2/UTF-16.
Statistically, the chance that a unicode string is encoded with UCS-2/UTF-16 is far greater than UTF-8 or any other alternative encoding.
Posted on 2010-01-29 15:30:42 by Scali
I'm not sure where you got that 1988 thing from...

http://www.unicode.org/history/publicationdates.html

This is the time line according to Unicode.org, the final arbiter on all questions involving the publications since after all they publish them. Volume 2 was the complete standard, Volume 1 was a working set.
Posted on 2010-01-29 19:01:07 by donkey
I'm not interested in arguments about dates or semantics.
My interest is purely about recognizing unknown encodings with a reasonable degree of accuracy.
I don't think its viable to assume that 'we should already know which encoding was used', and even in the Windows world there's a wide range of encodings being used by a wider range of authors, so it's not unreasonable to suggest that it's not only possible, but likely that we'll encounter them.

UTF-16 may be the most common, behind american ascii and UCS-2, I can't say, I haven't attempted to determine ratios or probabilities, I can only say that I have encountered all the popular encodings, both as filestreams and network data streams, and I'm trying to deduce an algorithm which can determine which is being employed, without necessarily having access to the complete datastream.

That being said, I have been starting with the assumption that the data is in fact utf-8, since this encoding seems to have the most strict binary encoding, with the greatest likelihood of detecting a BAD encoding with the minimum amount of data. utf-8 and utf-16 share some properties in their binary encoding, so it is easy to mistake one for the other... but utf-8's encoding scheme has a definitive character delimiter (bitpattern which describes the number of bytes in a legal encoding), and so it is indeed possible to distinguish between them. This is useful to me.

I will be initially concentrating on distinguishing between various little-endian encodings, I hope to retrofit the algorithm for both LE and BE once I'm satisfied that the algo is doing a reasonably good job of determining the encoding.
Posted on 2010-01-29 20:41:22 by Homer

I'm not sure where you got that 1988 thing from...

http://www.unicode.org/history/publicationdates.html

This is the time line according to Unicode.org, the final arbiter on all questions involving the publications since after all they publish them. Volume 2 was the complete standard, Volume 1 was a working set.


http://www.unicode.org/history/museum.html <-- as you can see, the first standard proposal was published in 1988. This is what I linked to. Which you apparently didn't even notice.  If you want to argue, at least bother to read the posts of the person you're arguing with. No respect.
Even so, this page proves yourself wrong, because it says the final 1.0 standard was released in 1991, not 1992 (as I already said... obviously Microsoft doesn't really care about which volume was published when. It's about the standard itself, not the books).
Try to be a BIT more meticulous in getting your data right. You don't even quote your own sources properly.

I'm not really sure what you're trying to argue anyway...
Your line of argument sounds like this to me:
June 1992: The second volume of the Unicode 1.0 standard was published. Microsoft only now discovers the glory of Unicode... and although Windows NT is scheduled for release the next year, they think "Hey, this unicode lark is cool. Let's completely rewrite our entire API just a few months before the final release, and add this unicode stuff"

Sounds completely unlikely.

What I'm saying is this:
Unicode proposal is published in 1988. Microsoft (and IBM at the time) started development on Windows NT in November 1989. They had seen the Unicode proposal, and figured it would be a great way to solve the localization problems in legacy OSes. So they adopted Unicode from the get-go, designing their API around using both ASCII and Wide Characters.
When Windows NT was already in the final testing stages, somewhere in 1992, some people proposed the alternative UTF-8 encoding. With Windows NT having already been in development for a few years, and gearing up for the release, it was too late to incorporate UTF-8. The APIs were already finalized, documentation was already being written.

Sounds highly likely.
Posted on 2010-01-30 02:52:33 by Scali
My interest is purely about recognizing unknown encodings with a reasonable degree of accuracy.
I don't think its viable to assume that 'we should already know which encoding was used', and even in the Windows world there's a wide range of encodings being used by a wider range of authors, so it's not unreasonable to suggest that it's not only possible, but likely that we'll encounter them.


As you say yourself, "a reasonable degree of accuracy". It's never going to be possible to catch all. Therefore, if you don't know the encoding beforehand, you're basically fighting a lost battle.

But I personally have never been in a situation where I didn't know the encoding used, so I wonder how you could ever get there. It seems that you can only reach such a situation when you once knew the encoding, but decided to ignore it at that point, and it came back to haunt you later.

Let me just throw this out there: PETSCII.
Ever even heard of that? It's the character encoding used by Commodore, originally designed for their PET range. It was used in various other Commodore computers aswell, including the C64, the best-selling computer of all time.
If you haven't heard of this encoding before, you wouldn't have created a detection routine for it either. Yet it's perfectly possible to encounter PETSCII strings 'in the wild', given the huge popularity of the C64. QED.
Posted on 2010-01-30 03:00:40 by Scali
But I personally have never been in a situation where I didn't know the encoding used, so I wonder how you could ever get there.
One very real example: text editors. You have to deal with a variety of formats... plain ascii, OEM codepages, unicode with BOM, unicode without BOM... those are the bare minimums, more adventurous editors might want to support EBDIC and other alien formats as well.

Yet it's perfectly possible to encounter PETSCII strings 'in the wild', given the huge popularity of the C64. QED.
A lot less likely than encountering an UTF-8 encoded document without a BOM. I doubt you'd bump into PETSCII unless you were specifically looking for C=64 stuff?
Posted on 2010-01-30 10:18:28 by f0dder

But I personally have never been in a situation where I didn't know the encoding used, so I wonder how you could ever get there.
One very real example: text editors. You have to deal with a variety of formats... plain ascii, OEM codepages, unicode with BOM, unicode without BOM... those are the bare minimums, more adventurous editors might want to support EBDIC and other alien formats as well.


Exactly my point, isn't it?
This text file was once written by someone who DID know what encoding to use. Then it should have been remembered in some way. Either by storing a BOM in the file, or using some other kind of metadata... perhaps even as simple as just a file extension.
Anything that would describe the format.

Obviously opening random text files with random editors is a lost battle, as I already said. You're never going to be able to determine 100% what kind of format the file is.
Besides, that is a VERY specific problem (caused by the user/owner of the text file). It's not a problem that normal applications should ever bump into. Any decent application should support decent fileformats which somehow define the encoding of the data properly. And obviously, any user input will also be in a known encoding at all times.

In other words, I doubt that the sole existence of the IsTextUnicode() API function is for people writing text editors.

I doubt you'd bump into PETSCII unless you were specifically looking for C=64 stuff?


I doubt you'd bump into anything other than UTF-16 unless you were specifically looking for non-Windows stuff?
Posted on 2010-01-30 10:30:08 by Scali
In other words, I doubt that the sole existence of the IsTextUnicode() API function is for people writing text editors.
It actually seems like a slightly weird API function to me, anyway... can't think of a lot of (normal) situations where it'd be useful.

I doubt you'd bump into PETSCII unless you were specifically looking for C=64 stuff?


I doubt you'd bump into anything other than UTF-16 unless you were specifically looking for non-Windows stuff?
I've never bumped into UTF-16 (API calls don't really count as "bump into") but I've bumped into "unclassified" UTF-8 on several occasions. Usually related to web browsers or servers not having specified content-type correctly, but not limited to that.
Posted on 2010-01-30 20:20:53 by f0dder
It actually seems like a slightly weird API function to me, anyway... can't think of a lot of (normal) situations where it'd be useful.


That's what I've been saying all along.

I've never bumped into UTF-16 (API calls don't really count as "bump into")


You may have, but never noticed :)
Eg, Visual Studio will save files in UTF-16 format if you use special characters. But obviously it will open them correctly because it knows how to detect ASCII or UTF-16 source files.

but I've bumped into "unclassified" UTF-8 on several occasions. Usually related to web browsers or servers not having specified content-type correctly, but not limited to that.


That's what I've been saying all along. If you don't know the encoding, something must have gone wrong.
Posted on 2010-01-31 04:23:26 by Scali