Heads Up!!

While messing with my new MovieMixer code recently, I discovered what I believe is a bug in the MultiByteToWideChar api.

Basically, the api only writes one Zero Terminator byte to the end of the WIDE output string - and if the buffer was not previously filled with zeros, then the string is not correctly terminated - and this caused me a lot of headaches with 'file not found' errors.

It's worth knowing that OA32's Str2UStr function does not suffer this flaw.
Posted on 2009-09-08 06:35:12 by Homer
Interesting!

Duplicated here, Win7-64 .

#include <windows.h>
#include <stdio.h>

#define lengthof(x) sizeof(x)/sizeof(x[0])

char *source = "hallj !";

int main()
{
WCHAR dest[100];

memset(dest, '@', sizeof(dest));
MultiByteToWideChar(CP_ACP, 0, source, -1, dest, lengthof(dest));

FILE *f = fopen("out.bin", "wb");
fwrite(dest, 1, sizeof(dest), f);
fclose(f);
}


I haven't been bitten by this bug since I tend to stay away from ugly NUL-terminated strings - I use bytelength-without-NUL for conversion, and keep my strings in std::string / std::wstring . But this is pretty nasty, MS should be contacted.

EDIT: as can clearly be seen from the attached screenshot, there isn't any problem, and there's a proper 16-bit NUL terminator - I'm a moron for a couple of hours after getting out of bed :)
Attachments:
Posted on 2009-09-08 10:59:09 by f0dder
MultiByteToWideChar was designed to be interfaced with C/C++ style code. Due to this, it's common practice with dealing with any C string conversion or input routines to have the user '\0' terminate the string. For example scanf/fscanf is regularly used like so (for the same reason):

char * buf;
buf = (char*)malloc( 1025 );

...

// scan string from file stopping at newline and null terminate.
buf[ (int)fscanf( fd, "%1024[^\n]", buf ) ] = '\0';


So it's probably not considered a bug as much as a, "terminate your own string" style situation which is present in a lot of HL languages.
Posted on 2009-09-08 10:59:53 by Synfire
So it's probably not considered a bug as much as a, "terminate your own string" style situation which is present in a lot of HL languages.


It's clearly a bug since the NUL terminator *IS* processed, but only a single byte is output. Let me quote MSDN as well (emphasis mine):

cbMultiByte
Size, in bytes, of the string indicated by the lpMultiByteStr parameter. Alternatively, this parameter can be set to -1 if the string is null-terminated. Note that, if cbMultiByte is 0, the function fails.
If this parameter is -1, the function processes the entire input string, including the null terminator. Therefore, the resulting wide character string has a null terminator, and the length returned by the function includes the terminating null character.

If this parameter is set to a positive integer, the function processes exactly the specified number of bytes. If the provided size does not include a null terminator, the resulting wide character string is not null-terminated, and the returned length does not include the terminating null character.


I updated the code above to printf() the result of MBTWC, and it returned 15 - that's the length of the source string including the NUL char.

EDIT: It's Not A BugTM
Posted on 2009-09-08 11:10:50 by f0dder
PS: I hope you don't write code like the above, it's "somewhat flawed" :) (think of EOF return value... and remember that the return value is number of fields captured, not charlength).
Posted on 2009-09-08 11:19:06 by f0dder

PS: I hope you don't write code like the above, it's "somewhat flawed" :) (think of EOF return value... and remember that the return value is number of fields captured, not charlength).


Actually I do write code like that. The above code is the equivalent of doing a getline with the exception that it truncates to the maximum specified length instead of throwing a failbit error. Also, take a close look at the "regex" I'm using "%1024[^\n]", it effectively matches all characters except a newline meaning that even if the EOF is reached, it will have already made it's first match (either a valid token or '\n' forcing a return of 0). The only real case in which EOF would be returned by using this routine would be if a read error occurred.. So I suppose it is wrong on that behalf and would warrant something more like:

int x; char buf[1025];
buf[ (((x=(int)fscanf( fd, "%1024[^\n]", buf ))>0)?x:0) ]='\0';


However I've never personally had a case where the prior had failed.
Posted on 2009-09-08 17:54:09 by Synfire
So, it's just that Homer set the argument cchMultiByte=strlen(s) , right?
Posted on 2009-09-08 18:02:12 by Ultrano

So, it's just that Homer set the argument cchMultiByte=strlen(s) , right?
That wouldn't include the NUL, then. Notice my code - I use -1 for the length, which according to MSDN: "If this parameter is -1, the function processes the entire input string, including the null terminator"... and results in a single NUL byte, not a NUL word as it's supposed to be.

EDIT: It's Not A BugTM
Posted on 2009-09-08 18:04:32 by f0dder
Ultrano,
Yea, f0dder is right, it's definitely a bug. Given that the documentation states that it should NULL terminate upon the use of a -1 argument. Although generally speaking it's always a good idea to set your own NULL terminators just to be on the safe side.
Posted on 2009-09-08 18:08:05 by Synfire
Synfire: getting EOF result value might be very unlikely, but it's not good practice not to handle it. Furthermore, what's with the (int) cast? - fscanf already returns int. And finally, fscanf returns the number of fields, not bytes, read... so you'll be getting your NUL terminator at index 1, not at the end of the string (at least that's consistent with MSDN, K&R and empirical testing).

PS: the single-line statement might be "cute", but personally I'd introduce an "int result" to keep the code clearer & cleaner. Sure, code increases to two or three lines, but I'm willing to sacrifice that for clarity.
Posted on 2009-09-08 18:16:35 by f0dder
I still see absolutely no problem. That "extra" or "just one" zero-byte is the higher-byte from the last little-endian WORD.
If you specify -1 , then obviously internally cchMultiByte=strlen(s)+1. This "-1" is just sugar on the top, beside which the function makes no assumptions that the input and output string are zero-terminated! The func-output is an array, not a szString.
Posted on 2009-09-08 18:22:01 by Ultrano
Ultrano: perhaps you should re-read what MSDN has to say about the function?

MultiByteToWideChar does not null-terminate an output string if the input string length is explicitly specified without a terminating null character. To null-terminate an output string for this function, the application should pass in -1 or explicitly count the null terminator for the input string.


In other words, passing strlen+1 or -1 as argument, you should get a NUL terminated output string. A 1-byte NUL isn't exactly appropriate for UTF-16 output.

EDIT: ...which is why it's good and correct that MBTWC outputs a 2-byte NUL :P
Posted on 2009-09-08 18:24:49 by f0dder
P.S. it's quite understandable why the output should be an assumed to be a szString: the same func can be used for a wide variety of tasks: 1) bulk-handling of many szStrings, 2) transcode of just parts of a text, 3) transcode into Pascal/Java type of strings.
Posted on 2009-09-08 18:27:41 by Ultrano
WinXP SP2:

.data
z1 db "hello",0
z2 dw 40 dup (0CCCCh)
.code
main proc
invoke MultiByteToWideChar,CP_ACP,0,addr z1,-1,addr z2,40
movzx ecx,z2[10]
print ecx ; ecx = 0 *******
ret
main endp


No problem here...
Posted on 2009-09-08 18:30:00 by Ultrano
I just fired up a XP32 VMware, and it did a word-sized NUL for my sample code above - so clearly the bug has been introduced later. PS: your code is buggy, the last argument is cchWideChar, not cbWideChar :)
Posted on 2009-09-08 18:34:33 by f0dder

Synfire: getting EOF result value might be very unlikely, but it's not good practice not to handle it. Furthermore, what's with the (int) cast? - fscanf already returns int. And finally, fscanf returns the number of fields, not bytes, read... so you'll be getting your NUL terminator at index 1, not at the end of the string (at least that's consistent with MSDN, K&R and empirical testing).

PS: the single-line statement might be "cute", but personally I'd introduce an "int result" to keep the code clearer & cleaner. Sure, code increases to two or three lines, but I'm willing to sacrifice that for clarity.


I typecast as an old habit. When I first started out with C I hated having issues with "converting types" when I knew good and well the values were essentially the same thing. So I got myself into the habit of typecasting any function in which I grab a value from.

As for the reason for the reduced one-liner, it has nothing to do with "cuteness" rather than what's actually generated. Whereas using the extended version (int result) generates a 144 byte <main>, the one-liner generated 134 byte <main> (built with GNU-C). Little optimizations like this are useful. I'm by far a C expert but I try to keep code size down as much as possible no matter what language I'm writing in.
Posted on 2009-09-08 18:39:47 by Synfire
f0dder: then on that attached MBTWC_bug.png .... what are those 2 zeroes doing at 1Ch :) , if that's on another OS?
Posted on 2009-09-08 18:42:20 by Ultrano

I typecast as an old habit. When I first started out with C I hated having issues with "converting types" when I knew good and well the values were essentially the same thing. So I got myself into the habit of typecasting any function in which I grab a value from.
Ugh... that's pretty bad coding practice. When the compiler tells you there's an issue, there's usually a reason for it (C++ and the PlatformSDK is painful, though - less so in C mode). But doing this as a general coding practice is bad.


As for the reason for the reduced one-liner, it has nothing to do with "cuteness" rather than what's actually generated. Whereas using the extended version (int result) generates a 144 byte <main>, the one-liner generated 134 byte <main> (built with GNU-C). Little optimizations like this are useful. I'm by far a C expert but I try to keep code size down as much as possible no matter what language I'm writing in.
You might want to move to a decent compiler, or at least a more recent version of GCC? With VC2008 (and also lot older versions, I expect) introducing the temporary variable results in exactly the same code as doing the one-liner.
Posted on 2009-09-08 18:48:31 by f0dder

f0dder: then on that attached MBTWC_bug.png .... what are those 2 zeroes doing at 1Ch :) , if that's on another OS?
Now I'll be damned - I could've sworn there was only a single one of them... that's what I get for doing code stuff right when I've gotten out of bed :oops:

Sounds like Homer probably *is* passing strlen-without-NUL, unless he's on a OS version that actually is buggy :)
Posted on 2009-09-08 18:52:48 by f0dder
Told ya :)
The beer is strong in us tonight :P

The whole world would collapse if that func was buggy in that way.
Posted on 2009-09-08 18:56:36 by Ultrano