Heads Up!!
While messing with my new MovieMixer code recently, I discovered what I believe is a bug in the MultiByteToWideChar api.
Basically, the api only writes one Zero Terminator byte to the end of the WIDE output string - and if the buffer was not previously filled with zeros, then the string is not correctly terminated - and this caused me a lot of headaches with 'file not found' errors.
It's worth knowing that OA32's Str2UStr function does not suffer this flaw.
While messing with my new MovieMixer code recently, I discovered what I believe is a bug in the MultiByteToWideChar api.
Basically, the api only writes one Zero Terminator byte to the end of the WIDE output string - and if the buffer was not previously filled with zeros, then the string is not correctly terminated - and this caused me a lot of headaches with 'file not found' errors.
It's worth knowing that OA32's Str2UStr function does not suffer this flaw.
Interesting!
Duplicated here, Win7-64 .
I haven't been bitten by this bug since I tend to stay away from ugly NUL-terminated strings - I use bytelength-without-NUL for conversion, and keep my strings in std::string / std::wstring . But this is pretty nasty, MS should be contacted.
EDIT: as can clearly be seen from the attached screenshot, there isn't any problem, and there's a proper 16-bit NUL terminator - I'm a moron for a couple of hours after getting out of bed :)
Duplicated here, Win7-64 .
#include <windows.h>
#include <stdio.h>
#define lengthof(x) sizeof(x)/sizeof(x[0])
char *source = "halløj æøåÆØÅ!";
int main()
{
WCHAR dest[100];
memset(dest, '@', sizeof(dest));
MultiByteToWideChar(CP_ACP, 0, source, -1, dest, lengthof(dest));
FILE *f = fopen("out.bin", "wb");
fwrite(dest, 1, sizeof(dest), f);
fclose(f);
}
I haven't been bitten by this bug since I tend to stay away from ugly NUL-terminated strings - I use bytelength-without-NUL for conversion, and keep my strings in std::string / std::wstring . But this is pretty nasty, MS should be contacted.
EDIT: as can clearly be seen from the attached screenshot, there isn't any problem, and there's a proper 16-bit NUL terminator - I'm a moron for a couple of hours after getting out of bed :)
MultiByteToWideChar was designed to be interfaced with C/C++ style code. Due to this, it's common practice with dealing with any C string conversion or input routines to have the user '\0' terminate the string. For example scanf/fscanf is regularly used like so (for the same reason):
So it's probably not considered a bug as much as a, "terminate your own string" style situation which is present in a lot of HL languages.
char * buf;
buf = (char*)malloc( 1025 );
...
// scan string from file stopping at newline and null terminate.
buf[ (int)fscanf( fd, "%1024[^\n]", buf ) ] = '\0';
So it's probably not considered a bug as much as a, "terminate your own string" style situation which is present in a lot of HL languages.
So it's probably not considered a bug as much as a, "terminate your own string" style situation which is present in a lot of HL languages.
It's clearly a bug since the NUL terminator *IS* processed, but only a single byte is output. Let me quote MSDN as well (emphasis mine):
cbMultiByte
Size, in bytes, of the string indicated by the lpMultiByteStr parameter. Alternatively, this parameter can be set to -1 if the string is null-terminated. Note that, if cbMultiByte is 0, the function fails.
If this parameter is -1, the function processes the entire input string, including the null terminator. Therefore, the resulting wide character string has a null terminator, and the length returned by the function includes the terminating null character.
If this parameter is set to a positive integer, the function processes exactly the specified number of bytes. If the provided size does not include a null terminator, the resulting wide character string is not null-terminated, and the returned length does not include the terminating null character.
Size, in bytes, of the string indicated by the lpMultiByteStr parameter. Alternatively, this parameter can be set to -1 if the string is null-terminated. Note that, if cbMultiByte is 0, the function fails.
If this parameter is -1, the function processes the entire input string, including the null terminator. Therefore, the resulting wide character string has a null terminator, and the length returned by the function includes the terminating null character.
If this parameter is set to a positive integer, the function processes exactly the specified number of bytes. If the provided size does not include a null terminator, the resulting wide character string is not null-terminated, and the returned length does not include the terminating null character.
I updated the code above to printf() the result of MBTWC, and it returned 15 - that's the length of the source string including the NUL char.
EDIT: It's Not A BugTM
PS: I hope you don't write code like the above, it's "somewhat flawed" :) (think of EOF return value... and remember that the return value is number of fields captured, not charlength).
PS: I hope you don't write code like the above, it's "somewhat flawed" :) (think of EOF return value... and remember that the return value is number of fields captured, not charlength).
Actually I do write code like that. The above code is the equivalent of doing a getline with the exception that it truncates to the maximum specified length instead of throwing a failbit error. Also, take a close look at the "regex" I'm using "%1024[^\n]", it effectively matches all characters except a newline meaning that even if the EOF is reached, it will have already made it's first match (either a valid token or '\n' forcing a return of 0). The only real case in which EOF would be returned by using this routine would be if a read error occurred.. So I suppose it is wrong on that behalf and would warrant something more like:
int x; char buf[1025];
buf[ (((x=(int)fscanf( fd, "%1024[^\n]", buf ))>0)?x:0) ]='\0';
However I've never personally had a case where the prior had failed.
So, it's just that Homer set the argument cchMultiByte=strlen(s) , right?
So, it's just that Homer set the argument cchMultiByte=strlen(s) , right?
EDIT: It's Not A BugTM
Ultrano,
Yea, f0dder is right, it's definitely a bug. Given that the documentation states that it should NULL terminate upon the use of a -1 argument. Although generally speaking it's always a good idea to set your own NULL terminators just to be on the safe side.
Yea, f0dder is right, it's definitely a bug. Given that the documentation states that it should NULL terminate upon the use of a -1 argument. Although generally speaking it's always a good idea to set your own NULL terminators just to be on the safe side.
Synfire: getting EOF result value might be very unlikely, but it's not good practice not to handle it. Furthermore, what's with the (int) cast? - fscanf already returns int. And finally, fscanf returns the number of fields, not bytes, read... so you'll be getting your NUL terminator at index 1, not at the end of the string (at least that's consistent with MSDN, K&R and empirical testing).
PS: the single-line statement might be "cute", but personally I'd introduce an "int result" to keep the code clearer & cleaner. Sure, code increases to two or three lines, but I'm willing to sacrifice that for clarity.
PS: the single-line statement might be "cute", but personally I'd introduce an "int result" to keep the code clearer & cleaner. Sure, code increases to two or three lines, but I'm willing to sacrifice that for clarity.
I still see absolutely no problem. That "extra" or "just one" zero-byte is the higher-byte from the last little-endian WORD.
If you specify -1 , then obviously internally cchMultiByte=strlen(s)+1. This "-1" is just sugar on the top, beside which the function makes no assumptions that the input and output string are zero-terminated! The func-output is an array, not a szString.
If you specify -1 , then obviously internally cchMultiByte=strlen(s)+1. This "-1" is just sugar on the top, beside which the function makes no assumptions that the input and output string are zero-terminated! The func-output is an array, not a szString.
Ultrano: perhaps you should re-read what MSDN has to say about the function?
In other words, passing strlen+1 or -1 as argument, you should get a NUL terminated output string. A 1-byte NUL isn't exactly appropriate for UTF-16 output.
EDIT: ...which is why it's good and correct that MBTWC outputs a 2-byte NUL :P
MultiByteToWideChar does not null-terminate an output string if the input string length is explicitly specified without a terminating null character. To null-terminate an output string for this function, the application should pass in -1 or explicitly count the null terminator for the input string.
In other words, passing strlen+1 or -1 as argument, you should get a NUL terminated output string. A 1-byte NUL isn't exactly appropriate for UTF-16 output.
EDIT: ...which is why it's good and correct that MBTWC outputs a 2-byte NUL :P
P.S. it's quite understandable why the output should be an assumed to be a szString: the same func can be used for a wide variety of tasks: 1) bulk-handling of many szStrings, 2) transcode of just parts of a text, 3) transcode into Pascal/Java type of strings.
WinXP SP2:
No problem here...
.data
z1 db "hello",0
z2 dw 40 dup (0CCCCh)
.code
main proc
invoke MultiByteToWideChar,CP_ACP,0,addr z1,-1,addr z2,40
movzx ecx,z2[10]
print ecx ; ecx = 0 *******
ret
main endp
No problem here...
I just fired up a XP32 VMware, and it did a word-sized NUL for my sample code above - so clearly the bug has been introduced later. PS: your code is buggy, the last argument is cchWideChar, not cbWideChar :)
Synfire: getting EOF result value might be very unlikely, but it's not good practice not to handle it. Furthermore, what's with the (int) cast? - fscanf already returns int. And finally, fscanf returns the number of fields, not bytes, read... so you'll be getting your NUL terminator at index 1, not at the end of the string (at least that's consistent with MSDN, K&R and empirical testing).
PS: the single-line statement might be "cute", but personally I'd introduce an "int result" to keep the code clearer & cleaner. Sure, code increases to two or three lines, but I'm willing to sacrifice that for clarity.
I typecast as an old habit. When I first started out with C I hated having issues with "converting types" when I knew good and well the values were essentially the same thing. So I got myself into the habit of typecasting any function in which I grab a value from.
As for the reason for the reduced one-liner, it has nothing to do with "cuteness" rather than what's actually generated. Whereas using the extended version (int result) generates a 144 byte <main>, the one-liner generated 134 byte <main> (built with GNU-C). Little optimizations like this are useful. I'm by far a C expert but I try to keep code size down as much as possible no matter what language I'm writing in.
f0dder: then on that attached MBTWC_bug.png .... what are those 2 zeroes doing at 1Ch :) , if that's on another OS?
I typecast as an old habit. When I first started out with C I hated having issues with "converting types" when I knew good and well the values were essentially the same thing. So I got myself into the habit of typecasting any function in which I grab a value from.
As for the reason for the reduced one-liner, it has nothing to do with "cuteness" rather than what's actually generated. Whereas using the extended version (int result) generates a 144 byte <main>, the one-liner generated 134 byte <main> (built with GNU-C). Little optimizations like this are useful. I'm by far a C expert but I try to keep code size down as much as possible no matter what language I'm writing in.
f0dder: then on that attached MBTWC_bug.png .... what are those 2 zeroes doing at 1Ch :) , if that's on another OS?
Sounds like Homer probably *is* passing strlen-without-NUL, unless he's on a OS version that actually is buggy :)
Told ya :)
The beer is strong in us tonight :P
The whole world would collapse if that func was buggy in that way.
The beer is strong in us tonight :P
The whole world would collapse if that func was buggy in that way.