I'm beginning to understand those "hardcore" assembly programmers who hate compilers... :mad:

I was coding in C++ a today, my program was crashing and after quite some debugging I found out what the problem was: my checksum function was *somehow* returning negative values (it was supposed to use positives only) and that was causing a GPF later.

I scratched my head and wondered: how come this code can return a negative number?


unsigned int checksum(const char* name)
{
? ? unsigned int sum = 0;
? ? for( int i = 0; name; i++ )
? ? {
? ? ? ? sum += (unsigned int)name;
? ? };
? ? return sum;
};


It couldn't be an integer overflow, because the strings I was feeding it were very short (under 256 chars long). I managed to isolate what kind of strings would produce negative results, and it turned out they all had ASCII characters over 128.

Something fishy was going on there :/ so I took a look at the dissassembly, and saw this line:


MOVSX ECX, DWORD PTR


Huh?! :?:

I have a theory: I think the compiler hasn't defined a conversion from char (signed by default) and unsigned int, but it does to int (also signed by default). So this statement...


(unsigned int)name


...was converting the char to int, and then to unsigned int. Since the last conversion does nothing, in practice the char was being converted to a signed integer instead! :mad:

I had to fix the code like this:


unsigned int checksum(const char* name)
{
? ? unsigned int sum = 0;
? ? for( int i = 0; name; i++ )
? ? {
? ? ? ? sum += (unsigned int)((unsigned char)name);
? ? };
? ? return sum;
};


It can still be overflowed, but now it needs a string full of 0xFF chars and at least 8,421,505 bytes long to do it. I think I'm safe with my 256 chars limit. ;)

Anyway, is it possible to disable this automatic type conversions? Or at least flag a warning when they take place? I'd hate to run into another one like this... :sad:
Posted on 2005-05-30 19:51:29 by QvasiModo
Hm, I'm not sure if there's some obscure thing in The Standard that explains this, but it does smell like a compiler bug. You should state which compiler you use - it's reproducible with vc2003. I would either pass unsigned chars (since signed chars have often been problematic in various circumstances), or rewrite it to something like this:


unsigned int checksum(const void* a_name)
{
const unsigned char* const name = static_cast<const unsigned char* const>( a_name );
unsigned int sum = 0;

for( int i = 0; name; i++ )
{
sum = sum + static_cast<unsigned>( name );
}

return sum;
}

Posted on 2005-05-30 20:14:24 by f0dder
BIG smile,  :mrgreen:


I'm beginning to understand those "hardcore" assembly programmers who hate compilers..


Ve Vill conVert Yoooo !
Posted on 2005-05-30 21:46:13 by hutch--
IMO it is not a bug!

The matter is that your signed character already holds a negative value, and that value will be preserved when promoting to integer, independently on the fact that the destination integer is signed or unsigned. That peculiarity (signed or unsigned) will only be taken into account in the following operations you will perform on the promoted integer.

Just another example:

? ? signed int si = -13;? // mov dword ptr ,0FFFFFFF3h
? ? unsigned int ui = (unsigned int)si;? // mov eax,dword ptr + mov dword ptr ,eax


So the value inside ui will be the same as the value inside si:
you cannot expect that the cast operation will take off the negative sign from -13...

Best regards, bilbo

P.S. This is anyway a common programmer's trap: I saw a similar discussion on a Microcontrollers forum,
hxxp://www.cygnal.org/ubb/Forum7/HTML/000056.html?
Posted on 2005-05-31 05:39:17 by bilbo
To solve such problems in my C code, I make pointers - at least I know they will be properly copied :). Here's how I'd fix your code:

unsigned long checksum(const char* sname)
{
    unsigned long sum = 0;
    unsigned char *name = (unsigned char*)sname;
    while(*name)
    {
        sum += (unsigned int)(*name); // extra round brackets
        name++;
    };
    return sum;
};


Note that I also use "long" instead of "int", because with some compilers "int" is 2 bytes, and "long" is always 4 bytes.
I use extra round brackets at tricky conversion-bound places, to assure the compiler will understand what exactly I need (and to spare me from learning the rock-paper-scissors logic in C/C++ compilers - that is not always implemented as ANSI states).
Posted on 2005-05-31 07:22:19 by Ultrano

BIG smile, :mrgreen:

Ve Vill conVert Yoooo !


I'm scared... :D

As pointed out by bilbo this is not a bug -- a design flaw at any rate. Since there seems to be no way to make the compiler at least warn me when it's about to do something unexpected, I guess I'll have to use unsigned chars always from now on. :P

BTW, Ultrano, your fix won't work. The compiler will try to convert the signed char into signed integer first, no matter how many brackets you use... :(
Posted on 2005-05-31 19:51:47 by QvasiModo
What about the absolute value of the signed variable?
Posted on 2005-05-31 20:28:48 by SpooK
OK, maybe because I made a typo:
sum += (unsigned int)(*name);  // ^^"
should be "long".

But if that doesn't work either, try
sum += ((unsigned long)(*name)) & 255;

So, which compiler do you use ???
Posted on 2005-06-01 00:57:12 by Ultrano
If you're going to treat plain char as numbers, you should always strip away any potential sign extension.

Here is the historical reason: the char to int conversion is supposed to use the most efficient conversion. On some machines, like the Digital machines, it was more efficient if char was treated as signed. On other machines, like the IBM mainframes, it was more efficient to treat char as unsigned. The big Unix port to at least three other machines (the original machines were Digital models) pointed out little things like this.

When C began to migrate to microprocessors, there was no signed keyword. So if you wanted (implied) sign extension in your compiler, you made plain char be equivalent to modern signed char. (Even if signed conversion was less efficient than unsigned.)

Another reason - what I remember was that everyone with access to Unix machines wanted to mimic them, which meant mimicking the Digital versions of software.
Posted on 2005-06-01 01:40:08 by tenkey
With vc2003 you can use the /J switch to cl.exe to make char be treated as unsigned, which may help.
Posted on 2005-06-01 05:46:31 by stormix
:shock: impressive, tenkey :)

interesting to know
Posted on 2005-06-01 06:09:08 by HeLLoWorld
Wow! A lot of nice info, thanks to tenkey and stormix!

By the way, also MSVC 6.0 contemplates the switch /J!
Regards, bilbo
Posted on 2005-06-01 06:32:57 by bilbo
Another way to do this (sqeezed into one line for extra l33tness :mrgreen:)
for (int i = 0; name; sum += (unsigned char)name);

Which is basically what tenkey indicated ;)
Posted on 2005-06-01 10:37:01 by Tola

So, which compiler do you use ???

Both VC 6 and GCC 3. Since one of my goals is portability, I'd rather stay away from long and keep using int.

@tenkey: Very interesting! That explains this oddity in the type conversion logic. :)
Posted on 2005-06-01 11:24:21 by QvasiModo

With vc2003 you can use the /J switch to cl.exe to make char be treated as unsigned, which may help.

Never never never ever depend on this, it's for being able to compile broken source code. If you're depending on 'char' being unsigned, you might as well code in assembly, as your code will not be portable.
Posted on 2005-06-01 12:00:23 by f0dder