Hi all,

I am totally puzzled by a weird phenomenon in my code. My debug version is twice as fast as my release version (Visual C++)! What's even stranger; adding a pushad and popad to the start and end of the bottleneck function solves the problem...

Is it possible that a function executes two times slower when the stack starts at another address? The function I'm using is quite big and takes 99% of execution time. The stack frame is ~600 bytes and 16-byte aligned. Any possible explanations?

Thanks,

c0d1f1ed
Posted on 2005-03-31 18:00:32 by C0D1F1ED

Is it possible that a function executes two times slower when the stack starts at another address?

Hmmm, just a wild idea: I remember reading about some cache issue... Two memory lines compete for the same L2 cache lines if they differ by a multiple of 64K - this is for PII, but I think it applies to later processors as well? (reference: http://www.linuxshowcase.org/2000/2000papers/papers/sears/sears_html/ ).
Posted on 2005-04-01 01:47:19 by f0dder
Indeed it appears to be cache related. If I offset the stack by 64 byte I get the same performance degradation. My Pentium M has 64-byte cache lines for L1 and L2...

But that makes it even more bizarre. Both caches are 8-way associative, and L1 data cache is 32 kB so I can't possibly be causing cache thrashing. :sad:
Posted on 2005-04-01 03:15:08 by C0D1F1ED
That does sound quite bizarre :-s. Hm, does debug builds pre-fill the local variables? I recall seeing 0xCCCCCCCC filling at least with VC6, been a while since I traced C/C++ code in assembly view, though.

Tried running VTune on the stuff?
Posted on 2005-04-01 05:54:20 by f0dder
I think the 8-way associative is the key.

I know that on the 486, the 128 cache lines were actually 32 subcaches of 4 cache lines each. (I'd have to hunt down docs to see if this was called 4-way associative.) You could get cache thrashing if you needed more than 4 cache lines in one subcache. Because the cache lines were 16 bytes each, the magic multiple (the distance that degraded performance) was 2048. Since code could go through the cache, you could change the performance by adding dead code between a subroutine and its caller.
Posted on 2005-04-01 17:28:01 by tenkey