Hi all,

I'd like some advice on how to create a stack frame for SSE. It has to be 16-byte aligned, and I'd like to access most locals with a one-byte positive index. So my first attempt looks like this:

prologue:
push ebp
mov ebp, esp
and ebp, 0xFFFFFFF0
sub ebp, 0x110

epilogue:
pop ebp

I see all kinds of different approaches on various sites, so I'm a bit confused about what's actually the best approach. I think one of the problems is making this work for multiple calls. Although currently I'm using this for only one function, and it seems to work as expected, I'm curious what the 'right' way is to do this.

Thanks.
Posted on 2004-11-01 16:30:11 by C0D1F1ED
I found this site: A guide to using the Pentium's new multimedia instructions, which seems very helpful, but I'm afraid the problem is more complicated. I don't know the frame size in advance.

This is because it's for dynamically generated code, using SoftWire. Local variable are dynamically created when required. So I was thinking about using ebp + a positive byte-sized index, and when those are all used continue with a negative index. But that's problematic when other stack frames have to fit in...

...I'm lost. :cry:
Posted on 2004-11-01 17:02:23 by C0D1F1ED

and ebp, 0xFFFFFFF0
sub ebp, 0x110

You mean, esp here, right?

I'm curious what other methods exist. One could be what I do usually, i.e. just a reordered version of your code. And, the one in the linked article does not look much better for me to favor the code in the article.

And, why would that cause any problem with multiple calls? Do you mean you want to create a Pascal-like stack frame?
Posted on 2004-11-02 22:55:33 by Starless
If you are going to go to all the trouble atleast use positive and negitive offsets. :P

You want ESP to be lower numerically than EBP (or the lowest used address in the frame). Using positive and negitive offsets will gain you 16 slots on the stack, of course.

08n --

070 -- greatest 16 byte slot
.
.
.
000 -- EBP
.
.
.
F80 -- least 16 byte slot

F7n -- ESP

In this scheme we don't know exactly what the address of EBP is because it is altered for alignment. The AND'ing further lowers the address of EBP by up to 12 bytes and an extra 12 bytes is subtracted from ESP for this extreme instance.

push ebp
lea ebp, [0 - (16*8 )]
sub esp, 16*16 + 12 ; assume not aligned
and ebp, 0-16

To correct the stack ESP needs to be adjusted and then EBP pop'd.

add esp, 16*16 + 12
pop ebp
ret

To use less SSE slots just reduce the offset to ESP -- not the EBP offset -- "16*16 + 12" at both prologue and epilogue is all that needs to change.
Posted on 2004-11-03 00:03:25 by bitRAKE
You mean, esp here, right?

No, using ebp makes the encoding one byte shorter.
And, why would that cause any problem with multiple calls? Do you mean you want to create a Pascal-like stack frame?

Well I need esp to point to free space usable for the next function, not?
Posted on 2004-11-04 02:04:33 by C0D1F1ED
To use less SSE slots just reduce the offset to ESP -- not the EBP offset -- "16*16 + 12" at both prologue and epilogue is all that needs to change.

And what if I need more? 8) I'm currently using more than 1024 bytes of stack space for local variables (GP, MMX and SSE) in one of my processing pipelines. The best solution I found so far is to just reserve enough space and generate an error when it's exceeded. Unfortunately, because this code is generated dynamically there is absolutely no guarantee that I have reserved enough. Would it be useful to adjust esp when I'm allocating more variables than expected? I think this would cause serious problems when esp is adjusted inside a loop...

...it might be possible to adjust the instructions in the epilogue. Let me see if that works. :idea:
Posted on 2004-11-04 02:14:26 by C0D1F1ED
Well, in windows you have to touch every page of stack to ensure its commited and grab as much as you want until something blows up! :)

As for accessing multiple sets of temp SSE slots, maybe try moving EBP around. Once you know it is aligned it could just be advanced to a new set of temp slots -- assuming no critical loops need more than 16. One instruction sure beats putting everything in separate frames.

Basically, a 16-slot sliding window is the effect I'm trying to explain. Inside processor intensive loops the window should encompass all used temp SSE slots.
Posted on 2004-11-04 09:59:15 by bitRAKE
I think I got it working now:


prologue:
push ebp
mov ebp, esp
sub ebp, stackSize - 128
lea esp, dword ptr [ebp-128-12]
and ebp, 0xFFFFFFF0

epilogue:
add esp, stackSize + 12
pop ebp

This code is dynamically generated. At the moment when the prologue is encoded, the stack size is unknown, but it's quite easy to keep a pointer to this instruction and update it when more local variables are required. esp always points to free stack space. Subtracting 12 is done to keep it out of the aligned frame. Storing variables starts at ebp - 128 and goes up, to ensure best use of byte-sized offsets, and prefetch the next cache lines easily.

This really seems like the most compact way to do this. Thanks all for the ideas!
Posted on 2004-11-06 18:18:06 by C0D1F1ED
I'm not there yet... :cry:

The problem now is function arguments. I can access them with esp + offset but this offset is not known before the whole function is generated (because the frame size is unknown). Just like the instruction in the prologue I could adjust it every time it's known that the frame has to grow, but since there can be many arguments and many instructions accessing them, this is much harder to manage.

So I'm trying to avoid this. Obviously, function arguments can be read easily before the prologue. Or in other words, if I can grab esp at that point and store it somewhere it can be used later for reading function arguments. So my current idea is to store that pointer as the very first local variable. Dynamic register allocation will make sure that this memory location is actually not used often. 8)

This may sound wacked but it might actually work. If anyone got better ideas, or just some more wacked inspiration, let me know!
Posted on 2004-11-07 09:21:09 by C0D1F1ED
The idea that comes to my mind first is to copy the locals relitive to ESP and then add this size to the frame size. My thinking is anything is better than two dependant instructions (that is if I am understanding you correctly).

Are you suggesting function parameters could be accessed like so:

mov edx, ; get old ESP
mov eax, [4*-4] ; get fourth parameter
:?:
Posted on 2004-11-07 09:51:24 by bitRAKE
There's a best and a worst-case scenario. In the best case, the original esp just gets copied to a free register, and this register is used whenever a function argument has to be read. In the worst case, this register has to be written back to memory (because the register allocator decides it is more useful for other variables), and it has to be read back when a function argument has to be accessed.

Anyway I think this is close to ideal given the restrictions I'm working with. For short functions where performance depends on every instruction, it would only take one extra mov operation, and take on of the six available general-purpose registers. In longer functions, where instruction count doesn't matter that much but registers are precious it avoids that we only have five general-purpose registers left, and it still minimizes the number of dependent read operations (since most function arguments are accessed closely together).

Your idea of copying it to the new stack top sounds like a good solution too. Maybe it's simpler to implement than what I'm trying to do now...

Thanks bitRAKE!
Posted on 2004-11-08 03:31:28 by C0D1F1ED