Hello Everyone,
I am working on a prog at the moment using SSE2. The following code is from an unrolled loop. However, evem though OllyDbg does not support SSE2 (yet) I have run this prog under the debugger to diagnose the error. It occurs at the mark below and looks like a store load error. I thought that this would only lead to a slow down of the code. However under OllyDbg it shows up as a "privileged instruction error". Could anyone help me with this please?



prefetchnta

movapd xmm1,
mulpd xmm1, xmm2
addpd xmm1,
movapd , xmm1

movapd xmm1, <---------
mulpd xmm1, xmm2
addpd xmm1,
movapd , xmm1

movapd xmm1,
mulpd xmm1, xmm2
addpd xmm1,
movapd , xmm1


Liamo
Posted on 2004-07-15 17:00:27 by Liamo
movapd _must_ be aligned, movupd allows unaligned access (but is slower). Quoting the Intel docs:


The MOVAPD (move aligned packed double-precision floating-point) instruction transfers a 128-bit packed double-precision floating-point operand from memory to an XMM register or vice versa, or between XMM registers. The memory address must be aligned to a 16-byte boundary; if not, a general-protection exception (GP#) is generated.

The MOVUPD (move unaligned packed double-precision floating-point) instruction transfers a 128-bit packed double-precision floating-point operand from memory to an XMM register or vice versa, or between XMM registers. Alignment of the memory address is not required.
Posted on 2004-07-15 17:18:23 by f0dder
thanks for replying fodder. all data used in my prog is aligned to 16 bits. If the alignment was a problem I would have expected the error to occur on the first block in the loop. Thanks again

Liamo
Posted on 2004-07-15 17:28:26 by Liamo
16 bytes, you mean :) - and since you do an access to 8], well, that causes an unaligned reference. I haven't worked much with SSE or SSE2 code, but it seems somewhat wasteful that you load from +8, since you've already loaded that data previously? Isn't there some shuffling/unpacking/whatever thingamajig you can use instead? - there's plenty of SSE registers for temp usage :)
Posted on 2004-07-15 17:34:47 by f0dder
Thanks again fodder

I will try out the alignment by 8 and get back to you. The data being loaded is from two large vectors of data. Each element, a double, has to be worked on independent of the rest.

Thanks again
Liamo
Posted on 2004-07-15 17:42:09 by Liamo
Sorry for my ignorance if I'm wrong - as I said I haven't worked much with SSE, I haven't looked much at your problem, and it's late night. Now with this disclaimer to avoid evil flames, let me try...

When you do movapd, you get two 64bit doubles from memory into your XMM. mulpd then multiplies so that, if I understand correctly, you get


mulpd xmm1, xmm2:
xmm1.high = xmm1.high * xmm2.high
xmm1.low = xmm1.low * xmm2.low


The same goes for addpd:


addpd xmm1, [edi]
xmm1.low = xmm1.low + REAL64 ptr [edi]
xmm1.high = xmm1.high + REAL64 ptr [edi+8]


so... if I understand this correctly (which I probably don't :) ), you should be able to do something like:



movapd xmm1, [esi]
mulpd xmm1, xmm2
addpd xmm1, [edi]
movapd [edi], xmm1

and then add 16 to the esi and edi indexes (of course you could unroll this loop further, as there's spare registers and such.) - if you have only loaded a real64 into the low part of xmm2, you should of course load it to the high part as well.

There's probably a lot of optimizations to be made to this code, but this is what my brain can handle at this point in space and time ;)
Posted on 2004-07-15 18:13:44 by f0dder
Thanks again fodder

I feel stupid!!! I should have been shifting the index registers by 16 and not 8. Thanks for taking the time to go through the code like that. Its great to be able to talk to someone elese about it like this.

Thanks again
Liamo
Posted on 2004-07-15 18:20:40 by Liamo
Heh, and I felt stupid when writing the stuff - thought I had missed something bloody obvious :). Don't worry, this forum exists so we can learn from and help eachother. And sometimes the most obvious bugs can be the hardest to spot.

I'm sure some of the clever heads here could speed up your algo a lot, but at least it might work now :P
Posted on 2004-07-15 18:27:36 by f0dder