When issuing a REP MOVSD, it's obviously best if ESI and EDI are 4byte aligned.

When working with arbitrary memory locations, you can't always ensure this. When dealing with the case where both ESI and EDI are misaligned, and not by the same amount, does it matter which of the registers you align, and does alignment matter at all?
Posted on 2005-08-19 02:11:00 by f0dder
By sheer coincidence, I saw Terje Mathiesen write this on CLAX:

For speed you really want to use aligned writes, while unaligned reads
are quite cheap, however this means that you cannot blindly load dwords
since the last one might start with the actual end of the input, then
straddle a cache line & page boundary into illegal territory.

The most elegant solution here is to specialcase the situation where
source and destination have the same alignment, since this is fast &
easy, and then handle the general case by both loading and string
aligned dwords, with a SHRD inside to align the output.


...he usually seems to know his stuff. Any comments? :)
Posted on 2005-08-19 02:59:08 by f0dder
I think he's right. you can visualize it: imagine a 4-byte unit which can read only on 4-byte boundaries. in order to read something unaligned it has to read 2 times and accumulate the result. That's why bytes have to be byte-aligned (1-byte "read unit"), word 2-byte aligned, etc.  this is of course very theoretical, and only tests can really show anything useful, but I always stick with the above visualization and it works.

As for the rep movsd - you ALWAYS can do string move on aligned adress, and then move the remaining "unaligned' bytes with few additional instructions. It really should be faster.
Posted on 2005-08-19 03:26:08 by ti_mo_n
Re-read my post :) - what Terje basically says that if one of your strings has to be misaligned, it should be the read string.


As for the rep movsd - you ALWAYS can do string move on aligned adress, and then move the remaining "unaligned' bytes with few additional instructions. It really should be faster.

Consider address1 = 401001 and address2 = 402003... how would you move bytes so both addresses become aligned?

Posted on 2005-08-19 03:29:46 by f0dder
yes you're right. but there are very few situations (in "win32asm" at least) when you HAVE to write at some specific address. most of the time you can allocate some larger buffer and do aligned reads/writes. this is supposed to be faster, (and that's what i meant with this '1/2/4-byte unit'), and as for the the rest - you're (of course) right.

I think the only way to see if the guy is right is to do benchmarks on some platforms. There are lots of things that theoretically should be faster, and they are much slower in practice (and vice versa). :)

/edit
corrected some typos ^^"
Posted on 2005-08-19 03:38:21 by ti_mo_n
From my experience it does not really matter :D

Intel CPU's starting fom Pentium1 upwards hide this issue by the use of cache. A read and write sequence for the same (or close) locations will issue some cache problems but if the source and destination are disjunct and relatively far away then you see no real difference.

In prcatice this means that reads  do not matter but writes somehow do matter a little.
So he is right but by a very small procent ... no more than 2-3%. If you have a LOT of data to transfer it might matter more but for small data it's an overkill.
Cache lines and issues matter much more.

Beware that on non-Intel CPUs things are COMPLETLY different since usually the RISC ones from PDA or mobile or embeded device do not perform as well as Intel does and are much more sensible to such issues.

Posted on 2005-08-19 05:26:14 by BogdanOntanu
f0dder,

You are probably best to write a test piece and alternate the alignment of the source and destination against a timer. Bogdan's advice on trying it on different hardware makes sense as a good solution on Intel may be a disaster on something else. There is not really a good solution as you can align one dynamically but not the other at the same time. I would also investigate incremented pointer code rather than REP MOVSD as it may be more flexible in this situation.
Posted on 2005-08-19 05:46:10 by hutch--

yes you're right. but there are very few situations (in "win32asm" at least) when you HAVE to write at some specific address.

Concatening strings would be an example - you have to write to the end of the string, whether it's aligned or not :)


You are probably best to write a test piece and alternate the alignment of the source and destination against a timer.

I plan on doing that, once I start my performance tuning. I really should brush up my toy OS kernel a bit, as it will allow much more accurate timings (and performance counters) than you can get under windows...


Bogdan's advice on trying it on different hardware makes sense as a good solution on Intel may be a disaster on something else.

Yup, the only way to do it these days. Fortunately I have an AMD64, two P4's (one of them a celeron), a PII-350, and a Pentium-M 1.7 laptop that are all available for testing... and a 1.3GHz PIII-celeron if I don't mind booting the linux server :)


I would also investigate incremented pointer code rather than REP MOVSD as it may be more flexible in this situation.

Perhaps - REP MOVSD has always been doing pretty well for me, though, only beaten by the more fancy MMX versions (and especially the SSE-MMX versions with cache bypassing... which of course should only be used for massive transfers, or transfers where you're not going to work on the destination shortly afterwards).

I'll definitely try what Terje suggests, he usually knows his stuff, and it sounds like sound advice.
Posted on 2005-08-19 05:57:22 by f0dder
Unaligned writes are slower, because they involve 2 reads. For example, at addr 100h we have the text "ABCDEFGH". If we want to change the "CDEF" part to "XXYY", the cpu will:
r1 = read(100h)
r2 = read(104h)
r1.hiword = 'XX'
r2.loword = 'YY'
write(100h,r1)
write(104h,r2)
Posted on 2005-08-19 05:59:43 by Ultrano
Thanks for putting it into writing, Ultrano - that was my "intuitive idea why writing would be slower", but I didn't manage to get it into words. I only got 2 hours of sleep tonight, as I'm currently turning day and night upside down; will be on nightwatch shift fri-sat-sun for the next three months :)
Posted on 2005-08-19 06:03:36 by f0dder
That's exaclty what my imaginative visaulization (in my 1st post) was supposed to mean ;)
Posted on 2005-08-19 06:14:42 by ti_mo_n

That's exaclty what my imaginative visaulization (in my 1st post) was supposed to mean

Well, not exactly... reading isn't so bad, because everything is done in cache lines anyway. Yes, unaligned reads take a little more effort, but still not too bad. The real problem with unaligned writes is that an unaligned write becomes read-combine-write, whereas an aligned write is write-only.
Posted on 2005-08-19 06:19:17 by f0dder