well, dunno if someone paste such code (kinda hard to find when searching)
but here we go:
a really fast array copying


ARR1 DW 4001 DUP(?)
ARR2 DW 4001 DUP(0)
...
...
...
MOV CX, LENGTHOF ARR1
MOV SI, OFFSET ARR1 ; we can use LEA, no need though
MOV DI, OFFSET ARR2 ; " ..... ...... "
CLD ; lets make SI/DI go up (STD for reverse)
SHR CX, 1 ; for uneven length
REP MOVSW ; word copy, si+=2/di+=2
JNC FINISH ; was the length uneven?
MOVSB ; yes it was
FINISH:
...
...


mabye someome will find a use for it :)
Posted on 2002-12-30 15:57:14 by wizzra
wizzra,

Just a suggestion, do the bulk of the data with MOVSD and the uneven balance with MOVSB, it should be faster.


cld
mov esi, [Source]
mov edi, [Dest]
mov ecx, [ln]

shr ecx, 2
rep movsd

mov ecx, [ln]
and ecx, 3
rep movsb

Just treat the two arrays as addresses with a known length, you are not bound to use the same data size as the array members.

Regards,

hutch@movsd.com

NB: Some of the guys in the source code forum in the past have posted even faster ways if you don't mind using MMX or SIMD instructions to do memory copy.
Posted on 2002-12-30 16:11:08 by hutch--
hi Hutch :-)

i guess MovSD will be much faster in win32asm ;-)
but that one above was refering to 16bit ASM.
but heck, now we have 2 variations :-)
Posted on 2002-12-31 01:00:31 by wizzra
Sorry for interrupting you guys (yet again) with my silly questions.

Its probably the asm code which i cant seem to understand most of.
How is this faster than normal array copying?
Could anyone please explain in a bit in simple terms why this algo is faster?
Posted on 2002-12-31 02:45:41 by clippy
faster = means less intructions for the cpu to perform.

the algo above is really fast [32/16 depend what u code in]
is because the actuall copy (byte/word from arr1 -> arr2) is done via 1 instruction!!
think about array with size 1000h ...
(1*1000h)/ 2 (2 = word copy) give us really fast cpu time.

the REP MOVSW is the instruction that performs the copy.

Destination Source
MOVSW : <-
ADD SI,2
ADD DI,2

REP : Repeat while CX<>0 (this instruction get only string instructions)

CLD: Clear D(F) Register, this ensure DI/SI will go up via the positive scale.

the shr cx,1 ; this check if the number has uneven length, if it does, we copy
; untill the last byte and perform an MOVSB on him.

well, as u can see it is super fast in w16-bit asm, and even more under win32 :-)
Posted on 2002-12-31 06:30:02 by wizzra
Oh i see. Thanks:)

Btw, how does normal array copying work in compilers?
Dont they use this optimization too?

Also if use this as a function in my C code what type of arrays will it work on?
ints,longs,chars or custom data types???:confused:
Posted on 2002-12-31 07:26:59 by clippy
"REP MOVSD" as an instruction isn't actually that fast!
When you're copying small chunks of data it is infact quite slow. When you start moving blocks of data the size of the cache around, the processor switches to a special mode where is picks up, and drops cache sized blocks, and this is very fast. But when ECX is loaded with a value lower than the size of the cache, this special mode isn't used, and things slow down.
Its faster in these smaller block sizes to write your own loop, and use eax. If possible though us an MMX register, the 64 bit copying is much faster. As far as I can tell there is no benefit in using the SSE register set, as the memory subsystem doesn't go fast enough!

I guess there may also be bonuses to be had from ensuring that at least one of the two sections of memory being copied from/to is aligned properly for the processor. Although I've never tested that.

Mirno
Posted on 2002-12-31 07:58:50 by Mirno
i dont really know what junk a compiler add (each and it own junk)..
but, an array is an array..
if u pass the adrees start of the array, u should also know its type..
int = 2bytes [4 in win]
long = 4 bytes
.....
.....
do whatever as long as it doesn't exceeded its length.
Posted on 2002-12-31 07:59:11 by wizzra
I must point out that less instructions does not mean in any way that the algo will be faster. In fact optimization for speed *usually* requires more instructions. A knowledge of the pipeline will help quite a bit.
Posted on 2003-01-04 02:57:16 by Asm_Freak
wizzra: that's about as fast as you'll get - for a 286 ;)
As Mirno points out, RISC-style copy loops are faster on anything >= Pentium. Nowadays, it's all about making good use of your cache. To get anywhere close to your maximum memory bandwidth, you need to prefetch and read/write 64 bytes at a time (via MMX regs and movntq, so as not to pollute the cache).

I think this way's pretty much the best you can do for Athlons:


for each 8 kb block:
manual prefetch - touch one address in each cache line, *backwards*
(we can't have the CPU generate prefetches of its own, now ;))

for i = 1 to 128
load 64 bytes
nt store 64 bytes

sfence (write combining is cool, but don't forget this ;))


_3x faster than rep movsd_!! (~2000 mb/s vs. ~650 mb/s)

HTH
Jan
Posted on 2003-01-06 05:52:43 by Jan Wassenberg