hi, i know copymemory has been discuussed a thousand times before, apologies in advance :D

the thing is, i'm looking for a version that will perform well on a xeon 5150 processor, which is intel core technology.
i guess there aren't any source code yet on the net, that deals with optimizing for these processors? even for general x86-64?

any pointers appreciated, even not necessarly source code.

Memcpy is likely to be a bottleneck in the cpu part of the app.

i'm also looking for info on any memcpy optimized for recent processors, that might perform better than ultra generic win32 or libc function.


Posted on 2007-11-14 08:21:02 by HeLLoWorld
please please please please please? :D

Posted on 2007-11-15 18:13:25 by HeLLoWorld
If you are talking about the x86-64, you have the 16 64-bit GPRs. Also, MMX/SSE is guaranteed to be there.

Personally, I'd be happy just using as many of those bigger GPRs as I could... for most situations.

However, if you want to go deeper with MMX/SSE based optimization, check out THIS link.
Posted on 2007-11-15 18:43:38 by SpooK
I think only a few here have Core cpus, the others are happy enough with their Athlons or P4s.
But I've noticed GCC always has its memcpy routines optimized, (+ are processor-dependent) . So, grab the latest source of gcc, and see the 32-bit and 64-bit code for Core. Or well, some linux distro.
Posted on 2007-11-16 08:30:12 by Ultrano
HeLLoWorld: are you copying huge blocks, or small blocks? Do you need to read or modify the contents after copying? etc.

"The fastest memcpy" depends on what it's going to be used for. For a libc memcpy, I'd probably just rep movsd :)
Posted on 2007-11-16 08:38:07 by f0dder
thanks spook, in the unlikely case i write simd myself!
thanks ultrano, thats a very good idea... i guess gcc is gpl, but maybe i can get some inspiration.

yeah fodder, i know it depends... i've already seen memcpy sources that first check for the length and trigger different code...

the blocks will likely be small, but the system shall be configurable...

i copy small chunks as i get them, and i copy a predefined maximum amount each time;

the specs say the typical default setting for this maximum is something like 8Kbytes, but the loop is likely to restart and continue with another call to memcpy, with source exactly following previous source, and destination following previous destination, but not exactly (aligned to some amount of memory).
Plus, between the block copies there is some "parsing" of the source" each time... this doesnt seem very cache-friendly.

In that case i don't really think this should be considered a big block copy...

i'm discovering use of MSVS on a daily basis... i didnt realize how much it 0wns :D ...
We must develop under MSVS6 (1998) (d'oh) even though the project is performance-oriented!
such is the industry... but, we also will need to have it under VS2K5, and from what i've read here there are huge performance differences in the generated code... i discover the debugger, the profiler... omg that thing from '98 is already a rolls royce! :D and 2k5 is waay better...

so long and thanks for all the shoes.
Posted on 2007-11-16 10:48:49 by HeLLoWorld