I am performing repeated reversal additions on a very large BCD number - several MB worth. The program accesses memory in a forward and reverse manner, and writes the sum to another location. Because of the complexity of the calculations everything is done in general purpose registers, not MMX, but for now that aspect of optimization is not in question.

I am using a 2.8GHz Pentium 4 overclocked to 2.95 GHz, with 512MB @ 800MHz FSB. (I guess 512 KB L2).

When operating from the cache only the program processes 75 GB (37.5 forward + 37.5 in reverse) and 37.5 GB data written (all to cache) in 60 seconds, so average 1.25 GB read/s and .625 GB written/s.
This is just to say that the program is entirely within the Pentium 4's optimal data transfer rate.

However, the program is memory bound when processing numbers larger than the cache because of lack of prefetch.

What is the optimum prefetch scheduling distance in this case please? And please explain how to calculate it.

Also, I have another program which uses MMX, and accesses twice as much memory (75 +75 Read +75 write) in 53 seconds = 2.82 Read/s + 1.41 write/s.

My aim is to have these programs operate at peak speed even when dealing with main memory, ie to cover the latency of accessing the memory with the computational instructions.

Please help in calculating prefetch scheduling distance.

Posted on 2005-01-26 21:18:24 by V Coder
The Intel Optimization manual has a chapter named "Mathematics of Prefetch
Scheduling Distance." I didn't read it but maybe it can help you.
Posted on 2005-01-27 11:53:05 by Dr. Manhattan
I also have a few things on my webpage about it. And I am writing an optimization tutorial on it, because it is a complex topic. I just haven't finished it yet. I wrote a program to compute optimum prefetch distance.

You also might want to tweak your read and write code. 1.25 GB/s can be made faster.
Posted on 2005-01-27 12:42:29 by mark_larson
Thanks. I read the Intel manual - and was no wiser. I hoped that if anyone has used prefetch they would be able to advise based on those figures.

I hope. Thanks again.
Posted on 2005-01-27 15:24:19 by V Coder
yea the Intel manual's appendix on proper prefetch distance blows chunks. They really need to re-do it. And it was what prompted me to start an article on it ( it's not a simple answer), and also write a program to calculate it.
Posted on 2005-01-27 16:47:58 by mark_larson
Thanks Mark. I looked at that page before. It's excellent.

BTW, based on the above system specs, can you make a general recommendation? Thanks.

Or can you give me an approximate lookup time & time to transfer a cache line?

Posted on 2005-01-27 20:13:03 by V Coder
On a P4 I generally use the "prefetchnta" instruction. It prefetches the data into the L1 data cache only. There is also a "prefetcht0" that fetches into all cache levels. It is slower because it has to update all cache levels. That is why I use prefetchnta. Both instructions fetch 2 cache lines. The cache line size on a P4 is 64 bytes. So they fetch 128 bytes. Intel's general recommendation which doesn't work well. Is to put your prefetch at the top of the loop, handle 128 bytes in the loop, and prefetch 128 bytes ahead. That's not the fastest way to do it.

mov ecx,1024
prefetchnta [edi+128] ; prefetch 128 bytes into the L1 cache before we need it.
add eax,[edi]
add eax,[edi+4]
add eax,[edi+8]
;... repeat this so you read 4 bytes at a time up to 128 bytes read.
add eax,[edi+124]

add edi,128 ;we dealt with 128 bytes.
sub ecx,128/4 ; subtract from loop counter the # of dwords handled
jnz looper

So what do you do to find the best way to do it? You can try moving the location of the prefetch instruction. Some locations in the loop will make it execute faster. The second parameter you can change is the prefetch distance. How far away you are getting the data. That is what my program does. It tries all combinations of prefetch distance with all combinations of the location of the prefetch instruction in a section of code. It times them all and tells you which is fastest. So start with something like I have above, and try different combinations manually. You can usually find something that is faster. In general if you do it right you should get anywhere from a 10% to 30% increase in speed.
Posted on 2005-01-27 22:50:02 by mark_larson