Yeah, see above, I check every iteration.

It takes 79 seconds on this machine...

I have one more obvious optimization, but I don't have the registers to do it... I'll need to kill ebp and esp probably!!! By the way, do you prefetch?

Let's see your program.
I didn't want to assume what you meant above. Mine takes 107 seconds, but it's built for the long numbers (mine drops to 54 seconds without prefetch and non-temporal stores :) -- that is why this little test is not good -- all the data is in the cache). I can't see how your program could beat mine on one million digits, but I'll have to test to pull my foot out of my mouth. :) The packed BCD version is a lot of work.
Posted on 2003-05-22 23:47:50 by bitRAKE
How could it drop to 54 seconds without prefetch? What are non-temporal stores?

Okay, post both versions, or pm?
Posted on 2003-05-23 00:32:38 by V Coder

How could it drop to 54 seconds without prefetch? What are non-temporal stores?
Non-temporal stores assume you don't want to read the data after writing it, and there is a big penalty for doing so. Also, they by-pass the cache to store directly in memory - this is perfect when we are dealing with numbers larger than cache size. Prefetch instructions still take up space in the pipeline, and if the data is already in the cache they are worthless.
Posted on 2003-05-23 00:45:44 by bitRAKE
I will go and read further on that.

Is your test program ready for testing?
Posted on 2003-05-23 00:56:28 by V Coder

Is your test program ready for testing?
No, I do not want to post a test program until the packed BCD version is complete. I hope that this weekend will afford me the time - I have been working a lot of overtime at work.
Posted on 2003-05-23 01:06:31 by bitRAKE
But what about the int32 & MMX based routine that is already very fast?

Did you put ISF on that? What does that look like?
Posted on 2003-05-23 06:56:05 by V Coder

What does that look like?
:) Okay here is what it looks like:
Posted on 2003-05-24 14:43:23 by bitRAKE
Right, well that looks hot! :alright:

Why are you waiting to submit it to Wade if it is working? Wade, Eric, and I are waiting with bated breath to see a new king at the top of the table. (I told them look out for a coder from the win32asm programming forum who will blow away all previous records!) :grin:

I submitted a version that is somewhat faster (prefetch, also optimized code - I will test on Monday to see which has greater effect on the Pentium 4), and tweak prefetching. I also am working on another optimization. I still need to include save every x seconds, iterations, etc... Now, I'm slightly motivated. I obviously also need to get out of console mode, but this will not happen for some months. I still need to experiment with proper win32 programming, main, etc...

:stupid:
Posted on 2003-05-24 18:29:25 by V Coder

When I first looked at Benjamin Despres {GPL} code, I found it difficult to understand (I still cannot understand it), so optimizing it was out of the question. It adds pairs of 32 bytes and propagates the carry from dword to dword within MMX (well we deal with carry before it reaches the MMX code).
His loop consists of 12 movq , 12 movd, 24 integer instructions , 56 MMX instructions and a couple of jmps.

I decided to write a simple loop to add pairs of 8 bytes instead. My latest routine uses 2 movq [1 from and 1 to memory], 4 movd, 6 MMX instructions, 13 integer instructions (incl. 2 32-bit movs from memory and 2 adds from memory and 2 branches.
multiply by 4={32 digit add}
8 movq, 16 movd, 24 MMX, 52 integer, 8 jmps...

Eric Goldstein's {PD} code adds pairs of 64 digit numbers, and does some manouvering of 128-bit xmm0-7 SSE2 registers to propagate the carry. Casual consideration of the first 13 line section suggests that trivial optimization will allow us to drop a few clocks, however it is clear that he sets up so many registers that much debugging will be needed. I will not try to understand his algorithm.
4 movdqu, 8 movq2dq, 16movq, 6 movq from memory, 10 movd, 52 SSE2, 71 MMX, 34 integer, 16 integer from memory, 16 mov from memory, 12 mov to memory.

Consider my latest routine again.
multiply by 8={64 digit add}
16 movq, 32 movd, 48 MMX, 104 integer, 16 jmps...

Casual loop unrolling to eliminate jumps should make an interesting speed increase, but not necessarily significant.

I believe that a simpler routine has greater channce of winning. Unfortunately, my theory is not yet borne out in practice! :grin:


My latest code adds pairs of 32 digit numbers using MMX. Like all my other routines, there are no processor specific instructions apart from MMX. (I also use prefetchnta but I test that routine separately. On my Pentium III notebook, it makes close to no difference.) Therefore my programs will run on any processor back to a Pentium MMX (which I also have).

8 movq, 22 movd, 16 mov, 31 mmx, 28 integer, 1 jmp
multiply by 2={64 digit add}
16 movq, 44 movd, 32 mov, 62 mmx, 48 integer, 2 jmps...
total=204
{compared to Goldstein: 245 instructions, and uses Pentium 4 specific instructions,
and my previous 216 instructions}
{Actually my 2nd submission last week had 90 instructions per loop = 180 instructions per 64 digit add,
and my 3rd submission had 98 instructions per loop = 196 instructions per 64 digit add, but it was about 12-3 % faster...}

Plus I rewrote the code to free up two additional registers - actually one, since I was not using ebp yet. Also, I tried to pair instructions and re-organize the code to eliminate data dependencies. {There is no try. Just do it!} This code should be even faster on the Pentium 4 than my last submission. I will test it on a Pentium 4 tomorrow before re-submission.

Again, I believe that a simpler routine has greater chance of winning. Again, unfortunately, my theory is not yet borne out in practice! :grin:
Posted on 2003-05-25 10:13:04 by V Coder
bitRAKE, Have you inserted code to test for the palindrome condition every iteration as yet?

I do the camparison with in MMX after reversing the digits with bswap, then leave the result in mm5...



pcmpeqd ([esi], mm0); // 00s if not equal
pand (mm0, mm5); // and adjust mm5


All told, 5 extra instructions - movd, movd, punpckldq, (copy two 32-bit numbers from eax, edx to mmx), pcmpeqd, pand... { Well actually I optimized this since...} At the end, mm5 is complemented and if non-zero, at least one digit did not match.

How do you do your test? Or does your program still only add to maximize speed? {Because the data throughput from memory is maxed out - bottleneck, even the comparison does not slow down my code (on a Pentium III).} Tell us...
Posted on 2003-05-25 11:33:30 by V Coder

Well, the average number of additions it takes to reach n digits is 2.41 n. Therefore, the average number of single digit additions for a number whose ultimate length is n is n^2*2.41/2


So, for my routine which reaches 171104 digits after 413280 additions in 101 seconds, that is 35356310640 single digit additions in 1066*1,000,000*103 cycles = 0.32 additions per cycle = 3.10 cycles per single digit addition.


BTW, My latest routine takes 78 seconds (on my Pentium III), that is 35356310640 single digit additions in 1066*1,000,000*103 cycles = 0.43 additions per cycle = 2.35 cycles per single digit addition.
Posted on 2003-05-25 12:00:12 by V Coder

{Your program, V Coder, takes} 83 seconds for 413280 itterations.

It really seems to start bogging down around 200000 digits, but I'll do a run to one million digit later. I hope your checking during the addition for the palindrome.

Originally posted by bitRAKE"]
I didn't want to assume what you meant above. Mine takes 107 seconds, but it's built for the long numbers (mine drops to 54 seconds without prefetch and non-temporal stores -- that is why this little test is not good -- all the data is in the cache). I can't see how your program could beat mine on one million digits, but I'll have to test to pull my foot out of my mouth. The packed BCD version is a lot of work.


Now that's a thought. It starts slowing down in a major fashion after about 200,000 digits because we're running to external memory now, not cache...

How much cache is there on an Athlon (different flavours), Pentium III, Pentium IV (different flavours), and what are the memory access latencies and throughputs? I'll check it out tonight... or I may have the info somewhere...

This would determine where and how much the programs slow down on each processor!

Maybe your Athlon does not slow down further because you are already in main memory? Maybe my main memory is very slow (Pentium III notebook with integrated video - Intel 830 chipset... shared memory accessed continually by graphics controller?!?).

(Not that it matters for when I submit it to Wade for testing on his Pentium 4 desktop...)

How do I disable the video card? Will this produce a speed increase on my machine?

Let me test your 54 second code on my machine... That's about 1.62 clocks per single digit addition!
Posted on 2003-05-25 12:05:23 by V Coder
Sorry, for the delay - I've just upgraded to a Athlon XP (Barton) 2800+ with Dual DDR memory. :) Your program takes only 38 seconds to reach 413280 itterations! :) I need to install some more software and then I'll resume programming. I can't wait to do some more time tests with some more code - should be able to find all kinds of way to use the extra L2 cache.
Posted on 2003-05-26 13:56:06 by bitRAKE
Hey, but you didn't say how fast yours runs on that platform, and you didn't answer my questions!

Here's another. The actual number of 32 digit additions to reach 171104 iterations/413280 additions is 1105721127. That works out to 35383076064 single digit additions performed (of which only about 35356310640 are significant, according to previous calculatios: the extra additions would be additions of zeros, eg when we now start with 3 digits we still add 32...)

Anyway, my 1066 MHz Pentium III takes 78 seconds to reach there, which means .426 additions per cycle, equals 75.2 cycles on average to perform 32 digit additions. Now, my loop has 108 instructions... So, theoretically, with perfect pairing, it should take 54 cycles... Obviously not perfect pairing.

Furthermore, the 1.8 GHz Pentium 4 takes 78 cycles as well, which means 127 cycles per 32 didit add performing 108 instructions! Why is the Pentium 4 so slow, even when I have optimized to eliminate dependencies...

The world record is 49 seconds on a 2.4 GHz Pentium 4... Eric does 64 digit additions in 247 instructions in 54 seconds. = 235 cycles to perform 247 instructions!!!

Now, my older program runs on your Athlon at 2 GHz in 38 seconds = 68.7 cycles to perform 108 instructions (yeah I had not done any pairing stuff yet)...

To me, clearly the problem is the platform. If Wade switches to Athlon, all the programs would perform faster, and the ranking would change too!?
Posted on 2003-05-27 06:46:25 by V Coder

Hey, but you didn't say how fast yours runs on that platform, and you didn't answer my questions!

Now, my older program runs on your Athlon at 2.2 GHz I presume (2800+) in 38 seconds = 75.6 cycles to perform 108 instructions (yeah I had not done any pairing stuff yet)...

To me, clearly the problem is the platform. If Wade switches to Athlon, all the programs would perform faster, and the ranking would change too!?
I'm working to create a program to post.

My new CPU only runs at 2Ghz. :) And on the old CPU my loop code only took 20 cycles to execute with the data in the cache - that is 20 cycle/16 digits. With memory latencies the loop time jumps to 70 cycles!

The ranks would change, but the Athlon XP's don't have SSE2 or support 800Mhz FSB. As we move to longer numbers the memory bus is more important - these small test will mean very little to the overall speed of the code.
Posted on 2003-05-27 08:01:05 by bitRAKE

I'm working to create a program to post.

My new CPU only runs at 2Ghz. :) And on the old CPU my loop code only took 20 cycles to execute with the data in the cache - that is 20 cycle/16 digits. With memory latencies the loop time jumps to 70 cycles!

The ranks would change, but the Athlon XP's don't have SSE2 or support 800Mhz FSB. As we move to longer numbers the memory bus is more important - these small test will mean very little to the overall speed of the code.


Oh yeah, Eric's program would not work on the Athlon... It is Pentium 4 specific.

Actually, Wade's Pentium 4 is 2.4 GHz with DDR RAM (I think 266MHz DDR).

The 1.8 GHz Pentium 4 I use has 400MHz RDRAM I think. My 1066MHz Pentium III has a 133MHz SDRAM. They run my program about the same speed up to 413280. Then my notebook falls way behind.

How do we avoid the memory latencies? How long does it take to get 2 cache lines of data in and 1 cache line of data out? Okay, so when you prefetch, does this hold back the memory bus too? Shouldn't it allow you to get data ready before you need it?

To my mind in each loop we need the following to be cached

0. instructions
1. In order number
2. part of number to be reversed
3. prefetch
4. prefetch
5. - number to be written.
6. local variables

Is it that the writing of data to is interrupting the precaching? or other memory access?

Should I try to reposition the prefetch in my code?
Posted on 2003-05-27 09:05:21 by V Coder
Okay, stress!

Pentium III has a 32 byte cache line. Excellent, since my loop handles 32 bytes...
Pentium 4 has a 128 byte cache line. So I need to unroll a further 4 times???!!!
How large is the Athlon cache line?
Posted on 2003-05-27 09:21:12 by V Coder
Athlon cache lines are 64 bytes.

The goal of prefetch instructions are to notify the CPU that data is needed in the future without impeding the other instructions (they are like NOPs, but take decode/execution bandwidth). Prefetches are not needed for non-temporal stores (making this mistake would really slow the code down!).

The best we can hope for is to approach memory bandwidth using prefetch - which still falls short of the throughput of the algorithms! My new motherboard can do 2.5GB/s (actual test code :)). If I fetch 16 bytes in one cycle, how long must I wait to fetch another 16 bytes? Assuming the bus stays saturated and we prefetch far enough in advance and the code is in the cache (but the data is not); then we have to wait 18 cycles to read more data --- it does not matter what the code does -- the processor will not fetch uncached data faster! {Ignore the small numbers in the cache - my new CPU can read data at 32GB/s from the L1 cache.}

Now we know the distance between 16 bytes reads (16 bytes because the CPU likes it that way). I'm ignoring the writing of data for now. To know how far ahead to prefetch data we need to know how long it takes data to travel from memory to be availible -- the above throughput calculations say nothing of the latency. Latency test (I should write my own latency test some day, http://www.cpuid.com/cpuz.php can test latency) shows it takes 175 cycles to load from uncached memory to the processor. This seems like a large number, but this time is to load a whole cacheline (64 bytes).

A good analogy is to imagine putting little boats in a stream and due to the speed of the stream your friend doesn?t see the boats until 175 seconds later. The throughput is the maximum number of boats the stream can hold, whereas the latency is how long it takes for your friend to see a particular boat from the time you released it.

Now that we know these quantities all we need to determine how far ahead to prefetch is the loop timing. For example, if the loop takes 25 cycles then we need to let the processor know 7 loops back that we need the data for this loop.

Let?s create some pseudo code for our algorithm:

; forward numbers
; reverse numbers
; new number

Okay this is fairly basic, but note each loop reads/writes in full cache lines (for Athlons). We know from the throughput figure above that reading 128 bytes takes 19*8 = 152 cycles (again, I ignore the writing of the data). Given this timing looks like we only need to prefetch one or two cache lines ahead for each read.

?
Posted on 2003-05-28 17:45:16 by bitRAKE

The goal of prefetch instructions are to notify the CPU that data is needed in the future without impeding the other instructions (they are like NOPs, but take decode/execution bandwidth). Prefetches are not needed for non-temporal stores (making this mistake would really slow the code down!).


Okay I understand a bit about prefetch... But, I have not made any headway with prefetching or non-temporal stores in my program: prefetchnta does not speed up the program any over just waiting for the hardware prefetcher to kick in; writing the data back with movntq slows down the program to 1/4 speed. I conclude just what you said weeks ago, we are at the speed limit of the memory...

I have tried prefetching forward 1 iteration and reverse, backward 1 iteration, also +/-64, up to 128, but no difference. (My loop adds 32 digits). You said on your machine

175 cycles to read 64 bytes from uncached memory;
152 cycles to read 128 bytes from cached memory.

1) How then can your CPU read 32 GB/s from cache?
2) Tell me where I am going wrong here (is this memory going real fast)?:
171104 digits, 413280 additions: = 171101 * 413280 / 2 single digit additions approximately = 35,356,310,640 additions (read, read, write). In doing so, I read two 32 byte blocks and write one 32 byte block 1,105,721,127 times = 106,149,228,192 bytes. On my 1066MHZ Pentium III 133MHz SDRAM system, assuming 34 cycles per 32 byte block with memory running at 1066MHz, this should take 105 seconds... It takes 78 seconds, and memory runs at slower at 133MHz, so obviously I am calculating something wrong. What is it?
Posted on 2003-05-28 23:28:08 by V Coder
What size data are you testing? Try atleast 1 meg blocks.

Cache memory runs a lot faster than main memory and has lower latency (2/3 cycles). Just run a simple test of several: movq mmN, to test the read bandwidth of the cache - I can execute two every cycle.

175 cycles for a 64 byte read from uncached memory to be availible (latency);
152 cycles to read 128 bytes from uncached memory (bandwidth utilization).
Posted on 2003-05-29 00:36:45 by bitRAKE