PADDQ is only SSE2, iirc. MMX/SSE/2 only seem to help with the memory timing, and they take twice as many instructions. But memory is the bottleneck here - the streaming stores along with prefetch help to get closer to memory bandwidth - there is more than enough room for instructions on a fast processor.

Assuming we can read memory at 2,000MB/s, the Athlon can execute two 64-bit loads in one cycle = 16 bytes a cycle read. Therefor, a 125Mhz Athlon can max out the memory bus; or there are 14 of every 16 cycles going to waste on a 1Ghz Athlon that is just reading from memory! Do you see the problem now?

The algorithm requires 2 reads and 1 write for each itteration - this takes 2 cycles on my 1Ghz Athlon, but the processor has to wait on memory for another 30 cycles before it can try that again. This is why packed BCD is needed and we have plenty of cycles to unpack in the registers and propagate the carries in MMX registers. It's also why the algorithm I posted above is so slow.

Just found Eric Goldstein's code - here -.
If that is current fastest, it should be easy to beat.
Posted on 2003-05-10 20:23:41 by bitRAKE
You mad? Only an experienced Assembly coder would look at that gobbledygook and say its easy to beat.

Oops. I forgot. :grin:

But hey, are you going to optimize your code for P4 and leave me behind?

Posted on 2003-05-10 21:52:32 by V Coder

But hey, are you going to optimize your code for P4 and leave me behind?
Not until I get an Opteron. :) I don't have a P4 - I'm optimizing for Athlon.
Posted on 2003-05-10 22:37:09 by bitRAKE
First, one obvious optimization to your code.

You involve the memory (slowing you down) here...

; aligned low order digits
mov [ebp + 0*4], eax
mov [ebp + 1*4], ebx
mov [ebp + 2*4], ecx
mov [ebp + 3*4], edx
movq mm0, [ebp]
movq mm1, [ebp+8]

Check this thread...

To transfer a 64 bit number in edx:eax to mm0
My initial idea...

movd mm0, edx
movd mm1, eax
psllq mm0, 32
por mm0, mm1


movd mm0, eax
movd mm1, edx
punpckldq mm0, mm1

So, your code can be optimized to...

; aligned low order digits
movd mm0,eax
movd mm2,ebx
movd mm1,ecx
movd mm3,edx
punpckldq mm0, mm2
punpckldq mm1, mm3
Posted on 2003-05-10 23:24:46 by V Coder
V Coder, very good thinking. :)
Posted on 2003-05-11 00:02:09 by bitRAKE
I must confess I was impressed by your moving edx:eax to memory and picking it up with movq, {simplified below}.

mov [edi], eax
mov [edi+4], edx
movq mm0, [edi]

I found it so cute that so I used it instead of my original code {V Coder}

movd mm0, edx
movd mm1, eax
psllq mm0, 32
por mm0, mm1

I have now compared the speeds of both, as well as the latest (derived from) {Goldstein}

movd mm0, eax
movd mm1, edx
punpckldq mm0, mm1

Goldstein = 148 seconds to 413280 additions, 205 seconds to 200000 digits (483101 additions)
V Coder = 150 seconds, 206 seconds
bitRAKE = 193 seconds, 266 seconds.

There is about 1% difference between the first two, and 29% difference to the third.

Here is part of your memory bottleneck. You are writing to memory twice...two reads and two writes, instead of two reads and one write. By eliminating 1/4 of the memory accesses, it takes 1/4 less time.

Oops. I just realized that in my {bitRAKE} routine I used your move to memory twice... {V Coder} and {Goldstein} have just one of them corrected.
In {Goldstein 2} I converted both to punpckldq.... I expect {V Coder 2} will take about the same...
The slowest takes 70% longer than the fastest.

Goldstein 2 = 113 seconds to 413280 additions, 156 seconds to 200000 digits (483101 additions)
Goldstein = 148 seconds, 205 seconds
V Coder = 150 seconds, 206 seconds
bitRAKE = 193 seconds, 266 seconds.
Posted on 2003-05-11 00:11:33 by V Coder
V Coder, my working algorithm has a throughput of 4.7 cycles per digit uncached - even for 100 million digits it maintains this speed. This means over 200 million digits a second are added. The packed BCD code isn't working just yet.
Posted on 2003-05-11 01:28:01 by bitRAKE
To tell the truth, I have no clue how fast my working program is in cycles per digit.

How are you calculating yours? I just tried a calc, and came up with a ridiculous figure.

My loop has 15 pairs of instructions (well I hope they are pairing) to add a pair of 8 digits.

Also, since Wade is testing specifically to 413280 additions, how fast does your program reach there? Thanks.

PS. When I first looked at Benjamin Despres {GPL} code, I found it difficult to understand (I still cannot understand it), so optimizing it was out of the question. It adds pairs of 32 bytes and propagates the carry from dword to dword within MMX (well we deal with carry before it reaches the MMX code).
His loop consists of 12 movq , 12 movd, 24 integer instructions , 56 MMX instructions and a couple of jmps.

I decided to write a simple loop to add pairs of 8 bytes instead. My latest routine uses 2 movq [1 from and 1 to memory], 4 movd, 6 MMX instructions, 13 integer instructions (incl. 2 32-bit movs from memory and 2 adds from memory and 2 branches.
multiply by 4={32 digit add}
8 movq, 16 movd, 24 MMX, 52 integer, 8 jmps...

Eric Goldstein's {PD} code adds pairs of 64 digit numbers, and does some manouvering of 128-bit xmm0-7 SSE2 registers to propagate the carry. Casual consideration of the first 13 line section suggests that trivial optimization will allow us to drop a few clocks, however it is clear that he sets up so many registers that much debugging will be needed. I will not try to understand his algorithm.
4 movdqu, 8 movq2dq, 16movq, 6 movq from memory, 10 movd, 52 SSE2, 71 MMX, 34 integer, 16 integer from memory, 16 mov from memory, 12 mov to memory.

Consider my latest routine again.
multiply by 8={64 digit add}
16 movq, 32 movd, 48 MMX, 104 integer, 16 jmps...

Casual loop unrolling to eliminate jumps should make an interesting speed increase, but not necessarily significant.

I believe that a simpler routine has greater channce of winning. Unfortunately, my theory is not yet borne out in practice! :grin:
Posted on 2003-05-11 03:55:21 by V Coder

My memory is being cached to disk. How can I prevent this? It is slowing down the program to 40% of possible speed.

How can I ensure that none of the allocated memory gets swapped to disk? I am using Windows XP.

Okay, the routine should take roughly 4 times as long as the number of digits doubles... Well, the average number of additions it takes to reach n digits is 2.41 n. Therefore, the average number of single digit additions for a number whose ultimate length is n is n^2*2.41/2

So, for my routine which reaches 171104 digits after 413280 additions in 101 seconds, that is 35356310640 single digit additions in 1066*1,000,000*103 cycles = 0.32 additions per cycle = 3.10 cycles per single digit addition.

Anyway, the total number of digits added, as well as the total time should be proportional to the square of the number of digits. I reach 250,000 digits in 219 seconds, so I should reach 1,000,000 digits in 16 * 219 = 3504: less than an hour. But the disk is flashing, and the routine has only reached 800,000 digits after 7083 seconds (almost 2 hours), which is approx 772741102210 single digit additions in 1066*1,000,000*7083 cycles=.10 additions per cycle=9.77 cycles per addition.

The slowdown can only be the disk cache. WHY, WHY!!!!!!!!!!!!!!
Posted on 2003-05-11 04:28:20 by V Coder
V Coder, what is the speed of your memory? How fast can you do a memory copy? I don't think memory getting cached to disk is slowing you down. How much memory do you allocate? I allocate only twice as much memory as the number of digits I'm calculating to.

Present speed 2^20 digits calculated in 1'44"54.52
(over twice as fast!)

I calculate the add speed using a method similar to the program you posted. I add two 2^23 digit numbers and see how many cycles that takes. Then I divide by 2^23 to get the average cycle count. It is important to use a block of memory that will not fit in the cache. I've attached the program, can you tell be if it works? 2.66 cycles per digit is better than mine.
Posted on 2003-05-11 09:26:05 by bitRAKE

Also, since Wade is testing specifically to 413280 additions, how fast does your program reach there? Thanks.
I do not understand what is meant by 'additions' - does he mean itterations? ...or singal digit additions?

I see he means itterations. This is not a good measure of speed due to the high cache usage with such small numbers. For example, my algorithm takes 123,527ms to do 413280 itterations, but this is due to overhead added to increase speed for very large numbers.
Posted on 2003-05-11 09:29:07 by bitRAKE
Well, uhm. I was allocating, uhm, 128MB. :stupid:
Actually, the program was giving errors with the add until I increased the amount allocated... I did not check to reduce it until this morning. I reduced it to 12 MB and XP is still swapping to disk.
I will carry it lower again. Actually, I intend to allocate two heaps of memory, one for the source and the other for the destination.

Your program takes 69003232 cycles - avg 8.2258 clocks on my machine.

I reduced it to 2 MB and brainless XP is still swapping to disk on my 512 MB machine!!!!!

What's the command format for VirtualLock?
Posted on 2003-05-11 10:46:56 by V Coder
I'll work on that virtual lock thing later.

Jason, I've finished my Version 0.5 of my program. Ready for submission...

It finishes 413280 iterations in 92 seconds on a 1.8 GHz Pentium 4 desktop with Windows 2000, and 101 seconds on a 1066MHz Pentium III notebook with Windows XP Home.
It finishes 1000000 digits in 5173 seconds (1 hour 26 minutes) on the desktop, and 12113 seconds (3 hours 21 minutes) on the notebook. The tremendous slowdown of the notebook is probably due to a slower hard drive so that paging more severely affects results.
Posted on 2003-05-12 22:29:06 by V Coder
Perhaps you could make the program prepare for the disk swapping, by making it use the memory in some way while it awaits a key press to start the actual process. Disk swapping should not be included as part of the time of computation, although I am certain that most applications just allocate the memory and starting computing away. I would imagine most systems do not have any problem with disk swapping, especially with 512 MB of RAM. I know my system has no problem... It seems strange that you are having such difficulties with your computer.

Also, I am not responsible for submitting programs to Wade. Please find his email on his website (he changes it a lot, so I have no idea if I have the current email he uses or not), and submit it directly to him.
Posted on 2003-05-13 07:04:24 by Jason
I emailed it to Wade shortly after posting - I used an address I had earlier this year. I'll check his site again to see if it is current. I mentioned it as just as a status report though.

When I last wrote him, he indicated that others had problems debugging a slow down with Windows XP. I got some advice here on the forum on how to allocate the memory so that XP does not swap it out to disk. It seems that XP ignores the virtual memory settings and swaps everything that enters memory anyway. This happens straight through the program. I believe Windows 2000 on the other hand is very bug-free in relation to that. I should send it to my brother who has a slower notebook with 64MB RAM running Windows 2000, and see how that fares.
Posted on 2003-05-13 14:38:59 by V Coder
V Coder, I am running WinXP and I have not seemed to have this problem, yet. Then again, I am not keeping a close eye on what is happening with my applications since I am not currently involved in any speed record attempt, so I may be wrong. If WinXP continually swaps even after the program has made it very aware that it has full intentions of using all of the memory it has allocated, then it doesn't appear to be much you can do about it... however, that is very strange behaviour. I would consider it a bug in the OS.
Posted on 2003-05-13 15:03:26 by Jason
It is a security measure to ensure no program attempts to bring down the system by allocating all the memory - we pay in performance for this security. Only power users and system admins are able to execute programs that allocate huge amounts of memory that are locked into not being swapped!
Posted on 2003-05-13 16:22:38 by bitRAKE
How do I convince the computer that I am a power user or a system administrator? I am the only Administrator account set up on the machine.
Posted on 2003-05-13 21:18:30 by V Coder

How do I convince the computer that I am a power user or a system administrator? I am the only Administrator account set up on the machine.
That is enough to have permission - now you have to use VirtualAlloc and VirtualLock - like in my example in the other thread. WindowsXP is actually smart to function this way as it generally improves multitasking performance (or the appearance of such :) ). Windows can't help the fact that we want the program to have the whole machine to it self, we just have to work to ensure windows understands this.
Posted on 2003-05-13 21:24:53 by bitRAKE
I was thinking to uninstall XP, in favor of 2000...

bitRAKE, so how is your program now? (Yeah I suspect you have other things to do)... I know you can put in a Save file function in 30 minutes max. I have not tried putting in a Load function yet. Should be easy enough.
Posted on 2003-05-13 21:58:59 by V Coder