bitRAKE, so how is your program now? (Yeah I suspect you have other things to do)... I know you can put in a Save file function in 30 minutes max. I have not tried putting in a Load function yet. Should be easy enough.
Your program seems better than my current one, and my test program did not do well on your computer - almost twice the time as on mine. I want to spend my time on the packed BCD version of the algo and try to approach the memory read/write speed. This mean hiding all the code in the memory latency. The timing isn't exactly comparible because I'm running 2^20 digits, not 1000000. I prefer to work in powers of two, but I'll do a run with 10^6 digits to see how mine fares. I have almost complete the save function - give me 30 minutes. :)

Also, I have the 1000000 and 2000000 files to insure my algorithms work exactly. It is nice that my 1Ghz Athlon is matching a 1.8Ghz P4. ;)
Posted on 2003-05-13 22:20:17 by bitRAKE
Here's my save routine...
Again, my apologies. Only HLA... But I'm sure it can be understood.
I made sure that the lengths of the strings are as Wade required - as per the ISF format.

ebx=number of seconds
is counting backwards in ecx since the data is in little endian format...

Everything else should be straightforward. Two loops - one counting 70 characters (need to use js not je since I decrement first...)

fileio.openNew( "p196.txt" );

fileio.put(eax, "Checkpoint time ", (type uns32 ebx),nl );
fileio.put(eax, "Initial value: 196",nl);
fileio.put(eax, "Iteration: ", (type uns32 numa),nl);
fileio.put(eax, "Number of digits: ", (type uns32 numd));

mov (70, edx);
fileio.put(eax, nl);

dec (edx);
js rev_;

mov ([esi+ecx],bl);
or ($30,bl);
fileio.putc(eax, bl);
dec (ecx);
jne revcopy;

fileio.close( eax );

Why do you say,
Your program seems better than my current one.
Surely you jest. It's in console mode...and the memory access has not been optimized... I still have not dedicated my mind to the task yet.

Why do you prefer to work in powers of two? What is the advantage?

All of my earlier programs used a BCD counter for easy printing... (I posted several versions of such code earlier along with speed comparisons. Put simply a BCD counter does not affect speed much, but it makes for convenience.... Well a 32bit counter is real good too, since you can compare with actual values easier - one compare).

How long did you take to do 1000000 and 2000000 digits?

If you're matching a 1.8 GHz P4 on your machine, you'll blow away the competition on a 2.4 GHz P4. (Yes V Coder, that's the point!!! :grin: )

I slept little over the weekend fine tuning where I reached, and now I am waiting to be inspired to handle the memory issues. (Actually, understanding MASM code is fairly difficult for me. :stupid: )
Posted on 2003-05-13 22:45:54 by V Coder
The time you have posted for 10^6 numbers is better than mine for 2^20. This might be because of the 48,576 extra digits, though. I prefer base two because it is known the algorithm takes four times as long for twice as many digits, and after all we are working in ASM - base two comes natual - I can count to 2^10 on my fingers. :grin: I downloaded the files from another web site as an external verification of my algorithm - the test program is very good, too.
Posted on 2003-05-13 23:24:35 by bitRAKE
You would be so disgusted, then. The test program I sent to Wade showed a counter that updated every 971 iterations. Initially it updated every 1000 iterations, and proceeded to the next line every 10000, and looped back to the first line every 100000. That is so easy to code... I don't even notice bits!!! ;)

I just read through the routine for VirtualAlloc. I will try to implement it tomorrow some time God willing. No brain work again tonight...

What site did you download the files from, please?
Posted on 2003-05-14 00:45:43 by V Coder
Here is my HLA implementation of VirtualLock...

// GetCurrentProcess
mov (eax, ebx);
// GetProcess Sizes
w.GetProcessWorkingSetSize(ebx, minsz, maxsz);

console.gotoxy(12, 1);
stdout.put("WorkingSetSize ", minsz, " ", maxsz);

add (memtoalloc, minsz); // Increase minimum
add (memtoalloc, maxsz); // and maximum memory
// to be kept active by virtual memory manager
// Reset Process Sizes + memtoalloc
w.SetProcessWorkingSetSize(ebx, minsz, maxsz);

stdout.put(" WorkingSetSize reset to ", minsz, " ", maxsz);

// Allocate RAM; EAX contains address
w.VirtualAlloc(NULL, memtoalloc, $1000, $4); // MEM_COMMIT, PAGE_READWRITE
test (eax, eax);
jne okay;
console.gotoxy(13, 1);
stdout.put("Unable to allocate memory needed.");
jmp exittt;

// Store memory address
mov (eax, memaddr);
add (8, eax);
mov (eax,high1);
add (memhalf,eax);
mov (eax,high2);

console.gotoxy(13, 1);
stdout.put("Memory allocated at ", memaddr);

// Lock the memory into physical memory
w.VirtualLock(val memaddr, memtoalloc);
test (eax, eax);
jne okay2;
console.gotoxy(14, 1);
stdout.put("Unable to lock memory.");
jmp exittt;


The correct syntax should be w.VirtualLock(memaddr, memtoalloc);
But, the program responds unable to lock memory.

There is no such response if I use w.VirtualLock(val memaddr, memtoalloc);
However, there is no improvement in the paging characteristics.

val memaddr may be correct after all.

Yeah, too many status messages, carried over from testing...
Posted on 2003-05-14 19:51:07 by V Coder
What site did you download the 1000000 and 2000000 files from please?
Posted on 2003-05-14 19:59:03 by V Coder

What site did you download the 1000000 and 2000000 files from please?
Posted on 2003-05-14 21:46:31 by bitRAKE
Would you believe I never downloaded any files? I just assumed that my programs have been working since the total number of additions to reach a particular # of digits was constant... even when I changed algorithm to MMX, etc.

I intend to get the 25million file from Wade that he uses to test, however, he has not responded to my submission as yet.

I am not getting PADDQ to assemble in MASM32... What settings do I need please?
Posted on 2003-05-14 22:25:14 by V Coder
.XMM, iirc.
Posted on 2003-05-15 00:16:15 by bitRAKE
Back after a slight respite from programming.
Have you incorporated ISF load and save, or have you stopped the quest?
I have not got a response from Wade yet. I'll mail him again. Last time he responded quite quickly. Maybe lost in mass purging of mail, maybe his palindrome, 196 filter did not pick it up.
Posted on 2003-05-20 02:05:56 by V Coder
I just emailed him regarding something on his webpage, and he replied immediately. Make sure that you have 196 in the subject line so he can see it - that's what he has told me to do in the past. As far as I know, he sorts through his junk mail manually.
Posted on 2003-05-20 05:33:16 by Jason
I was wrong. And he responded on the 18th... Wade added me to his list, and as I estimated, I am third on his list! Now, once I put in a Load function, which I have put off for more than a week now, he'll be able to update his records...

I know I can speed it up a bit if I unroll the loop, but I'm trying to get rid of the caching problem fiirst.

I just was not aware that I had received his reply since the 18th...

I'm not sure whether he responded before and I deleted his mail in a mass purging (manual)...
Posted on 2003-05-21 04:32:43 by V Coder

Have you incorporated ISF load and save, or have you stopped the quest?
I have a save function complete, but have taken a break for primes. :)
Posted on 2003-05-21 08:42:51 by bitRAKE

I have a save function complete, but have taken a break for primes. :)

That's good, so I can have my moment of glory. :grin:

I'm listed on Wade's page. Third place, as I expected. But this morning, just before I went to work, I unrolled the loop, so I add 32 digit pairs per loop instead of 8. In addition, I moved a missed branch from the loop, to just after by changing the logic. All in all I got a 20% speed increase on my Pentium III up to the first checkpoint. That should be enough to comfortably reach second place, and only Wade can say if I would reach 1st place. At least until you submit... :rolleyes: :)

You're probably thinking, V Coder, you've got to be kidding! You never unrolled before?!! Well no! I'm with stupid! :stupid: -- And I was just trying to develop algorithms...

My final optimization would be to align the loop on a 16 byte boundary.

I'm busy... I hope to get back to programming the Load function by month end.

Tell me when you have moved on to 64 bit packed BCD...
Posted on 2003-05-21 14:09:15 by V Coder
V Coder, bravo! Do you have a new test proggie? So we can check the timing on other machines. A raw speed test for large numbers as well as a palidrome test would be cool.

I'm making all the loads aligned as well... :)
Posted on 2003-05-21 14:21:38 by bitRAKE
Here's my program as submitted... Well I made one last change last night (Wade does not have it yet), which corrects a bug when the time crosses 11:59:59 pm. Since the time starts back at 0 seconds, when I subtracted the initial time from the current time I got a -ve number, which when printed as an unsigned 32 bit number was 4 billion or whatever... Fixed it by adding 86400 seconds...

(It is also a bit faster than the one I submitted to get listed. I sent the faster one to Wade, and he said it takes 60 seconds instead of 65... but my tests suggest it should have taken <50, since it is 20 percent faster than the first routine on my machine...)

How long do the two checkpoints 413280 & 1000000 take on your machine?

What does your program look like?
Posted on 2003-05-22 21:51:43 by V Coder
What am I suppose to do with the object file? :confused:
Posted on 2003-05-22 22:21:32 by bitRAKE
What object file? :rolleyes:

Oh, that one.

Well, you mean you can't see how fast the program runs from the .obj? Tsk!!!

(Sorry. Fixed.) :grin:

Now, this code will be slower than yours. But bear in mind that I check for palindrome every iteration... How fast does this program go on your machine?
Posted on 2003-05-22 22:34:01 by V Coder
83 seconds for 413280 itterations.

It really seems to start bogging down around 200000 digits, but I'll do a run to one million digit later. I hope your checking during the addition for the palindrome.
Posted on 2003-05-22 23:00:47 by bitRAKE
Yeah, see above, I check every iteration.

It takes 79 seconds on this machine...

I have one more obvious optimization, but I don't have the registers to do it... I'll need to kill ebp and esp probably!!! By the way, do you prefetch?

Let's see your program.
Posted on 2003-05-22 23:09:02 by V Coder