V Coder, I don't think the carries propagate correctly in your code. You have to store to the least significant digits first and carries propagate to the more significant digits - are you doing this? I translated your code as a starting point for my algo, and it had this problem - so I reworked it from scratch. Tracing through the code in a debugger is a real plus to work this stuff out - my program doesn't do any output - I just look at the numbers in a debugger. :)
Posted on 2003-05-05 01:53:00 by bitRAKE

My intention was to write a routine that was faster on my Pentium MMX 166 than the 1 day 18 hours on a Pentium II 266.
http://www.jasondoucette.com/worldrecords.html

Sure beats ( this ) guy's 1 day and 18 hours!
I noticed some traffic to my web site from this board, so I thought I would check it out to see what was going on.

bitRAKE, yes indeed you have beaten my time. But you should be aware of the specifics of my code and hardware so you know exactly what you are comparing your program against. I have a feeling your tools and hardware allow for a better application, as they probably do not have as many restrictions as the tools I programmed that code with. I did explain most of this on my page http://www.jasondoucette.com/worldrecords.html, but I will explain it again here:

(edit: I forgot to mention for those who may not have read the whole thread that I used a Pentium II 266 vs. bitRAKE's 1000 Athlon) The compiler I used was Turbo Pascal 7.0, which comes with the integrated BASM assembler. In its most advanced mode, it can accept 286 instructions. You have to use special coding to be able to access the extended 32-bit registers available on the 386 (you must know the machine code that a real assembler creates, and insert them yourself). In order to be able to access more than 64 Kbytes of data, I had to use 'flat mode' (as Turbo Pascal is not a protected mode compiler), which enables linear 32-bit addressing. To use it, you must set ES to 0, and then you can access 32-bit linear memory like so: mov al, byte ptr es:, with eax being the pointer. This is far slower than just using regular protected mode, but I did not have the tools to use it. In fact, I was not even supposed to be able to get 32-bit addressing on my machine, so getting 'flat mode' to work was a great success in that I can actually use more than 1 megabyte of memory without using EMS (which would be far slower still). Also, as far as hardware is concerned, I do not think the P-II had very much cache at all (I recall the Athlon's that were out at the time had 8 times as much or more, for both L1 and L2). As you are probably aware, the amount of cache is critical for the computations of the first million digits, as RAM need not be accessed for a considerable amount of the time if you have lots of cache - Athlon was really outdoing Intel at this point in time. Also, the program accidentally wrote some 50+ megabytes of data to my considerably slow HD (non SCSI) for the first million. I know I stated that I would re-test the time with this bug removed, but I never did. I guess I knew that my tools were so outdated and insufficient that it was no surprise someone easily made a faster program (like Istv?n Bozsik, who programmed his algorithm identical to mine with current tools).

The reason I have 'published' my results on my webpage, despite my 'primitive' program is because: even though my program was not exactly the fastest any one could make, it was the fastest at the time, and was basically impossible to make it any faster with the tools that I had. The original program that took 196 to 1 and 2 million digits used % and / operations when a simple compare and subtract would do. It was obvious that I could code it much faster. After doing so, I realized that my program on my machine was still far faster than this unoptimized program on supercomputers of yesteryear. This is why I took it to 13 million digits before I lost access to a computer, and passed it on. I really wished that I could have messed with this more with superior tools. I have these tools now, but I do not have the time due to current projects.

Good luck with your programs.
Posted on 2003-05-05 09:45:12 by Jason
Jason, I do not mean to downplay you effort on this endevor. I am aware that my algorithm is behind the curent fastest even with my current tools of MASM, RadASM, and OllyDbg. I used to program on 286's with DEBUG and a set of batch files - I know what your talking about. :)
Posted on 2003-05-05 10:27:45 by bitRAKE
That's okay. I did not think you were downplaying my effort. I just thought I would mention the circumstances of my program in case people are using it as a measuring stick, which is what appears to be the case. :) I realize that some people reading this thread may not realize the difficulties of even getting a program to work with improper tools, let alone be fast, because of the work-arounds that have to be made.

But, it is always nice to compare as long as you know what you are comparing against. I know that I compared my program against the original from John Walker, but I also understood that back then memory was more of any issue than speed, as I can see that the code was specifically made to store two digits per byte, which does nothing but save memory and make the program slower (although I did notice that he could have used no more memory and sped up the program considerably by not using divide/modulus operations).

I beleive that Ian Peter was the first person to use BCD instructions to make a super-fast reversal-addition program, but his program has never been tested against the others, since it ran in Lunix, which Wade doesn't have. I wish he would participate in your discussions, because he did most of the stuff people are trying today years ago, and has never been properly credited. He reached 1 million in 5 hours on a 500 MHz athlon (not sure how much cache it had).

It should also be noted (if it hasn't been already) that beating the speed record to 1 million is different from having the fastest general 196 app, because the use of cache is different in both aspects. I think this is clearly shown on Wade's Software Comparison page.
Posted on 2003-05-05 11:49:28 by Jason
The time I posted above for 2^20 digits is without using prefetch, so the cache is actually slowing the algorithm down - only during the middle 256k/3 is any accessed data in the cache. CPU speed has really outpaced memory speed and so the extra Mhz just go to waste - we can see this in the timings for cached data (2.66 cycles per digit). Maybe with a 400Mhz memory bus my 1Ghz CPU would really be used.

Initially my goal was to minimize the cycles per digit, but this goal will change with the next version to get the right balance between CPU speed and memory latency. Athlon's are quite good on MMX speed, so a packed BCD version is where I'm headed.
Posted on 2003-05-05 12:37:28 by bitRAKE
To speed-up the program, you should try to store the number in normal and reverse order as follows:

4 bytes containing the number in the normal order
4 bytes containing the number in reversed order
etc...

This will perhaps reduce the cache hits.
Of course, reversing the number can be done during the check_for_palindrom phase.

JC
Posted on 2003-05-05 14:31:17 by MCoder

4 bytes containing the number in the normal order
4 bytes containing the number in reversed order
etc...
I like this idea, but not for the reason you mention. Currently, two registers are used to point to two buffers of equal size - source and destination are exchanged each interration. With interleaving only one register is needed, and it saves a prefetch instruction.
Posted on 2003-05-05 14:57:44 by bitRAKE
Jason

I think it is great to have provoked your interest across here (by engaging bitRake in the programming).

If you read my earlier posts, you would note that I started working on the palindrome quest in January 1998 and wrote DOS unreal mode code entirely in assembler using NASM. I stopped working on it simply because I thought my program (a DOS unreal mode version, which on a Pentium 166 MMX machine works faster than yours) was too slow, only to realize (when I accidentally came across Wade's work) that others picked up the quest later. You are recognized as a pioneer of the quest, having started with a slower program than mine on a faster machine, 19 months later... By that time my program world have long crossed 10 million.

My interest in optimizing my program for the Windows environment, therefore, is not to challenge your accomplishment ("Too late, too late," shall be the cry!), but to challenge myself.

I have been dabbling in assembler since the Commodore 64 and 6502 (mid 80s). Those were the days of the fast loader and quick loader, and Vorpal, which required a specially formatted floppy to load programs faster than normal. (The C64 floppy drive [1541] was realllllllll slow.) I disassembled the code of all of the routines to see how they worked. I was working on my own routine which involved caching the data in the Drive internal memory (1 KB - well 2 KB, but as I recall only 1.25 KB were available for data) when one chip in the C 64 spoilt. By the time I got a replacement, another spoilt, possibly in the drive... Then I carried both for repair, and never got back either. I still have the Commodore 1541 Troubleshooting and repair guide, The Anatomy of the 1541 (which contains a disassembly of the drive internal ROM), and the assembly routines for the C 64 and 1541.

I bought a book which had a graphics programming environment for the C 64, entered the binary code (or assembler code, can't remember now - at that time I could understand the hexadecimal code as easily as the assembly instructions they represented). I found that a paint (fill) routine was too slow, and sped it up by a factor of 4. I optimized the dot routine to calculate the location in memory for each pixel.

All of this might have gone beyond hobby, but when I sent the C64 and 1541 to fix I was already in Dental School. When I started work on the palindrome quest I was already working, and I have relegated my computer skills to hobby status. I decided this year to go beyond...

I want to release a program now that 1) is faster than yours was (pointless to go slower really - and I have already passed it anyway); 2) can be listed on the software comparisons page, because I have not really programmed anything of worth as yet, and I think writing software to use up all the idle cycles of a machine is as good a way as any to learn Win32ASM programming, and initiate myself into the ranks of real coders.
Posted on 2003-05-07 00:03:17 by V Coder

To speed-up the program, you should try to store the number in normal and reverse order as follows:

4 bytes containing the number in the normal order
4 bytes containing the number in reversed order
etc...

This will perhaps reduce the cache hits.
Of course, reversing the number can be done during the check_for_palindrom phase.

JC


I'm inclined to disagree that it will brind a speed up... Does this not add an increased level of complexity to the code, since you need the reversed digits some long distance away from the normal digits, especially with 9x million digit numbers?

Originally posted by bitRake
V Coder, I don't think the carries propagate correctly in your code. You have to store to the least significant digits first and carries propagate to the more significant digits - are you doing this? I translated your code as a starting point for my algo, and it had this problem - so I reworked it from scratch.


Which code exactly has the problem? The Jones' routine for four digits or the simple single digit addition?

PS. I reworked my 4 digits addition using little endian format. A couple of other changes too. I'll post soon.

How far have you reached with your code, Rake?

I'm now installing HLA, MASM, etc on my Pentium 1066 notebook, which I got back yesterday.
Posted on 2003-05-07 00:22:49 by V Coder

1) Which code exactly has the problem?

2) How far have you reached with your code, Rake?
1) Jones' routine. Should be fixed with your recent changes to little endian.

2) I'm aiming to meet or surpass existing algorithms, but the code needs some work. It will be taylored to Athlon CPU's instead of P4's. Early testing is looking like 2.5 cycles per digit on uncached data. I will try to get access to an XP with 400 Mhz bus for better speed testing. :)
Posted on 2003-05-07 01:32:29 by bitRAKE
Rake, I'm sure the carries propagate. In fact, I put it in a working 4 digit add palindrome program and it posted the same numbers and my other routine... But I have to do two bswaps because of big endian format... (I made the wrong choice 5 years ago to use big endian format. These things stay with you for life, eh?)

Other users, check Jones' webpage about why the algorithm should work. At any rate the algorithm is not the problem here, if anything it is my coding of it. But once you can understand the code, you can adapt it yourself.

ADD F6F6F6F6 - to propagate carries
AND 30303030: any byte F6-FF, and will leave 30...
SHR , 3: 30 becomes 06
SUB 06 from any byte F6-FF, then
AND 0F0F0F0F to reconvert all bytes to unpacked BCD.
Posted on 2003-05-07 02:50:43 by V Coder
V Coder, thanks for your reply.

This is exactly how I got myself involved in this quest, if you are interested:

In grade 5 we were asked to compute the palindrome of 89 via reversal addition. Years later, in junior high, my brother was thinking back on this, and created a BASIC program that computed the palindrome of every number, just out of curiosity to see which ones resolved. Years later, he was going through old programming stuff, and came across this. It piqued my curiosity, and I searched on the web for information about the quest. Just as the digits of pi are computed more and more every year, I expected the same to be done with this quest. I was very surprised to find that it had only been taken to 1 million digits in 1990, and then to 2 million in 1995, and to a claimed 3.9 million digits in a rec.puzzles archive a few years later...

I wrote up a real-mode Turbo Pascal program very quickly, and timed it to the maximum digit length it could hold within the 64 kb data limit. I extrapolated the data (knowing it would be slightly incorrect due to cache use for larger numbers), and was surprised to find that I would reach 1 million digits within a day. I could not believe that a very slow program that was thrown together in a few minutes could outperform a supercomputer of a few years ago. Therefore, I looked at John Walker's code, and was surprised to see that it was very inefficient (it was obviously thrown together very quickly, with more concerns of memory usage than speed). I knew that I would need ?flat mode? to access more than 64 kb, so I reprogrammed the simple algorithm (it was a basic 1 digit per byte algorithm that added digits one at a time using 2 arrays) into the BASM assembler within Turbo Pascal.

I realized that I could beat the world record in less than a month, and I had access to an idle computer, so it was at this time that I decided to run this program... I had to. After all, the computer was just sitting there, and the program was already made. This was not a quest to make the fastest program. I knew that I did not have the proper tools (protected mode compiler/assembler) or the proper information (assembly language references) to code anything better at the time. It only used basic move, add, compare, and jump instructions that I learned from a single sheet of paper handed to me one day in computer science class for 16-bit 286 instructions. I did have ideas for speeding it such as storing multiple digits per byte with look up tables, but I did not have the time to mess with it - so the code was a small hobby program that I did not intend to spend more than a few hours working on, and I didn't. I expected it to be beaten, as Istvan easily did using the exact same algorithm with a real 32-bit mode assembler. This code certainly does not rank in anywhere near in the best code that I have written.

If you had created your program before mine, then you should have definitely put it into action, for the same reasons that I did. I did not think I was doing any great deed by putting mine into action, although it was somewhat cool to think that I would have the world record. There was actually a period of almost a year between when I lost access to an idle computer, and Wade's email address stopped working in which no one was running the quest - it was stalled at 13 million. Ian Peter had already written an application that was 4 times as fast as mine, but Wade couldn?t use it because he didn?t have linux. I told myself repeatedly to redo the program in windows with a true assembler, but just never did - although it is something I always wanted to do (I still want to do it now :) ). I guess it served its purpose for its time. However, I think it is great that so many people are actively involved once again (It was weird to think that I was potentially the only person interested in it for such a long time). Also, I think it is great to set up challenges for yourself - pushing yourself further will never hurt you, but please remember I was not on a quest for the fastest program (Of course, my resume page may speak highly of my program, but that is its purpose).
Posted on 2003-05-07 10:39:04 by Jason

Rake, I'm sure the carries propagate.
Sorry, it was not your error but mine. I was trying to use that algo for little endian, but like you already know - it works perfectly fine for big endian. :stupid:
Posted on 2003-05-08 21:33:11 by bitRAKE
bitRAKE!!!!! It's working !!!!! My first real Win32ASM program!!!!!

Well, it is console mode, so you'll probably say it does not count, but {where's the smily with the tongue stuck out? oh well, you get the idea. :grin: }...

Now, as you would have seen from my HLA code... I print every 1000 iterations... I have it running simultaneously on my Pentium 166 MMX, and Pentium III 1066. Surprisingly, it only runs slightly faster on my Pentium MMX machine than the unreal mode version. It's at 52 minutes to reach a 200,000 digit number (483101 additions), and should finish 1,000,000 digits in just under a day.

On the Pentium III, 200,000 digits took 3 minutes and 50 seconds. It's crossed 750,000 digits (1611514 additions) in 1:19:22. Should reach 1,000,000 digits (2415836 additions) within 2 hours. I am going to test it on the 1.8 GHz Penitum 4 in the morning. By that time I will adjust the code to allow it to stop after a specified number of additions.

I am using a mod of my algorithm, (along with lea, etc.) I hope to use the Jones' algorithm extended to 8 digits next. You'll still be ahead of me therefore, with 16 digits additions.

Most importantly, I check for the palindrome condition during each addition. At most therefore, I will add once extra, but if I find palindrome, then I use the preveious sum not the current one... This allows it to be more portable to check other things... Also, if the tester puts in a number which is known to yield, I will not be caught out...
Posted on 2003-05-08 22:56:50 by V Coder
I have to contact my former co-workers. In my farewell letter 3 monts ago I told them I was learning Win32ASM programming just to leave no stone unturned. "Whatever your hands find to do, do it with all your might...Attain, attain!" Yes' I'm proud.

Jason, yeven though I am 3 years 9 months behind you in the quest, it feels real good to have run.

Now, tell me something,


Sorry, it was not your error but mine. I was trying to use that algo for little endian, but like you already know - it works perfectly fine for big endian. :stupid:


Does this mean I can pat myself on the back for introducing you to the idea of adding BCD numbers in parallel? And can I boast of having derived an almost identical routine to that of a master programmer? (Your addition routine is slightly different from the one I posted, only MUCHHHHHH more efficient, with ALL the optimization tricks put in. It was like reading a primer on optimization!!!!) Can I say I'm a real good programmer by extension???? :grin:

By the way, bitRake, what are your new times?
Do you have new code or executables for view? That hidden window trick is cool though...
Have you contacted p196.org to submit your code for testing yet?
I plan to do so sometime next week.

I will look into increasing thread priority separately...

I'm glad I reached this far, but I'm not so sure about configuring the program to read ISF files etc... I will write to a plain text file (digits after digits...), and let the converting program donvert. I may read from a plain text file too.

Maybe I should check RDTSC at the start of the program so I could check how many cycles have been wasted on the idle palindrome quest... Nah. I can calculate that! Pentium III at 1066, running for 2 hours = 7,200 seconds * 1,066,000,000 = more than 7 1/2 trillion cycles wasted! YES!!!!!!!!
Posted on 2003-05-08 23:20:22 by V Coder
MCoder originally wrote
To speed-up the program, you should try to store the number in normal and reverse order as follows:

4 bytes containing the number in the normal order
4 bytes containing the number in reversed order
etc...

This will perhaps reduce the cache hits.
Of course, reversing the number can be done during the check_for_palindrom phase.

JC

Has anyone tried it as yet? (I think it would slow things down, but I'm willing to concede at anytime...)


bitRAKE originally wrote[/]
Initially my goal was to minimize the cycles per digit, but this goal will change with the next version to get the right balance between CPU speed and memory latency. Athlon's are quite good on MMX speed, so a packed BCD version is where I'm headed.

Have you started a packed version? I expect that you will need the lookup table MCoder spoke about but you would only be able to do byte by byte addition. Am I correct?


Here is a test program. 1000 Mhz Athlon clocks in at <7 secs for 2^16 digits.
This actually calculates the digits for the number 196.

2^20 digits takes 3 hours, 33 minutes, 23.52 seconds!

Pentium III 1066, 2 hours 41 minutes to reach 1 million digits. 2^20 is 1048576. I'm on a slightly faster machine 1066 vs 1000. . But I use no precaching, etc. (I have nooooooo idea how to do that!!), but mine looks faster !!! :grin: Your routine takes 6920ms on the PIII...
Posted on 2003-05-08 23:34:28 by V Coder
V Coder, I am human. :) You didn't introduce me to the idea of parallel BCD math, but I was not aware of the 196 palidrome quest. :) We programmers barrow from each other and we are a community for it. I am having problems optimizing the packed BCD add in MMX - there is a sense it should be better, but I am unable to see it, yet. The memory accesses are the very fastest possible for the Athlon and as Jason has said already, there is not a way to optimize the algorithm overall - having the itterations dependant on each other.

I'm thinking maybe a different number format could be used?

The memory latencies leave much room for instructions - you will see this as your code runs on the P4 - not the speed increase one would first expect. I'm sure the existing top code is designed for SSE2 - it is quite impressive. No program, yet. What hidden window?
Posted on 2003-05-08 23:41:50 by bitRAKE
bitRAKE
So, you got packed BCD math to work with reverse and add? I could not understand from your post whether you got the MMX going and were trying to optimize it or could not get it with MMX. (I'm thinking that would be more trouble than it's worth, but if you get it working, then I take back the part about me being a real good programmer. I'll settle for extremely lowly amateur -- which I am.)

I hope my :grin:s are taken well, I am just extremely relieved - after spending hundreds of hours of lost sleep, getting a Win32 console mode program doing something is just wow for me.
Posted on 2003-05-09 00:08:03 by V Coder

I hope my :grin:s are taken well
Yes, it has just been a very long week - I am a fool for not being in bed now. :grin: I work towards an MMX only version. Good night and sweet dreams of algorithms. :)

I would advise to have your program output to a file - for checking by the program on Jason's page. I will do the same most likely.
Posted on 2003-05-09 00:35:33 by bitRAKE
It takes over 3 hours. Also, the 413280 iterations takes a few seconds longer too.

And in case you're wondering, the 1066 MHz Pentium III is over 20 percent faster than the 1.8 GHz Pentium 4. (I left my figures at work accidentally on Friday, so I cannot remember the exact ratio...) I am not likely to move from an MMX version to SSE or SSE2, so, tough!

I will soon put in load and save function.

bitRAKE
It seems that Jones' algorithm uses the same number of instructions (five - 5) in the core of the addition as we do (well, his add $F6F6F6F6 is inside the loop, ours is outside to collect several carries separate from the MMX). So, I wonder if any benefit can be derived fom using it? I was thinking only if we have 64 bit add with carry. Or even 128 bit. Let me look at the SSE and SSE2 instruction set again...

Check out PADDQ. This should work on Pentium MMX? or is it only Pentium II, etc? 64 bit - quadword addition. Therefore no need for add, adc at all. Working it in now.
Posted on 2003-05-10 19:08:18 by V Coder