hi all,

I've written a random generator function which uses 4 seeds and is partially based on the idea of Park Miller.
I would like to have your experts opinions / suggestions on it for making it even more random.
Speed isn't really an issue, randomness is.

Here's the code

pseed PROC s1:DWORD,s2:DWORD,s3:DWORD,s4:DWORD
.data
seed1  dd 0AAAABBBBh
seed2  dd 0CCCCDDDDh
seed3  dd 0EEEEFFFFh
seed4  dd 11112222h

.code
mov eax,s1 ;if s1 = 0 then use default value
.if eax!=0
mov seed1,eax
.endif
mov eax,s2 ;if s2 = 0 then use default value
.if eax!=0
mov seed2,eax
.endif
mov eax,s3 ;if s3 = 0 then use default value
.if eax!=0
mov seed3,eax
.endif
mov eax,s4 ;if s4 = 0 then use default value
.if eax!=0
mov seed4,eax
.endif
ret

pseed ENDP

prand PROC base:DWORD
;seed1 = AAAABBBB
;seed2 = CCCCDDDD
;seed3 = EEEEFFFF
;seed4 = 11112222

mov eax,seed1 ;AAAABBBB
mov ebx,seed2 ;CCCCDDDD
mov ecx,seed3 ;EEEEFFFF
mov edx,seed4 ;11112222
;start shifting
xchg ax,bx    ;eax = AAAADDDD, ebx = CCCCBBBB
xchg cx,dx   ;ecx = EEEE2222, edx = 1111FFFF
xchg al,cl   ;eax = AAAADD22, ecx = EEEE22DD
xchg bl,dl   ;ebx = CCCCBBFF, edx = 1111FFBB
push eax   ;AAAADD22
push ecx      ;EEEE22DD
shl eax,8   ;AADD2200
shr ecx,24   ;000000EE
add eax,ecx  ;AADD22EE
mov seed1,eax   ;s1 = AADD22EE
pop ecx   ;EEEE22DD
pop eax   ;AAAADD22
push ecx   ;EEEE22DD
shr eax,24   ;000000AA
push edx   ;1111FFBB
shl edx,8   ;11FFBB00
add edx,eax   ;11FFBBAA
mov seed2,edx    ;s2 = 11FFBBAA
pop edx   ;1111FFBB
shr edx,24   ;00000011
push ebx   ;CCCCBBFF
shl ebx,8   ;CCBBFF11
mov seed3,ebx   ;s3 = CCBBFF11
pop ebx   ;CCCCBBFF
shr ebx,24   ;000000CC
pop ecx   ;EEEE22DD
shl ecx,8   ;EE22DD00
add ecx,ebx  ;EE22DDCC
mov seed4,ecx    ;s4 = EE22DDCC
;start calculating
mov eax,seed1
mov ecx,16587
xor edx,edx
div ecx   ;AADD22EE / 16587, result in eax, remainder in edx
mov ebx,seed2    ;11FFBBAA
xchg ebx,eax 
sub eax,ebx  ;11FFBBAA - remainder
mov ecx,edx
xor edx,edx
mul ecx
mov seed2,eax    ;seed2 = (s1 / 16587)*(s2 - (s1 % 16587))

mov ecx,29753
xor edx,edx
div ecx ; (s2 / 29753)
mov ebx,seed3  ;CCBBFF11
xchg ebx,eax
sub eax,ebx  ;CCBBFF11 - remainder
mov ecx,edx
xor edx,edx
mul ecx
mov seed3,eax  ;seed3 = (s2 / 29753)*(s3 - (s2 % 29753))

mov ecx,39744
xor edx,edx
div ecx ; (s3 / 39744)
mov ebx,seed4  ;EE22DDCC
xchg ebx,eax
sub eax,ebx  ;EE22DDCC - remainder
mov ecx,edx
xor edx,edx
mul ecx
mov seed4,eax  ;seed4 = (s3 / 39744)*(s4 - (s3 % 39744))

mov ecx,59721
xor edx,edx
div ecx ; (s4 / 59721)
mov ebx,seed1  ;AADD22EE
xchg ebx,eax
sub eax,ebx  ;AADD22EE - remainder
mov ecx,edx
xor edx,edx
mul ecx
mov seed1,eax  ;seed1 = (s4 / 59721)*(s1 - (s4 % 59721))
;use every last byte of each new seed
shl eax,24
mov ebx,seed2
shl ebx,24
shr ebx,8
add eax,ebx
mov ebx,seed3
shl ebx,24
shr ebx,16
add eax,ebx
mov ebx,seed4
add al,bl
mov ebx,seed1
xor eax,ebx
xor edx,edx
div base
    mov eax,edx
    ret

prand ENDP


Thanks in advance!

Mark Vogels
Posted on 2006-05-31 02:05:42 by white scorpion
Test your random generator with ENT
http://www.fourmilab.ch/random/

Make a small program that writes RANDOM dwords (generated by your procedure) to a file. Then use ENT to test the randomness of the file.

At first glance your algorithm looks like it could work, but it's a little long (if speed is one of your concerns) if not then it's fine.
Posted on 2006-05-31 16:31:41 by r22
Speed isn't a concern for me at the moment, since it will only be used to generate a 1-4 KB file. this is fast enough on a normal computer.

can you explain to me which values i should preferably get using ent?
Changing the static numbers changes the result a lot, so which one should i use?

here's one result

Entropy = 7.909002 bits per byte.

Optimum compression would reduce the size
of this 4096 byte file by 1 percent.

Chi square distribution for 4096 samples is 514.38, and randomly
would exceed this value 0.01 percent of the times.

Arithmetic mean value of data bytes is 127.5481 (127.5 = random).
Monte Carlo value for Pi is 3.061583578 (error 2.55 percent).
Serial correlation coefficient is -0.002773 (totally uncorrelated = 0.0).


here's another one:

Entropy = 7.894121 bits per byte.

Optimum compression would reduce the size
of this 4096 byte file by 1 percent.

Chi square distribution for 4096 samples is 605.00, and randomly
would exceed this value 0.01 percent of the times.

Arithmetic mean value of data bytes is 128.1001 (127.5 = random).
Monte Carlo value for Pi is 3.143695015 (error 0.07 percent).
Serial correlation coefficient is -0.018626 (totally uncorrelated = 0.0).


the first one is close to random (127.5481) the second one is close to PI.
which is better and why?

Thanks
Posted on 2006-05-31 23:57:16 by white scorpion

the first one is close to random (127.5481) the second one is close to PI.
which is better and why?

Good question :) - I dunno if the Arithmetic means should preferably be equal to 127.5, or just larger than that value.

From the docs, the more random your data is, the closer the Monte Carlo value is to Pi.
Posted on 2006-06-01 06:02:15 by f0dder
I've read that too. but i also saw this:

This quantity measures the extent to which each byte in the file depends upon the previous byte. For random sequences, this value (which can be positive or negative) will, of course, be close to zero. A non-random byte stream such as a C program will yield a serial correlation coefficient on the order of 0.5. Wildly predictable data such as uncompressed bitmaps will exhibit serial correlation coefficients approaching 1. See for more details.
I've changed some values which as result gave here -0.000714 which would mean it is pretty close to random as well.
At the same time, the other things like Pi and the random went further away of the optimal value.
So which one would be better?
Posted on 2006-06-01 10:44:47 by white scorpion
You should run it with a larger sample - 4096 isn't really sufficient. A file size on the order of a meg or bigger will give more accuarte results.

Ossa
Posted on 2006-06-01 11:37:24 by Ossa
I understand,
But then i still don't know what kind of output is wanted.



Entropy = 7.620056 bits per byte.

Optimum compression would reduce the size
of this 1048576 byte file by 4 percent.

Chi square distribution for 1048576 samples is 727702.68, and randomly
would exceed this value 0.01 percent of the times.

Arithmetic mean value of data bytes is 126.2168 (127.5 = random).
Monte Carlo value for Pi is 3.180989002 (error 1.25 percent).
Serial correlation coefficient is -0.000042 (totally uncorrelated = 0.0).

Changed some values. how about this?
The serial correlation coefficient looks nice...
Posted on 2006-06-01 12:32:17 by white scorpion

But then i still don't know what kind of output is wanted.


You mean that you don't know what values from ENT are the ideal values for a perfect RNG?

Well here's what a perfect output would be:

    Entropy = 8.000000 bits per character.

    Optimum compression would reduce the size
    of this xxx character file by 0 percent.

    Chi square distribution for xxx samples is yyy.yy, and randomly
    would exceed this value zz.z percent of the times. <not so sure about this one>
 
    Arithmetic mean value of data bytes is 127.5 (127.5 = random).
    Monte Carlo value for Pi is 3.14159265 (error 0.00 percent).
    Serial correlation coefficient is 0.000000 (totally uncorrelated = 0.0).


As a note, the serial correlation coefficient will actually be between (mean - 2*stddev) and (mean + 2*stddev) 95% of the time for excellent samples where:

mean = -1 / (n - 1)
stddev^2 = n^2/(((n-1)^2)*(n-2)))

(from Knuth's Art of Computer Programming: Section 3.3.2 Empirical Tests: Algorithm K - Serial Correlation Coefficient)

But you are never going to get exactly that. I guess you want to know how good is "good enough" and I can't really say without a context... for encryption, study the links I gave over in the other forum about cryptographic random number generators - some of the "minimum acceptable" values for various standards are covered there.

Ossa

I'm going to check on the expected value for the Chi-Squared test, but the value produced by your algorithm is alarming me
Posted on 2006-06-01 13:30:56 by Ossa
For the Chi-Squared test, read the ENT page:

If the percentage is greater than 99% or less than 1%, the sequence is almost certainly not random. If the percentage is between 99% and 95% or between 1% and 5%, the sequence is suspect. Percentages between 90% and 95% and 5% and 10% indicate the sequence is "almost suspect".


Your value is most certainly less that 1%. From what I read there and from what I'm reading from Knuth's tome, your algorithm fails the test for randomness. In short: for cryptographic use, this RNG is useless.

Ossa
Posted on 2006-06-01 13:50:34 by Ossa
Once cause for the bad chi distribution could be how your creating the file.
Make sure you get rid of the last divide or make sure you're pasing 0xFFFFFFFF to your function as the max value parameter.

Another cause for bad chi could be your choice of seed values. Try to random them up a bit (if you are using the default values).

My 2cents, use more ADD and XOR instructions and less MUL and DIV. And maybe replace the SHifts with a ROlls.
Posted on 2006-06-01 21:10:10 by r22
Thanks guys, this is useful info.
I will post some new results later this day.
Posted on 2006-06-02 00:19:06 by white scorpion
Ok, here it goes:

Entropy = 7.999851 bits per byte.

Optimum compression would reduce the size
of this 1048576 byte file by 0 percent.

Chi square distribution for 1048576 samples is 216.70, and randomly
would exceed this value 95.00 percent of the times.

Arithmetic mean value of data bytes is 127.5783 (127.5 = random).
Monte Carlo value for Pi is 3.143795562 (error 0.07 percent).
Serial correlation coefficient is 0.000029 (totally uncorrelated = 0.0).

Using attached code.

This should be sufficient for encryption, or am i wrong?


this is strange:
the bigger the generated file gets, the better entropy, pi & correlation get:

Entropy = 7.999999 bits per byte.

Optimum compression would reduce the size
of this 167772160 byte file by 0 percent.

Chi square distribution for 167772160 samples is 275.12, and randomly
would exceed this value 25.00 percent of the times.

Arithmetic mean value of data bytes is 127.4810 (127.5 = random).
Monte Carlo value for Pi is 3.142472151 (error 0.03 percent).
Serial correlation coefficient is 0.000026 (totally uncorrelated = 0.0).

for a 160meg file...

using the -b option on the 160meg file:

Entropy = 1.000000 bits per bit.

Optimum compression would reduce the size
of this 1342177280 bit file by 0 percent.

Chi square distribution for 1342177280 samples is 11.56, and randomly
would exceed this value 0.01 percent of the times.

Arithmetic mean value of data bits is 0.5000 (0.5 = random).
Monte Carlo value for Pi is 3.142472151 (error 0.03 percent).
Serial correlation coefficient is -0.000021 (totally uncorrelated = 0.0).

Attachments:
Posted on 2006-06-02 14:25:09 by white scorpion
this is strange:
the bigger the generated file gets, the better entropy, pi & correlation get


That is not strange, it is a documented fact.
Paul
Posted on 2006-06-22 11:55:22 by PBrennick
Well, you would say that on a 30 byte file you have lesser chance of finding the same byte strings than on a 300,000 byte file.
That's why i find this strange.
Posted on 2006-06-22 13:50:21 by white scorpion
but the more random the file is, if you're looking for a decent sized bit string, there'll be a lot more unequal strings, than in a smaller file. Basically, even though there Is more of a chance of that perticular sequence being created in the large file, the large area between occourances  is enough to indicate more random properties than a small file with only a few of the possible sequences in it. If that small file has 2 occourances of the same sequence in, say 512k, as opposed to a 5MB file with 2 occourances, the randomness of the smaller file is actually less so than the large file.

I'm having trouble properly explaining this, because I've got no formal education (self or otherwise) on this subject, I just understand the principle in use here.
Posted on 2006-06-24 19:44:37 by Bobbias
I understand what you mean  ;)
But the program also checks if it's possible to compress the file.
If you would have more stringsequences which are equal, then it would be easier to compress it.

But i don't think it would really matter in this case.
The biggest problem i have is finding 4 pseudo random seeds to use with the function.
The ones i used now are:
1. GetTickCount
2. GlobalMemoryStatus
3. GetCurrentThreadId & GetCurrentProcessId
4. GetDiskFreeSpace

I'm not happy using these seeds, but i don't know of any other values which can be used to input my function.
Any ideas?
Posted on 2006-06-25 03:32:44 by white scorpion
you could combine some of them, or something.. or use a hash function on some of those values and use the resultant hash (which should be nice and random).
Posted on 2006-06-25 05:18:05 by Bobbias
I've been thinking of using a hashfunction, but i think it would be useless in a security way.
example:

i have one random function with 4 dword seeds as input.
if i have a hashfunction which makes a 4 dword hash out of 1 dword, then i actually would only have 1 dword of possible different seeds and so defeating the strength of 4 seeds.

Posted on 2006-06-26 00:29:47 by white scorpion