I showed you mine first - aint I a flirt?



Coord2D struct
X REAL4 ?
Y REAL4 ?
Coord2D ends

BB2D struct
TopLeft Coord2D <?>
BottomRight Coord2D <?>
BB2D ends
;=============================================
;A very fast test for intersection of regular 2d rectangles
;=============================================
QUICK_DoTheseBB2DsIntersect proc pBB1, pBB2
mov esi,pBB1
mov edi,pBB2
fld [esi].BB2D.TopLeft.X
fcomp [edi].BB2D.BottomRight.X
__FJLE @F
return FALSE
@@
fld [esi].BB2D.BottomRight.X
fcomp [edi].BB2D.TopLeft.X
__FJGE @F
return FALSE
@@:
fld [esi].BB2D.BottomRight.Y
fcomp [edi].BB2D.TopLeft.Y
__FJLE @F
return FALSE
@@:
fld [esi].BB2D.TopLeft.Y
fcomp [edi].BB2D.BottomRight.Y
__FJGE @F
return FALSE
@@:
return TRUE
QUICK_DoTheseBB2DsIntersect endp
Posted on 2004-11-10 23:20:49 by Homer
I don't turn my head at everything with a pair of legs and the jumpy ones scare me, but I'm drinking tonight and who knows...

Boolean returns should be in the carry flag, imho. No jump is required for this algorithm. :) Subtract and accumulate the negative signs - return the opposite of the sign in the carry flag.

Maybe this code will work, or at the very least you can see what I mean:
	push edi

push esi
mov edi, [esp][4*4]
mov esi, [esp][4*3]

mov ecx, [edi].BB2D.BottomRight.X
sub ecx, [esi].BB2D.TopLeft.X

mov eax, [esi].BB2D.BottomRight.X
sub eax, [edi].BB2D.TopLeft.X

sub edx, [edi].BB2D.TopLeft.Y
mov edx, [esi].BB2D.BottomRight.Y

or eax, ecx

mov ecx, [esi].BB2D.TopLeft.Y
sub ecx, [edi].BB2D.BottomRight.Y

or eax, edx
or eax, ecx
shr eax, 31
pop esi
pop edi
retn 4*2
I image the SSE2 code would be a couple instructions. :)
Posted on 2004-11-11 10:14:29 by bitRAKE
bitrake, maybe you can use the sign flag instead. then you won't do the shr at all, if i'm correct. it will save a clocktick :)
Posted on 2004-11-11 11:15:10 by lifewire
1cycle * 60FPS * 1000 BB2D^2; yeah - good idea to use the sign flag.

...* 1000 users * 50 hours; yeah - good idea to use the sign flag.

...* 10 years; yeah - good idea to use the sign flag.

Sorry, I forgot to reverse the bit to stick with the return type of the original code.
Posted on 2004-11-11 11:34:55 by bitRAKE
I still think my version is preferable due to the early out at each test.
My version will return a boolean response with as few as one floating compare (one test).
Also, the structure declares the type as REAL4 - a simple integer subtraction won't yield a correct answer, but assuming the type was dword integer, your code is perfectly valid, just wants some tests of carry jumping to an early out ie jc/jnc tests.
I do see your point about using carryflag to hold replyvalue, but then my code wouldn't be C/C++ friendly anymore..
Posted on 2004-11-11 21:45:25 by Homer
The same technique works with REAL4 (code is just an example). Yeah, the unpredicted early out takes more time than whole routine. :roll: :lol: (assuming the data within a single cacheline) True, C/C++ friendly is not a goal of mine.
Posted on 2004-11-11 22:08:23 by bitRAKE
Could you elaborate on "unpredicted early out takes more time than the whole routine" please?

You mean by exiting the procedure early, it actually takes more cpu time to execute than performing unnecessary math does?
Posted on 2004-11-11 22:36:56 by Homer
Sure it's faster to execute unbranching code, because the processor is built like a rocket car. We flick the switch and it races down the road at 763MPH - kind of late to be thinking about steering!

The start of the pipe does the prefeching of the instructions, but the processor will not know until the middle of the pipe (or later for Intel) if a branch is truely taken or not. Forward branches are not predicted as being taken the first time, and after that its predictablity is very hard for the processor to predict because of the nature of the algorithm. An unpredicted branch causes the processor to deploy the parachute. Like positioning rockets the processor has a few tricks to prevent the total clearing of the instruction pipe: prefetch all targets, calculate both paths.

Every optimization text I've ever read (including the manuals) advocated the exchange of branches with code if possible. From the AMD manual, "Predicted-taken branches incur only a single-cycle delay to redirect the instruction fetcher to the target instruction. In the event of a mispredict, the minimum penalty is ten cycles." The maximum is much higher!

*I'm speaking of conditional branching throughout the above.

Another example is my checkers engine - it is faster for it to evaluate all the moves than to exclude duplicate moves with a hash table (below a few ply). Which is contrary to everything published on the matter (because they use some conceptual theory based on ply approaching infinity - checkers games are much shorter :)).
Posted on 2004-11-11 23:06:38 by bitRAKE
Tested both procs, and they proved wrong. (please, prove me wrong) I couldn't believe it, so I made a dozen of tests with different arguments. The last two conditional jumps should be exactly the opposite, and also the first two should not include the "or equal" statement. That's because usually we take this: if a rect at (0;0) with size 10x10 is next to a rect with the same dimensions but at position (10;0), they do not overlap. With Homer's code they would be thought to do so. bitRAKE's code didn't pass all tests, so I think it's not a solution. Comparing both negative floats with integer comparisonor subtraction leads to inverse-logic.
Anyway, here's my simple solution, that can be made in MMX/SSE/SSE2, I guess. No jumping here to confuse the conveyer :).Also, no calling/ret - for max speed. Preserve esi and edi if your code needs it.



; First, a float compare macro, that behaves perfectly
; with negative values too

;------[ support macro ]---------\
ufcomp macro var1,var2
mov eax,var1
mov ebx,var2
mov ecx,eax
and ecx,ebx
sar ecx,31
xor eax,ecx
xor ebx,ecx
cmp eax,ebx
endm
;--------------------------------/

;------------[ example usage ]--------------------------------\
; after using this macro, you can use the Jxx as though you
; just compared two integer values

;Example:
f1 real4 -17.6
f2 real4 44.7

ufcomp f1,f2
jle @F
print "f1 is bigger"
@@:
jne @F
print "f1 = f2"
@@:
jge @F
print "f1 is lower"
@@:

The only problem is that 0.0 > -0.0 ^^"
;------------------------------------------------------------/


;-----------[ Rectangle intersection macro ]--------\
DoRectsIntersect macro r1,r2
;push esi
;push edi
mov esi,r1
mov edi,r2
ufcomp [esi].RECT.left,[edi].RECT.right
setge dl
ufcomp [esi].RECT.right,[edi].RECT.left
setle dh
shl edx,16
ufcomp [esi].RECT.bottom,[edi].RECT.top
setle dl
ufcomp [esi].RECT.top,[edi].RECT.bottom
setge dh
;pop edi
test edx,edx
;pop esi
endm

;---------------------------------------------------/

;----------[ example ]-----------------------------------------\
If EDX was zero, then the two rectangles intersect

re1 RECT <9.9,0.0,19.9,10.0>
re2 RECT <0.0,0.0,10.0,10.0>


DoRectsIntersect offset re1,offset re2
.if ZERO?
print "They Intersect!"
.endif

; if you change the 9.9 to 10.0 , the rectangles won't intersect,
; so it works correctly :). I tested this code thoroughly before
; posting, so you can use it safely ;)

;--------------------------------------------------------------/





Btw, I got interested in this topic in the moment it was posted, but I was delayed to reply partly because I had done one of the tests of my code wrong ^^" Sorry for the delay, have fun :)

btw, do you know a good 1:1 replacement of Notepad, with the same shortcut keys - because win2k's Notepad sucks to Win98's one - chops text in lines when copying to clipboard, and sometimes/often doesn't redraw correctly :|
Posted on 2004-11-14 14:16:32 by Ultrano
Speed test:


invoke GetTickCount
push eax

mov ecx,1000000000
@@:
push ecx
DoRectsIntersect offset are1,offset are2
;.if ZERO?
; print "They Intersect! 7"
;.endif
pop ecx
dec ecx
jnz @B

invoke GetTickCount
pop edx
sub eax,edx
print eax

On AthlonXP 2000+ , normal thread priority, this takes 14.4 seconds to execute (70 million intersections/sec)
If I change the "dec ecx" to "dec dword ptr" with the accompanying changes, it takes 12 seconds (83.3 mil tests/s)
Thus, the DoRectsIntersect code executes in less than 20 cycles, with 2+ instructions/cycle
Posted on 2004-11-14 14:41:05 by Ultrano
In my current 2D OpenGL code, it seems OpenGL inverts the Y axis so that 0,0 is in the bottom left screen corner, sorry :oops:

The 2D stuff is viewed as orthographic projection in a 3D environment.

It's something I haven't been worried about too much yet (although it feels weird to draw bottom up lol) and I guess maybe there could be something wrong with my projection matrix ?

hehehe :)

Anyway thanks for showing interest, especially bitRake (you have opened my eyes somewhat, I've a very linear 8bit background)

Ultrano, we are covering some common ground at this point in time, any input you have into the 2D aspects of my opengl dll would be appreciated :)
Posted on 2004-11-14 22:49:03 by Homer
2D - my specialty I guess :-D lol .
In the ogl then my second macro should look like this:


DoRectsIntersect macro r1,r2
;push esi
;push edi
mov esi,r1
mov edi,r2
ufcomp [esi].RECT.left,[edi].RECT.right
setge dl
ufcomp [esi].RECT.right,[edi].RECT.left
setle dh
shl edx,16
ufcomp [esi].RECT.bottom,[edi].RECT.top
setge dl
ufcomp [esi].RECT.top,[edi].RECT.bottom
setle dh
;pop edi
test edx,edx
;pop esi
endm

Funny ogl :| . I remember it was a pain to draw a DIB of a virtual window that was resizable with scrollbar + it was a child of another virtual window but with top-down draw, and the DIB had many elements that depended on top-down drawing :( . Good now I use DDraw, so no such problems arise :-D
Looking at the code I currently develop, there's only a SetBounds function and something that isn't 2D-related but helps a lot in games - message queue. The SetBounds actually makes the Blt functions clip to a specified area - it is really fast, so you might not need any ogl code for that. The message queue is something like this:
invoke NewMessage,MSG_REMOVEINVULNERABILITY,150
where 150 is the timeout in animation frames (with having 50 fps) after which this message will be launched and processed. For instance, this code is usually used after the player's shield gets broken, and he's been set to immortal. And this message makes sure this immortality is just temporary :) .
Posted on 2004-11-15 03:17:36 by Ultrano
The message queue sounds interesting.. my current coding ethic is to create one or more object classes, then an "object manager class" for keeping collections of objects, it would certainly be useful for my objects to post messages to their manager and vice versa :) At the moment I use flags within the object instances to pass messages to/from the object manager instance...
Posted on 2004-11-15 14:47:31 by Homer
Speed test:


invoke GetTickCount
push eax

mov ecx,1000000000
@@:
push ecx
DoRectsIntersect offset are1,offset are2
;.if ZERO?
; print "They Intersect! 7"
;.endif
pop ecx
dec ecx
jnz @B

invoke GetTickCount
pop edx
sub eax,edx
print eax

On AthlonXP 2000+ , normal thread priority, this takes 14.4 seconds to execute (70 million intersections/sec)
If I change the "dec ecx" to "dec dword ptr" with the accompanying changes, it takes 12 seconds (83.3 mil tests/s)
Thus, the DoRectsIntersect code executes in less than 20 cycles, with 2+ instructions/cycle


You made a slight mistake in your math. 12 seconds = 12*1000*1000*2000 ( 2000+ athlon) = 24000000000 cycles for the whole thing. Divide that by 1 billion loops to get the average time in cycles for executing the code.

24000000000 / 1000000000 = 24 cycles.

Also I don't know how accurate it is to be using GetTickCount ( since it only has a resolution in the millisecond range) and then using that to see how many cycles it took to execute your code. You should probably use RDTSC to get more accurate values.
Posted on 2004-11-16 15:09:44 by mark_larson
The AthlonXP2000+ runs on 1.6GHz, not 2GHz. It's called 2000+ because it effectively runs almost all code as if it was an Intel on 2GHz or more :P . In this case you see the code runs as if it was a 3.7 GHz Celeron or PIII. Though, I haven't made exactly this benchmark on my PIII and k6-2 or a Celeron, so I can be wrong with +-10%.
The millisecond mistake is just 0.08% , since this code takes 12-14 secs to execute :)
Btw, tested the proc with manually setting priority in TaskManager to realtime, produces only 300ms speedup ^^"
Posted on 2004-11-16 15:24:34 by Ultrano
The error in GetTickCount must be worse than that, I used it as the basis for my timing and found that it was "jumpy" even though the framerate was high.. after changing to high performance timer, the timing artifacts disappeared completely.
I do admit that I would have exaggerated the timing error within my math, but then so would any application requiring timing based on elapsed times...
I probably should write some benchmarking code myself :)
Posted on 2004-11-16 19:57:22 by Homer
The error in GetTickCount must be worse than that, I used it as the basis for my timing and found that it was "jumpy" even though the framerate was high.. after changing to high performance timer, the timing artifacts disappeared completely.
I do admit that I would have exaggerated the timing error within my math, but then so would any application requiring timing based on elapsed times...
I probably should write some benchmarking code myself :)


Yes it has more error that what Ultrano listed.

Ultrano, look at it this way. You are timing someting that runs in the "cycle" range of time using a timer that times events in the millisecond range of time. There are 1,600,000 cycles in 1 millisecond, and your routine is supposedly running in 20 cycles. See the problem? Even if there was no error in GetTickCount you cannot possibly accurately time something that runs in around 20 cycles. There just isn't enough accuracy.

Ignoring loop overhead, here's a simple method.



LOOP_CNT equ 10000000


xor eax,eax
cpuid

rdtsc
push edx
push eax

mov ecx,LOOP_CNT
looper:
push ecx

;... put code to time here
call Whatever

pop ecx
dec ecx
jnz looper

rdtsc
pop ebx
sub eax,ebx
pop ecx
sbb edx,ecx

mov ecx,LOOP_CNT
div ecx

;EAX holds the number of cycles. You can print that or convert to some other unit of time.

;In my actual code I also subtract out the loop overhead to get a more accurate answer. Adding that is easy.
Posted on 2004-11-16 21:20:25 by mark_larson