Hey guys,

I hope this is the right forum for this topic. It isn't really a normal question so bear with me. :)

Something really odd is happening and I was hoping one of you might have a thought as to what it could be.

I wrote two implementations of Bresenham's line drawing algorithm. One in C and one in assembly. By all accounts the assembly version looks faster when compared to the C routine's disassembly. However, the routine is only sometimes faster. The assembly version is in a separate library with the C calling convention.

If I call both routines in windowed mode (I am using sdl for simplicity's sake), using hard coded coordinates from (0,0) to (399, 599) then the assembly version beats the C version by quite a bit ~114,000 lines per second as opposed to ~55,000 lines per second for the C version.

However, if the same number of lines are drawn using randomly generated coordinates then the two routines are about the same with the C version having a narrow edge (about 1-2k lines extra per second). This seems really strange to me since with either test it is using the same seed on a psuedo-random number generator so all the lines being drawn are exactly the same for each routine. The only difference is the random nature in which things are being called. But, since both routines are being called in the same order for each test one would think that both routines would slow down similarly, but this isn't the case.

Finally, if I run the first test (the one with hard coded coordinates) in full screen mode then the same thing happens as with random coordinates ... the C version slightly edges the asm version.

This is boggling my mind. The thing I first thought of was maybe the rate in full screen mode was because of the display memory surface instead of system memory. So, I manually allocated an 800x600x32 buffer and drew to that. Results were the same as drawing to the normal full screen back buffer. But, even if this were the case how would it explain it doing the same thing in windowed mode when the coordinates are randomized?

Does anybody have a clue as to what the problem might be? I can post the code if anybody needs/want that to look at. But, I don't have a problem with the code ... I have a problem with the varying results that don't seem to correlate with any aspect of hardware/software that I am familiar with.

Thanks for any insight!

- Chris
Posted on 2002-10-11 00:11:16 by bit
Chris,

Just as a rough guess, I would suggest that the random algo is slowing things up enough for the times to be much closer between versions. Just try this, make sure you align the code in the asm version to at least 4 and see if it makes any difference.

Regards,

hutch@movsd.com
Posted on 2002-10-11 02:39:41 by hutch--
Hi Chris,
Post the asm source, the C source and the asm output of the C compiler.
Posted on 2002-10-11 03:15:44 by Maverick
Hey Hutch,

I thought about that, but why wouldn't both versions slow down accordingly? The C version runs a smidge faster without random numbers (which is to be expected), but why would the asm version more than double in speed? Shouldn't they experience the same speed-up and slow-down?

Maverick,

I included the source to both functions, the C version's disassembly, and an already built library file. Let me know if there is anything you would like to see.

- Chris
Posted on 2002-10-11 10:11:08 by bit
The other key thing is that when I place it in full screen mode with hard-coded numbers the asm version runs a little slower than he C. But, in windowed mode is where I get that performance boost.

It has me completely baffled.

- Chris
Posted on 2002-10-11 15:50:18 by bit
I think there is too much branching. :)
How about an inner (and outer ;)) loop like this:
; X increases with each loop


GRADIENT_Y EQU <eax>
COLOR EQU <ebx>
INDEX EQU <ecx>
TEMP EQU <edx>
BUFFER EQU <edi>
PITCH EQU <esi>
DELTA_YoX EQU <ebp>

; INDEX = x2 - x1 ; x2 < x1
; BUFFER = {memory} + y2 - INDEX
; DELTA_YoX = 2^32 * [ ABS(y1-y2) / (x1-x2) ]
; GRADIENT_Y = 0
; PITCH = -pitch if y2 < y1 else pitch
_3:
add GRADIENT_Y, DELTA_YoX
sbb TEMP, TEMP

mov [BUFFER + INDEX*4], COLOR

and TEMP, PITCH
inc INDEX

lea BUFFER, [BUFFER + TEMP]
jne _3




; Y increase / decreases with each loop

GRADIENT_X EQU <eax>
COLOR EQU <ebx>
INDEX EQU <ecx>
BUFFER EQU <edi>
PITCH EQU <esi>
DELTA_XoY EQU <ebp>

; INDEX = x2 - x1 ; x2 < x1
; BUFFER = {memory} + y2 - INDEX
; DELTA_XoY = 2^32 * [ ABS(x1-x2) / (y1-y2) ]
; GRADIENT_X = 0
; PITCH = -pitch if y2 < y1 else pitch
_3:
mov [BUFFER + INDEX*4], COLOR
add BUFFER, PITCH

add GRADIENT_X, DELTA_XoY
adc INDEX, 0
jne _3
You'll also need a version for dx < dy
Added code for dx < dy

Your going to be bound by memory speed.
I think that is what you were testing. :)
Try testing a cache size virtual screen on random data.
Posted on 2002-10-11 19:23:39 by bitRAKE
Great Idea bitRake. I was trying to figure out how to eliminate that one branch in the inner loop, and I think that may be a great way to do it.

But, do you have any idea why the assembly version slows down so much when using random lines, or when placed in full screen mode? That is my main concern in all of this. Something really strange is going on, and I am not sure what. I can understand my routine not being faster than the C version (although that doesn't seem very likely loking at the code). What I don't understand is why the thing runs at almost double the speed when tested in windowed mode with random lines, but a smidge slower using random lines or when placed in full screen mode.

The varying results are driving me nuts! :confused:

- Chris
Posted on 2002-10-11 19:33:55 by bit
Ooops ... I posted before you edited your post. :)

I thought the memory might have been an issue.

But, why does it slow down when the hardcoded line version is tested in full screen? The memory is the same. I have tried using both system memory and video memory. When in windowed mode I get about double the lines per second rate, but when I switch that same code to use full screen the assembly version all of the sudden becomes slower than the C (but not by much).

However, this explanation makes sense for the slowdown when switching to randomized lines. In that case, cahce hits/misses could bring the performance of the assembly version down to that of the C version ... but why doesn't the C version slow down more too?

- Chris
Posted on 2002-10-11 19:41:55 by bit
Think about having 90+ cycles between memory accesses - both should run about the same on random data. I'm not sure why windowed mode is faster, but I doubt it has anything to do with your code. Rather than random data, try running a test of every line from the corners to every pixel in say a 256x256 space - don't use the same corner twice in a row and use relitive offsets for the pixels. And if you cycle the colors it will look pretty. :)
Posted on 2002-10-11 19:59:59 by bitRAKE
No, I don't think my code has much to do with windowed mode being faster. I just wonder why the asm version is faster with hard-coded data, but not with random data when it is the same data for each test!

Maybe' I'll try that test later tonight/tomorrow and see what happens.

- Chris
Posted on 2002-10-11 20:03:02 by bit

No, I don't think my code has much to do with windowed mode being faster. I just wonder why the asm version is faster with hard-coded data, but not with random data when it is the same data for each test!
Oh, thought I answered that - caching.
That is why I suggested to try a virtual screen the size of the cache.

Also, note how a slight change in the algo for cheap anti-aliased effect lines...
;#############################

; Anti-Aliased Versions:

; Blend two pixels, writing the result in the destination
Pixel_Blend MACRO Dest, Source
ENDM

; X increases with each loop

GRADIENT_Y EQU <eax>
COLOR EQU <ebx>
INDEX EQU <ecx>
TEMP EQU <edx>
BUFFER EQU <edi>
PITCH EQU <esi>
DELTA_YoX EQU <ebp>

; INDEX = x2 - x1 ; x2 < x1
; BUFFER = {memory} + y2 - INDEX
; DELTA_Y = 2^32 * [ ABS(y1-y2) / (x1-x2) ]
; GRADIENT_Y = 0
; PITCH = -pitch if y2 < y1 else pitch
_3:
add GRADIENT_Y, DELTA_YoX
sbb TEMP, TEMP

mov COLOR_A, GRADIENT_Y
and COLOR_A, 0FF000000h
or COLOR_A, COLOR
push COLOR_A
xor COLOR_A, 0FF000000h

Pixel_Blend [BUFFER + INDEX*4], COLOR_A

pop COLOR_A
lea TEMP_2, [BUFFER + PITCH]
Pixel_Blend [TEMP_2 + INDEX*4], COLOR_A

and TEMP, PITCH
inc INDEX

lea BUFFER, [BUFFER + TEMP]
jne _3
Posted on 2002-10-11 20:05:01 by bitRAKE
Sorry, I guess I wasn't clear with that last statement. What I really meant was:

I just wonder why the asm version is faster than the C version with hard-coded data, but not faster than the C version with random data when it is the same data for the C version and the Asm version! The caching problem would explain a slowdown from hard-coded lines to random lines since different areas of a 800x600x32bit buffer would be pulled in. But, it doesn't explain why the C version doesn't have the same amount of a slowdown since it has to pull in the same buffer areas during its execution.

Thanks for the AA version. I have a Wu anti-aliased version in C but hadn't yet converted it to assembly. I am still deciding whether or not it is worth doing the conversion. If the assembly version isn't that much faster than the C, then I may not even bother.



My latest benchmarks are as follows ...

Windowed Mode:
- ASM Version Hard-Coded (0, 0)-(799, 599) = 90,909 lines per second
- C Version Hard-Coded (0, 0)-(799, 599) = 40,540 lines per second

Full Screen Mode:
- ASM Version Hard-Coded (0, 0)-(799, 599) = 17,400 lines per second
- C Version Hard-Coded (0, 0)-(799, 599) = 17,045 lines per second

So, I would expect the full screen version to run at 38,222 lines per second. In other words, at the same ratio as in windowed mode. Slow-downs in code are reasonable because of caching and other things going on in the system ... I just expect a similar slowdown for another routine that performs the same task. It just doesn't make sense to me how the assembly version can slow down by 80% and the C version only slows down by 57% when they are both being profiled on the same data set.

Am I making any sense at all? I have been trying to sort this problem out for a few days now and it just isn't meshing well in my mind.

- Chris
Posted on 2002-10-11 22:47:13 by bit
Here are the results of trying the two routines on a 100x100x32bit system allocated buffer.

Windowed Mode:
- ASM Version Hard-Coded (0, 0)-(99, 99) = 1,851,851 lines per second
- C Version Hard-Coded (0, 0)-(99, 99) = 516,351 lines per second

Full Screen Mode:
- ASM Version Hard-Coded (0, 0)-(99, 99) = 1,886,792 lines per second
- C Version Hard-Coded (0, 0)-(99, 99) = 559,179 lines per second

=====

Windowed Mode:
- ASM Version Random Lines = 525,394 lines per second
- C Version Random Lines = 414,078 lines per second

Full Screen Mode:
- ASM Version Random Lines = 525,394 lines per second
- C Version Random = 419,580 lines per second


Now that makes sense!!!

Obviously the first test hits the special cased diagonal line code and that is why it is so fast. But, these numbers are in line with my initial test results that I performed on hard coded lines (only in windowed mode). Those results were Vertical Lines: 15% faster - Horizontal Lines: 37% faster - Diagonal Lines: 55.7% faster - Random Lines: 53.8% faster.

So, it obviously has something to do with memory ... but why/what/how is only one of the routines (the asm version) slowing down tremendously? Why doesn't the C routine do the same? This test proves that my gut feeling on the asm routine being a lot faster was indeed correct. But, why isn't it showing so?

- Chris
Posted on 2002-10-11 23:08:50 by bit
The answer is probably just simplier than you imagined - let me explain each test result in very general terms:

Windowed Mode:
All test are only run for a single line, so we can assume the line is in the
CPU cache - windows is controling the copy of memory to video.
- ASM Version Hard-Coded (0, 0)-(799, 599) = 90,909 lines per second
This version runs as fast as memory will allow (assumed because all memory is in the cache).
- C Version Hard-Coded (0, 0)-(799, 599) = 40,540 lines per second
This version is slower by over half.

Full Screen Mode:
All test are only run for a single line, so we can assume the line is in the
CPU cache, but it doesn't matter because we are writing directly to video memory.
- ASM Version Hard-Coded (0, 0)-(799, 599) = 17,400 lines per second
This version hits memory wall.
- C Version Hard-Coded (0, 0)-(799, 599) = 17,045 lines per second
This version hits memory wall.

Basically, the full screen versions can't go faster than memory allows - there is no logic in calculating a ratio between slow downs of the routines as the slow down is not based on the performance of the code, but directly on the speed of video memory transfer. The assembler routine has much more to loose by waiting around and doing nothing!

Moral of the story:
If you are going to code in asm for speed: Do not wait on memory - keep the CPU busy. And on todays faster processors you have many cycles to be very creative with.

Your second test show that ASM code is always waiting on memory. :)
Posted on 2002-10-11 23:15:31 by bitRAKE
bitRake,

Yeah ... it makes sense when put in those terms. :) I remember reading Michael Abrash's Black Book. In the chapter "Lightning Lines and Dead Cats" he talks about how the performance of the VGA display adapter was causing his ASM routine to be only twice as fast as the C version, but in reality it was actually 4 times as fast. So, that was the first thing I looked at. But, for whatever reason the explanation you just gave didn't "click" for me ... mainly because of what I am going to ask next. I think I was mainly being thrown off by the windowed mode results, but since that was simply a matter of it being in the cache it makes more sense now that the full screen version wasn't achieving those numbers since it was using video memory.

So, the other question I have (and why it didn't click for me) is this:

I created a buffer with malloc() that was the same size as the screen (800x600x32). Then I ran the same tests using that system memory buffer. With the same result as the video memory. I take it this means the ~55,000 lines per second I was getting for both routines is basically just the maximum amount of time needed for memory to be pulled in to cache and written to?

And, a few extra just out of curiousity ...

1 - Since I might actually be wasting a ton of cycles it seems reasonable to try and create an optimized version of the Wu anti-aliasing routine. The extra calculations per pixel could be done while waiting for memory to be pulled in since an 800x600x32 screen is never going to fit in a 64k, or even 128k, cache. What do you think of this plan?

2 - Is it even worth it to rewrite this thing in assembly? And, is it worth it to recode so that branch is out of the inner loop? I am still up in the air. I wonder how many machines are going to have the bottleneck on memory. I suppose the asm version would allow for faster times on the lower end machines.

Any thoughts are appreciated, as was all your guys' help. bitRake ... an extra thanks to you for the clarification I needed to see what was already in front of me. :)


- Chris
Posted on 2002-10-11 23:47:50 by bit
1. That is a good idea. :)

2. Yes, but it is not easy. I suggest studying the memory access characteristics of the CPUs (and Windows) and developing algorithms designed around those limitations. The above line algo has been used for ages, but it can be further optimized for modern processors - be creative.

And you should ask yourself why you need so many lines per second. :)
My GeForce will beat any lines the CPU could draw.
Posted on 2002-10-12 00:08:44 by bitRAKE
There really isn't a need for so many lines. :) This was just a little exercise for fun. However, when I start writing my 3d software engine, simple line tracing is needed for determing the correct scan lines when rendering triangles. But the act of just drawing lines probably won't be needed. Hardware is a nice thing, but there are still a ton of machines that lack a good video card. For instance, my laptop that I do almost all of my work on doesn't have a good one at all.

As for the algo ... I have tried the simple, straight-forward Bresenham's as well as the run-slice version, and symmetric algorithms. All variations I have tried have fallen a little short of the simple one. My guess is that all the variables that need to be maintained can't all be placed in registers and thus a lot of time is spent moving things around. But, I haven't really investigated it completely.

One other thing is that with the little optimization you showed earlier the inner loop should be quite clean, and insanely fast. Especially for the Pentiums and their mis-predicted branching tendencies. :)

- Chris
Posted on 2002-10-12 08:25:02 by bit
Originally posted by bitRAKE
And you should ask yourself why you need so many lines per second. :)
My GeForce will beat any lines the CPU could draw.


One little question, how to instruct a GeForce (or any other gfx-card) to draw lines, plot pixles, etc?
Posted on 2002-10-12 12:25:48 by scientica
Use one of these:
IDirect3DDevice8::DrawIndexedPrimitive
IDirect3DDevice8::DrawIndexedPrimitiveUP
IDirect3DDevice8::DrawPrimitive
IDirect3DDevice8::DrawPrimitiveUP

With one of these:
D3DPT_POINTLIST - Renders the vertices as a collection of isolated points.
D3DPT_LINELIST - Renders the vertices as a list of isolated straight line segments.
D3DPT_LINESTRIP - Renders the vertices as a single polyline.
Posted on 2002-10-12 13:27:38 by bitRAKE
I implemented something similar also, but I didnt profile it and it's very buggy. Anyway here it is, please don't criticize too harshly, I converted it from C directly from GameDev ;)



mov ax,x2
mov bx,x1
sub ax,bx
mov deltax,ax ;Store DELTAX in LOCAL stack RAM
mov ax,y2
mov bx,y1
sub ax,bx
mov deltay,ax ;Store DELTAY in LOCAL stack RAM
mov ax,x1
mov bx,x2
;AX==X1, BX==X2
.IF BX >= AX
mov xinc1,1
mov xinc2,1
.ELSE
mov xinc1,-1
mov xinc2,-1
.endif
mov ax,y1
mov bx,y2
.IF BX >= AX
mov yinc1,1
mov yinc2,1
.ELSE
mov yinc1,-1
mov yinc2,-1
.ENDIF
;Initialized structures
mov ax,deltax
mov bx,deltay
.IF AX >=BX
xor ax,ax
mov xinc1,ax
mov yinc2,ax
mov ax,deltax
mov den,ax
shr ax,1
mov num,ax
mov ax,deltay
mov numadd,ax
mov ax,deltax
mov numpixels,ax
.ELSE
xor ax,ax
mov xinc2,ax
mov yinc1,ax
mov ax,deltay
mov den,ax
shr ax,1
mov num,ax
mov ax,deltax
mov numadd,ax
mov ax,deltay
mov numpixels,ax
.endif
;Precomputation is done, now time to draw line
;EAX==X
;EBX==Y
;ECX==Color
;EDX==NumPixel counter
movzx eax,x1
movzx ebx,y1
movzx ecx,color
lineloop:
invoke DrwDot, eax, ebx, ecx
mov si,num
add si,numadd
mov num,si
.IF SI >=den
sub si,den
mov num,si
add ax,xinc1
add bx,yinc1
inc dx
cmp dx,numpixels
je exitlinelp
jmp lineloop
.endif
add ax,xinc2
add bx,yinc2
inc dx
cmp dx,numpixels
je exitlinelp
jmp lineloop
exitlinelp:
ret
Posted on 2002-10-12 18:12:25 by x86asm