In the attachment is a simple project that lets you draw on a window. However it has an anti-aliased look because it is drawn on an hidden DC twice as big as the window. Then a piece of code resamples it to window size, giving smooth lines.
The project itself is really messy, I don't cleanup bitmaps and other gdi stuff, it's just for testing.
When you draw on the window, after the drawing procedure has been called 30 times, it will count ticks for one call and show the results. On my athlon 1.4Ghz I managed to get 45 cycles per *output* pixel (400x300 output pixels). Any optimizations are welcome.


Resample proc uses edi esi ebx lpSrc:DWORD, lpDest:DWORD, dwWidth:DWORD, dwHeight:DWORD
mov esi, lpSrc
mov edi, lpDest
mov edx, dwWidth
shr edx, 1

mov ecx, dwWidth

pxor MM4, MM4

mov ebx, dwHeight
shr ebx, 1
ALIGN 16
@nextline:

mov edx, ecx
shr edx, 2
ALIGN 16
@nextpixel:
movq MM0, [esi] ; MM0: X2 R2 G2 B2-X1 R1 G1 B1
movq MM1, [esi+8] ; MM1: X4 R4 G4 B4-X3 R3 G3 B3
movq MM2, [esi+4*ecx] ; MM2: X6 R6 G6 B6-X5 R5 G5 B5
movq MM3, [esi+4*ecx+8] ; MM3: X8 R8 G8 B8-X7 R7 G7 B7

movq MM5, MM0
movq MM6, MM2

punpckhbw MM0, MM4 ; MM0: 00 X2 00 R2-00 G2 00 B2
punpckhbw MM2, MM4 ; MM2: 00 X6 00 R6-00 B6 00 B6
punpcklbw MM5, MM4 ; MM5: 00 X1 00 R1-00 G1 00 B1
punpcklbw MM6, MM4 ; MM6: 00 X5 00 R5-00 G5 00 B5

paddw MM0, MM2
paddw MM0, MM5
paddw MM0, MM6
psrlw MM0, 2

; second pixel:

movq MM5, MM1
movq MM6, MM3

punpckhbw MM1, MM4 ; MM1: 00 X4 00 R4-00 G4 00 B4
punpckhbw MM3, MM4 ; MM3: 00 X8 00 R8-00 B8 00 B8
punpcklbw MM5, MM4 ; MM5: 00 X3 00 R3-00 G3 00 B3
punpcklbw MM6, MM4 ; MM6: 00 X7 00 R7-00 G7 00 B7

paddw MM1, MM3
paddw MM1, MM5
paddw MM1, MM6
psrlw MM1, 2

; now: MM0: 00 XQ 00 RQ-00 GQ 00 BQ ;where Q is the first mixed pixel
; now: MM1: 00 XP 00 RP-00 GP 00 BP ;where P is the second mixed pixel
; output should be: BQ GQ RQ XQ-BP GP RP XP (in mem)
; XP RP GP BP XQ RQ GQ BQ (in reg)

packuswb MM0, MM1
movq [edi], MM0


add esi, 16
add edi, 8
dec edx
jnz @nextpixel

lea esi, [esi+4*ecx]
dec ebx
jnz @nextline
emms
ret
Resample endp


The procedure takes each 4 input pixels, caculates the avarage and puts it back. It does 8 input pixels (2 output) in every loop.

About the numbering: if you have two input lines of pixels, the numbering looks like this:


line1: X X X X 1 2 3 4 X X X
line2: X X X X 5 6 7 8 X X X

Each char is a pixel, the first output pixel (Q) is the avarage of 1,2,5 and 6. The second output pixel (P) is the avarage of 3,4,7 and 8.

Thomas
Posted on 2001-12-31 09:20:11 by Thomas
Thomas, what is the reasoning behind 8 pixels in 4x2 rectangle verses 4 pixels in 2x2 square? I did look at the code and couldn't think of how to improve it at this time beyond prefetching the data, hardcoding the width (not good for a window).
Posted on 2002-01-01 11:59:03 by bitRAKE
I use 8 input pixels at a time because it will produce 2 ouptut pixels on every loop that way. It's like unrolling a loop twice.
btw the routine is fast enough for me, but I wondered how it could be further optimized.

Thomas
Posted on 2002-01-01 12:24:35 by Thomas
That was a brain fart. Of course, your doing two 2x2 blocks, duh! :alright:
Really need to use prefetch when your working with a data bound routine. Chapter 5, Athlon x86 Optimization Manual - I'm still trying to figure the stuff out myself.
Posted on 2002-01-01 12:48:03 by bitRAKE
I have played with those docs too but couldn't get masm to use the new instructions.
I use this:


.686
.MMX
.XMM
.K3D


but it says "instruction or register not accepted in current CPU mode" for even a simple instruction without operands like sfence.

How did you setup masm?

Thomas
Posted on 2002-01-01 13:06:22 by Thomas
I found it out, they are order-dependant. This is the right order:


.686
.MMX
.K3D
.XMM


Thomas
Posted on 2002-01-01 13:20:37 by Thomas
Haven't tried sfence - might not support it, but these seem to work:
.XMM

prefetchnta [ecx+060h]
prefetchnta [edx+060h]
Posted on 2002-01-01 13:39:49 by bitRAKE
All instructions work fine now.. I modified the code like this:


[....]
ALIGN 16
@nextpixel:
[b] prefetchnta [esi+8*eax+512][/b]
[....]
packuswb MM0, MM1

add esi, 16
add eax, 2

[b]movntq qword ptr [edi], MM0[/b]

add edi, 8
dec edx
jnz @nextpixel
[....]

lea esi, [esi+4*ecx]
dec ebx
jnz @nextline
[b]sfence[/b]
emms


This resulted in an improvement of 5 clocks per output pixel (now 40 cycles/output pixel). Not bad.

Thomas
Posted on 2002-01-01 13:45:54 by Thomas
SSE specific versions for AMD/Intel could be half the size (12 instructions total) because of the byte averaging instruction - PARGB - maybe smaller? You could use word averaging, but all you'd save is the shift instructions.
Posted on 2002-01-01 19:39:18 by bitRAKE