Hi there,

started to check out the bandwith to/from my PCIe-card (ATI x800xt) - with some disappointing results:


--------------------------------------

;P4 3.2GHz HT, ATI_x800xt PCIe - Messergebnisse:? ?(LU=LoopUnroll)
;===============================================
;

; Read from videomem (1024x768x32, x800xt)? ? Write to videomem
; ----------------------------------------? ? -----------------
;
;mov al,, 2MB, LU=1 :? ?GB/s? ? ? ? ?;mov ,al,? 2MB, LU=1 : 1.31 GB/s
;mov al,, 2MB, LU=2 :? ?GB/s? ? ? ? ?;mov ,al,? 2MB, LU=2 : 1.31 GB/s
;mov al,, 2MB, LU=4 :? ?GB/s? ? ? ? ?;mov ,al,? 2MB, LU=4 : 1.31 GB/s
;mov al,, 2MB, LU=8 :? ?GB/s? ? ? ? ?;mov ,al,? 2MB, LU=8 : 1.31 GB/s
;mov al,, 2MB, LU=16:? ?GB/s? ? ? ? ?;mov ,al,? 2MB, LU=16: 1.31 GB/s
;
;mov ax,, 2MB, LU=1 :? ?GB/s? ? ? ? ?;mov ,ax,? 2MB, LU=1 : 1.31 GB/s
;mov ax,, 2MB, LU=2 :? ?GB/s? ? ? ? ?;mov ,ax,? 2MB, LU=2 :? ?GB/s
;mov ax,, 2MB, LU=4 :? ?GB/s? ? ? ? ?;mov ,ax,? 2MB, LU=4 :? ?GB/s
;mov ax,, 2MB, LU=8 :? ?GB/s? ? ? ? ?;mov ,ax,? 2MB, LU=8 :? ?GB/s
;mov ax,, 2MB, LU=16:? ?GB/s? ? ? ? ?;mov ,ax,? 2MB, LU=16:? ?GB/s
;
;mov eax,, 2MB, LU=1 :? ?0.008 GB/s? ;mov ,eax,? 2MB, LU=1 : 1.31 GB/s
;mov eax,, 2MB, LU=2 :? ?GB/s? ? ? ? ;mov ,eax,? 2MB, LU=2 :? ?GB/s
;mov eax,, 2MB, LU=4 :? ?GB/s? ? ? ? ;mov ,eax,? 2MB, LU=4 :? ?GB/s
;mov eax,, 2MB, LU=8 :? ?GB/s? ? ? ? ;mov ,eax,? 2MB, LU=8 :? ?GB/s
;mov eax,, 2MB, LU=16:? ?GB/s? ? ? ? ;mov ,eax,? 2MB, LU=16:? ?GB/s
;
;movq mm0,, 2MB, LU=1 :? 0.015 GB/s? ;movq ,mm0,? 2MB, LU=1 : 1.31 GB/s
;movq mm0,, 2MB, LU=2 :? ?GB/s? ? ? ?;movq ,mm0,? 2MB, LU=2 :? ?GB/s
;movq mm0,, 2MB, LU=4 :? ?GB/s? ? ? ?;movq ,mm0,? 2MB, LU=4 :? ?GB/s
;movq mm0,, 2MB, LU=8 :? ?GB/s? ? ? ?;movq ,mm0,? 2MB, LU=8 :? ?GB/s
;movq mm0,, 2MB, LU=16:? ?GB/s? ? ? ?;movq ,mm0,? 2MB, LU=16:? ?GB/s
;
;movaps xmm0,, 2MB,LU=1 : 0.016 GB/s ;movaps ,xmm0,? 2MB, LU=1 : 1.31 GB/s
;movaps xmm0,, 2MB,LU=2 : 0.016 GB/s ;movaps ,xmm0,? 2MB, LU=2 : 1.31 GB/s
;movaps xmm0,, 2MB,LU=4 :? ?GB/s? ? ?;movaps ,xmm0,? 2MB, LU=4 :? ?GB/s
;movaps xmm0,, 2MB,LU=8 :? ?GB/s? ? ?;movaps ,xmm0,? 2MB, LU=8 :? ?GB/s
;movaps xmm0,, 2MB,LU=16:? ?GB/s? ? ?;movaps ,xmm0,? 2MB, LU=16:? ?GB/s
;


; Read from memory? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Write to memory
; ----------------? ? ? ? ? ? ? ? ? ? ? ? ? ? ?---------------
;
;mov al,, 20MB, LU=1 :? 1.14 GB/s? ? ? ?;mov ,al,? 20MB, LU=1 :? 1.69 GB/s
;mov al,, 40MB, LU=2 :? 1.72 GB/s? ? ? ?;mov ,al,? 40MB, LU=2 :? 1.75 GB/s
;mov al,, 80MB, LU=4 :? 1.85 GB/s? ? ? ?;mov ,al,? 80MB, LU=4 :? 1.73 GB/s
;mov al,, 80MB, LU=8 :? 2.09 GB/s? ? ? ?;mov ,al,? 80MB, LU=8 :? 1.74 GB/s
;mov al,, 80MB, LU=16:? 2.07 GB/s? ? ? ?;mov ,al,? 80MB, LU=16:? 1.72 GB/s
;
;mov ax,, 40MB, LU=1 :? 2.19 GB/s? ? ? ?;mov ,ax,? 40MB, LU=1 :? 1.87 GB/s
;mov ax,, 80MB, LU=2 :? 3.09 GB/s? ? ? ?;mov ,ax,? 80MB, LU=2 :? 1.86 GB/s
;mov ax,, 80MB, LU=4 :? 3.37 GB/s? ? ? ?;mov ,ax,? 80MB, LU=4 :? 1.86 GB/s
;mov ax,, 80MB, LU=8 :? 3.35 GB/s? ? ? ?;mov ,ax,? 80MB, LU=8 :? 1.87 GB/s
;mov ax,, 80MB, LU=16:? 3.33 GB/s? ? ? ?;mov ,ax,? 80MB, LU=16:? 1.86 GB/s
;
;mov eax,, 80MB, LU=1 : 3.81 GB/s? ? ? ?;mov ,eax,? 80MB, LU=1 : 1.86 GB/s
;mov eax,, 80MB, LU=2 : 3.88 GB/s? ? ? ?;mov ,eax,? 80MB, LU=2 : 1.87 GB/s
;mov eax,, 80MB, LU=4 : 3.90 GB/s? ? ? ?;mov ,eax,? 80MB, LU=4 : 1.87 GB/s
;mov eax,, 80MB, LU=8 : 3.89 GB/s? ? ? ?;mov ,eax,? 80MB, LU=8 : 1.87 GB/s
;mov eax,, 80MB, LU=16: 3.90 GB/s? ? ? ?;mov ,eax,? 80MB, LU=16: 1.86 GB/s
;
;movq mm0,, 80MB, LU=1 : 4.14 GB/s? ? ? ;movq ,mm0,? 80MB, LU=1 : 1.86 GB/s
;movq mm0,, 80MB, LU=2 : 4.16 GB/s? ? ? ;movq ,mm0,? 80MB, LU=2 : 1.86 GB/s
;movq mm0,, 80MB, LU=4 : 4.16 GB/s? ? ? ;movq ,mm0,? 80MB, LU=4 : 1.86 GB/s
;movq mm0,, 80MB, LU=8 : 4.16 GB/s? ? ? ;movq ,mm0,? 80MB, LU=8 : 1.86 GB/s
;movq mm0,, 80MB, LU=16: 4.16 GB/s? ? ? ;movq ,mm0,? 80MB, LU=16: 1.86 GB/s
;
;movaps xmm0,, 80MB, LU=1 : 4.74 GB/s? ?;movaps ,xmm0,? 80MB, LU=1 : 1.85 GB/s
;movaps xmm0,, 80MB, LU=2 : 4.76 GB/s? ?;movaps ,xmm0,? 80MB, LU=2 : 1.85 GB/s
;movaps xmm0,, 80MB, LU=4 : 4.76 GB/s? ?;movaps ,xmm0,? 80MB, LU=4 : 1.85 GB/s
;movaps xmm0,, 80MB, LU=8 : 4.76 GB/s? ?;movaps ,xmm0,? 80MB, LU=8 : 1.85 GB/s
;movaps xmm0,, 80MB, LU=16: 4.76 GB/s? ?;movaps ,xmm0,? 80MB, LU=16: 1.85 GB/s

--------------------------------------

I must do something wrong?!
Reading from videomem (dword) only 8MB/s !?!
Max. should be 4GB/s to/from videomemory ?!

VShader

Posted on 2005-03-28 04:08:04 by VShader

  I had heard that no videocard vendor is using the full power of PCIe yet.  Both ATI and Nvidia did it differently but neither one has  a part that can use the full potential of PCIe.  I don't know how much of  difference it will make when they do it, because the way most games work is they send all the data for the level you are about to play at the start of the level.  And then you do the operations on the data in the videocard.  That's why if you've ever seen video game tests comparing AGP, AGP 2x and AGP 4x you hardly see any FPS difference in the different modes.  I think going from AGP to AGP4x gives a 8% gain in performance, if that.  Games were written to take advantage of the fact that there isn't much bandwidth, and so they try to send as much data as they can at the start of each level.  As a side note, reads from video memory have always been really expensive.  So the fact that your reads were a lot slower for video memory was no surprise to me. 
Posted on 2005-03-28 09:37:45 by mark_larson
Even when video card manufacturers start utilizing PCIe properly, readbacks will probably remain slow when operating in 3D mode, as you have to "sync up" the pipelines. Just a guess, of course.

It wouldn't surprise me if the current PCIe cards work more or less as AGP cards right now, just like the early AGP cards where basically PCI cards with a different interface.
Posted on 2005-03-28 10:26:02 by f0dder
Hehe,

my first hope for PCIe was:

6.4 GB/s to main memory + 4 GB/s TO vidmem + (simultaneous) 4 GB/s FROM vidmem
= 14.4 GB/s bandwith.

Then I realized that in P4-architecture everything has to go through (south?)-bridge -> so then max.? 6.4 GB/s.
(Perhaps the 14.4 GB/s could be theoretically possible on AMD64).

But only 16 MB/s from videomem is disapointing.

I am for sure not complaining about the power of the x800xt, but I hoped to do some crunching with the GPU (pixelshader) and then read it back, but this is with only 16 MB/s (only 0.4% of theoretical max.) obsolete.

Perhaps I can sue someone for misleading marketing? :P


VShader
Posted on 2005-03-28 16:08:59 by VShader
Btw, could you post the source of your test program here? I have a AMD64 with a GeForce 6600 PCIe card, might be interesting to see which speeds I get?
Posted on 2005-03-29 07:52:16 by f0dder
>> Btw, could you post the source of your test program here?

Yes, of course.

I put it on my "homepage" (scroll down).

http://rz-home.de/~eugen/

I did it in my 3D-engine (no development for the last half year - going from P200mmx to P4_HT_3.2Ghz+ATIx800xt opens a world of cheap and cool games , ... but I think I start coding again soon. ).

There is the complete engine with source (newest version).
Full 3D-Clipping and with DSound/DInput
Cleaned up and translated.
I used masm V6.15.8803 for sse2 instructions (but test can run on pentium too)
Search for ?bandwidthtests" in the code.
Change parameters in the block ?All you have to change" for read/write-tests (CAUTION: max.blocksize!).
Toggle Info-overlay with F2.
Additional info about the engine in the thread ?Quickk reference to MMX/SSE/SSE2 ..."
The DebugIt-block is normally empty and gets moved around in the sourcecode for quick debugging.


The read/write-instructions are straight forward - unrolled mov(1..4bytes)/movq(8bytes)/movaps(16bytes)-instructions with a little bit help of macros? ?- perhaps there are faster ways (let me know!) but on P4 there is a hardware data prefetcher which automatically kicks in when you read/write in regular patterns - which here is the case.


I don?t give up on a faster version of reading back from vidmem via PCIe yet - If anyone has a clue let me please know (dma?, special DirectX-functions?, special way of allocating a buffer in vidmem instead of reading from backbuffer?).
It would be too cool to utilize the 22 (6 vertex-engines and 16 pixelengines) floating-point-cores in the GPU @ 500MHz with 32GB/s bandwidth (physics!).

Btw: Hi bitrake! This version includes the fully working octree (mmx-accelerated) !! (hehe, I have to look deeply to understand it in detail myself now ... but it works stable and fast (I think, have no other code to compare).
I am planning to keep this octree with (sse2 enhanced)-integer-simd and change the transforming/clipping to sse-float-simd - should be the best combination. It should work to check the octree-spheres in (4x)16bit integer-math but then do the clipping in (4x)float32 and it should be possible to adjust the integer-resolution so the world can be much bigger but still fast and compact mmx(/mmx2)-octree-vf-culling.

VShader
Posted on 2005-03-29 13:09:49 by VShader
"mmx3d_200503.zip" crashes every time on my P4, GF4 Ti 4200, WinXP PRO SP2.

eax=00000000 ebx=00000000 ecx=00000000 edx=00241a11 esi=0000010b edi=00d7f840
eip=0040744b esp=0012ff44 ebp=0012ff84 iopl=0        nv up ei pl zr na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000            efl=00000246

        00407436 eb0a            jmp    MMX3D+0x7442 (00407442)
        00407438 0f7f2c8f        movq    qword ptr ,mm5
        0040743c 83eb02          sub    ebx,0x2
        0040743f 83c102          add    ecx,0x2
        00407442 83fb01          cmp    ebx,0x1
        00407445 7df1            jge    MMX3D+0x7438 (00407438)
        00407447 0bdb            or      ebx,ebx
        00407449 7504            jnz    MMX3D+0x744f (0040744f)
------>0040744b 0f7e2c8f    movd dword ptr ,mm5 ds:0023:00d7f840=????????
        0040744f 4e              dec    esi
        00407450 0ffef7          paddd  mm6,mm7
        00407453 033df44d4100    add    edi,
        00407459 83fe00          cmp    esi,0x0
        0040745c 7dad            jge    MMX3D+0x740b (0040740b)
        0040745e 90              nop
        0040745f 0f6f1424        movq    mm2,qword ptr
        00407463 83c408          add    esp,0x8
        00407466 90              nop


ChildEBP RetAddr  Args to Child             
0012ff84 00401ebb 0f360040 0f1353a0 00000134 MMX3D+0x744b
0012ff98 00401194 00000005 00000012 0000031a MMX3D+0x1ebb
0012ffb8 00410174 001406a2 7c816d4f 80000001 MMX3D+0x1194
0012fff0 00000000 004100ce 00000000 78746341 MMX3D+0x10174


stack:
000000000012ff44  00 00 b8 08 c7 51 ff 7f - 00 00 84 4b 1a 53 ff 7f  .....Q.....K.S..
000000000012ff54  00 00 8b 4b e2 52 ff 7f - 38 5d 13 0f 40 6e e8 00  ...K.R..8]..@n..
000000000012ff64  40 00 36 0f 7c ff 12 00 - 88 3a 36 0f 2e 01 00 00  @.6.|....:6.....
000000000012ff74  00 00 00 00 ae 18 d3 6b - 11 1a 24 00 98 ff 12 00  .......k..$.....
000000000012ff84  98 ff 12 00 bb 1e 40 00 - 40 00 36 0f a0 53 13 0f  ......@.@.6..S..
000000000012ff94  34 01 00 00 b8 ff 12 00 - 94 11 40 00 05 00 00 00  4.........@.....
000000000012ffa4  12 00 00 00 1a 03 00 00 - 46 02 00 00 60 00 00 00  ........F...`...
000000000012ffb4  24 03 41 00 f0 ff 12 00 - 74 01 41 00 a2 06 14 00  $.A.....t.A.....
000000000012ffc4  4f 6d 81 7c 01 00 00 80 - 8c da da 00 00 b0 fd 7f  Om.|............
Posted on 2005-03-29 18:17:23 by ti_mo_n
ti_mo_n ,

this is in the scanline-filler (.exe was assembled with 32 bit color-depth and mmx-writes to videomem.)

I added (sorry) the necessary include/lib-files and a assemble/link-batchfile on my homepage to "mmx3d_200503.zip" so you now have a realistic chance to succesfully assemble and link the program.
Change in the sourcecode to 16 bit (search for "BytesPerPixel" -> from 4 to 2 ) and assemble again.
The programm allocs a 100MB-block for memtest ("P4MemTestBuffer") -> you should kill this if you only want the 3D-engine.
Or if you don't want to assemble again take the mmy3d.exe from the version above on my homepage.


VShader
Posted on 2005-03-30 07:07:53 by VShader
;The fastest transferrates so far on my P4,PCIe:
;----------------------------------------------------------
;read vidmem:
;movaps xmm0,, 2MB, LU=2 : 0.016 GB/s
;write vidmem:
;movaps ,xmm0,? 2MB, LU=2 : 1.31 GB/s

;read mem:
;movaps xmm0,, 80MB, LU=2 : 4.76 GB/s
;write mem:
;movaps ,xmm0,? 80MB, LU=2 : 1.85 GB/s

VShader
Posted on 2005-03-30 07:12:39 by VShader
  I saw that you were using MOVAPS to write data to memory.  Look at doing MOVNTPS instead if you have 1-2 MB or larger buffer.  It's faster for large buffer sizes.  Experiment with your code and find out at what buffer size MOVNTPS gets faster than MOVAPS.  MOVNTPS writes directly to memory and never updates the cache.  It usually gets faster than MOVAPS on my P4 with a 1-2MB buffer size.

  Actually now that I think about it, using MOVNTPS might not give you a speed up at all, since the buffer is in video memory.  It'd be curious to see you run this test with a 4MB buffer and try both MOVAPS and MOVNTPS and see what the speed difference is.  Technically if the buffer is in video memory, the processor shouldn't cache it.  But I am not sure how they set it up.  If they do cache it, then MOVNTPS should be faster at some point.
Posted on 2005-03-30 10:35:24 by mark_larson
Or if you don't want to assemble again take the mmy3d.exe from the version above on my homepage.

You meant "mmx3d.zip"? ..eh sorry, but i really dont have the time to play with this. i just wanted to run it, and see what it does. never mind that. anyways - mmx3d.zip crashes too.

eax=00000058 ebx=00000000 ecx=0b000000 edx=bdc0bdc0 esi=00000003 edi=00d7fa00
eip=004071a0 esp=0012ff54 ebp=0012ff84 iopl=0        nv up ei pl zr na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000            efl=00000246

004071a0 66891447        mov    ,dx        ds:0023:00d7fab0=????


...since i cant just run it and see - i'll leave it alone :]
Posted on 2005-03-30 12:11:46 by ti_mo_n
mark_larson, you are the man!

moving the backbuffer to ram should bring roughly 3 times the speed in actual engine!
Speedincrease for untouched ram with movntps is cool too!

;P4 Prescott 1MB Cache
;
;vidmem write
;-------------
;movaps? ,xmm0,? 1MB, LU=4 :? ? ?1.31 GB/s
;movntps ,xmm0,? 1MB, LU=4 :? ? ?1.32 GB/s
;
;movaps? ,xmm0,? 0.614MB, LU=4 :? 1.32 GB/s? (640x480x16 Buffer in VIDram)
;movntps ,xmm0,? 0.614MB, LU=4 :? 1.32 GB/s? (640x480x16 Buffer in VIDram)


;mem write
;-------------
;movaps? ,xmm0,? 80MB, LU=4 :? ? ?1.75 GB/s
;movntps ,xmm0,? 80MB, LU=4 :? ? ?4.25 GB/s? :-)
;
;movaps? ,xmm0,? 0.2MB, LU=4 :? ? ?8.47 GB/s? <= Should make sense
;movntps ,xmm0,? 0.2MB, LU=4 :? ? ?4.22 GB/s
;
;movaps? ,xmm0,? 0.4MB, LU=4 :? ? ?6.84 GB/s
;movntps ,xmm0,? 0.4MB, LU=4 :? ? ?4.26 GB/s
;
;movaps? ,xmm0,? 0.614MB, LU=4 :? ?4.38 GB/s? (640x480x16 Buffer in ram)
;movntps ,xmm0,? 0.614MB, LU=4 :? ?4.23 GB/s? (640x480x16 Buffer in ram)


movaps in a memory-backbuffer seems to make sense. Transfering the final image at 80 Hz to vidmem eats only 3,78% -> overdraw in (cached) ram is much cheaper.

VShader
Posted on 2005-03-30 13:55:57 by VShader

mark_larson, you are the man!



  Just buy me some brownies, if we ever meet ;) ;) ;) heheee

Posted on 2005-03-31 12:54:09 by mark_larson