If you were to fill a 4MBs contiguous block of memory using only 32-bits GPRs, what would you do? What will your solution be?
Posted on 2007-03-31 08:45:14 by XCHG
rep stosd
Posted on 2007-03-31 10:22:13 by JimG

rep stosd

Agreed. It does fairly well with large memory sizes and static data. Imagine something like "repz stosq" :P
Posted on 2007-03-31 10:46:56 by SpooK
Additionally, Intel manulals state that "rep stods" is the fastet method to fill large chunks of memory. On P4 and above, there is also "fast string mode", where instructions like "rep stosd" execute even faster if they meet the required criteria.
Posted on 2007-03-31 14:50:08 by ti_mo_n
Alrighty. Thank you guys.
Posted on 2007-04-01 09:28:41 by XCHG
Alternately, you might want to do a MMX store routine, or a SSE routine that uses write-through store so you don't pollute the CPU cache.
Posted on 2007-04-03 10:05:29 by f0dder
Thank you f0dder but there are two problems here. I don't want to assume that the computer that runs this program has an FPU and also I am not really good at FPU instructions and anything related to it as I've said before. For example, this is the best I can do for a MXM ZeroMemory procedure:

    ; void __MMXZeroMemory (void* Destination, DWORD Length)
    PUSH    EAX                                      ; Push the accumulator onto the stack
    PUSH    ECX                                      ; Push the count register onto the stack
    PUSH    EDX                                      ; Push the data register onto the stack
    PUSH    EDI                                      ; Push the destination index onto the stack
    PUSH    EBP                                      ; Push the base pointer onto the stack
    MOV    EBP , ESP                                ; Move the stack pointer to the base pointer
    MOV    ECX , DWORD PTR               ; *ECX = The parameter
    TEST    ECX , ECX                                ; See if the requested length is zero
    JZ      .EP                                      ; Jump to the end of the procedure if yes
    MOV    EDI , DWORD PTR               ; *EDI = The parameter
    MOV    EAX , ECX                                ; *EAX = The parameter
    XOR    EDX , EDX                                ; Clear the buffer for Byte-moves
    SHR    ECX , 0x00000003                          ; *ECX = Number of QWORDs that we have to move to the Destination
    AND    EAX , 0x00000007                          ; *EAX = Number of parity bytes that we have to move
    TEST    ECX , ECX                                ; See if the number of QWORDs to move is zero
    JZ      .MoveBytes                                ; Start moving bytes (Up to 7 bytes) if yes
    EMMS                                              ; Empty MMX Technology State
    PXOR    MM0 , MM0                                ; The 8-Byte MMX byte is zero
    .MoveQWORDs:                                      ; Start zeroing memory 8 bytes at a time
      MOVQ    QWORD PTR , MM0                  ; Zero the current QWORD
      ADD    EDI , 0x00000008                        ; Move to the next QWORD in the destination
      DEC    ECX                                    ; Decrement the number of QWORDs to move
      JNZ    .MoveQWORDs                            ; Keep moving QWORDs to the destination
    TEST    EAX , EAX                                ; See if the number of parity bytes to move is zero
    JZ      .EP                                      ; Jump to the end of the procedure if yes
    .MoveBytes:                                      ; We are left to move up to 7 bytes now
      MOV    BYTE PTR , DL                    ; Write one byte to the current destination's location
      INC    EDI                                    ; Move to the next byte of the destination
      DEC    EAX                                    ; Decrement the number of bytes to move
      JNZ    .MoveBytes                              ; Keep moving parity bytes while EAX>0
    .EP:                                              ; End of the procedure routine
      POP    EBP                                    ; Restore the base pointer
      POP    EDI                                    ; Restore the destination index
      POP    EDX                                    ; Restore the data register
      POP    ECX                                    ; Restore the count register
      POP    EAX                                    ; Restore the accumulator
    RET    0x08                                      ; Return to the calling procedure
                                                      ; And sweep 2 parameters off the stack
                                                      ; And sweep 1 parameter off the stack
Posted on 2007-04-04 01:37:20 by XCHG
FPU != MMX, even if original MMX was overlayed on the FPU hardware. If you don't want to assume FPU, you probably shouldn't be writing 32bit code anyway :) (unless you want to support _really_ old hardware). Same goes for MMX, you'll be hard pressed to find a computer today without MMX support.

SSE code should still be optional, though, IMHO. Win2k has SSE code for it's ZeroPage() function, btw.
Posted on 2007-04-04 08:11:29 by f0dder
Why Win2k has SSE code, while XP doesn't?

RtlZeroMemory from "XP SP2"'s ntdll.dll:
7C90311B >/$ 57             push    edi
7C90311C  |. 8B7C24 08      mov     edi,
7C903120  |. 8B4C24 0C      mov     ecx,
7C903124  |. 33C0           xor     eax, eax
7C903126  |. FC             cld
7C903127  |. 8BD1           mov     edx, ecx
7C903129  |. 83E2 03        and     edx, 3
7C90312C  |. C1E9 02        shr     ecx, 2
7C90312F  |. F3:AB          rep     stos
7C903131  |. 0BCA           or      ecx, edx
7C903133  |. 75 04          jnz     short ntdll.7C903139
7C903135  |. 5F             pop     edi
7C903136  |. C2 0800        retn    8
7C903139  |> F3:AA          rep     stos
7C90313B  |. 5F             pop     edi
7C90313C  \. C2 0800        retn    8

And here are the requirements to be met for the "Fast String Operation" (the fastest way to fill/move chunks of memory, at least according to "The Manuals"):
Initial conditions for ?fast string? operations:
? EDI and ESI must be 8-byte aligned for the Pentium III processor. EDI must be 8-byte
  aligned for the Pentium 4 processor.
? String operation must be performed in ascending address order (Direction flag cleared).
? The initial operation counter (ECX) must be equal to or greater than 64.
? Source and destination must not overlap by less than a cache line (64 bytes, Pentium 4 and
  Intel Xeon processors; 32 bytes P6 family and Pentium processors).
? The memory type for both source and destination addresses must be either WB or WC.

It requires the size of the block to be at least 1 cache line, so it seems to me like it's using burst transactions. If it does, nothing can be really faster (but I believe SSE are using burst transactions as well..?). The down side is that it fills the cache and that might be a pollution if you don't use the memory immediately after zeroing it.
Posted on 2007-04-04 20:28:02 by ti_mo_n
ZeroPage() is an internal kernel function, not to be confused with RtlZeroMemory.

Also, afaik the 64-bit version of XP does use SSE in the usermode parts of the system (ie., RtlZeroMemory). Dunno why they do it this way, but perhaps because XP-32 can (at least theoretically :)) run on non-SSE capable hardware, whereas all x64 processors are SSE capable.

Or perhaps because XP64 is based on the Win2003 codebase? *shrug*
Posted on 2007-04-05 03:12:10 by f0dder