Four-F / Opcode (/ others), got a couple minutes to look at this? ;)

A topic at the FASM board, a guy has some defect RAM but wants to use it anyway. I guess the idea would be to write a driver that allocates the specific area(s) of (physical) memory that is bad, so the memory manager will not use it... but I dunno which APIs you'd need to use and I'm a bit busy at the moment.

http://board.flatassembler.net/viewtopic.php?p=19026
Posted on 2004-12-06 07:51:31 by f0dder
That's weird. It sounds like he has a failure a 100% of the time. The BIOS runs a memory test, and if it finds errors it truncates memory based on the lowest failing address. Memory is truncated by updating the value that the Int 15h Func E820h sends to the OS. This is for XP and other modern OSes. The older OSes used other BIOS software interrupts. The OS uses that function when booting to get the total memory for the system.

So one of two things are happening.



1) For some reason his BIOS memory test is not finding the error.

2) His BIOS memory test is disabled in CMOS Setup OR it only runs when the amount of memory in the system changes. Their might be a way to force it to run all the time in the BIOS Setup. I would try that first since it's easier. If the BIOS can update E820, it saves you a lot of hassle.
Posted on 2004-12-06 08:41:39 by mark_larson

if it finds errors it truncates memory based on the lowest failing address

By updating E820 (and running an OS that queries the memory map through this method), only the pages (or other blocksize?) that fail will be mapped out? - might be worth a shot then.

Might still be worth doing it with a driver, though - the BIOSes I've seen are somewhat slow at memory testing.
Posted on 2004-12-06 10:14:00 by f0dder

if it finds errors it truncates memory based on the lowest failing address

By updating E820 (and running an OS that queries the memory map through this method), only the pages (or other blocksize?) that fail will be mapped out? - might be worth a shot then.

Might still be worth doing it with a driver, though - the BIOSes I've seen are somewhat slow at memory testing.


That is why a lot of BIOSes skip the memory test unless there has been a change in the memory size. Usually you can force it to always do the memory test in CMOS setup.

The memory test was one of the things I re-wrote for the Dell Server BIOSes. ( I'm in the Dell Server BIOS group). I got anywhere from an 8x to 12x speed up depending on memory type. For example the old memory test on a 32GB system would take 16 minutes. My new code made it run anywhere from 1.5 minutes - 2 minutes. I do the exact same test except using SSE2. I also optimized some other things to make them faster to get the speed up. I filed a patent on it. I also just finished writing an MP version of the SSE2 memory test. If you are interested in the technical details I'd be happy to go into more detail.

You can also cheat if you don't want to learn how to write a driver.

1) Write a TSR that hooks INT 15h, and when E820 is called, return the truncated memory map.

2) Boot to DOS diskette

3) Load TSR

4) do an INT 19h and pop out floppy. This forces it to go to the next boot device which is the hard drive

5) Boots XP ( or whatever OS you have) but with the modified INT 15h E820.

6) OS queries and gets the corrected map for the bad DIMM



I've used that trick a lot in the past to set different things from DOS before booting into the OS.
Posted on 2004-12-06 10:54:55 by mark_larson
How many versions of windows use E820? Just XP, or 2k as well? NT4?

I thought about the floppy-boot trick, except I had regular int15 memory reporting in mind (which would effectively truncate from the bad block and up, if I'm not mistaken) - but with the E820 method it actually seems worthwhile.

How would you truncate the memory map? By returning additional block(s) with the bad range(s) marked "reserved", or by "cutting up" the memory range?

Your memtest optimizations sound cute :)
Posted on 2004-12-06 11:39:23 by f0dder
But for what cute all the other memory?, I guess RAM the randomess not need access secuentially , that mean tht would be posible make the "shield" or mark this memory as reserved, I guess exactly like the OS mark his own memory when it start up, I guess the OS is not moving else where by reallocation/mallocation ;) or request of other app with less priviliges, equally that this memory not will be reserved for be used by other application that is not the kernel itself, for what not use this concept of protection of the memory of the kernel for protect the access to this memory, ie mark this bloks at statup like ONLYREAD |PRIVILEGEDACCESS |NOACCESS |NODEALLOCATION |NOMOVE |NOUSE, dont know :). ?Or is not posible use the same concept as the kernel memory?... or I am wrong in my concepts :)....

For what you will trash some Mb when you can only mark some pages? 4Kb?
Posted on 2004-12-06 11:49:09 by rea
> 1) Write a TSR that hooks INT 15h, and when E820 is called, return the
> truncated memory map.
> 2) Boot to DOS diskette
> 3) Load TSR
> 4) do an INT 19h and pop out floppy. This forces it to go to the next boot
> device which is the hard drive

Where is the hooking code to be located and how is it protected against being overwritten? Just below 0xA000 segment? This would work for dos and win9x. But does this work for XP or Linux as well?
Posted on 2004-12-06 12:25:51 by japheth
> 1) Write a TSR that hooks INT 15h, and when E820 is called, return the
> truncated memory map.
> 2) Boot to DOS diskette
> 3) Load TSR
> 4) do an INT 19h and pop out floppy. This forces it to go to the next boot
> device which is the hard drive

Where is the hooking code to be located and how is it protected against being overwritten? Just below 0xA000 segment? This would work for dos and win9x. But does this work for XP or Linux as well?


I had assumed that memory under 0xA000 was also preserved under XP ( the person with the problem is running XP). I have not written a TSR under XP to do this, so that memory might not be free. I use the floppy boot trick to set different PCI and chipset settings before booting into the OS.


How many versions of windows use E820? Just XP, or 2k as well? NT4?

I thought about the floppy-boot trick, except I had regular int15 memory reporting in mind (which would effectively truncate from the bad block and up, if I'm not mistaken) - but with the E820 method it actually seems worthwhile.


2K and XP. I don't remember what NT4 uses. Their are other memory size functions under INT 15h. The older OSes use that instead. So you could fix them all up.

How would you truncate the memory map? By returning additional block(s) with the bad range(s) marked "reserved", or by "cutting up" the memory range?


No. You simply modify the range that has the useable system memory, to return a smaller range. We currently return 11 different ranges on a Server BIOS.

base memory
extended memory
acpi memory
acpi reclaim memory
usb memory - we run USB transactions out of the top of extended memory.
rci - Remote Configuration Interface
reserved memory
local apic
i/o apic
flash chip
memory above 4GB
Posted on 2004-12-06 16:12:09 by mark_larson
Your memtest optimizations sound cute :)


For those of you who don't want to fall asleep, now's the time to wander away.

The basic memory test did a bunch of different things to test a block of memory. The block size was 64KB. 95% of the total time to test a block was spent writing patterns to the block. So I spent my time optimizing the block reading/comparing and block writing code. The old code used the floating point registers to do writes, and a completely unrolled compare loop to check the values by simply comparing to an ALU register. The code was written on a Pentium when completely unrolling loops like that was the fastest way to do it.

The code would go into protected mode to test a 64KB block. If the block is above 4GB, it would set up protected mode with paging to access memory above 4GB. The pattern writing and reading code did not need to know if paging was on or not.


here's how the pattern writing worked ( psuedocode)


push two 32-bits values on the stack
read them off the stack and into a floating point register.
use the floating point store instruction to write it to the 64KB block in a loop
do a wbinvd to make sure the just written patterns go out to memory and you aren't testing the cache


for the read/compare


get the expected pattern in EAX
go through each memory location in a completely unrolled loop and compare it against eax.


We would test with 4 different patterns. So both routines got called 4 times each with the different patterns.

So what can we do to optimize it? A lot.

I ran some tests where I disabled the L3 cache on our systems before running the old memory test. With the L3 cache off, the code ran 3-4 times faster!!! Why is that? Well wbinvd is an incredibly slow instruction. It has to completely write the L1, L2 and L3 cache to memory, which takes up a lot of time. And you are doing it 4 times for every 64KB in memory!!! Ewwwwwwww. Also our code would get removed from the cache, so when the wbinvd finished you got cache misses on your code! So what if we could write directly to memory bypassing the cache, so we wouldn't have to use the WBINVD instruction? That would give us almost a 4x increase in speed. So I switched to the non-temporal store instructions. I used the SSE2 version "movntdq", since it writes 16 bytes at a time, which is another optimization I did.

Just as a side note, there are other ways to do write directly to memory bypassing the cache. You could disable the L2 and L3 caches before running the code and just leave the L1 running. As long as the block you are writing is bigger than the L1 data cache size, you shouldn't have a problem. You can also use the MTRRs to mark memory above 1MB as uncacheable. Then all memory writes would go directly to memory.

So here is my new write code.


get pattern in ALU register into low dword of SSE2 register
blast it to all dwords using PSHUFD
loop through block size of memory addresses writing to memory bypassing the cache by using MOVNTDQ


The new write code is significantly faster than the old write code. The other benefit is MOVNTDQ is a faster way to write to memory than MOVDQA ( if you are dealing with large amounts of memory), because it doesn't have to update the cache when you do the write.


The read and compare code is a lot more complex. It's 10 times as many lines. The write code was about 8 lines. The read/compare code is almost 100 lines of code. I had to do a lot more tricks to get it to run fast. The read/compare code runs faster than the write code.


I converted it to using SSE2 to do the read and compare part of the code. The first optimization I did was to use PCMPEQD to do the compare. The old code did a single 32-bit compare. I did 4 32-bit compares in parallel.


I broke the code up into 2 loops. An inner loop and an outer loop. The inner loop deals with 128 bytes at a time. The reason for that was because I use "prefetchnta" to speed the code up, and the P4 fetches 128 bytes into the L1 cache when executing that instruction.

I found out that putting "prefetchnta" at the top of your loop and prefetching 128 bytes ahead is not the optimum way to do it. Even though that is what Intel recommends.



mov ecx,512 ;Handle 512 128 byte blocks
fred_loop:
prefetchnta [edi+128] ;grab 128 bytes ahead of current location
movdqa xmm0,[edi] ;grab 16 bytes into xmm0 register
movdqa xmm0,[edi+16] ;grab 16 bytes into xmm0 register
movdqa xmm0,[edi+32] ;grab 16 bytes into xmm0 register
movdqa xmm0,[edi+48] ;grab 16 bytes into xmm0 register
movdqa xmm0,[edi+64] ;grab 16 bytes into xmm0 register
movdqa xmm0,[edi+80] ;grab 16 bytes into xmm0 register
movdqa xmm0,[edi+96] ;grab 16 bytes into xmm0 register
movdqa xmm0,[edi+112] ;grab 16 bytes into xmm0 register

add edi,128 ;Go to the next 128 byte block

dec ecx ;decrement loop counter
jnz fred_loop


I tried different prefetch distances and different locations of the prefetch instruction in the loop. And a number of them were faster! Finally I wrote my own program to try all combinatoins of the two with my code. I found out that the optimum prefetch distance was 928 bytes ahead ( not 128), and the prefetchnta instruction needed to be almost at the end of my loop!!! Using prefetchnta in that spot with the 928 offset gave me a good speed up.


Next problem, I would do a "pcmpeqd" to see if each dword in the SSE2 register match the pattern we are looking for. After doing that you have to do a "pmovmskb". Pmovmskb on a P4 is very slow.



pcmpeqd xmm5,xmm7 ;Equal?
pmovmskb eax,xmm5 ;Grab high bits in EAX
cmp eax,0FFFFh ;all set?
jne compare_failed ;No, exit failure


To get around the slow instruction, I accumulated results for a 4KB block in two registers before doing the comparison. I had two registers with the pattern. One was ANDed with the value read from memory and one was ORed. The ANDed would collect any bits that had gotten set to 0, that should have been a 1. The ORed register would collect any bits that had ended up as a 1, when they should have been a 0. This gave me a good speed up over doing a PMOVMSKB every loop.

The code would go to the beginning of the 4KB block and use REP SCASD to find the bad dword.

Accumulating a result in a register to avoid a slow instruction gave me a big speed up.


As another optimization I also read the data from memory into an XMM register well before I had to use it, to break up dependencies.


When we were in protected mode with paging, not having a page table entry in the TLB cache will cause a TLB cache miss and a memory access. So I added TLB priming. The inner loop read through 4KB, outside that loop I would read ahead 4KB to force the next page entry to be in the TLB cache. This is why the outer loop is 4KB, since the pages are 4KB.



mov ebx,32 ;128 32 byte blocks
mov esi,edi ;save start of block

if DO_TLB_PRIMING
mov eax,[edi+4096] ;TLB priming
endif ; if DO_TLB_PRIMING

inner_loop:


TLB priming gave me a good speed up in protected mode with paging.

With all these speed ups for the memory read/compare code, I was able to hit almost 90% of the maximum peak bandwidth of the bus.

You couldn't get the same speed under Windows.


I also added MP support to the code. I would run the memory test on different processors. Remember we are a Server so we generally come with lots of processors ;). Running the memory test on different processors allowed me to really make sure that the test completely used all the buses bandwidth.
Posted on 2004-12-06 17:11:12 by mark_larson

use the floating point store instruction to write it to the 64KB block in a loop

BIOS programmers like FISTing, eh? ;-)


You could disable the L2 and L3 caches before running the code and just leave the L1 running.

That would be a way to optimize for non-SSE processors, right?

Was an interesting read, thanks for it. 90% of the theoretical bus bandwidth sounds like pretty good use to me - and adding MP support at a BIOS level sounds a bit scary ;)
Posted on 2004-12-07 08:35:27 by f0dder

use the floating point store instruction to write it to the 64KB block in a loop

BIOS programmers like FISTing, eh? ;-)


ROFL, they use those as insentives to keep us around ;)





You could disable the L2 and L3 caches before running the code and just leave the L1 running.

That would be a way to optimize for non-SSE processors, right?

Yep, but you wouldn't get the 8x-12x increase in speed I got with my approach. It would be closer to 4. In actual testing it was closer to 3.5 times faster. All our processers are P4 and up. We don't have any processors that don't support SSE. The current generation is P4 and the next generation will be the Server version of the Prescott processor ( Nocona). We had to add a few things in the BIOS to support it since it has 64-bit support.



Was an interesting read, thanks for it. 90% of the theoretical bus bandwidth sounds like pretty good use to me - and adding MP support at a BIOS level sounds a bit scary ;)


Naw, the problem is is you can only output data so fast to the bus using SSE2. Our current generation of product the memory bus has a maximum bandwidth of 6.4 GB/s. The next generation will have double or quadruple that ( I forget if it is 12.8 or 25.6). However an SSE2 read and write isn't going to go any faster. That means that the 90% I am getting on reads is going to be less on the next generation. So how are you going to test even larger memory sizes ( generally the max amount of memory we sell on our systems doubles every year), without using MP and using all the bus? That was why I went with MP, because it allows us to scale up with larger and larger memory bandwidths.



f0dder, I wander if the approach suggested by the people who reponded to your posted topic may be used to actually selectively take out that memory address (0b5d9270) instead of just cutting off everything above 190MB as I had to do.


Yes, but the suggestion I made was to truncate the memory. However you can get it to work and not truncate by adding an additional memory range. If you read my above post about the ranges there are currently 2, one for conventional memory and one for extended memory ( well one for above 4GB, which doesn't apply to you). So adding a 3rd range that is in extended memory would allow you to just ignore the bad dword.

something like this:

start of conventional memory TO end of conventional memory
start of extended memory TO bad dword - 1
bad DWORD + 1 TO end of extended memory


You might also want to see if your laptop is still under warranty. Or how cheap memory is.
Posted on 2004-12-07 09:17:54 by mark_larson
How many of this ranges can be done?

And other question, for what a memory can fail? or where is the source of that such byte dosent return a correct "answer", this type of errors mean that the memory is near of not being completely usable?

And only a last one, is posible to make the driver for windows?
Posted on 2004-12-07 19:39:10 by rea
How many of this ranges can be done?

The interrupt does not define a maximum number. Now that doesn't mean Windows will correctly handle a large number of ranges. They might assume a maximum of 3 ranges, and die after that. The only way you can tell is to try it.

And other question, for what a memory can fail? or where is the source of that such byte dosent return a correct "answer", this type of errors mean that the memory is near of not being completely usable?


I'd personally replace the memory stick in case it became worse and worse over time.


And only a last one, is posible to make the driver for windows?


I have no idea how to make a driver under Windows to do this. However f0dder posted some excellent driver links under the thread for this on the flatassembler board.
Posted on 2004-12-08 15:20:38 by mark_larson