What options do i have to influence the Level 2 Cache besides prefetch? Is there a way to load more than one cache line at once? can i access the L2 without referencing the main memory aka does the L2 have an own address space? Or is it only working by its own logic?

Thanks ahead.
Posted on 2007-08-07 15:24:38 by atcl
L2 is simply a mechanism to accelerate the computations. Not all cpus have the same L2 line-size, and most importantly/obvious: L2 size. If you want to prefetch several lines... put several prefetch instructions (or a loop of such).
In no way you can tell the cpu to limit itself to L2. Consider the case of your thread being switched with another - *poof* you can no longer be sure what L2 has. Instead of simply forcing the cpu, you can only make the context suitable for L2 access.
In most cases making it possible for the cpu to avoid main-memory is easy, when you read. Writing always involves asynchronous write-through operations to the main-memory (16-byte aligned).
Study just a bit more about the write-through of your current cpu, and you can easily make L2 optimizations.
Posted on 2007-08-07 15:58:59 by Ultrano

In no way you can tell the cpu to limit itself to L2.

There's a trick (used by BIOS developers) that'll let you operate only within cache, without flushing out to RAM. Can't remember the specifics, but probably involves playing with MSRs/MTRRs.

Posted on 2007-08-07 19:13:51 by f0dder
I also remember that trick but not from BIOS related literature though. If I remember right it is just done by reading the address space you want to use as "L1 RAM" (what I read was from the time where L2 was on motherboard), and later disable the cache. Yes, it sound wierd but the reason was that by disabling the cache you don't flush it so the CPU will still use it and at the same time it will not invalidate entries to load new data because it is disabled to cache new data.

It would be easy to verify and I'll try test it to ensure that the method was it.
Posted on 2007-08-07 19:45:38 by LocoDelAssembly
@f0dder, @LocoDelAssembly:

any sources, books, source code, web pages, or keywords to search by? Sounds very interesting!
Posted on 2007-08-08 08:22:45 by atcl
atcl, seems that is as I said but unfortunatelly I'll can't test it:

Intel says this

But AMD64 says this
Cache Disable (CD) Bit. Bit 30. When CD is cleared to 0, the internal
caches are enabled. When CD is set to 1, no new data or
instructions are brought into the internal caches. However, the processor still accesses the internal caches when CD=1 under
the following situations:
???? Reads that hit in an internal cache cause the data to be read
from the internal cache that reported the hit.
???? Writes that hit in an internal cache cause the cache line that
reported the hit to be written back to memory and
invalidated
in the cache.


So, the method depends on CPU brand :(

I can't test because I have an Athlon64
Posted on 2007-08-08 10:38:13 by LocoDelAssembly
So...

On an intel processor, it would be possible to load lets say a whole chunk of code into the L2. Then disable the cache. and then call procedures or load data by the main memory address and execute/process the code/data in lightening speed?

If the above is possible, the question remains how to load memory blocks that are larger then one cache line. several prefetches probably won't do the trick since it cannot be guaranteed that they are loaded at all, or that they won't be overwitten by each other.

With jmp i could probably at least clean the L2 out.
Posted on 2007-08-09 05:50:57 by atcl
pff...

why dont cpu let you adress the cache as a "scratchpad" or small memory pool by explicit load/store? i had that vision years ago that it would be so much cool for optimizing, signal processing etc...

more control is more power. its ridiculous to strive with hal a dozen regs when you have halves of megabytes floating around that are almost as fast... but a teacher of mine said "yeah, yeah, but no, you know, its much better to let all this be done automatically, trust me." ...and let the thing trash itself half of the time? moron! look at the consoles architectures now.

and look at how a pain it is to have a simple *beep*ing memcpy (MEMCPY! sigh) go at the NORMAL (maximum) speed of your bus on all *beep*ing pentium/athlon...

well i guess, for a general purpose cpu aimed at HLLs it may make sense most of the time... except in critical loops where the program spends so much time.


d'oh.
:mad:

:)
Posted on 2007-08-14 07:27:47 by HeLLoWorld
atcl, I'm not so sure about caching code because if I remember right L2 cache caches data only so even if you preload the code the CPU will not use it, you have to run the code first and later disable the cache. However, only branchless code has good chances to get cached by running it once...

I also remember the "unified cache" term, and in such case the preloading should work but I can't provide you any more data, sorry...

PS: Note that if you want to force the CPU to cache only certain parts you can always (supposing that there is no underlaying OS) use the MTRRs setting UC mode for those parts you don't want to get cached.
Posted on 2007-08-16 12:09:55 by LocoDelAssembly
atcl,

So far as references go, depending on what it is you are trying to do, these may be worth a look:


  • The book Inner Loops, by Booth, gets into gorey detail on the cache. I got lost half into that section of the book, but may be helpful. Booth's all about the cache.

  • Judy, http://judy.sourceforge.net/, is a C library for implementing sparse dynamic arrays, Judy Arrays, that are supposedly tuned for optimal cache performance. If nothing else, Judy is open-source, so the code might shed some light on what you are trying to do... That is, if Judy works similar to what you are describing.



My references are, however, best guesses.
Posted on 2007-08-31 15:34:25 by TheAbysmal
Thanks for those links, Abysmal :)

- if you have any links lying around for lock-free algorithms, feel free to post that too :P
Posted on 2007-08-31 15:57:17 by f0dder
f0dder,

Heh... okay, now you're talking about things WAY out of my league.

Though, now my curiosity is piqued.
Posted on 2007-08-31 18:32:48 by TheAbysmal

f0dder,

Heh... okay, now you're talking about things WAY out of my league.

Though, now my curiosity is piqued.


I doubt it. Lock free algorithms just require more thoughtful design and usually more instructions, but in the end, you save cycles.

I read a PDF for a lock-free heap manager a while ago, it looked pretty interesting and sensible.
Posted on 2007-08-31 19:19:06 by SpooK