I have tried going through the Agner Fog's manual but i still dont seem to be able to make much out of it.
Can someone please explain a little bit or give me a link.
Posted on 2003-01-16 03:47:37 by clippy
What do you want to know, what they are, or why they are important to performance?

Mirno
Posted on 2003-01-16 05:31:41 by Mirno
Well actually both.
Posted on 2003-01-16 05:36:40 by clippy
Alignment:
Aligning data is putting it on an address which is a particular multiple.
4 byte aligned data will have the bottom 2 bits clear. This is done because the hardware design makes some assumptions, in order to reduce silicon needed. Rather than providing the memory controller with a byte granularity access for instance, 64 bit chunks of data will be fetched with 64 bit alignment. This makes the caching a whole lot simpler, rather than checking each cache line is within a particular range, mask out the bottom 6 bits of the address you are looking for, and compare it to the addresses you've already got in the cache. It also removes the possibility of duplication within a cache, or having to make the cache track each byte within it (which would take up so much silicon it would make them unbearable). It also saves on the address bus, as the wires will always be zero, you don't need to bother putting them in the chip!
It is a reasonable design decision, because consecutive data accesses make up the majority of memory requests. Although you as a programmer can access individual bytes, it is only within the processor itself that this is done.

Caching:
This is basically a method of time saving. The data path to fetch things from the actual phyisical memory is long - very long, possibly hundreds of clocks. So when you get something from memory, rather than use once, and write back, store it in some temporary space, where it can be got at again quickly.
Caches also help smooth out memory accesses, and improve the use of available bandwidth. Rather than every write to memory resulting in a write to memory, they can be accumulated in the cache, and be written back as one nicely aligned chunk of data.
This suits programming very well, its not often that you perform some operation, then leave the result for ages before operating again. Data which is rapidly used in quick sucession need only be fetched from this small, fast, local store. There are usually several levels of cache, each being bigger than the last, but further from the processor, and thus slower.
There are problems with caches, these generally arise from data inconsistancy, or stale data. When there are two caches (such as the data, and code caches) at the same level, modifying one will not modify the other, so the processor must flush, and re-collect the data for the out-of-date cache. This can seriously dent the performance, and is most visible in self modifying code - if the code being modified is already in the cache, it will be modified in the data cache (it looks like data to the processor because it isn't being executed), so the code cache becomes out of date. The execution will then stall the hundreds of clocks while it is re-fetched. Similar issues exist in multi-processor systems.

Branch target buffer:
The quoted cycles a given instruction takes to execute are basically lies. The processor does not take 1 clock cycle to execute a mov, it can take as many as 21 depending on the processor. The 1 clock cycle is how long it takes in the execution engine within the processor, the rest of the time is spent fetching, and decoding the instruction. However this fetch & decode is pipelines, so while the mov is in the pipeline, so are the subsequent instructions, so it looks like it only takes 1 clock to process it (with a latency for the very first instruction executed).
This pipelineing is great whenever the instruction stream is linear, but when a processor comes across a conditional jump, which path should it load the pipeline up with? Indeed which lines of code should it fetch into the code cache? If it does nothing there is a stall while it executes the jump, and finds which way it went. If it guesses one way or the other, it could be wrong, which would mean an inefficient use of the cache (it'd throw away something it could have kept), it will have eaten memory bandwidth unnecessarily, and it'll have to flush the pipeline it loaded with the wrong data.
Fortunately help is at hand! If you look at programs you've written, there are very few toss of a coin decisions when jumping, most follow a pattern. Take a loop, this is the very simple pattern of take the jump back repeatedly. This means on a big loop, the processor should guess wrong at most twice, first time it could get it wrong, and the last time it'll probably get it wrong. So the application of a bit of maths, and hardware engineering came up with the BTB. The BTB is a table (or series of tables) which stores whether or not the jump was taken (under given circumstances), and chances are, looping being as it is repetative, the same pattern of jumps will happen again.
There are some very complex branch prediction algorithms, and each processor manufacturer develops their own. A good branch prediction system will give a huge boost to the system, as it'll not be nearly so wasteful. Which method of prediction is chosen according to space needed to implement, and how good their mathematicians are.
Basically branch prediction is an analysis of how the real world operates, and giving the processor the best chance to cope with it.

Mirno
Posted on 2003-01-16 06:53:37 by Mirno
hi mirno, thanks a million for writing so much stuff for me.

I do appreciate it but i must say i dont understand alignment fully.
What exactly do you mean by
4 byte aligned data will have the bottom 2 bits clear
and
This makes the caching a whole lot simpler, rather than checking each cache line is within a particular range, mask out the bottom 6 bits of the address you are looking for, and compare it to the addresses you've already got in the cache.


I pretty much could not understand the whole of the alignment part. Though the rest was very nicely written.

Btw, what exactly are 'pipelines'? I read in the agner fog manual that code can execute simultaneously in 2 pipelines. Also what exactly is 'masking bits'?

Would it be possible to post a bit of asm code to explain. (pls comment it as i am only a beginner in asm). How can we control all this alignment,btb and caching in asm code?

Thanks again
Posted on 2003-01-18 06:03:43 by clippy
The reasons for caching and alignment are to make hardware more effective given the amount of silicon it takes up. The reasons for both are simply to reduce the number of gates needed to actually build the chip (silicon costs money, plus the smaller it is, the lower the power needs, and the higher it can clock).

So the reason we align data is because it reduces the number of address bits inside the chip to fetch a portion of memory - all memory on the Pentium and above is fetched in blocks of 64bits.
It cannot pick just any 8 bytes; they must be on a 64-bit boundary (or aligned to an 8 byte address). In doing this, the memory subsystem does not need to pick individual bytes, so those bits of the address can be removed - this means the address bus is now only 29 bits wide rather than the full 32. This saves on pins going to and from the chip, and on wires within the chip.
This design also simplifies the caches, as I said earlier, data duplication within caches is a pain (one gets out of date, and its wasting cache space which is limited).



struct cacheline {
unsigned integer29 address; // Our 29 bit address where this cache line came from
bool dirty; // Is this data different from what is in actual memory?
unsigned integer64 data; // Our 64 bits of data
}

cacheline MyCache[32];


This is a basic cache. So when you try to access some data, then the processor will do something like this:

Remember this is hardware, so it'll be done in parallel!


// Search the cache for address
for (int a = 0; a < 32; a++)
{
if (MyCache[a].address == (address >> 3))
{
// cache hit!
return (MyCache[a].data >> (address & 7)) & 0xFFFFFFFF;
}
}

// Its not in the cache so we'll need to read from memory...


If the data were cached as individual bytes, to get the same thing you'd need 8 times as many addresses and they'd all need to be the full 32 bits. Also the search needs to be 8 times wider, that may not be possible in a single clock cycle, so it'd slow down all cache accesses.

But also as you can see, if the data spans a cache line, then it will need both in the cache. This is why alignment is important. If the data you access is badly placed, it'll take two cache lines rather than one, it could effectively in the worst-case scenario half the size of the cache.

You think of memory as a series of 8 bit chunks, and label these as 0 -> (2^32 - 1). The hardware sees them as a series of 64 bit chunks and labels them 0 -> (2^29 - 1). The 3 bits that you have extra refer to which of the 8 bytes inside the hardware's chunk you are looking at.
In future chips we may see data aligned to say a 128-bit boundary, many graphics chips already do this because they are really memory bandwidth dependant. So data is fetched from memory where the address is a multiple of 16, or 32 (it is 128, or 256 bit aligned). So if the bottom n bits are zero it is aligned to a 2^(n - 1) byte boundary.


Masking is another term for anding. If you want to make sure that only particular bits are set, then applying a mask to stop the others getting through, this is what anding does! Although in hardware, if a wire is guaranteed to be zero (or one), it will simply not be put in the chip.


Pipelines in hardware, are like production lines. It is very difficult to get one robot that can build a whole car; so several different, and specialised robots are put next to one another, and pass things along.
The same it true of the hardware within a processor, it is very difficult to engineer something complex that will execute in one clock cycle, it may even be impossible if you want the clock to run at 2GHz. So instead, the hardware is designed to do some of the execution, then pass it to the next stage - who will do some more execution, then pass on to the next etc. Each stage is independent and takes a single clock cycle, so they can work concurrently.
In a processor, there is more than just read instruction, execute it! The instructions have to be decoded, and the appropriate resources selected (registers, memory, flags). This can take some time, so rather than run the chip at 1MHz, and have it all done in one clock, the processor cranks up the speed, and divides the task into separate discreet unit, and they work in series on the data stream.
When Agner Fog describes the dual pipeline architecture of the Pentium series of processors, he means that there are two of these production lines working in parallel. So rather than just one instruction being executed by the processor, two are (if they don't depend on one or the other).
The architecture of the P6 core (that used in the Pentium Pro, PentiumII, and PentiumIII), things are a little different.


As I said, this is all hardware, and embedded in the silicon of the processor you are using. You cannot affect the caches, or the way they work, but they are part of the machine you use. If you bear in mind the way the machine works, then you can get more performance out of it.

What this means to you, as an assembly programmer is this:
#1 Try to put data that is important, and will be used often, on an appropriate memory boundary for the processor you are working on. This will improve the memory access speed.

#2 If given the opportunity, do all the work on a given range of memory in one go (or as few goes as possible), then move on to the next. This will mean the data stays in the cache, and you won't make the processor thrash it, as it keeps loading the data in, then flushing it out.

#3 Try to make the conditional jump patterns predictable. The more consistent you are with the conditional jumping instructions, the easier a time the processor will have of guessing which way the code will go.


This in my opinion is one of the reasons assembly is a great language to learn. Even if you don't program in it professionally, it teaches you about the hardware, and why you should do things in a particular way. In doing so, you can then apply this knowledge to the code you write in other languages.

As you are new to assembly language, don't concern yourself with optimisation details. Not aligning data will not break the code, it'll slow it down. When you are comfortable with how to code in assembler then start to look at why people do things the way they do.

Mirno
Posted on 2003-01-18 11:10:49 by Mirno
Mirno,

Compliments on the technical description of this topic, I wish there was a way to wrap this quality of technical data so that more people could access it.

Regards,

hutch@movsd.com
Posted on 2003-01-18 14:42:31 by hutch--
Will dynamically allocated memory, always be properly aligned?
Posted on 2003-01-18 23:58:43 by david
David,

From memory yes, GlobalAlloc is always 8 byte aligned, OLE string is 16 byte aligned and the heap and similar functions are also aligned to a minimum as well.

Regards,

hutch@movsd.com
Posted on 2003-01-19 06:10:20 by hutch--
and VirtualAlloc is page aligned (4k on x86). Actually, 64k aligned on x86 when you use NULL for the lpAddress parameter (ie, let windows find a suitable address).
Posted on 2003-01-19 09:46:29 by f0dder
Hi mirno,
Sorry for my late reply. Again i am out of words to thank you for writing that.

Although its more clearer to me right now, i am still pretty sure that i cant undertand most of it.
Like this C code you posted-


// Search the cache for address
for (int a = 0; a < 32; a++)
{
if (MyCache[a].address == (address >> 3))
{
// cache hit!
return (MyCache[a].data >> (address & 7)) & 0xFFFFFFFF;
}
}


I have never been able to understand how anding operations really work. I know
that-


a.1=a
a.0=0

and the truth table for and,or,etc but i dont know how algorithms that use it work.
Like what exactly does anding the address with 7 do?
Also as anything anded with 1 gives the same thing then what's the point in 'anding' it with FFFFFFFF???:confused:

Also this is from agner fog's manual on caching but i again dont seem to understand it.


The consequence of this is that the cache can hold no more than two or four different data blocks which have the same value in bits 5-11 of the address. You can determine if two addresses have the same set-value by the following method: Strip off the lower 5 bits of each address to get a value divisible by 32. If the difference between the two truncated addresses is a multiple of 4096 (=1000H), then the addresses have the same set-value.

Let me illustrate this by the following piece of code, where ESI holds an address divisible by 32:



AGAIN: MOV EAX, [ESI]
MOV EBX, [ESI + 13*4096 + 4]
MOV ECX, [ESI + 20*4096 + 28]
DEC EDX
JNZ AGAIN

The three addresses used here all have the same set-value because the differences between the truncated addresses are multipla of 4096. This loop will perform very poorly on the PPlain and PPro. At the time you read ECX there is no free cache line with the proper set-value so the processor takes the least recently used of the two possible cache lines, that is the one which was used for EAX, and fills it with the data from to and reads ECX. Next, when reading EAX, you find that the cache line that held the value for EAX has now been discarded, so you take the least recently used line, which is the one holding the EBX value, and so on.. You have nothing but cache misses and the loop takes something like 60 clock cycles. If the third line is changed to:

MOV ECX,

then we have crossed a 32 byte boundary, so that we do not have the same set-value as in the first two lines, and there will be no problem assigning a cache line to each of the three addresses. The loop now takes only 3 clock cycles (except for the first time)


What exactly is 'set-value'? why can 2 addresses have the same set-value?
Also in this line-
MOV EBX,
ebx is being moved an address which is 4 more than the multiple of 4096, then how does agner say that ' the differences between the truncated addresses are multipla of 4096'???:confused:
Posted on 2003-01-22 04:24:55 by clippy
The truth tables for anding (and the other logical operations) apply to single bits.
A 32 bit value can be thought of as an array of bits:


bool[32] MyAnd(bool ArgA[32], bool ArgB[32])
{
bool result[32];
for (int i = 0; i < 32; i++)
result[i] = ArgA[i] & ArgB[i];

return result;
}

So "and eax, 7" masks out all but the bottom 3 bits of eax (7(dec) == 0111(bin)). Hence the term masking.

The reason I wrote the mask value of 0xFFFFFFFF is to demonstrate that the cache is returning a 32 bit value, even though it is 64 bits wide. If this were an optimised C model of the cache, rather than an example bit of code it could be omitted. The idea of working with integers of widths other than the basic 8, 16, 32, or 64 bits is something that needs to be kept in mind when dealing with hardware.

Agner said that bits 5-11 make up the set-value. So adding a value between 0 and 31 will not affect the set-value, as they modify bits 0-4.
setvalue = (Address >> 5) & 0x7F;


13*4096 + 4 = 0x0D004 -> shr 5 = 0x680 -> AND 0x7F = 0
20*4096 + 28 = 0x1401C -> shr 5 = 0xA00 -> AND 0x7F = 0

set-values are both zero!

Mirno
Posted on 2003-01-22 05:41:37 by Mirno
Thanks again mirno.
I seem to be getting much more of this.

I will trouble you again in the future:)
Posted on 2003-01-23 03:42:54 by clippy