Hello everyone. I've been looking through the Pentium 4 optimization manual from Intel and I have a few questions about some things I've read. Alignment is mentioned a lot, and I understand what it is and why it helps, but how exactly do you go about aligning something? I also read about aligning code to x boundries. How exactly do you go about aligning code? It mentioned using nop's or something like mov eax,eax but how do you know how many of these to do to align it to any particular boundry? There is also a lot of talk about micro ops, throughput, and latency, but I can't seem to find a list for all the instructions on the P4. There is a list in the back of the book, but it's not complete and it doesn't tell the number of micro ops. Does pairing still matter on the P4, or is that just for the P6 type processors? From the way things looked in the book, it's not so much the uv pipes as it is issue ports and execution units. Are the 4 ports pretty much like the uv pipes except there are 4 instead of 2? I admit that I'm looking at things that are probably a step or two beyond where I'm at as an assembly programmer, but I'm still curious about these things.

Also, I guess while I'm here... how can I actually do anything using the @FileName macro? I can't declare a string that's initialized to the filename because quotes make the string and "@FileName" is exactly that, @FileName, heh.

Thanks for any help, comments, or links :grin:
Posted on 2003-12-09 04:22:19 by AlexEiffel
Assemblers generally have alignment directives. MASM for example has "ALIGN n", where n is the byte boundary to align on.
It will then automatically pad the code (or data) to get to the next boundary. When done in code, it will try to use the shortest possible sequence of 'nop' instructions (lea eax, for example is a long one).
The P4-list may not be complete, but I think most stuff from the P3-list still applies, more or less. It should do well enough in practice, I'd say. Just pay extra attention to the P4's ability to executes some instructions in 0.5 clks.
As for pairing, that was never really an issue on P6-processors. They can reorder code dynamically, so the exact order doesn't really matter. The ports are important, as you said, but also the decoding. Some instructions are too long to be decoded by the second and third decoder, for example, and then they stall until they can be decoded by the first decoder in the next clk. If you reorder your code a bit, you can get a better decoder throughput. Then again, usually the decoder is faster than the execution units, so if it stalls every now and then, you won't notice it right away, since the decoder can catch up again before the execution units run out of micro-ops.
The Intel Optimization manual (at least the PIII one) explains the architecture in great detail... It will tell you that there are 5 ports, and each of those is unique in some way... I don't recall exactly at this moment, but it was something like 2 ports for ALU, 1 port for load, 1 port for store, and... the last was FPU/MMX/SSE? I'm not sure, anyay, they were all different, and they could all be used at the same time.
Anyway, check the manuals, they should be a good guide.
Posted on 2003-12-09 06:05:38 by Bruce-li
www.intel.com go to developers section for download:

NOTE: The Intel Architecture Software Developer?s Manual consists of
three volumes: Basic Architecture, Order Number 243190; Instruction Set
Reference, Order Number 243191; and the System Programming Guide,
Order Number 243192.
Please refer to all three volumes when evaluating your design needs.

As far as FileNames, have you done any study into Iczelion's tutorials?

http://spiff.tripnet.se/~iczelion/

I usually find what I'm looking for there.
Posted on 2003-12-09 07:48:17 by mrgone
Bruce: Thanks for clearing a few things up for me. I'll keep looking through the manuals.

mrgone: Thanks for the reply, but I think maybe I wasn't clear on what I mean with FileNames. I wasn't talking about getting a filename from Windows, but using the Masm @FileName directive to get the name of the file that it is in. I just can't figure out how to get the result of the directive into a string to use in my programs. I actually have all of the manuals that you mentioned and they are definately good sources of information.
Posted on 2003-12-10 19:30:03 by AlexEiffel
You need something like:

the_filename db @catstr<"!"", @Filename, "!", 0">

I've got this working in the past, and it is something like this. I can't try it out at the moment though...

Mirno
Posted on 2003-12-12 04:59:06 by Mirno
Alex,

Have a look at the optimisation manual for a PIV to get the general outline of instruction scheduling. It has a detailed section on preferred instruction usage. What you will find in practice is that you can get very bad stalls on a PIV if you don't get it right, different size register reads and writes like shifting from DWORD to WORD or BYTE on the same register will generate these stalls among other things.

Data alignment is based on the hardware requirement that the processor needs to take 2 reads to get a DWORD that is not aligned to a 4 byte boundary. Some of the later XMM instructions require a higher alignment with data in memory.

In practice, sometimes code alignment matters, some times it does not, when you can do it without interfering with instructions before it, you can align labels to some advantage at times.

Instruction scheduling is not exactly the same thing as pairing was on the earlier Intel processors but it does work in much the same way. The preferred instruction set runs faster than they did on earlier Intel processors for a given clock frequency so any problems in instruction sequences generates large time differences.

Instructions like shifts and rotates are slower relatively and LEA does not have the performance advantage it had on earlier processors from the 486 up.

Just for example, If you wrote code to truncate a value to its next lowest boundary of 4, you used to write code like,


mov eax, oddnumber
shr eax, 2
shl eax, 2

On a PIV it is faster to use adds for the SHL operation so you would write,


mov eax, oddnumber
shr eax, 2
add eax, eax
add eax, eax

PIV code is faster if you get it right so its worth the effort to have a good read of the PIV manual set and especially the optimisation manual.

Regards,

http://www.asmcommunity.net/board/cryptmail.php?tauntspiders=in.your.face@nomail.for.you&id=2f46ed9f24413347f14439b64bdc03fd
Posted on 2003-12-12 08:01:43 by hutch--
Just for example, If you wrote code to truncate a value to its next lowest boundary of 4, you used to write code like,




mov eax, oddnumber
and eax, NOT 3
Posted on 2003-12-12 08:05:46 by Bruce-li
Cute piece of optimisation but you have missed the point about using shifts. This is what Intel have to say about replacing SHL with adds.

The shift and rotate instructions have a longer latency on the Pentium 4 processor than on previous processor generations. The latency of a sequence of adds will be shorter for left shifts of three or less. Fixed and variable shifts have the same latency. Assembly/Compiler Coding Rule 42. (M impact, M generality) If a shift is on a critical path, replace it by a sequence of up to three adds. If its latency is not critical, use the shift instead because it produces fewer ?ops.

Benchmarking shows that Intel actually know what they are talking about.

Regards,
http://www.asmcommunity.net/board/cryptmail.php?tauntspiders=in.your.face@nomail.for.you&id=2f46ed9f24413347f14439b64bdc03fd
Posted on 2003-12-12 08:52:51 by hutch--
donkey,

The example was to actually round down to the next 4 byte boundary,

and eax, -4 works fine but it missed the point of replacing left shifts with adds.

Regards,

http://www.asmcommunity.net/board/cryptmail.php?tauntspiders=in.your.face@nomail.for.you&id=2f46ed9f24413347f14439b64bdc03fd
Posted on 2003-12-12 09:10:37 by hutch--
Here's the post Hutch is referring to above :)

Isn't it supposed to be, you want to align to the next DWORD. Bruce-Li's code will just truncate the lower 2 bits.

mov eax, oddnumber

[b]add eax,3[/b]
and eax, NOT 3


<edit>Sorry, I misread Hutch's post, couldn't see the point of aligning downward so didn't consider it</edit>
Posted on 2003-12-12 09:43:20 by donkey
hutch:


Cute piece of optimisation but you have missed the point about using shifts. This is what Intel have to say about replacing SHL with adds.

quote:
The shift and rotate instructions have a longer latency on the Pentium 4 processor than on previous processor generations. The latency of a sequence of adds will be shorter for left shifts of three or less. Fixed and variable shifts have the same latency. Assembly/Compiler Coding Rule 42. (M impact, M generality) If a shift is on a critical path, replace it by a sequence of up to three adds. If its latency is not critical, use the shift instead because it produces fewer ?ops.


Benchmarking shows that Intel actually know what they are talking about.



lol!!
but the point here is why would you do a shift in the first place?
shift of more than 1 will require the value of the shift encoded in the instruction , one byte I think (shift of 1 place dont require it)

BruceLi:


code:
mov eax, oddnumber
and eax, NOT 3



my little noobie self outperforms both solutions by the following masterpiece :grin: :
and al,1111 1100

coz with "and eax,not 3" you ll have to include a full "not 3"-dword in your opcode : doh! (to be said with homer 's voice)
with al you just have a one-byte "not 3".

or did I miss something?

donkey:

Isn't it supposed to be, you want to align to the next DWORD. Bruce-Li's code will just truncate the lower 2 bits.

code:mov eax, oddnumber
add eax,3
and eax, NOT 3


anyway hutch s code did the same...
Posted on 2003-12-12 11:49:54 by HeLLoWorld

donkey:


anyway hutch s code did the same...


Ummmm, no it doesn't. At least not the same thing mine does (not really mine). Mine will always align up to the closest DWORD Hutch's and Bruce-Li's will just truncate the bottom 2 bits.
Posted on 2003-12-12 11:52:22 by donkey
my little noobie self outperforms both solutions by the following masterpiece :
and al,1111 1100

coz with "and eax,not 3" you ll have to include a full "not 3"-dword in your opcode : doh! (to be said with homer 's voice)
with al you just have a one-byte "not 3".

or did I miss something?


That will work yes... the problem is that you are using a partial register, and when using eax again afterwards, you will get a stall...
And the other part is, and supports sign-extension if I'm not mistaken... So if you disassemble the code, and eax, NOT 3 will just store NOT 3 in 1 byte anyway (assuming you use an assembler that writes the shortest possible opcodes, MASM will do this for you automatically).

So, it's a nice idea, good to see someone using their head :)
But I don't think it's an improvement in this case.
Posted on 2003-12-12 11:54:13 by Bruce-li
IMHO, Hutch just wanted to make and example on replacing shifts with additions. So maybe he picked the wrong example, that doesn't invalidate his point, nor does it imply he's not a good programmer, you know ;)
People, let's not get this thread moved to the Crusades, ok? :(

@HelloWorld:
Good thinking :) got to move to 32 bits registers though, the 16 and 8 bit ones don't give you size optimizations in most cases, and will slow you down due to stalls. The Intel optimization manual is a good reference material for this stuff:

http://www.asmcommunity.net/board/showthread.php?threadid=14740&highlight=intel+free
Posted on 2003-12-12 12:02:24 by QvasiModo
Nobody said that hutch--'s point about the shifts was invalid, hutch-- just felt attacked, as usual. So he felt he had to retaliate, as usual.
And it does imply he's not a good programmer... I mean, a good programmer would never even come up with this example, because he'd know to use an and for that, and shifts would never occur to him.
I'd probably use an example of a multiply replaced by shifts and adds instead, or something.
Posted on 2003-12-12 12:07:01 by Bruce-li
donkey:

Ummmm, no it doesn't. At least not the same thing mine does (not really mine). Mine will always align up to the closest DWORD Hutch's and Bruce-Li's will just truncate the bottom 2 bits.


I meant "hutch's code does the same as BruceLi's code", obviously...


BruceLi:

That will work yes... the problem is that you are using a partial register, and when using eax again afterwards, you will get a stall...

i must apologise to you then... not for not knowing this and posting anyway, of course, but for thinking to myself "heck! learn to code before you learn the slight differences between pIII and pIV!" :grin: ... anyway, it DID matter in this case to know the pipeline internal, and I m not familiar with this at all and i was worng... So what is best of course is to know everything :)


And the other part is, and supports sign-extension if I'm not mistaken..

could you explain what this means to me?


MASM will do this for you automatically

ddddddoh!!!!!!
and if I f_ckin do WANT to produce it? :) nasm rules (although nasm sux for not having ORG)
Posted on 2003-12-12 12:08:29 by HeLLoWorld

Nobody said that hutch--'s point about the shifts was invalid, hutch-- just felt attacked, as usual. So he felt he had to retaliate, as usual.
And it does imply he's not a good programmer... I mean, a good programmer would never even come up with this example, because he'd know to use an and for that, and shifts would never occur to him.
I'd probably use an example of a multiply replaced by shifts and adds instead, or something.

Then it would be more constructive if you post such an example. :)
Anyway you can't judge someone's programming ability on some quick sample from the top of his head. It's not even like Hutch's code was wrong, it just wasn't the best possible, right?
Posted on 2003-12-12 12:10:43 by QvasiModo

ddddddoh!!!!!!
and if I f_ckin do WANT to produce it? :) nasm rules (although nasm sux for not having ORG)

:grin: :grin: :grin:
Guess you're right... unless there's some command-line switch to disable that feature (there probably is, but I'm too lazy to check ;) ).

EDIT: You can also hardcode the opcode using DB, but it's kinda chating you know :grin:
Posted on 2003-12-12 12:12:53 by QvasiModo
i must apologise to you then...


No problem, I didn't take offence anyway.

could you explain what this means to me?


Well it's an old trick... For example, if you want to load -1 into eax, I believe that or eax, -1 is the shortest possible way (this is what compilers have been doing lately anyway).
This is because there are multiple forms for encoding immediate operands.
Some instructions can store eg -1 as a byte (0xFF), and it is expanded to -1 as a dword (0xFFFFFFFF) by the CPU before it is fed to the execution unit. This means the code size in memory is still small.

ddddddoh!!!!!!
and if I f_ckin do WANT to produce it? nasm rules (although nasm sux for not having ORG)


Why would you ever want a non-optimal encoding of your instructions? :)
Anyway, you can still handcode the opcode in MASM that way, clumsy perhaps, but it's possible :)

Then it would be more constructive if you post such an example.
Anyway you can't judge someone's programming ability on some quick sample from the top of his head. It's not even like Hutch's code was wrong, it just wasn't the best possible, right?


I think the point of the shifts/adds came across anyway, no need to spam more examples, it's trivial stuff anyway.
And in case you didn't know yet, hutch-- and I go back a long way, I got banned in the past for disagreeing with hutch--... But he can't ban me now, because he got stripped of his administrator rights (for obvious reasons).
So he can just flame me now, and I can handle him, he'll just make himself look like an idiot :)
Posted on 2003-12-12 12:17:31 by Bruce-li

Why would you ever want a non-optimal encoding of your instructions? :)
Anyway, you can still handcode the opcode in MASM that way, clumsy perhaps, but it's possible :)

Mutant code maybe? (It's the only use I could think of). You're right, it's not very useful... :)
Posted on 2003-12-12 12:22:29 by QvasiModo