I posted 59 tips/tricks you can only do in assembler on a webpage. It was part of a tutorial I wrote showing how to speed up code written in C using assembler. I wanted to show what kinds of fancy tricks you can do in assembler that you can't do in C ( or other high level languages). Their are a few items on the list that I included that I think are great tricks that you can do in a high level language. But the vast majority of the 59 items ( 96% or more) are assembler only. I broke them up into 3 categories Beginner, Intermediate, and Advanced. If you are looking for a great place to find tips/tricks on how to make you code run faster, this is a great place to start. I have seen very very few up to date assembler optimization pages that go into as much detail as I did.

http://www.visionx.com/markl/optimization_tips.htm
Posted on 2004-07-09 20:35:51 by mark_larson
cool :)
Posted on 2004-07-09 23:35:51 by Homer
Nice page. Maybe you could add a trick described in AMD optimization manual (http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf), page 112

The trick uses a negative index to save a cmp at the end of a loop. For example :


mov esi, src
LoopLbl:
; some code using [esi][ecx]
add [esi][ecx], 3
inc ecx
cmp ecx, MAXSIZE
jne LoopLbl

can be replaced by :


mov ecx, -MAXSIZE
LoopLbl:
add [esi][ecx+MAXSIZE], 3
inc ecx
jnz LoopLbl
Posted on 2004-07-09 23:48:14 by Dr. Manhattan

Nice page. Maybe you could add a trick described in AMD optimization manual (http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf), page 112

The trick uses a negative index to save a cmp at the end of a loop. For example :


mov esi, src
LoopLbl:
; some code using [esi][ecx]
add [esi][ecx], 3
inc ecx
cmp ecx, MAXSIZE
jne LoopLbl

can be replaced by :


mov ecx, -MAXSIZE
LoopLbl:
add [esi][ecx+MAXSIZE], 3
inc ecx
jnz LoopLbl


I actually cover getting rid of compares, but the example I show uses DECs not INCs. I think I am going to snag this one also. Thanks for the heads up :)
Posted on 2004-07-10 00:43:49 by mark_larson
Very very nice. I have learnt a lot from just glancing through your site. I'm sure it will come in very handy for me and many other people. Good work. :alright:
Posted on 2004-07-10 03:10:40 by DeX
Hi Mark,

Thanks :alright: Very nice work :)
Posted on 2004-07-10 03:14:17 by Vortex
cute/:P
Posted on 2004-07-10 09:26:41 by krakers
Mark, you are probably aware that some more suggestions and Coding Rules can be found in IA-32 Intel Architecture Optimization Reference Manual. But your page is easier to read and comes in useful, thank you.

The title should rather be Performance Optimization Tips as it only concerns optimization for speed. Of course, speed of code is the most important criterion, especially among Sunday Assembly coders. Everyday Assembly coders, on the other hand, sometimes prefer other criteria:

    [*]size of code
    [*]readibility
    [*]writeability
    [*]reusability
Posted on 2004-07-11 08:23:40 by vit$oft
Nice site mark, but I have one question, in my 68000 emulator (program is in little endian words, but big endian DWORDS)
To convert from Big Endian to Little Endian I need to do a rol eax,16 or a ror eax,16, your site says to avoid this, do you guys by any chance have an alternate route? (I need to swap high and low words of a register)
Posted on 2004-07-11 09:58:12 by x86asm

Mark, you are probably aware that some more suggestions and Coding Rules can be found in IA-32 Intel Architecture Optimization Reference Manual. But your page is easier to read and comes in useful, thank you.

The title should rather be Performance Optimization Tips as it only concerns optimization for speed. Of course, speed of code is the most important criterion, especially among Sunday Assembly coders. Everyday Assembly coders, on the other hand, sometimes prefer other criteria:

    [*]size of code
    [*]readibility
    [*]writeability
    [*]reusability


Yepper I am familiar with that manual. I have had all the Intel Optimization manuals going back to the PPro, or maybe P2. They used to give us hard copies at work ( plus the 3 processor manuals). And then when they started posting them as PDFs, I downloaded them from Intel's site. And yea I need to get the word "Optimization" in the title of the webpage.
Posted on 2004-07-11 10:48:05 by mark_larson

Nice site mark, but I have one question, in my 68000 emulator (program is in little endian words, but big endian DWORDS)
To convert from Big Endian to Little Endian I need to do a rol eax,16 or a ror eax,16, your site says to avoid this, do you guys by any chance have an alternate route? (I need to swap high and low words of a register)


You can't use ROL or ROR to convert from big endian to little endian. The bytes have to be swapped, using ROR and ROL the bytes will remain in the same order, so it won't work. The plus side is there is an instruction that does it BSWAP. It converts between big endian and little endian. I am going to be adding that trick on my web page when I get a chance. It is already on my list of things to add. The down side is that BSWAP on a P4 is slower than a ROR or a ROL. It runs in 7 cycles. On the up side you probably can't do it any faster by breaking it up into mulitple instructions on the P4.
Posted on 2004-07-11 10:58:04 by mark_larson



You can't use ROL or ROR to convert from big endian to little endian. The bytes have to be swapped, using ROR and ROL the bytes will remain in the same order, so it won't work. The plus side is there is an instruction that does it BSWAP. It converts between big endian and little endian. I am going to be adding that trick on my web page when I get a chance. It is already on my list of things to add. The down side is that BSWAP on a P4 is slower than a ROR or a ROL. It runs in 7 cycles. On the up side you probably can't do it any faster by breaking it up into mulitple instructions on the P4.

The 68000's program code already has each individual word byte swapped(so each word is already in little-endian format), but I don't do it at the DWORD level you see, so if a DWORD read is requested by the 68000's program I need to swap the high and low WORD's. That is why I use ROL to do it, swap the words, but not the bytes as they are already swapped. ya I would like to stay away from the BSWAP instruction. It doesn't seem to be very fast on todays CPU's. Would there be a way to swap the high and low words of a 32-bit register without using rol (another instruction)?
Posted on 2004-07-11 13:38:27 by x86asm
i though that newer CPU must be 'faster' hehe but it seems that the more new cpus we gt the slower the instructions get :D
so, than i guess buying new pc is out of the question hehehe
Posted on 2004-07-11 13:46:42 by wizzra

i though that newer CPU must be 'faster' hehe but it seems that the more new cpus we gt the slower the instructions get :D
so, than i guess buying new pc is out of the question hehehe


That was because Intel was trying to make it easy to scale their CPU speeds up. They increased the pipeline depth, they increased the instructions latency, they increased the branch prediction logic, all with the plan on making it up by having a lot faster CPU speeds. Not all instructions got slower. Just some.

These instructions got faster: xor, or, and, not, neg, mov, cmp, test, add, sub, movzx, movsx. You can actually do 4 ALU instructions a cycle from that list if you don't have any dependencies and stick to those instructions. Each of them runs in a half a cycle.

Here are some examples ( not a comprehensive list) of slower instructions going from the P3 to P4
shr, shl, sar, rol, ror, rcr, rcl, lea, movq, movaps, movd

The same thing happened with prescott. Several of the instructions have longer latencies when compared to the P4. Usually only by 1 cycle. Here are some examples:

( again this is not comprehensive, their are a lot more)
SIMD: pmaddwd, PMULHx, PMULLx, PMULUDQ, ADDPD, etc
floating point: fabs, fadd, fsub, fmul, etc
ALU: adc, sbb, add, sub, and, or, xor, not, bsf, bsr, etc

A few instructions did get faster on Prescott but not as many as did going from P3 to P4 ( about 4 or 5)
Posted on 2004-07-11 14:32:06 by mark_larson
thanks markl :)
Posted on 2004-07-11 16:08:09 by wizzra


The 68000's program code already has each individual word byte swapped(so each word is already in little-endian format), but I don't do it at the DWORD level you see, so if a DWORD read is requested by the 68000's program I need to swap the high and low WORD's. That is why I use ROL to do it, swap the words, but not the bytes as they are already swapped. ya I would like to stay away from the BSWAP instruction. It doesn't seem to be very fast on todays CPU's. Would there be a way to swap the high and low words of a 32-bit register without using rol (another instruction)?



You could look at doing XCHG, if you only want to swap the words. Just don't do XCHG with memory, because it does an implicit lock. XCHG does not work on the same register like ROL/ROR. So if you have both of the word values in the same register you'd have to copy one of them to another register.
Posted on 2004-07-11 17:06:28 by mark_larson
"BSWAP reg32" is one cycle on Athlon's.

x86asm, why are the words pre-swaped? If it was all done at runtime then BSWAP would be more useful.
Posted on 2004-07-11 19:18:23 by bitRAKE

"BSWAP reg32" is one cycle on Athlon's.

x86asm, why are the words pre-swaped? If it was all done at runtime then BSWAP would be more useful.

BSWAP is only one cycle on the Athlon?!?! weird.
I dont know, the emulator is interpreter based, and all of the 68000 opcodes are 16-bits wide with some extension stuff. So it would be easier and faster for the Sega Genesis program to be byte swapped first and then I can just read the opcode from the array directly, without manipulating the data. Or if you say, I could use the BSWAP instruction, but to do a word wide switch to little endian would require a XCHG AH,AL , how is this instruction?
Posted on 2004-07-11 19:56:25 by x86asm

I dont know, the emulator is interpreter based, and all of the 68000 opcodes are 16-bits wide with some extension stuff. So it would be easier and faster for the Sega Genesis program to be byte swapped first and then I can just read the opcode from the array directly, without manipulating the data. Or if you say, I could use the BSWAP instruction, but to do a word wide switch to little endian would require a XCHG AH,AL , how is this instruction?
It has been a few years since I did 680x0 and I don't know how the rest of your emulator works; but why do you need to swap the opcode bytes to emulate? I can understand swaping a DWORD, or a WORD offset, but not the opcode bytes - seems like an unneeded step. XCHG AH,AL is two cycles.

Instead of using XCHG, you could just read two byte before and BSWAP. :)

mov ax,
xchg ah, al

...replaced by...

mov eax, [-2]
bswap eax

Of course, this trashes the top word, and you might already have the data in a register from the dispatcher -- I would.
Posted on 2004-07-12 00:04:34 by bitRAKE

It has been a few years since I did 680x0 and I don't know how the rest of your emulator works; but why do you need to swap the opcode bytes to emulate? I can understand swaping a DWORD, or a WORD offset, but not the opcode bytes - seems like an unneeded step. XCHG AH,AL is two cycles.

Instead of using XCHG, you could just read two byte before and BSWAP. :)

mov ax,
xchg ah, al

...replaced by...

mov eax, [-2]
bswap eax

Of course, this trashes the top word, and you might already have the data in a register from the dispatcher -- I would.

I'm reversing the byte order in each word, so say we have a word that is 1234h, I'm switching it to 3412h. This is because the 68000 is big endian. Though these methods might have very little overhead and I probably will use them instead of byte swapping the code.
Posted on 2004-07-12 08:44:40 by x86asm