Hello, all!

Somewhere I use a Call in a loop.
Since Call and Ret are vector path instruction (thermin from AMD)
I decided to do some speed optimization for this purpose:
Let's say I have some:


DrawStringRep: mov EBX, CharClass.Execute[EAX]
and EBX, NOT BDO_NONACTIVE ;reset flag to get real address
mov EDI, [ESP] ; CharString Object
call EBX
jnc DrawStringRep ; loop until CARRY is set

Here I see some operations I can place outside of loop.
Each Call pushes return address to the stack and Ret then pops it.
So I considered to use some quicker instruction instead of them

			push	@F

DrawStringRep: mov EBX, CharClass.Execute[EAX]
and EBX, NOT BDO_NONACTIVE ;reset flag to get real address
mov EDI, [ESP+DWORD] ; CharString Object
jmp EBX ; for example Char_Draw
@@: jnc DrawStringRep
pop ebx ; restore stack
.....
[edit]
qret macro ; quick return, don't clear stack
jmp DWORD PTR [ESP]
endm
[/edit]

Char_Draw proc
....
Char_Draw_Exit: qret
Char_Draw endp
Posted on 2003-09-16 20:11:02 by S.T.A.S.
The idea is great, but the code may be difficult to understand in six months time. Wouldn't it be easier to move the loop into the called procedure, e.g. like this:



mov EBX, CharClass.Execute[EAX]
and EBX, NOT BDO_NONACTIVE
mov EDI, [ESP]
DrawString: call DWORD PTR [EBX]
; ...


Char_Draw proc

@@: ; ...
jnc @B
ret

Char_Draw endp

CALL and RET are executed only once, so they shouldn't hurt very much.

Regards, Frank
Posted on 2003-09-17 09:48:56 by Frank

CALL and RET are executed only once, so they shouldn't hurt very much.
Regards, Frank

Hi, Frank!
I know that.
In the case you're talking about, I'd use inlined proc instead of your example :)

I'm not able to move that: "and EBX, NOT BDO_NONACTIVE" outside loop.
Also I don't use "call DWORD PTR ", I use "call EBX " instead.

In my case I have some objects: scrolling, menus, console. Each of them is drawing chars on the screen (DX).
Also, each of them is using one common method (that loop) to draw these chars.
But chars are different objects, so in this loop I call different metods to draw them.
And I call them quicker that casual call (because I use "Direct path" instructions that are pairable, instead of Call and Ret). Moreover, I use my procs without EBP stack frame, so I don't use enter/leave stuff, and avoid useless stack overhead as well.

And I'm not talking about optimizing my loop :)
I'm talking about quick call of some procs.
Or may be I'm developing a wooden bicycle :stupid:
Posted on 2003-09-17 19:26:04 by S.T.A.S.
S.T.A.S., I think I have rode that wooden bicycle a time or four and this seems familar. ;) How about making BDO_NONACTIVE be bit one, and then putting a NOP a the start of each proceedure!? That reduces the code to:
DrawStringRep:	mov	EDI, [ESP]	;  CharString Object

call CharClass.Execute[EAX]
jnc DrawStringRep ; loop until CARRY is set
Posted on 2003-09-17 20:06:54 by bitRAKE
Hi, bitRAKE!
Are you talking about placing a ret in the beginning of my procs, then replace it with a nop when I need it?
It's good a idea, I thought that the time of such tricks is gone :)

I'm just trying to avoid any modification of .CODE segment
So I considered to set 31bit of method's address (BDO_NONACTIVE),
then code can be executed from one part of my program,
but can't be executed in another cases without any modification
(this may cause data/instruction cahe conflicts sometimes. I think)

I understand that I can just save a few tackts of CPU (in the best case), and not sure could it give me some effect on Intel CPU, but I considered it's better than save a few bytes.
Any way "jmp EBX" should be faster than a call. And wooden bicycle can't drown :)
Posted on 2003-09-17 21:04:26 by S.T.A.S.
S.T.A.S., let me get off the bike and spin some code...
	ALIGN 4 ; magic ;)

CharacterPROC1 PROC someParam:DWORD
nop ; voodoo ;)
push ebp
mov ebp, esp
sub esp, -4
...
You see the NOP makes the least significant bit of the address insignificant. :)

jmp CharacterPROC1 == jmp CharacterPROC1 + 1
; And least significant bit of CharacterPROC1 is zero.
Posted on 2003-09-17 21:26:05 by bitRAKE
Hi, bitRAKE!
May be I can't catch your idea :(

In my case CharClass.Execute contains address of some method.
When I need to mark this "not to execute in some cases", I use "or CharClass.Execute,40000000H",
then to get real address I use "and BFFFFFFFh". / <- typo error removed

But you give me another idea. (May be you have already told that me :confused: )
I may consider that my all procs are starting with:
align 4 

nop
So one more bit of method's address is free for use :)
(Before now I've got 2 of them)

And yes, in this case I can remove "and EBX, NOT BDO_NONACTIVE"
just equate BDO_NONACTIVE to 1 :)

			push	@F

DrawStringRep: mov EDI, [ESP+DWORD] ; CharString Object
jmp CharClass.Execute[EAX]
@@: jnc DrawStringRep
pop ebx ; restore stack


Some more post and I'll able to remove all the loop :grin:
But I'm afraid other parts of my code don't allow me to do so
Posted on 2003-09-17 23:52:29 by S.T.A.S.
Yeah, bitRAKE . I must say you're genious :alright:

In my case your solution will not work because I have to use sometimes other flag "BDO_INEEDDESTRUCTOR equ 80000000h"
(but I can use 3 NOPs at the start of prog :) )

And idea is great!
A DWORD contains more data than 2^32 :)
It's really cool :cool:
Now I have at least 4 bites free to use in a dword address.

How many such clever trics do you know? ;)
Posted on 2003-09-18 18:23:31 by S.T.A.S.

How many such clever trics do you know? ;)
'Luck' is the term used to describe the skill of people you don't like. :)

No cleaver tricks - just skillz, baby. :grin:
Posted on 2003-09-18 20:20:26 by bitRAKE

No cleaver tricks - just skillz, baby. :grin:

Ok, any advices? :rolleyes:
Posted on 2003-09-18 20:31:00 by S.T.A.S.

Ok, any advices? :rolleyes:
I was just playing. :)

I would advise to try the code speed and size optimization challenges!
After that experience you will think about code differently.

No playing - the truth.

Learn to read and understand some of this code: http://www.cybertrails.com/~fys/hugi/hcompo.htm

Search this board and read the algorithms that are here.
Posted on 2003-09-18 20:51:45 by bitRAKE
Thanks, for the info bitRAKE!
Posted on 2003-09-18 21:18:24 by S.T.A.S.
it is said that JMP is harmful to the performance of a processor. Maybe CALL isn't, or maybe not?
Posted on 2003-10-08 05:28:22 by optimus
it is said that JMP is harmful to the performance of a processor
Yes, if CPU JUMPs to the code that is not in the instruction cache.
(So we should write a programm without any JMP at all :grin: )

If code is cached, then JMP is able to execute in parallel with some other instructions
because it is simple (direct path) instruction, that just loads a new value into EIP.

CALL is NOT able to do so.
this instruction does two things: push EIP and changes EIP.
This is always slower :(

Is there some one who think he is programming x86 assembly?..
OR
programming some kind of HLL, that inside of CPU is translating to some microOOPS (native asm of RISC core). :-|

Intel says: asm is dead, use our super-optimising C compiler to get REALLY quick code.
(Else the code that runs quick on Celeron 1200MHz (P3 core) will slow down new Celeron 2GHz (P4 core))
This is NOT developing the BEST CPUs.
This is DEVELOPING the BEST marketing model.
"We will sell, you will pay"

There are really not so many people who CAN write REALLY cool code.
They are looking at my code and smiling ;)
They never show us all their secrets.

That is not bad nor good -- that is life.
The white side of this is: everything can be improved.

Also, I'm able to use this smile :stupid:

PS. no offence to anyone

PPS. every instruction is harmful to the performance of a processor :tongue:
Posted on 2003-10-30 08:12:21 by S.T.A.S.

The white side of this is: everything can be improved.
If one were to build a brute force code generator they would be able to produce the best code for their needs given sufficient time - so this statement is not true in general, but for all practical purposes - it is true. :) All I mean to say is that the instruction and data bit streams the CPU uses are finite and much can be done to reduce the number of execution cycles needed.

Posted on 2003-10-30 08:38:29 by bitRAKE
the instruction and data bit streams the CPU uses are finite
I am some times expecting ("slower" and "smaler") instructions are faster than ("faster" and "bigger") :)


much can be done to reduce the number of execution cycles needed.
taking into account the number of CPU modification I'm afraid we'll need a JIT profiler/compiler for critical code parts :eek:


I ignore self-modifying code
hmm... may be 02:40 is not a good time to understand "brute force code generator" :o :^

bitRAKE, as always one your post and one half of my mind is blocked :grin: (I'm not sure right now left or right :confused: )
Posted on 2003-10-30 10:46:33 by S.T.A.S.

hmm... may be 02:40 is not a good time to understand "brute force code generator"
The most simple version would consist of a code space that gets incremented until all solutions below a threshold are reached. The code space would be initialized to all zeroes. Then each itteration an attempt would be made to execute the code space and a test would be done to check if the desired output is reached. If the code is a solution then it is logged. Then the code space is incremented by one. Ad infinum...

A better idea is to prune the code space as much as possible - this requires a feedback function setup on the exceptions. I also prune Plain_x86/FPU/MMX/SSE instructions depending on the algorithm. The code space can be HUGE for only a few bytes. I am improving my heuristics though. :) My goal is to brute force anything under 1K within a day. Currently, it doesn't really work. :(
Posted on 2003-10-30 20:04:06 by bitRAKE
Originally posted by bitRAKE
Currently, it doesn't really work.

You need a better CPU :)

Ask AMD/Intel to replace mOOPs ROM with RAM :)
Posted on 2003-10-30 21:03:21 by S.T.A.S.