fodder once wrote that to many people are trying to optimize ExitProcess. In other words, useless optimization. But hey, I optimized a call to ExitProcess.

I started using jump tables again after calling everything through function pointers and I've learned some stuff regarding call and jmp. Please correct me if Im wrong on any of this.

Using MASM if I turn off incremental linking, calls to function defined in my own program do not get a jump table, just E8(realtive offset). But Windows APIs, must use a jump table or funtion pointer because the address of the API is not know at assemble time. The PE loader loads the address at runtime.

Now a few time here there has been dicussion about speed etc jump table vs function pointer. Both require a memory reference:

call ExitProcess
.
.
jmp [_imp_ExitProcess@4] ;An indirect jmp requiring access to memory

call ;Calls to an address containing the address of the function

But there is a relative version of jmp:

E9(realtive offset)

So I got the crazy idea of copying to addresses in the import table and make my own jump table that will use relative call to a relative jmp. No memory reference needed:


_DATA SEGMENT
__imp__ExitProcess@4:
db 0e9h
dd 0
_DATA ENDS

mov eax,_imp__ExitProcess@4 ;Get import table address
mov eax,dword ptr[eax] ;Get API entry point
mov ecx,__imp__ExitProcess@4+1 ;Get addres of jmp+1
mov edi,ecx
add edi,3 ;Not sure why I need to add 4
sub edi,eax
not edi
mov [ecx],edi ;Store Relative offset at jmp+1

invoke p_imp__ExitProcess@4,0 ;It exits without error[I]!!![/I]


00401327 E8 F4 1C 00 00 call __imp__ExitProcess@4 (403020h)
00403020 E9 F7 37 A7 77 jmp 77E7681C

TaDa, a faster call to exitprocess. If you call an API or DLL in an inner loop this might be a good optimization.
Posted on 2003-11-10 11:22:02 by ThoughtCriminal


mov ecx, ExitProcess+2
call ecx

I think it should work... At least it looks shorter than your code. :grin: PS: This code requires a jump table. :/
Posted on 2003-11-10 12:01:35 by roticv
First, am I correct in saying that you are trying to statically call the memory address where you expect ExitProcess to be?

I'd like to ask a few questions about this kind of speed optimization.

1) Is the location of API calls always the same? If they are, are you sure they are the same for all Windows Operating systems. What I mean is, will the above mentioned speed optimization work the same on Windows 95 and XP?

2) Does Windows link for example, the Kernel32.dll into the same address space everytime? What I mean is, is the memory location of Kernel32.dll the same for all applications?

3) What will happen to your above mentioned speed optimization when a new version of Windows comes out? Could you not run into a problem if they change the address of ExitProcess in a new version of Kernel32?

_DATA SEGMENT
__imp__ExitProcess@4:
db 0e9h
dd 0
_DATA ENDS


4) What is the dd 0 used for?

5) How safe is this optimization or this kind of programming based on assumptions?

Thanx
Posted on 2003-11-10 13:06:21 by SubEvil

First, am I correct in saying that you are trying to statically call the memory address where you expect ExitProcess to be?

I'd like to ask a few questions about this kind of speed optimization.

1) Is the location of API calls always the same? If they are, are you sure they are the same for all Windows Operating systems. What I mean is, will the above mentioned speed optimization work the same on Windows 95 and XP?

2) Does Windows link for example, the Kernel32.dll into the same address space everytime? What I mean is, is the memory location of Kernel32.dll the same for all applications?

3) What will happen to your above mentioned speed optimization when a new version of Windows comes out? Could you not run into a problem if they change the address of ExitProcess in a new version of Kernel32?



4) What is the dd 0 used for?

5) How safe is this optimization or this kind of programming based on assumptions?

Thanx


heh, I think you are missing a few things in the code. What thoughtcriminal is doing is taking the mem-pointer from the import table and patching it into his own jump table. Thus, the answers to your questions are.
1) The location of the import table is always the same. So ofcourse it will work cross-windows, as windows itself fills in the import table.
2) No, it doesn't have to, but basically it does. Kernel32 is the first dll to be loaded in win9x (if memory serves me right) and subsequent dll's will be relocated. In win2k and xp, it will most likely also be placed the same.
3) Nothing whatsoever.
4) heh, it's an integral part. The 0e9h that preceeds it is a jump mnemonic. So the dd 0h is basically just an empty variable
5) it's hardly based on assumptions - it's as safe as the linking of exe-files goes, it's the same idea basically.

Fake
Posted on 2003-11-10 14:07:14 by Fake51
Nope, that's not it. You got it all wrong. He is just making a new jump table at runtime, with relative jumps.



Anyway, you don't really need an import library. Many assemblers allow you to put imports inside the source. I don't remember if MASM can do that, though...

You could optimize things further by trying something like this:



; Begin with ecx=number of imports
push esi
push edi
cmp ecx,65
mov edi,Table
mov al,235
mov ah,cl
jb not_65_imports
mov ah,64
not_65_imports:
add ah,ah
call afterpatchfunction
beforepatchfunction:
pop eax
mov ecx,[eax-4]
add ecx,eax
sub ecx,Table
mov edx,[WhateverFirstImport+ecx*2]
sub edx,eax
mov [eax-4],edx
sub eax,5
jmp eax
afterpatchfunction:
pop esi
table_loop:
dec ecx
sub ah,2
stosw
jae table_loop
push ecx
push afterpatchfunction-beforepatchfunction
pop ecx
rep movsb
mov ah,beforepatchfunction-afterpatchfunction-2
pop ecx
jz nomoreimports
table_loop2:
sub ah,2
stosw
loop table_loop2
nomoreimports:
pop edi
pop esi


Then you could just call all the APIs with call Table+FunctionNumber*2. (or Table+FunctionNumber*2+afterpatchfunction-beforepatchfunction for imports from 64 and up) And it would automatically correct the address.
Posted on 2003-11-10 14:18:48 by Sephiroth3
regards!
Posted on 2003-11-10 21:08:54 by jefeng
Maybe I'm missing something, but if you want to dump the jump table, why not just use hutch's L2EXTIA utility. Then when you say:

invoke ExitProcess,eax

you get:

push eax
call _imp__ExitProcess@4

:grin:
Posted on 2003-11-10 21:56:27 by S/390
S/390:

Check the opcode for a call using his utility. I'll bet the first byte is FF, an indirect call needing an access to memory to get the address. What I'm doing does not need a memory access.

.
.
.

Regarding the import table. I'll need to look at a windows program and not a console one, since a window needs user32.dll. What I do know about kernel32.dll is the imports are loaded alphabetically. I only used one API so ,_imp__ExitProcess@4 would be at the top of the import table. I'll need to look and see what happens when you link with more than one lib. You should just be able to grab the first entry and go down the table(up in memory) and stop when you get to 0(the import section should be zeroed out by the PE loader.

Sephiroth3:

A few to many magic numbers(to me) for me to understand your code.
Posted on 2003-11-11 00:14:55 by ThoughtCriminal
Hehe, well, what I'm doing is making a table of jumps that all go to the same address, which is a function that reads the called address from the instruction that called the function, and replaces it with the offset to an imported function, taken from the IAT. Hmm. I just saw an error. The code is broken for entries 64 and up. But that's easy to repair. Anyway, you get the idea :P The code will now jump directly to the imported function, instead of going to an intermediate address.
Posted on 2003-11-11 11:31:05 by Sephiroth3
The order that you list imports in your executable, whether it be console, GUI, or dll is irrelevent. It just may happen that your assembler and/or linker will create the tables in alphabetic order.

On the other hand, the exports of an executable MUST be in alphabetical order.

I know this as I have written import entries by hand - I don't anymore because the macros take care of the creation of the import section.


section '.idata' import data readable writeable

dd 0,0,0,rva kernel_name,rva kernel_table
dd 0,0,0,rva user_name,rva user_table
dd 0,0,0,0,0

kernel_table:
GetCommandLine dd rva _GetCommandLine
ExitProcess dd rva _ExitProcess
dd 0
user_table:
MessageBox dd rva _MessageBoxA
dd 0

kernel_name db 'KERNEL32.DLL',0
user_name db 'USER32.DLL',0

_GetCommandLine dw 0
db 'GetCommandLineA', 0
_ExitProcess dw 0
db 'ExitProcess', 0
_MessageBoxA dw 0
db 'MessageBoxA', 0


With in import macros you get


section '.idata' import data readable writeable

library kernel, 'KERNEL32.DLL',\
user, 'USER32.DLL'

kernel:
import GetCommandLine, 'GetCommandLineA',\
ExitProcess, 'ExitProcess'

user:
import MessageBox, 'MessageBoxA'


Now i use the .code, .data, and .end macros which create the code, data, and import sections.

Here is the minipad example that comes with fasm. It has an unordered import table.
Posted on 2003-11-11 12:05:09 by eet_1024
Sephiroth3:

Is the first byte of any call with this method FF? I looked at Hutch's utility and it makes function pointers(indirect calls) to the import table. I figured this out independantly, but our syntaxs where very similar.

My little code for copying and adjusting the rvas would be a run once thing done at program start, not for every call to a function.
(I sure you understand this, but I want to clarify for those who may not.)

eet_1024:
On the other hand, the exports of an executable MUST be in alphabetical order.

Thanks for letting me know. I was wonder if they came in alphabetical order becuase the libs(ie Kernel32.lib) are in alphabetical order. In the case of MASM, I'd guess LINK would enforce the order. But your using FASM, so maybe the PE loader enforces this??
Posted on 2003-11-11 21:51:30 by ThoughtCriminal
Ok, first post so hi to the forum and please bear with me here.

Isn't this actually quite stupid? you are for speed reasons building a new jump table only this time with relative
jumps it really feels like a cludge. If you would ever be in a situation that you need to speed up that jump
wouldn't something like:

;before loop
mov unused_reg, [_imp__Function@X]

;in loop
call unused_reg

ok, sure one could argue that this isn't an option in most cases but it's quite a deal less work than building the
jump table, another way would be runtime patching just create a function that on startup patches itself using the relative addresses instead. That way you would at least get rid of the jump table figouring out a good scheme to encode patching places would also get around hardwiring all places to patch...

But somehow this all feels like throwing the baby out with the bathwater...
Posted on 2003-11-13 04:14:48 by DrunkenCoder

Isn't this actually quite stupid? you are for speed reasons building a new jump table only this time with relative
jumps it really feels like a cludge. If you would ever be in a situation that you need to speed up that jump
wouldn't something like:

;before loop
mov unused_reg, [_imp__Function@X]

;in loop
call unused_reg

It all depends on what your optimization needs are. If you have regs to spare, you code is fine. If you don't and it will negatively impact you algorithm to free one, a relative call might be better than changing your algorithm. But if that relative call is to address that cannot be know at assemble time(imports,DLL vtable method entry points,etc), MASM can only code it as a jump table:

call someAPI ;E8(offset to jump table entry)
jmp ;FF(address of point to function entry point).

Or a function pointer:

call .someAPI;FF(address of point to function entry point). A call to a reg is a call to a pointer.



Both methods need to touch memory. What if you really need speed and touching memory is an unaceptqable penalty?
There are two thing I know you could do. One is to patch all inner loop callsto a direct relative call. It is also extream opimization. Depending on how you build your program and for safety you my need to call VirtualProtect to make the area writable, and again to make it non-writable. And where you patch, you might effect surrounding code. Can be a lot of work and error prone. What I demonstrated is a basic way to call functions, where the entrypoint cannot be known at assembly time, without touching memory. The optimization gurus here have not posted anything about a fatal flaw in what I'm doing. I take that as a good sign. They are quite good a educating someone who says something that is right. Optimizing around calls to Windows APIs is generally silly. The time the API takes usually cancels out any gain. I used ExitProcess as a simple example. So who needs an optimization like this? I can only guess. Optimization freaks, demo coders, DirectX,video decoders? Maybe some of the OOP implementaions could us this for inheritance vtable optimization. VC2003 uses a relative call then relative jmp for some operator overloads(in some stuff I looked at). I've never seem this type of optimization discuss, so I'm just trying to provide some food to the forum. I hope some have read this a gained a deeper understanding how this all works, and run with it and share what they have learned.

Oh, BTW, welcome to the forum :)
Posted on 2003-11-14 04:31:58 by ThoughtCriminal
No, my idea involves direct calls. When the call is first made, the address of the function will be loaded from the import table and the offset will be updated, and subsequent calls will then be very fast.
Posted on 2003-11-14 11:39:48 by Sephiroth3
I guess we kinda misunderstood each other. When you posted your code I figured that you were using a method where rather than copying then patching the imports like me, you getting the address from the imports to patch the actual call. My simple method uses a jump table and you more complicated method has no jump table, just a direct call to the correct offset(entry point).
Posted on 2003-11-14 14:10:22 by ThoughtCriminal
The reason PE exports must be alphabetical is because a binary search is used when the PE loader tries to locate an import.

This is a silly optimization anyway - as mentioned in another thread. first of all you end up dirtying your code pages, which as we all know is a bad idea (not being able to do discard+reload but having to do pagefilewrite+pagefileload, and of course not being able to use code sharing when running multiple instances of the same app).

If you "need" this kind of optimization, you're designing your application wrong. Who would do *calls* in their innerloops? This is akin to spending a lot of time optimizing putpixel for use in a spriteblit, whereas you should really spend your time designing the spriteblit properly.

It's a cute trick (even if your implementation lacks a lot), but I hope people won't go "gee wizz, I'm gonna use this in my apps!" without considering the tradeoffs.
Posted on 2004-01-05 10:00:51 by f0dder
To be honest, I have no idea what I'd use this for. I have no interest in making high-performance 3D demos or anything like that. However, if I ever need something like this, I will have some ideas on how to proceed.

I hadn't touched programming for over a month and just needed a project.


The reason PE exports must be alphabetical is because a binary search is used when the PE loader tries to locate an import.

Thank you, thats good to know.
Posted on 2004-01-05 11:27:47 by ThoughtCriminal
(continuing from http://www.asmcommunity.net/board/index.php?topic=16694 )

Hmm, don't think DX would patch it's vtable on-the-fly, so it's probably "safe" to 'optimize' the vtable calls to direct calls. However, for <256byte indices into the vtable, the call can be coded as a three-byte "call "... I like guaranteed safe+small better than slight_risk_of_something_unforseen+not_reall_that_much_faster :)

And would it really matter for DX anyway? The big deal is 3D, and there you (as far as I have understood) tend to send a vertex buffer or similar - so, doing a whole batch of operations per call. Still seems like a silly optimization.

I don't know if there's any source floating around for my stub idea, and I can't really rip out parts of a "pretty private" project I'm working on :). But it's not really too bad to get working, as long as you fix the PE header properly when adding the stub.

EXE that optimizes itself and dumps to disk... nah. If the calls are optimized on the fly, you'll have to dump at an arbitrary point - some calls might not be optimized yet. Then there's all the tedious things about dumping, like having to reconstruct IT, fixing up PE header etc. Besides, static global variables might have been changed, which could cause unpredictable program execution. Besides, you will have an exe that will only work on YOUR computer, with the exact same windows version, service packs, hotfixes etc. 'Re-optimization' probably wouldn't be feasible.

The best runtime optimization that doesn't cause dirty code pages would probably be running through the import thunks and fixing the FF25 to E9 + whatever padding byte. These thunks should be grouped together, and iirc (could be wrong) will usually be located close to the IAT, which will (should) be dirty anyway, since it needs to be fixed by the PE loader. This still has both CALL+JMP but at least both are direct calls. Of course it doesn't handle the case where the app directly does indirect call through the IAT instead of direct call to the thunk.


To be honest, I have no idea what I'd use this for.

Heh, okay - I was afraid you thought this was a thing that would make much of a difference ;). It's a fun project anyway, and there's a whole bunch of things to be learnt from it which might be put to use for more productive things.


I hadn't touched programming for over a month and just needed a project.

Play around with PE encryption and advance to compression afterwards - those are fun and interesting (and lots of hairpulling when SOME version of windows refuses to load the output PE while all the rest work, hehe)..
Posted on 2004-01-05 12:00:30 by f0dder

Hmm, don't think DX would patch it's vtable on-the-fly, so it's probably "safe" to 'optimize' the vtable calls to direct calls. However, for <256byte indices into the vtable, the call can be coded as a three-byte "call "... I like guaranteed safe+small better than slight_risk_of_something_unforseen+not_reall_that_
much_faster

It does not patch it on the fly.

And would it really matter for DX anyway? The big deal is 3D, and there you (as far as I have understood) tend to send a vertex buffer or similar - so, doing a whole batch of operations per call. Still seems like a silly optimization.

From what I know about Direct 3D, there is a rander loop. The best optmization is to remove function calls, maybe this could be 2nd best.

EXE that optimizes itself and dumps to disk... nah. If the calls are optimized on the fly, you'll have to dump at an arbitrary point - some calls might not be optimized yet. Then there's all the tedious things about dumping, like having to reconstruct IT, fixing up PE header etc. Besides, static global variables might have been changed, which could cause unpredictable program execution. Besides, you will have an exe that will only work on YOUR computer, with the exact same windows version, service packs, hotfixes etc. 'Re-optimization' probably wouldn't be feasible.

I would not optmize on the fly at all. Optimized once at program startup before any of the real program code is used. "Kinda" like the C# JIT compilation process. It runs the jitter the first time the program is started and tunes it for... your computer :grin: With my current code reoptimization would be easy. It can only patch static addresses know at link time, so the addresses never move and the code does not change size.

The best runtime optimization that doesn't cause dirty code pages would probably be running through the import thunks and fixing the FF25 to E9 + whatever padding byte. These thunks should be grouped together, and iirc (could be wrong) will usually be located close to the IAT, which will (should) be dirty anyway, since it needs to be fixed by the PE loader. This still has both CALL+JMP but at least both are direct calls. Of course it doesn't handle the case where the app directly does indirect call through the IAT instead of direct call to the thunk.

My first post in this thread has code to do E8 to E9. A lot simpler than E8 to entrypoint as you can see. FF25 is an indirect call. Please explain dirty code page. I could build the E9 right after the IAT in a data section.

Play around with PE encryption and advance to compression afterwards - those are fun and interesting (and lots of hairpulling when SOME version of windows refuses to load the output PE while all the rest work, hehe)..

Perhaps I will.....

Thanks for all your input.
Posted on 2004-01-05 21:03:46 by ThoughtCriminal

Please explain dirty code page.

Dirty pages are one of the reasons thinks like exe compression is generally bad... http://f0dder.has.it , articles, "packing, data handling, stuff".
Posted on 2004-01-06 00:44:58 by f0dder