Hello,

the following is a fragment of text taken from a book page describing how imported functions are resolved inside PE files:

When the program loader loads your executable and its dependent DLLs, the loader fixes up this one indirect address so that it corresponds to the final load address of the XXXXXX api. The compiler makes this indirect addressing work by generating a jump to the indirect address any time your code calls the imported function. This indirect address is stored in the .idata (or import) section of the executable. If you import through __declspec(dllimport), instead of being an indirect jump, the code is an indirect call, thus saving a couple of instructions per function call.


What I don't undestand is the fact that importing through __declspec(dllimport) saves 2 instructions. Isn't the saved instruction only one (the jump that would otherwise be made from the jump array to the .idata location where the loader has substituted the real address of the imported function)??? Where is the other saved instruction???


yaa
Posted on 2004-01-04 06:06:02 by yaa
could be what he says talks about the OriginalFirstThunk and FirstThunk which at run time FirstThunk points to the functions and not to the import able .
i could go wrong though :)
Posted on 2004-01-04 10:25:27 by wizzra
wizzra I did not get your answer. Where is the other saved instruction?

yaa
Posted on 2004-01-04 13:35:01 by yaa
Indirect JMP and CALL are same size - however, the normal method is direct CALL to "thunk" + indirect JMP using the IAT dword. So, for one import you'd get E8xxxxxxxx (5 bytes) + FF25xxxxxxxx (7 bytes). With the declspec, causing the indirect CALL (FF15xxxxxxxx) you only get 7 bytes for a single call. (both are excluding the IAT dword of course, but that's not relevant when comparing the methods, as both require it).

So... for n calls to the same import, the indirect CALL method requires n*7 bytes. The direct CALL + indirect JMP requires (n*5)+7 bytes. Using simple math... solve this equation:

(n*7) > (n*5) + 7

you get n=4 (rounded up - half a call doesn't really make sence ;)). Thus, For less than 4 imports, indirect CALL gives you the smallest executable, while the direct CALL + thunk is better for n>=4.

Not that this really matters a lot, imo. But for size-limited situations, it can matter. The indirect CALL method should also compress slightly better, since the FF15xxxxxxxx will be the same for every call to the same import, while the indirect + thunk will have different opcodes for every call, because E8xxxxxxxx is EIP-relative.
Posted on 2004-01-04 14:35:40 by f0dder
Still F0dder, you have not explained where are the "couple of instructions" saved by importing thru __declspec(dllimport).


yaa
Posted on 2004-01-04 17:07:35 by yaa
"couple of instructions" is wrong, it's "couple of bytes". Or, well, you do save the "JMP indirect" since you do "CALL indirect" instead of "CALL direct + JMP indirect".
Posted on 2004-01-04 17:10:29 by f0dder
Wait a minute, you mean 6 bytes, not 7 ;) That makes the number of calls 6.

Hmm... I think you can use 5 bytes per call with no extra bytes per imported function, if you replace all the calls with invalid instructions, and set up an exception handler that will replace the invalid instructions with appropriate direct jumps and calls. :P If speed is not important, it could even simulate the jump or call and not patch it, thereby requiring only 3 or 4 bytes per call :alright: If the program uses many imported functions, then you could save a few bytes with this.
Posted on 2004-01-04 17:22:40 by Sephiroth3

Wait a minute, you mean 6 bytes, not 7 ;) That makes the number of calls 6.

Hehe, blame multitasking and tiredom ;)

As for using invalid instructions, humm... be careful - some instruction might be invalid now but used in the future. UD2 is two bytes - are there any one-byte invalid instructions? Of course you could use int3 0xCC, but this could make debugging hard (which could of course also be a benefit ;)).

If you're really sneaky, you could make import calls two-byte... 0xCC + 1byte index, then handle this in SEH + a LUT. Lots of apps wont need more than 256 imports anyway. Hm, interesting idea, I'll save this as a note for future use ;) :thumbsup: . Of course this would be a bit troublesome (or at least not-as-maintainable-as-normal), but it should be doable.

But this is somewhat off-topic anyway :)
Posted on 2004-01-04 17:30:36 by f0dder
Interesting considerations that you have made guys.

As for my post, it is strange that John Robbins has made such a big mistake .... bytes and instructions aren't exactly the same thing.

yaa
Posted on 2004-01-04 17:49:51 by yaa
Well, it's easy to mix up things when you're writing. Besides, you do save on instruction per call when using indirect CALLs instead - and those can add up, causing a "couple of instructions" :)
Posted on 2004-01-04 17:53:02 by f0dder
I have seen a lot of debate over the virtues of direct calls and indirect calls but usually the functions being called are so slow it simply does not matter.

If you really do need fast calls on external code in a DLL, use LoadLibrary, GetProcAddress, Freelibrary as you have the direct address to call without any overhead once you have the address.

You can write a DLL with a table of adresses you get once and call whatever you require if call overhead is a problem with the code design you are using.

Regards,
http://www.asmcommunity.net/board/cryptmail.php?tauntspiders=in.your.face@nomail.for.you&id=2f46ed9f24413347f14439b64bdc03fd
Posted on 2004-01-04 18:12:50 by hutch--
This is not a speed issue in any way, it's a size issue - loadlib+gpa will be larger no matter what. And you'll have the same CALL overhead as __declspec(dllimport). Getting a table of addrs is in now way better than CALL indirect - at best the same.

If you really cared about speed (which would be silly... sure, I can gives examples why if required) you could patch CALL sites in the executable, but this would require a lot of relocations, which would bloat executable size.
Posted on 2004-01-04 18:20:50 by f0dder

If you really cared about speed (which would be silly... sure, I can gives examples why if required) you could patch CALL sites in the executable, but this would require a lot of relocations, which would bloat executable size.


It's funny that you should mention that...

4th post in thead is my source.
http://groups.google.co.kr/groups?dq=&hl=ko&lr=&ie=UTF-8&oe=UTF-8&newwindow=1&threadm=2e7c19fd.0401040225.7ad8cee1%40posting.google.com&prev=/groups%3Fhl%3Dko%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26newwindow%3D1%26group%3Dcomp.lang.asm.x86

I've finally made that faster ExitProcess you once said people waste their time making(Actually it is a faster call to ExitProcess.)

That's my "proof-of-concept" code at the link.

Considering bloat, the nice thing about my system is you only pay for what you use(or add yourself.)

Per import called = 4 bytes

RELOC table = 4 bytes per relocation

Call address to modify+function index = 8 bytes

So one function call will cost 16 bytes of overhead.
Secondary call will cost 8 bytes.

I could make some of this small, but thats for version 2.0

It looks like OP had his question answed. Sorry if this was a thread hijack... :sweat:
Posted on 2004-01-04 21:02:43 by ThoughtCriminal
UD2 is two bytes - are there any one-byte invalid instructions?


What about 0f2h (aka int1)?
Posted on 2004-01-05 08:19:04 by roticv
ThoughtCriminal, that methods seems *very* tedious... it can be done automatically if you add some stub code and hijack (plus fill out yourself) the IAT. Instead of filling the IAT with the real imports, you fill it with addrs of runtime generated "patch caller" code. It's not that much of a big deal to do, and this way you'll be able to support precompiled executables instead of requiring a lot of tedious manual work in your own apps.

However, I'd say that if you're worrying about CALL speed overhead, you're doing wrong program design - whether this is for DLLs or statically linked code. The gain by doing direct call is pretty small, and when doing it runtime I'd worry a lot more about dirtying your code pages.

While not very useful for program optimization, this technique is sort of cute and can be used for obfuscation; especially in the scope of exe packers/encrypters where all the code pages will be dirty anyway.

roticv, ah yes - good olde icebp. I'm sorta absent-minded these days :) (you mean 0xF1 though)

PS: btw I think it's hutch who made fun about the "faster ExitProcess", not me. I could be wrong, though.
Posted on 2004-01-05 09:44:40 by f0dder
f0dder-

Yes it is a 50-50 solution :grin: The assembler does half the work, you do the other half. However it will only optmize what you tell it to...

However, I'd say that if you're worrying about CALL speed overhead, you're doing wrong program design - whether this is for DLLs or statically linked code. The gain by doing direct call is pretty small, and when doing it runtime I'd worry a lot more about dirtying your code pages.


When you mentioned about a faster ExitProcess a long time ago, you indirectly gave me a project to pursue. It's a proof of concept. I use APIs because it is convinient. My main goal this first time was to get it to work. A better use for this might be the DirectX vtables. A place where performance might matter. As an optimization, it would best be used at program initilization IMO. Part of my complexity is that I'm trying to do a list. I'm thinking of making a it to a function that does one relocation at a time and some other method of giving the address to modify and function to point to.

Oddly enough, I was first thinking of doing something like this as a stub.

it can be done automatically if you add some stub code and hijack (plus fill out yourself) the IAT. Instead of filling the IAT with the real imports, you fill it with addrs of runtime generated "patch caller" code. It's not that much of a big deal to do, and this way you'll be able to support precompiled executables instead of requiring a lot of tedious manual work in your own apps.

Hmmmmm But I need to get the address of external functions somehow... I kinda guessing at what your are talking about, but this sounds like optmize as you go.... You would not happen to know where I could find an implemention of this? I was thinking of an exe that optimizes itself, then saves itself minus the stub, or with the stub code disabled. Then you don't have to worry about code pages being dirty as the next time the exe is run, it will load in it's optimized state. Maybe use a command line parameter to force re-optimzation.

Your just dragging me in deeper :)
Posted on 2004-01-05 11:12:27 by ThoughtCriminal
Posted on 2004-01-05 12:01:00 by f0dder



What about 0f2h (aka int1)?


You mean 0F1h, 0F2h is REPNE.
AFAIK 0F1h is valid on most processors and is pretty good documented. http://www.sandpile.org/post/msgs/20003978.htm
Posted on 2004-01-05 14:04:14 by MazeGen