Hi All,

The subject says a lot what I'm looking for  ;) ...

After searching the internet for a "smallest exe" and stuff like that, I ran against competitors and real games for creating stuff in assembler, in the smallest windows 32bit exe's ever... amazing.

Even on this forum I'm finding topics of small exe's...  :)    but they all use <when running on windowsXP/2003> over 1.000 K memory.
(like the sample minimum.exe in the MASM directory)

Any suggestions for a small memory footprint "dummy" application?

Why?
I need a dummy process for an own made construction on a windows2003 server, where I have to preload 200 different userhives, to speedup the later logon for these users (yes, on a terminal server). We now use minimum.exe to keep that userhive loaded on the system (when closed it's automatically unloaded), but 200 times this small process is still using 400MB memory. Btw, this pre-fetching of user hives does work like we want to, and speeds up logon processing on our server (in some cases we have 2 to 3 interactive logons per second)!   

Greetings,
Dabur

PS: In my college days, I had courses assembler on the intel 8088 and the motorola 68000. Stuff you guys are doing on the x64/x86 platform is amazing and a lot different (read complex for me  :oops:  )
Posted on 2007-03-28 05:57:04 by dabur
You have to understand the difference between "working set size", "private bytes", and a bunch of other figures. The figured you'll be most interested in is "private bytes", since that's the amount of memory that isn't sharable between instances of your process (and thus, labeled "private").

Get SysInternals' Process Explorer, it's much better to poke around with than the default taskmgr. And if you want even more detailed, check perfmon.msc (start->run->perfmon.msc<enter>).
Posted on 2007-03-28 07:18:02 by f0dder
You didn't say win32 app.
http://stig.servehttp.com/homer/ONEBYTE.zip
Posted on 2007-03-28 09:04:36 by Homer
You're so silly, homer :)

To continue in the same vein, he did say 32bit app though, and the context it's running in is clearly win32. So

1) your app is disqualified because it will run in a 16bit subsystem.
2) you incur the overhead of NTVDM ;)
Posted on 2007-03-28 09:08:02 by f0dder

Yep, don't want to mess with the NTVDM ...

Reference on other forum: http://www.masm32.com/board/index.php?topic=6994.0
Got some result with the code there. Working great.

Thanks for the reply !!  :)
Posted on 2007-03-28 10:30:35 by dabur
Again - the figure you should care about for your purpose is almost exclusively "private bytes". Don't worry too much about the other figures, especially not for your example. A win32 process _always_ needs to include kernel32.dll (unless you want to lose compatibility with a few windows versions).

If you really really really need to shave things tightly, perhaps a native image would be in order... what exactly does minimum.exe need to do? Just keep running until the session is logged off, and it is terminated automatically?


EDIT: I don't know if it's possible to run native exes under normal windows (they might be reserved for drivers and "boot time applications", the ones that run before windows is fully initialized (chkdsk, some defragmenters) - time to check sysinternals/whatever).

I tried creating a (normal) win32 app that only imported from NTDLL.DLL (ie, no kernel32/user32, using Nt* functions) - the memory footprint difference from NTDLL/NtDelayExecution and KERNEL32/SleepEx is extremely small: 132kb private bytes in both, 4.224k Virtual Size in both, 504/512k working set respectively. This is probably because there's a few DLLs that get injected to all win32 processes by default - it might be possible to reduce these (I think the registry KnownDlls has to do with this), but that might affect system stability (with broken apps).

Also note that win2k will _silently refuse_ to load any application that doesn't import from kernel32 (whether that be directly, or by importing from a dll that ends up importing from kernel32 somewhere down the chain. This is used in some size-sensitive demos that import from GDI32:Arc, since that afaik is the shortest import you can have that ends up bringing you kernel32).

Using another subsystem (os/2 or posix) probably won't help - perhaps they have a smaller per-app dependency, but they'll end up launching some subsystem handlers (just like 16bit apps require NTVDM.EXE).

Imho, the best bet so far is probably just depending on kernel32/sleepex with infinite timeout - that's safe, and relatively small. You can add SetProcessWorkingSetSize(GetCurrentProcess(), -1, -1) to trim the process before windows would normally do so, but that still leaves you with the private bytes usage (and possible stuff in the paging file).

If you don't mind being very system specific, and possibly having the app break on a service pack, and don't need win2k compatibility, it might be possible to come up with something that has no DLL dependencies and uses a syscall - that could possibly bring memory requirements down a bit.
Posted on 2007-03-28 17:11:58 by f0dder
Okay, I've gone as far as creating an executable with no imports (still win32 subsystem), directly using system calls. Unfortunately, on my girlfriend's laptop (winxp sp2), with process explorer I can verify that both ntdll.dll and kernel32.dll are injected into the process by the system, leaving me at 128k private bytes, 4meg virtual size, and 496kb working set (124kb private, 372kb sharable, 356kb actually shared).

So, the best solution is probably to use the kernel32/SetWorkingSetSize trick to trim the WS (no point in leaving kernel32.dll out when it's being injected anyway), and then a SleepEx call to sleep indefinitely. With that, I get 132kb private bytes, 4meg virtual size, 160kb working set (100kb private, 60kb sharable, 52kb shared). Unfortunately, setting the PE heap and stack parameters don't seem to affect any of this.

Also note that windows itself will do memory trimming, the SetProcessWorkingSize trick just makes it do it "right now" rather than "when needed".
Posted on 2007-03-28 20:32:06 by f0dder
...long live insomnia.

I've attached what I feel to be the best solution, rigorously commented, fasm style. I'm on my gf's laptop right now, so I don't have my full suite of development tools, fasm+ollydbg were the fastest to grab hold of. Also shows you how to code in a really minimal style, not that it's really any use :)

While you're at it, please do grab Process Explorer and go by that... configure it to show (at least) private bytes, virtual size, working set. Also showing the {Private, Shareable, Shared} subsets of the WS might be beneficial.

You'll notice the Process Explorer can show a lot of really nifty statistics, a good subset of what perfmon.msc supports. You don't need to keep everything in the main window columns, since you can double-click a process to view the various stats. Also note that the lower pane view can be switched between 'handles' and 'dlls'.
Attachments:
Posted on 2007-03-28 21:08:43 by f0dder
f0dder

I noticed an odd result using the masm version.
If it was assembled normally but linked as a console program,
when run it would start as a normal sized console window with 524 k
but when manually minimized the memory usage would drop down
to 36k then later bump up to 80 k.
Even restoring it later retains the 80 k.

Is this also possible with the fasm version ?

Is it possible to create a console program that starts minimized
and gets the reduced memory footprint ?
Posted on 2007-03-29 10:17:26 by dsouza123
Hm, you're saying that linked as a console program, you end up with a smaller footprint than when linked as a GUI program? Did you try my application that does the SetProcessWorkingSetSize() trick?

Also, what memory figures are you reporting? private bytes, working set size, ...?
Posted on 2007-03-29 17:00:22 by f0dder
Working set is what drops to 36k then bumps to 80k.

Those are the results with the original masm code as a console app.


I assembled your code using masm and linked as a console app
when run it starts at 116k then drops to 32k when minimized,
also the working set values.
Posted on 2007-03-29 18:35:44 by dsouza123
That is nearly at the level of the System Idle Process which is 16k working set
though the idle process has 0 private bytes and 0 virtual bytes.
Posted on 2007-03-29 18:42:28 by dsouza123
Okay... if you measure working set, then you really need to look at "working set - private"... the shareable/shared values almost don't matter, with regular 4k pages you basically need 4kb memory per 4meg working set (there's some additional bookkeeping for windows, but this does give you an indication that working set isn't that big a deal).

So, again again :), I must stress that the most important figure to measure is the non-shareable (aka private) memory usage... and that we probably can't gain anything significant when doing a win32 subsystem app, whether it uses API calls or not. So, again, I recommend the SetWorkingSetSize() hack and Sleep(INFINITE) - that's more or less as optimal as you can go. Don't use the code from the masmboard that runs Sleep() in a loop, that's plain retarded when you can do it with INFINITE.

The "system idle process" is very special, as (afaik) it's not a real process, but a part of the kernel (ntoskrnl.exe) that "masquarades" as a process.

It would still be interesting if one could run "native" apps after windows is started, though... or have the option to not inject kernel32 and ntdll.
Posted on 2007-03-29 18:46:49 by f0dder
Memory amounts


            Private bytes    Working Set
Sleep      132k             124k
mSleepC     136k             116k  normal console window
            136k              32k  minimized console


*Sleep      the fasm sleep.exe from the attachment.
*mSleepC  a masm console version of sleep
Posted on 2007-03-29 20:03:58 by dsouza123
Okay, that is weird! - and reproducible here!

I still don't think you should worry about the working set size... private bytes is really where it's at. And if I include user32.dll (necessary for ShowWindow, for your hide trick), WS goes up to 148kb... and unfortunately, the programmatic ShowWindow(GetConsoleWindow(), SW_HIDE) doesn't bring the WS reduction, only manually minimizing it does. Neither CloseWindow() nor DestroyWindow() changes this... and private bytes usage is higher than for the GUI version.

So, stick with the Sleep(INFINITE) GUI variant, unless somebody can find a way to run native images after windows is fully started :)
Posted on 2007-03-30 07:45:59 by f0dder
Working set includes private bytes.

http://msdn2.microsoft.com/en-us/library/ms684891.aspx

Process Working Set

The working set of a program is a collection of those pages in its virtual address space that have been recently referenced. It includes both shared and private data. The shared data includes pages that contain all instructions your application executes, including those in your DLLs and the system DLLs. As the working set size increases, memory demand increases.


http://www.microsoft.com/technet/prodtechnol/windows2000serv/reskit/core/fnec_evl_hkbg.mspx?mfr=true

Process(process_name )\Private Bytes reports bytes allocated exclusively for a specific process; its value tends to rise for a leaking process.

Process(process_name )\Working Set reports the shared and private bytes allocated to a process; its value tends to rise for a leaking process.
Posted on 2007-03-30 09:13:54 by dsouza123
Yeah, the working set includes private bytes - but also shared. Amount of shared almost doesn't matter. So if reducing working set means increasing private bytes, that's a non-optimization, especially in your case with 200 instances.

Working set is basically how much memory windows tries to "keep active" for the application, it's the amount that has been used "recently". It does give an indication of how memory-hungry an application is, but it does include shared memory, so it's not as important as private bytes (imho).
Posted on 2007-03-30 09:33:56 by f0dder
Fortunately, nothing indicates that reducing working set means increasing private bytes.

The evidence shows reducing the working set doesn't change the private bytes.

Reducing the working set will reduce the physical paged mapped for the process
freeing up more for other processes so it will result in a system wide optimization.


Credit where it is due,

f0dder your optimizations with SetProcessWorkingSetSize and
switching to SleepEx with an infinite time period were quite effective at reducing
working set memory requirements (thus optimizing the processes memory requirements)
and very effective at reducing the system calls to sleep,
1 total instead of 4 per minute.
Posted on 2007-03-30 11:41:38 by dsouza123

Reducing the working set will reduce the physical paged mapped for the process
freeing up more for other processes so it will result in a system wide optimization.

True, but the page table entries is only 4k for mapping 4 megabytes... there's a bit more additional bookkeeping than that done by windows, but I don't know how to measure it - but I doubt it's very much.

Are you able to measure any difference in Physical Memory consumption by just reducing working set size? Would be interesting to know. Slightly difficult to measure on a regular box, the scale that Terminal Services brings is more interesting.
Posted on 2007-03-30 11:55:01 by f0dder
Good question.

Perhaps a test of using Process Explorer's system information to measure
loading a large number of the console sleep processes,
minimizing them, then loading a program that uses VirtualLock
and consumes enough memory to fill the remaining physical memory.

Then close the VirtualLock program and all the console sleep processes,
repeat the test this time not minimizing.

Perhaps someone has a easier way to measure it.


Attached a program (modified from code found online)
that will find and minimize all visible console windows,
also able to restore them.

Could be modified to only pick ones with a specific titlebar caption.
Attachments:
Posted on 2007-03-30 22:43:15 by dsouza123