By the way, last night during a terrible headache I found a way to solve the speed-issue on SmAlloc. Though, since this will require 33 bits extra per object, instead of (the current) 1 extra bit per object, I'll probably make it as another lib.

The need is:
ability to SmAlloc() and SmFree() objects for less than 50 cycles.
The environment is:
probably one set of object-sizes (means "all objects are  i.e 24 bytes big"). The objects are 50,000 - 1,000,000 in number. We can malloc and free any of the objects at random.

The solution:
Keep a stack of the free objects from the current MemChunk. SmAlloc2() will take the object from the top of the stack, and decrease the stack-size.
SmFree2() will check in the bit-table whether the object is already free (and return immediately if yes), then add this object to the top of the stack.
When a new MemChunk is created, the "free-stack" is pre-filled with all addresses of objects from this MemChunk.

Posted on 2006-08-11 12:18:29 by Ultrano
Here's the implementation of the idea, mentioned above. Called it "TinyAlloc"... although it certainly isn't smaller than SmAlloc ^^'

Finally, my venture to beat HeapAlloc is complete -  TinyAlloc is 5-7 times faster than HeapAlloc :D. But man, HeapAlloc really impresses me! Good job, Microsoft!

Test machine:
Sempron 2200+ (1800MHz), DDR400 (~2.6GB/s bandwidth), Win2kSP4

TinyAllocNonZero() executes in 71 cycles (or more)
TinyFree() executes in 130 cycles

HeapAlloc() executes in 351 cycles (or more)

Total time for allocation of 300,000  50-byte objects:
TinyAllocNonZero() : 10ms
HeapAlloc(): 70ms

If you write into the memory you allocate (i.e zero-out the whole object), performance drops twice on both TinyFree() and HeapAlloc()

Btw, TinyAlloc can be made 4 times faster by removing the checks, used in TinyFree(). Currently, TnAlloc will not misbehave in the case of freeing the same object twice:
TinyFree(pObj1); // TinyFree sees that this object is already free, and does nothing

TinyAlloc is 6-10 times faster than SmallAlloc, with the same functionality. But for each object, it takes 32 extra bits instead of 1.
TinyAlloc doesn't have SmallAlloc's limit of 65535 objects/MemChunk
Neither TinyAlloc nor SmallAlloc are anything revolutionary :)
BUT, the idea behind TinyAlloc can be used to boost the performance of special cases: particle managers in 3D games, for instance.
Posted on 2006-08-11 17:31:37 by Ultrano
Great minds think alike...

I'm currently in the testing phase of a very similar memory manager system of my own.  I would say its 85-90% identical in structure (looking at your source).  Good work.  If anything you're solution is hinting at some optimizations I can borrow from.

I especially like your bit32 usage for used/not used in the memory pointer stack.  One question tho, is it a known fact that the OS will not provide a memory object with this bit set for OS related purposes (from Virtual Alloc)?  Or is this something you figured out (IE. trial and no error).

Also, I still used HeapAlloc.  What reasons made you choose the VirtualAlloc?  Last I remember, they are all the same in the latest version of the Win OS's.

Great work.
Posted on 2006-08-13 23:24:29 by NaN
HeapAlloc vs. VirtualAlloc... VirtualAlloc gives you 64kb alignment, control of page permissions, possibility to control reserve vs. commit. Last time I checked it was a bit slower than HeapAlloc for small allocs. There's no VirtualReAlloc, so you'd have to code that logic yourself.

VirtualAlloc is, imho, good when you need to allocate large blocks that'll mostly be allocated through your app lifetime... so, pretty nice for your own custom memory managers. If you look at the MSDN documentation for HeapAlloc, it says they're slow on 9x for blocks larger than 4meg.
Posted on 2006-08-14 04:59:06 by f0dder

One question tho, is it a known fact that the OS will not provide a memory object with this bit set for OS related purposes (from Virtual Alloc)?  Or is this something you figured out (IE. trial and no error).

It was a bit hard for me to understand what you mean, so I guessed you mean "does the OS let normal apps get virtual memory with address 0x8xxxxxxx ?" .
Windows reserves the virtual space 0x80000000 up to 0xFFFFFFFF for drivers, and internals. It'll never let you allocate memory there. So, currently it's safe to reserve the last bit in pointers of memory that we allocate.
But in TinyAlloc, in the "free-stack", I don't put pointers. I put a value ObjectID, which is 0,1,2,3,4....(NumMaxObjects-1) .
Posted on 2006-08-14 06:53:41 by Ultrano
Just to be pedantic: what about the /3GB switch in boot.ini?
Posted on 2006-08-14 09:59:31 by f0dder
I hadn't heard of it. But still, it can't pose problems, since to use 3GB, the PE header must have a specific flag:

The /3GB switch allocates 3 GB of virtual address space to an application that uses IMAGE_FILE_LARGE_ADDRESS_AWARE in the process header. This switch allows applications to address 1 GB of additional virtual address space above 2 GB.

(quote from

Actually, since I knew for sure Win2k has the to 2GB limit (we studied it at uni), but this could (would) be changed in later 32-bit OSes (or HALs), I gave it some thought.
Yet, for such a major point to be changed, for years to come it would require only the special apps (that would use it) to switch somehow to such a mode. (and the normal apps would run in a legacy/compatibility layer). Microsoft has always been very concerned about compatibility :) , fortunately.

Since TinyAlloc was (initially) designed for apps that would require millions of objects (thus the alloc/free speed issue), these apps could easily need 2GB+ . So, just in TinyAlloc, I didn't thread on the 2GB limit.
(I kinda wanted to present a particle-manager for 3D games or scientific apps, that would break the current limitations and problems :) . TinyAlloc served as a concept-proof)

I initially planned the stack entries of TinyAlloc to be 33-bit, instead of the current 32-bit (a stack of 32-bit pointers + a bitfield of 'is object allocated'). But while optimizing TinyAlloc, I had to switch to using ObjectID in the stack, instead of ObjectPtr. Since the highest 3 bits would always be 0, then I merged the bitfield with the stack.
Posted on 2006-08-14 10:44:46 by Ultrano
Hi Ultano
If you are using an alignment for storing your objects (which is always a good deal), letís say 4, then you can use bits 0 and 1 to store additional information. Transforming the pointer back is as simple as using the and instruction  ;)


Posted on 2006-08-14 12:30:33 by Biterider
Well, I align only the first object in the internal array on dword-boundary, it's up to the user to make his objects' size a multiple of 4, 8 or 16. For max optimization (aligning the objects at 16-byte boundary) and enabling SSE/SSE2 "movaps" (instead of "movups"), the user can fix-up TinyAlloc's SmNewChunk() to align pFirstObject.
Additional bits aren't necessary... I already have 2-11 bits too much :) (11+ bits if we're allocating less than 1 million objects in one TinyAllocChunk.) And I don't have any idea on what to do with them :)

The 3 bottlenecks in TinyAlloc are:
- TinyAlloc() : finding a chunk to get memory from
- TinyFree() : finding the chunk this object persists in
- TinyFree() : "div" instruction when checking if the object is already free. I could add an
if(!(ObjSize & (ObjSize-1))){ size is power of 2, use SHR }else{ size is arbitrary, use DIV}
but finding CL for the SHR really makes things slower iirc...

In the particle-manager I'm currently designing, those 3 bottlenecks are gone - but only since it's a specialized manager.
Posted on 2006-08-14 13:53:43 by Ultrano