Hello.

Whilst going through the WIKI Book I read that multiple stacks could be made and they could be accessed by changing the SS register. I'm not sure how this is done, but I almost took it for granted that changing SS would grant you a new stack to use.

Here's the quote:

There can be many stacks present at a time in memory, but there's only one current stack. Each stack is located in memory in its own segment to avoid overwriting other parts of memory. This current stack segment is pointed to by the stack segment (SS) register.


I understand how to use the ESP and SB registers, but I need some clarification on this. My question is what are the details involved in creating new stacks and stuff? Do you have to define the size of each stack before you can use it or do they keep growing as you put data in?

Thanks in advance,

- keantoken
Posted on 2007-07-22 09:30:37 by keantoken
the ss segment (and indeed cs, ds, gs and fs) are generally something you shouldn't mess with in 32 bit code... as the 32 bit model is 'flat' mode, where essentially cs=ds=ss.. and changing it (especially in 64 bit enviroments) can lead to disasterous results...

you haven't mentioned if you're using 16 bit, 8 bit, etc code, or what operating system you're using, so i assumed 32 bit windows... hence my answer...
Posted on 2007-07-22 10:33:16 by evlncrn8
Also, for win32, if you want to create your own stack manually and reference it with ESP, you'll need to modify some FS:xx values... search the board for posts by Maverick if you're interested :)
Posted on 2007-07-22 10:55:32 by f0dder

Also, for win32, if you want to create your own stack manually and reference it with ESP, you'll need to modify some FS:xx values... search the board for posts by Maverick if you're interested :)


Here we go:


Win32_TIB_StackBase  EQU 4
Win32_TIB_StackLimit EQU 8
    [...]
    PUSH    DWORD PTR FS:
    PUSH    DWORD PTR FS:
    MOV    EAX,ESP
    MOV    ESP,DWORD PTR ; which e.g. you VirtualAllocated
    MOV    DWORD PTR FS:,ESP  ; READ the Note!
    ADD    ESP,
    MOV    DWORD PTR FS:,ESP  ; READ the Note!
    PUSH    EAX ; save old ESP
; Get some fun with your new stack (I like it to be BIG!)
    [...]
; And now switch back to the old stack
    POP    ESP ; restore old ESP
    POP    DWORD PTR FS:
    POP    DWORD PTR FS:


Note: on Win2K, I simply stored $00000000 to FS: and $7FFF0000* to FS:, so it worked always, wherever I switched the stack to (in User Space :) ).

But recently I discovered that WinXP (to which I am totally repellent), maybe it's a SP or not I don't know, doesn't like it. I didn't bother to check if some other value worked, and I simply changed the code as above, which restored compatibility with XP. I shall check with that ugly crap called Vista sometime.

*I said $7FFF0000 and not $7FFFFFFF because those 64KB of address space are marked as non-accessible area for the User.

**why do I use $ instead of 0x or ..H for HEX? Because real processors used $ for HEX (a real processor is a 68000, or a 6502, or a 6809, but not a 8080, or a Z80, or a 80286).

:D

***another interesting thing is FS:, where Win32_TIB_ExceptionList EQU 0. Google for SEH (Structured Exception Handling) to know why.

****another useful thing is FS:, where Win32_TIB_ArbitraryUserPointer EQU $14. You can store there whatever you want. When you start doing multithreading at a certain level, you will discover how much useful it can be (of course expecially if you store there a pointer to a growable structure, since 32bits aren't much if you want to do something useful in itself with them).

Win32_TIB_* is a prefix I used here to make things more readable from variables, etc.. because all the members pointed by FS are part of the TIB (Thread Info Block). Here're all the members:


Win32_TIB_ExceptionList          EQU 0x00  ; Pointer to SEH's EXCEPTION_RECORD.
Win32_TIB_StackBase              EQU 0x04  ; Used by functions to check for stack overflow: upper limit (NOTE: don't go beyond $7FFF0000 because the last 64KB are marked as non-accessible area for the User anyway).
Win32_TIB_StackLimit            EQU 0x08  ; Used by functions to check for stack overflow: lower limit.
Win32_TIB_WinNT_SubSystemTib    EQU 0x0C  ; NT-Only
Win32_TIB_WinNT_FiberData        EQU 0x10  ; NT-Only
Win32_TIB_WinNT_Version          EQU 0x10  ; NT-Only
Win32_TIB_Win9x_pvTDB            EQU 0x0C  ; Win9x-Only: TDB
Win32_TIB_Win9x_pvThunkSS        EQU 0x0E  ; Win9x-Only: SS selector used for thunking to 16 bits
Win32_TIB_ArbitraryUserPointer  EQU 0x14  ; Available for application use
Win32_TIB_Self                  EQU 0x18  ; Linear address of the TIB, base of FS segment.
Win32_TIB_WinNT_processID        EQU 0x20  ; NT-Only
Win32_TIB_WinNT_threadID        EQU 0x24  ; NT-Only
Win32_TIB_Win9x_TIBFlags        EQU 0x1C  ; 9x-Only
Win32_TIB_Win9x_Win16MutexCount  EQU 0x1E  ; 9x-Only
Win32_TIB_Win9x_DebugContext    EQU 0x20  ; 9x-Only
Win32_TIB_Win9x_pCurrentPriority EQU 0x24  ; 9x-Only
Win32_TIB_Win9x_pvQueue          EQU 0x28  ; 9x-Only: Message Queue selector
Win32_TIB_pvTLSArray            EQU 0x2C  ; Thread Local Storage array (AFAIK doesn't work for dynamically loaded DLL's)
Win32_TIB_Win9x_pProcess        EQU 0x30  ; 9x-Only: Pointer to owning process database
Win32_TIB_LastError              EQU 0x34  ; As reported by GetLastError()
Posted on 2007-07-22 12:02:52 by Maverick
Yes, I am using 32-bit code, and no, I am not programming in Windows... Unfortunately, I was unable to post this before Maverick went through all the trouble to make that enormous post... Sorry! *cries*

But I plan to start programming in Windows sometime, so I'll definitely bookmark this! Thanks!

To start off, I'm not programing for anything but the raw PC. Call me crazy, but that's what I'm doing.

I'll show you the code I want to use this in, so you'll get a better idea of what I'm trying to do.

This is a line-drawing algorithm I wrote in FASM that I plan to use in Project FlopNinja:

Draw_Line:     ;The draws a line from one place on the screen to the next. x1,y1,x2,y2,x3,y3... are loaded into the stack NOT in reverse order so that the stack order should be from bottom (most recent) to top, ...y3,x3,y2,x2,y1,x1.
        pop eax  ;eax=y3
        mov esi, eax ;esi=eax=y3. This helps us in the end to know where to stop drawing our line. Watch when the esi register is used next.
        sub esp, 2  ;decrement the stack pointer by two in order to pop y2
        pop ecx  ;ecx=y2
        ;NOTE: stack now looks like x3,x2,y1,x1 and esp is pointing at x2
        sub eax, edx ;y3-y2=eax
        ;---------------------- Done with Y calculation for the moment. Now on to X:
        mov esp, 0  ;put the stack pointer back where it belongs.
        pop edx  ;edx=x3
        sub esi, edx ;esi-edx=esi OR y3-x3=esi. In the end, esi will be used to determine when to stop drawing pixels.
        pop ebx  ;ebx=x2
        ;NOTE: stack now looks like y1,x1 and esp is still pointing at 0. Later, y2 and x2 will be pushed back into the stack so that they can be used to draw the next line. you will see that through all of this mathematical technoblab, their initial values are preserved.
        sub edx, ebx ;x3-x2=edx
        ;---------------------- Done with X calculation for the moment. Now calculating the slope of x3,y3,x2,y2.
        ;eax=y3-y2,
        ;edx=x3-x2,
        ;esi=y3-x3
        div edx  ;(eax/edx=eax)=((y3-y2)/(x3-x2)=eax) Remainder=edx. This means that for every step in y, there are eax steps in x.
        ;Note that ecx and ebx (y2 and x2) remain unchanged. This is our starting position. This means that the line is drawn from x2,y2 to x3,y3 - NOT from x3,y3 to x2,y2.
Line_Loop:
        inc ss  ;switch to new stack segment for loading variable data into for calls so that the original stack data is unchanged.
        push ebx ;push X coordinate into the stack.
        push ecx ;push Y coordinate into the stack.
        jmp Draw_Pixel  ;Draws a pixel on the screen: X and Y coordinates are pushed into the stack NOT in reverse order so that the stack should look like this from bottom (most recent) to top: Y,X.
        sub ebx, eax    ;ebx-eax=ebx. Remember that eax was the answer from the previous division operation.
        sub ecx, edx    ;ecx-edx=ecx. Remember that edx is the remainder from the previous division operation.
        ;---------------------- The code above this line computes the coordinates of the pixels and then draws them. Now, we have to compute when to stop drawing pixels! Note: The placements MUST be exact or it will never stop drawing the line since the end of the line coordinates must match a value calulated before the comparison.
        add esi, ebx    ;Remember: esi=y3-x3. So, we add the current X value and if the values match up, then we've met the end of the line.
        cmp esi, ecx    ;We compare esi with our current Y value, and if the values match up, then we've met the end of our line.
        sub esi, ebx    ;Return esi to original state. Problem: I'm not sure if the instruction needs to be placed just before the instruction, but if it does, this won't work!
        jne Line_Loop    ;If esi and the current Y value are not equal, then we need to draw the next pixel. So we loop.
        ;I'll have to fill in this part when I have the time.

        dec ss          ;Switch back to original stack segment.
        push ebx        ;Put x2 back where it belongs. This could also be done at the beginning of Line_Loop before the instruction, since the
        push ecx        ;Put y2 back where it belongs.
        re sb, esp      ;Return is sb=esp. Returns to the main process if all stack data hasd been cleared. This goes at the end of the program. Before this the instruction is used to switch back to the original stack segment.
        jmp Draw_Line    ;Loop and draw next line if all stack contents have not been cleared.


Notice that it's not quite complete, and that's because I wasn't sure about a few things, namely being what we're discussing right now.

- keantoken
Posted on 2007-07-22 12:44:47 by keantoken
cant really see the point in adjusting the stack to your own area with fs: manipulation, and it wont work either in 64 bit...

Maverick - can you give one good reason why its useful, and why it'd be better than the /stack: compiler switch? (just curious)
Posted on 2007-07-23 00:49:35 by evlncrn8

Yes, I am using 32-bit code, and no, I am not programming in Windows... Unfortunately, I was unable to post this before Maverick went through all the trouble to make that enormous post... Sorry! *cries*


Don't worry, it can still be useful to others in the battle to defeat the way Microsoft mistreats the stack on Windows. After 20+ years of asm (and other languages, nonetheless mine) programming I deeply realized the importance of the stack, in a world that abuses the heap (which is still of fundamental importance, I simply say that it should be used only when indispensible).

For example, if you need one megabyte buffer, I wouldn't allocate it from the heap, but just write to directly. You know that memory there exists (you HAVE to know that of course), and it's certainly free and available for your use! No need to allocate and subsequently free anything. Also, thinking 'bout the stack as your only source of memory (unless PROVEN to really need the heap) promotes the normal program flow, where functions get called, use their chunk of memory, release it when they return, etc.. in a structured way.

Unfortunately, known compilers don't really support this methodology, so you either have to write your own (for maximum profit, also the language needs to support it in some ways) or code in asm.. or both.

Even more unfortunately, Windows doesn't support this methodology at all. It even goes actively against it (a sign that it's a good methodology :D ).
It does it in the following ways:

1) If you try to hit the stack below ESP directly, an exception will be generated. So in Windows you ain't free to do what I described above. If you disassemble some C compiled code, you'll see many calls to the chkstk function. What it does in short (and what you can do "manually") is to touch sequentially one page at a time to "validate" it. In Windows you can not access to the stack below ESP randomly, but you must do it sequentially, at least a page at a time (thus, of course, randomly within the page, but sequentially stilluntouchedpage-wise). The solution is to get rid of the ugly default stack, and switch to your own, reserving for it a memory range you've also committed (use e.g. VirtualAlloc for that).

2) The size of the stack is hardcoded in the PE: what if your needs dynamically change, maybe dramatically? The solution to this problem and to the previous is to switch the stack by yourself. Just changing ESP to the top of your new stack doesn't suffice, because the first time you call a Win32 function, it may be one of those that actively check if ESP is outside its bounds, and will kill your thread if it verifies so. The solution is to update FS:[4] and FS:[8] with your new stack borders coordinates, as described in my previous post. Now, if you use VisualC, you can also get rid of the automatically generated calls to chkstk, via the /Gs2147483647 compiler switch, reducing that silly overhead.

3) SEH (Structured Exception Handling) is a way for our *User* mode programs to intercept and act upon CPU exceptions. Unfortunately Microsoft doesn't know shit, because it has ruined a basic rule followed by all CPU's: the User stack (the one pointed by ESP) and the System/Supervisor (or whatever is the name you want to give to the OS) stack must be two totally separate things. If an exception or an IRQ occurs, the CPU will never write data below the User's ESP. But with SEH Windows does, yes, to your User stack! Ideally, I'd always access directly, and update ESP only when I must call some function. So if an exception can be generated (e.g. division by zero), expect my variables or temporary data to be destroyed. Unfortunately AFAIK there ain't a fix to this shameful Windows behaviour (the fix would be to give the SEH Handler its own private stack, but only the OS could do it, because by the time we switch stack in the SEH Handler, Windows has already written a lot of data in our previous/precious stack), so you either first decrement ESP and then write to , or make sure that no exception can be generated (of course hardware interrupts don't cause this problem because the CPU behaves well: it's Windows that behaves bad, misusing your thread's stack in case of SEH. As you know SEH don't intercept hardware IRQ's anyway, so those don't cause problems although Microsoft doesn't stop to be diabolic).

There would be many more considerations, but these are those that come automatically out of my mind this morning, after a sleepless night (with a 2 years old daughter and a 2 months old son, and the heat we're having right now in Sicily, I bet you wouldn't sleep either!).

In any case, if you want my advice, use the stack whenever you can, following the natural evolution of the program path. Use the heap only when absolutely necessary, i.e. for data which has to live beyond the lifetime of a thread, or for some other exceptional uses (e.g. when the stack would be fragmented). I don't say use the heap to share data among threads because that can be perfectly done also via a thread's stack, as long as this thread always exists during this sharing. Some allocations fall naturally in the stack, others in the heap. What I say is that statistically the heap is abused, a lot of allocations that are traditionally made on the heap fall naturally on the stack instead. It takes a lot of thinking and experience to start doing it correctly, but when you do it it's enlighting. I end up using the heap VERY little nowadays, often nothing at all, besides for some kinds of programs.

Another thing I'm angry with is Win32 (but unfortunately also most other OS's, including Linux) is virtualization of the address space. In my opinion, expecially now on 64bit CPU's, the address space should be shared by all applications (but still read/write protected when necessary, of course! sharing address space doesn't mean sharing all data or access rights). Not only to share data more easily between processes, when you want to do so (hell, some people use files or WM_* for interprocess sharing! my god!!), but also to remove a lot of overhead each time the process is switched and for many philosophical reasons as well. If you think of it, most of the trouble virii and trojans do is due to the buffer overflow technique. If the address space was shared by all processes, and modules would thus (necessarily) have to load dynamically, thus you (hacker) would never know the address where a module is loaded in that certain PC, and could never exploit buffer overflows to make that computer execute arbitrary code, you could only crash it at most.

Another thing I think that sucks is virtual memory, here meant in the sense of swapped to disk, where you have no control of it, but only the OS has. While virtual addresses may be useful for certain tasks (e.g. I used them on the Amiga for debugging purposes to get a copy of write-only hardware chipset registers), other than repeating that IMO the address space should be globally shared (and if insufficient, SIMPLY move to a 64bit CPU.. don't do ugly tricks like the 8086 & Co. 16bit segmented code), I'm not against swappable memory, as long as you as programmer have total control on it (VirtualLock is privileged and anyway not guaranteed). I believe almost all OS's, at least consumer OS's, should be realtime. Things are going in the total opposite way, now with Vista even the video card memory, at the driver level (!) is virtualized!!! This is as ugly as it can be, and explains in part why a Commodore64 game has better scrolling than a PC game (let away that vertical blank syncronization is a concept alien to most PC game programmers) or the mouse pointer on the PC jerks (aliasing anyone? syncronizing, if not the hardware, at least via software interpolation the mouse coordinates sampling and the video refresh rate?).

Anyway.. enough of my rant for today. ;)
Posted on 2007-07-23 02:31:56 by Maverick
Maverick - can you give one good reason why its useful, and why it'd be better than the /stack: compiler switch? (just curious)


It's like asking why men should mate with women: it's the natural way, other ways may apparently work (expecially if you are behind) but, if you think of it, are simply wrong.

The Win32 way is less flexible (stack size cannot change), thus forces you to waste address space you'll never possibly use, or puts you in the risk to use more stack than you have (crash!), EFFECTIVELY FORCING YOU MENTHALLY AWAY from the natural way the stack should be used; has much more overhead (continuous, hidden, chkstk calls, let away all the overhead at the KERNEL level to manage guard pages) - moreover, what do you think is faster, to manage a heap for each (maybe huge) chunk of memory you temporarily need, or simply write STRAIGHT to , with at worst a SUB ESP,xxxxxx???; and doesn't follow the philisophy dictated by the natural program flow.

It's hard to explain, it's like pretending to explain to a homo why he's not doing the right thing. ;) Problem is that homosexuality is not a choice and thus not a culpage either, while bad programming practices may not be a choice but if you don't fix them when you can you have the right to feel guilty this time.


Posted on 2007-07-23 02:42:48 by Maverick

Ideally, I'd always access directly, and update ESP only when I must call some function. So if an exception can be generated (e.g. division by zero), expect my variables or temporary data to be destroyed. Unfortunately AFAIK there ain't a fix to this shameful Windows behaviour

DOS has this "problem" as well... really, if you use the stack, do mark what you're using by updating ESP appropriately. Anything else is just asking for trouble...


If you think of it, most of the trouble virii and trojans do is due to the buffer overflow technique. If the address space was shared by all processes, and modules would thus (necessarily) have to load dynamically, thus you (hacker) would never know the address where a module is loaded in that certain PC, and could never exploit buffer overflows to make that computer execute arbitrary code, you could only crash it at most.

This requires either fixups, meaning that a page can't just be discarded and re-read from executable file in low-memory situation... or use of a delta-register like done with ELF executables. And even with address-space randomization, exploits can still be done, it just requires a bit more work.

As for your manual-stack thing, it probably does have it's advantages (like speed), but if you VirtualAlloc() your entire chunk at start, possibly wasting memory... the point of the guard-page approach is to conserve memory, as well as catch stack overruns.

Philosophy aside, looking at things a bit pragmatically, the heap&stack allocation method of windows work pretty well... and for most applications, it doesn't have any noticable overhead. Also keep in mind that there are environments like terminal servers, where it's not a good idea to think you own the whole machine.

Not saying things are perfect though, there's a lot of things I'd like to have changed, but a lot of the defaults make sense and work fine for the majority...
Posted on 2007-07-23 08:20:09 by f0dder
Whoa... I think I might have to change my signature and yield to Maverick on that claim...

Um... And I thought for sure someone would point out an error in my code. Did I really not mess up or did just no one read my code?

Anyways, Maverick, I'll keep your monstrous list of complaints in mind when I finally start developing my own OS. Not that I understood most of it as thoroughly as possible, but... Um... I will.

the ss segment (and indeed cs, ds, gs and fs) are generally something you shouldn't mess with in 32 bit code... as the 32 bit model is 'flat' mode, where essentially cs=ds=ss.. and changing it (especially in 64 bit enviroments) can lead to disasterous results...


Does this pertain to non-Windows Assembly? I now understand I can manually create a new 'stack' inside the currently existing one, but is there still a way to do what I was originally intending?

Keep in mind I'm not doing this under Windows or anything. It's just me and Bochs.

- keantoken
Posted on 2007-07-23 12:00:10 by keantoken

Dear f0dder, although you know how much I esteen and respect you, let me disagree with many of your points:



Ideally, I'd always access directly, and update ESP only when I must call some function. So if an exception can be generated (e.g. division by zero), expect my variables or temporary data to be destroyed. Unfortunately AFAIK there ain't a fix to this shameful Windows behaviour

DOS has this "problem" as well...

It's a problem of Microsoft then. That DOS really sucked isn't a new, or is it? :)
Ok, it was useful for many people.. and filled a market gap, but that's a completely different story, ain't it?

really, if you use the stack, do mark what you're using by updating ESP appropriately. Anything else is just asking for trouble...

That's what I said, too, on Win32 of course. Or make sure no exceptions can happen for the length of that code snippet. Or change OS. :)


If you think of it, most of the trouble virii and trojans do is due to the buffer overflow technique. If the address space was shared by all processes, and modules would thus (necessarily) have to load dynamically, thus you (hacker) would never know the address where a module is loaded in that certain PC, and could never exploit buffer overflows to make that computer execute arbitrary code, you could only crash it at most.

This requires either fixups,

Fixups are a good and efficient thing (not the most efficient at load time, but at run time yes). In my dynamic module system I do it this way and it works wonderfully, with inter-module references all fixed up whenever a module gets loaded or relocated.

meaning that a page can't just be discarded and re-read from executable file in low-memory situation...

Which is a non-issue for me, since when I read about code pages getting swapped to a mechanical system, I feel a big stomach ache..
You know, I'm of the real-time apps and OS kind, but I'm not too alien to servers and such (I actually make a living out of it too). I still think if you don't have enough RAM, buy it. Don't mess with the hard disk! ;) Or to say it differently if you need RAM you've got to get RAM, all the rest has a place and an usefulness but not to the abusive extent we see today (even on OS's where people play realtime action games and wannabe low-latency multimedia apps you can get your code swapped to a mechanical system without even knowing!).

or use of a delta-register like done with ELF executables. And even with address-space randomization, exploits can still be done, it just requires a bit more work.

A bit more work? How would you make me run arbitrary code if modules get loaded and unloaded dynamically, and if a module gets relocated then all other modules which reference it will have their fixups adjusted to a new location? How will you know what return address to inject? or even the address of the buffer you're overrunning?

When you think that it's a cool thing that Win32 physically shares code pages which haven't been written in your process, because it's a (rare) example of Win32 efficience, or that a code page can be just discarded because the OS knows that it will be able to read it again from the locked EXE file, you've to take into account also what a giant monster Windows is (and let's not talk about Vista..), and how little it means in global picture that some code pages will not consume RAM, when the same model in the whole leads to huge inefficiences elsewhere.

We are forgetting what computers are meant to do, Vista is just thousands times bigger than it needs, and for sure it doesn't offer thousands times more useful things than the lamest OS in the world. This race towards "bigger, bigger, bigger" is IMHO sick, and virtual memory is an accomplice of it. Now even the gfx board's memory is virtualized, this is a computer architecture horror story turned into reality. If they invested in reducing the bloat (a 100:1 factor could be easily achieved), you wouldn't need such schizophrenic solutions. The Amiga was an example of efficiency, it could do much more than the PC with orders of magnitude of less resources. Please don't remember me that Commodore went bankruptcy, because first of all that's not a technical argument (and we're talking about technical arguments here), and also because I'm not convinced at all that it was the bloat that made the PC survive, but rather too much luck on its side and too much misluck and mismanagement on the other sides. Technically speaking the PC has always been inferior to its competitors, and its OS's always lagged behind (when Microsoft introduced preemptive multitasking, the Amiga had it since a mere 11 years before, and let's not talk about gfx/snd capabilities).

That's why I'm thinking to start development on a Neo 1973, small devices have less of that crap, of that bloat, and give much more per gram of silicon.

As for your manual-stack thing, it probably does have it's advantages (like speed), but if you VirtualAlloc() your entire chunk at start, possibly wasting memory...

Why? You only waste address space, at worst. You can simply reserve as much as could be needed, and commit when necessary. You could use guard pages yourself with some kernel calls, too.

the point of the guard-page approach is to conserve memory, as well as catch stack overruns.

I wouldn't consider that a good debugging tool..

Philosophy aside, looking at things a bit pragmatically, the heap&stack allocation method of windows work pretty well... and for most applications, it doesn't have any noticable overhead. Also keep in mind that there are environments like terminal servers, where it's not a good idea to think you own the whole machine.

As I already said, virtualization has certainly its merits. But not to the extent it has been abused today on "home" computers.. And anyway it's hard to think that virtual memory is really super useful on a CPU with 32bit address space when four giga of RAM cost so little nowadays.

Not saying things are perfect though, there's a lot of things I'd like to have changed, but a lot of the defaults make sense and work fine for the majority...

The majority doesn't even *really* need a computer, heck! ;)

Posted on 2007-07-23 13:08:20 by Maverick


the ss segment (and indeed cs, ds, gs and fs) are generally something you shouldn't mess with in 32 bit code... as the 32 bit model is 'flat' mode, where essentially cs=ds=ss.. and changing it (especially in 64 bit enviroments) can lead to disasterous results...

Does this pertain to non-Windows Assembly? I now understand I can manually create a new 'stack' inside the currently existing one, but is there still a way to do what I was originally intending?

It pertains to most protected-mode code, since you're suddenly not dealing with "segments" anymore, but selectors into a descriptor table. For most OS'es, the descriptors will cover the entire address space anyway, and the paging mechanism will be used for memory management and permissions instead.

So once you move away from real-mode code, you use the segment registers in a completely different way (notice that the split is between real mode and protected mode, not between 16/32... You can easily use 32-bit registers in real-mode, and that there's 16-, 32- and 64-bit protected modes).


It's a problem of Microsoft then. That DOS really sucked isn't a new, or is it? :)

Not just Microsoft, but many people that wrote code for DOS... even a few BIOS vendors as well, or so I've heard rumored. IMHO you just shouldn't touch memory below ESP, you're playing safe then, and sub/add don't cost that much :)


Fixups are a good and efficient thing (not the most efficient at load time, but at run time yes). In my dynamic module system I do it this way and it works wonderfully, with inter-module references all fixed up whenever a module gets loaded or relocated.

They take up additional space in the executable image, makes it impossible to do share code pages for multiple instances of your app (BAD if you're running on a terminal server!), and make it impossible to discard+reread pages. Imho a delta-register is more sane if you really want a non-isolated environment, but this of course sucks on x86 with it's very limited amount of GP registers.


Which is a non-issue for me, since when I read about code pages getting swapped to a mechanical system, I feel a big stomach ache..

Heh, well, the point is NOT swapping them, but discard+re-reread, which is certainly faster. And being able to demand-load executables instead of mapping in the entire file at load.

On the other hand, on modern systems with all the RAM they have, it might be fine enough to never page or discard code sections. But it's certainly still good to share them.

I do agree that windows by default writes too much to the paging file, even when not necessary... and even if you're on a 1- or 2-gig machine. But that's really the fault of retard "power users" that don't understand that unused ram==wasted ram, and bitch if their "available memory" drops, since they don't understand topics like filesystem cache etc. A little knowledge is a dangerous thing :)



or use of a delta-register like done with ELF executables. And even with address-space randomization, exploits can still be done, it just requires a bit more work.

A bit more work? How would you make me run arbitrary code if modules get loaded and unloaded dynamically, and if a module gets relocated then all other modules which reference it will have their fixups adjusted to a new location? How will you know what return address to inject? or even the address of the buffer you're overrunning?

Remember that buffer overflows aren't the only way to exploit programs. Other methods are harder and dynamic load addresses does complicate matters... but it doesn't make it impossible, you "just" look elsewhere.


Vista is just thousands times bigger than it needs, and for sure it doesn't offer thousands times more useful things than the lamest OS in the world. This race towards "bigger, bigger, bigger" is IMHO sick, and virtual memory is an accomplice of it.

I agree almost fully with this - even with the notion of generic virtual memory (today, when machines have lots of ram). I'm still a fan of virtual address space, though.



As for your manual-stack thing, it probably does have it's advantages (like speed), but if you VirtualAlloc() your entire chunk at start, possibly wasting memory...

Why? You only waste address space, at worst. You can simply reserve as much as could be needed, and commit when necessary. You could use guard pages yourself with some kernel calls, too.

Sure, but then you still can't access the stack 100% randomly without suffering at least an exception (yeah, auto-handled, but still a speed hit... and yes, I know windows already does this :) .)



the point of the guard-page approach is to conserve memory, as well as catch stack overruns.

I wouldn't consider that a good debugging tool..

*shrug*, if you overflow the stack by too much, you get an exception with register + memory dump... better than no exception. But of course, you need to overflow the stack a lot, and usually the nasty-bug-induced overflows are smaller.


As I already said, virtualization has certainly its merits. But not to the extent it has been abused today on "home" computers.. And anyway it's hard to think that virtual memory is really super useful on a CPU with 32bit address space when four giga of RAM cost so little nowadays.

I agree that virtual memory isn't as useful nowadays in this way as it used to be (and back then, imho, it still tended to work better when managed by the application itself). For me, virtual address space has always been about isolating processes from eachother, though - how would you do that without a mechanism like paging?

I'm talking both about guarding against malicious exploits, as well as buggy code...
Posted on 2007-07-23 14:40:29 by f0dder
It pertains to most protected-mode code, since you're suddenly not dealing with "segments" anymore, but selectors into a descriptor table. For most OS'es, the descriptors will cover the entire address space anyway, and the paging mechanism will be used for memory management and permissions instead.

So once you move away from real-mode code, you use the segment registers in a completely different way (notice that the split is between real mode and protected mode, not between 16/32... You can easily use 32-bit registers in real-mode, and that there's 16-, 32- and 64-bit protected modes).


YAY.

I do agree that windows by default writes too much to the paging file, even when not necessary... and even if you're on a 1- or 2-gig machine. But that's really the fault of retard "power users" that don't understand that unused ram==wasted ram, and bitch if their "available memory" drops, since they don't understand topics like filesystem cache etc. A little knowledge is a dangerous thing


Yeah... I've often wondered myself why Windows has to use a paging file even though there is a full 128MB of RAM left. Note that I have nothing near a state-of-the-art computer. What I've got is a Dell Optiplex GX1 with 384MB of RAM and a 550MHz CPU. Top that off with a GeForce FX 5500 and a SoundBlaster AWE64 Gold (XD) and you've got my supercomputer.

However, if you want people like me to use your software, you either write the software to automatically adjust for the amount of RAM in the system or you make it less RAM hungry. Yes, unused RAM=wasted RAM, but if the programmer isn't paid enough and you want some software I can use, that's what you get. This is why I prefer to be self-motivated on my projects. I don't get lazy that way. That's also why I like Open-Source.

I personally would like to throw in a card for those like me who can't pay for 4GB of RAM let alone actually be able to use it since we can't buy a state-of-the-art computer to accommodate it. You may call 4GB of RAM inexpensive, but I'm afraid my pocketbook's not quite like yours.

That said, I'll try not to jut into this debate any more than I already have, and I'll leave you two to discuss whatever else you may want to discuss. However, if the topic really matters a whole lot, I'd think it would deserve a separate thread.

- keantoken
Posted on 2007-07-23 17:00:00 by keantoken
keantoken: first, sorry for hijacking your thread, hope you did get your questions answered nonetheless. I can perform a "split topic", but unfortunately I don't think I can move individual posts back and forth, so I'll just leave things be - hope that's okay :)

Second, I think you misunderstood my "unused RAM is wasted RAM" - what I meant isn't that programmers should be sloppy about their programming, but that whatever amount of RAM you have in your system should be utilized. Put in a simplified way: "when applications don't need the RAM, use it for filesystem cache, and do be sure to use it."
Posted on 2007-07-24 04:39:41 by f0dder

Gotta write in a hurry:


As I already said, virtualization has certainly its merits. But not to the extent it has been abused today on "home" computers.. And anyway it's hard to think that virtual memory is really super useful on a CPU with 32bit address space when four giga of RAM cost so little nowadays.

I agree that virtual memory isn't as useful nowadays in this way as it used to be (and back then, imho, it still tended to work better when managed by the application itself). For me, virtual address space has always been about isolating processes from eachother, though - how would you do that without a mechanism like paging?

I'm talking both about guarding against malicious exploits, as well as buggy code...


I'm not against paging (besides only that I think that expecially nowadays there's a better solution, which I will mention below), as I said I used it on the Amiga MMU to get a copy of write-only hardware registers, and it has been proven very valuable also for other uses.

I am only against abuse of virtual memory (meant as swapped to hard disk without your control), while I think that the (technically equivalent) memory mapped files are a great thing instead (BUT they are demanded by the application!), expecially when you have a large address space of course (otherwise it's pretty useless in practice, unfortunately). If the programmer had sufficient control over it, then I wouldn't have anything against virtual-memory-swapped-to-disk either. Of course "sufficient control" goes against the very philosophy of the "Police OS", where no user programs should be allowed any chance to put the system down (but to make this protection possible, the OS in the end puts the system down by itself :D ), and where processes have no rights or guarantees at all (I think that realtime OS should be the norm, expecially for home computers, not the very exception). I'd much prefer an OS where if the administrator/owner/user of the computer wants it to be so, a process can hog resources like a pig, because sometimes it's necessary. Theoretically this should be already possible even on Win32, but in practice it is not. I agree that there are some kinds of servers and uses where no user program should be allowed to hog the system resources, neither if he has maximum privileges, but as I said I'm talking about home computing, where I see much different needs.

About your question, I am in favour of a shared (64bit now, that 32bit started to be too little) address space. I would even go beyond, and get rid of paging, just for performance reasons, and instead (now that we have hundreds millions transistors CPU's it's very doable) base the MMU on a MTRR (Memory Type Range Registers) mechanism, which can be highly paralleled, and extremely efficient.

When process A is going to get its time slice, you (the kernel) will set up at least one MTRR, with the range of allowable addresses, all the other addresses default to protected ones (belonging to the OS or other processes). More MTRR (it's just a bunch of transistors each, few hundreds, for register and comparators, i.e. 0.00001% of total transistors count per MTRR on a last generation CPU) will allow different permissions in different regions of the address space (e.g. allow writes to a video memory buffer, share a region of system memory with other processes, etc..). Honestly I am against using protection as a debugging means but, if one wants, with the same mechanism could specify some read only ranges or non-executable readable range, etc.. with a much finer granularity than paging allows, and thus (I am against using it at release, but it could be very useful at least in debug sessions) catch misbehaving pointers, etc..

While I am against virtual memory the way it's commonly used (swap anything to hard disk without programmer's control), I think that memory mapped files are a great thing, and that could be made with MTRR's as well, adding a simple translation mechanism to some of them (few more transistors per MTRR), this has other potentials as well, like moving in different parts of physical memory certain key buffers which you can locate in some special zone of the address space (making an exception to the shared-address-space rule, since sometimes it's useful, e.g. to translate a circular buffer to a linear one).

I am for the "Keep It Simple" philosophy, 20 years ago I was and am for RISC bigtime, for tiny lightweight solutions wherever possible. Tell me one kind of program that a lightweight hardware and OS wouldn't execute perfectly well, tell me one thing where e.g. Vista would work better.
There are pro and cons everywhere, but the ability to fetch pages from the EXE file on demand, while being good in itself, when you compare the two systems (in the necessary whole) loses every significance, and gives you nothing.
Posted on 2007-07-24 04:49:32 by Maverick

while I think that the (technically equivalent) memory mapped files are a great thing instead (BUT they are demanded by the application!), expecially when you have a large address space of course (otherwise it's pretty useless in practice, unfortunately).

Memory mapped files are great for the lazy programmer, especially on 64-bit architectures where he can be really lazy even when working with huge files (on 32bit systems you still need to operate in a 'chunked' manner), but they do impose a speed hit. Can be useful for other things too, though.


I'd much prefer an OS where if the administrator/owner/user of the computer wants it to be so, a process can hog resources like a pig, because sometimes it's necessary. Theoretically this should be already possible even on Win32, but in practice it is not.

And I don't think this is a good idea - even without malicious code, programming errors could cause a process to run amok and thus requiring a hard power-off. Might not be that bad if you're only running a game or a word processor, but I tend to be running a lot of stuff at once.

Also, if it was possible to really hog the system, everybody and their dog would start abusing this feature even if they didn't need it. A game doesn't need this - although some other changes would be in order.


About your question, I am in favour of a shared (64bit now, that 32bit started to be too little) address space. I would even go beyond, and get rid of paging, just for performance reasons, and instead (now that we have hundreds millions transistors CPU's it's very doable) base the MMU on a MTRR (Memory Type Range Registers) mechanism, which can be highly paralleled, and extremely efficient.

*snip the rest*

Sounds like an interesting approach, does save you from managing the pagetable structures and TLB flushes etc. If there were enough MTRRs and they gave the same possibilities as paging (R/W/X and U/S) plus perhaps a bit more, it could be an interesting solution.

If we ignore the "fetch from RAM" part of x86 paging and just compare TLBs to MTRRs, would it be any faster, though?
Posted on 2007-07-24 05:48:52 by f0dder
I am in favour of a shared (64bit now, that 32bit started to be too little) address space. I would even go beyond, and get rid of paging, just for performance reasons, and instead (now that we have hundreds millions transistors CPU's it's very doable) base the MMU on a MTRR (Memory Type Range Registers) mechanism, which can be highly paralleled, and extremely efficient.

When process A is going to get its time slice, you (the kernel) will set up at least one MTRR, with the range of allowable addresses, all the other addresses default to protected ones (belonging to the OS or other processes). More MTRR (it's just a bunch of transistors each, few hundreds, for register and comparators, i.e. 0.00001% of total transistors count per MTRR on a last generation CPU) will allow different permissions in different regions of the address space (e.g. allow writes to a video memory buffer, share a region of system memory with other processes, etc..).


Without adding anything truly of benefit to this thread, I must admit that I really am in favor of this idea!  If we could only get all the powers that be on board!

(I think that realtime OS should be the norm, expecially for home computers, not the very exception)


Maverick, if you don't mind me asking....What OS are you running on your home systems?


Posted on 2007-07-24 15:30:38 by madprgmr


About your question, I am in favour of a shared (64bit now, that 32bit started to be too little) address space. I would even go beyond, and get rid of paging, just for performance reasons, and instead (now that we have hundreds millions transistors CPU's it's very doable) base the MMU on a MTRR (Memory Type Range Registers) mechanism, which can be highly paralleled, and extremely efficient.

*snip the rest*

Sounds like an interesting approach, does save you from managing the pagetable structures and TLB flushes etc. If there were enough MTRRs and they gave the same possibilities as paging (R/W/X and U/S) plus perhaps a bit more, it could be an interesting solution.

If we ignore the "fetch from RAM" part of x86 paging and just compare TLBs to MTRRs, would it be any faster, though?


Throughput wise, may be not: but latency wise, certainly would.

Please note that most of my ideas get implemented by myself in Verilog on FPGA boards like this one. Memory management is not something I've already implemented anyway, and there are others on the list before it. Mine could be described as a "hardware multithreaded cyclic pipeline processor", i.e. a relatively complex ALU shared by four hardware threads on a simple 16bit RISC architecture where 32bit, 48bit, 64bit and morebits basic operations (add, sub, logical and shift) are just nbits/16 slower than the native 16bit. Why is it native 16bit? From simulations it was the best compromise value to use, at least for my code, between number of logic gates and total computing power. When I only need 16 bit for a counter, I don't use 64bit, like a modern RISC would. This also saves memory, and with a 16bit CPU you don't need to align data on 64bit boundaries (of course the interface to SDRAM is much wider than 16bit!). Anyway, of course I plan to make out of this a commercial system.. which is be called Omega64 (just like the name of my own company, of my website, etc..). The cyclic pipeline allows all instructions to be executed in one cycle without complications like branch delay slots, etc.. and the whole architecture is much semplified. You get four processors at one fourth of the main clock speed, but they behave (real, total MIPS wise) much better than a single processor at full main clock speed, and the whole design is semplified, as well as compiler work, because everything is much more predictable than with those complex, modern superpipelined CPU architectures, which can stall in many different situations and are utterly complex. Also, the ALU permits relatively advanced operations, so the total throughput is really good (I love DSP's!). Finally, on my FPGA (I got Altera Cyclone I, II and III chips) it runs at maximum speed (i.e. the limit is imposed by internal RAM, about 250MHz, depending on exact FPGA model), while I see lotsa open cores of RISC CPUs where a single pipelined processor runs slower than my fourth-speed single thread, MIPS wise (ok, that one is e.g. 32bit, while mine is 16bit, but I got a better code density and excellent average performance). I pay a lot of attentions to critical paths, but unfortunately all the optimizations I do in my Verilog design to reach this speed would have to be changed in case of an ASIC implementation (every silicon process has its own peculiarities, and FPGAs although wonderful devices aren't exactly like an ASIC in several regards). I've lately read a lot about hardware design, and it surpises me how many of my (honestly for me original) ideas already existed in the '60s, but not by Microsoft, by Cray..
Also, I've heard of at least two new microcontrollers released recently which use the cyclic pipeline approach. I think it's a natural way to use the higher chip densities of today, with the multithreaded/multiprocessor uses that today are common among programmers, but trying to keep things simple.
Posted on 2007-07-25 08:17:02 by Maverick

(I think that realtime OS should be the norm, expecially for home computers, not the very exception)


Maverick, if you don't mind me asking....What OS are you running on your home systems?


Oh well, I'm still a big fan of my Amigas (one has a 68060 CPU, SCSI II, AmigaOS 3.9, etc..). On my PC's, I run Linux (of which I must admit I'm not a real fan, although I prefer it much more than Windows), and since we're all slaves of it in the end, expecially if we want to make a living out of this job, I also use Windows2000 Professional SP4, but I try not to inhale. :)


Posted on 2007-07-25 08:19:31 by Maverick
Oh well, I'm still a big fan of my Amigas (one has a 68060 CPU, SCSI II, AmigaOS 3.9, etc..). On my PC's, I run Linux (of which I must admit I'm not a real fan, although I prefer it much more than Windows), and since we're all slaves of it in the end, expecially if we want to make a living out of this job, I also use Windows2000 Professional SP4, but I try not to inhale.


I've got the same Windows version, alongside an UbuntuStudio partition. And they're both on my 10GB harddrive! My slave harddrive is 8GB and I store the unprofessional half of my work on it. Though, since you apparently know some things about OS development, you shouldn't be nearly as surprised as everyone else I've told.

I wouldn't be surprised if the larger half of Windows is unoptimized HLL (namely C and its derivatives). At least it works, though. I can't figure out why Linux has so much of a problem with my PC specs. I don't really understand how it works perfectly on a Pentium 4 and not on my Pentium 3. It's all compiled in 386 assembly, isn't it? That logically leads me to the conclusion that it's not my CPU, but something else in my PC. Maybe because I've got ISA slots in it or something like that...? It's certainly not my graphics card, and not my sound card, so what is it?

At first when I heard all these great success stories about modern linux on ancient computers I thought neato, but unfortunately, the machine must have to be REALLY old before Linux'll run right on it. It seems they've only tested it on ancient and modern computers and not in-between. Come on, my compy's only one decade old... *sniff*

But whatever. I'll yield to the REAL commenters now. ;)

- keantoken
Posted on 2007-07-29 06:14:11 by keantoken