Don't worry about the HeapAlloc thing, it didn't insult me - and I can easily see why you got this idea :). For me it's just about using HeapAlloc for generic allocations (since that's sorta what MSDN recommends), and "whatever scheme" (custom manager, pooled strings, whatever) ontop of HeapAlloc or "pretty custom VirtualAlloc" schemes where necessary. This would also involve discouraging people from using Memory Mapped Files for generic memory allocation purposes, for instance :)

As for the scali thing... I generally like the guy, since I know how to deal with him, and why he thinks like he does. He tends to get lost in details, but that's not too stange in the context of assembly, where scali is the type that focuses on getting a small piece of code running *fast*. I think he forgets that a lot of people here writte assembly "just because", and don't care too much about speed (or are still young and na?ve and think "just because it's asm it's faster", heh).

It's unfortunate that it always seems to end this way and the tone becomes so agressive, but I must admit that I prefer scali's way to hutch'es behaviour. Bullying around, twisting words, spewing non-info. I don't know who's worst, but at least scali doesn't hide behind a mask of niceness. Oh well, it's no secret that I've been pissed at hutch since he called me a virus writer.
Posted on 2004-02-17 11:34:29 by f0dder
I think he forgets that a lot of people here writte assembly "just because", and don't care too much about speed (or are still young and na?ve and think "just because it's asm it's faster", heh).


I think the issue is more that these people get annoyed when someone tells them there's a way to improve their code, and even go as far as determining for the entire forum that this is not interesting at all.
The only right answers would be: "You're right, thanks for pointing that out, I will use it from now on."
or "You're right, but it's not important to me, I will continue my usual way".
Anything else is just silly.
Posted on 2004-02-17 11:38:49 by Henk-Jan
It would be easier to get either reaction if you were a bit less agressive, though. But, hey, it *is* easy to become annoyed when people seem ignorant to you.
Posted on 2004-02-17 11:49:58 by f0dder
It would be easier to get either reaction if people would wait long enough with answering until they actually understood what they just read. Cross-referencing info is not a bad idea either.. Like checking whether it is actually in the Intel manuals, before someone spoonfeeds it to you (why should the one providing the info always supply an exact source? It takes time to go through a manual and find out exactly where it is mentioned. Besides, any decent asm programmer should have read the manual anyway, so why can we not just assume it as wellknown truth?). Then there would be no denying of facts, and there can exist no argument.
The same goes for the SGI-case.
I think some people are just too confident about their knowledge, and think they don't have to cross-reference. Or that they think that because they don't like someone, that person can never provide any factual information or whatever.

I don't think that way. Facts are completely independent of who mentions them.
Posted on 2004-02-17 11:57:01 by Henk-Jan
OK, since I was gone WTF just happened?

Please don't tell me this is another ASM and C war, geeze I make effective use of both....(C++ I'm learning, albeit slowly :) )

Please explain what happened? I'm quite busy studying and dont have time to read 6 PAGES?!?!
Posted on 2004-02-17 15:33:18 by x86asm
Please explain what happened? I'm quite busy studying and dont have time to read 6 PAGES?!?!


I pointed out that one of the proposed methods (embedding the string in the code) did not work quite like in C (where strings are in the data section), and had some cache disadvantages (as the Intel manuals also point out)... Then someone started a war about it because he took it as a personal insult or something... Nothing interesting anyway.
Posted on 2004-02-17 15:49:07 by Henk-Jan
Henk and f0dder I respect both you guys as coders as both of you have offered help when I have asked questions but I mean come on dont you think your being a little childish here? Saying someone is an idiot because they deliberately screw up cache performance I think is a little too far. I mean there is no point to chase performance where it is not needed, this principle which many use in HLL's can be used in ASM as well with MACROS that make ASM easier, sure they may screw up the pipelines or that (?) MOESI stuff (hehe, I researched I wanna become a microprocessor engineer :D) but if it makes ASM code more readable I don't mind taking the hit in performance.
Posted on 2004-02-17 15:57:39 by x86asm
mean there is no point to chase performance where it is not needed


There is no point to screw up performance where it is not needed either.

if it makes ASM code more readable I don't mind taking the hit in performance.


That's not the issue here though. There's no difference in ease-of-use, or readability, or anything, for the different methods of storing strings. So it makes no sense to choose an inferior performing macro, since there is no other value to gain whatsoever.
Posted on 2004-02-17 16:10:15 by Henk-Jan
I accept that f0dder made a genuine mistake in asserting the SADD macro wrote to the code section when in fact it writes to the initialised data section. Just an attention to detail matter.

To answer the original question that was asked, C string over many years is a sequence of characters terminated by an ascii zero. Unicode strings are technicaly different. Now MASM handles C string in the normal manner when it uses the notation,


MyString db "This is a C string",0

Now where the debate has been is whether cache considerations actually matter when data is embedded in the code section.

Anyone who can write assembler knows data in the code section is normal code and this is where you use an immediate operand.


mov eax, 1

The immediate "1" IS data.

So is a table written in the code section to avoid the fetch delays of reading data from the data section.

String data written in the code section is not as common for historical reason that follow from the segmented architecture of old DOS code but the complaints about using the technique in very small files does not pass the test of relevance.

A substantial amount of code never handles processor intensive algorithms so there is no point in trying to structure the data in a way that is suitable for a non-existent high speed algo. Trying to do so is lousy design by a person who simply does not know how and why a program works.

With examples like embedding a control class name before a CreateWindowEx function call, the entire function fits into the code cache so code like,


jmp @F
BtnClass db "BUTTON",0
@@:

in fact does not break the code cache as it is 7 characters long.

Then there is the example of writing string data after the start of the code section but before the entry point label.


.code
MyStringData db "I am written in the code section",0
start:

What is being confused here by our proponent of data section purity is that the data before the start: label is never read into the code cache yet it is clearly data in the code section.

Blanket statements made on misunderstandings of processor performance and a lack of understanding or the PE specifications demonstrate ignorance on the part of the speaker which could be excused if he confined his ignorance to his own circle but he has in fact tried to confuse people here with his ignorance.

Since I do put my money where my mouth is, the example in the MASM32 example code called "smallwin" in the example9 directory in 1536 bytes and it is a demo of SIZE that was originally aimed at the TASM example WEP that was 8k in SIZE.

Now note here that the size for a working window IS 1536 bytes, not the minimum 28k of a current C compiler or needing to perform unusual entry pointv design in C code to avoid all of the bloat in the default file layout.

When SIZE matters, assembler programmers have all sorts of wicked little tricks and not adding a data section is simply one of them. It may feel profound when manipulating big bad mannered pigs in C to make noises about cache performance but assembler programmers are not saddled with such piles of crap and can write what they want.

Not that any technical consideration actually matters here, our friend think he can win an argument when he is wrong again by weight of his wit and rhetoric.

Regards,

http://www.asmcommunity.net/board/cryptmail.php?tauntspiders=in.your.face@nomail.for.you&id=2f46ed9f24413347f14439b64bdc03fd
Posted on 2004-02-17 18:58:01 by hutch--
The immediate "1" IS data.


Semantically, yes. Technically it is encoded in the opcode for the instruction, and therefore the CPU considers it as code. And guess what? It ends up in the code cache, not the data cache.
So this is not the same as the string.
See, the difference is that with a string, you don't actually embed the data in opcodes. Instead, you pass a memory address to the CPU... The CPU will pass this address to the data cache, in order to fetch the data... Guess what? The data is not in the data cache yet! It's in the code cache! (or not cached at level 1 at all, in the case of P4).

in fact does not break the code cache as it is 7 characters long.


And what exactly is this based on?
Where are the start and end-boundaries of your cacheline? This is an entirely hollow statement.

Someone needs to read the Intel manuals again....

What is being confused here by our proponent of data section purity is that the data before the start: label is never read into the code cache yet it is clearly data in the code section.


The entire case mentioned here was never discussed until you brought it up in your last post.

Blanket statements made on misunderstandings of processor performance and a lack of understanding or the PE specifications demonstrate ignorance on the part of the speaker which could be excused if he confined his ignorance to his own circle but he has in fact tried to confuse people here with his ignorance.


The PE specifications have absolutely nothing to do with how the CPU cache works. Trying to use them as a crutch for your bruised ego-crap is just sad. And guess what... I can put data in the code in MZ files aswell! Which I have only mentioned about 10 times. Let's just keep ignoring that and repeating the same useless crap forever, until people are so numb that they will believe it!

Now note here that the size for a working window IS 1536 bytes, not the minimum 28k of a current C compiler or needing to perform unusual entry pointv design in C code to avoid all of the bloat in the default file layout.


Nice try, but cynica_l can provide you with a Visual C++ project that will compile to a PE file with a working messagebox in about 700 bytes, without any manual tuning whatsoever. Just load the project, and build it. Don't judge tools if you don't know how to use them.

When SIZE matters, assembler programmers have all sorts of wicked little tricks and not adding a data section is simply one of them. It may feel profound when manipulating big bad mannered pigs in C to make noises about cache performance but assembler programmers are not saddled with such piles of crap and can write what they want.


Firstly, this issue was never about ASM vs C, but rather about one version of a MASM macro versus another. Ofcourse, if you need to hide behind this kind of nonsense, that says a lot about you, and the lack of quality of your macro.
Secondly, would you believe that I can make programs without a data section in C aswell? Don't judge tools if you don't know how to use them. So far all you've managed to prove is that you are absolutely useless when it comes to using C.

Not that any technical consideration actually matters here, our friend think he can win an argument when he is wrong again by weight of his wit and rhetoric.


Look who's talking. Even putting actual Intel manual quotes in your face doesn't stop you from denying the issue! You are so wrong, there are no words for it.
Posted on 2004-02-17 19:43:01 by Henk-Jan
I first saw the cTXT macro posted by huh and knew it was pure genius. I will use it with reckless abandon for as long as I use MASM; and FASM has a similar macro, so my recklessness will have no bounds. :)

Sometimes I program using mostly code, and then I'll shift to the other extreme of using mostly data. It is fun to understand the relationship between one and the other. Data is nothing without code and code requires data - it is a beautiful relationship allowing us to delve deeper into the inner workings of the processor.
Posted on 2004-02-17 21:46:07 by bitRAKE
Profundities from a person using a C compiler as a crutch don't cut much ice in the assembler programming area.

When you support stupidity like blanket statements that data in the code section is BAD, you have failed in technical terms with a multitude of working examples.


mov eax, "lluB"
mov ecx, "t*hs"

Now for the dummies,

DATA is DATA is DATA is DATA

Next example,


mov eax, 1

Again for the dummies,

DATA is DATA is DATA is DATA

Another example,


.code
MyString db "DATA is DATA is DATA is DATA",0
start:
; your code

Ain't it a shame that you still don't comprehend the PE specifications or understand how code is read into the code cache.

Again for the dummies,

DATA is DATA is DATA is DATA

Now if you stopped hobbling along with a poor grasp of the C compiler you are using and actually bothered to do some work with the instructions, you would understand the differences.

The range of sse2 instructions include non temporal writes and with some instructions that require 16 byte aligned data. You set the data in memory at the correct alignment, use either the hardware prefetch instructions or another technique of software pretouch, the fastest I have seen there is Lingo's code.

Do you use these performance related instructions with small zero terminated strings ?

Now if you were truly performance oriented, you would allocate memory, align the starting address to a page boundary, write the zero terminated string to it and then try and write some code that would show this AMAZING technical advantage you keep assuming with cache loads.

In high level code and that is what API functions are, placing a short string in the same function is fast safe reliable code, even if you don't know it.

You may have some need for this neurotic scratching but it does not convince anyone, particularly assembler programmers who already know what cache usage is about and where to use it.

Most people who need them have the Intel manuals but the application you are trying to force does not follow from the cache technical data in the manuals, what you are trying to inflict is a private interpretation of technical data in a contaext where you are simply wrong.

Now when you succeed in scraping the egg off your face for having made a fool of yourself again, you can try again and get the same result. Why you bother I simply don't know but I guess it must feel good. :tongue:

Regards,

http://www.asmcommunity.net/board/cryptmail.php?tauntspiders=in.your.face@nomail.for.you&id=2f46ed9f24413347f14439b64bdc03fd
Posted on 2004-02-17 22:21:32 by hutch--
Yeah, even code is data - data the processor works with to obtain the result - electrons flowing through the hardware at a faster rate in a planned direction.
Posted on 2004-02-17 23:30:27 by bitRAKE
Hm, there's no reason whatsoever to use the


jmp @F
BtnClass db "BUTTON",0
@@:

kind of code style. I'd say that cache performance is rather irrelevant in this case (at the same time, though, I see no reason why you wouldn't put the string in .data) - but anyway, if you're going to put your BtnClass string in the code section, why do it like this? It's plain silly, when you can put it before a procedure entry point or after a ret. This is just *waste*, with no excuse whatsoever.

I think it's kinda fun that it's *hutch* bringing in C and trying to turn this into some twisted C-vs-asm thing... especially when he obviously has such a poor understanding of C, and how to use the tools around.

The crap about immediates in opcodes... like, wtf? You can't call this "data in the code section" since, technically, it's code, and goes in the code cache. Putting non-code data pieces is something different. Do you disagree that this would impair performance a bit? Whether it's relevant for a certain piece of code is another question, but do you disagree it impars performance?

Then I wonder why you pull in the PE format specification, it has absolutely nothing to do with any of this. And when coupled with what seems to be a bad understanding of the old segmented memory model, well, it gets even sillier.

As for placing code right before the entrypoint - I actually wonder how this will affect things. Parts of MyStringData will obviously be in the same cache line as the code after start - how does this affect the caching? Does this mean the parts of MyStringData in the same cache line as the code will no be put in the data cache? In this example it's clearly irrelevant performance-wise, but it would be a good thing to know for other situations.
Posted on 2004-02-18 01:01:14 by f0dder
Oh, and while I confused SADD with szText, szText *does* do the junky "jmp skipdata" kind of thing, and is used in 70 of the examples in masm32v8 (I only bothered to look at scall.inc and 5 of the asm files, but I think it's a safe assumption it's the same macro used in all).

Btw, if you don't mind the extra few bytes of "jmp @f" here and there, and don't think data cache is relevant, you might as well put the following code sequence in the start of your applications as well (apart from it requiring CPL=0, of course).


mov eax, cr0
or eax, 060000000h ; CD + NW
mov cr0, eax
wbinvd
Posted on 2004-02-18 01:28:26 by f0dder
f0dder,

Basically the same answer, string data directly into registers, immediated loaded into registers are all data. This is among the reasons why the blanket statement breaks down as it does not fit either the hardware or the executable file specification.

Now I have no doubt you could find places where embedded string data could cause a problem with code caching in terms of cache performance and this is why the technical data is in the Intel manuals but you would really have to go out of your way to write it that way.

Write something like an sse2 algo and place a large body of unrelated data directly in the middle of it so that the algo did not fit into cache and you could get disasterous results.


sse2 algo beginning
big block of data far larger than code cache
sse2 algo end

Now if this was an interdependent pair of loops on either side of the big block of data, each cross jump would involved completely reloading the code cache which would deliver a serious reduction in the algorithms performance but then why would you bother to write code this way ?

Another instance is blocks of data using the standard DB sequences which can comfortably reside in the code section but never be executed. Even though it IS in the code section, it is in fact data.

A table at the front of an algorithm is another perfect example of data in the code section. It is code than cannot be executed that you jump over and it is accessed item by item as data from the code that uses it. Putting the same table in the data section is usually slightly slower because the data is not in the data cache and has to be fetched.

Where you are worried about data and code cache fetches is in code that is not even vaguely related to high level code. Very high speed block copy of memory, specialised forms of encryption, the multitude of multimedia style data processing in both video and sound simultaneously and you will undersand the difference.

There are two reasons why you use small strings embedded in the code section.

1. When you wish to avoid a data section in very small programs.

2. When you wish to enclosed a complete working algorithm without committing the exe that may use it to a data section when it is not needed. The perfect example is the string for a control window class.

It is code design to put a block of high level code in a seperate procedure so you just need to call it for the functionality and to restrict the size and keep it all in one place, you write the procedure with the control class as embedded data in the code section.

There are simply no cache issues that arise here as high level code is powers slower than assembler algorithm code.

There is nothing wrong with your preference to always use the data section, even though it may be under 10 bytes but it does commit you to a 512 byte PE section which is often not needed for a read only string.

I mentioned the PE specifications for a reason, it does not specify that data cannot be written in the code section because as you should know, both data and code reside in the same flat segment. You have some control by defining the sections as read or write or execute or combinations of the three but if you choose to intermix blocks of code and blocks of data, the PE loader will load and run the file if it gets everything else right.

With DOS code, you would remember that a com file specifies that both code and data reside in the same segment as a com file can only be one 64k segment long but with EXE files of different memory models, you actually use different segments like CS for code, DS for data and ES for extra data.

It just happens to be that dos MZ and Windows PE files are structured differently yet many seem to forget this difference and try and do things the old way.

Regards,
http://www.asmcommunity.net/board/cryptmail.php?tauntspiders=in.your.face@nomail.for.you&id=2f46ed9f24413347f14439b64bdc03fd
Posted on 2004-02-18 01:56:06 by hutch--

Basically the same answer, string data directly into registers, immediated loaded into registers are all data.


Now, then tell me, how are you going to pass this "string data directly into registers" to an API function, say, MessageBox?
Posted on 2004-02-18 02:59:19 by Morris

Basically the same answer, string data directly into registers, immediated loaded into registers are all data.

It isn't "string data" though, it's just immediate values. To use it as "a string" (in the sense of passing it to a piece of code that accepts "a string"), you'd have to store it somewhere first. Besides, this is still *code*, and goes entirely in the code cache, without the problems of mixing code/data.


Now I have no doubt you could find places where embedded string data could cause a problem with code caching in terms of cache performance and this is why the technical data is in the Intel manuals but you would really have to go out of your way to write it that way.

If you mean in the sense of putting a text string in your code section, sure. Quite frankly, I don't see this as much of a performance problem, really... okay, sure, there are the cache issues of doing this, but it won't matter when casually passing a string to some speed-insensitive function. The "jmp @f" way of putting data in code is silly, though - it has no advantages at all, and makes stuff both slower & larger. While the speed and size disadvantages of doing this are very small in non-critical code, there's no reason to write bad code when it brings no advantages - this is quite different from "requiring people to write optimal code".


Write something like an sse2 algo and place a large body of unrelated data directly in the middle of it so that the algo did not fit into cache and you could get disasterous results.

As long as you don't execute the data ^_^ and don't reference the data-in-code in the sse2 algo, this should't be too much of a problem. The code and data cache is set-associative, not a linear chunk of memory. Cache lines, etc. Some of the data would end up being placed in code cache because of cache line size though.


Another instance is blocks of data using the standard DB sequences which can comfortably reside in the code section but never be executed. Even though it IS in the code section, it is in fact data.

Sure, and there isn't too much trouble with this. Just place your data before proc entrypoints or after a ret, and you can put data in code sections without having to jmp around. You can even have data-in-code without performance issues, as long as you don't mix data and code in the same cache line. Of course knowing cache line size beforehand is a bit tricky, so you might have to "align 64" to be safe - might as well put performance-critical stuff in .data (for initialized data, and do align it of course) or on the stack.


A table at the front of an algorithm is another perfect example of data in the code section. It is code than cannot be executed that you jump over and it is accessed item by item as data from the code that uses it. Putting the same table in the data section is usually slightly slower because the data is not in the data cache and has to be fetched.

Well, if scali is right here - and by the stuff he quoted directly fro m the intel manuals, he just might be - *this* is the kind of thing that you should worry about, not putting your average trivial strings in the code section. From the information scali posted, I got the idea that
1) stuff that goes in the code cache doesn't go in the data cache - oops!
2) stuff doesn't keep away from the code cache just because you don't execute it, it's about cache lines.
Also, remember that code and data caches are split, and that furthermore they're set-associative and not linear. Moder processors do speculative prefetching, and the P4 can even handle multiple "streams" of data... you still get maximum performance by doing your own prefetching, of course.


There is nothing wrong with your preference to always use the data section, even though it may be under 10 bytes but it does commit you to a 512 byte PE section which is often not needed for a read only string.

512 bytes on disk, 4096 bytes in memory, and some additional PE header space usage; I'm well aware of the implications, and I'm well aware that even for a tiny app, this would all be lost if filesystem cluster size :)


I mentioned the PE specifications for a reason, it does not specify that data cannot be written in the code section because as you should know, both data and code reside in the same flat segment. You have some control by defining the sections as read or write or execute or combinations of the three but if you choose to intermix blocks of code and blocks of data, the PE loader will load and run the file if it gets everything else right.

Yup, but the PE specifications do have read/write/execute flags, which does indicate the intention of making read-only data read-only (which is a quite sound idea, safety-wise). As for DOS code, nothing stops you from setting CS=DS in a exe file, and using ES to access "farther away segments" - I think at least one memory model did it this way. In small enough programs, you could even have CS!=DS, but the difference between the two segments being small enough that you could access data through - this has been done more than once to delay crackers.

Just to make it perfectly clear, I'm not really opposed to putting data inside your code section, as long as you don't do this for performance-critical stuff. I don't see much use in doing it (as for self-contained pieces of code, you could always have a SEGMENT+ENDS), though. The thing I *really* oppose is using "jmp @f" to do this, as it's silly - considering you can put the data before proc entry points or after a RET.

Also, while not very relevant in the context of trivial code, cache *is* very important. Even a multi-ghz x86 would be of very little use if you execute the CR0 flag-changing code I posted in the previous thread. It also does sound like you have a somewhat wrong idea of how the cache works?, but that could just be me.

Anyway, you ought to update the 'litereal' macro in masm32 to use SEGMENT+ENDS instead of .data/.code, this way it can be used in other segments, to construct stuff like string tables. It won't change the use of the macro, nor have any side effects - just extend the use of it.
Posted on 2004-02-18 03:07:45 by f0dder
Now I have no doubt you could find places where embedded string data could cause a problem with code caching in terms of cache performance and this is why the technical data is in the Intel manuals but you would really have to go out of your way to write it that way.


From what I remember the only time you can take a performance hit with data in the code segment is after an indirect jump that is not cached so there is no problem there and anyone would be hard pressed to find one. In your case Hutch you use a JMP I think so there is no cache hit at all or an insignificant one at best.

Well, if scali is right here - and by the stuff he quoted directly fro m the intel manuals, he just might be - *this* is the kind of thing that you should worry about, not putting your average trivial strings in the code section. From the information scali posted, I got the idea that


Don't know where he quoted the stuff from, he never says. I am quoting from page 2-47 of the optimization manual, Rule # 27.
Posted on 2004-02-18 03:21:18 by donkey
The "literal" macro in the macros.inc file is a modification of an original design by "huh" from Blenheim in New Zealand and it is used by the SADD and CTXT macros depending on where they are called from.


literal MACRO quoted_text:VARARG
LOCAL local_text
.data
local_text db quoted_text,0
align 4
.code
EXITM <local_text>
ENDM

It does in fact write directly to the DATA section then aligns the exit so the next item written to the data section is aligned at a 4 byte boundary. Committing flat memory model code to segment assumptions is of little use in 32 bit code where the initialised data section is the right place to write initialised data where you want to both read and write to it.
Posted on 2004-02-18 03:36:51 by hutch--