Working on a distributed crossword project, I need to generate ASM code for Windows on the fly.
1) I'm currently generating an ASM source and use Masm to build an OBJ.
I first wanted to include a small assembler in my code to compile the outputted file and obtain a DLL file, but I found no easy way to do this: GEMA assembler seemed very small, but doesn't generate DLL, and other assemblers are larger than my program ! (My client/server GUI takes around 100 Kb).
Does anybody know a small open source assembler that can build DLL ?
2) Then, I realized that I could generate code directly.
The problem in this case is that when I generate code in a data section, the code runs 2 times slower than in a code section !
Is there a way to modify a .code section on Windows ?
Thank you for your attention !
JC
1) I'm currently generating an ASM source and use Masm to build an OBJ.
I first wanted to include a small assembler in my code to compile the outputted file and obtain a DLL file, but I found no easy way to do this: GEMA assembler seemed very small, but doesn't generate DLL, and other assemblers are larger than my program ! (My client/server GUI takes around 100 Kb).
Does anybody know a small open source assembler that can build DLL ?
2) Then, I realized that I could generate code directly.
The problem in this case is that when I generate code in a data section, the code runs 2 times slower than in a code section !
Is there a way to modify a .code section on Windows ?
Thank you for your attention !
JC
You can link the EXE as read/write/executable but I doubt it will fix the speed problem, you are probably getting some cache pollution from code and data being in the same place.
GoLink is capable of linking DLL files and can use MASM object files, it is 44K (compared to nearly 650K for Link.exe) and requires no external files. The distribution license is open for non-commercial applications. GoAsm, the accompanying assembler can also make OBJ files, has a syntax very close to MASM and is around 93KB (smaller than GEMA and much smaller than the 400K ML.exe), but it is not required if you are using MASM. The combination of GoAsm/GoLink can represent a savings of several 100K over MASM/LINK
Using an assembler is definitely not very efficient. It adds many kilobytes to your program, and takes ages to execute.
As for the Linux thing, you have yet to show us something about it which doesn't suck. (My dad just told me it's slow because it's so advanced. Hah)
As for the Linux thing, you have yet to show us something about it which doesn't suck. (My dad just told me it's slow because it's so advanced. Hah)
I have split the flame war out of this topic and moved to the heap, a much more appropriate place to discuss it.
Since donkey cut away useful information, I'm going to re-paste it here.
If you're generating code on the fly, be *SURE* to use VirtualAlloc for the buffer. Firstly to make sure your code will run on the processors with execute-disable feature (amd64, newest P4's, use PAGE_EXECUTE_READWRITE protection), but also to make sure the code and data are far apart - having code near modified data gives extreme slowdown. VirtualAlloc allocates on 64k boundaries, which is quite fine.
As for an assembler, have a look at FASM. It's pretty good at backend work, it's open source (with a very liberal license, unlike crappy GPL), and it shouldn't be too hard to modify it to generate in-memory code. I think the executable is around 64kb for the console version, and it can output a lot of formats (including PE DLLs) directly without using a linker.
http://www.flatassembler.net
...and I'd still advise against bundling masm with anything. Microsoft tends not to reply to inquiries, but Microsoft Germany has replied and said the license doesn't allow it (check google archive of alt.lang.asm). Any further on the masm licensing can go to /dev/null or http://www.win32asmcommunity.net/board/viewtopic.php?t=20203 .
If you're generating code on the fly, be *SURE* to use VirtualAlloc for the buffer. Firstly to make sure your code will run on the processors with execute-disable feature (amd64, newest P4's, use PAGE_EXECUTE_READWRITE protection), but also to make sure the code and data are far apart - having code near modified data gives extreme slowdown. VirtualAlloc allocates on 64k boundaries, which is quite fine.
As for an assembler, have a look at FASM. It's pretty good at backend work, it's open source (with a very liberal license, unlike crappy GPL), and it shouldn't be too hard to modify it to generate in-memory code. I think the executable is around 64kb for the console version, and it can output a lot of formats (including PE DLLs) directly without using a linker.
http://www.flatassembler.net
...and I'd still advise against bundling masm with anything. Microsoft tends not to reply to inquiries, but Microsoft Germany has replied and said the license doesn't allow it (check google archive of alt.lang.asm). Any further on the masm licensing can go to /dev/null or http://www.win32asmcommunity.net/board/viewtopic.php?t=20203 .
Sorry about that f0dder but it's annoying to have threads hijacked by the never ending flame war, and the guy deserved a reasonable answer without all of the arguing. I can't split a post, moving only parts of it, so I had to move the whole thing.
F0dder, thank you very much for your precious informations !
It was not my intention to bundle MASM, since it's 4 times bigger than my own code ! (My C client/server GUI takes less than 100Kb).
About FASM, I think it takes more time to integrate into my code than to generate binary code directly.
Should I use VirtualProtect or VirtualLock after I generated an ASM routine ? Since I have no access to the new Athlons, I'd like to know if changing PAGE_EXECUTE_READWRITE to PAGE_EXECUTE leads to a speedup.
It was not my intention to bundle MASM, since it's 4 times bigger than my own code ! (My C client/server GUI takes less than 100Kb).
About FASM, I think it takes more time to integrate into my code than to generate binary code directly.
Should I use VirtualProtect or VirtualLock after I generated an ASM routine ? Since I have no access to the new Athlons, I'd like to know if changing PAGE_EXECUTE_READWRITE to PAGE_EXECUTE leads to a speedup.
MCoder,
Memory is memory, as long as in this instance it is set as EXECUTABLE. What you need to get the swing of is how different processors handle their respective code and data cache. What you can do if you have a limited range of code that you want to create on the fly is create it normally in another test app, copy the code as DATA and then write it to the data section of the app you have in mind. You the select the code you want to write to executable memory at runtime.
If it is more complex than that, you may need some form of opcode generation that can write directly to executable memory. Where you will notice a problem is the code cache lag from writing the code to running the code so if you can write it first, do something else then run the code, it will probably be heaps faster that way.
Code alignment is trivial on most late model hardware but its easy enough to align the start of where you write the code to more or less whatever alignment you want.
Memory is memory, as long as in this instance it is set as EXECUTABLE. What you need to get the swing of is how different processors handle their respective code and data cache. What you can do if you have a limited range of code that you want to create on the fly is create it normally in another test app, copy the code as DATA and then write it to the data section of the app you have in mind. You the select the code you want to write to executable memory at runtime.
If it is more complex than that, you may need some form of opcode generation that can write directly to executable memory. Where you will notice a problem is the code cache lag from writing the code to running the code so if you can write it first, do something else then run the code, it will probably be heaps faster that way.
Code alignment is trivial on most late model hardware but its easy enough to align the start of where you write the code to more or less whatever alignment you want.
LZAsm may be an alternative to FASM. Its syntax is more MASM like, its size is 80 kB.
MCoder, VirtualLock doesn't really matter. You should use VirtualAlloc instead of VirtualProtect on HEAP allocated memory, as that would lead to a whole page of heap memory being "deprotected". Furthermore, HEAP allocations have a low granularity, which means your generated exeuctable code would be in (possible modified) memory, leading to slowdown.
So, VirtualAlloc it is. I don't think using VirtualProtect after the code is written will cause any speedup, but it would be "a nice thing" to do anyway, since it removes a potential security hole.
If you can generate binary code on the fly without problems, there's no reasons to use an assembler - your code will be faster and more compact. If you need the flexibility an assembler offers, do take a look at fasm. It shouldn't be too hard to integrate, and the license is liberal.
SoftWire is interesting, but it's written in C++. This by itself isn't a problem, but it will "probably" have "some" size impact, and you've mentioned size as an important parameter :)
So, VirtualAlloc it is. I don't think using VirtualProtect after the code is written will cause any speedup, but it would be "a nice thing" to do anyway, since it removes a potential security hole.
If you can generate binary code on the fly without problems, there's no reasons to use an assembler - your code will be faster and more compact. If you need the flexibility an assembler offers, do take a look at fasm. It shouldn't be too hard to integrate, and the license is liberal.
SoftWire is interesting, but it's written in C++. This by itself isn't a problem, but it will "probably" have "some" size impact, and you've mentioned size as an important parameter :)
If it is more complex than that, you may need some form of opcode generation that can write directly to executable memory. Where you will notice a problem is the code cache lag from writing the code to running the code so if you can write it first, do something else then run the code, it will probably be heaps faster that way.
F0dder said that the granularity of VirtualAlloc is 64Kb, so there should not be any problem with direct code generation.
Anyway, I'll follow your advice about generating code in the 'VirtualAlloc'ed heap, then filling the data cache, then executing the generated code.
SoftWire is interesting, but it's written in C++. This by itself isn't a problem, but it will "probably" have "some" size impact, and you've mentioned size as an important parameter :)
As you said, VirtualLock is quite useless, it just avoids that the generated code runs on swappable memory.
About the code generation, I need a full opcode support, since I use code generation on all my distributed projects (I have finished several already, mostly about programming contests).
My crossword generator uses only 10 or 15 different opcodes, but I cannot predict how much different opcodes I'll need for my future projects.
The interesting fact about Softwire is that you can type C++ code that directly outputs ASM code (for example: cg->mov(al, *this);). GemA has also a somewhat similar ability.
As I'm not fond of C++, I'll try to write a C based JIT generator.
Thank you for your advices !
Anyway, I'll follow your advice about generating code in the 'VirtualAlloc'ed heap, then filling the data cache, then executing the generated code.
VirtualAlloc != HEAP. And don't mix cache with those things :). Read your processor documents... the most important thing to keep in mind is to generate your code once, and execute it later. If you can space the generation and exectution, that is good. If you keep modifying the code, an interpreted approach will likely be better.
As you said, VirtualLock is quite useless, it just avoids that the generated code runs on swappable memory.
It's not useless as such - even though it does not even guarantee you run unswappable. But most people should avoid it as they don't understand the implications :) (read "insidide windows 2000").