Hi everybody :)
It's been so long since I've been here (I visit this forum? regularily, though passively!).
If some of you would remember, 14 months ago, I announced? that I designed a Digital Signal Processing library? (FFT, Filtering, & IFFT) mainly (but not necessarily)? for audio.
Since then, I kept enhancing it (for example:? supporting mono/stereo, supporting more data types, &? simplifying usage in the same time). And I started? updating the documentation.
The problem I faced is the large number of? functions that I had to design to support many? datatypes (16-bit & 32-bit PCM, 32-bit & 64-bit float).
(I made these changes , but still didn't upload the new? version on my humble web page http://geocities.com/johnkirollos.)
Recently, I started thinking of using self-modifying code? to reduce nb of functions, by adding 1 arg specifying the? data types ==> at start of function, instructions affected? by type are modified. (example: choosing between FLD? <double_var> & FLD <float_var>.
Here is the current list of functions:
As you see, the list is BIG! I gess using SMC would reduce? nb of functions (& hence code size by more than 75%)
So, is SMC generally used for such purpose? I never? used SMC, so should I go for it or not?
It's been so long since I've been here (I visit this forum? regularily, though passively!).
If some of you would remember, 14 months ago, I announced? that I designed a Digital Signal Processing library? (FFT, Filtering, & IFFT) mainly (but not necessarily)? for audio.
Since then, I kept enhancing it (for example:? supporting mono/stereo, supporting more data types, &? simplifying usage in the same time). And I started? updating the documentation.
The problem I faced is the large number of? functions that I had to design to support many? datatypes (16-bit & 32-bit PCM, 32-bit & 64-bit float).
(I made these changes , but still didn't upload the new? version on my humble web page http://geocities.com/johnkirollos.)
Recently, I started thinking of using self-modifying code? to reduce nb of functions, by adding 1 arg specifying the? data types ==> at start of function, instructions affected? by type are modified. (example: choosing between FLD? <double_var> & FLD <float_var>.
Here is the current list of functions:
1- 256-pts functions:
FFTSignal_256_fi_f32
FFTSignal_256_fi_f64
FFTSignal_256_i16
FFTSignal_256_i32
FFT2Signals_256_fi_f32
FFT2Signals_256_fi_f64
FFT2Signals_256_i16
FFT2Signals_256_i32
FilterSignal_256_fi_f32
FilterSignal_256_fi_f64
FilterSignal_256_fio_f32
FilterSignal_256_fio_f64
FilterSignal_256_fo_i16
FilterSignal_256_fo_i32
FilterSignal_256_i16
FilterSignal_256_i32
Filter2Signals_256_fi_f32
Filter2Signals_256_fi_f64
Filter2Signals_256_fio_f32
Filter2Signals_256_fio_f64
Filter2Signals_256_fo_i16
Filter2Signals_256_fo_i32
Filter2Signals_256_i16
Filter2Signals_256_i32
IFFT_256_fo_f32
IFFT_256_fo_f64
IFFT_256_i16
IFFT_256_i32
FIR_OverlapSave_256_i16
FIR_OverlapSave_256_i32
2- 1024-pts functions:
FFTSignal_1024_fi_f32
FFTSignal_1024_fi_f64
FFTSignal_1024_i16
FFTSignal_1024_i32
FFT2Signals_1024_fi_f32
FFT2Signals_1024_fi_f64
FFT2Signals_1024_i16
FFT2Signals_1024_i32
FilterSignal_1024_fi_f32
FilterSignal_1024_fi_f64
FilterSignal_1024_fio_f32
FilterSignal_1024_fio_f64
FilterSignal_1024_fo_i16
FilterSignal_1024_fo_i32
FilterSignal_1024_i16
FilterSignal_1024_i32
Filter2Signals_1024_fi_f32
Filter2Signals_1024_fi_f64
Filter2Signals_1024_fio_f32
Filter2Signals_1024_fio_f64
Filter2Signals_1024_fo_i16
Filter2Signals_1024_fo_i32
Filter2Signals_1024_i16
Filter2Signals_1024_i32
IFFT_1024_fo_f32
IFFT_1024_fo_f64
IFFT_1024_i16
IFFT_1024_i32
FIR_OverlapSave_1024_i16
FIR_OverlapSave_1024_i32
As you see, the list is BIG! I gess using SMC would reduce? nb of functions (& hence code size by more than 75%)
So, is SMC generally used for such purpose? I never? used SMC, so should I go for it or not?
I think "runtime code generation" is a better term for this than SMC (which is more like patching a few locations). SMC also tends to do modifications multiple times, while RCG tends to construct procedures at the beginning of the program.
It's a good method when you have a lot of similar routines... You might want to have a look at http://softwire.sourceforge.net/
It's a good method when you have a lot of similar routines... You might want to have a look at http://softwire.sourceforge.net/
Thanks fOdder for clarifying the concept :)
So.. what I'm thinking of is not a bad thing to do.. I can't wait to see the outcome of going RCG :D
I downloaded SoftWire to see if it can help!
So.. what I'm thinking of is not a bad thing to do.. I can't wait to see the outcome of going RCG :D
I downloaded SoftWire to see if it can help!
So.. what I'm thinking of is not a bad thing to do..
Indeed not - the thing to avoid is constantly moldifying + calling code, that is pretty foo.
Also, be sure to allocate buffers with VirtualAlloc and the PAGE_EXECUTE_READWRITE protection type (possibly setting to PAGE_EXECUTE_READ or PAGE_EXECUTE when the code is constructed); if you don't do this, you will get protection errors on AMD64 or P4-64 computers.
fOdder Wrote:
Also, be sure to allocate buffers with VirtualAlloc and the PAGE_EXECUTE_READWRITE protection type (possibly setting to PAGE_EXECUTE_READ or PAGE_EXECUTE when the code is constructed); if you don't do this, you will get protection errors on AMD64 or P4-64 computers.
What I'm intending to do is just modifying an fld with fild for example. That is, I'm not gonna add new functions at runtime, just modifying a couple of bytes in an already existing function (most probably the current one). A typical DSP function takes thousands of clocks to execute, so I gess adding a few cycles to make these modifications won't heart, specially if modified parts are far from the modification point..
Or I am missing something?
Also, be sure to allocate buffers with VirtualAlloc and the PAGE_EXECUTE_READWRITE protection type (possibly setting to PAGE_EXECUTE_READ or PAGE_EXECUTE when the code is constructed); if you don't do this, you will get protection errors on AMD64 or P4-64 computers.
What I'm intending to do is just modifying an fld with fild for example. That is, I'm not gonna add new functions at runtime, just modifying a couple of bytes in an already existing function (most probably the current one). A typical DSP function takes thousands of clocks to execute, so I gess adding a few cycles to make these modifications won't heart, specially if modified parts are far from the modification point..
Or I am missing something?
If you only need to patch some bytes (ie, no opcode length change), softwire is probably overkill; I would then suggest having a "template" routine, which you copy to VirtualAlloc'ed memory and patch modifications.
You could construct all the needed routines at program startup, or create them on-the-fly as necessary. If there are many routines and you won't always need all of them, I would probably create them on-the-fly but keep each routine after it's constructed.
You could call the DSP routines via pointer indirection; initially, the pointers would be to stub functions that create the necessary routine, then change the pointer to point to the newly created routine... that ought to work fairly well.
so I gess adding a few cycles to make these modifications won't heart, specially if modified parts are far from the modification point..
Testing will show :) - but mixing data/code and pipeline flushes are typically pretty expensive.
You could construct all the needed routines at program startup, or create them on-the-fly as necessary. If there are many routines and you won't always need all of them, I would probably create them on-the-fly but keep each routine after it's constructed.
You could call the DSP routines via pointer indirection; initially, the pointers would be to stub functions that create the necessary routine, then change the pointer to point to the newly created routine... that ought to work fairly well.
so I gess adding a few cycles to make these modifications won't heart, specially if modified parts are far from the modification point..
Testing will show :) - but mixing data/code and pipeline flushes are typically pretty expensive.
Thanks again fOdder,
I'll start by making some experiments & speed testing. Then other questions may arise.
I'll start by making some experiments & speed testing. Then other questions may arise.
Hi again,
After some searches on this board (which is a treasure BTW:)), I made a test lib to investigate the effect of SMC, concerning my idea.
The library contains 2 functions: SinTable_real4() & SinTable_real8(). The 1st replaces a table of real4 by their sin (Array=sin(Array) ), the 2nd is the same but with real8.
Here are C-style functions declaration:
My idea was to combine the 2 functions into 1 for the following benefits:
1- Reducing code size.
2- Having only 1 function whose name is simply SinTable() (I named it SinTable_SMC() in my test lib).
To achieve this, an extra argument is passed to function:
iType=1 (choosing real4), or 2 (choosing real8).
This function has a piece of code at its start, that writes the instructions affected by data type. In this example function, they are:
fld mem, fstp mem, & add eax,<size> (incrementing pointer to next element).
So, I tested speed of both normal, & SMC functions, for several iCount (array length) on a P4 CPU.
My conclusion is:
---------------
Modifying the code adds an overhead of about 1000 clocks.
For example: array length=200:
SinTable_real4()? ----> average=26,344 clocks
SinTable_SMC() with iType=1 ----> average=27,486 clocks
For array length of 1000, the difference is still around 1000 clocks.
So, may we say that SMC costs about 1000 clocks on P4? And would this value depend on distance between patching & patched code? & is better or worse on previous processors?
Then, to minimize the number of times the function would have to modify the code, I used a global variable called iLastType, that holds 1 if currently the function is for real4, & 2 if real8.
So, if for example I call SinTable_SMC() with iType=1,
iType is compared to iLastType, if they're equal, no need for code modification. => for example, when calling it 100 consecutive times with same data type, only the 1st time execution suffers a 1000 clocks overhead.
So, in what cases could this approach be most beneficial?
I think the answer could be the following conditions:
- Number of functions needed is large otherwise :
Consider supporting word, dword, real4, real8, i.e. 4 functions. (in my DSP lib, each function deals with more than 1 type =>large set of functions - see list at start of this topic)
- Needed code modification is minor compared to the function itself so that patching code is small compared to whole function:
For example: fld --> Lengthy processing on FPU stack (independent of type) --> fst.
=> only fld & fst need modification.
- Function clocks is >>1000 clocks (for P4).
- Type needs not be changed frequently:
For example, FFT algorithm is usually performed in a loop to process a long audio signal, =>modification occurs only? in the 1st loop iteration.
About the possible disadvantages:
1- Code is writable, => Virtual memory system treats it as data (swapped to disk rather than discarded). But I think this won't have a pronounced effect.
2- Some bad pointer bugs can't be easily discovered (no exception received on writing to code by mistake) since code is writable. But I see we can minimize this by not? making the whole code writable, just the SMC functions.
3- Such function can't be called from 2 different threads simultaneously (unless both threads are using same data types, i.e same code).
(I think this is not a big problem, since you usually won't use same code in both GUI & worker threads).
Finally, those are my conclusions after a very short research, experience, & testing (2 days). So what do you think about it? If someone has experience in this issue, if something is wrong/not precise in my idea, assumptions & conclusions, it'd be helpful to comment.
Note: I attached the lib (a RadASM project) - I didn't include the speed tests coz they're in Sphinx C--.
Sorry for this long thread! I hope it could be useful to others :)
After some searches on this board (which is a treasure BTW:)), I made a test lib to investigate the effect of SMC, concerning my idea.
The library contains 2 functions: SinTable_real4() & SinTable_real8(). The 1st replaces a table of real4 by their sin (Array=sin(Array) ), the 2nd is the same but with real8.
Here are C-style functions declaration:
SinTable_real4(dword pTable, dword iCount);
SinTable_real8(dword pTable, dword iCount);
My idea was to combine the 2 functions into 1 for the following benefits:
1- Reducing code size.
2- Having only 1 function whose name is simply SinTable() (I named it SinTable_SMC() in my test lib).
To achieve this, an extra argument is passed to function:
SinTable_SMC(dword pTable, dword iCount,dword iType);
iType=1 (choosing real4), or 2 (choosing real8).
This function has a piece of code at its start, that writes the instructions affected by data type. In this example function, they are:
fld mem, fstp mem, & add eax,<size> (incrementing pointer to next element).
So, I tested speed of both normal, & SMC functions, for several iCount (array length) on a P4 CPU.
My conclusion is:
---------------
Modifying the code adds an overhead of about 1000 clocks.
For example: array length=200:
SinTable_real4()? ----> average=26,344 clocks
SinTable_SMC() with iType=1 ----> average=27,486 clocks
For array length of 1000, the difference is still around 1000 clocks.
So, may we say that SMC costs about 1000 clocks on P4? And would this value depend on distance between patching & patched code? & is better or worse on previous processors?
Then, to minimize the number of times the function would have to modify the code, I used a global variable called iLastType, that holds 1 if currently the function is for real4, & 2 if real8.
So, if for example I call SinTable_SMC() with iType=1,
iType is compared to iLastType, if they're equal, no need for code modification. => for example, when calling it 100 consecutive times with same data type, only the 1st time execution suffers a 1000 clocks overhead.
So, in what cases could this approach be most beneficial?
I think the answer could be the following conditions:
- Number of functions needed is large otherwise :
Consider supporting word, dword, real4, real8, i.e. 4 functions. (in my DSP lib, each function deals with more than 1 type =>large set of functions - see list at start of this topic)
- Needed code modification is minor compared to the function itself so that patching code is small compared to whole function:
For example: fld --> Lengthy processing on FPU stack (independent of type) --> fst.
=> only fld & fst need modification.
- Function clocks is >>1000 clocks (for P4).
- Type needs not be changed frequently:
For example, FFT algorithm is usually performed in a loop to process a long audio signal, =>modification occurs only? in the 1st loop iteration.
About the possible disadvantages:
1- Code is writable, => Virtual memory system treats it as data (swapped to disk rather than discarded). But I think this won't have a pronounced effect.
2- Some bad pointer bugs can't be easily discovered (no exception received on writing to code by mistake) since code is writable. But I see we can minimize this by not? making the whole code writable, just the SMC functions.
3- Such function can't be called from 2 different threads simultaneously (unless both threads are using same data types, i.e same code).
(I think this is not a big problem, since you usually won't use same code in both GUI & worker threads).
Finally, those are my conclusions after a very short research, experience, & testing (2 days). So what do you think about it? If someone has experience in this issue, if something is wrong/not precise in my idea, assumptions & conclusions, it'd be helpful to comment.
Note: I attached the lib (a RadASM project) - I didn't include the speed tests coz they're in Sphinx C--.
Sorry for this long thread! I hope it could be useful to others :)
Nice research, and your conclusions seem okay :)
Disadvantages 2 and 3 can be avoided with a just-in-time/caching scheme; #2 would be avoided by VirtualAlloc/write-and-patch/VirtualProtect, and #3 would be avoided by some smart CriticalSection protection of the *first* time a routine is called (ie, when the thunk generates the real code).
Disadvantages 2 and 3 can be avoided with a just-in-time/caching scheme; #2 would be avoided by VirtualAlloc/write-and-patch/VirtualProtect, and #3 would be avoided by some smart CriticalSection protection of the *first* time a routine is called (ie, when the thunk generates the real code).
fOdder Said:
Thanks fOdder again :),
But the bold part in the quote needs just be changed to "*first* time a routine is called with code change request (different data type(s)"
So, I think I'm going to try the approach on some "real" function..
#3 would be avoided by some smart CriticalSection protection of the *first* time a routine is called , when the thunk generates the real code).
Thanks fOdder again :),
But the bold part in the quote needs just be changed to "*first* time a routine is called with code change request (different data type(s)"
So, I think I'm going to try the approach on some "real" function..
Not to burst your bubble with all the research you have done. But have you considered using Macros to do this instead??? MASM has a very powerful set of Macros that can let you do stuff like specifying a variable type and then doing conditional or non-conditional assembly on it. Plus it's a lot easier than this approach.
Just an idea. What if you use two procedures?. One is the main procedure, and the other is used to setup the main procedure.
Example: SinTable_set(type) and SinTable_SMC(ptable,iCount)
In SinTable_set(type) you modify the SinTable_SMC code. You can set the section writable or not after procedure modification. This can be useful in case where the type is not going to be modified for a while.
Example: changing type because of user preferences, PC resources (processor characteristics, available memory), etc.
This idea can be used in cases like this:
Like example shows, this idea can not be aplied in all cases (but it can be useful in others).
This was just an idea (that can be considered or not). It is similar to yours, but you don't have to pass data type to main procedure in every call (and you don't have to make data type checking in it). But brings programming problems like: What was the last data type used? Programer using the lib can call the procedure with invalid data type if he is distracted (I hope that this idea not distract and confuse more the thing).
Regards.
Kecol.-
Mark, I think he need all the functions in the lib, and that is the "why" he is not considering use of macros
Example: SinTable_set(type) and SinTable_SMC(ptable,iCount)
In SinTable_set(type) you modify the SinTable_SMC code. You can set the section writable or not after procedure modification. This can be useful in case where the type is not going to be modified for a while.
Example: changing type because of user preferences, PC resources (processor characteristics, available memory), etc.
This idea can be used in cases like this:
SinTable_set(1)
SinTable_SMC(...)
SinTable_SMC(...)
SinTable_SMC(...)
SinTable_set(2)
SinTable_SMC(pTable_1, iCount_1)
SinTable_SMC(pTable_2, iCount_2)
...
...
SinTable_SMC(pTable_n, iCount_n)
SinTable_set(x)
...
Like example shows, this idea can not be aplied in all cases (but it can be useful in others).
This was just an idea (that can be considered or not). It is similar to yours, but you don't have to pass data type to main procedure in every call (and you don't have to make data type checking in it). But brings programming problems like: What was the last data type used? Programer using the lib can call the procedure with invalid data type if he is distracted (I hope that this idea not distract and confuse more the thing).
Regards.
Kecol.-
Mark, I think he need all the functions in the lib, and that is the "why" he is not considering use of macros
Mark, I think he need all the functions in the lib, and that is the "why" he is not considering use of macros
The macros would be in the library as well.
Mark Larson Said:
But have you considered using Macros to do this instead??? MASM has a very powerful set of Macros that can let you do stuff like specifying a variable type and then doing conditional or non-conditional assembly on it.? Plus it's a lot easier than this approach.
I'm already using Macros in current design & this has 4 disadvantages (concerning my case):
1- Library code size is large. Many functions are almost identical except for some few instructions or constant values.
2- Many Functions with names hard to remember.
3- FFT length parameter is currently a compile-time constant. I had to do this due to the limited number of registers. So, I had to make a function for each FFT length (64, 256, 1024, ...) ==> Nb of functions becomes huge, & so becomes the library size!
To demonstrate the lack of registers, consider (If you have time:)) this macro invokation taken from a loop body:
It uses 5 registers out of the 7 available. EAX & ECX are also used in other purposes in the loop.
(In this macro, I could have not used EBX, & add its value to EBP instead, but EBX is also used for loop termination (when EBX==0), ==> a CMP instruction is saved.
4- Number of channels (Mono, Stereo, ...) is a parameter that user passes to function. This adds complexity & consequently registers/mem vars are required in loops, =>again, speed loss! So, if number of channels is determined at compile-time, we can free some registers for other use. ==> we can gain some speed-up that may outweight speed loss by SMC, specially if we don't modify code all the time.
Concerning how easy is this approach compared to using macros, I made a list of all parts that would be affected by using SMC, they are all just a single instruction, or a single constant. And there are few of them.
But I still don't want to jump to conclusions so fast, specially since I never used SMC before!
Just an idea. What if you use two procedures?. One is the main procedure, and the other is used to setup the main procedure.
Example:? SinTable_set(type) and? SinTable_SMC(ptable,iCount)
In SinTable_set(type) you modify the SinTable_SMC code. You can set the section writable or not after procedure modification. This can be useful in case where the type is not going to be modified for a while.
I see your idea is good Kecol. But as you said it has a down side in that it leaves the duty of checking wether modification is needed now or not to the function user. So, he'd either call the 2 functions as a pair all the time, or checks 1st:
I'll keep it in mind anyway :)
But have you considered using Macros to do this instead??? MASM has a very powerful set of Macros that can let you do stuff like specifying a variable type and then doing conditional or non-conditional assembly on it.? Plus it's a lot easier than this approach.
I'm already using Macros in current design & this has 4 disadvantages (concerning my case):
1- Library code size is large. Many functions are almost identical except for some few instructions or constant values.
2- Many Functions with names hard to remember.
3- FFT length parameter is currently a compile-time constant. I had to do this due to the limited number of registers. So, I had to make a function for each FFT length (64, 256, 1024, ...) ==> Nb of functions becomes huge, & so becomes the library size!
To demonstrate the lack of registers, consider (If you have time:)) this macro invokation taken from a loop body:
Butterfly_LastLevel_RealOut ,,\
,,\
,,,
It uses 5 registers out of the 7 available. EAX & ECX are also used in other purposes in the loop.
(In this macro, I could have not used EBX, & add its value to EBP instead, but EBX is also used for loop termination (when EBX==0), ==> a CMP instruction is saved.
4- Number of channels (Mono, Stereo, ...) is a parameter that user passes to function. This adds complexity & consequently registers/mem vars are required in loops, =>again, speed loss! So, if number of channels is determined at compile-time, we can free some registers for other use. ==> we can gain some speed-up that may outweight speed loss by SMC, specially if we don't modify code all the time.
Concerning how easy is this approach compared to using macros, I made a list of all parts that would be affected by using SMC, they are all just a single instruction, or a single constant. And there are few of them.
But I still don't want to jump to conclusions so fast, specially since I never used SMC before!
Just an idea. What if you use two procedures?. One is the main procedure, and the other is used to setup the main procedure.
Example:? SinTable_set(type) and? SinTable_SMC(ptable,iCount)
In SinTable_set(type) you modify the SinTable_SMC code. You can set the section writable or not after procedure modification. This can be useful in case where the type is not going to be modified for a while.
I see your idea is good Kecol. But as you said it has a down side in that it leaves the duty of checking wether modification is needed now or not to the function user. So, he'd either call the 2 functions as a pair all the time, or checks 1st:
IF(type changed)
SinTable_Set(x)
ENDIF
SinTable_SMC(...)
I'll keep it in mind anyway :)
Hi. If you would sacrifice the code size maybe you can use pointers to functions, i.e. you have a table of all available functions and a table of functions used for only one data type, and you init that table at the startup or when needed.
Ex:
allfuncs:
SinTable4
SinTable8
CosTable4
CosTable8
...
singlyfuncs: ; used in code
SinTable dd SinTable4
CosTable dd CosTable8
...
Code always uses functions from singlyfuncs:
invoke SinTable, ...
And by user input you can change the contents of the functions.
Ex:
allfuncs:
SinTable4
SinTable8
CosTable4
CosTable8
...
singlyfuncs: ; used in code
SinTable dd SinTable4
CosTable dd CosTable8
...
Code always uses functions from singlyfuncs:
invoke SinTable, ...
And by user input you can change the contents of the functions.