Hi,
i am looking for a quick and nice solution to scan the code section for any data in it.
sometimes when i debug (ollydbg) / disasm a file i can see from time to time data insde the disassembled code section which is displayed as byte: DB XX.
any ideas?
Posted on 2003-11-17 10:50:13 by wizzra
Sometimes when a debugger doesn't understand an instruction for instance (if it started dissasembling in the middle of an instruction) then you will see that. It is actually code. Maybe it's a protected mode instruction not interpreted by Olly.
Posted on 2003-11-17 11:00:23 by mrgone
the problem is:


The assembler fills in these gaps by some random (or irrelevant) value since these locations are never executed. The problem is that there is no way to distinguish data from gaps within instructions. If the normal process of disassembly is allowed to take its own course by treating these gaps as genuine code, the opcode alignment may be destroyed. Once misaligned, there is no way to recover and we may get unreliable disassembly. Thus it is absolutely essential to prevent the processing of such gaps. We do this by identifying the basic blocks in the code section. Each basic block constitutes a valid address range in the .text section. This is achieved by making one extra pass of code analysis on the .text section. Thus our disassembler is a two pass disassembler.


mrgone,
every byte sequence can produce assembly, even if it doesn't make sense for use in our proggy.
problem is to loacte that 'data' and skip it to the next code segment.
i am in need to find a good way to do it.

normal assembly instructions (add,inc..etc) are easy to pass using disasm engine, but i guess when we get at conditional or unconditional jumps, we should store the address and 'virtually' trace if next code is references whitin the program.
any ideas for algorithm?
Posted on 2003-11-17 11:33:55 by wizzra
Hi, wizzra. :)

I think Olly uses an heuristic algorithm to figure out where each function begins and ends, and wich parts of the code never get to be executed. I don't think there's an easy way to do that, and in any case it's not bulletproof. :(

Probably the simplest way is to "mark" target addresses of jumps as valid code addresses. That way if you're decoding an instruction and you find it "overlaps" an address known to be a jump target, then most likely the dissassembler misaligned somewhere. That should work fine for most programs anyway, the only problem would be anti-dissassembly tricks.
Posted on 2003-11-17 11:46:58 by QvasiModo
start from entrypoint and disassemble from there, following Jcc and CALLs. That's the only way to really do it, and it can be fooled by false conditional jumps etc.
Posted on 2003-11-17 13:36:55 by f0dder
That's what I thought, but you said it better. ;)
Anyway, is that really the only way to do it? :(
Posted on 2003-11-17 13:55:20 by QvasiModo
such task require a smart recursive procedure than...
hm..



here is some useless code with no meaning what so ever,
trying to find a way to determine data location.
analysis using jxx tracing [recursive(?)]

0: mov eax,ebx
1: xor ecx,edx
2: jmp 5
3: jmp 11
4: cmp eax,ecx
5: jz 3
6: cmp eax,edx
7: jb 16
8: db 'h'
9: db 'i'
10: db 0
- entry point-
11: inc eax
12: add ecx,eax
13: cmp edx,ecx
14: jg 16
15: dec eax
16: xor eax,eax

analyze from offset 0:

start:
offset(2)->offset(5)
/\
offset(3) offset(7)
| /\
offset(14) (16) [ data offset: 8 ]
/\ |
(16) (15) offset(14)
/\
(16) (15)

jxx accurds at: 2,5,3,7,14


need to find a good algorithm to create code blocks (start->end offsets) or data block(start->end offsets)
Posted on 2003-11-17 14:40:35 by wizzra
I think recursive is a bad idea... for a decent-sized app, you risk using very large amounts of stack space. Probably smarter to make it linear, and add "entrypoints" to some list - everytime you reach a branch, add the branch location to the "entrypoint list" (if not there already). Once you reach a section boundary (or invalid instruction / ret with no matching call / etc), and there's no more non-visited "entrypoints", you're done.
Posted on 2003-11-17 14:46:06 by f0dder
hi Fodder,
when doing linearyou mean by:

0: ...
1: ...
2. jmp x ; save x position
3. ...
4. ...
5. call y ; save y position
6. ...

and trace just down, without following any branch address,
and for each address to ask if it's already in my 'entry point' list ?
thus, to locate subroutines (with signaure: push ebp, mov ebp, esp..) i can mark only call x location.
but than if that is the way you was reffering, what happens to all of the 'code' before EP how would you reach it than?
cuz i know that with recursive you can follow all flows, but stack could explode :(
i guess i need a visual example, would be nice to see example.
thnx
-wizzra
Posted on 2003-11-23 13:03:15 by wizzra
Once you reach the "end" of an "entrypoint", you start disassembling the next "entrypoint" in your list, and thus, eventually, you should have covered all code in the program. Of course this will not disassemble "dead code", but that's not too important anyway, in my opinion.
Posted on 2003-11-23 13:10:34 by f0dder
The following is why is generally impossible to distinguish data between instructions:


1. cmp eax,[404000]
2. jnge label1 ; jump is ALWAYS taken
3. BYTE "HI" ; 48H dec eax, 49H dec ecx
4. jmp fakelabel1
5.label1:
6. ...
7. jmp label2
8.fakelabel1:
9. BYTE "THERE" ; 54H push esp, 48H dec eax, 45H inc ebp, 52H push edx, 45H inc ebp
10. jmp fakelabel2
11.label2:
12. ...
13. jmp label3
14.fakelabel2:
15. ; fake-text-or-code

Line 2. This code is anti-disasm - this jump is always taken, but from disassembler's point of view is impossible to get it, because it doesn't know what value will be at [404000] at run-time.

Line 3. Text "HI" seems to be valid instructions, so it can't be recognized as text.

Line 4. Offset fakelabel1 is authenticated by this jump, which is never taken.

Line 9. Text "THERE" will be recognized also as instructions.

Line 10. Offset fakelabel2 is authenticated by this jump, which is never taken.

Line 15. Again the same...
Posted on 2003-11-25 03:36:01 by MazeGen
Wow, good examply of anti-disasm, MazeGen. :)
But I wonder if softs like WDASM and IDA can deal with it anyway... IMHO Wizzra should use the simple algo described by F0dder, since ot'll work for most cases and disassembling code that was meant not to be disassembled it a fairly complicated task, likely to involve either heuristics or simulation of execution. Since he's working in a debugger/disassembler, perhaps it's just better to let the user handle this special cases, making the disasm engine more configurable.
Posted on 2003-11-25 18:06:20 by QvasiModo
That actually wasn't a good example of anti-disasm, as (assuming the comments are correct, I didn't bother to check) the generated opcodes don't affect the flow of the code - having multi-byte garbage that messes up valid instructions later is a better example of anti-disasm.

But it's true that it takes some work to handle anti-disasm code, and IMO it is a better idea to focus on making a dynamic and interactive disassembler, so the user can correct these problems himself. Work on heuristics can be done later on, or the user can run deobfuscation tools of his own.
Posted on 2003-11-25 18:17:16 by f0dder

But I wonder if softs like WDASM and IDA can deal with it anyway...

Try it ;)

IMHO Wizzra should use the simple algo described by F0dder, since ot'll work for most cases and disassembling code that was meant not to be disassembled it a fairly complicated task, likely to involve either heuristics or simulation of execution.

Yeah, but wizzra posted a question similar to my answer on another board, so that I answered it here.

Since he's working in a debugger/disassembler, perhaps it's just better to let the user handle this special cases, making the disasm engine more configurable.

Yeah, this is, of course, the best solution.

That actually wasn't a good example of anti-disasm, as (assuming the comments are correct, I didn't bother to check) the generated opcodes don't affect the flow of the code - having multi-byte garbage that messes up valid instructions later is a better example of anti-disasm.

Well, it was an anti-disasm model example in sense of difficulty in recognizing data between code. IMHO it's one of the most difficult tasks while coding disasm or debug engine.

But it's true that it takes some work to handle anti-disasm code, and IMO it is a better idea to focus on making a dynamic and interactive disassembler, so the user can correct these problems himself. Work on heuristics can be done later on, or the user can run deobfuscation tools of his own.

Do you think there is any way how to get, with 100% assurance, the value at [404000] at run-time, when we disassemble at disasm-time, even using heuristics? I'm asking because I'm just exploring it.

Thanks for comments, guys.
Posted on 2003-11-26 14:48:56 by MazeGen


Try it ;)

Ok, I will. (I'm curious) :)
Do you think there is any way how to get, with 100% assurance, the value at [404000] at run-time, when we disassemble at disasm-time, even using heuristics? I'm asking because I'm just exploring it.

F0dder probably know better than me, but IMHO it's possible by simulation of code execution. But just as long as the value in [4040000] does not depend on other factors that can't be simulated (like timers, I/O or user input, etc).
Posted on 2003-11-26 16:24:16 by QvasiModo
hi all,
thnx guys for the ideas.
so walking branches & marking adresss as valid is a 'quick' sulotion, so i guess it somehow reduces few 'data' parts.
simulating code without knowing the register's values and on the fly data isn't much of a good sulotion is it ? ;)
i guess when it comes to heruistic it looks for 'bad' commands, 'weird' opcodes which usually doesn't fit or will not be created by 'normal' compiler.

let's not forget Zen when checking if code is data ;D
Posted on 2003-11-27 00:30:00 by wizzra
well, i'v done some hand analyse, with a 'stack' based idea for return addersses.
do you think its too much for basic analyse or you would suggest a simplier algo?
below analyse is fully analysed ..just a snippet.



=================================
assume entry point: ( i know its a function ;P)
=================================
0040116C . 55 PUSH EBP v
0040116D . 8BEC MOV EBP,ESP v
0040116F . 817D 0C 840000>CMP DWORD PTR SS:[EBP+C],84 v
00401176 . 75 05 JNZ SHORT DESPATCH.0040117D *
00401178 . E9 A3000000 JMP DESPATCH.00401220
0040117D > 837D 0C 01 CMP DWORD PTR SS:[EBP+C],1 v
00401181 . 75 16 JNZ SHORT DESPATCH.00401199 *
00401183 . FF75 14 PUSH DWORD PTR SS:[EBP+14]
00401186 . FF75 10 PUSH DWORD PTR SS:[EBP+10]
00401189 . FF75 0C PUSH DWORD PTR SS:[EBP+C]
0040118C . FF75 08 PUSH DWORD PTR SS:[EBP+8]
0040118F . E8 B7000000 CALL DESPATCH.0040124B
00401194 . E9 87000000 JMP DESPATCH.00401220
00401199 > 817D 0C 110100>CMP DWORD PTR SS:[EBP+C],111 v
004011A0 . 75 13 JNZ SHORT DESPATCH.004011B5 *
004011A2 . FF75 14 PUSH DWORD PTR SS:[EBP+14] v
004011A5 . FF75 10 PUSH DWORD PTR SS:[EBP+10] v
004011A8 . FF75 0C PUSH DWORD PTR SS:[EBP+C] v
004011AB . FF75 08 PUSH DWORD PTR SS:[EBP+8] v
004011AE . E8 CA000000 CALL DESPATCH.0040127D
004011B3 . EB 6B JMP SHORT DESPATCH.00401220
004011B5 > 837D 0C 0F CMP DWORD PTR SS:[EBP+C],0F v
004011B9 . 75 1C JNZ SHORT DESPATCH.004011D7 *
004011BB . FF75 14 PUSH DWORD PTR SS:[EBP+14] v
004011BE . FF75 10 PUSH DWORD PTR SS:[EBP+10] v
004011C1 . FF75 0C PUSH DWORD PTR SS:[EBP+C] v
004011C4 . FF75 08 PUSH DWORD PTR SS:[EBP+8] v
004011C7 . E8 F8000000 CALL DESPATCH.004012C4 *
004011CC . B8 00000000 MOV EAX,0 v
004011D1 . C9 LEAVE v
004011D2 . C2 1000 RETN 10 v
004011D7 > 837D 0C 10 CMP DWORD PTR SS:[EBP+C],10 v
004011DB . 75 1B JNZ SHORT DESPATCH.004011F8 *
004011DD . FF75 14 PUSH DWORD PTR SS:[EBP+14] v
004011E0 . FF75 10 PUSH DWORD PTR SS:[EBP+10] v
004011E3 . FF75 0C PUSH DWORD PTR SS:[EBP+C] v
004011E6 . FF75 08 PUSH DWORD PTR SS:[EBP+8] v
004011E9 . E8 14010000 CALL DESPATCH.00401302 *
004011EE . 0BC0 OR EAX,EAX v
004011F0 . 75 2E JNZ SHORT DESPATCH.00401220 *
004011F2 . C9 LEAVE v
004011F3 . C2 1000 RETN 10 v
004011F8 > 837D 0C 02 CMP DWORD PTR SS:[EBP+C],2 v
004011FC . 75 1C JNZ SHORT DESPATCH.0040121A *
004011FE . FF75 14 PUSH DWORD PTR SS:[EBP+14] v
00401201 . FF75 10 PUSH DWORD PTR SS:[EBP+10] v
00401204 . FF75 0C PUSH DWORD PTR SS:[EBP+C] v
00401207 . FF75 08 PUSH DWORD PTR SS:[EBP+8] v
0040120A . E8 32010000 CALL DESPATCH.00401341 *
0040120F . B8 00000000 MOV EAX,0 v
00401214 . C9 LEAVE v
00401215 . C2 1000 RETN 10 v
00401218 . EB 06 JMP SHORT DESPATCH.00401220
0040121A > 837D 0C 05 CMP DWORD PTR SS:[EBP+C],5 v
0040121E . 75 00 JNZ SHORT DESPATCH.00401220 *
00401220 > FF75 14 PUSH DWORD PTR SS:[EBP+14] v
00401223 . FF75 10 PUSH DWORD PTR SS:[EBP+10] v
00401226 . FF75 0C PUSH DWORD PTR SS:[EBP+C] v
00401229 . FF75 08 PUSH DWORD PTR SS:[EBP+8] v
0040122C . E8 69010000 CALL <JMP.&USER32.DefWindowProcA> *-api
00401231 . C9 LEAVE v
00401232 . C2 1000 RETN 10 v
00401341 /$ 55 PUSH EBP v
00401342 |. 8BEC MOV EBP,ESP v
00401344 |. 6A 00 PUSH 0 v
00401346 |. E8 85000000 CALL <JMP.&USER32.PostQuitMessage>
0040134B |. C9 LEAVE v
0040134C \. C2 1000 RETN 10 v
00401302 /$ 55 PUSH EBP v
00401303 |. 8BEC MOV EBP,ESP v
00401305 |. EB 14 JMP SHORT DESPATCH.0040131B *
00401307 |. 50 6C 65 61 73>ASCII "Please Confirm E"
00401317 |. 78 69 74 00 ASCII "xit",0
0040131B |> 6A 04 PUSH 4 v
0040131D |. 68 00304000 PUSH DESPATCH.00403000 v
00401322 |. 68 07134000 PUSH DESPATCH.00401307 v
00401327 |. FF75 08 PUSH DWORD PTR SS:[EBP+8] v
0040132A |. E8 9B000000 CALL <JMP.&USER32.MessageBoxA> *-api
0040132F |. 83F8 07 CMP EAX,7 v
00401332 |. 75 09 JNZ SHORT DESPATCH.0040133D *
00401334 |. B8 00000000 MOV EAX,0 v
00401339 |. C9 LEAVE v
0040133A |. C2 1000 RETN 10 v
0040133D |> C9 LEAVE v
0040133E \. C2 1000 RETN 10 v


///////////////////////////////////////////////////////////////
Analysis
///////////////////////////////////////////////////////////////
walked Addr | ret adderss | removed | counter of removed
///////////////////////////////////////////////////////////////

;v - marked address as Code
;* - jmp/call

0040116C
0040116D
0040116F
00401176 - 00401176 + opcodeSize
0040117D
00401181 - 00401181 + opcodeSize
00401199
004011A0 - 004011A0 + opcodeSize
004011B5
004011B9 - 004011B9 + opcodeSize [x] 9
004011D7
004011DB - 004011DB + opcodeSize [x] 4
004011F8
004011FC - 004011FC + opcodeSize [x] 2
0040121A
0040121E - 0040121E + opcodeSize [x] 1
00401220
00401223
00401226
00401229
0040122C
00401231
00401232 - ret
if(ret)
{
if(.Size())
pop_back()
}
for(..)
{
if(.size())
if (addr==foundAddr)
{
pop_back();
break;
}
}
004011FE
00401201
00401204
00401207
0040120A - 0040120A + opcodeSize [x] 3
00401341
00401342
00401344
00401346
0040134B
0040134C - ret
if(ret)
{
if(.Size())
pop_back()
}
for(..)
...
}
0040120F
00401214
00401215 - ret
if(ret)
{
if(.Size())
pop_back()
}
for(..)
...
}
004011DD
004011E0
004011E3
004011E6
004011E9 - 004011E9 + opcodeSize [x] 6
00401302
00401303
00401305
0040131B
0040131D
00401322
00401327
0040132A
0040132F
00401332 - 00401332 + opcodeSize [x] 5
0040133D
0040133E
if(ret)
{
if(.Size())
pop_back()
}
for(..)
...
}
00401334
00401339
0040133A
if(ret)
{
if(.Size())
pop_back()
}
for(..)
...
}
004011EE
004011F0 - 004011F0 + opcodeSize [x] 7
004011F2
004011F3 - ret
if(ret)
{
if(.Size())
pop_back()
}
for(..)
...
}
004011BB
004011BE
004011C1
004011C4
004011C7 - 004011C7 + opcodeSize [x] 8
004011CC
004011D1
004011D2 - ret
if(ret)
{
if(.Size())
pop_back()
}
for(..)
...
}
004011A2
004011A5
004011A8
004011AB
004011AE - 004011AE + opcodeSize
....
....
....
....
Posted on 2003-11-27 02:15:50 by wizzra
Hi Wizzra,
I didn't fully understand, how your analysis works.

Anyway, in such case is not difficult to recognize text between code (listed according to importance):

1. There is no entry point at [00401307]
2. Last byte of these no-entry-point-bytes is zero (ASCIIZ string)

Using heuristic, there could be recognized the following points:

3. At [00401307] should be ASCIIZ string for MessageBox
4. All bypassed bytes are within ASCIIZ range
5. Bypassed bytes, interpretted as code, seems to be quite strange:


00401307 50 push eax ; 'P'
00401308 6C insb ; 'l' ; strange: I/O operation at user-level code
00401309 6561 popad gs: ; 'ea' ; strange: 1.with segment prefix 2.after push eax
0040130B 7365 jnb 00401372 ; 'se' ; strange: probably inside instruction
0040130D 20436F and [ebx+6F],al ; ' Co'
00401310 6E outsb ; 'n' ; strange: I/O operation at user-level code
00401311 6669726D2045 imul si,[edx+6D],4520 ; 'firm E' ; strange: 16bit operands
00401317 7869 js 00401382 ; 'xi' ; strange: probably inside instruction
00401319 7400 je 0040131B ; 't',0
Posted on 2003-12-02 08:18:32 by MazeGen
hi,

yes, heruistic is a good way indeed, and there could be lotsa ways to check for 'strange' code, this however makes the disasm engine too much slow.
i came the idea that even if u create an algorithm, it wont be perfect, so i also added the possibility for the user to create his own data segments in code beside an automated algo which will suggest data locations (user can delete/modify them).

the algo above, will scan code untill reaches a branch (jxx / call) and will push (stack based algo) the next instruction's address (same as the cpu does), once upon a ret we pop the address last in the stack and continue from there (LIFO algo), when tracing the braches, you mark all the addresses considers valid, the ones which are suppose to be data will not be marked.
thus an anti-disasm will kick it, also a heuristic should be added indeed.
there is no perfect way, thats why user define - control brings the best of it.

-Wizzta
Posted on 2003-12-02 11:10:54 by wizzra