I am making app that will search and extract all unicode strings in some exe's. The problem is how should I recognize the character from the unicode string? I know chars are of word length and string is ending with double byte NULL, which leaves 2^16 - 1 combinations for all characters. But how do I know which word is character and which one is just some dummy data? Let's say I am not interested in those exotic languages such as kanji, and I only need english.
If you don't care about speed, the solution should be sorta simple... select a minimum string length (perhaps 5-6 chars, could make it user configurable), then repeat the next section until you hit filelength-minimumpatternlength.
First, skip until you hit a char that's in your alphabet(a-z, A-Z, 0-9, and a bunch of other chars). Then check if the next byte is 0, if not continue skipping until you hit a char that's in your alphabet. If it's a zero, continue alphabet+zerobyte checking until you satisfy the minimum pattern length.
It's a very trivial algorithm, but it works pretty well in practice, and unless you're scanning huge or lots of files, should be fast enough :)
First, skip until you hit a char that's in your alphabet(a-z, A-Z, 0-9, and a bunch of other chars). Then check if the next byte is 0, if not continue skipping until you hit a char that's in your alphabet. If it's a zero, continue alphabet+zerobyte checking until you satisfy the minimum pattern length.
It's a very trivial algorithm, but it works pretty well in practice, and unless you're scanning huge or lots of files, should be fast enough :)
Here's some simpleton code to get you started:
You could extend the table to 2^16 entries (8192 bytes) to cover the full unicode alphabet, the test would then "movzx eax, word ptr " to get a unicode char from your buffer, etc...
.data
alphabet DWORD (256/32)) dup (0) ; boolean table for alphabet - we need 256 bits
goodalpha db "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", 0
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; CODE section
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.code
ENTRY32:
; build bool-array from alphabet
mov esi, offset goodalpha
@@buildloop:
mov al, [esi] ; get ASCII value from alphabet
test al, al ; NUL terminator?
jz @@donebuild ; if so, exit the loop
inc esi ; point to next char
bts [alphabet], eax ; set bit/bool value to TRUE
jmp @@buildloop
@@donebuild:
; do a couple of tests
mov al, 'a'
bt [alphabet], eax
mov al, '@'
bt [alphabet], eax
mov al, '5'
bt [alphabet], eax
You could extend the table to 2^16 entries (8192 bytes) to cover the full unicode alphabet, the test would then "movzx eax, word ptr " to get a unicode char from your buffer, etc...
If you are using the string table in which to search well those are always Unicode, never ANSI. If you are just scanning the file you may have a few problems with normal data giving false positives. There is an API function that you can pass the string that I have found fairly reliable - IsTextUnicode
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_81np.asp
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_81np.asp