I am making app that will search and extract all unicode strings in some exe's. The problem is how should I recognize the character from the unicode string? I know chars are of word length and string is ending with double byte NULL, which leaves 2^16 - 1 combinations for all characters. But how do I know which word is character and which one is just some dummy data? Let's say I am not interested in those exotic languages such as kanji, and I only need english.
Posted on 2004-12-13 18:55:42 by Mikky
If you don't care about speed, the solution should be sorta simple... select a minimum string length (perhaps 5-6 chars, could make it user configurable), then repeat the next section until you hit filelength-minimumpatternlength.

First, skip until you hit a char that's in your alphabet(a-z, A-Z, 0-9, and a bunch of other chars). Then check if the next byte is 0, if not continue skipping until you hit a char that's in your alphabet. If it's a zero, continue alphabet+zerobyte checking until you satisfy the minimum pattern length.

It's a very trivial algorithm, but it works pretty well in practice, and unless you're scanning huge or lots of files, should be fast enough :)
Posted on 2004-12-13 19:24:08 by f0dder
Here's some simpleton code to get you started:

alphabet DWORD (256/32)) dup (0) ; boolean table for alphabet - we need 256 bits
goodalpha db "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", 0

; CODE section

; build bool-array from alphabet
mov esi, offset goodalpha
mov al, [esi] ; get ASCII value from alphabet
test al, al ; NUL terminator?
jz @@donebuild ; if so, exit the loop
inc esi ; point to next char
bts [alphabet], eax ; set bit/bool value to TRUE
jmp @@buildloop

; do a couple of tests
mov al, 'a'
bt [alphabet], eax

mov al, '@'
bt [alphabet], eax

mov al, '5'
bt [alphabet], eax

You could extend the table to 2^16 entries (8192 bytes) to cover the full unicode alphabet, the test would then "movzx eax, word ptr " to get a unicode char from your buffer, etc...
Posted on 2004-12-13 19:45:54 by f0dder
If you are using the string table in which to search well those are always Unicode, never ANSI. If you are just scanning the file you may have a few problems with normal data giving false positives. There is an API function that you can pass the string that I have found fairly reliable - IsTextUnicode

Posted on 2004-12-13 20:51:27 by donkey