Here's a small christmas present, it's a fast procedure to calculate the levenshtein distance of two strings. I'm sure it can be optimized further, I optimized it for Athlon processors.

What's the levenshtein distance?

The levenshtein distance (sometimes called edit distance) of two strings is the minimum number of character deletions, insertions or substitutions required to transform the first string into the second.
For example, the distance of 'asm' 'masm' is 1, since inserting 'm' is enough to make masm from asm. lev('mov', 'xor') is 2, since the middle character is already the same so only 2 characters need to be substituded.

more info..

How do you calculate it?
The algorithm is not hard but requires quite some computation (O(n*m)). Taken from the above link:

Step 1
Set n to be the length of s.
Set m to be the length of t.
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns.

Step 2
Initialize the first row to 0..n.
Initialize the first column to 0..m.

Step 3
Examine each character of s (i from 1 to n).

Step 4
Examine each character of t (j from 1 to m).

Step 5
If s equals t, the cost is 0.
If s doesn't equal t, the cost is 1.

Step 6
Set cell d of the matrix equal to the minimum of:
a. The cell immediately above plus 1: d + 1.
b. The cell immediately to the left plus 1: d + 1.
c. The cell diagonally above and to the left plus the cost: d + cost.

Step 7
After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d.

The test suite
In the attachment you will find a C++ program containing 4 C++ implementations of the algorithm (written by me), and the assembly implementation I created. The program tests each procedure 10,000 times and shows the number of msecs it took. It does tests with various strings and string sizes.

The 5 versions
STL safe & simple
Uses a two dimensional (nested) std::vector from the Standard Template Library to store the matrix values. The memory use is (m+1)*(n+1) integers for the matrix. This version is the 'cleanest C++ way' to implement the algorithm. It's also the slowest :) This is probably mostly because of the memory allocation that is done by vector.

STL optimized version
Although the description of the algorithm says you should have a matrix, this isn't really necessary. Each iteration for one row is only dependent on the previous row. So instead of a matrix, a 'previous' and 'current' row can be used. This only uses (m+1) * 2 integers of memroy, which is far less than (m+1)*(n+1). This version is still relatively clean, it uses a vector but only two rows and offsets to the current & previous part (which are swapped before the next row)

2 dimensional array
The same as 'STL safe & simple', but using a local array on the stack. This saves the memory allocation but uses quite some stack space ((m+1)*(n+1) integers)).

2 row array
The same as 'STL optimized version', but again with a local array and pointers to the previous and current row. The stack usage is now okay, it's also the fastest C implementation most of the time.

Asm version
The assembly version is almost the same as '2 row array', but of course hand optimized and with it's inner loop unrolled once. It's probably a bit hard to read, I'll explain it in a later post. I optimized it for my own Atlhon processor.

Note
The string length is limited to a maximum of 255 characters for both strings, so a fixed stack array could be used. Longer strings would take too much time anyway.

The test
Here are the test results for an Athlon XP 2100+. Each function is executed 10,000 times.
[size=9][COLOR=darkblue]

-----------------------------------------------------
String1: 100 characters
String2: 100 characters
Distance: 0
-----------------------------------------------------
STL safe & simple . . . . . . . 2714 msecs
STL optimized verison . . . . . 1683 msecs
2 dimensional array . . . . . . 1852 msecs
2 row array . . . . . . . . . . 1773 msecs
Asm version (Thomas). . . . . . 651 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 6 characters
String2: 6 characters
Distance: 1
-----------------------------------------------------
STL safe & simple . . . . . . . 60 msecs
STL optimized verison . . . . . 10 msecs
2 dimensional array . . . . . . 20 msecs
2 row array . . . . . . . . . . 10 msecs
Asm version (Thomas). . . . . . 0 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 6 characters
String2: 7 characters
Distance: 7
-----------------------------------------------------
STL safe & simple . . . . . . . 70 msecs
STL optimized verison . . . . . 10 msecs
2 dimensional array . . . . . . 20 msecs
2 row array . . . . . . . . . . 0 msecs
Asm version (Thomas). . . . . . 10 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 52 characters
String2: 82 characters
Distance: 56
-----------------------------------------------------
STL safe & simple . . . . . . . 1312 msecs
STL optimized verison . . . . . 721 msecs
2 dimensional array . . . . . . 761 msecs
2 row array . . . . . . . . . . 721 msecs
Asm version (Thomas). . . . . . 281 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 255 characters
Distance: 220
-----------------------------------------------------
STL safe & simple . . . . . . . 27129 msecs
STL optimized verison . . . . . 9443 msecs
2 dimensional array . . . . . . 9934 msecs
2 row array . . . . . . . . . . 9774 msecs
Asm version (Thomas). . . . . . 4146 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 1 characters
Distance: 254
-----------------------------------------------------
STL safe & simple . . . . . . . 2224 msecs
STL optimized verison . . . . . 60 msecs
2 dimensional array . . . . . . 130 msecs
2 row array . . . . . . . . . . 70 msecs
Asm version (Thomas). . . . . . 30 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 255 characters
Distance: 0
-----------------------------------------------------
STL safe & simple . . . . . . . 26448 msecs
STL optimized verison . . . . . 10035 msecs
2 dimensional array . . . . . . 12047 msecs
2 row array . . . . . . . . . . 10986 msecs
Asm version (Thomas). . . . . . 4096 msecs
-----------------------------------------------------
[/COLOR] [/SIZE]

Thomas
Posted on 2002-12-24 08:09:40 by Thomas
[size=12]

[COLOR=seagreen]All the parameters and local vars are accessed directly
using esp as index register (floating stack pointer), this frees ebp so it can
be used as a general purpose register.[/COLOR]

[COLOR=seagreen]'d' is the array that can hold the current and previous row of the matrix.
It's size is 2*256 integers (2*(max_m+1)). 64 bytes (64/4 dwords) are added as padding so the array can
be aligned to a 64 byte boundary later.
[b]note:[/b] Each row is stored in reversed order. That is, curRow[i] is stored in curRow[m - i].
This allowed some optimizations in the code but can be quite confusing as well[/COLOR]
d textequ <esp+stk_c+0000h> ; int d[2*256+64/4] = 840h bytes

[COLOR=seagreen]curRow is a pointer to the 'current row' part in d.[/COLOR]
curRow textequ <esp+stk_c+0840h> ; int *curRow = 4 bytes
tmp_ecx textequ <esp+stk_c+0844h>
tmp_esi textequ <esp+stk_c+0848h>

STACK_LOCAL_SIZE = 84Ch ; local vars size

param_s textequ <esp+stk_c+STACK_LOCAL_SIZE+04> [COLOR=seagreen]; string 1[/COLOR]
param_t textequ <esp+stk_c+STACK_LOCAL_SIZE+08>[COLOR=seagreen]; string 2[/COLOR]
param_n textequ <esp+stk_c+STACK_LOCAL_SIZE+12>[COLOR=seagreen]; length string 1[/COLOR]
param_m textequ <esp+stk_c+STACK_LOCAL_SIZE+16>[COLOR=seagreen]; length string 2[/COLOR]

stk_c = 0 ; stack correction

levAsm proc C
;sub esp, STACK_LOCAL_SIZE
;sub esp, 4 * 4

;push ebx, ebp, esi, edi
sub esp, STACK_LOCAL_SIZE + 4 * 4
mov [esp][00], edi
mov [esp][04], esi
mov [esp][08], ebp
mov [esp][12], ebx
stk_c = stk_c + 4 * 4

mov esi, [param_s] ; esi = s
mov edi, [param_t] ; edi = t

; align data array on 64 byte boundary
lea eax, [d]
add eax, 62
and eax, NOT 63

[COLOR=seagreen]Instead of a local var, ebx is used as 'prevRow' pointer,
pointing to the 'previous row' part in d[/COLOR]
mov ebx, eax ; ebx = prevRow
add eax, 256 * 4
mov [curRow], eax

[COLOR=seagreen]Check size boundaries[/COLOR]
; check n==0
mov eax, [param_n]
test eax, eax
jz _retM
cmp eax, 255
ja _invalid

; check m==0
mov eax, [param_m]
test eax, eax
jz _retN
cmp eax, 255
ja _invalid

[COLOR=seagreen]Initialize the previous row with 0 ... m[/COLOR]
; for (j=0;j<=m;j++)
; prevRow[j] = j;
mov eax, ebx
mov ecx, [param_m]
_fillPrev:
mov [eax], ecx
add eax, 4
dec ecx
jns _fillPrev

[COLOR=seagreen]This outer loop iterates through all the rows in the matrix.
ecx serves as 'i'.[/COLOR]
; for (i=1;i<=n;i++)
mov ecx, 1
align 16
_for_i:
[COLOR=seagreen]ebp is used to hold the previous result of the inner loop,
which is then used to calucalate the 'left' value of the next iteration.
[/COLOR]
mov eax, [curRow]
mov edx, [param_m]
mov ebp, ecx ; ebp = initial left - 1
mov [tmp_ecx], ecx
[COLOR=seagreen]Since not all rows are available at the same time, their first columns
cannot be initialized with 0...n. This is now done before the current row is used
(curRow[0] = i, i is the current row number).[/COLOR]
mov [eax + 4 * edx], ecx ; curRow[0] = i
mov edi, [param_t] ; edi = t

[COLOR=seagreen]The inner loop uses edx as j and is unrolled once.[/COLOR]
; for(j=1;j<=m;j++)
; {
movzx eax, byte ptr [esi]
mov [tmp_esi], esi

[COLOR=seagreen]If less than 2 iterations, the unrolled loop has no use[/COLOR]
cmp edx, 2
jb _last_byte
align 16
_for_j:
xor ecx, ecx
inc ebp ; add 1 to previous result to get left.
cmp al, [edi] ; s[i-1]==t[j-1] ?
mov esi, [ebx][4 * edx - 4] ; esi = prevRow[j]
cmovne ecx, num1 ; cost(ecx) = 0 or 1

inc esi ; esi = prevRow[j] + 1 = above
[COLOR=seagreen]esi = above value here[/COLOR]
add ecx, [ebx][4 * edx] ; diagonal(ecx) = prevRow[j-1] + cost(ecx)
[COLOR=seagreen]ecx = diagonal value here[/COLOR]

cmp ecx, esi ; diagonal > above
cmova ecx, esi ; ? min = above

[COLOR=seagreen]ecx = min(diagonal, above)[/COLOR]

cmp ecx, ebp ; min > left
mov esi, [curRow]
cmova ecx, ebp ; ? min = left
[COLOR=seagreen]ebp holds the 'left' value. ecx = min(ecx, left)[/COLOR]

[COLOR=seagreen]Store result:[/COLOR]
; curRow[j] = min;
mov [esi][4 * edx - 4], ecx

[COLOR=seagreen]In the second part of the unrolled loop,
the roles of ecx and ebp are switched (ebp holds the minimum, ecx
holds the result of the previous iteration). The rest of the loop is the same,
except for different memory offsets.[/COLOR]
; --- second part of unrolled loop ---
xor ebp, ebp
inc ecx ; add 1 to previous result to get left.
cmp al, [edi+1] ; s[i-1]==t[j-1] ?
mov esi, [ebx][4 * edx - 8] ; esi = prevRow[j]
cmovne ebp, num1 ; cost(ebp) = 0 or 1

inc esi ; esi = prevRow[j] + 1 = above
add ebp, [ebx][4 * edx - 4] ; diagonal(ebp) = prevRow[j-1] + cost(ebp)

cmp ebp, esi ; diagonal > above
cmova ebp, esi ; ? min = above

cmp ebp, ecx ; min > left
mov esi, [curRow]
cmova ebp, ecx ; ? min = left

; curRow[j] = min;
mov [esi][4 * edx - 8], ebp
; }
sub edx, 2
add edi, 2
[COLOR=seagreen]Store the result as the next left value - 1[/COLOR]
; mov ebp, ecx ; next left is current result
[COLOR=seagreen]edx > 2?[/COLOR]
test edx, NOT 1
jnz _for_j
align 16

[COLOR=seagreen]If m is not even, one iteration has yet to be done at the end of
the unrolled loop. Most of the array offsets can be hardcoded in this last iteration.[/COLOR]
_last_byte:
test edx, edx
jz _even_m
;---- last element ---
xor ecx, ecx
inc ebp ; add 1 to previous result to get left.
cmp al, [edi] ; s[i-1]==t[j-1] ?
mov esi, [ebx] ; esi = prevRow[j]
cmovne ecx, num1 ; cost(ecx) = 0 or 1

inc esi ; esi = prevRow[j] + 1 = above
; int diagonal = prevRow[j-1] + cost
add ecx, [ebx][4] ; diagonal(ecx) = prevRow[j-1] + cost(ecx)

cmp ecx, esi ; diagonal > above
cmova ecx, esi ; ? min = above

; int left = curRow[j-1] + 1;
cmp ecx, ebp ; min > left
mov esi, [curRow]
cmova ecx, ebp ; ? min = left

; curRow[j] = min;
mov [esi], ecx

;--- end last element
_even_m:
[COLOR=seagreen]Swap the current and previous rows..[/COLOR]
mov eax, [curRow]
mov esi, [tmp_esi]
mov [curRow], ebx
mov ecx, [tmp_ecx]
mov ebx, eax
; }
inc esi
inc ecx
cmp ecx, [param_n]
jbe _for_i

[COLOR=seagreen]The final result is in arr[n][m],
which is prevRow[m] in our case, = ebx[m-m] = ebx[0]. [/COLOR]
mov eax, [ebx] ; return prevRow[param_m]
jmp _return

_retN:
mov eax, [param_n]
jmp _return
_retM:
mov eax, [param_m]
jmp _return
_invalid:
mov eax, -1
jmp _return
_return:
; pop edi, esi, ebp, ebx
mov edi, [esp][00]
mov esi, [esp][04]
mov ebp, [esp][08]
mov ebx, [esp][12]
;add esp, 4 * 4
;add esp, STACK_LOCAL_SIZE
add esp, STACK_LOCAL_SIZE + 4 * 4
ret
levAsm endp[/size]

Thomas
Posted on 2002-12-24 08:37:57 by Thomas
hehheh... perhaps you should post the timings into the "crusades" :grin: that doesn't even factor in file sizes :tongue:

now if only i could figure out WHY you'd need to use this function... but it still seems cool :alright:

edit: oops... i just read "more info"
:stupid:
Posted on 2002-12-24 09:45:30 by jademtech
Quick test result (with lots of other apps running in teh BG):
The "Asm version (Thomas)" rules... :)

(I'll run it alone and add the result for my celeron 700 MHz in a minute...)

conclusion: assembly is the language for speed and size. :)

The results:
[size=9]-----------------------------------------------------

String1: 100 characters
String2: 100 characters
Distance: 0
-----------------------------------------------------
STL safe & simple . . . . . . . 5673 msecs
STL optimized verison . . . . . 2840 msecs
2 dimensional array . . . . . . 3260 msecs
2 row array . . . . . . . . . . 2970 msecs
Asm version (Thomas). . . . . . 1600 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 6 characters
String2: 6 characters
Distance: 1
-----------------------------------------------------
STL safe & simple . . . . . . . 110 msecs
STL optimized verison . . . . . 20 msecs
2 dimensional array . . . . . . 85 msecs
2 row array . . . . . . . . . . 21 msecs
Asm version (Thomas). . . . . . 4 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 6 characters
String2: 7 characters
Distance: 7
-----------------------------------------------------
STL safe & simple . . . . . . . 105 msecs
STL optimized verison . . . . . 20 msecs
2 dimensional array . . . . . . 80 msecs
2 row array . . . . . . . . . . 11 msecs
Asm version (Thomas). . . . . . 9 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 52 characters
String2: 82 characters
Distance: 56
-----------------------------------------------------
STL safe & simple . . . . . . . 2405 msecs
STL optimized verison . . . . . 1381 msecs
2 dimensional array . . . . . . 1319 msecs
2 row array . . . . . . . . . . 1410 msecs
Asm version (Thomas). . . . . . 690 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 255 characters
Distance: 220
-----------------------------------------------------
STL safe & simple . . . . . . . 70545 msecs
STL optimized verison . . . . . 18000 msecs
2 dimensional array . . . . . . 24005 msecs
2 row array . . . . . . . . . . 18870 msecs
Asm version (Thomas). . . . . . 10120 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 1 characters
Distance: 254
-----------------------------------------------------
STL safe & simple . . . . . . . 2775 msecs
STL optimized verison . . . . . 131 msecs
2 dimensional array . . . . . . 974 msecs
2 row array . . . . . . . . . . 125 msecs
Asm version (Thomas). . . . . . 80 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 255 characters
Distance: 0
-----------------------------------------------------
STL safe & simple . . . . . . . 67979 msecs
STL optimized verison . . . . . 16500 msecs
2 dimensional array . . . . . . 20930 msecs
2 row array . . . . . . . . . . 17629 msecs
Asm version (Thomas). . . . . . 10120 msecs
-----------------------------------------------------
[/size]
Posted on 2002-12-24 11:15:07 by scientica
Wonderful Thomas - another routine for the string library...
Merry Christmas, and thank you.
Posted on 2002-12-24 11:26:51 by bitRAKE
Very nice Thomas,
Results on my 2.0GHz Pentium 4


-----------------------------------------------------
String1: 100 characters
String2: 100 characters
Distance: 0
-----------------------------------------------------
STL safe & simple . . . . . . . 2053 msecs
STL optimized verison . . . . . 961 msecs
2 dimensional array . . . . . . 1002 msecs
2 row array . . . . . . . . . . 1061 msecs
Asm version (Thomas). . . . . . 601 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 6 characters
String2: 6 characters
Distance: 1
-----------------------------------------------------
STL safe & simple . . . . . . . 70 msecs
STL optimized verison . . . . . 10 msecs
2 dimensional array . . . . . . 10 msecs
2 row array . . . . . . . . . . 10 msecs
Asm version (Thomas). . . . . . 0 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 6 characters
String2: 7 characters
Distance: 7
-----------------------------------------------------
STL safe & simple . . . . . . . 70 msecs
STL optimized verison . . . . . 10 msecs
2 dimensional array . . . . . . 10 msecs
2 row array . . . . . . . . . . 10 msecs
Asm version (Thomas). . . . . . 0 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 52 characters
String2: 82 characters
Distance: 56
-----------------------------------------------------
STL safe & simple . . . . . . . 1022 msecs
STL optimized verison . . . . . 471 msecs
2 dimensional array . . . . . . 500 msecs
2 row array . . . . . . . . . . 531 msecs
Asm version (Thomas). . . . . . 250 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 255 characters
Distance: 220
-----------------------------------------------------
STL safe & simple . . . . . . . 17526 msecs
STL optimized verison . . . . . 6108 msecs
2 dimensional array . . . . . . 6289 msecs
2 row array . . . . . . . . . . 6730 msecs
Asm version (Thomas). . . . . . 3846 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 1 characters
Distance: 254
-----------------------------------------------------
STL safe & simple . . . . . . . 2543 msecs
STL optimized verison . . . . . 40 msecs
2 dimensional array . . . . . . 70 msecs
2 row array . . . . . . . . . . 40 msecs
Asm version (Thomas). . . . . . 41 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 255 characters
Distance: 0
-----------------------------------------------------
STL safe & simple . . . . . . . 16954 msecs
STL optimized verison . . . . . 5448 msecs
2 dimensional array . . . . . . 5598 msecs
2 row array . . . . . . . . . . 6259 msecs
Asm version (Thomas). . . . . . 3845 msecs
-----------------------------------------------------
Posted on 2002-12-24 13:32:12 by anon
Intel Compiler version 7.0 results (on 1.333Ghz Athlon TB):
-----------------------------------------------------

String1: 100 characters
String2: 100 characters
Distance: 0
-----------------------------------------------------
STL safe & simple . . . . . . . 3215 msecs
STL optimized verison . . . . . 1772 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 1012 msecs
Asm version (Thomas). . . . . . 600 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 6 characters
String2: 6 characters
Distance: 1
-----------------------------------------------------
STL safe & simple . . . . . . . 61 msecs
STL optimized verison . . . . . 10 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 10 msecs
Asm version (Thomas). . . . . . 0 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 6 characters
String2: 7 characters
Distance: 7
-----------------------------------------------------
STL safe & simple . . . . . . . 50 msecs
STL optimized verison . . . . . 20 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 0 msecs
Asm version (Thomas). . . . . . 0 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 52 characters
String2: 82 characters
Distance: 56
-----------------------------------------------------
STL safe & simple . . . . . . . 1512 msecs
STL optimized verison . . . . . 831 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 501 msecs
Asm version (Thomas). . . . . . 260 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 255 characters
Distance: 220
-----------------------------------------------------
STL safe & simple . . . . . . . 31395 msecs
STL optimized verison . . . . . 11947 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 7191 msecs
Asm version (Thomas). . . . . . 3865 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 1 characters
Distance: 254
-----------------------------------------------------
STL safe & simple . . . . . . . 1733 msecs
STL optimized verison . . . . . 70 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 40 msecs
Asm version (Thomas). . . . . . 30 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 255 characters
Distance: 0
-----------------------------------------------------
STL safe & simple . . . . . . . 30053 msecs
STL optimized verison . . . . . 10986 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 6059 msecs
Asm version (Thomas). . . . . . 3885 msecs
-----------------------------------------------------
I'm looking at the "2 dimensional array" generated code - the timing doesn't look right...

Attached EXE optimized for P2/P3 MMX:
Posted on 2002-12-24 16:40:30 by bitRAKE
This algorithm can be done with a single row array and the inner loop should have no branches and everything in registers. The Intel generated code is good, but no where near optimal. I'll post my version after some testing...
Posted on 2002-12-27 15:18:03 by bitRAKE
bitRAKE: that would be great, I'm looking forward to see your code.
I'll figure out why the 2 dim. array C version always runs in 0 msecs in your test, seems to be way too fast :).

all: Thanks for testing!

Thomas
Posted on 2002-12-27 17:56:54 by Thomas
Sorry it is taking so long - I haven't forgotten...:tongue:
Posted on 2003-01-02 16:51:36 by bitRAKE
It hangs up in thomas method in my pentium 200.
i guess those cmova.

so, i prefer 2 row array, its slower but more compatible.
Posted on 2003-01-03 01:06:45 by r00t

Sorry it is taking so long - I haven't forgotten...:tongue:


It appears you have ;)

..rocking the boat as usual.. :alright:

NaN
Posted on 2003-08-06 16:13:27 by NaN
I was going to wait until the one year aniversary. :grin:

It really can be done as I've said above - I've worked it out on paper - just haven't gotten around to converting it to code.
Posted on 2003-08-07 16:41:45 by bitRAKE
The paper to look at is optimal alignments in linear space. Particularly figure 1B. If you don't want to recreate the path then that's all you need. If you do I'd recommend NOT using the linear space version & using an NxM backtrace matrix unless the sequences are huge because the matrix version is about twice as fast.


The figure lays out in pseudo-code a 1 x N array method. Getting the north, northeast, east cells all dancing right is a pain but worth the effort if speed is the issue.
Posted on 2003-08-07 17:24:08 by twisteddweeb
Figure I'd post my results:



G:\Release>levenshtein
-- ---------------------------------------------------
String1: 100 characters
String2: 100 characters
Distance: 0
-----------------------------------------------------
STL safe & simple . . . . . . . 1657 msecs
STL optimized verison . . . . . 765 msecs
2 dimensional array . . . . . . 813 msecs
2 row array . . . . . . . . . . 782 msecs
Asm version (Thomas). . . . . . 500 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 6 characters
String2: 6 characters
Distance: 1
-----------------------------------------------------
STL safe & simple . . . . . . . 62 msecs
STL optimized verison . . . . . 16 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 15 msecs
Asm version (Thomas). . . . . . 0 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 6 characters
String2: 7 characters
Distance: 7
-----------------------------------------------------
STL safe & simple . . . . . . . 63 msecs
STL optimized verison . . . . . 0 msecs
2 dimensional array . . . . . . 16 msecs
2 row array . . . . . . . . . . 0 msecs
Asm version (Thomas). . . . . . 0 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 52 characters
String2: 82 characters
Distance: 56
-----------------------------------------------------
STL safe & simple . . . . . . . 828 msecs
STL optimized verison . . . . . 391 msecs
2 dimensional array . . . . . . 406 msecs
2 row array . . . . . . . . . . 391 msecs
Asm version (Thomas). . . . . . 203 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 255 characters
Distance: 220
-----------------------------------------------------
STL safe & simple . . . . . . . 13656 msecs
STL optimized verison . . . . . 4906 msecs
2 dimensional array . . . . . . 5031 msecs
2 row array . . . . . . . . . . 5250 msecs
Asm version (Thomas). . . . . . 3172 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 1 characters
Distance: 254
-----------------------------------------------------
STL safe & simple . . . . . . . 1875 msecs
STL optimized verison . . . . . 31 msecs
2 dimensional array . . . . . . 63 msecs
2 row array . . . . . . . . . . 31 msecs
Asm version (Thomas). . . . . . 31 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 255 characters
Distance: 0
-----------------------------------------------------
STL safe & simple . . . . . . . 13188 msecs
STL optimized verison . . . . . 4391 msecs
2 dimensional array . . . . . . 4500 msecs
2 row array . . . . . . . . . . 4531 msecs
Asm version (Thomas). . . . . . 3172 msecs
-----------------------------------------------------


just noticed bitrakes optimized version:


-----------------------------------------------------
String1: 100 characters
String2: 100 characters
Distance: 0
-----------------------------------------------------
STL safe & simple . . . . . . . 1657 msecs
STL optimized verison . . . . . 765 msecs
2 dimensional array . . . . . . 813 msecs
2 row array . . . . . . . . . . 782 msecs
Asm version (Thomas). . . . . . 500 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 6 characters
String2: 6 characters
Distance: 1
-----------------------------------------------------
STL safe & simple . . . . . . . 62 msecs
STL optimized verison . . . . . 16 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 15 msecs
Asm version (Thomas). . . . . . 0 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 6 characters
String2: 7 characters
Distance: 7
-----------------------------------------------------
STL safe & simple . . . . . . . 63 msecs
STL optimized verison . . . . . 0 msecs
2 dimensional array . . . . . . 16 msecs
2 row array . . . . . . . . . . 0 msecs
Asm version (Thomas). . . . . . 0 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 52 characters
String2: 82 characters
Distance: 56
-----------------------------------------------------
STL safe & simple . . . . . . . 828 msecs
STL optimized verison . . . . . 391 msecs
2 dimensional array . . . . . . 406 msecs
2 row array . . . . . . . . . . 391 msecs
Asm version (Thomas). . . . . . 203 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 255 characters
Distance: 220
-----------------------------------------------------
STL safe & simple . . . . . . . 13656 msecs
STL optimized verison . . . . . 4906 msecs
2 dimensional array . . . . . . 5031 msecs
2 row array . . . . . . . . . . 5250 msecs
Asm version (Thomas). . . . . . 3172 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 1 characters
Distance: 254
-----------------------------------------------------
STL safe & simple . . . . . . . 1875 msecs
STL optimized verison . . . . . 31 msecs
2 dimensional array . . . . . . 63 msecs
2 row array . . . . . . . . . . 31 msecs
Asm version (Thomas). . . . . . 31 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 255 characters
Distance: 0
-----------------------------------------------------
STL safe & simple . . . . . . . 13188 msecs
STL optimized verison . . . . . 4391 msecs
2 dimensional array . . . . . . 4500 msecs
2 row array . . . . . . . . . . 4531 msecs
Asm version (Thomas). . . . . . 3172 msecs
-----------------------------------------------------


G:\Release>levenshtein
-----------------------------------------------------
String1: 100 characters
String2: 100 characters
Distance: 0
-----------------------------------------------------
STL safe & simple . . . . . . . 1844 msecs
STL optimized verison . . . . . 844 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 609 msecs
Asm version (Thomas). . . . . . 735 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 6 characters
String2: 6 characters
Distance: 1
-----------------------------------------------------
STL safe & simple . . . . . . . 47 msecs
STL optimized verison . . . . . 0 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 15 msecs
Asm version (Thomas). . . . . . 0 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 6 characters
String2: 7 characters
Distance: 7
-----------------------------------------------------
STL safe & simple . . . . . . . 47 msecs
STL optimized verison . . . . . 16 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 0 msecs
Asm version (Thomas). . . . . . 0 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 52 characters
String2: 82 characters
Distance: 56
-----------------------------------------------------
STL safe & simple . . . . . . . 922 msecs
STL optimized verison . . . . . 437 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 328 msecs
Asm version (Thomas). . . . . . 313 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 255 characters
Distance: 220
-----------------------------------------------------
STL safe & simple . . . . . . . 15172 msecs
STL optimized verison . . . . . 5859 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 4531 msecs
Asm version (Thomas). . . . . . 3656 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 1 characters
Distance: 254
-----------------------------------------------------
STL safe & simple . . . . . . . 1484 msecs
STL optimized verison . . . . . 31 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 16 msecs
Asm version (Thomas). . . . . . 31 msecs
-----------------------------------------------------

-----------------------------------------------------
String1: 255 characters
String2: 255 characters
Distance: 0
-----------------------------------------------------
STL safe & simple . . . . . . . 14469 msecs
STL optimized verison . . . . . 4859 msecs
2 dimensional array . . . . . . 0 msecs
2 row array . . . . . . . . . . 3407 msecs
Asm version (Thomas). . . . . . 3671 msecs
-----------------------------------------------------
Posted on 2003-08-07 18:50:18 by Gunner