Wednesday, September 3, 2008

Levenshtein distance

A Levenshtein distance, or "edit distance," is a measure of the similarity of two given strings as embodied in the number of changes--insertions, deletions, or substitutions--required to transform one string into the other.

This is an incredibly useful tool for grouping and ordering strings, and in particular, non-language strings -- we already have a convention for ordering words in English, but if you're staring at 200 protein sequences, alphabetic order doesn't do anything for you.

There's a fast C implementation of the Levenshtein algorithm and though it doesn't seem to have a proper project homepage, it can be found in the Pootle & Translate Toolkit.

I needed to add a wildcard parameter to the Levenshtein algorithm so that the distance between, eg, ABC and AXC is 0, given that 'X' is the specified wildcard character. (Normally, the Levenshtein distance between "ABC" and "AXC" is 1, of course: it takes one substitution to turn one string into the other.)

So I did.

Before I attempted to modify this C version, though, I put together a simple, unoptimized Python version, and the difference in performance was shocking. On my test data set, the C code took 0.3s to run, and the Python version took ... over 3 minutes. It was over 1000x slower.

After adding the couple of if statements and loops necessary for the wildcard code, the C code slowed to 1.2s.

No comments: