Stanford NLP: Minimum Edit Distance


Minimum edit distance allows us to assess how similar two strings are. For example, the word ‘graffe’ has few words that are similar to it such as ‘graf’, ‘graft’, ‘grail’, ‘giraffe’ but which one is the closest. This is useful for computational biology or IR. The minimum edit distance is the minimum number of editing operations such as insertion, deletion and substitution that are needed to transform one into the other.

As shown above, in order to turn ‘intention’ to ‘execution’ we need to do few edition. If each operation has a cost of 1, the distance between the word ‘intention’ and ‘execution’ is 5. The Levenshtein distance is when substitution operation has a cost of 2 which means that so the total distance is now 8. We can use this has an evaluation of how well machine translation/speech recognition works by calculating the distance (cost) of what the machine translation does vs the actual text.

How to find the minimum edit distance?

Intuitively, we would search for a path (sequence of edits) from the start string to the final string. We will start with the initial state (the word we’re transforming), follow by edit operations (insert, delete and substitute). We will then have a goal state (the word we’re trying to get to), by which then we can calculate the path cost (number of edits – what we want to minimise). However, the space of all edit sequences is huge and we can’t afford to navigate naively. There are lots of distinct paths that wind up at the same state and we don’t have to keep track of all of them. We just need to keep track of the shortest path to each of those revised path. For example:

Two strings: X of length n and Y of length m. We will define a distance matrix D(i, j) which is the edit distance between X and Y i.e. the first i characters (n) of X and the first j characters (m) of Y. Therefore the edit distance between X and Y is D(n, m).

Computing Minimum Edit Distance

We can use dynamic programming which combines solutions to subproblems to compute D(n, m). This is a bottom-up approach where we compute D(i, j) for small i, j values and then compute larger D(i, j) based on previously computed smaller values. Below is an example (Levenshtein):

Backtrace for Computing Alignments

We often need to align each character of the two strings to each other by keeping a backtrace. Every time we enter a cell, we need to remember where we came from and when we reach the end, trace back the path from the upper right corner to read off the alignment. The arrows in the graph below (starting from upper right corner) allows you to trace back the path. For example, to go from “inte” to “e”, we could remove all 4 characters and insert a letter “e” but that would cost us a distance of 5. Alternatively, we could just remove “int” and get the remainder character “e”, costing us a distance of 3.

The performance of this algorithm is O(nm) in time as our distance matrix is size n x m. In space is O(nm) and the backtrace is O(n+m) as in the worst case is we have to go n deletion and m insertion.

Weighted Minimum Edit Distance

The reason we add weights to the computation is because in the case of spell correction, some letters are more likely to be mistyped than others. In Biology, certain kinds of deletions or insertions are more likely than others. For example, in the confusion matrix for spelling errors below, the letter “e” is very likely to be confused with “a” or the letter “o” to be confused with “e”.

To implement this, we add a special cost function for deletion and insertion. This is in contrast to Levenshtein where we add 1 for every deletion, 1 for every insertion and 2 for every substitution. For the recurrence relation, we also have the special cost function to measure the actual cost of all three operations. See below:

Leave a Reply

Your email address will not be published. Required fields are marked *