Program to Calculate the Levenshtein Distance in List in Python

Learn via video course
FREE
View all courses
Python Course for Beginners With Certification: Mastering the Essentials
Python Course for Beginners With Certification: Mastering the Essentials
by Rahul Janghu
1000
4.90
Start Learning
Python Course for Beginners With Certification: Mastering the Essentials
Python Course for Beginners With Certification: Mastering the Essentials
by Rahul Janghu
1000
4.90
Start Learning
Topics Covered

Overview

Levenshtein Distance, also known as Edit Distance, is a measure of similarity between two strings of characters. Given two words, the Levenshtein Distance is defined as the number of characters needed to be inserted, deleted, or replaced in one string, to transform it into another string. The algorithm returns a numeric value representing the similarity between the two strings. Less is the Levenshtein distance, more is the similarity.

What is Levenshtein Distance?

Have you ever wondered how word processors like MS Word and Google Docs can identify misspelled words and provide suggestions? These word processors use multiple string algorithms to match each word written by the user against the words in their dictionary. Now, identifying misspelled words can be done using a standard String Matching algorithm. However, to provide suggestions, the processors need to find a list of such words that are close similar to the misspelled word. In simpler words, the processor needs an algorithm that computes the similarity between two words. This can be calculated by finding the Levenshtein Distance between the two strings.

Levenshtein Distance (a.k.a Edit Distance) calculates the distance between two words and returns a number representing how similar the words are. The number represents the total number of edits required to transform one word into another. There are three ways by which the editing can be done:

  • Insertion A new character can be inserted anywhere in the given string. Example: A: "helo" B: "hello" We can insert the character 'l' at position 4 in string A to get string B.
  • Deletion A character can be deleted from anywhere in the given string. Example A: "helloo" B: "hello" We can delete a character at position 6 in string A to get string B.
  • Replacement A character can be replaced with the other character. Example A: "hella" B: "hello" We replace the character 'a' with 'o' at position 5 in string A to get the string B.

Each of these operations adds 1 to the edit distance.

Understanding the Levenshtein Algorithm

Problem Statement

Consider two input strings A and B of length n and m respectively, containing lowercase English letters only. The problem is to find the Levenshtein distance between A and B using the following edits:

  • Insert a character in A
  • Delete a character from A
  • Replace a character in A with any other character.

The cost of each operation is 1.

Algorithm (Using sub-subproblems)

Levenshtein distance between two strings can be found using dynamic programming. The algorithm exhibits the property of overlapping subproblems.

Consider the strings A and B.

  • Case 1: A[1]=B[1]A[1] = B[1]. In this case, we can skip the first position and the edit distance of the whole problem is equal to the edit distance of A[2:n]A[2:n] and B[2:m]B[2:m]. Here S[i:j]S[i:j] means substring of SS from position ii to position jj.
  • Case 2: A[1]B[1]A[1] \neq B[1]. In this case, we will use the three operations mentioned above to find the edit distance.
    • Insert B[1]B[1] in AA: The edit distance is equal to 1 + the edit distance of A[1:n]A[1:n] and B[2:m]B[2:m]. This is because the first character in BB matches the newly added first character in AA.
    • Delete A[1]A[1]: The edit distance is equal to 1 + the edit distance of A[2:n]A[2:n] and B[1:m]B[1:m]. This is because we have simply deleted the first character in AA. The characters left in both strings will contribute to the edit distance of the whole problem.
    • Replace A[1]A[1] with B[1]B[1]: The edit distance will be equal to 1 + the edit distance of A[2:n]A[2:n] and B[2:m]B[2:m]. The edit distance will be the minimum cost of inserting, deleting, or replacing the characters in AA to get the string BB.

The general idea of the algorithm is to transform the prefix of string A[1:i]A[1:i] equal to the prefix of string B[1:j]B[1:j] using the minimum number of insert, delete and replace operations. At every mismatch, we need to apply the three types of edit and choose the operation that minimizes the overall edit distance of the problem.

Example

Let us consider an example where AA = "gate" and BB = "goat". To calculate the Levenshtein distance, we will use the above approach.

  • AA = "gate" and BB = "goat". Since A[1]=B[1]A[1] = B[1], we simply move to the next character.
  • Since A[2]B[2]A[2] \neq B[2], we have three choice. If we perform:
    • Insert: We get AA = "goate" and BB = "goat".
    • Delete: We get AA = "gte" and BB = "goat".
    • Replace: We get AA = "gote" and BB = "goat".

Now if we look at the choice where we insert an element, we see that deleting the last character from AA = "goate" will transform it into the string B. Hence in this case the edit distance will be 22. In the rest cases, we see that the number of operations will be more, so we take the minimum of the three cases. Edit Distance of AA = "gate" and BB = "goat" is 22.

Understand the Array

Now to efficiently compute the result, we need to add memoization in the above algorithm. To do this, we need some states to define a subproblem uniquely. To identify a subproblem, we only need to know the length of the prefix of string AA and string BB. Hence we need two variables ii and jj, to define our dynamic programming states. Let us define DP[i][j]DP[i][j] = Levenshtein distance of string A[1:i]A[1:i] and string B[1:j]B[1:j].

Base Case: In the base case, we can consider that the length of one of the strings is 00. In this case, we will either use only insert operations, if the length of AA is 00, or else we will use only delete operations if the length of string BB is 00. DP[0][j]=jDP[0][j] = j for all jj from 11 to mm. DP[i][0]=iDP[i][0] = i for all ii from 11 to nn.

base-case-of-levenshtein-algorithm

Transitions

Suppose we want to calculate the edit distance of A[0:i]A[0:i] and B[0:j]B[0:j], then

  • Case 1: A[i]==B[j]DP[i][j]=DP[i1][j1]A[i] == B[j] \Rightarrow DP[i][j] = DP[i-1][j-1]
  • Case 2: A[i]B[j]A[i] \neq B[j]. In this case, we need to perform an edit operation.
    • If we perform insert operation then we will take i1i-1 characters from AA and jj characters from BB, so DP[i][j]=1+DP[i1][j]DP[i][j] = 1 + DP[i-1][j].
    • If we perform delete operation then we have not matched the jthj^{th} character in B yet, so we simply carry forward the edit distance calculated till ithi^{th} character of AA and (j1)th(j-1)^{th} character of BB, so DP[i][j]=1+DP[i][j1]DP[i][j] = 1 + DP[i][j-1].
    • If we do a replace, then DP[i][j]=1+DP[i1][j1]DP[i][j] = 1 + DP[i-1][j-1], as we have replaced ithi^{th} character of AA with jthj^{th} character of BB.

transitions-of-levenshtein-algorithm

Finally, we write DP[i][j]=1+min{DP[i1][j],DP[i][j1],DP[i1][j1]}DP[i][j] = 1 + min\{DP[i-1][j], DP[i][j-1], DP[i-1][j-1]\}

Final Output After we calculate all the transitions, the Levenshtein Distance will be equal to DP[N][M]DP[N][M].

Example

Let us look at the DPDP matrix for AA = "gate" and BB = "goat". Here n=m=4n = m = 4. Consider i=0i = 0 and j=0j = 0 be the iterators. When i=1i = 1 and j=1..mj = 1 .. m we calculate the edit distance for A[1]A[1] with all the prefixes of BB. Since A[1]==B[1]A[1] == B[1], DP[1][1]=DP[i1][j1]=0DP[1][1] = DP[i-1][j-1] = 0. Since B[2]B[2] is an extra character, a delete operation is required, DP[1][2]=DP[1][1]+1=2DP[1][2] = DP[1][1] + 1 = 2. In this way, we build the rest of the matrix.

example-building-rest-of-matrix

When i=2i = 2 and j=1..mj = 1 .. m, we calculate the edit distance for A[1:2]A[1:2] with all prefixes of BB.

example-editing-distance-of-matrix

Repeating the same process, we will end up with the whole DPDP matrix.

example-dp-matrix

Implementation

Output

Complexity Analysis of Levenshtein Algorithm

Time Complexity

In the implementation, we have a nmn*m state, and for each transition, we take O(1)O(1) time, hence the time complexity is O(nm)O(n*m) where nn and mm are the lengths of strings AA and BB respectively. Time Complexity: O(nm)O(n*m)

Space Complexity

In the implementation, we use a table to store nmn*m states of the problem, hence the space complexity is O(nm)O(n*m) where nn and mm are the lengths of strings AA and BB respectively. Space Complexity: O(nm)O(n*m)

Calculating the Levenshtein Distance using Enchant

Introduction

Spell Checking in Python is also implemented in the in-built module pyenchant. This module can be used to calculate the Levenshtein distance between two strings. To download this package, run the following command in the terminal:

Algorithm

The following function pyenchant.utils.levenshtein() can be used to calculate the edit distance between two strings. This function is a library implementation of the above-discussed algorithm.

Example

Output

Conclusion

  • Levenshtein distance is used to measure text similarity between two strings. It is equal to the number of edits (insert, delete, replace) required to convert one string into another.
  • The Levenshtein algorithm exhibits the property of overlapping sub-problem, hence dynamic programming is used to calculate the same.
  • Levenshtein algorithm is also available as an inbuilt function in the pyenchant library. When developing a spell checker, can be very helpful.