It is the minimum number of edit operations to transform one document to another. Changes allowed are:
- Add a letter
- Remove a letter
- Replace one letter by another
Document Similarity[]
Given two documents, how similar are they?
- Plagiarism detection
- Checking changes between versions of code
- Answering web search queries more effectively
How do we compute it?[]
Brute Force[]
- Delete all of the first document. Add all of the second document.
- Impossibly Inefficient.
Naive Solution[]
- Make the first character in both documents same. Explore all possible edit operations to do the same. Recursively fix the rest of the document.
- Naive solution is inefficient as the same problem is solved recursively many times.
Dynamic Programming[]
- Making recursive computations efficient by ensuing that each subproblem is computed only once.We store and look up answers to already solved subproblems.
- Quite efficient.
Needleman-Wunsch's Algorithm[]
Consider two strings A[1 ... n] and B[1 ... m]. We define V(i,j) to be the score of alignment between A[1 ... i] and B[1 ... j] and score(A,B) is the score if a character A is aligned with character B.
Best case: V(0,0) = 0 // no score for matching 2 empty strings
Recurrences: For i>0 and j>0: V(0,j) = V(0,j-1) + score(-,B[j]) // insert space j times to make alignment V(i,0) = V(i-1,0) + score(A[i],i) // delete i times to make alignment V(i,j) = max(option1,option2,option3), where option1 = V(i-1,j-1) + score(A[i],B[j]) // score of match or mismatch option2 = V(i-1,j) + score(A[i],-) // delete option3 = V(i,j-1) + score(-,B[j]) // insert
In short, this DP algorithm concentrates on three possibilities for the last pair of characters, which must either be a match/mismatch, a deletion, or an insertion. Although we do not know which one is the best, we can try all possibilities while avoiding re-computation of overlapping subproblems (i.e. basically a DP technique).
With a simple cost function where a match gets a +2 point and mismatch, insert, delete all get a -1 point, the detail of string alignment score of A = 'ACAATCC' and B = 'AGCATGC' is shown in the figure below. The alignment score is 7 (bottom right).
String Alignment Example for A = 'ACAATCC' and B = 'AGCATGC' (score = 7)
Follow the dashed (red) arrows from the bottom right cell to reconstruct the solution. Diagonal arrow means a match or a mismatch (e.g. the last 'C'). Vertical arrow means a deletion (e.g. .. CAT.. to ..C_A..). Horizontal arrow means an insertion (e.g. A_C.. to AGC..).
A = 'A_CAAT[C]C' B = 'AGC_AT[G]C'
As we need to fill in all the entries in the table of n X m matrix and each entry can be computed in O(1), the time complexity is O(nm). The space complexity is O(nm) = the size of the DP table.
Variations[]
- Interested in only the meaning of the document
- Focus on words
- Documents are near if they overlap in many words
- Order in which words occur may not matter
- Useful for topic based web search
- Can have dictionary of "similar" words