It is the minimum number of edit operations to transform one document to another. Changes allowed are:

- Add a letter
- Remove a letter
- Replace one letter by another

## Document SimilarityEdit

Given two documents, how similar are they?

- Plagiarism detection
- Checking changes between versions of code
- Answering web search queries more effectively

## How do we compute it?Edit

### Brute ForceEdit

- Delete all of the first document. Add all of the second document.
- Impossibly Inefficient.

### Naive SolutionEdit

- Make the first character in both documents same. Explore all possible edit operations to do the same. Recursively fix the rest of the document.
- Naive solution is inefficient as the same problem is solved recursively many times.

### Dynamic Programming Edit

- Making recursive computations efficient by ensuing that each subproblem is computed only once.We store and look up answers to already solved subproblems.
- Quite efficient.

## Needleman-Wunsch's Algorithm Edit

Consider two strings A[1 ... n] and B[1 ... m]. We define *V(i,j)* to be the score of alignment between A[1 ... i] and B[1 ... j] and *score(A,B)* is the score if a character A is aligned with character B.

Best case: V(0,0) = 0 // no score for matching 2 empty strings

Recurrences: For i>0 and j>0: V(0,j) = V(0,j-1) + score(-,B[j]) // insert space j times to make alignment V(i,0) = V(i-1,0) + score(A[i],i) // delete i times to make alignment V(i,j) = max(option1,option2,option3), where option1 = V(i-1,j-1) + score(A[i],B[j]) // score of match or mismatch option2 = V(i-1,j) + score(A[i],-) // delete option3 = V(i,j-1) + score(-,B[j]) // insert

In short, this DP algorithm concentrates on three possibilities for the last pair of characters, which must either be a match/mismatch, a deletion, or an insertion. Although we do not know which one is the best, we can try all possibilities while avoiding re-computation of overlapping subproblems (i.e. basically a DP technique).

With a simple cost function where a match gets a +2 point and mismatch, insert, delete all get a -1 point, the detail of string alignment score of A = 'ACAATCC' and B = 'AGCATGC' is shown in the figure below. The alignment score is 7 (bottom right).

String Alignment Example for A = 'ACAATCC' and B = 'AGCATGC' (score = 7)Follow the dashed (red) arrows from the bottom right cell to reconstruct the solution. Diagonal arrow means a match or a mismatch (e.g. the last 'C'). Vertical arrow means a deletion (e.g. .. CAT.. to ..C_A..). Horizontal arrow means an insertion (e.g. A_C.. to AGC..).

A = 'A_CAAT[C]C' B = 'AGC_AT[G]C'

As we need to fill in all the entries in the table of *n X m *matrix and each entry can be computed in O(1), the time complexity is O(nm). The space complexity is O(nm) = the size of the DP table.

## VariationsEdit

- Interested in only the meaning of the document
- Focus on words
- Documents are near if they overlap in many words
- Order in which words occur may not matter
- Useful for topic based web search
- Can have dictionary of "similar" words