Algorithms and Data Structures Wiki
Advertisement

It is the minimum number of edit operations to transform one document to another. Changes allowed are:

  • Add a letter
  • Remove a letter
  • Replace one letter by another

Document Similarity[]

Given two documents, how similar are they?

  • Plagiarism detection
  • Checking changes between versions of code
  • Answering web search queries more effectively

How do we compute it?[]

Brute Force[]

  • Delete all of the first document. Add all of the second document.
  • Impossibly Inefficient.

Naive Solution[]

  • Make the first character in both documents same. Explore all possible edit operations to do the same. Recursively fix the rest of the document.
  • Naive solution is inefficient as the same problem is solved recursively many times.

Dynamic Programming[]

  • Making recursive computations efficient by ensuing that each subproblem is computed only once.We store and look up answers to already solved subproblems.
  • Quite efficient.

Needleman-Wunsch's Algorithm[]

Consider two strings A[1 ... n] and B[1 ... m]. We define V(i,j) to be the score of alignment between A[1 ... i] and B[1 ... j] and score(A,B) is the score if a character A is aligned with character B.

Best case:
V(0,0) = 0 // no score for matching 2 empty strings
Recurrences: For i>0 and j>0:
V(0,j) = V(0,j-1) + score(-,B[j]) // insert space j times to make alignment
V(i,0) = V(i-1,0) + score(A[i],i) // delete i times to make alignment
V(i,j) = max(option1,option2,option3), where
  option1 = V(i-1,j-1) + score(A[i],B[j]) // score of match or mismatch
  option2 = V(i-1,j) + score(A[i],-) // delete
  option3 = V(i,j-1) + score(-,B[j]) // insert


In short, this DP algorithm concentrates on three possibilities for the last pair of characters, which must either be a match/mismatch, a deletion, or an insertion. Although we do not know which one is the best, we can try all possibilities while avoiding re-computation of overlapping subproblems (i.e. basically a DP technique).

Screen Shot 2017-05-01 at 3.06

With a simple cost function where a match gets a +2 point and mismatch, insert, delete all get a -1 point, the detail of string alignment score of A = 'ACAATCC' and B = 'AGCATGC' is shown in the figure below. The alignment score is 7 (bottom right).

Edit distance

String Alignment Example for A = 'ACAATCC' and B = 'AGCATGC' (score = 7)

Follow the dashed (red) arrows from the bottom right cell to reconstruct the solution. Diagonal arrow means a match or a mismatch (e.g. the last 'C'). Vertical arrow means a deletion (e.g. .. CAT.. to ..C_A..). Horizontal arrow means an insertion (e.g. A_C.. to AGC..).

A = 'A_CAAT[C]C'
B = 'AGC_AT[G]C'

As we need to fill in all the entries in the table of n X m matrix and each entry can be computed in O(1), the time complexity is O(nm). The space complexity is O(nm) = the size of the DP table.

Variations[]

  • Interested in only the meaning of the document
  • Focus on words
  • Documents are near if they overlap in many words
    • Order in which words occur may not matter
    • Useful for topic based web search
    • Can have dictionary of "similar" words
Advertisement