A string is a sequence of characters. A string may represent a name or address, e.g., “John Doe” or “PO Box 12345, Big Town, Wash. 67890.” Alternatively, a string may represent a biological sequence, e.g., a DNA or mRNA sequence. Characters in a string may be indexed by position. In computer-based applications, the indices usually range from zero to one less than the number of characters, which is also called the length. For example, string “abcdefghij” has length 10; the character in position zero is “a,” and the character in position nine is “j.”
The need to identify among a set of strings (called a text list) those that are similar to a string (called a search string) occurs in several contexts. For many applications, a search for texts similar to the search string is more useful than a search for exact matches.
For example, consider a system to locate the medical history for an individual among a large set of histories. Suppose the individual writes his or her name and address on a form. Another person transcribes the name and address into a computer to form a name search string and an address search string. Then the computer performs a search among a set  of histories that are indexed by name and address strings. The set of history name strings forms a name text list, and the set of history address strings forms an address text list.
One possible method is to search the name text list for an exact match to the name search string, search the address text list for an exact match to the address search string, and report as search results any history that is an exact match for name and address. The problem with this method is that the name and address strings for the individual's medical history may be slightly different from the name and address search strings—there may be errors in transcription, and there may be variations in expressions of the name and address.
For another example, consider an organization that keeps a mailing list of members. The list may contain multiple references to the same member, often collected by different methods or by different instances of the same method. The multiple references to a member often contain strings that are similar but not exact matches. The organization can reduce costs and member aggravation by identifying and removing multiple references to the same member. This process, which is called deduplication, has many other uses, including detection of duplicate benefit payments by government agencies and aggregation of data about a customer with multiple accounts at a financial institution.
Other examples involve bioinformatics. In biology, similar sequences often correspond to similar functionality. For example, similar DNA sequences in different individuals or species can encode proteins with similar functions. So one use of searching for texts similar to a search string is when the texts are biological sequences corresponding to proteins with known functions and the search string is a sequence for which the function is unknown. This type of search could be useful to understand the mechanisms at work in a genetic disease in which a person lacks a known DNA sequence of unknown function.
Some measures of string similarity are called edit distances. A basic edit distance is the minimum number of inserts and deletes needed to convert one string to another. Refer to this measure as “simple edit distance.” Strings “wheat” and “whets” have simple edit distance two, because “wheat” can be converted to “whets” by deleting an “a” and inserting an “s.” Each insert or delete is called an operation. The contribution to the edit distance for an operation is called the operation cost. Edit distances can involve a variety of operations, such as overwriting one character with another, costs that vary by operation, such as insert being twice as expensive as delete, and costs that vary by operation position, such as operations being more expensive at the beginning of a string than at the end.
Dynamic programming is a method to compute edit distance. A dynamic programming algorithm can be developed as follows. First, determine an ordered set of subproblems that includes the problem itself, and determine a recurrence that defines each subproblem solution in terms of previous subproblem solutions or a constant. Then determine a process to compute each subproblem solution in order, using the recurrence, which may involve solutions to earlier subproblems. Since the problem itself is a subproblem, this process solves the problem.
For example, a dynamic programming algorithm to compute the simple edit distance between a search string and a text can be developed as follows. Call the search string length m and the text length n. Call the substring consisting of the first i characters of a string the i-substring. Define subproblem S(i,j) to be the simple edit distance between the i-substring of the search string and j-substring of the text. Then S(0,0) is zero since no operations are needed to convert an empty string to an empty string. For each i from 1 to m, S(i,0) is i since i deletes are needed to convert the i-substring of the search string to an empty string. Likewise, for each j from 1 to n, S(0,j) is j because j inserts are needed to convert an empty string to the j-substring of the text. For i from 1 to m and j from 1 to n, if character i of the search string is the same as character j of the text, then S(i,j) is the minimum of S(i−1,j)+1, S(i,j−1)+1, and S(i−1,j−1) because a method to convert the i-substring of the search string to the j-substring of the text using the fewest possible operations is one of the following.                Delete the last character from the i-substring of the search string, then convert the (i−1)-substring of the search string to the j-substring of the text using the fewest possible operations.        Convert the i-substring of the search string to the (j−1)-substring of the text using the fewest possible operations, then insert the last character of the j-substring of the text at the end of the (j−1)-substring of the text.         Since the last character in the i-substring of the search string is the same as the last character of the text, convert the (i−1)-substring of the search string to the (j−1)-substring of the text using the fewest possible operations, then keep the last character of the i-substring of the search string in place to form the j-substring of the text.        
For example, if the i-substring of the search string is “appli” and the j-substring of the text is “analysi,” then at least one of the following is a method to convert “appli” to “analysi” using the fewest possible operations.                Delete “i” from “appli” to form “appl,” then convert from “appl” to “analysi” using the fewest possible operations.        Convert “appli” to “analys” using the fewest possible operations, then insert “i” at the end to form “analysi.”        Convert “appl” to “analys” using the fewest possible operations, then keep the “i” from the end of “appli” to form “analysi.”        
If the ith character of the search string is not the same as the jth character of the text, then the third option does not exist, so S(i,j) is the minimum of S(i−1,j)+1 and S(i,j−1)+1.
Computing the subproblems in the order S(0,0), S(1,0), . . . , S(m,0), S(0,1), S(1,1), . . . , S(m,1), . . . , S(0,n), S(1,n), . . . , S(m,n) ensures that each subproblem is solved before the solution is used by a recurrence for another subproblem. Note that the m-substring of the search string is the entire search string, and the n-substring of the text is the entire text. So S(m,n) is the simple edit distance between the search string and the text. Hence, solving the sequence of subproblems solves the original problem.
It is possible to perform this computation by hand, as follows. Use a grid. The search string letters correspond to rows 1, 2, . . . , m, proceeding from bottom to top. The text letters correspond to columns 1, 2, . . . , n, proceeding from left to right. Use a row zero on the bottom and a column zero on the left. Write the search string up a column to the left of the grid, and write the text along a row below the grid. Each grid cell corresponds to a subproblem—the grid cell in row i and column j corresponds to subproblem S(i,j). Compute one column at a time, proceeding left to right. Within each column, compute from the bottom to the top. For each border cell, i.e., each cell in row or column zero, simply fill in the value. For other cells, if the search string character on the row of the cell matches the text character on the column of the cell, then write in the cell the minimum of the following values: the value in the neighboring cell below plus one, the value in the neighboring cell to the left plus one, the value in the neighboring cell diagonally below and left. If the search string and text characters corresponding to the cell row and column do not match, then write in the cell the minimum of the following values: the value in the neighboring cell below plus one, the value in the neighboring cell to the left plus one. When finished, the value in the top right cell is the edit distance between the search string and the text. Here is an example, with search string “apple” and text string “proper.”    e|5 4 5 6 5 4 5    l|4 3 4 5 4 5 6    p|3 2 3 4 3 4 5    p|2 1 2 3 4 5 6    a|1 2 3 4 5 6 7    |0 1 2 3 4 5 6    ______    proper
In this example, the edit distance between “apple” and “proper” is five. (The string “apple” can be converted to the string “proper” by deleting the two characters “a” and “l” and inserting the three characters “r”, “o”, and “r.”)