A. Technical Field
The present invention is related to text equivalencing, and more particularly, to text equivalencing for human language text, genetic sequences text, or computer language text.
B. Background of the Invention
In many contexts finding equivalent text is desirable. Equivalent texts are two pieces of text that are intended to be exactly the same, at least one of which contains a misspelling or typographical error or an difference in representation such as using a symbolic representation of a word or a different word order. One context where finding equivalent text is desirable is in identifying a song or book by its textual identifiers such as title, artist or author, album name or publisher name, etc. Another such context is in identifying genetic sequences. In these and other contexts, it is desirable to be able to identify equivalent texts because typographical errors or misspellings can occur. In particular, misspellings occur with a high frequency with foreign-sounding text or names, or when the text is the product of voice transcription. Also, there are times when certain words, such as, “the” are omitted or added erroneously, or a character is replaced by a word or vice versa, for example “&” to “and” or “@” to “at.”
In the context of text equivalencing for textual information related to music, there are many ways the text ends up close, but not exactly the same. For example, when a song is downloaded the information can be typed in manually by the user, thus increasing the chance for error. In the context of a software music player it is generally important to identify the track of music by the textual information. If the textual information cannot be determined it is difficult to identify the music track.
One method of identifying an equivalent text is by applying a simple set of heuristics to the text and then comparing it to known texts. The simple set of heuristics can overcome problems of variable amounts of white space and omitting or adding words such as “the” at the beginning of an album or book name. However, such heuristics fall short when it comes to typographical errors or misspellings. It is not possible to apply a heuristic for every possible mistyping or misspelling based on every possible pronunciation of every word in the text. The problem is further compounded when the words are not actual language words, but instead are proper names, genetic sequences, acronyms, or computer commands. In those contexts, predicting the mistakes and forming the heuristics is more difficult than when the words are language words.
What is needed is a system and method of text equivalencing that avoids the above-described limitations and disadvantages. What is further needed is a system and method for text equivalencing that is more accurate and reliable than prior art schemes.