The present invention relates generally to data processing systems, and more particularly, to methods for reducing storage requirements in a database.
Many databases have been implemented that use a person""s name as a key for record retrieval. To facilitate searching, these names are sometimes stored using an encoding scheme such as Soundex or Metaphone whereby fuzzy sound-like retrievals can be performed. The Soundex algorithm codes together surnames that sound similar but have different spellings. Soundex codes begin with the first letter of the surname, followed by a three-digit code that represents the first three remaining consonants. Zeros are added to names that do not have enough letters to be coded. In Soundex, consonants that sound alike have the same code. The coding guide is as follows:
1xe2x80x94B, P, F, V;
2xe2x80x94C, S, G, J, K, Q, X, Z;
3xe2x80x94D, T;
4xe2x80x94L;
5xe2x80x94M, N;
6xe2x80x94R.
The letters A, E, I, O, U, Y, H and W are not coded. Names with adjacent letters having the same equivalent number are coded as one letter with a single number. Surname prefixes are generally not used in the Soundex algorithm.
A Metaphone is an algorithm for encoding a word so that similar sounding words encode the same. It is similar to Soundex in purpose, but as it knows the basic rules of English pronunciation, it is more accurate. The higher accuracy requires more computational power, as well as more storage capacity. The algorithm reduces an input word to a one to eight or more character code using relatively simple phonetic rules for typical spoken English. Metaphone reduces the alphabet to sixteen consonant sounds: B, X, S, K, J, T, F, H, M, N, P, R, O, W, Y. Metaphone uses the following transformation rules: doubled letters, except xe2x80x9ccxe2x80x9d, drop the second letter; keep vowels only when they are the first letter.
Additionally, names can also be stored in an uppercase alphanumeric version to facilitate searching by partial character matches. When either of these methods are used to facilitate searching, the original mixed case name is also stored for display purposes as the xe2x80x9cas originally enteredxe2x80x9d format. Obviously, storing a name in both an uppercase alphanumeric only version as well as the original mixed case true format may double the storage requirements.
This problem can be best described by an example. In a health provider""s network, there typically exists a master person index (MPI) that is used to resolve a name to a single person, given a wide variety of partially complete and potentially different input fields. For example, assume a person""s last name is xe2x80x9cMendez-Perez.xe2x80x9d One operator may input the name as written (i.e., xe2x80x9cMendez-Perezxe2x80x9d) while another operator may input the name as xe2x80x9cMendez Perezxe2x80x9d or as xe2x80x9cmendezperez.xe2x80x9d To facilitate the expected outcome of searching of the database for this person, a retrieval key field may be created that is the uppercase alphabetic characters only, thus xe2x80x9cMENDEZPEREZxe2x80x9d would be searched for in the column of the xe2x80x9csquishedxe2x80x9d representation of the name. In this example, the xe2x80x9csquishedxe2x80x9d representation is formed by converting all letters to uppercase and ignoring any character that is not an uppercase alphabetic character, thus a space, or hyphen would be discarded. Once the appropriate record has been found, the mixed case version of the name should be used for display at the operator""s console.
One solution to the above problem is to simply fetch and apply the xe2x80x9csquishxe2x80x9d rule record by record to the name as originally input column of the database. This would be a very slow process since repetitive processing would need to be done for each search. Therefore, such a method is not a viable solution. Another approach that can be used is to store two columns, one already squished, and the other as originally input; thus doubling the storage space needed.
The well-known xe2x80x9czipxe2x80x9d and xe2x80x9cHoffmanxe2x80x9d encoding techniques are optimized for, and function on a long series of subcharacter strings in long textual documents. What is needed is an algorithm that works better for encoding short common name character sequences where the data must exist in multiple forms for: (1) database searching and (2) display back to the operator.
One alternative is a simple bit mapping providing upper/lower case flagging information, but that alternative does not provide for reinsertion of the characters removed by a xe2x80x9csquishxe2x80x9d algorithm, i.e., the algorithm will only provide information if the character is translated to lower case, or copied as is.
This invention attempts to minimize the storage requirements required in keeping both forms of the name, one for machine searching/record retrieval consistency, and the other for human display. By using this invention, the space requirements per record can be greatly reduced thus allowing more records to be stored on the same media, and as a by-product of smaller databases, the information retrieval process can also be sped up.
This invention applies where the data needs to be stored in a compacted or xe2x80x9csquishedxe2x80x9d format to facilitate a retrieval key, and the original input data must also be capable of being recreated. This invention applies where the general characteristics of the data to be stored are well known such that frequency of exception characters can be predicted in advance to assign the most efficient encoding scheme to the data. This invention applies to short strings rather than long lengthy texts.
The data to be encoded and stored in the database record is first analyzed to determine its characteristics. If the representation of a person""s name is to be encoded in a bit string, then the data will be characterized by uppercase and lowercase alphabetic characters with a few additional characters such as an apostrophe or hyphen. The data analyzed can be a sample of the records to enter and store or the entire data set. The analysis can be performed by a computer software module, or can be done manually, or by a combination of computer processing of the input stream of data and manual analysis to determine trends and characteristics. An encoding scheme is then devised to encode the information input with a bit stream that represents the information. The information input is then compacted to convert the information input into a uniform format (e.g., all uppercase alphabetic characters or all lowercase alphabetic characters). The encoded and compacted information are then stored in a corresponding database record.
When a user wants to retrieve a particular record from the database, the information is entered by the user and the system compacts it, which in turn, is used to locate the record(s) in the database. The compacted information is used as a key to retrieve the record. The encoded representation of the information is retrieved with the record and is then used to decode the compacted information into the original information input which is displayed to the user. The original information input does not need to be stored in the database record as a result of this invention.