1. Field of the Invention
The present invention relates to database systems and more particularly, to a system and method for organizing and/or finding data in a database system.
2. Discussion of the Related Art
Computerized database systems have long been used and their basic concepts are well known. A good introduction to database systems may be found in C. J. DATE, INTRODUCTION TO DATABASE SYSTEMS (Addison Wesley, 6th ed. 1994).
In general, database systems are designed to organize, store and retrieve data in such a way that the data in the database is useful. For example, the data, or partitioned sets of the data, may be searched, sorted, organized and/or combined with other data. To a large extent, the usefulness of a particular database system, is dependent on the integrity (i.e., the accuracy and/or correctness) of the data in the database system. Data integrity is affected by the degree of xe2x80x9cdisorderxe2x80x9d in the data stored. Disorder may occur in the form of erroneous or incomplete data such as duplicate data, fragmented data, false data, etc. In many database systems, from time to time, existing data may be edited and processed, and as a result, additional errors may be introduced. In some database systems, new data may be introduced. Additionally, as database systems are upgraded with new hardware and/or software, data conversion may be required or additional fields may become necessary. Furthermore, in some applications, the data in the database may simply become outdated over time.
Regardless of the preventative steps taken, some degree of disorder is eventually introduced in conventional database systems. This degree of disorder increases exponentially over time until eventually, the data in a conventional database becomes entirely useless. As a result, even a small degree of disorder eventually affects the integrity of the database system.
Unfortunately, identifying and correcting disorder in the data are often difficult, if not impossible, tasks particularly in large database systems. Traditionally, such tasks are performed manually, making these tasks time-consuming, expensive, and subject to human error. Furthermore, due to the very nature of the task, much of the disorder may go largely undetected. What is needed is a system and method for organizing data in a database system to overcome these and other associated problems.
The present invention provides a system and method for organizing data in a database system. The present invention derives a distilled database of accurate data from raw data extracted from one or more raw data sources. The raw data is converted from its original format(s) to a numeric format. According to one embodiment of the present invention, the raw data is represented as a vector having numeric elements. Once the raw data is represented numerically, various mathematical operations such as correlation functions, pattern recognition methods, or other similar numeric methods, may be performed on these vectors to determine how content in a particular vector corresponds to others vectors in a xe2x80x9cdistilledxe2x80x9d or reference database. The distilled database is formed from sets of one or more related vectors that are believed to be unique (e.g., orthogonal) with respect to the other sets. These sets represent the best information available from the raw data. After all the raw data has been incorporated into the distilled database, new data may be screened to ensure that new errors are not introduced into the distilled database. The new data may be also evaluated to determine whether it is unique or whether it includes better information than that already present in the distilled database. The new data is added to the distilled database accordingly.
One of the features of the present invention is that raw data is converted into a numeric format based on a number system having an appropriate radix. An appropriate radix is determined according to the type of information included in the raw data. For example, for raw data generally comprised of alpha-numeric characters, an appropriate radix may be greater than or equal to the number of different alpha-numeric characters present in the raw data. Using such a number system allows raw data to be represented numerically, allowing for manipulation through various well-known mathematical operations.
Another feature of the present invention is that the number system may be selected so that the numbers themselves retain semantic significance to the raw data they represent. In other words, the numerals in the number system are selected so that they correspond to the raw data For example, in the case of raw data comprised of alphanumeric characters, the numerals are selected to correspond to the alphanumeric characters they represent. When the numerals in the number system are subsequently displayed, they appear as the alphanumeric characters they represent.
Another feature of the present invention is that once the raw data is represented as vectors in an appropriate number system, the represented data may be efficiently manipulated in the database (e.g., sorted, etc.) using various well-known techniques. Furthermore, various well-known mathematical operations may be performed on the vectors to analyze the data content. These mathematical operations may include correlation functions, eigenvector analyses, pattern recognition methods, and others as would be apparent.
Still another feature of the present invention is that the raw data is incorporated into a distilled database. The distilled database represents the best information extracted from the raw data without having any data disorder.
Yet another feature of the present invention is that new data may be compared to the distilled database to determine whether the new data actually includes any new information or content not already present in the distilled database. Any new information not already in the distilled database is added to the distilled database without adding any disorder. In this manner, the integrity of the distilled database may be maintained.
Yet another feature of the present invention is that the raw data may be pre-encoded into an intermediate encoded format prior to, or contemporaneously therewith, being converted to a numeric format.
Other features and advantages of the invention will become apparent from the following drawings and description.