1. Field of the Invention
The present invention relates a method for indexing information. In particular, the invention comprises a computerized method of reading information from a data stream, creating a one or more signatures that are a representations of transitions within an episode, and organizing the signatures into one or more indices for later use in computerized information searches.
2. Background
Many enterprises generate large volumes of information for computerized storage, retrieval, and analysis. Generally, a computer reads the information, or data entries, from a data stream record by record. Each record contains a varying number of items of information. A particular record can contain one item, or several hundred or more individual items. The records, therefore, not only vary in length, but the individual items can vary in length. Usually, the computerized management of data involves storage of the data on a mass storage device like a magnetic disk. This allows later retrieval of the data for analysis. An organization can collect a huge amount of information very quickly; therefore, timely and accurate retrieval of the data requires a good indexing system.
Prior art indices are described as primary vs. secondary, where primary indices are ordered by the same key as the file but secondary indices refer to a different attribute of the file record. A dense index identifies every record of the file whereas a sparse index identifies logical sections. A single-level index points directly to the location of the content whereas a multi-level index accommodates further subdivision of the index at each level, the final level pointing to the location of the content. Static indices are not changed in the normal operation whereas dynamic indices are expected to be altered while an operation is in execution. Regardless of the method of index, the target of the index reference is always one specific item. That one specific item may be a specific record, the first occurrence of a specific value of a given field, the disk sector on which the data is found, or something else of a singular definition. The target is usually defined on (directed to the index from) individual field(s) of a single record.
Nievergelt, Hinterburger and Sevcik (ACM Transactions on Database Systems, March, 1984) surveyed combinatorial indices. However, their work covered multi-attribute combinations and disclaimed the study of multiple values of a single data attribute in an index.
Additionally, U.S. Pat. No. 5,212,639 shows a method and electronic apparatus for the classification of combinatorial data for the summarization and/or tabulation thereof. The apparatus and the method create a database wherein the data entry, such as a journal entry in accounting, comprises the canonical record. A plurality of data entries to be classified are separate records, each comprised of one or more items having associated quantities and an entry identifier serving as a pointer to the record. Each item contains information including at least an item number, or label, and a quantity. A mapping function is applied to each data entry to assign item indicators for the item numbers paired with the associated quantities. The item indicators for the data entries are sorted into ascending numerical sequence and an n-dimension sparse matrix is selected where “n” is the number of items in the data entry. If the present combination of item indicators is new, a design record is created for the database based upon the sparse matrix and including the item indicators, the associated quantity sums, the total number of data entries summarized in the design record and a pointer (a chain of entry identifiers) to the records of the data entry detail. The quantities for the present data entry are added to the quantity sums and the entry identifier is stored in the pointer chain. After all the data entries have been processed, a search routine can be utilized to review the various design records as desired.
This reference while representing a substantial advancement, however, does not teach the utilization of a key number representing the total number of key fields, or items, in the data entry record groups. By utilizing a key number, it is possible to, for example, minimize the amount of memory needed to ultimately store the information. Additionally, the reference teaches the use of pointers chains to navigate through the index. Pointer chains eventually breakdown when dealing with very large amounts of data. With large amounts of data the size of the pointer chain grows to an excessive level, and stepping through a very long pointer chain requires significant amount of processing time. Further, when the pointer chains become large they require a large amount of Random Access Memory storage, which places further demands on the computer system and computer processing time. Also, the pointer chains requiring updating and storage each time a new record is added to the chain. Again with the growth in the size of the pointer chains, this increases system processing time. Furthermore, pointer chains can make error recovery difficult; if one link in the pointer chain fails then processing stops and must resume at the beginning of the pointer chain. Furthermore, the reference teaches separately processing variable length data entry record groups. The reference teaches the maintenance of separate pointer chains for each variable length data entry record group. Thus, all the data entry record groups comprised of two items require separate processing from the data entry record groups comprised of three items. Consequently, the method taught by the reference is more effective when implemented to audit existing fixed length data, rather than to perform real-time management of variable length data.
Furthermore, U.S. Pat. No. 5,390,113 shows a method and electronic apparatus for electronically performing bookkeeping upon a plurality of pre-existing accounting journal entries having at least one account number and at least one data component associated with each account number. First, a chart of accounts having account numbers and opening balances associated with the plurality of journal entries is read electronically. A set of account-section numbers is then created for each account number. The journal entries are electronically read individually and one of the account-section numbers is assigned to each account number. The assigned account-section numbers and associated data components are then sorted in a predetermined order. A design for the predetermined order is identified and compared with stored design records to see if such a design already exists. If not, the new design is stored. If so, the associated data components are added to the accumulated total for each account-section number. A tally representing the number of journal entries summarized is increased by one and an entry identifier is added to a list of data entry record groups. The process is then repeated for each journal entry. This reference also fails to teach the utilization of a key number representing the total number of key fields in the data entry records group. The key number allows for quickly identifying data entry record groups of common size. Without the key number, sorting requires traversing all of the design records regardless of their size. Further, this reference also relies extensively on pointer chains, and requires separate processing for each variably sized data entry record group. Accordingly, the teachings of the reference are more effective when implemented to perform bookkeeping of existing data rather than to perform real-time management of variable length data.
Moreover, U.S. patent application Ser. No. 08/751,74, now abandoned, teaches a method and apparatus for the classification of raw data based on the creation of an index. The method comprises reading a data entry record group from a plurality of data entry record groups, where the data entry record group comprises at least one data entry record with at least one key field containing an item. The method further comprises tallying a key number representing a total number of key fields in the data entry record group, creating an index record having a predetermined number of one or more key fields equal to the key number, mapping each item in the key fields of the data entry record groups to generate an item indicator in each of the key fields of the index record, and determining whether each of the item indicators in each of the key fields of the index record exists in a stored index record. If the item indicators in each of the key fields of the index record do not exist in the stored index record, the item indicators are added to the index record along with a pointer enabling location of the data entry records group, and the index record is stored. If the item indicators in each of the key fields of the index record exist in the stored index record, a pointer scheme related to the stored index record is altered to enable location of the data entry record groups. This method again relies on pointer chains to traverse the variable length index records, and of course, suffers from the other aforementioned difficulties. Additionally, this method also requires a key number to allow quick identification of variable size records. Thus, each variable length group of records requires a separate indexing system. Accordingly, the teachings of this reference operate more effectively when implemented upon smaller data sets.
U.S. Pat. No. 6,058,392 teaches a computerized method of organizational indexing, storage, and retrieval of computerized representations of events in the form of data, by creating signatures based upon the occurrence of patterns within the data. The method involves creating data entry record groups from events, where the data entry record groups comprises one of more items. The items are encoded to into fixed length binary equivalent item indicators, and are then used to create a signature which is a fixed length coded equivalent of the data entry record group. Various indices are created including a partial index record, and combination cross-reference index. While representing a substantial improvement over the prior art, this method is not ideally suited to the problem created by network path traversals. U.S. Pat. No. 6,058,392 has a preference for sorting the information records that can remove important information regarding the order of the networks, which can be quite important in the network path application. Furthermore, the '392 patent does not teach certain aspects of data organization necessary and/or helpful to indexing network path transversals.
Accordingly, a need exists for an improved method of indexing information for use by computerized information searches.