1. Field of the Invention
The invention relates to a data processing method and data processing apparatus or processing large amounts of data using a computer or other information processing apparatus, and particularly to a method and apparatus for searching for, tabulating and sorting table-format data.
2. Description of the Prior Art
Conventionally, large amounts of data are accumulated and searching and tabulating and other types of data processing is performed on the accumulated data. This data processing may be done using, for example, a known computer system including a CPU, memory, peripheral interface, a hard disk or other auxiliary storage device, a display, a printer or other output device, a keyboard, a mouse or other input device, and a power supply unit connected via a bus, and particularly as software that can be run on a readily available commercial computer system. In order to perform the aforementioned searching, tabulating or other types of data processing, various types of databases that particularly store large amounts of data are known. Among various types of large amounts of data, there is a particularly strong demand to process data that can be expressed in a table format. FIG. 1 is a diagram showing an example of expressing the data to be processed in a table format. FIG. 1 shows an example wherein the sex, age and occupation data for a large number of people, e.g. 1 million, are stored in a table. In FIG. 1, the horizontal rows in the table, namely the so-called records, consist of the record number, and the sex, age and occupation fields corresponding to the record number. The vertical columns in the table consist of the record number, sex, field, age field and occupation field. The table indicates that the person with the record number of xe2x80x9c0xe2x80x9d has a sex of female, age of 18 and occupation of programmer. In the following explanation, the data such as xe2x80x9cFemale,xe2x80x9d xe2x80x9c18xe2x80x9d and xe2x80x9cProgrammerxe2x80x9d set in the various fields are called field values. In addition, in the following explanation, unless otherwise indicated, the table-format data consisting of 1 million records shown in FIG. 1 is used as a specific example of a large amount of data.
Whether or not large amounts of data can be searched for or tabulated efficiently depends on the format in which the large amount of data is stored. Conventionally, typical known storage techniques include the so-called xe2x80x9crecord-sequentialxe2x80x9d and xe2x80x9cfield-sequentialxe2x80x9d storage techniques shown in FIGS. 2A and 2B, respectively.
FIG. 2A and FIG. 2B show a representation of the data storage format on a storage device, e.g. a hard disk. In the case of the record-sequential storage technique in FIG. 2A, a set of the field values of sex, age and occupation for each record number is stored on disk in the order of increasing logical addresses sequentially for each record number. On the other hand, in the case of the field-sequential storage technique in FIG. 2B, for each field, the field values are stored in record number order grouped by field in the direction of increasing logical addresses. To wit, in the example of FIG. 2B, the field values for the sex field corresponding to record numbers xe2x80x9c0xe2x80x9d through xe2x80x9c999999xe2x80x9d are arranged in order, and next, the field values for the age field are arranged in record number order, and then the field values for the occupation field are arranged in record number order.
In the case of the aforementioned prior art, field values corresponding to all fields for all record numbers are stored as is in a two-dimensional data structure (with the record number as one dimension and the other field values as one dimension). Hereinafter, such a data structure in particular shall be referred to as a xe2x80x9cdata table.xe2x80x9d In the case of the prior art, when searching for and tabulating stored data, this is performed by accessing such a data table.
In addition to the method of storing the value of the fields as field values as is, there is also a known method of converting the values to codes and storing the codes as field values. For example, with respect to the sex field, the value xe2x80x9cMalexe2x80x9d may be converted to xe2x80x9c0xe2x80x9d while the value xe2x80x9cFemalexe2x80x9d is converted to xe2x80x9c1xe2x80x9d and then the values xe2x80x9c0xe2x80x9d or xe2x80x9c1xe2x80x9d are stored as the field values instead of xe2x80x9cMalexe2x80x9d or xe2x80x9cFemale.xe2x80x9d Even in this case, there is no change to the point that the converted codes are stored in a data table as field values.
In the case of searching for and tabulating large amounts of data stored using a data structure of the data table type in the aforementioned prior art, there is a problem in that the processing time for searching and tabulating becomes longer due to the access time required to access such data tables.
In addition, data tables have at least the following intrinsic drawbacks.
(1) The data tables easily become enormous in size and cannot be easily separated (physically) into individual fields. For example, when extracting records in which the sex is xe2x80x9cMale,xe2x80x9d the age and occupation information is unnecessary, so efficiency could be improved if the table could be separated into a table containing only the sex fields. In the case of the field-sequential storage technique shown in FIG. 2B, while separation into individual fields is simple, when large amounts of data are handled, the size of the data table still becomes enormous, so the actual expansion of a data table into memory or other fast storage device for the purpose of tabulating or searching is difficult.
(2) Data tables cannot be kept in a form with multiple field values sorted simultaneously. For example, in the case of the prior art illustrated in FIG. 2A and FIG. 2B, the field values for the sex field arc arranged in record number order in the manner xe2x80x9cFemale, Male, Female, . . . , Female.xe2x80x9d However, when performing searching and tabulating processes, it is typically convenient for them to arrange in the manner xe2x80x9cFemale, Female, Female, Male, . . . , Male.xe2x80x9d However, in table data, the field values are arranged in a specific matrix order, namely record number order, so sorting the field values on a specific field is not permitted. For this reason, in the case of the prior art, it is not possible to select an arrangement of field values that is convenient for searching and tabulating.
(3) In a data table, identical values appear over and over. For example, in the case of the conventional data table given in FIG. 2A and FIG. 2B, at the time of extracting records wherein the sex is xe2x80x9cxe2x80x98Malexe2x80x99 or xe2x80x98Manxe2x80x99xe2x80x9d (or namely, record numbers), because the field value xe2x80x9cMalexe2x80x9d appears many times, it is necessary to perform the matching operation xe2x80x9cxe2x80x98Malexe2x80x99 or xe2x80x98Manxe2x80x99xe2x80x9d which is the comparison condition with the field value of xe2x80x9cMalexe2x80x9d many times. A single comparison should be sufficient to make the determination of whether there is a match with identical values.
In order to increase greatly the speed of searching for and tabulating large amounts of data, the object of the present invention is to provide a method of searching for, tabulating and sorting table-format data and an apparatus for implementing said method by providing a data control mechanism that both has the functions of the conventional data table and solves the aforementioned problems with the data structure based on the data table.
In order to achieve the aforementioned object, the method and apparatus for searching for and tabulating table-format data according to the present invention proposes a novel data control mechanism that is usable on an ordinary computer system. The data control mechanism according to the present invention comprises a value control table and an array of pointers to the value control table, as a general rule.
FIG. 3 is a diagram used to explain the principle of the present invention, showing a value control table 10 and an array of pointers to the value control table 20. A value control table 10 is defined to be a table made by assigning, for each field in table-format data, an (integral) field value number to each field value belonging to that field, and the table thus contains the field values corresponding to said field value number arranged in order of the field value numbers (reference number 11) along with a category number (reference number 12) which relates to said field value. An array of pointers to the value control table 20 is defined to be an array containing pointers to the field value numbers of the columns (namely, the fields) in the table-format data, namely to the value control table 10, arranged in order of the record numbers of the table-format data.
By combining the array of pointers to the value control table 20 with the value control table 10, given a certain record number, it is possible to use the array of pointers to the value control table 20 pertaining to the field in question to extract the stored field value number corresponding to that record number, and next, extract the stored field value corresponding to that field value number within the value control table 10, and thus obtain the field value from the record number. Therefore, in the same manner as with a conventional data table, it is possible to access all data (field values) with coordinates consisting of the record number (row) and field (column).
The data control mechanism according to the present invention, which includes a value control table generated for a certain field within the fields of table-format data and an array of pointers to the value control table, may also be referred to in particular as an xe2x80x9cinformation blockxe2x80x9d in the following explanation.
While conventional data tables offer the integrated control of all data using the coordinates of the rows corresponding to records and columns corresponding to fields, the information blocks according to the present invention are characterized in that the data are completely separated by column in the table format, namely by field. In this manner, by means of the present invention, large amounts of data are separated by field, so it is possible to load only that data related to those fields required for searching or tabulating into memory or other high-speed storage device, and as a result, the access time to the data is reduced, so the searching and tabulating processes are speeded up, and even extremely large amounts of data can be handled without adversely affecting performance.
In addition, in the case of the information blocks according to the present invention, the field values are stored in the value control table, and the record numbers that indicate the position of the value are associated with the array of pointers to the value control table, so there is no need for the field values to be arranged in record number order. Therefore, data can be sorted on field values such that it is suited to searching and tabulating. Thereby, the determination of whether or not a field value matching the target value is present in the data can be performed at high speed. Furthermore, corresponding field value numbers are assigned to the field values, so even if the field values consist of long data or text strings, they can be handled as integers.
Moreover, by means of the present invention, all of the field value numbers of the value control table 10 correspond to different field values, so the number of comparison operations between a specific value and the field values which are required to extract a record containing a field value having that specific value is no more than the number of possible field values, namely the number of field value numbers, so the number of comparison operations is greatly decreased, thus speeding up searching and tabulating. At this time, while a location is required to store the results of determining whether or not a certain field value matches, for example, the category number 12 can be used as this storage location.
FIG. 4 shows the information block according to the present invention which comprises a value control table 10 including an array of field values 11 containing the field values, an array of category numbers 12 containing the category numbers, and an array of counts 14 containing the counts. The array of counts 14 contains numbers which indicate a count of the number of times each field value is present within all data in a certain field, or in other words, the number of records which have a stipulated field value. By preparing such an array of counts 14 within the value control table 10, the information xe2x80x9c(how many instances of) which data is present?xe2x80x9d required at the time of searching or tabulating can be obtained immediately, thus speeding up searching and tabulating.
FIG. 5 shows an information block including a value control table 10, array of pointers to the value control table 20 and an array of pointers to records 30. The array of pointers to records 30 is defined to be an array containing, for each field value number, namely each field value, pointers to records that have that field value (corresponding to the record number). The number of pointers contained in the array of pointers to records 30 for each field value matches the number of entries in the array of counts 14 in the value control table 10. In addition, an array of start positions 13 which specifies the starting address of a group of pointers for each field value may be provided within the array of pointers to records 30. By providing such an array of pointers to records 30 in the information block in this manner, a set of records that have a particular field value for a certain field can be extracted quickly. The count (reference number 14) and start positions (reference number 13) of pointers stored in the array of pointers to records 30 are set in the value control table 10, so the fact that the values and count are present in the information block as is such that they are usable at the time of tabulating is an advantage.
Here follows an explanation of the method of searching for and tabulating table-format data according to the present invention. Note that in the following explanation, the individual field information refers to the aforementioned xe2x80x9cinformation block,xe2x80x9d and the field value number-specifying information array refers to the aforementioned xe2x80x9carray of pointers to the value control tablexe2x80x9d while the record identifying information array refers to the aforementioned xe2x80x9carray of pointers to records.xe2x80x9d
When table-format data is represented as an array of records including a plurality of fields containing field values for each field, the method of extracting from the table-format data the field value corresponding to a specific field and a specific record according to the present invention comprises the steps of:
keeping in a storage device, for each individual field, a value control table containing field values which are located in order of field value number each corresponding to the field value belonging to the specific field, and a field value number-specifying information array containing information that specifies the field value numbers in the order of the records,
acquiring from the field value number-specifying information array the field value number corresponding to the specific record, and
obtaining from the field values stored in the value control table the field value corresponding to the field value number acquired as above.
In addition, with the method of obtaining field values according to the present invention, in order to categorize the field values corresponding to the field value number, category numbers are stored in the value control table corresponding to the field value number, and the category numbers are accessed at the time of obtaining the field value corresponding to the field value number.
When table-format data is represented as an array of records including field values with respect to a field associated with a search condition, a single-search method of searching through said table-format data for field values that match a specific search condition comprises the steps of:
keeping in a storage device, for each individual field, individual field information such that includes a value control table containing field values which is located in order of field value numbers each corresponding to the field value belonging to the field associated with the search condition, a field value number-specifying information array containing information that specifies said field value numbers in the order of said records, and a record identification information array storing in exclusive areas for each of said field value numbers one or more pieces of record identification information related to identical field value numbers, and said value control table includes, for each of said field value numbers, record identification information-specifying information that indicates the area where said one or more pieces of record identification information related to identical field value numbers in said record identification information array,
using said record identification information-specifying information corresponding to field value numbers related to field values within the field values contained in said value control table that match said search conditions, to acquire record identification information from said record identification information array that matches said search conditions.
In addition, the multiple-field search method according to the present invention comprises the steps of:
keeping in a storage device the result set of records that match the search conditions obtained by the single-field search method according to the present invention,
selecting separate individual field information regarding fields related to separate search conditions,
acquiring from the field value number-specifying information array regarding the separate individual field information the field value number corresponding the pieces of record identification information that match the search conditions set in the result set,
regarding the separate individual field information, determining whether or not the field values identified by the extracted field value numbers match the separate search conditions, and
regarding the separate individual field information, if the field values identified by the extracted field value numbers match the separate search conditions, extracting the pieces of record identification information corresponding to the field value numbers as pieces of record identification information that match the separate search conditions, and
Alternately, as a variation to the multiple-field search method, it is possible to implement a so-called OR search. In more detail, this method comprises the steps of:
keeping in a storage device the result set of records that match the search conditions,
regarding other individual field information related to other search conditions,
using field values within the field values stored in other value control tables that match the search conditions and record identification information-specifying information corresponding to related field values to extract from a record identification information array the records that match the other search conditions, and store the records that match the search conditions in a specified other record set, p1 if necessary, regarding still other search conditions, using still other record identification information-specifying information to extract records that match still other search conditions, and repeating the storage of still other result sets, and
obtaining a final result set by eliminating duplicate records from the result sets thus obtained.
When table-format data is represented as an array of records including a plurality of fields containing field values for each field, the method of tabulating the table-format data by each field value according to the present invention comprises the steps of:
if n represents an integer equal to 1 or greater, for each of n fields used in tabulation, keeping in a storage device individual field information including a value control table containing field values for that field corresponding to a field value number that uniquely identifies the field value, which is a field value number that is common to the various fields and has a stipulated order from an initial value, and a field value number-specifying information array containing information that specifies the field value numbers in the order of the records,
if i represents an integer in the range 1xe2x89xa6ixe2x89xa6n, for the ith individual information field, the total number of the field value numbers is represented by Ni, ki represents an integer in the range 0xe2x89xa6kixe2x89xa6Nixe2x88x921, M represents an integer equal to 1 or greater, and if m is an integer in the range 1xe2x89xa6mxe2x89xa6M, then initializing elements Pm(k1, k2, . . . , ki, . . . , kn) of n-dimensional M data spaces having a size of N1xc3x97N2xc3x97 . . . xc3x97Nixc3x97 . . . xc3x97Nn,
for the n individual information fields, when j represents an integer in the range 0xe2x89xa6jxe2x89xa6(total number of records)xe2x88x921, extracting the respective field value numbers stored in the jth position in each field value number-specifying information array, and when the field value number extracted from the ith individual information field is represented by qi, identifying the elements Pm(q1, q2, . . . , qi, . . . , qn) of the data space, and
processing the identified values of the elements Pm(q1, q2, . . . , qi, . . . , qn).
When table-format data is represented as an array of records including a plurality of fields containing field values for each field, the method of tabulating the table-format data by the category of field values,
the method being characterized in comprising the steps of:
if n represents an integer equal to 1 or greater, for each of n fields used in tabulation, keeping in a storage device individual field information including a value control table containing field values for that field and the category number of the field value corresponding to a field value number that uniquely identifies the field value, which is a field value number that is common to the various fields and has a stipulated order from an initial value, and a field value number-specifying information array containing information that specifies the field value numbers in the order of the records,
if i represents an integer in the range 1xe2x89xa6ixe2x89xa6n, for the ith individual information field, the total number of either the field value numbers or the category numbers is represented by Ni, ki represents an integer in the range 0xe2x89xa6kixe2x89xa6Nixe2x88x921, M represents an integer equal to 1 or greater, and if m is an integer in the range 1xe2x89xa6mxe2x89xa6M, then initializing elements Pm(k1, k2, . . . , ki, . . . , kn) of n-dimensional M data spaces having a size of N1xc3x97N2xc3x97 . . . xc3x97Nixc3x97 . . . xc3x97Nn,
for the n individual information fields, when j represents an integer in the range 0xe2x89xa6jxe2x89xa6(total number of records)xe2x88x921, extracting the respective field value numbers stored in the jth position in each field value number-specifying information array, and when the field value number extracted from the ith individual information field or the category number stored corresponding to the field value number in the value control table of the ith individual information field is represented by qi, identifying the elements Pm(q1, q2, . . . , qi, . . . , qn) of the data space, and
processing the identified values of the elements Pm(q1, q2, . . . , qi, . . . , qn).
With the method of tabulating counts in particular according to the present invention, M=1 is true, and the step of processing the value of the identified element Pm includes adding 1 to the current value of the element Pm.
In addition, with the method of tabulating statistical quantities according to the present invention, the step of processing the value of the identified element Pm comprises: for at least one element Pm among the M elements Pm,
for separate individual field information kept in a storage device, acquiring the field value numbers stored in the jth position in the field value number-specifying information array,
from among the field values stored in the value control table of the separate individual field information, acquiring the field value corresponding to the field value number thus acquired, and updating the current value of the element Pm and the value of the element Pm in combination with the field value thus obtained.
With the present invention, the information that specifies the field value number may be the field value number itself.
Alternately, in order to implement the so-called multi-answer fields wherein multiple field values are allocated to one field of a certain record, with the present invention, the information that specifies the field value number may be a binary value wherein 1 bit is allocated to each field value number, thus setting whether or not it is set.
In addition, when table-format data is represented as an array of records including a plurality of fields containing field values for each field, the apparatus for searching for and tabulating the table-format data according to the present invention comprises:
a storage device for keeping, for each individual field, a value control table containing field values for that field corresponding to a field value number that uniquely identifies the field value, which is a field value number that is common to the various fields and has a stipulated order from an initial value, and a field value number-specifying information array containing information that specifies the field value numbers in the order of the records,
means of acquiring from the field value number-specifying information array kept on the storage device the field value number corresponding to the specific record, and
means of obtaining from the field values stored in the value control table kept on the storage device the field value corresponding to the field value number acquired as above.
When table-format data is represented as an array of records including a plurality of fields containing field values for each field, the storage medium upon which is recorded a program for searching for and tabulating the table-format data according to the present invention is recorded with a program characterized in comprising:
a step of keeping in a storage device, for each individual field, a value control table containing field values for that field corresponding to a field value number that uniquely identifies the field value, which is a field value number that is common to the various fields and has a stipulated order from an initial value, and a field value number-specifying information array containing information that specifies the field value numbers in the order of the records,
a step of acquiring from the field value number-specifying information array kept on the storage device the field value number corresponding to the specific record, and
a step of obtaining from the field values stored in the value control table kept on the storage device the field value corresponding to the field value number acquired as above.
The present invention also proposes a sorting method whereby an array of record identification information, e.g. record numbers, specifying records including a plurality of fields containing field values corresponding to fields of information is rearranged on a specific field. With the sorting method according to the present invention, an array of pointers to the value control table is formed wherein, for each record, record identification information is associated with field value number corresponding to the field values of a certain field. Next, for each of the field value numbers, the storage location after reordering said record identification information is defined. Said record identification information is sequentially extracted from the array, and said field value number corresponding to said record identification information thus extracted is determined, the record identification information thus extracted is stored in said storage location according to the record identification information-specifying information corresponding to the field value number thus determined, and the storage location where the record identification information is to be stored is updated in order to store the next record identification information.
A preferred embodiment of the sorting method according to the present invention comprises the steps of keeping in a storage device individual field information including a value control table containing field values in the order of field value numbers corresponding to field values for a field value associated with search conditions, and a field value number-specifying information array containing information that specifies field value numbers in the order of the records, where the value control table further includes record identification information-specifying information that, for each field value number, indicates the area in said record identification information-specifying information array where said one or more pieces of record identification information regarding identical field value numbers are stored, and is constituted such that, record identification information is stored at storage locations according to the record identification information-specifying information.
Moreover, the objects of the present invention may be achieved by an apparatus for implementing the aforementioned methods, a computer-readable storage medium containing a program according to this method, or a computer-loadable program product according to the method in question.