In recent years, importance of large-scale data base service including not only secondary information (bibliographic information), such as literature information, patent information and the like, but primary information (text information) have become greater. Heretofore, a method using key words or classification codes has been used for retrieving information in such a data base (hereinafter sometimes abbreviated to "DB").
Key words are selected from a collection of control terms (called "thesaurus") by a specialist who conducts key word provision (called "indexing") at the time of registration of information into a data base. On the other hand, a DB searcher employs a system in which key words are selected from the thesaurus to perform search. However, the key word provision work includes very troublesome work. In short, a vocabulary suitable for expression of the contents of a document to be registered must be selected from the thesaurus after reading the contents thereof. If indexing is unsuitable, accurate information cannot be acquired from the data base. Accordingly, there arises a problem in that the indexing requires a specialist having special knowledge of contents of documents and being well versed in vocabulary registered in the thesaurus. Further, there arises a problem in that requested documents cannot be called or unnecessary texts may be mixed in the called documents if suitable vocabulary according to the thesaurus cannot be designated as key words at the time of searching.
Further, the classification system in the thesaurus always changes with the passage of time. There arises a problem in that key words and classification codes must be always updated.
Further, since a large time is required for indexing, new documents collected by considerable quantities must be registered by batch processing. Accordingly, there arises a problem in that search-enabled information is always a predetermined time behind the present time. In such circumstances, with the spread of DB, provision of a system in which everyone, not only a specialist in DB, can conduct both document registration and document retrieval easily by using free words (also called "non-controlled words") with no restraint of the thesaurus has been desired.
With the increase in data base scale, it becomes impossible to describe contents of documents fully in detail by use of only the control words in the thesaurus. Accordingly, the number of documents which can be narrowed down by retrieval using key words is limited to a range from the order of tens to the order of hundreds. To find a target one from the documents, there is no method but a method of reading the contents thereof directly. This causes a serious problem in search efficiency.
Attempts of automatic summarizing and automatic indexing have been made counter to the problem in the present searching method based on the indexing using control words in the thesaurus. However, the problem is not yet substantially solved because Japanese requires various dictionaries for the reason of its linguistical difficulty.
Omission in search may often arise in the searching process using such free words because of differences in notation or expression, though both a search term as a key word designated by the user and a target term used in the DB have one meaning. For example,
the term " (piano; katakana)" may be represented as " (piyano; katakana)", or
the term " (intafeisu; katakana)" may be represented as " (intafesu; katakana)", " (intafeisu; katakana)" or " (intafesu; katakana)".
Hence, search for desired information often becomes impossible because of fine differences between variations in syllabic notation as described above.
Hereinafter, development into terms different in notation is called "different notation development". Development into other terms by use of a dictionary is called "synonym development". The concept "terms different in notation" is called "different notation".
As means for thoroughly solving these problems, there has been proposed a full text search system in which the searcher can search for the contents of documents on direct reference to the texts of the documents based on free key words (called "free words" or "non-controlled words").
In the following, the proposed system is described with reference to a typical example of construction thereof as shown in FIG. 1.
A search system 101 is connected to a host computer and performs both reception of a search request and transmission of a search result through a communication circuit. When a search request 107 is issued from the host computer, a search controller 103 accepts it, analyzes it and then sends corresponding search control information 108 to both a term comparator 105 and a query resolver 104. Further, the search controller 102 controls a storage controller 103 so that term data (text data) 111 stored in a search data base 106 are transferred to the term comparator 105.
The term comparator 105 makes a comparison between input term data and preliminarily set search terms (key words). When a matched term is detected, the term comparator 105 sends detection information 110 to the query resolver 104. The query resolver 104 judges whether or not the detection information 110 satisfies a complex condition pertaining to the positional relation, co-existence relation and the like between terms described in the search request. When it satisfies the complex condition, identification information for corresponding document data, as well as the content of the document as a search result 109, is returned to the host computer.
An example of such a prior art is described in the reading, R. L. Haskin and A. Hollaar: "Operational Characteristics of a Hardware-Based Pattern Matcher", ACM Trans. on Database System, Vol. 8, No. 1, 1983.
As a term comparison method in the term comparator 213 which is an important part of the term search system 200, a method of retrieving a plurality of terms by one scanning by use of a finite-state automaton is known. A typical example of the method is disclosed in the reading, A. V. Aho and M. J. Corasick: "Efficient String Matching", CACM, Vol. 18, No. 6, 1975.
Two methods for generating an automaton and a string matching method using the automaton are described in the aforementioned reading. In the following, the methods are described.
A first one of the methods (hereinafter referred to as "conventional method 1") will be now described with reference to FIG. 2. The drawing shows automaton state transition for searching term data for a key word " (intafesu; katakana)" given by a user. In the drawing, the circles represent automaton states, and the arrows represent state transitions. Respective characters given to the arrows represent input characters when state transitions corresponding to the arrows occur. In the case of representing negation, for example, in the case of representing characters other than " (n; katakana)" and other than " (i; katakana)", such negation is expressed with a negation symbol " " added to the characters to be denied, for example, as -- {" ", " "}--. The arrow 403 represents a starting state in which state transition starts. Numerical values given to the inside of the circles represent state numbers. The double circle represents an ending state in which comparison of " (intafesu; katakana)" is finished. This method is characterized in that state transitions corresponding to all input characters having a possibility to be inputted are described by an automaton. Therefore, the number of state transitions increases. Accordingly, there arises a problem in that a very large time is required for generating an automaton when the number of key words increases.
In the following, the term comparing operation in the conventional method 1 is described with reference to the drawing. When a character is inputted into the automaton, a token is placed to reveal the state in which comparison of the input character is to be made. In short, the token is a mark for indicating the transit state position in the automaton. First, the token is initialized to be placed in the state 0 as a starting state. In this example, the token moves to the state 1 when the input character is " (i; katakana)". When, on the contrary, a character other than " (i; katakana)" enters, the token moves to the sate 0. When, on the other hand, the token is in the state 1 and the input character is " (n; katakana)", the token moves to the state 2. When the input character is " (i; katakana)", the token moves to the state 1. When the input character is not " (i; katakana)" and not " (n; katakana)", the token moves to the state 0. When the token is in the state 2 and the input character is " (ta; katakana)", the token moves to the state 3. When the input character is " (i; katakana)", the token moves to the state 1. When the token is in the state 3 and " (fesu; katakana)" enters, the token successively moves to the state 4, the state 5, the state 6 and the state 7. The double circle is given to the state 7, so that comparison of the term (intafesu; katakana)" is perfected.
Because state transitions corresponding to all input characters having a possibility to be inputted must be described in the automaton in the conventional method 1, the number of state transitions increases as the number of key words increases. Accordingly, there arises a problem in that a very large time is required for generating an automaton. Hardware for putting the method into practice has been disclosed in Japanese Patent Unexamined Publication No. Sho-60-105040.
In the following, a second method (hereinafter referred to as "conventional method 2") is described. The conventional method 2 is designed to shorten the time required for generating an automaton, compared with the conventional method 1. The automaton generation time in the conventional method 2 is improved greatly to be one-third as long as that in the conventional method 1. The conventional method 2 has been described in detail in Japanese Patent Unexamined Publication No. Sho-63-311530. The conventional method 2 is described now with reference to FIGS. 3 and 4. FIG. 3 shows state transition in the automaton in the case where the same term " (intafesu; katakana)" as in FIG. 2 is compared. The token is initialized to be placed in the state 0 as a starting state. When a character " (i; katakana)" enters, the token moves to the state 1 after comparison is made in the state 0 in which the token is placed. When, on the contrary, a character other than " (i; katakana)" enters, the token moves to the state 0.
When the token is in the state 1 and a character " (n; katakana)" enters, the token moves to the state 2. When the token is in the state 2 and a character " (ta; katakana)" enters, the token moves to the state 3. When the token is in the state 3 and a character (for example, " (i; katakana)") other than " (fu; katakana)" described in the automaton enters, a failure function is established in the conventional method 2 so that reference to a failure function table as shown in FIG. 4 is made. Number of state effected by failure, to be re-compared with the number of the state in which the token is placed, is stored in the failure function table. In this example, the value 0 of state effected by failure, corresponding to the current state number 3, is obtained, so that the token moves to the state 0. In the state 0, comparison of the input character " (i; katakana)" is made to thereby move the token to the state 1. The aforementioned function is called "failure function". When a string of characters " (ntafesu; katakana)" enter one by one, the token successively moves to the state 2, the state 3, the state 4, the state 5, the state 6 and the state 7. The double circle is given to the state 7, so that comparison of the term " (intafesu; katakana)" is perfected.
When, for example, the term " (intafesu; katakana)" is given as a key word, the term may be described in the text by different notation as a term different from the search term designated by the user.
In the text, the term " (intafesu; katakana)" using "- (minus sign)" instead of "-- (prolonged sound symbol)" may be used (this is called "prolonged sound different notation") or the term " (intafesu; katakana)" additionally using "--" may be used (this is called "presence or absence of prolonged sound") or the term " (intafeisu; katakana)" using " (fei; katakana)" instead of " (fe; katakana)" may be used based on difference in pronunciation (this is called "phonetic different notation").
To search for all the terms, nine terms, that is, " (intafesu; katakana)", " (intafesu; katakana)", " (intafeisu; katakana)", " (intafeisu; katakana)", " (intafeisu; katakana)", " (intafesu; katakana)", " (intafesu; katakana)", " (intafesu; katakana)" and " (intafesu; katakana)" formed by combination of these different notations must be all recognized as key words.
The aforementioned example is explained with reference to FIGS. 5 and 6. FIG. 5 is a view of automaton state transition in the case where term data are compared with the nine terms written in different notations.
Comparison is started from the head of the key word, so that the state branches off and leads to another state when there is difference in transition character.
When, for example, " (intafesu; katakana)" and " (intafesu; katakana)" are subjected to comparison successively from the head of the key word, the two are the same till " (inta; katakana)" and the two are different in transition character between the next characters " (fu; katakana)" and "--". Therefore, there occurs branching of state transition that the state is transited from state 3 to sate 22 at the transition character " (fu; katakana)" and the state is transited from state 3 to state 4 at the transition character "--".
In short, because respective transit states are assigned to different transition characters in a certain state, the automaton is shaped like a tree. FIG. 6 is an explanatory view of a failure function table showing transition in the case where a character not described in the automaton enters. When comparison including different notation is made as described above, there arises a problem in that the number of states increases because the number of key words increases.
Further, don't care characters may be used in key words in term retrieval. An example of use of fixed-length don't care characters in key words is explained with reference to FIGS. 7 and 8. FIG. 7 shows automaton state transition in the case where a key word "A?B" including a don't care character "?" having the fixed length of one character is retrieved. FIG. 8 is an explanatory view of a failure function table for indicating the number of state effected by failure in the case where a character not described in the automaton enters.
In this example, an automaton is generated by use of one-byte character codes (JIS codes). "?" is a character symbol which is allowed to satisfy an arbitrary character or symbol. Accordingly, transition based on a don't care character "?" is shown as transition based on all character codes 00-FF in the case where the state 1 in the drawing is a current state. In short, "A?B" shows a designation for retrieving terms composed of a head character "A", an arbitrary character and an ending character "B".
There arises a problem in that the number of states in the automaton increases greatly in spite of such a simple retrieval condition when fixed-length don't care characters enter.
As a method for solving the problem pertaining to different notation and synonym, Japanese Patent Unexamined Publication No. Sho-62-011932 has been proposed. In this quatation, different notation development is called "different notation generation" and synonym development is called "synonym extraction".
FIG. 9 is a block diagram showing an example of construction of the quotation.
In the construction, a search term written in romaji or katakana is once converted into a term written in katakana by a standard notation. In short, a standardizing process to collect a plurality of notations into one is first carried out by the operation reverse to the different notation generation. On the other hand, a search term written in alphabets is generalized to expression in katakana by borrowed word/kana conversion.
The katakana term thus once standardized is subjected to synonym development by use of a synonym dictionary, so that words synonymous to the input katakana term are sent out as katakana terms. After synonym extraction, the katakana terms are subjected to kana/kanji conversion, kana/borrowed word conversion and kana/romaji conversion to form kanji terms, alphabet terms of foreign origin and romaji terms, respectively.
As described above, the katakana terms as the result of synonym extraction are respectively converted into kanji, romaji, katakana and alphabet terms to thereby carry out different notation development.
In the conventional term search system 101 as shown in FIG. 1, a magnetic disk device capable of storing a large quantity of data is required as a search data base 106 which is one of constituent members of the term search system 101. A general magnetic disk device has a problem in that high-speed data input/output is impossible. On the other hand, a multi-head magnetic disk device in which high-speed data input/output is possible has a problem in that the device is very high in cost.
Therefore, a collective-type magnetic disk unit formed by connecting a plurality of general low-cost small-size magnetic disks to improve data input/output speed has been considered. As one of this type disk units, a "picture data division storage" is disclosed in Japanese Patent Unexamined Publication No. Sho-60-117326.
This unit has a plurality of magnetic disk devices, magnetic disk controllers in the same number as the magnetic disk devices, and a master controller for controlling data transmission between an input/output buffer and an external device. In the master controller, data given from the external device are divided into a quantity not more than the capacity of the input/output buffer. The divided data are successively transferred to the respective magnetic disk controllers. The magnetic disk controllers serve to write the data in corresponding magnetic disk devices, respectively. The master controller gives a seek instruction to magnetic disk controllers corresponding to magnetic disk devices free from the writing operation, to omit the apparent seek time of data-storage magnetic disk devices on and after the second magnetic device to thereby shorten data write/read time.
When the conventional search system as shown in FIG. 1 is applied to large-capacity database search, the following problems arise.
A first problem is in search time. When, for example, full text search is applied to 20000 documents having a capacity of 20 KB per document, 400 MB data must be scanned.
If data processing is carried out by the steps of: storing the 40 MB text data in the search data base; reading the data at a mean effective speed of about 1 MB/s; and performing comparison in the term comparator at the same speed, it takes about 7 minutes to perfect the search. There arises a problem in that the time required for reading the text data is too long to bear practical use when general magnetic disk devices are used. In short, it is necessary that the reading speed in the search data base for storing the text data is improved to the same degree as the processing speed in the term comparator. Here is a first object of the invention.
Even though the reading speed in the search data base is improved to the same degree as the processing speed in the term comparator or in other words even though the reading speed is improved to 10 MB/s, it takes still 40 seconds to perfect the scanning of the 400 MB text data. To shorten the search time to a practically allowable value of the order of several seconds is a second problem of the invention.
In respect to the technique for improvement in search speed, a "term search method" has been proposed in JP-A-62-241026. In the "term search method", a "table of distribution of frequency in use of characters" is generated by examining frequency in use of characters from the contents of text (called "data") in advance, in order to improve the processing speed in the process of searching a text data base (called "file") for a designated term.
The proposed method has the steps of: performing test search with reference to the "table of distribution of frequency in use of characters" with use of a character of lowest frequency in the key word designated by the user as a key; and performing comparison on characters before and behind the character if the character satisfies a specific condition.
Further, the JP-A-62-241026 has described that retrieval can be finished without text search in the case where the character of lowest frequency in the key word has a frequency of zero in the "table of distribution of frequency in use of characters".
In short, according to the JP-A-62-241026, the number of wasteful character comparisons can be reduced, thus to attain an effect that the search processing speed is improved.
However, the method is designed to generate a "table of distribution of frequency in use of characters" in the whole of a data base (file) to thereby search the data base for a text file (data) based on the table (refer to the drawing). Accordingly, the method has an effect in efficiency in search processing in the case where a key word pertaining to characters absent in the data base is retrieved. In general, the number of characters absent in the data base is reduced as the scale of the data base increases. There arises a problem in that the effect of the method in search processing disappears.
To solve the aforementioned problem in order to enforce efficient search processing to make equivalently high-speed full text search possible is the second object of the invention.
On the other hand, in full text search using free words, a difference in expression may arise between the key word designated by the searcher and the term described in the text though they are the same in meaning. In this case, documents having a different expression form are omitted from search to make it impossible to retrieve target documents. Examples of such terms are synonyms, different-form words (called "different notation words" or "different notation"), and the like.
Examples of words synonymous to " (keisanki; kanji)" are " (densikeisanki; kanji)", " (densanki; kanji)", "Computer", and the like. Examples of different notations with respect to " (konpyuta; katakana)" are " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", " (konpyuta; katakana)", and the like. Examples of different notations with respect to "Computer" are "computer", "COMPUTER", and the like. To cope with the problem in difference in notation between the key word designated by the user and the term described in the text of the document, the searcher must conduct search after designating all synonyms and different notations. However, it is practically difficult that the searcher designates all different notations, because hundreds of different notations may exist. To solve the problem is a third object of the present invention.
In the conventional method, expected development results in most cases cannot be attained, because information in an original term is changed at the time of the standardizing of notation.
This fact is explained with reference to the rule of partial term conversion
from " (hoo; katakana)" to " (hou; katakana)" to standardize katakana notation. When the conversion rule is applied, the term
" (jouhoo; katakana )"
However, when the same conversion rule is applied to a given term
" (jouhoon; katakana)"
the term is standardized to a false term
" (jouhoun; katakana)"
This has an influence on both synonym development process after the standardizing process and different notation development process following the synonym development process, so that expected development results cannot be attained.
One object of the present invention is to attain expected development results without performing the aforementioned standardizing process.
In the aforementioned quotation, the key word synonym development from " (keisanki; kanji)" to " (konpyuta; katakana)" based on a synonym dictionary is made by the steps of: converting once the search key word designated by the user into a katakana term; making the katakana term be subject to synonym development; and then making the resulting term be subject to kana/kanji conversion, kana/romaji conversion and kana/foreign language conversion. Therefore, the synonym dictionary must have an ability of development from katakana term to katakana term. In short, synonyms must be always written in katakana as follows.
Headword: " ( konpyuta; katakana)"
Synonym 1: " (keisanki; katakana)" PA1 Synonym 2: " (jouhoushorishouchi; katakana)" PA1 [" (iu; katakana )".fwdarw."(" (iu; katakana )", " (yuu; katakana )"] PA1 (a) Conversion rule pertaining to development corresponding to difference in notation between the old style and the new style of kanji. PA1 [" (sei)".fwdarw.(" ", " ", " ", " ")] PA1 (b) Conversion rule pertaining to development corresponding to difference in notation between declensional kana of kanji. PA1 [" (yomitori)".fwdarw.(" "," ")] PA1 Conversion rule pertaining to development into various notations in similar syllables. PA1 [" (pia)".fwdarw.(" (pia)", " (piya)"] PA1 Development from " (keisanki; kanji)" to " (kompyuta; katakana)" and " (johoshorisochi; kanji" PA1 Development from " (keisanki; kanji)" to " (denshitakujokeisanki; kanji)" PA1 Development from " (keisanki; kanji)" to " (ofisuotomeishon; katakana)"
This causes a problem in that the scale of the dictionary is enlarged, because output terms having an expression form corresponding to the synonym development must be registered in both the kana/kanji conversion dictionary and kana/borrowed word conversion dictionary. A large number of homonyms exist in the Japanese language. This causes failure in synonym development. For example, the term " (kensaku; katakana)" can be interpreted as " (kensaku; kanji)" or can be interpreted as " (kensaku; kanji)". There arises a problem in that the distinction between the two words cannot be recognized by the synonym dictionary using only katakana notation. Further, there arises a problem in that homonym selection in katakana/kanji conversion after synonym development is made by interactive processing.
Further, a foreign language/kana conversion dictionary, a kana/kanji conversion dictionary and a kana/foreign language conversion dictionary are required for converting the search key word into katakana and converting the katakana word into a suitable-form word after synonym development. There arises a problem in that a great deal of labor is required for generation and maintenance of dictionaries, because a great variety of large-scale dictionaries must be used.
In short, the third object of the invention is to solve the problem in homonyms at the time of kana/kanji conversion and kana/foreign language conversion and solve the problem in generation and maintenance of large-scale dictionary used for the aforementioned conversion.
In the case where hundreds of synonyms and different notations are considered as key words in retrieval, a term comparator for collectively comparing these words is required. When retrieval is made under the consideration of synonyms and different notations with no use of the term comparator, the search time increases by hundreds of times so that it cannot bear practical use. A fourth object of the present invention is to provide a term comparator in which search processing can be made with no reduction of the comparison speed even though hundreds of key words are designated.
In the conventional search method using an automaton, all key words including different notations are listed and developed. Further, an automaton is generated based on the key words. Because the automaton thus generated is shaped like a tree, a very large number of automaton states are required.
In the case of retrieval with don't care character designation, all combinations of character codes allowed by don't care characters are listed and developed into key words. Because the automaton is generated based on the key words, a very large number of automaton states are required similarly to the case of different notation.
As described above, the increase in the number of automaton states causes the increase in automaton generation time and, accordingly, the increase in the capacity of a transition table for storing the automaton, that is, the increase in hardware.
An object of the invention is to provide a search method using an automaton in which the number of states is reduced by describing collectively transition in the automaton in the case where retrieval is made under the consideration of different notations and with designation of don't care characters to thereby shorten the automaton generation time, and in which the capacity of a state transition table is reduced to thereby attain retrieval by compact hardware.
When document data are further successively registered in the text data base, the capacity of the magnetic disk device which forms a search data base becomes full at a certain point of time. In this case, it is necessary that the storage capacity of the system can be enlarged with no losing the stored data. In the case where the capacity of the search text data base is enlarged to a capacity for 100000 texts, that is, a capacity for 4 GB, processing time increases as the storage capacity of the magnetic disk device is enlarged. Accordingly, the original object cannot be attained. Therefore, it is necessary to enlarge the scale of the storage capacity with no deterioration in search time.
A fifth object of the invention is therefore to provide a search system having an architecture satisfying such a requirement.
In the search data base in the term search system, there are three important factors, namely, large storage capacity, continuous high-speed input/output of a plurality of files, and low cost. A collective-type magnetic disk unit satisfying these factors has been desired.
The conventional technique is designed to shorten data write/read time merely through omitting apparently access time of seek time. In short, there is no consideration of the number of magnetic disk devices required corresponding to the data transfer rate necessary for an external device. Accordingly, the conventional technique has a problem in the viewpoint of cost performance.
The conventional technique has an effect in that access time can be saved in the case where a file large in size of data such as picture data is stored separately in a plurality of magnetic disk devices. However, the conventional technique has a problem in that the access time becomes equal to the access time in one magnetic disk device in the case where writing/reading is carried out with respect to a file small in data size which can be stored in one magnetic disk device, because seek time cannot be omitted in this case.
Further, in the conventional technique, there is no consideration of continuous writing/reading with respect to a plurality of files. Accordingly, a write/read instruction from a higher-rank apparatus can be processed with respect to only one file. In access to a plurality of files, the file processing with respect to one file must be repeated. There arises a problem in that a large overhead time is required for the repetition.
As one component of the overhead time, there is the processing time required for retrieving information pertaining to magnetic disk device storage position from file identification codes to designate files as targets of access from the higher-rank apparatus.
In a conventional general magnetic disk device, a file identification code is expressed by a file name constituted by a string of character codes such as ASCII codes. Physical storage position must be found through retrieving file management information stored in the file management information area of the magnetic disk device, based on the file name. There arises a problem in that the processing time required for it is large.
An object of the invention is therefore to provide a low-cost collective-type magnetic disk unit which has such a large storage capacity that continuous high-speed input/output with respect to a plurality of files can be attained regardless of the size of the files.
On the other hand, document information is constituted by not only text data but graphic data such as pictorial data, photographic data, and the like. Accordingly, it is necessary to answer the requirement that the retrieved document can be seen in print image. A sixth object of the invention is to provide a search system having an architecture which can answer the requirement.
Further, the text data base is provided to be shared to a plurality of users. For example, it is necessary to make access to the text data base from a conversation-type workstation through LAN (local area network). Accordingly, the search system must have a function connected to the LAN to answer search requests from other workstations. A seventh object of the invention is to provide a full text search system having the aforementioned function.
A final object of the invention is to provide a full text search system which can answer the aforementioned problems.