The present invention relates to a retrieval system having an interface for intuitive operations of retrievals of secondary data added sentences in a database and a program for the system. Further, the present invention relates to a database suitable for high-speed retrievals of secondary data added sentences by using an intuitive interface.
More particularly, the present invention relates to a retrieval system having an intuitively searchable interface for an annotated corpus, an example of secondary data added documents.
A sub-string search technique (also known as a full-text search technique) is useful to retrieve sentences, which function as information transmitters, from a collection of texts such as newspaper articles or patent specifications. The same technique is applied to retrievals of HTML documents in the Internet. In this technique only character strings included in texts displayed on a Web-browser are searched, and another parts of HTML documents are neglected.
Though it is possible to use plural key words for one search and to analyze documents whether there is a match between each key word and character strings by using this technique, word order of key words in one sentence is not considered. Because a relationship of key words in a query by this technique is just a simple conjunction.
In this specification, xe2x80x9ca documentxe2x80x9d is a collection of sentences divided by a period and the like, expressing character information of something organized such as a newspaper article. An element of xe2x80x9ca documentxe2x80x9d ended by a period and the like is xe2x80x9ca sentencexe2x80x9d.
In a standard data interchange format such as Standard Generalized Markup Language (SGML), secondary data can be added to each related sentence as attributes in a tag. Secondary data added sentences of such a standard data interchange format have advantages that various types of information can be included in a tag and a data interchange is easy because such sentences are essentially written in text format.
Applying these advantages to a corpus, an annotated corpus, which has not only sentences but also secondary data relating to each sentence, is the current focus of attention.
A corpus is generally a computerized large collection of linguistic data included in various documents such as newspaper articles or screenplays, having the purpose of support for language description or language analysis. In other words, a corpus is a large collection of illustrations of daily usage in the form of electronic character data. In many case, a corpus is retrieved by using GREP, and a retrieval result is displayed on screen in Key Word in Context format (KWIC). Making use of a corpus brings convenient way of collocation searches to clarify a practical side of language expressions, and is useful for a natural language description or language analysis. A corpus is a collection of documents, and a document is a collection of sentences.
As a search method for a corpus, a full-text search methods is well used. For example, Unexamined Patent Publication (Kokai) No.8-137898 discloses an invention concerning a document retrieval system, expanding a user input key word into related key words referring to a concept dictionary, searching a corpus with the related key words, so as to improve accuracy of a search.
An annotated corpus mentioned above is a corpus that secondary data such as a part of speech, a lemma and so on are added to each syntactic unit of a document such as a word, a phrase or a chapter and so on as attributes of a tagged form. From the point of view of data input efficiency, most of annotated corpuses in use adopt a format adding secondary data to each word. A format of an annotated corpus is not limited to a tagged form, for example, a format divided simply by xe2x80x9c/xe2x80x9d can also be used. An annotated corpus are widely used in the field of language study or dictionary compilation. As an embodiment of an annotated corpus, The British National Corpus (BNC) and The Bank of English are known. The vast file size of each corpus amounts to a few Giga bytes.
One problem concerning with these corpuses is that, retrieval results of such a large corpus without utilizing secondary data (annotation) are often useless because too many matchings occur. Therefore, many linguists want to make use of secondary data such as a part of speech included in an annotated corpus in order to limit the number of retrieved sentences.
However, for the purpose of utilizing secondary data, a query has to be based on a special construction rule which is called xe2x80x9cCorpus Query Languagexe2x80x9d (CQL). And a user has to learn practical expressions of each corpus, CQL, UNIX commands, software, programming and so on.
Further, many annotated corpuses in use are not suitable for fast retrieval because the formats of such corpuses are selected from the point of view of data input efficiency. For example, a retrieval of a phrase composed of more than one word such as xe2x80x9cpretty womanxe2x80x9d from a corpus, which is added secondary data on each word as attributes of SGML, takes much time.
In one aspect, the present invention relates to a retrieval system having an interface for intuitive operations to search secondary data added documents such as an annotated corpus without detailed knowledge about a format of each corpus, commands or programmings.
In one aspect, the present invention relates to a retrieval system and a database to retrieve secondary data added documents in a relatively short time.
In another aspect, the present invention relates to a retrieval system of a database storing secondary data added documents, said system including:
means for transmitting a graphical user interface (GUI) for searching having data entry fields configured in a matrix, to display on a user""s display;
means for storing retrieval data input in one or more data entry field(s) of the GUI;
means for locating each data entry field in which each datum is input;
means for generating a query comprising query units, each unit being generated by using a set of retrieval data input in each one column of data entry fields of said matrix, and each unit corresponding to one element of said document,
in the case that more than one column of data entry fields being input with retrieval data, generating a query so as to retrieve sentences having the same order of elements in each sentence as the order of said columns of data entry fields,
in the case that only one column of data entry field being input with retrieval data, generating a query so as to retrieve sentences having an element corresponding to said retrieval data;
means for interpreting said query and searching said database;
means for transmitting search results to display on a user""s display.
According to the fourth aspect of the present invention, we provide a program for a retrieval of a database storing secondary data added documents, said program including the step of:
transmitting a GUI having data entry fields configured in a matrix, to display on a user""s display;
storing retrieval data being input in one or more data entry field(s) of a GUI for searching;
locating each data entry field being input with each datum;
generating a query comprising query units, each unit being generated by using a set of retrieval data input in each one column of data entry fields of said matrix, and each unit corresponding to one element of said document,
in the case that more than one column of data entry fields being input with retrieval data, generating a query so as to retrieve sentences having the same order of elements in each sentence as the order of said columns of data entry fields,
in the case that only one column of data entry field being input with retrieval data, generating a query so as to retrieve sentences having an element corresponding to said retrieval data;
interpreting said query and searching said database.
As a result, a retrieval of secondary data added documents becomes fast and accurate because elements of sentences are the unit of data input and searching.
As a result, a retrieval of secondary data added documents becomes fast and accurate because elements of sentences are the unit of data input and searching.
Additionally, examples of secondary data added document(s), except an annotated corpus, are a collection of genetic maps describing information of each part of genes, a collection of music with songs, or scores. The present invention could be applied to a database retrieval of gene or music information.
The present invention may be implemented with a stand-alone computer including a database, or with a client server system via network such as the Internet.
In the case of a client server system, a database can reside in a server, and a retrieval program can be installed in a client, or both of a database and a retrieval program can reside in the same server and an retrieval interface can be transmitted to a client. It will be appreciated that the present invention is not limited above mentioned variables.