The present invention relates to databases and pertains particularly to dividing sentences into phrases in preparation for using attributes to organize and access documents.
The collection and use of information is important both for individuals and corporate entities. This is particularly true for certain professions, such as news agencies and publishing companies. In these professions, the collection and management of data is essential.
In early data management systems, data was collected and preserved. Data, when needed, was searched out one article at a time. Such a traditional data management lacks structure, and is not sufficient for modern society which values efficiency and speed.
In more recent years, the use of computers has greatly increased the efficiency of data management. Data management by computer is generally divided into two systems. In one system, data is sorted by index. In the other system, data is sorted using multiple indexes similar to the use of a bibliographical card index.
When sorting by index, a subjective judgment of data is made according to the existing sorting criteria. Based on this subjective judgement, the data is indexed and stored into a corresponding file. When a particular lot of data is desired, a search is performed by index in an attempt to locate the appropriate data.
One drawback of a single index system is that sorting is done manually in reliance upon the subjective judgment of an administrator. Data supposed to be classified under a first category might be misplaced in a second category simply because the administrator failed to recognize the nature of the data. Since any lot of data is generally put under only one particular category only, the lot of data is practically missing if put under another category by mistake. Therefore, it is easy in a single index for data to become lost or difficult to retrieve.
In multiple index systems, multiple indexes are used. For example, separate columns can be used to allow sorting by author, log-in date, log-in publication, topic or serial number. The data can then be retrieved using an index for any column.
However there are also deficiencies with multiple index systems. For example, for any particular lot of data any and all specific columns can fail to satisfy the needs for organization of data. For instance, it may still be difficult to define and classify data used by a news agency or a publishing company. For example, if there are seven co-authors in a given article and the specific column used to index authors allows the entry of at most three authors, then only three of seven co-authors can be used to index the article. The remaining four authors would have to be abandoned in the entry. A later search for the works of these four authors would not turn up this article. Furthermore, the selection of which authors to include in the entry and which to drop requires a subjective judgment.
Key words can be used to index data. For example, to index a target article, keywords can be used such as xe2x80x9cPoliticsxe2x80x9d, xe2x80x9cRelated to Crossing the Straitsxe2x80x9d, or xe2x80x9cStraits Exchange Foundationxe2x80x9d. These keywords can be stored with the document or the database system can perform a full text index through all documents in the database searching for a keyword. However, use of keywords for searching lacks accuracy since articles may contain searched key words, but the key words may have different meanings as used in different articles. Thus searching by key word is often not worth the effort.
In accordance with a preferred embodiment of the present invention, the process of dividing sentences into phrases is automated. The sentence is divided into sub-sentences using statistical analysis. Then, the sub-sentences are into phrases, using statistical analysis.
For example, for each pair of adjacent words in the sentence a metric is calculated which represents a strength of disconnection between the adjacent words. The sentence is divided into sub-sentences at locations in the sentence where the metric exceeds a first threshold.
In the preferred embodiment, the metric is a cutability measure that is calculated as a sum of backward entropy, forward entropy and mutual information.
Forward entropy (FE) of a character CI which immediately proceeds a character Cj in a sentence is calculated using the following equation:       FE    ⁢          (              C        i            )        =      -                  ∑                  C          j                    ⁢                                    P            F                    ⁢                      (                                          C                j                            |                              C                i                                      )                          ⁢                  xe2x80x83                ⁢        log        ⁢                  xe2x80x83                ⁢                              P            F                    ⁢                      (                                          C                j                            |                              C                i                                      )                              
where PF(Cj|Ci) is the probability of Cj following Cj.
Backward entropy (BE) of a character CI which immediately follows a character Cj in a sentence is calculated using the following equation:       BE    ⁢          xe2x80x83        ⁢          (              C        i            )        =      -                  ∑                  C          j                    ⁢              xe2x80x83            ⁢                                    P            B                    ⁢                      (                                          C                j                            ❘                              C                i                                      )                          ⁢                  xe2x80x83                ⁢        log        ⁢                  xe2x80x83                ⁢                              P            B                    ⁡                      (                                          C                j                            ❘                              C                i                                      )                              
where PB(Cj|Ci) is the probability of Cj being ahead of Cj.
Mutual information (MI) of a character Ci that immediately precedes a character Cj in a sentence is calculated using the following equation:       MI    ⁢          xe2x80x83        ⁢          (                        C          i                ,                  C          j                    )        =      log    ⁢          xe2x80x83        ⁢                  P        ⁢                  (                                    C              i                        ⁢                          C              j                                )                                      P          ⁢                      (                          C              i                        )                          ⁢                  xe2x80x83                ⁢                  P          ⁢                      (                          C              j                        )                              
where P(CiCj) is the probability that Cj exactly comes after Ci 
where P(Ci) is the probability that any character chosen at random in the corpus is Ci, and
where P(Cj) is the probability that any character chosen at random in the corpus is Cj.
In order to divide a sub-sentence into phrases, for a first word in the sub-sentence, an occurrence, in a corpus, of word combinations of the sub-sentence beginning with the first word is determined. Also, for a word immediately following the first word in the sub-sentence, an occurrence, in the corpus, of word combinations of the sub-sentence beginning with the word immediately following the first word is determined. For the first word in the sub-sentence, a word combination of a first number of words starting with the first word is selected to be used as a phrase when a ratio of the occurrence of the word combination of the first number of words starting with the first word and continuing with adjacent words in the sub-sentence to occurrence of a word combination of the first number of words starting with the word immediately following the first word and continuing with adjacent words in the sub-sentence is greater than that for any but the first number, provided the first number is less than a predetermine threshold.
A next phrase can be determined the same way. Specifically, for a next word in the sub-sentence not included in the word combination of the first number of words starting with the first word, an occurrence, in the corpus, of word combinations of the sub-sentence beginning with the next word is determined. For a word immediately following the next word in the sub-sentence, an occurrence, in the corpus, of word combinations of the sub-sentence beginning with the word immediately following the next word is determined. For the next word, a word combination of a second number of words starting with the next word is selected to be used as a phrase when a ratio of the occurrence of the word combination of the second number of words starting with the next word and continuing with adjacent words in the sub-sentence to occurrence of a word combination of the second number of words starting with the word immediately following the next word and continuing with adjacent words in the sub-sentence is greater than that for any but the second number, provided the second number is less than the predetermine threshold.
This methodology of dividing sub-sentences in phrases can be used to divide any sentence portion into phrases including a sentence portion that includes the entire sentence.
Once sentences have been divided into phrases, these phrases may be checked, if necessary, by a human operator or by some other method such as a syntactic analyzer.
The present invention allows for automated division of sentences into phrases. This can significantly reduce the time and effort needed to generate phrases within a document.