FIG. 1 illustrates relationships between various types of text analyses that are performed on documents. Text analyses include for example a morphological analysis (part-of-speech analysis), a syntax analysis (modification analysis), and a semantic analysis. A morphological analysis is a process in which a sentence is segmented into morphemes so as to give part-of-speech information to each of the morphemes. Morphemes obtained in a morphological analysis may be treated as words in some cases. In a morphological analysis, a lexical analysis may be performed. A lexical analysis is a process in which a sentence is segmented into words in a document on the basis of notation.
A syntax analysis is a process in which phrases containing independent words are synthesized on the basis of part-of-speech information of words so as to obtain a modification relationship (qualification relationship) between phrases on the basis of independent words included in the phrases. Also, a semantic analysis is a process in which meaning relationships between words contained in sentences are analyzed on the basis of for example a modification relationship. A semantic analysis result can be used for a process of obtaining meaning of a synonymous expression and a polysemous expression or a process of extracting a word having a similar meaning from among a plurality of words. While a semantic analysis that does not aim at very high accuracy can be performed on the basis of words alone or on the basis of words and pieces of part-of-speech information, using modification relationships increases the accuracy of a semantic analysis. In a semantic analysis, part of the processes of a syntax analysis may be performed.
A semantic analysis uses a result of a morphological analysis of a natural sentence so as to obtain the semantic structure of that natural sentence. Using a semantic structure makes it possible to express, as data treated by computers, what a natural sentence means.
A semantic structure includes for example a plurality of nodes respectively representing the concepts of a plurality of words included in a morphological analysis result and directed arcs connected to the nodes. When an arc is connected to only one node, that arc represents the attribute of the node to which it is connected. Also, when an arc is connected to two nodes, that arc represents the relationship between the two nodes to which it is connected. In some cases, one node is connected to a plurality of arcs. A semantic structure is expressed by a graph structure (directed graph) that is created from for example nodes and arcs. FIG. 2 exemplifies a graph structure corresponding to a sentence of “WATASHI WA GAKKOU DE HATARAITE IMASU” (“I work for a school”).
A semantic analysis defines structures on the basis of for example rules so as to perform an analysis while combining a plurality of structures as needed. An example of rules used by semantic analyses is case grammar, which is proposed by Fillmore. According to case grammar, a sentence for example is considered to consist of one verb and a plurality of case categories. For example, by repeatedly applying a rule as described, a graph structure, as illustrated in FIG. 2, that corresponds to one sentence can be generated eventually.
Also, FIG. 3 illustrates an example of a utilization process in which a text analysis result is utilized. A document 311 is compressed by using a compression dictionary 301, and is stored as a compressed document 312. Then, the compressed document 312 is decompressed for utilization so that the document 311 is restored, and a morphological analysis and a semantic analysis are performed on the document 311 by using an analysis dictionary 302 so as to generate a semantic analysis result 313. The semantic analysis result 313 is utilized by an application program etc.
Regarding this, a technique is known in which for example a document is rewritten so that the semantic contents will not be changed and document compression is performed by converting the document into bit strings while referring to a compression table after the rewriting (see Patent Document 1 for example). Also, a technique of obtaining a method of accessing and searching for information via a data communication system is known (see Patent Document 2 for example). A technique is also known that makes it possible to analyze document contents without preparing a dictionary for a natural language (see Patent Document 3 for example).
Patent Document 1: Japanese Laid-open Patent Publication No. 7-160684
Patent Document 2: Japanese Laid-open Patent Publication No. 2008-135023
Patent Document 3: Japanese Laid-open Patent Publication No. 7-129588