FIG. 1 illustrates an example of a relationship among various text analyses for documents. In the text analyses, a lexical analysis, a morphological analysis (part-of-speech analysis), a syntax analysis (dependency analysis), a semantic analysis, and the like are included. The lexical analysis is processing for dividing sentences within documents into words based on literation. Further, the morphological analysis is processing for dividing sentences into morphemes and for giving part-of-speech information to each morpheme. The morpheme obtained by the morphological analysis may be handled as a word.
The syntax analysis is processing for synthesizing a clause including a self-sufficient word based on part-of-speech information of words and for finding a dependency relationship (modification relationship) between two clauses based on the self-sufficient word included in the clause. The semantic analysis is processing for finding meanings of a synonymous expression or a multivocal expression based on the dependency relationship, or processing for extracting a synonym from among a plurality of words. A synonym extraction, which is a practical-type semantic analysis, can be performed based on only words, or words and part-of-speech information. Further, accuracy is improved in the semantic analysis by using the dependency relationship.
In the syntax analysis, for example, a structure is defined in a rule base, and an analysis is performed while a plurality of structures are combined, if desired. A rule used in the syntax analysis is, for example, as follows.
S→NP VP (S: sentence, NP: noun phrase, VP: verb phrase)
VP→V S (VP: verb phrase, V: verb, S: clause)
The above-described rule is applied repeatedly, and thereby a tree structure corresponding to a sentence as illustrated in FIG. 2 is generated eventually. “A” represents an adjective, “N” represents a noun, and “Adv” represents an adverb. The dependency relationship is determined from the generated tree structure. Examples of a method for determining the dependency relationship include a method for determining it from the tree structure of the whole sentence, a method for determining it from a partial tree structure by focusing attention on a clause etc., and the like.
FIG. 3 illustrates an example of application processing for applying an analysis result of a conventional text analysis. A document 311 is compressed using a compression dictionary 301, and is stored as a compressed document 312. When the application processing is performed, the compressed document 312 is decompressed and the document 311 is restored. Further, the lexical analysis and the syntax analysis are performed with respect to the document 311 using an analysis dictionary 302, and thereby an analysis result 313 is generated. Next, word information and the like are tabulated using the document 311 and the analysis result 313, and thereby a tabulation result 314 is generated. Further, the analysis result 313 and the tabulation result 314 are utilized by an application program and the like.
A data compression method is also known, by which a data amount can be reduced and data can be exchanged after simultaneously applying enciphering by using structure information of structured data (see, for example, Patent Document 1). In accordance with this data compression method, in a compression module, internal expression data of the structured data is separated to the structure information and content using previously applied syntax designation information, and further the structure information and the content are compressed together. The compressed data is delivered from a transmitting side system to a receiving side system through a network. In a decompression module, the received compressed data is restored to the internal expression data of the structured data using the syntax designation information.
Patent Document 1: Japanese Laid-open Patent Publication No. 2003-44459