Due to the recent advancement of information technology and the growing popularity of the Internet, a vast amount of information is now available in digital form in both the Internet and the Intranet environments. Such availability of information has provided many opportunities. In the commercial world for example, online information is an advantageous source of business intelligence that is crucial to a company's survival and adaptability in a highly competitive environment. Unfortunately, a user in this situation is usually faced with too much information and too little knowledge that is useful or actionable knowledge. The processes of extracting and discovering knowledge, or knowledge extraction and discovery, from text documents or the like textual data are thus very important tasks of considerable application potential and impact.
Conventional methods and systems of knowledge extraction and discovery from text documents typically focus on the extraction of information or meta-data from free-text documents. Meta-data, which are condensed and typically semi-structured representations of text content, can be considered as the raw form of knowledge and are essentially facts specified in the text documents. Meta-data do not include knowledge that is not mentioned explicitly in the text. In addition, there is usually too much information extracted by the conventional methods and systems and it is a painstaking process for a user to organize and discover wisdom from the extracted information.
Specifically, in U.S. Pat. No. 6,076,088 and U.S. Pat. No. 6,263,335, both entitled “Information Extraction System and Method using Concept-Relation-Concept (CRC) Triples” by Paik et al, systems are proposed for building subject knowledge bases in the form of Concept-Relation-Concept (CRC) triples from text documents. The systems can acquire new knowledge by automatically identifying new names, events, or concepts from text documents.
In International Patent Publication No. WO 01/01289 entitled “Semantic Processor and Method with Knowledge Analysis of And Extraction from Natural Language Documents” by Tsourikov et al, the use of natural language processing methods are proposed for the extraction of Subject-Action-Object (SAO) tuples from text documents upon a user request. The methods further include normalization and organization of SAO triplets into Problem Folders with Action-Object (AO) portions as the name of the folders containing a list of subjects. In International Patent Publication No. WO 01/82122 entitled “Expanded Search and Display of SAO Knowledge Based Information” by Tsourikov et al, the methods proposed by Tsourikov et al in WO 01/01289 are extended by proposing methods for normalizing SAO triplets through paraphrasing AOs.
A critical problem associated with the foregoing proposals lies in the common and attendant inability of the proposed systems and methods to derive new or hidden knowledge from text documents that is often the critical differentiating factor in gaining an edge over competitors.
There is therefore a need for a method and a system for knowledge extraction and discovery from text documents for addressing such a problem.