1. Technical Field
The present invention relates generally to a system, and computer program product for processing of documents. More particularly, the present invention relates to a system, and computer program product for selecting a structure to represent tabular information.
2. Description of the Related Art
Documents include information in many forms. For example, textual information arranged as sentences and paragraphs conveys information in a narrative form.
Some types of information are presented in a tabular organization. For example, a document can include tables for presenting financial information, organizational information, and generally, any data items that are related to one another through some relationship.
Natural language processing (NLP) is a technique that facilitates exchange of information between humans and data processing systems. For example, one branch of NLP pertains to transforming a given content into a human-usable language or form. For example, NLP can accept a document whose content is in a computer-specific language or form, and produce a document whose corresponding content is in a human-readable form.
A question and answer system (Q&A system) is an artificial intelligence application executing on data processing hardware. A Q&A system answers questions pertaining to a given subject-matter domain presented in natural language.
Typically, a Q&A system is provided access to a collection of domain-specific information based on which the Q&A system answers questions pertaining to that domain. For example, a Q&A system accesses a body of knowledge about the domain, where the body of knowledge (knowledgebase) can be organized in a variety of configurations. For example, a knowledgebase of a domain can include structured repository of domain-specific information, such as ontologies, or unstructured data related to the domain, or a collection of natural language documents about the domain. IBM Watson is an example of a Q&A system. (IBM and Watson are trademarks of International Business Machines Corporation in the United States and in other countries).
A Q&A system can be configured to receive inputs from various sources. For example, the Q&A system may receive as input over a network, a corpus of electronic documents or other data, data from a content creator, information from one or more content users, and other such inputs from other possible sources of input. Some or all of the inputs to the Q&A system may be routed through network 102. The various computing devices on the network may include access points for content creators and content users. Some of these computing devices may include devices for storing the corpus of data. The network may include local network connections and remote connections, such that the Q&A system may operate in environments of any size, including local and global, e.g., the Internet. Additionally, the Q&A system can be configured to serve as a front-end system that can make available a variety of knowledge extracted from or represented in documents, network-accessible sources and/or structured data sources. In this manner, some processes populate the Q&A system with input interfaces to receive knowledge requests and respond accordingly.
A content creator creates content in a document for use as part of a corpus of data with the Q&A system. The document may include any file, text, article, or source of data for use in the Q&A system. Content users input questions to the Q&A system that the Q&A system answers using the content in the corpus of data. When a process evaluates a given section of a document for semantic content, the process can use a variety of conventions to query such document from the Q&A system. One convention is to send the query to the Q&A system as a well-formed question. Semantic content is content based on the relation between signifiers, such as words, phrases, signs, and symbols, and what they stand for, their denotation, or connotation. In other words, semantic content is content that interprets an expression, such as by using Natural Language Processing.
The process sends well-formed questions (e.g., natural language questions) to the Q&A system. The Q&A system interprets the question and provides a response to the content user containing one or more answers to the question. The Q&A system can also provide a response to users in a ranked list of answers.
As an example, IBM Watson™ Q&A system receives an input question, parses the question to extract the major features of the question, uses the extracted features to formulate queries, and applies those queries to the corpus of data. Based on the application of the queries to the corpus of data, the Q&A system generates a set of hypotheses or candidate answers to the input question, by looking across the corpus of data for portions of the corpus of data that have some potential for containing a valuable response to the input question.
IBM Watson™ Q&A system then performs deep analysis on the language of the input question and the language used in each of the portions of the corpus of data found during the application of the queries using a variety of reasoning algorithms. There may be hundreds or even thousands of reasoning algorithms applied, each of which performs different analysis, e.g., comparisons, and generates a score. For example, some reasoning algorithms may look at the matching of terms and synonyms within the language of the input question and the found portions of the corpus of data. Other reasoning algorithms may look at temporal or spatial features in the language, while others may evaluate the source of the portion of the corpus of data and evaluate its veracity.
The scores obtained from the various reasoning algorithms indicate the extent to which the potential response is inferred by the input question based on the specific area of focus of that reasoning algorithm. Each resulting score is then weighted against a statistical model. The statistical model captures how well the reasoning algorithm performed at establishing the inference between two similar passages for a particular domain during the training period of the IBM Watson™ Q&A system. The statistical model may then be used to summarize a level of confidence that the IBM Watson™ Q&A system has regarding the evidence that the potential response, i.e. candidate answer, is inferred by the question. This process may be repeated for each of the candidate answers until the IBM Watson™ Q&A system identifies candidate answers that surface as being significantly stronger than others and thus, generates a final answer, or ranked set of answers, for the input question. More information about the IBM Watson™ Q&A system may be obtained, for example, from the IBM Corporation website, IBM Redbooks, and the like. For example, information about the IBM Watson™ Q&A system can be found in Yuan et al., “Watson and Healthcare,” IBM developerWorks, 2011 and “The Era of Cognitive Systems: An Inside Look at IBM Watson and How it Works” by Rob High, IBM Redbooks, 2012.