Today's enterprises make decisions based on analyzing information from massive and heterogeneous databases or sources. More and more aspects of controlling machines or technical installations are driven by data, and as a result more and more operators need access to data.
The challenges of building an industrial grade question-answering (QA) system are many fold, not only due of the domain specificity of the underlying knowledge bases, but also because of the user interaction with the system, which needs to cover a wide range of queries.
The most pressing challenge is run time performance on commodity hardware. For example, an acceptable speed may be defined as computing the answer representation within 800 ms.
The system may be scalable, in that the response time may not be proportional to the size of data being accessed.
Often, enterprise data is heterogeneous and dynamic and thus unstructured. For example, a QA system needs to integrate these sources and accommodate their changing nature. Part of the integration process may include offering unified semantics for the data.
For example, it is estimated that up to 80% of all information is unstructured data. That means, in general, the data to be searched include unstructured data as well as structured data.
For example, for searching unstructured data and structured data, a so-called common index structure may be used. That is, the unification process of (e.g., primarily) unstructured data is accomplished by using the traditional approach of an inverted term index that is built separately for each data source.
More precisely, any given data object (e.g., document) is represented by splitting the document into its corresponding term features, (e.g., single words), and assigning some feature weighting method to the respective feature value, as for example the occurrence (e.g., word frequency, inverse document frequency) within the document and the entire document collection.
For retrieval purposes, any given query is mapped onto the inverted (e.g., single) word index and the resultant document references, (e.g., document identifier), are merged and ranked by a given ranking measurement, (e.g., cosine similarity or page-rank algorithm). With reference to structured data, the retrieval process is in the same way being integrated.
An actual unification process, however, between unstructured and structured data repositories or knowledge bases is not conducted. That is, the different repositories use their separate index structures. Moreover, computation, weighting or ranking, between and overarching, are left out of the actual basic unification process. That is, this process unifies the data by separate index structures focusing on the traditional (e.g., inverted) term index structure.
Accordingly, it is an object to improve unifying unstructured data objects and structured data objects.