The exemplary embodiment relates to the management and use of documents. It finds particular application in connection with the enrichment of data with information which allows both structured and unstructured (textual) data to be analyzed with common forms of analysis.
Frequently, business data sources contain structured as well as non-structured data. Structured data may include quantitative information about business objects, while the non-structured data may include textual information related to these business objects. Examples of structured data include tables in which defined hierarchical relationships exist between different parts of the data. For example, a table of a database generally includes fields corresponding to the column headings in a conventional table, which have a predefined relationship to the content of each column. Unstructured data is textual data which is expressed in a natural language (“free text”) and may include text which results from interactions with customers or suppliers, such as e-mails, scanned and OCR-ed (automated character recognition processed) mail, survey questionnaires, transcripts of phone calls, notes of meetings, and so forth in which no such structure exists (or is at best, very limited). For example, technical centers often maintain databases of fault/repair logs, containing both structured information about the hardware, the product components, date of intervention, involved technicians, as well as verbatim comments.
Generally, the two parts of the mixed-data environment are kept isolated and are utilized separately. In analysis of the data, reports and statistical analysis rely only on the quantitative (structured) part using data mining techniques, while the textual part is often exploited by traditional Information Retrieval engines using keyword searching techniques. There is no real link made with the quantitative part of the data.
Users of the data have an interest in mixed data modeling for a variety of uses. One reason for the lack of a global analysis/exploitation of the data is that the unstructured text uses different expressions to refer to the business objects and may refer to them generically.
For example, a customer may e-mail a service engineer at AB Company about a malfunction on his printer model AB100 indicating simply: “my new printer is not working.” The service engineer is able to determine the printer model from records in the structured business data. For example the database may include tables which list all the engineer's customers, the corresponding printer models, and when they were purchased. The engineer is then able to respond to the customer's e-mail and may store the e-mail in a database file of customer service requests. However, AB Company may wish to generate a report of the number of service calls for each of its printer models. Although this information may exist in the company's database as a whole, the company has no way of extracting the information in an automated fashion from both the structured and unstructured (textual) data.
One solution is to re-structure the textual free part by reducing it to a list of controlled keywords, with entity extraction, classification, and clustering techniques. Once re-structured under the form of extra features whose values belong to finite, known sets, the textual part can then be integrated in the structured part of the database and standard methods can then be applied for analytic purposes. This can be a lengthy process which is operator intensive.