Increasingly large numbers of specialized applications are developed by enterprise business users in response to situational business needs. Such applications often require access to information derived by combining data in corporate databases, content management systems, and other IT managed repositories, with data from the desktop, Web, and other sources typically outside IT control. Web 2.0 inspired enterprise data mashup technologies, like IBM's InfoSphere MashupHub (MashupHub), have been developed to meet the data processing and data integration requirements of such applications. MashupHub, which originated from the Damia research project at IBM, provides visual facilities for quickly and easily creating data mashups that filter, join, aggregate, and otherwise transform feeds published from a wide variety of sources, into new feeds that can be consumed by AJAX, and other types of web applications.
An important class of enterprise mashup scenarios involves feeds derived from data created primarily for eye consumption, such as email, calendars, blogs, wikis, and web feeds. Such feeds often contain the data needed to perform mashup operations buried within swaths of unstructured element and attribute text. Consider a scenario where an account representative would like to get quick current events updates on customer accounts he or she is preparing to visit. His customer account information is available in a spreadsheet on a desktop. The representative would like to join this data with relevant news from popular business news feeds available on the Web. Unfortunately, business feeds on the web often have company references buried within unstructured text in a description or title field of the feed. For example, a Reuters business feed titled “Aston Martin expects 2009 sales to slow: report” identifies the company “Aston Martin” as the subject of the business news represented by the feed entry. This company information must be extracted from the text and added to the news feed as a structured attribute before it can be successfully joined with corresponding account information in the spreadsheet.
Information extraction technology can be a critical enabler in such scenarios, providing various types of text annotators for discovering entities, relationships, and other attributes that can be exploited by mashup operations. Current mashup technologies can typically make direct use of information extraction technology made available as web services. These services can be called from within a data mashup execution flow to annotate unstructured text within the data feed. There are significant efficiency concerns with this approach, however as (1) potentially large portions of feed text need to be transferred between the data mashup and the web service; (2) there might be many calls to the service for each execution of the data mashup—one or more per feed entry perhaps; (3) there is often significant network latency involved with web service calls. In addition to the performance concerns, exposing sensitive company data like email messages or call center records to an external web service can lead to security and privacy issues. Given the importance of information extraction technology to enabling this important class of data mashups, it is important that this technology be integrated tightly into the system.
Even ignoring the performance and security concerns, there are other drawbacks to relying exclusively on external annotation services. Another drawback is that the annotators provided by such services are generic and not necessarily tuned to work well in specific mashup environments. For example, a feed can join with more sources if it is annotated with more specific attributes such as street (e.g. “650 Harry Road”), city (e.g. “San Jose”), and state (e.g.“CA”), versus more general ones such as location (e.g. “650 Harry Road, San Jose, Calif.”). Writing annotators that work with high specificity and low noise requires careful tuning of annotation rules. Moreover, annotators tuned for feeds must deal intelligently with markup. This requirement might mean ignoring html tags or exploiting XML element and attribute data (perhaps of parent or sibling nodes) to achieve greater precision and recall.
Yet another drawback is that the set of annotators provided by external services are fixed and hence cannot be extended with new annotators that target a particular installation, feed source, or mashup application. For example, a semiconductor company may need to extract information about Field-Programmable Gate Array (FPGA) users' performance requirements from articles in the technical press, a task that no pre-built library is likely to accomplish. Even if a remote text annotation service supports customized annotators and dictionaries, it is hard to share such customization efforts. The reasons are two-fold: first, users of such web services are unlikely to share the same scenario or data sources; second, companies need to protect their intellectual property and are unlikely to have their customized annotators and dictionaries stored at a third party.
Thus, there are deficiencies in the current art as it relates to the effective and efficient exploitation of information extraction from data processing systems. This deficiency is particularly evident in the context of data mashup systems, which often deals with data feeds derived from unstructured data sources. What is needed is a data processing system that provides efficient and extensible information extraction capabilities.