1. Field
Embodiments of the invention relate to use of federation services and transformation services to perform extract, transform, and load (ETL) of unstructured information and associated metadata.
2. Description of the Related Art
Extraction, transformation and loading of structured data stored primarily in relational databases are described, for example, in the following references: (1) Squire, C., “Data Extraction and Transformation for the Data Warehouse”, ACM Proceedings of Sigmod, Intl. Conference on Management of Data, Vol. 24, No. 1, Mar. 1, 1995, p. 446-447 (“Squire” hereinafter) and (2) White, C., “Managing Data Transformations”, BYTE, Vol. 22, No. 12, Dec. 1, 1997, p. 53-54 (“White” hereinafter).
Structured information (also referred to as “structured data”) may be described as including “alphanumeric values easily classified by specific attributes . . . [including values such as] . . . name, zip code, account balance, transaction number etc.”, as described in Kugel, R., “Unstructured Information Management”, Intelligent Enterprise, December 2003 (“Kugel” hereinafter). According to Kugel, structured information forms only 10-20% of enterprise information.
Unstructured information (also referred to as “unstructured data” or “native content” or “content”) comprises the other 80-90% of all enterprise information. Unstructured information may be described as computerized information that does not have a structure that is easily readable by a computer. Unstructured information includes, for example, Binary Large OBjects (BLOBs) such as multimedia, emails, memos, white papers, etc. Today's complex business environment is subject to increasing regulation. Compliance requirements demand that corporations maintain documents and e-mails during seven years, in the case of an audit. While governance control becomes more stringent, the competitive playing field becomes more leveled. Companies are faced with greater competition, and, thus, need to make faster and better informed decisions in order to sustain growth. It is imperative that companies gain a unified view of their customer data in order to stay competitive, while improving productivity and reducing costs.
The unstructured information may be stored in a content repository. A content repository may be described as software, firmware, hardware, or any combination thereof, that manages the storage of the unstructured information.
Currently, there are techniques describing content management and federation such as the techniques described in U.S. Pat. No. 6,643,663, issued on Nov. 4, 2003, to Dabney et al.; U.S. Pat. No. 6,804,674, issued on Oct. 12, 2004, to Hsiao et al.; and U.S. Pat. No. 6,910,040, issued on Jun. 21, 2005 to Emmick et al. There are also techniques describing content transformation, such as U.S. Pat. No. 7,016,963, issued on Mar. 21, 2006, to Judd et al. In addition, there are techniques describing ETL for structured data residing in relational databases, such as U.S. Pat. No. 7,051,334, issued on May 23, 2006, to Porter et al.
Furthermore, enterprises gain from unifying structured and unstructured information. The time and effort to implement new applications that require combined data types should be minimized. Cost reduction over time can be significant if a common data integration infrastructure is deployed across the spectrum of data types according to Gilbert, Mark and Friedman, Ted, “The New Data Integration Frontier: Unifying Structured and Unstructured Data”, Gartner, Mar. 31, 2006 (“Gilbert” hereinafter).
Thus, there is a need in the art for techniques that extract, transform and load unstructured information and associated metadata.