Data extraction is the act or process of retrieving structured or unstructured data out of data sources for further data processing or data storage. Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files etc. Though the Web is best known as a vast repository of shared documents, it also contains a significant amount of structured data covering a complete range of topics, from product to financial, public-record, scientific, hobby-related, and government. Structured data on the Web shares many similarities with the kind of data traditionally managed by commercial database systems but also reflects some unusual characteristics of its own; for example, it is embedded in textual Web pages and must be extracted prior to use; there is no centralized data design as there is in a traditional database; and, unlike traditional databases that focus on a single domain, it covers everything.
The business need for structured and unstructured extractions is well known. The domains range from augmenting search results, providing rich results to properties and targeted advertising where extracted information can be used to improve user experience as well as targeted advertisements.
However, currently, every design or development team that requires extraction of structured data [e.g. review ratings, store hours of operation, store phone number, hotel photos, etc.] from the web has to develop their own workflow execution mechanisms from scratch for retrieving and processing the structured data. So instead of concentrating on the business problem at hand, they have to worry about peripheral problems like how to orchestrate the whole workflow including, but not limited to, designing their own workflow model, developing custom ways of stringing together components using scripts, how to get access to web data, how to validate workflow output on a continuous basis, and how to transfer data between distributed and stand-alone systems.