The information technology revolution over the past few decades has resulted in various advances. Examples of such advances include digitization of massive amounts of data and widespread access to computational devices. However, while rich digital information may be accessible, it is oftentimes difficult to manipulate and analyze such data.
Information is available in various document types such as text files, log files, spreadsheets, and webpages. These documents allow their creators flexibility in storing and organizing hierarchical data by combining presentation and formatting with the underlying data model. However, such flexibility can cause difficulty when attempting to extract the underlying data for tasks such as data processing, querying, altering a presentation view, or transforming data to another storage format. The foregoing has led to development of various domain-specific technologies for data extraction. For instance, scripting languages have been designed to support string processing in text files. Moreover, spreadsheet systems allow users to write macros using an inbuilt library of string and numerical functions, or to write arbitrary scripts in various programming languages. Further, some web technologies can be used to extract data from webpages; however, such technologies rely on knowing the underlying schema of the webpages.
Conventional programmatic solutions to data extraction can be problematic for various reasons. For instance, conventional solutions are commonly domain-specific and rely upon knowledge/expertise in different technologies for different document types. Further, conventional solutions typically rely on understanding the underlying data schema including data fields that the end user is not interested in extracting and their organization (some of which may not be visible in the presentation layer as in case of webpages). Moreover, conventional solutions oftentimes require knowledge of programming. The reliance upon knowledge/expertise in different technologies for different data types and the understanding of the underlying data schema can create challenges for programmers. Moreover, the need for knowledge of programming can result in end users who lack programming skills being unable to use these conventional solutions. As a result, users oftentimes are traditionally either unable to leverage access to rich data or have to resort to manual copy-paste, which is both time consuming and error prone.