1. Field of the Invention
Embodiments of the invention described herein pertain to the field of computer system user interfaces. More particularly, but not by way of limitation, one or more embodiments of the invention enable a user interface for parsing unstructured data using pattern recognition wherein parsed data is displayed in a first format and unparsed data is displayed in a second format.
2. Description of the Related Art
There are a number of requirements and/or preferences associated with utilizing unstructured data. Data may be received in a variety of formats that may or may not originate from a common source. When data is obtained from different sources that has no common structure or format, it must be normalized so that the data may be utilized.
Current tabular user interface oriented programs provide cumbersome wizard based solutions that do not allow for easily converting unstructured text into structured substrings that matches desired patterns. Microsoft Excel™ is an example of such a program. There are no solutions for using predefined pattern libraries that allow formatting to be applied to matching text and non-matching text and for example a matching string to be placed in a separate column while color coding text that fails to match one or more patterns.
Existing solutions allow for legacy file formats to be read, however these legacy formats are generally delimited by special characters or exist in fixed width fields. These file formats are generally related to EDI and the archaic method of defining custom files for intercompany communications before the advent of XML. These file formats are generally specific to a particular customer and reading in multiple files from multiple customers that all use different formats to represent the same type of data defeats these types of solutions.
U.S. Pat. No. 6,668,254 to Matson et al., relates to a method and system for importing data comprising the downloading of product data from different sources and in different formats; processing the downloaded data by at least comparing it with data downloaded and stored in a product database; and reviewing the results of the comparison to detect differences in the data, the differences potentially being errors. The system and methods further comprise [converting] the downloaded data from its supplier specific format into a standard format; comparing the downloaded data in the standard format with a previously downloaded data set saved in the standard format; categorizing the product data based on the results of the second comparison; and processing each category of data independently to automatically update the product database.
Specifically, “as an alternative or in addition to simple differential analysis, the data load technician can use many other tools to gain insight into the contents of the latest supplier data file. In fact, the input data should be subjected to significant review before proceeding with the import process, especially for data from new or unreliable suppliers. These tools include, but are not limited to, viewing the file in a text editor, loading relational data into a database such as Oracle and executing various retrievals, and analyzing the data in an Excel spreadsheet.”
U.S. Pat. No. 6,718,336 to Saffer describes a data import system that enables access to data of multiple types from multiple data sources of different formats and provides an interface for importing data into a data analysis system. The interface enables a user to customize the formatting of the data as the data is being imported into a data analysis system.
Specifically, “If the user selects the define format option, a format editor is presented for the user to define the format of the structured text. If the user selects the unstructured text option (FIG. 9e), the user is presented with options for identifying the unstructured text.”
U.S. Patent Application Publication 2005060324 to Johnson et al., describes a “System and method for creation and maintenance of a rich content or content-centric electronic catalog”. The system and method disclosed are directed toward transforming catalog data from multiple supplier sources to a standardized rich content catalog either by the suppliers themselves or by a third party using the system and method of the present invention. Incoming raw catalog data content is cleansed and normalized using an extensive knowledge base of patterns and incoming schemas are appended to the cleansed and normalized data. The resulting rich content catalogs are published for user browsing and data syndication.
Specifically, “the underlying framework for the invention is based on an extensive and extensible knowledge base of over 200,00[0] patterns covering an extremely broad range of 44,000 families of goods and services. This knowledge base can be used to load any database (e.g., Oracle, Sybase, DB2, Access, etc) or any spreadsheet (e.g., Excel), as well as to output XML, EDI, or any other standard format.”
U.S. Patent Application Publication 20030182287 to Parlanti et al., describes an “interface for an electronic spreadsheet and a database management system”. The invention is directed to a generalized interface for an Electronic Spreadsheet program, such as Microsoft Excel, and any data provider supported from Microsoft Universal Data Access (UDA), such as an Open Database Connectivity driver (ODBC), for a Database Management System (DBMS) such as DB2/400.
Specifically, “The interface reads a profile file (.ini) and interprets the instructions in this file to add commands to the Excel Menu bar. This profile file also contains instructions on the sequence of SQL statements to be performed for each Command added and embedded these in the SQL database.”
U.S. Patent Application Publication 20030061226 to Bowman et al., describes a “data loader for handling imperfect data and supporting multiple servers and data sources”. A “wizard-based” data loader handles imperfect data and supports multiple servers and data sources. The structures that represent the hierarchical model for the data are defined and created as the backbone for the model using spreadsheets, multiple relational database tables, and other sources of data that may reside on one or more servers.
Specifically, “the wizard-based data loader is a tool that permits ordinary business or domain experts to create templates that load data from existing sources of data that are both internal and external to an organization. The data loading mechanism provides three fundamental capabilities: the creation of structural hierarchies, the loading of information into those hierarchies, and the linking of data across hierarchies. The automated data loader allows the user to automate data loads so that data loading tasks can be scheduled to run automatically at a regular intervals and scheduled times.”
U.S. Patent Application Publication 2002004835 to Pepin et al., describes a “method and apparatus for enabling bulk loading of data”. A system and method for processing information performs actions associated with rules to modify, adjust, calculate and massage data to comport with downstream handling requirements. In one example, bulk uploads from a supplier are treated in accordance with column headings to perfect data to be imported into a marketplace. The system also permits the storage of the rules to process later uploads with similar data structures.
Specifically, “The Supplier User performs the inventory management function by selecting this application object. The user specifies the source of the inventory data, which can be in multiple formats (csv, excel, tab delimited, xml). The User identifies the source and the data is processed by the service.”
The Adeptia Product comprises a data integration capability that includes support for complex data formats and transformation. The product comprises a data transformation engine that allows any-to-any mapping between different data formats. Complex data processing functions are included such as string, math, and conditional operations as well as DB and XML file look-up. Data can be aggregated from multiple sources. Supported data formats include XML-DTD, XSD, Hierarchical, attributes, enumerated values, ASCII Text/Flat, Fixed-length, EDI, AL3, Excel files, SQL compliant relational databases such as Oracle, Sybase, DB2, Informix, MySQL etc.
The Autonomy Product comprises technology that automatically reads, categorizes, hyperlinks, personalizes large volumes of unstructured data, and delivers personalized highly targeted content automatically.
The Stylus Studio Product allows for the generation of match patterns in importing EDI data. The product comprises a utility named Convert to XML. Convert to XML works on any legacy data input file, for example, text files, comma separated values (CSV), tab separated values, binary data, EDI files, or any other flat file format. Stylus Studio can also read dozens of different file encodings, understand various data types, and so on.
For at least the limitations described above there is a need for an apparatus and method for parsing unstructured data.