1. Field of the Invention
The invention is directed towards a system, software and method for managing the extraction and processing of unstructured, semi-structured and structured data.
2. Description of the Related Art
The Internet and other networks contain vast amounts of structured, semi-structured and unstructured data. Structured data is data that can be interpreted according to a schema. Unstructured data has no specific format and may not follow any specific rules. Semi-structured data is data that has some aspects of structured and some aspects of unstructured date Examples of unstructured data include text, video, sound and images.
Searching the Internet and other networks for data is time consuming and often results in retrieval of an abundance of unstructured data. Moreover, Internet content is updated and changed constantly, thus making it increasingly difficult to monitor for updated changes to key data in a user friendly, and efficient manner. A user may perform searches and queries on the Internet to gather data. However, the data retrieved may be unstructured and may require a certain amount of processing before the data is ready to be used the user. Furthermore, the collected and processed data may be out-of-date unless the user periodically updates the collected data with additional searches of the Internet.
Recent innovations include processing tools to construct structured representations of the large amounts of retrieved unstructured data. These tools include natural language processors (NLPs), which further include data extraction engines. Some of these data extraction engines incorporate statistical processing tools, and may include Bayesian theory and/or rule-based learning approaches to extracting key data from unstructured data. Processing the data via NLPs and other types of processing engines is often necessary to transform the unstructured data into a structured data format. The data may be stored in a structured format inside a database, for ready access.
A relational database is well known in the art as a type of database that provides easy access to semi-structured and/or structured data. As data is processed, certain pieces of data, e.g., people and dates, may be identified, captured and processed for future use. For example, the extensible markup language (XML) may be used to syntactically describe the structure of the data. The structured data may be stored in a XML database, allowing future searching and retrieval and preventing the need for repeating processing efforts to regenerate the relevant data or structure. Alternatively, staying with the relational example, information expressed in the extensible markup language (XML) may be parsed and stored in a relational database, allowing future searching and retrieval and preventing the need for repeating processing efforts to regenerate the relevant data or structure
A data analyst or user must constantly monitor data sources, e.g., the Internet, for new and updated data. The constant monitoring of data can require large amounts of time and manpower. A user may require updated data to recognize or realize various types of concerns, e.g., important trends, global epidemics, etc., which are constantly changing throughout the world. Furthermore, because search engines offer an abundance of unstructured data, the searching process may be overwhelming to the user.
Finding data efficiently is important to the welfare and lives of people throughout the world. Users rely heavily on data from the Internet and from other private databases, which may also be accessible over the Internet. Some of these databases are third party data providers that organize data by categories, e.g., LexisNexis®. The data obtained over the Internet and from third party data providers may be unstructured, semi-structured and/or structured; however, the data may require further processing before it can be meaningfully displayed to or used by a user.