In general, public information sources, such as the Internet, present challenges for information retrieval. The volume of information available via the Internet grows daily, and search engine technologies have scaled dramatically to keep up with such growth. Conventionally, search engines, such as those provided by Yahoo, Google, and others, utilize data collection technologies, such as spiders, bots, and web crawlers, which are software applications that access web pages and trace hypertext links in order to generate an index of web page information. The data collected by such software applications is typically stored as pre-processed data on which search engines may operate to perform searches and to retrieve information.
Additionally, a vast amount of data exists that is not accessible to the public Internet (e.g., “dark web” data, internal data, internal application data, private data, subscription database data, other data sources, or any combination thereof). Such data can often be searched via private access interfaces, private search tools, other application program interfaces, or any combination thereof. Such information may be segregated from other information sources, requiring multiple interfaces, multiple protocols, multiple formats, and different database drivers to access the data. Accordingly, information retrieval can be complicated by the variety of data sources.
In general, software has been developed to provide “on-line analytical processing” (OLAP) for collecting, managing, processing, and presenting multidimensional data for analysis and management purposes. Typically, an OLAP interface is customized for each data source. However, such custom interfaces do not readily leverage existing, distributed data sources without significant coding. Further, when an OLAP interface is designed to access multiple data sources, either the multiple data sources are pre-processed to index the data to a searchable index or the multiple data sources are commonly controlled such that the data structures are uniform. Unfortunately, a large amount of data, for example, unstructured data and semi-structured data that is stored on the Internet, may include useful information that cannot readily be correlated to data from other sources.
Conventionally, multiple data sources may include data related by a common attribute. However, such data may not be readily correlated due to differences in data structures, in attribute labels, in data formats, in the data entered, or any combination thereof. For example, in one data source, a company name may be tagged or labeled using the term “company,” while another data source may label the data using other terms, such as “business name,” “co.,” “corporation,” other labels, or any combination thereof. Further, even when two data sources have common labels, such as “company,” the data is entered in a first data source may be entered differently from the second data source, making it difficult to correlate the data without manual intervention.