The growth of electronic data and media has created a significant engineering problem in terms of data search and analysis in order to extract useful knowledge or information from potentially large, disparate and distributed data sources. The generation of data including events, records, logs, indicators, audit information and sensor information is increasingly automated and pervasive in aspects of technical and societal infrastructure. A result of pervasive automated data generation is large quantities of complex data known as “big data”.
Big data refers to sets of data, typically voluminous in nature, and of potentially high complexity in terms of either or all of the structure or format of data, distributed nature of the data, the interrelationships between data elements, or the disparity between data items, their structure and/or the mechanisms employed for their storage.
Big data presents particular challenges in the search, transfer, analysis, use and visualisation of data to meet specific needs without extraneous, irrelevant or missing data. Traditional data storage, search and analysis tools, such as relational or object-oriented databases, can be ineffective or inefficient due to the number, size and complexity of data stores and data items stored. The problems can be particularly acute in certain fields where the volume and complexity of data captured and stored can be high. Such fields include, inter alia: meteorology; genomics; connectomics; information technology networks; social networks and societal infrastructure including healthcare data, education data, energy and water supply data, public safety and policing data, traffic, transport and behavioural information. In particular, multiple data sources spanning numerous such fields present a considerable challenge in determining the meaning and usefulness of data sources for specific data processing, data analysis or data visualisation applications.
Existing approaches to data search and analysis depend on proactive search and selection of data items provided by data sources based on defined criteria. For example, web search technologies include data source indexing, data matching such as regular expression matching for data search, and search result ranking to provide a set of search results with a proposed order of relevance. Examples of such technology include regular expression search techniques such as the RE2 algorithm and information ranking such as the Pagerank algorithm, both of Google (Google and Pagerank are trademarks or registered trademarks of Google Inc.) While such an approach to data search can be effective for structured web page data, where each document is organised in accordance with a well-specified and conventional markup language having references, or links, between documents, the approach requires a comprehensive index of structured data sources and depends on result ranking techniques tied to the structure of the information, such as Pagerank which is dependent on references between web pages, to identify data relevance.
Thus web search techniques suffer from a dependency on stable data sources suitable for indexing, where the data sources conform to a known and readily parsable structure. Such approaches are not suitable for very large, complex, distributed and disparate data sources. Such approaches also do not provide for the integration of multiple large, complex and disparate data sources to satisfy a data dependency of a software service.
To address these challenges, Crespo et al. developed Semantic Overlay Networks (SONs) for searching peer-to-peer networks where data is distributed with no control over network structure or data source location (“Semantic Overlay Networks for P2P Systems”, Crespo and Garcia-Molina, Proceedings of the 29th VLDB Conference, Berlin, 2003). Nodes in a peer to peer network are assigned to one or more SONs based on the content of documents at each node. A SON manifests as a set of links between nodes, each link being a triple (ni, nj, l) where ni and nj are connected nodes and l is a string. Traditional peer to peer networks are established by a single overlay network where l is constant. In contrast, a SON provides for multiple different l such that a node can be connected to a set of neighbours through an l1 link, and to a potentially different set of neighbours through a l2 link. A classification hierarchy of concepts is used to determine the links, l, between nodes. Documents stored at peer nodes in the network are classified into concepts in a hierarchy of concepts. The classification of documents determines the classification of nodes storing the documents. A search based on a query is conducted by classifying the query using the classification hierarchy to identify one or more SONs to which the query is directed. Nodes in each identified SON apply the query to documents. In this way, SONs provide for the searching of peers in a peer to peer network avoiding searching by peers that do not belong to a SON relevant to the query.
The approach of Crespo et al. has considerable disadvantages. While defining SONs for peers can improve search efficiency, the dependence on a predefined and common classification hierarchy can reduce search effectiveness. The classification hierarchy defines how nodes in a peer to peer network are linked, and is also the basis for identifying which nodes should be targeted for a classified query. Thus, to be effective, the classification hierarchy must reflect all possible nodes (and the documents stored at nodes) and all possible queries. Further, each SON is associated with a single concept in a classification hierarchy. Where a group of nodes share multiple concepts in common, multiple SONs are generated. Thus, using the approach of Crespo it is not possible to transfer from one SON to another SON unless classification of a query also identifies the other SON. This is particularly problematic by the requirement, in Crespo, that peers and queries are classified according to a classification hierarchy. Where a query is classified as a first concept in a branch of a classification hierarchy, and the query might also be somewhat relevant to a second concept in another, different branch of the hierarchy, such second concept will correspond to a separate SON and will not be searched despite the relevance to the query. Crespo requires precise classification of queries and peers to identify a peer for searching. Crespo only contemplates imprecise classification along a common branch of the classification hierarchy, which is tantamount to requiring precise classification of ancestor classifications, so restricting the extent of a search considerably to only precisely relevant classes or ancestor classes. Yet further, the approach of Crespo et al. requires a sharing of the classification hierarchy by all peers in the network. New peers are required to request, receive and store the classification hierarchy. Crespo is accordingly susceptible to multiple varying versions of a classification hierarchy leading to potentially ineffective search.
Thus there is a need to provide for the identification of appropriate data sources from a complex set data sources to satisfy a data dependency requirement of a software service without the above described disadvantages.