The present invention relates to a method of data analysis and in particular to a method of analysing a selected sub-set of data.
The expansion of the internet and the development of it as a mass consumer resource has caused it to evolve from a distributed data system to a distributed information store—information here meaning data which can be can used to answer questions—in which information is created and processed by millions of humans using the internet every day.
Information mediation and distribution services such as Twitter, Facebook and RSS feed aggregators like Google Reader enable users to construct subscriptions to particular topics or sites of interest. The result of these information feeds can then be searched or browsed by a user. This approach provides the advantage that the collected content was of interest to the user at the time that the information was obtained. However, the users search preference may have changed between the time that the information feed was established and the time that the information is reviewed or analysed and only the information that was searched for is retained.
An alternative approach is to attempt to gather all information and store it, or index it, in a massive index such as a search engine like Google. This allows a user to establish subscriptions as queries over the main index, which can be widened at any time to cover any other information that has been stored in the index. Clearly, such an approach is infeasible for all but the most powerful enterprises. Potentially Google could provide sophisticated query capability on its centralized hardware, however the Google infrastructure (previously Map-Reduce, now Caffeine) is specialised in terms of providing a particular style of query capability in a massively scalable but cost effective way (see for example, M Stonebraker “SQL databases v. NoSQL databases”, Communications of the ACM, Volume 53 Issue 4 Apr. 2010).
This means that queries that intensively cross reference the results of previous queries (such as recursively constraining and expanding result sets based on the information held in the previous result set) are unlikely to be scalable using such an indexing infrastructure, in particular unless a query writer is aware of the physical distribution of data and results queries cannot be constructed in a way that enables them to most efficiently process and move data and results during the query answering process. Such problems are likely to be more significant if results need to be cross-referenced with proprietary information that must be loaded to the infrastructure (in some cases it may not be possible to upload such data, for example in order to comply with data protection regulation, reasons of privacy or national security, etc.). In addition multiple sources of data may exist that are not licensed for public consumption and may be held privately on other infrastructures. Subscriptions to these information sources are made by humans via various interfaces, for example Twitter RSS feeds or tools such as Co-Tweet. These subscriptions bring data from the large scale internet stores and communities into a local data processing unit where it may be efficiently processed.
According to a first aspect of the present invention there is provided a system comprising: a first data store, a communications interface to one or more further data stores, a resource allocation manager, a plurality of data subscriptions; and one or more user terminals; the system being configured, in use, such that; the resource allocation manager selects one or more data subscriptions from the plurality of data subscriptions; the selected one or more data subscriptions cause data to be selected from the one or more further data stores and transferred to the first data store.
The system may further comprise a plurality of user agents, each user agent being associated with one or more of the plurality of data subscriptions, the user agents being configured to, in use, transmit a bid for system resources for the one or more associated data subscriptions to the resource allocation manager. The system resources may comprise data storage capacity and/or processing means capacity. The system may further comprise one or more user terminals for analysing the data associated with the selected one or more data subscriptions and transferred to the first data store.
The system may be configured, in use, to periodically update the data associated with the selected one or more data subscriptions and transferred to the first data store.
The one or more data subscriptions may be constructed from data relating to a social media network entity. The data relating to a social media network entity may comprise one or more further social media entities. The one or more data subscriptions may further comprise one or more keywords.
The present invention addresses the issue of maintaining the subscriptions of users to large scale data resources in the face of the change of relevance and interest in these subscription over time and the opportunities for retrieval of items that may be of interest to the human user of the system in the future but are unknown to the user at the time when the item is retrieved. Whereas a complete mirror of all large scale resources would achieve this, the resource available to the user are constrained and the present invention balances the acquisition of information with the cost of gathering it from, for example, the internet, storing it and answering queries from it.
According to a second aspect of the present invention there is provided a method of analysing data, the method comprising the steps of: i) defining a plurality of data subscriptions; ii) a resource allocation manager selecting one or more of the plurality of data subscriptions; iii) selecting data associated with the selected one or more data subscriptions; and iv) transferring the data selected in step iii) to a first data store.
According to a third aspect of the present invention there is provided a tangible data carrier for use in a computing device, the data carrier comprising computer executable code which, in use, performs a method as described above.