It is known to submit queries to a database that return deterministic results where there is no ambiguity in the search term or the retrieved result. For example, a deterministic result from an SQL query may result from a query such as SELECT name FROM customers WHERE customer_id=“123456”, where the search terms, namely customers and customer_id are defined.
Furthermore, a search query may be a complex search or a nested search query, which utilises conjunctions and disjunctions to assimilate a query. For example a typical nested query in a SQL based environment may be SELECT*FROM customers where ((credit_rating=‘good’ AND payment_history=‘excellent’) OR (credit_rating=‘excellent’)). In this example, a user may wish to identify all customers with good or excellent credit ratings, and has combined several search constraints, namely credit rating and payment history, to form a search query through the use of logical connectives such as conjunction and disjunction. In this way queries can be nested or structured in a hierarchical manner that creates complex search queries which query many objects which have a common root. Thus, the complex search query is a convenient way of referring to a single hierarchy of connected search constraints and search phrases. In the following description “search constituent” is used to indicate any constituent within the query structure, from an individual search term (word or phrase), through the search constraints to the entire complex query.
Subjectivity or uncertainty in the searched or queried data or material will result in non-absolute or non-deterministic results. This is particularly apparent when there may be some uncertainty in the content of the data. In particular searching multimedia data, such as audio recordings, usually gives rise to non-deterministic results due to the uncertainties involved in methods of searching audio data such as word recognition. Uncertainties in word or pattern recognition often require results to be calculated by their probable relevance or likelihood of match given uncertainties in the models used.
Such non-deterministic results are typically expressed as scores on a numerical scale for each of a set of variables, and execution of a complex search query is the process of extracting or obtaining those scores for the specified variables. The use of numerical scores allows alternative implementations of complex queries, which may be expressed as weighted combinations, so that if credit_rating and payment_history are scores on a numeric scale, the combined score would be expressed as 0.7*credit_rating+0.3*payment_history. For the example application this process is known as “credit scoring”, but there are many wider applications. A further alternative is to introduce non-linear functions into the process, so that the combined example score would be 1/(1+exp(intercept+0.7*credit_rating+0.3*payment_history)), where “intercept” is a further heuristic parameter to be determined—in this form one statistical method allowing automatic selection of the parameters (intercept, 0.7, 0.3) is known as “logistic regression” and the input variables (credit_rating, payment_history) are known as “predictors”. The word “probability” should be interpreted to mean any such score on a numerical scale, whether or not it strictly obeys the mathematical definition of probability.
It is known to combine such numerical scores with deterministic information and to allow such deterministic information to modify the weightings. In statistics, such information may be represented as a “factor” (taking one of a discrete set of values) among the predictors, and the modification of the weightings corresponds to “interactions” among the predictors, whether those are discrete or continuous. For example, in credit rating the applicant's gender may be included either in isolation, effectively providing a different “intercept” for men and women, or in such a way that all the parameters are different for men and for women.
The use of non-deterministic searching and matching is a powerful tool when analysing data. However, it is often difficult for a user to resolve or understand the non-deterministic nature of the query in a quick and efficient manner. In particular, a complex search query with nested queries may result in a non-deterministic result which has a relevancy score for each of the nested queries. As a complex query may have a nested structure which potentially runs into several tens or hundreds of fields, the user would be presented with a result that may have relevancy scores, or likelihoods of matches, for each of the terms in the nested structure requiring a large amount of human interaction and assimilation for the user to fully understand the results of the search query. Therefore, there is a need for the user to better understand the data presented to them, in a manner that facilitates their understanding of the results and improves man-machine interaction. In particular, there is a need for the user to be able to easily identify which of the results which have a probable relevance are the most likely to be absolutely relevant i.e. those hits or results that relate to the terms or objects for which the user is searching.
Additionally, due to the non-deterministic nature of the results there will be results returned from a search query that are calculated as having a high relevance which are in fact irrelevant or conversely, results that are seen as irrelevant which are in fact relevant. This is particularly an issue for searches of non-textual analogue material, such as audio, video or any other signal (such as radar) where there are uncertainties involved in the pattern matching algorithms used.
To assign an absolute relevance to a result (i.e. to turn a non-deterministic result to an absolute result) requires the result to be verified in some manner. Human interaction can help determine if the result is correct, however this may require a user to check the entire source that contains the hit. For example, if a complex nested search has returned a match to an audio source, and the user wishes to see if indeed the source is relevant they would be required to listen to the entire source to determine the relevancy of the audio source. This is clearly inefficient and, in the case where the source may be several minutes or hours in length, time consuming. Therefore, there is a requirement for a system which allows users to quickly and efficiently determine the relevancy of a hit and assign an absolute relevance if required.
To mitigate at least some of these, and other problems in the prior art there is provided, according to a first aspect of the invention apparatus for analysing non-deterministic results of a search query of data representing analogue information, such as audio data, comprising: a processor and a user interface, the processor being operably in communication with a plurality of data sources, preferably audio data sources, or databases representing the content thereof and adapted to communicate with the user interface which enables the user to query one or more data sources for the presence of search constituents within the data, wherein the processor is adapted to determine the non-deterministic likelihood of occurrence of the search constituent within at least part of a searched data source for a user query and the user interface is adapted to present to the user the search results in a form comprising two or more portlets from: a portlet presenting the overall search results (such as search strings) against part or all of the search query structure for a data source(s); a portlet presenting the data source (such as by source name) of one or more data source(s); a portlet presenting a data source filter tree for selecting currently active source(s); a portlet presenting the hit(s) of the search phrase(s) for a data source; a portlet presenting the hit location(s) within a data source, and wherein at least one of the portlets presents the user with information related to the probability of the relevance of a selected data source to the search query and/or parts of the search query, and the user interface further enabling the user to select and inspect at least part of the searched data source(s) for the presence of the search constituents.
According to another aspect of the invention there is provided a data file comprising core data and associated metadata, wherein the metadata comprises deterministic results of a complex search query resulting from human intervention with the data so as to assign the deterministic result to the data.
According to yet another aspect of the invention there is provided a method of analysing source data relevance for a complex search query, comprising the steps of constructing a complex search query of two or more search phrases, terms and/or constraints, searching a plurality of data sources according to the complex search query, determining a probable relevance of at least part of a data source for the search query, presenting the probable relevance of the data source to a user, enabling the user to determine directly the relevance of the data source for the search query, and enabling a user to edit the probable relevance of the data source based on user interaction with the data source.
According to yet another aspect of the invention there is provided a method of analysing plural data sources said data sources comprising core data and metadata; wherein at least some of the metadata comprises deterministic relevance results of a complex search queries, said deterministic relevance results determined through human intervention with said data sources having previously had non-deterministic relevance results for the relevance of a match of the source with the complex search query. The metadata may also include other information related to the data source, for example within a call centre environment the metadata may include agent and customer identifiers.
According to yet another aspect of the invention there are provided associated methods for defining sets of tags or labels and for assigning one or more tag(s) from one or more set(s) to some or all of the data sources or to portions within some or all of the data sources. The tags may be defined to be mutually exclusive within each set, so that at most one tag can be assigned from the set, or may be allowed to co-exist. The assignment of tags may be: fully automatic, based on ranges of the scores associated with the non-deterministic search results (alone or in combination with source metadata); fully manual, based on inspection of each data source; or a combination of these approaches. Once assigned the tags may be used, alone or in conjunction with search results and/or metadata associated with the data source, to select subsets of the material for further processing.
According to yet another aspect of the invention there is provided support for multiple people to work on the same project, including methods for exporting and importing the project as a whole and methods for re-connecting a project with data sources when either the data has been moved or the address through which the data is accessed has changed.
According to yet another aspect of the invention there is provided apparatus for defining deterministic results of a non-deterministic search comprising; a processor and a user interface, the processor being operably in communication with a plurality of data sources or databases representing the content thereof and adapted to communicate with the user interface which enables the user to query the content of the data sources, wherein the processor is adapted to determine the probable relevance of at least part of a searched data source for a user query and the user interface is adapted to present to the user the search results; the user interface further enabling the user to select and inspect at least part of the searched data source to assign a deterministic relevance result for at least part of the user query to said searched data source and/or to assign one or more tags from predefined tag sets to at least part of said searched data source.
Other aspects and features of the invention will become apparent from the following description and the appended claims.