The present invention relates to analysing data files containing representations of natural language to identify unspecified high value items.
Database technology is known to which information is supplied to users in the form of text-based files, in preference to the more traditional organisation of data in numerical and tabular form. Several facilities are available on the Internet, commonly referred to as xe2x80x9csearch enginesxe2x80x9d which assist in the location of information. For the present purposes, it will be assumed that information represents a selection of data files, selected from a very large volume of available data files, which are of particular interest to a user.
The majority of known databases perform what has become known as xe2x80x9cfree textxe2x80x9d searching, in which a user specifies words which they believe are contained within the target data files (that is the information of interest) as a mechanism for instructing a database supplier to retrieve files of interest. Problems with this technique are well known to users of the available search engines, particularly over the Internet. A simple enquiry can generate hundreds of thousands of xe2x80x9chitsxe2x80x9d, the majority of which will tend to be totally irrelevant to the user""s needs. Furthermore, other relevant files may be missed simply because they do not contain the specific chosen words. Thus, in the present context, engines are known for providing a level of filtering of available data but the provision of high value information to users by technical means presents a considerable problem.
Many data files may be classified with reference to categories and technical solutions have been put forward by the present applicant for the association of incoming data files with categories so as to facilitate the identification of information. However, a further problem arises in that particular types of information may often be of interest to users but, the characteristic which actually makes the items of interest is difficult to determine with reference to the incoming data file itself.
The work performed by the present inventors has been directed towards the identification of information relating to companies and financial transactions etc, although the procedures identified herein have much wider application. Thus, many users of the service provided by the present applicant under the trade mark xe2x80x9cPROFOUNDxe2x80x9d consider up-to-date information in connection with companies of interest to be extremely valuable. However, when data is first received by the PROFOUND system, it is not known which companies will actually be of interest.
In order to facilitate subsequent searching and to enhance the availability of information of interest, it is known that any data files containing information relating to any companies are potentially of interest to users of the system in the future. However, the actual data files being processed would only tend to include references to the actual company names without any pointers stating something to the effect that xe2x80x9cthis is a companyxe2x80x9d.
Data items of this type are referred to herein as unspecified high value items; unspecified in that it is their characteristic rather than their content which is of interest and of high value in that there is a high probability that users will identify an interest in files containing references to this item. The present application therefore addresses the problem of identifying files containing unspecified high value items using technical means thereby allowing a large number of files to be processed in realistic time-scales and at realistic costs.
According to a first aspect of the present invention, there is provided apparatus arranged to receive data files from data sources and to categorise said data files to facilitate searching in response to user requests, wherein said data files contain unspecified high value items, comprising identifying means configured to identify occurrences of unspecified candidate items in preferred contexts for a preferred specified category and to identify occurrences of unspecified candidate items in non-preferred contexts; and processing means configured to process said preferred occurrences with said non-preferred occurrences for each candidate item and to select a candidate item as a high value item.
In a preferred embodiment, a first transmission means is included for continually supplying input data files from a plurality of sources.
Preferably, a second transmission means is included for supplying information to users in response to user requests.
According to a second aspect of the present invention, there is provided a method of analysing data files containing representations of a natural language to identify unspecified high value items, comprising steps of identifying occurrences of unspecified candidate items in contexts for a preferred specified category; identifying occurrences of unspecified candidate items in contexts for a non-preferred specified category; processing said preferred occurrences with said non-preferred occurrences for each candidate item; and selecting a candidate item as a high value item in response to said processing step.
In a preferred embodiment, occurrences of unspecified candidate terms are identified for a plurality of non-preferred categories. The preferred category may represent companies and non-preferred categories may include place names and personal names.
In a preferred embodiment, a plurality of processes are performed to remove candidates to produce a refined list of high value items. Identified occurrences may result in score values being increased and the processing steps may involve the processing of the score values. The score values may be increased non-linearly so as to restrain the scores within a predetermined maximum value. Similar entries may be identified and one or more of the similar entries may be removed in response to a score comparison. Similar entries may represent situations in which a first entry is the same as a second entry with an extension added thereto.