The field of the invention relates to computer implemented methods for determining the similarity of the contents of text documents. In particular, the field of the invention relates to applications of such methods to searching large databases, such as hypertext documents on the Internet, and to automatically producing personalized content without requiring a user or users to provide or update personalized searching criteria.
There are known techniques for the comparison of text files to determine similarity and difference of content, and for providing personalized content to a user. A common difficulty of present techniques is that they often return inappropriate matches or miss desired matches. For example, the list of documents retrieved by a typical Internet search engine usually contains a large fraction of irrelevant items and misses a large fraction of relevant items. Although sophisticated searching techniques employing artificial intelligence and the like can improve the quality of searching, these techniques usually require a lengthy training phase. Additionally, many present searching systems require the user to manually select personalized search criteria. This process is laborious, time consuming, and often requires the user to learn cryptic searching syntax. It also has the disadvantage that as the user's interests change with time, the personalized criteria must be manually modified.
In summary, conventional systems for searching large data bases do not contain simple and fast techniques for comparing documents and correctly determining their similarity. In particular, conventional database searching techniques lack the ability to permit documents to be compared quickly and to enable their similarity to be determined with high accuracy. Conventional methods also fail to provide simple database searching techniques that can identify documents of interest to a user or users without requiring the user or users to manually provide personalized searching criteria, and without requiring the user or users to update such criteria.
Accordingly, there is a need to provide a method for accurately and quickly comparing the contents of text documents and determining their similarity. It also would be desirable to provide such a method which may be used to implement an improved database searching system. What is also needed is a database searching system that automatically identifies documents of interest to a user or users without requiring the user or users to specify any search criteria. There is also a need for a system which provides automatic updates to the search criteria without requiring direct user intervention.