A citation index indexes citations contained in articles, linking the articles with the cited works. Citation indexing was originally designed for information retrieval. A citation index allows navigation backward in time by following the list of cited articles and forward in time by tracking which subsequent articles cite any given article.
The rate of production of scientific literature continues to increase, making it time consuming for researchers to stay current. Most published scientific research appears in paper documents such as scholarly journals or conference proceedings. There is a considerable time delay between the completion of research and the availability of the publication. Thus, the World Wide Web ("the web") has become an important distribution medium for scientific research. An increasing number of authors are making new research available on the web in the form of preprints or technical reports which can be downloaded and printed. These web publications are often available before any corresponding printed publication in journals or conference proceedings. In order to keep current on research, especially in rapidly advancing fields, a researcher can search the web to download papers as soon as they are written. Literature available on the web is easier to access. The web, however, does have its own limitations for searching. The web lacks a standardized organization, publications themselves tend to be poorly organized (each institution or researcher may have its, his or her own organizational scheme), and publications are often spread throughout the web. A researcher could spend large amounts of time solely for searching, downloading and printing papers to read in order to find those publications that may be important However, web based literature can be read and processed by autonomous agents far more easily than printed documents. Agents can search the web and thereby provide an automated means to find, download and judge the relevance of web publications. The present invention concerns just such an agent. An assistant agent is a computer program which automatically performs some task on behalf of a user.
One method for finding relevant and important publications on the web is to use a combination of Web Search Engines and manual web browsing. Web search engines such as AltaVista (http://altavista digital.com) index the text contained on web pages, allowing users to find information with keyword search. Some research publications on the web are made available in HTML (HyperText Markup Language) format, making the text of these papers searchable with web search engines. However, most of the published research papers on the web are in Postscript form (which preserves the formatting of the original), rather than HTML. The text of these papers is not indexed by search engines such as AltaVista, requiring researchers to locate pages which contain links to these papers, e.g. by searching for a paper by title or author name. Another limitation of the web search engines is that they typically only use word frequency information to find relevant web pages, although other types of information are potentially useful, e.g. papers which contain citations of common earlier papers may be related.
In the following text, reference will be made to several publications in the open literature. These publications are herein incorporated by reference.
The present invention benefits from three areas of prior work. The first involving citation indexing which indexes the citations made between academic articles. See, for example, in Chapters 1 to 3 and Chapter 10 of the book by E. Garfield, entitled "Citation Indexing: Its Theory and Application in Science, Technology and Humanities", ISI Press, Philadelphia, 1979.
The second concerning semantic distance measures between text documents. Research in this area is directed towards finding quantifiable and useful measures of similarity or relatedness between bodies of text.
The third, web, interface and assistant software agents. Several papers have addressed the problem of locating "interesting" web pages. For example, articles including those by M. Pazzani, J. Muramatsu and D. Billsus, entitled "Syskill & Webert: Identifying interesting web sites" in Proceedings of the National Conference on Artificial Intelligence (AAAI96), 1996; by F. Menczer, entitled "Arachnid: Adaptive retrieval agents choosing heuristic neighborhoods for information discovery" in Machine Learning: Proceedings of the Fourteenth International Conference, pp. 227-235, 1997; by M. Balabanovic, entitled "An adaptive web page recommendation service" in Proceedings of the First International Conference on Autonomous Agents, ACM Press, New York, pp. 378-385, 1997; and by A. Moukas, entitled "Amalthaea: Information discovery and filtering using a multiagent evolving ecosystem" in Proceedings of the Conference on Practical Applications of Agents and Multiagent Technology, 1996. This includes work which uses learning techniques based on user feedback.
In citation indexing, references contained in articles are used to give credit to previous work in the literature and provide a link between the "citing" and "cited" articles. A citation index, such as Garfield, supra, indexes the citations that an article makes, linking the articles with the cited works. Citation indexes were originally designed mainly for information retrieval, as referenced by E. Garfield in an article entitled "The concept of citation indexing: A unique and innovative tool for navigating the research literature" Current Contents, Jan. 3, 1994. The citation links allow navigating the literature in unique ways. Papers can be located independent of language and words in the title, keywords or document. A citation index allows navigation backward in time (the list of cited articles) and forward in time (subsequent articles which cite the current article). Citation indexing can be a powerful tool for literature search, in particular:
a. A citation index allows finding out where and how often a particular article is cited in the literature, thus providing an indication of the importance of the article. Older articles may define methodology or set the research agenda. Newer articles may respond to or build upon the original article. PA0 b. Citations can help to find other publications which may be of interest. Using citation information in addition to keyword information should allow the identification of more relevant literature. PA0 c. The context of citations in citing publications may be helpful in judging the important contributions of a cited paper. PA0 d. A citation index can provide detailed analyses of research trends and identify emerging areas of science. PA0 a) No manual effort is required for indexing, resulting in a corresponding reduction in cost and increase in availability. We believe that this can be very important. PA0 b) ACI facilitates literature search based on the context of citations--given a particular paper of interest, an ACI system can display the context of how the paper is cited in subsequent publications. The context of citations can be very important for both literature search and evaluation. PA0 a) ACI allows creating more up-to-date databases which can avoid lengthy journal publication delays, because it is not necessary to limit the amount of literature indexed due to human resource requirements. In many areas preprints and conference papers are available long before any corresponding journal publication. PA0 b) The potential for broader coverage of the literature as opposed to indexing only a select set of journals. SCI has been repeatedly criticized for not indexing certain literature.
The Institute for Scientific Information (IS) .RTM. (Institute for Scientific Information, 1997) produces multi-disciplinary citation indexes, which are used to provide several commercial services for searching scientific periodicals. An ISI service is the Keywords Plus.RTM. service, which adds citation information to the indexing of an article. Specifically, in addition to the title, author-supplied keywords, and abstract, Keywords Plus adds additional indexing terms which are derived from the titles of cited papers. As a user browses through papers in the ISI databases, bibliographic coupling allows navigation by locating papers which share one or more references.
Another commercial citation index is the legal database offered by the West Group (KeyCite). This database indexes case law as opposed to scientific research articles.
Compared to the current commercial citation indexes, the citation indexing performed by using the present invention has the following limitations: it does not cover the significant journals as comprehensively and it cannot distinguish subfields as accurately, e.g. it will not disambiguate two authors with the same name.
The present invention, Autonomous Citation Indexing (ACI), has significant advantages over traditional citation indexing:
Because ACI does not require human indexers, very significant further benefits result:
ACI can improve scientific communication, and facilitates an increased rate of scientific dissemination and feedback.
R. D. Cameron in an article entitled "A universal citation database as a catalyst for reform in scholarly communication", Technical Report CMPT TR 95-07, School of Computering Science, Simon Fraser University (1995), has proposed a "universal [Internet-based] bibliographic and citation database linking every scholarly work ever written". He describes a system in which all published research would be available to and searchable by any scholar with Internet access. Also, citation links between those documents would be recorded and available as search criteria. Such a database would be highly "comprehensive and up-to-date", making it a powerful tool for academic literature research, and for the production of statistics as with traditional citation indexes.
One important difference between Cameron's vision of a universal citation database and the present invention is that the present invention does not require any extra effort on the part of authors beyond placement of their work on the web. The present invention automatically creates the citation database from downloaded documents whereas Cameron has proposed a system whereby authors or institutions must provide citation information in a specific format.
Another area of prior work is the use of semantic distance measures given a set of documents (essentially text strings), there has been much interest in distance (or the inverse, similarity) measurements between documents. Most of the known distance measures between bodies of text rely on models of similarity of groups of letters in the text. One type of text distance measure is the string distance or edit distance which considers distance as the amount of difference between strings of symbols. For example, the Levenshtein distance, as described in an article entitled "Binary codes capable of correcting spurious insertions and deletions of ones (original in Russian)", Russian Problemy Peredachi Informatsii 1, 12-25 (1965), is a well known early edit distance where the difference between two text strings is simply the number of insertions, deletions, or substitutions of letters required to transform one string into another. A more recent and sophisticated example is an algorithm called LikeIt as described by P. N. Yianilos in Technical Report No. 97-093, NEC Research Institute, entitled "The LikeIt intelligent string comparison facility" (1997) and by P. N. Yianilos, in an article entitled "Data structures and algorithms for nearest neighbor search in general metric spaces", in Proceedings of the 4.sup.th ACM-SIAM Symposium on Discrete Algorithms, pp. 311-321 (1993) and in U.S. Pat. No. 5,978,797 to Yianilos and assigned to the same assignee as the present invention, where a string distance is based on an algorithm that tries to "build an optimal weight matching of the letters and multigraphs (groups of letters)".
Another type of text string distance is based on statistics of words which are common to sets of documents, especially as part of a corpus of a large number of documents. One commonly used form of this measure, based on word frequencies, is known as term frequency times inverse document frequency (TFIDF) as described by G. Salton and C. Yang, in an article entitled "On the specification of term values in automatic indexing", in the Journal of Documentation 29, pp 351-372 (1973). Consider a dictionary of all of the words (terms) in a corpus of documents. In some systems, very common words, sometimes called stop words, such as the, a, and so forth are ignored. Also, sometimes only the stems of words are considered instead of complete words. An often used stemming heuristic introduced by M. F. Porter, in an article entitled "An algorithm for suffix stripping", Program 14, pp 130-137 (1980), tries to return the same stem from several forms of the same word, e.g. "walking", "walk", "walked", all become simply walk. In a document d, the frequency of each word stem s is f.sub.ds and the number of documents having stem s is n.sub.s. In document d the highest term frequency is called f.sub.d .sub..sub.max . In one such TFIDF scheme, as described in an article by G. Salton and C. Buckley, entitled "Term weighting approaches in automatic text retrieval", in Technical Report 87-881, Department of Computer Science, Cornell University (1987), a word weight w.sub.ds is calculated as: ##EQU1##
where N.sub.D is the total number of documents. In order to find the distance between two documents, a dot product of the two word vectors for those documents is calculated.
A third type of semantic distance measure is one in which knowledge about document components or structure is used. In the case of research publications for example, citations of papers by other papers has been used to create citation indexes (as described above) which can be used to gauge document relatedness as described by G. Salton, in an article entitled "Automatic indexing using bibliographic citations", Journal of Documentation 27, pp 98-110 (1971).
The third area of prior work is assistant agents. The present invention can be viewed as an assistant agent. Assistant agents are agents designed to assist the user with the use of software systems. These agents may perform tasks on behalf of the user, making interaction with the software system easier and/or more efficient. Many web based assistant agents have been constructed to help the user find interesting and relevant World Wide Web pages more quickly and easily. Many of them such as Moukas, supra; Balabanovic, supra; Menczer, supra; Pazzani et al, supra and those described in an overview of several agents in an article by P. Edwards et al., "Exploiting learning technologies for World Wide Web agents", in IEE Colloquium on Intelligent World Wide Web Agents, Digest No.:97/118 (1997), learn from user feedback in an environment of word vector features to find more relevant web pages. Interesting changes to known relevant web pages are learned by the "Do-I-Care" agent as described by B. Starr et al, in an article entitled "Do-I-Care: Tell me what's changed on the web", in Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access Technical Papers (1996). This system also allows the agent to learn from the feedback of another user. Although it does no learning, the heuristic web agent "CiFi" as described in an article by S. Loke et al, entitled "CiFi: An intelligent agent for citation finding on the World-Wide Web", in Technical Report 96/4, Department of Computer Science, University of Melbourne (1996), tries to find citations to a specified paper on the World Wide Web.