The present invention relates to autonomous citation indexing. Specifically, an autonomous citation indexing system is completely automatic, autonomously extracts citations, identifies identical citations which occur in different formats, and identifies the context of citations in the body of the articles.
A citation index indexes citations contained in articles, linking the articles with the cited works. Citation indexing was originally designed for information retrieval. A citation index allows navigation backward in time by following the list of cited articles and forward in time by tracking which subsequent articles cite any given article.
The rate of production of scientific literature continues to increase, making it time consuming for researchers to stay current. Most published scientific research appears in paper documents such as scholarly journals or conference proceedings. There is a considerable time delay between the completion of research and the availability of the publication. Thus, the World Wide Web (xe2x80x9cthe webxe2x80x9d) has become an important distribution medium for scientific research. An increasing number of authors are making new research available on the web in the form of preprints or technical reports which can be downloaded and printed. These web publications are often available before any corresponding printed publication in journals or conference proceedings. In order to keep current on research, especially in rapidly advancing fields, a researcher can search the web to download papers as soon as they are written. Literature available on the web is easier to access. The web, however, does have its own limitations for searching. The web lacks a standardized organization, publications themselves tend to be poorly organized (each institution or researcher may have its, his or her own organizational scheme), and publications are often spread throughout the web. A researcher could spend large amounts of time solely for searching, downloading and printing papers to read in order to find those publications that may be important. However, web based literature can be read and processed by autonomous agents far more easily than printed documents. Agents can search the web and thereby provide an automated means to find, download and judge the relevance of web publications. The present invention concerns just such an agent. An assistant agent is a computer program which automatically performs some task on behalf of a user.
One method for finding relevant and important publications on the web is to use a combination of Web Search Engines and manual web browsing. Web search engines such as AltaVista (http://altavista.digital.com) index the text contained on web pages, allowing users to find information with keyword search. Some research publications on the web are made available in HTML (HyperText Markup Language) format, making the text of these papers searchable with web search engines. However, most of the published research papers on the web are in Postscript form (which preserves the formatting of the original), rather than HTML. The text of these papers is not indexed by search engines such as AltaVista, requiring researchers to locate pages which contain links to these papers, e.g. by searching for a paper by title or author name. Another limitation of the web search engines is that they typically only use word frequency information to find relevant web pages, although other types of information are potentially useful, e.g. papers which contain citations of common earlier papers may be related.
In the following text, reference will be made to several publications in the open literature. These publications are herein incorporated by reference.
The present invention benefits from three areas of prior work. The first involving citation indexing which indexes the citations made between academic articles. See, for example, in Chapters 1 to 3 and Chapter 10 of the book by E. Garfield, entitled xe2x80x9cCitation Indexing: Its Theory and Application in Science, Technology and Humanitiesxe2x80x9d, ISI Press, Philadelphia, 1979.
The second concerning semantic distance measures between text documents. Research in this area is directed towards finding quantifiable and useful measures of similarity or relatedness between bodies of text.
The third, web, interface and assistant software agents. Several papers have addressed the problem of locating xe2x80x9cinterestingxe2x80x9d web pages. For example, articles including those by M. Pazzani, J. Muramatsu and D. Billsus, entitled xe2x80x9cSyskill and Webert: Identifying interesting web sitesxe2x80x9d in Proceedings of the National Conference on Artificial Intelligence (AAAI96), 1996; by F. Menczer, entitled xe2x80x9cArachnid: Adaptive retrieval agents choosing heuristic neighborhoods for information discoveryxe2x80x9d in Machine Learning: Proceedings of the Fourteenth International Conference, pp. 227-235, 1997; by M. Balabanovic, entitled xe2x80x9cAn adaptive web page recommendation servicexe2x80x9d in Proceedings of the First International Conference on Autonomous Agents, ACM Press, New York, pp. 378-385, 1997; and by A. Moukas, entitled xe2x80x9cAmalthaea: Information discovery and filtering using a multiagent evolving ecosystemxe2x80x9d in Proceedings of the Conference on Practical Applications of Agents and Multiagent Technology, 1996. This includes work which uses learning techniques based on user feedback.
In citation indexing, references contained in articles are used to give credit to previous work in the literature and provide a link between the xe2x80x9ccitingxe2x80x9d and xe2x80x9ccitedxe2x80x9d articles. A citation index, such as Garfield, supra, indexes the citations that an article makes, linking the articles with the cited works. Citation indexes were originally designed mainly for information retrieval, as referenced by E. Garfield in an article entitled xe2x80x9cThe concept of citation indexing: A unique and innovative tool for navigating the research literaturexe2x80x9d Current Contents, Jan. 3, 1994. The citation links allow navigating the literature in unique ways. Papers can be located independent of language and words in the title, keywords or document. A citation index allows navigation backward in time (the list of cited articles) and forward in time (subsequent articles which cite the current article). Citation indexing can be a powerful tool for literature search, in particular:
a. A citation index allows finding out where and how often a particular article is cited in the literature, thus providing an indication of the importance of the article. Older articles may define methodology or set the research agenda. Newer articles may respond to or build upon the original article.
b. Citations can help to find other publications which may be of interest. Using citation information in addition to keyword information should allow the identification of more relevant literature.
c. The context of citations in citing publications may be helpful in judging the important contributions of a cited paper.
d. A citation index can provide detailed analyses of research trends and identify emerging areas of science.
The Institute for Scientific Information (ISI)(copyright) (Institute for Scientific Information, 1997) produces multi-disciplinary citation indexes, which are used to provide several commercial services for searching scientific periodicals. An ISI service is the Keywords Plus(copyright) service, which adds citation information to the indexing of an article. Specifically, in addition to the title, author-supplied keywords, and abstract, Keywords Plus adds additional indexing terms which are derived from the titles of cited papers. As a user browses through papers in the ISI databases, bibliographic coupling allows navigation by locating papers which share one or more references.
Another commercial citation index is the legal database offered by the West Group (KeyCite). This database indexes case law as opposed to scientific research articles.
Compared to the current commercial citation indexes, the citation indexing performed by using the present invention has the following limitations: it does not cover the significant journals as comprehensively and it cannot distinguish subfields as accurately, e.g. it will not disambiguate two authors with the same name.
The present invention, Autonomous Citation Indexing (ACI), has significant advantages over traditional citation indexing:
a) No manual effort is required for indexing, resulting in a corresponding reduction in cost and increase in availability. We believe that this can be very important.
b) ACI facilitates literature search based on the context of citationsxe2x80x94given a particular paper of interest, an ACI system can display the context of how the paper is cited in subsequent publications. The context of citations can be very important for both literature search and evaluation.
Because ACI does not require human indexers, very significant further benefits result:
a) ACI allows creating more up-to-date databases which can avoid lengthy journal publication delays, because it is not necessary to limit the amount of literature indexed due to human resource requirements. In many areas preprints and conference papers are available long before any corresponding journal publication.
b) The potential for broader coverage of the literature as opposed to indexing only a select set of journals. SCI has been repeatedly criticized for not indexing certain literature.
ACI can improve scientific communication, and facilitates an increased rate of scientific dissemination and feedback.
R. D. Cameron in an article entitled xe2x80x9cA universal citation database as a catalyst for reform in scholarly communicationxe2x80x9d, Technical Report CMPT TR 95-07, School of Computering Science, Simon Fraser University (1995), has proposed a xe2x80x9cuniversal [Internet-based] bibliographic and citation database linking every scholarly work ever writtenxe2x80x9d. He describes a system in which all published research would be available to and searchable by any scholar with Internet access. Also, citation links between those documents would be recorded and available as search criteria. Such a database would be highly xe2x80x9ccomprehensive and up-to-datexe2x80x9d, making it a powerful tool for academic literature research, and for the production of statistics as with traditional citation indexes.
One important difference between Cameron""s vision of a universal citation database and the present invention is that the present invention does not require any extra effort on the part of authors beyond placement of their work on the web. The present invention automatically creates the citation database from downloaded documents whereas Cameron has proposed a system whereby authors or institutions must provide citation information in a specific format.
Another area of prior work is the use of semantic distance measures given a set of documents (essentially text strings), there has been much interest in distance (or the inverse, similarity) measurements between documents. Most of the known distance measures between bodies of text rely on models of similarity of groups of letters in the text. One type of text distance measure is the string distance or edit distance which considers distance as the amount of difference between strings of symbols. For example, the Levenshtein distance, as described in an article entitled xe2x80x9cBinary codes capable of correcting spurious insertions and deletions of ones (original in Russian)xe2x80x9d, Russian Problemy Peredachi Informatsii 1, 12-25 (1965), is a well known early edit distance where the difference between two text strings is simply the number of insertions, deletions, or substitutions of letters required to transform one string into another. A more recent and sophisticated example is an algorithm called LikeIt as described by P. N. Yianilos in Technical Report No. 97-093, NEC Research Institute, entitled xe2x80x9cThe LikeIt intelligent string comparison facilityxe2x80x9d (1997) and by P. N. Yianilos, in an article entitled xe2x80x9cData structures and algorithms for nearest neighbor search in general metric spacesxe2x80x9d, in Proceedings of the 4th ACM-SIAM Symposium on Discrete Algorithms, pp. 311-321 (1993) and in U.S. Pat. No. 5,978,797 to Yianilos and assigned to the same assignee as the present invention, where a string distance is based on an algorithm that tries to xe2x80x9cbuild an optimal weight matching of the letters and multigraphs (groups of letters)xe2x80x9d.
Another type of text string distance is based on statistics of words which are common to sets of documents, especially as part of a corpus of a large number of documents. One commonly used form of this measure, based on word frequencies, is known as term frequency times inverse document frequency (TFIDF) as described by G. Salton and C. Yang, in an article entitled xe2x80x9cOn the specification of term values in automatic indexingxe2x80x9d, in the Journal of Documentation 29, pp 351-372 (1973). Consider a dictionary of all of the words (terms) in a corpus of documents. In some systems, very common words, sometimes called stop words, such as the, a, and so forth are ignored. Also, sometimes only the stems of words are considered instead of complete words. An often used stemming heuristic introduced by M. F. Porter, in an article entitled xe2x80x9cAn algorithm for suffix strippingxe2x80x9d, Program 14, pp 130-137 (1980), tries to return the same stem from several forms of the same word, e.g. xe2x80x9cwalkingxe2x80x9d, xe2x80x9cwalkxe2x80x9d, xe2x80x9cwalkedxe2x80x9d, all become simply walk. In a document d, the frequency of each word stem s is fds and the number of documents having stem s is ns. In document d the highest term frequency is called fdmax. In one such TFIDF scheme, as described in an article by G. Salton and C. Buckley, entitled xe2x80x9cTerm weighting approaches in automatic text retrievalxe2x80x9d, in Technical Report 87-881, Department of Computer Science, Cornell University (1987), a word weight wds is calculated as:                               w          ds                =                                            (                              0.5                +                                  0.5                  ⁢                                                            f                      ds                                                              f                                              d                        max                                                                                                        )                        ⁢                          (                              log                ⁢                                                      n                    D                                                        n                    s                                                              )                                                                          ∑                                  jε                  ⁢                                      xe2x80x83                                    ⁢                  d                                            ⁢                              (                                                                            (                                              0.5                        +                                                  0.5                          ⁢                                                                                    f                              dj                                                                                      f                                                              d                                max                                                                                                                                                        )                                        2                                    ⁢                                                            (                                              log                        ⁢                                                                              N                            D                                                                                n                            j                                                                                              )                                        2                                                  )                                                                        (        1        )            
where ND is the total number of documents. In order to find the distance between two documents, a dot product of the two word vectors for those documents is calculated.
A third type of semantic distance measure is one in which knowledge about document components or structure is used. In the case of research publications for example, citations of papers by other papers has been used to create citation indexes (as described above) which can be used to gauge document relatedness as described by G. Salton, in an article entitled xe2x80x9cAutomatic indexing using bibliographic citationsxe2x80x9d, Journal of Documentation 27, pp 98-110 (1971).
The third area of prior work is assistant agents. The present invention can be viewed as an assistant agent. Assistant agents are agents designed to assist the user with the use of software systems. These agents may perform tasks on behalf of the user, making interaction with the software system easier and/or more efficient. Many web based assistant agents have been constructed to help the user find interesting and relevant World Wide Web pages more quickly and easily. Many of them such as Moukas, supra; Balabanovic, supra; Menczer, supra; Pazzani et al, supra and those described in an overview of several agents in an article by P. Edwards et al., xe2x80x9cExploiting learning technologies for World Wide Web agentsxe2x80x9d, in IEE Colloquium on Intelligent World Wide Web Agents, Digest No.: 97/118 (1997), learn from user feedback in an environment of word vector features to find more relevant web pages. Interesting changes to known relevant web pages are learned by the xe2x80x9cDo-I-Carexe2x80x9d agent as described by B. Starr et al, in an article entitled xe2x80x9cDo-I-Care: Tell me what""s changed on the webxe2x80x9d, in Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access Technical Papers (1996). This system also allows the agent to learn from the feedback of another user. Although it does no learning, the heuristic web agent xe2x80x9cCiFixe2x80x9d as described in an article by S. Loke et al, entitled xe2x80x9cCiFi: An intelligent agent for citation finding on the World-Wide Webxe2x80x9d, in Technical Report 96/4, Department of Computer Science, University of Melbourne (1996), tries to find citations to a specified paper on the World Wide Web.
An Autonomous Citation Index autonomously creates a citation index from literature in electronic format [printed literature can be converted to electronic form using optical character recognition (OCR)]. An ACI system autonomously locates new articles, extract citations, identifies citations to the same article which occur in different formats, and identifies the context of citations in the body of articles. The viability of autonomous citation indexing depends on the ability to perform these functions accurately.
The ability to recognize variant forms of citations to the same publication is critical to the usefulness of ACI. Without this ability, it would not be possible to group multiple citations to the same publication (e.g. list the context of all citations to a given publication), nor would it be possible to generate statistics on citation frequency (allowing estimation of the importance of articles).
Finding articles can be accomplished by searching the World Wide Web, monitoring mailing lists or newsgroups for announcements of new articles, or by direct links with publishers. Once familiar with ACI systems, researchers may send notification of new papers directly, allowing these papers to be indexed almost immediately. Journal papers are increasingly being made available online at journal web sites. Journals typically charge for access to online papers, and as such one way to index these papers would be to make agreements with the publishers. An ACI system is likely to be beneficial to publishers, because users can be directed to the journal""s home page, increasing subscriptions. Pay-per-view agreements may be beneficial for users that do not wish to subscribe to the journals in order to view a single article.
Finding and extracting citations and the context of citations, from articles in electronic form, is relatively simple. For example, the citations are often contained in a list at the end of the article, the list of citations is typically formatted such that citation identifiers, vertical spacing, or indentation can be used to delineate individual citations, and the context of citations is typically marked in a consistent format, e.g. using identifiers such as, xe2x80x9c3xe2x80x9d, xe2x80x9c[7]xe2x80x9d, xe2x80x9c[Minsky 92]xe2x80x9d, or xe2x80x9cWilliams (1991)xe2x80x9d. By using a number of rules to account for common variations, these tasks can be performed accurately. When searching for the context of citations, variant forms of citation identifiers, such as listing all authors or only the first author, or varying use of initials between the references section and the main text, can be handled using regular expressions.
The present invention provides an autonomous citation indexing system which indexes literature made available in electronic format, such as Postscript files, on the World Wide Web. Papers, journals, authors and so forth are ranked by the number of citations. The invention extends current capabilities by allowing interactive browsing of the literature. That is, given a particular paper of interest, it is possible to display the context of how the paper is cited in subsequent publications. The context may contain a brief summary of the paper, another author""s response to the paper, or a subsequent work which builds upon the original article. Papers may be located by keyword search or by citation links. Papers related to a given paper can be located using common citation information or word vector similarity.
The concept underlying the present invention is generally that given a set of broad topic keywords, use web search engines and heuristics to locate and download papers which are potentially relevant to the given topic. The downloaded papers are parsed to extract semantic features, including citations, citation context, and word frequency information. Citations to identical papers are identified. This information is stored in a database which can be searched by keyword, or browsed by following citation links. It is also possible to find papers similar to a paper of interest by using common citation information or word vector similarity.
A principal object of the present invention is therefore the provision of a computer implemented citation indexing system which extracts citations from articles and identifies syntactic variants of citations to the same work.
Another object of the present invention is the provision of a computer implemented citation indexing system which extracts citations from articles and extracts the context of the citation in the work.
A further object of the present invention is the provision of a computer implemented citation indexing system which combines the use of automatic citation indexing and keyword indexing.
Further and still other objects of the invention will be more clearly understood when the following description is read in conjunction with the accompanying drawing.