With the advents of the printing press, typeset, typewriting machines, and computer-implemented word processing and storage, the amount of information generated by mankind has risen dramatically and with an ever quickening pace. As a result there is a continuing and growing need to collect and store, identify, track, classify and catalogue for retrieval and distribution this growing sea of information and an entire area of study has emerged called “information science.” One popular existing form of cataloging and classifying information, e.g., books and other writings, is the Dewey Decimal System. Beyond classifying information, information science involves the study of how organizations and people, e.g., researchers, interact in moving bodies of science and research forward.
In the area of scholarly and scientific writing a sophisticated process and convention for documenting research, supporting materials and organizing fields of study has emerged called “bibliographic citation.” Such scientific writings include, among other things, books, articles published in journals, magazines or other periodicals, and papers presented, submitted and published by society, industry and professional organizations such as in proceedings and transactions publications. To facilitate the widespread distribution of information published in scholarly writings to more efficiently and effectively move bodies of study forward, scholars and scientists use bibliographic citation to recognize the prior work of others, or even themselves, on which advancements set forth in their writings are based. “Citations” included in any particular work or body of work collectively form a “bibliography” and are used to identify sources of information relied on or considered by the author and to give the reader a way to confirm accuracy of the content and direction for further study. A “bibliography” may refer to either of a complete or selective list or compilation of writings specific to an author, publisher or given subject, or it may refer to a list or compilation of writings relied on or considered by an author in preparing a particular work, such as a paper, article, book or other informational object.
Citations briefly describe and identify each cited writing as a source of information or reference to an authority. Citations and bibliographies follow particular formatting conventions to enhance consistency in interpreting the information. Each citation typically includes the following information: full title, author name(s), publication data, including publisher identity, volume, edition and other data, and date and location of publication. However, the author names are most usually in an abbreviated form, such as an initial rather than full first or middle names (e.g., J. Smith), or suffer naturally from commonality with other authors, such as having either a common first or last name or both e.g., John Smith. This is results in a latent ambiguity as to the actual identity of the author. There have been many attempts to disambiguate author information, i.e., to establish a single semantic interpretation for, in this case, author identity. Each writing or paper may have one or more authors and represents an authorship for each author or co-author. As used herein each authorship instance represents the contribution of an individual author. Accordingly, if a paper has three co-authors then there will be three distinct “authorships” associated with that paper. For purposes of descriptions contained herein, for a paper identified as “1” having co-authors A, B, and C, then the authorships associated, respectively, with the co-authors would be identified as A1, B1, and C1. for linking authorships or citations representing authorships with particular authors and bibliographies of given authors.
Two areas of scientific study directed to measuring and analyzing science and scientific publications are “scientometrics” and “bibliometrics,” which are based on the early works of Vannevar Bush and more recently on the works of, among others, Eugene Garfield, founder of the Institute for Scientific Information (“ISI”). Bibliometrics concerns analyzing content and associated information of books and other publications, which may be referred to as informational objects. Such analysis may then be used to identify and/or quantify, confirm or reject relationships among informational objects, e.g., author entities, or academic journal citations, to create links among the informational objects. Other applications for bibliometrics include: creating word relationships to populate a thesaurus; measuring frequency of terms (individual words, groups of words, or word roots or meanings); identifying relationships of texts using grammar, semantic and syntax rules, and other techniques to create useful tools and resources.
Efforts have been undertaken to define relationships and the evolution of science within particular fields to give some coherent structure to the business of science, for example, see Eugene Garfield, Mapping The Structure Of Science (Chapter 8), Citation Indexing: Its Theory and Application in Science, Technology, and Humanities, John Wiley & Sons, Inc. NY, p. 98-147, 1979; and The Geography Of Science: Disciplinary And National Mappings, by Henry Small and Eugene Garfield, J. Inform. Sci., 11:147-159 (1985). ISI's Science Citation Index (“SCI”) was created as a citation index of the world's leading journals of science and technology and has proven to be a powerful bibliometric resource. SCI has been used to map the progress and development of science by using factors that measure the importance of scientific journals. The study of science based on examining citations and bibliographies to infer associations may be referred to as “citation analysis.” For instance, SCI has been used to show that certain fundamental journals are central to hard science while in areas such as the humanities or social sciences there is no such relationship.
In support of the pursuits of science and research databases, database management tools, citation management and analysis tools, research authoring tools, and other powerful tools and resources have been used and developed for the beneficial use of researchers and scientists. These tools and resources may be available to users in an online environment, over the Internet or some other computer network, and may be in the form of a client-server architecture, central and/or local database, application service provider (ASP), or other environment for effectively communicating and accessing electronic databases and software tools. Examples of such tools and resources are Thomson Scientific's Web of Science™ (WoS), Web of Knowledge™ (WoK), and Researchsoft™ suite of publishing solutions including, EndNote™, EndNoteWeb™, ProCite™, Reference Manager™, and RefViz™, as well as solutions such as Scholar One's Manuscript Central™. A longstanding problem associated with these databases and tools has been inaccurate identification and attribution of authorship due to, among other things, author name ambiguity which may be a result of incomplete information (e.g., abbreviated name with initials), incorrect information (e.g., misspellings), and common/identical information (e.g., same name same spelling). Name ambiguity resulting in incorrect linkage of paper and citation records with author entities result in inaccuracies that diminish integrity, reliability and performance of resources and tools, including document and information search and retrieval, database integration, and research formation.
Techniques used to help build out databases and confirm database information include extraction and sorting, such as parsing of data from sentence or word structures, performed on electronic documents to extract information from papers and citations for further processing. Prior extraction techniques may include linking techniques such as Bayesian-based techniques as described in Automatic Extraction And Linking Of Person Names In Legal Text, Christopher Dozier and Robert Haschart, In Proceedings of RIAO 2000 (Recherche d'Information Assistee par Ordinateur), 12-14 Apr. 2000, Paris, France, pp. 1305-1321. See also HistCite™: A Software Tool for Informetric Analysis of Citation Linkage, Eugene Garfield, Soren Paris, and Wolfgang Stock, Information Wissenschaft & Praxis, 57(8):391-400, November/December 2006.
Relational links may be established based on “citations” and such links may be used in searching for materials and analyzing the relative merit of resources. By linking informational objects, such as papers, through citations and citation indices, e.g., WoS, users can search forward using a known article to identify and access more recent publications that cite the known article and are related to the same subject matter.
Citation analysis can applied across databases such as WoS and WoK to determine acceptance, following, and impact of specific publications and authors and may be used, for example, in screening reference materials, validating research, establishing interaction among authors or institutions, and in deliberating an author's tenure review. Although citation analysis has been used for years, ever increasing computing power and information management techniques are making it more useful and widespread. One highly beneficial use of citation analysis is to associate works of authorship with individual authors. Also, integrating new publications into an existing database of papers and other works often starts with an existing list of known authors as a starting point. For example, assume an existing list of authors includes an entry for John Smith, Professor at University of Alabama. And then assume a subsequent article indicating “J. Smith” from “U. of Al.” as an author or co-author. Known systems might automatically associate the article with the known John Smith at University of Alabama that appears on the existing list of authors. However, the system would not know of or consider the case of a “Jane Smith” that recently became professor at University of Alabama. Also, such a system might not have a way of detecting a miss-match or the likelihood of a miss-match, e.g., if the citation has an incorrect abbreviation either in the author name or in the school/institution name, e.g., typographical error in that the school should have been “U. of Az.” for University of Arizona at which the real author, Jeff Smith, is a professor.
“Writings” and “papers,” as used herein shall refer to both “hard” and “soft” electronic documents, are now widely created, edited, maintained, archived, catalogued and researched in whole or in part electronically. The Internet and other networks and intranets facilitate electronic distribution of and access to such information. The advent of databases, database management systems and search languages and in particular relational databases, e.g., DB2 and others developed by IBM, Oracle, Sybase, Microsoft and others, has provided powerful research and development tools and environments in which to further advance all areas of science and the study of science. There are companies and institutions that have created electronic databases and associated services, such as SCI, WoS, and WoK, that are specifically designed to help organize and harness the vast array of knowledge.
“Clustering” is a method of identifying a subset of items sufficiently similar to form a relational link to form a “cluster.” A dendrogram is a graphical representation of links between data objects forming a cluster tree. If the linking of the data objects grows weaker the farther up the cluster tree, then one could assign a threshold degree of relatedness such that the tree is severed at some level resulting in individual groups of connected or linked data objects forming a plurality of clusters of data objects. There are several known techniques for clustering data objects, including single link, average link, complete link For instance, in a database of articles including: Article 1 with author “J. Smith at Univ. of Ala.”; Article 2 with co-author “Jeff Smith at Univ. of Al.”; and Article 3 with co-author “J. S. Smith at Univ. of Alabama,” a sufficient link may have been formed based off of the name similarity and the school similarity to form a cluster as representing author “Jeff S. Smith” of the University of Alabama. This may be in conjunction with a known list of authors or professors including a “Jeff Smith” at the University of Alabama. Because papers often do not include full names, because professors do change positions and schools, and because typographical errors do occur, relying heavily on last name and first initial could introduce significant risk for error in the database and bibliographies generated by using such databases and systems. What is needed is a way to more accurately link or associate authorships with individual authors.