With the advents of the printing press, typeset, typewriting machines, and computer-implemented word processing and storage, the amount of information generated by mankind has risen dramatically and with an ever quickening pace. As a result there is a continuing and growing need to accurately collect and store, identify, track, classify and catalogue for retrieval and distribution this growing sea of information.
In the area of scholarly and scientific research and writing a sophisticated process and convention for documenting research, supporting materials and organizing fields of study has emerged called “bibliographic citation.” Such scientific writings include, among other things, books, articles published in journals, magazines or other periodicals, manuscripts, and papers presented, submitted and published by society, industry and professional organizations such as in proceedings and transactions publications. To facilitate the widespread distribution of information published in scholarly writings to more efficiently and effectively move bodies of study forward, scholars and scientists use bibliographic citation to recognize the prior work of others, or even themselves, on which advancements set forth in their writings are based. “Citations” or “cited references,” as included in any particular work or body of work, is used herein to refer broadly to cited references, bibliographic or other reference data, that collectively form in-text citations, footnotes, endnotes, and bibliographies and are used to identify sources of information relied on or considered by the author and to give the reader a way to confirm accuracy of the content and direction for further study. A “bibliography” may refer to either of a complete or selective list or compilation of writings specific to an author, publisher or given subject, or it may refer to a list or compilation of writings relied on or considered by an author in preparing a particular work, such as a paper, article, book or other informational object.
A citation briefly describes and identifies a cited writing as a source of information or reference to an authority. Citations and bibliographies follow particular formatting conventions to enhance consistency in interpreting the information. Each citation typically includes the following information: full title, author name(s), publication data, including publisher identity, volume, edition and other data, and date and location of publication. Given the formatting requirements and numerous fields associated with each citation and given that there are tens and in some cases hundreds of citations in a given paper, the amount of work required of authors in accurately identifying and citing such a substantial number of citations in a given paper or project presents a substantial problem and burden in the publishing and research processes. Even when an author or collaborator of a paper or project has cited certain papers in prior documents, there exists the problem of efficiently and accurately recalling such papers for citing in future papers. What is needed is a system that allows an author the ability to work in a word processor environment while providing effective access to citation information without leaving the word processor application.
In addition, many fields in a citation are inherently ambiguous making it very difficult for an author to accurately represent citation in a paper or work in process. For example, recalling the exact title of a paper to be cited or the author of such a paper make the process inherently problematic. Systems presently available to researchers and authors do not provide an effective means to identify and generate citations based on an author's “rough” or approximate citation information or even topical information in the text of a working document.
Also, author names are most usually represented in a citation in an abbreviated form, such as an initial rather than full first or middle names (e.g., J. Smith), or suffer naturally from commonality with other authors, such as having either a common first or last name or both e.g., John Smith. This results in a latent ambiguity as to the actual identity of the author or the paper for which a citation is sought during the authoring process. There have been many attempts to disambiguate author and other citation information. A system and method for disambiguating information is disclosed in U.S. Pat. No. 7,953,724, issued May 31, 2011, entitled Method and System for Disambiguating Information Objects, which is hereby incorporated by reference herein in the entirety.
In support of the pursuits of science and research databases, database management tools, citation management and analysis tools, research authoring tools, and other powerful tools and resources have been used and developed for the beneficial use of scholars, researchers, and scientists. These tools and resources may be available to users in an online environment, over the Internet or some other computer network, and may be in the form of a client-server architecture, central and/or local database, application service provider (ASP), or other environment for effectively communicating and accessing electronic databases and software tools. Examples of such tools and resources are Thomson Reuters Scientific's Web of Science™ (WoS), Web of Knowledge™ (WoK), and ResearchSoft™ suite of publishing solutions including, EndNote™, EndNoteWeb™, Reference Manager™, and Manuscript Central™.
A longstanding problem in the authoring and publication process has been accurately entering citation information in papers during creation and the time consuming and tedious process of manually verifying the accuracy of the citation prior to publication. Small but critical errors, such as incomplete information and incorrect information (e.g., misspellings and typographical errors) cause the author and publisher to lose credibility and cause the reader to waste effort searching for the referenced material incorrectly cited in the document. What is needed is a system that enables authors to identify, select and insert accurate citation information directly into a document while in the word processor application.
One particular aspect of the authoring process that is problematic is when an author desires to present in a paper a technical or other position that is supported by prior research but does not recall the prior paper that supports the statement. Techniques for textual analysis, including those based on natural language processing, IDF (inverse document frequency), TF-IDF (term frequency—inverse document frequency), are known and have been used to help discern meaning out of the text presented and to associate such text with relationships, concepts, and documents based on, for example, a scored relevance. Such techniques may include extraction and sorting, such as parsing of data from sentence or word structures, performed on electronic documents to extract information from papers and citations for further processing.
“Writings,” “manuscripts,” and “papers,” as used herein shall refer to both “hard” documents and “soft” electronic documents and shall be used interchangeably and given the broadest collective meaning. Such works of authorship are now widely created, edited, maintained, archived, catalogued and researched in whole or in part electronically. The Internet and other networks and intranets facilitate electronic distribution of and access to such information. The advent of databases, database management systems and search languages and in particular relational databases, e.g., DB2 and others developed by IBM, Oracle, Sybase, Microsoft and others, has provided powerful research and development tools and environments in which to further advance all areas of science and the study of science. There are companies and institutions that have created electronic databases and associated services, such as WoS and WoK, that are specifically designed to help organize and harness the vast array of knowledge.
Reference validation tools are available within the context of manuscript creation, submission, approval, proofing, and production processes. Many citation or reference databases, which may be referred to herein as authority databases, have become available via web service connections. However, these tools are not presently well integrated with authoring applications and require substantial manual intervention and confirmation. Thousands of papers and manuscripts are submitted to reviewers and publishers daily by authors and many of the submissions include malformed references. To catch and correct these errors, the current path to publication usually includes a manual reference validation step consisting of checks for style and content accuracy. The validation task may be performed by a variety of roles, most commonly by a copy editor or a production editor, but also possibly by a typesetter. With papers and manuscripts commonly containing dozens (or hundreds) of cited references, the validation process is tedious and time-consuming, and adds significant costs to the publication process, having been estimated to account for up to 60% of a publisher's correction and formatting effort. What is needed is a system that effectively identifies and recommends citation data for inclusion by authors in papers, which citations are accurate and uniformly conform to a desired style.
Existing systems are known that provide, e.g., validation of XML tags and schema in a document/citation and that provide enhanced data with document/citation or document/citation records. Such systems may provide cholarly meta data and linking, e.g., separate topical fields, abstract fields, etc, and the creation of a Digital Object Identifier (DOI) or unique digital identifier for a specific scholarly work, for example a URL. DOI may be used to identify content objects in a digital environment. Entities operating over digital networks are assigned DOI “names,” and have associated with them “current” information, including address information. Name information does not change but other information, e.g., address, may change over time. A DOI system provides a framework for managing the following: identification, content; metadata, links, and media.
Improved methods, systems, and software for automatically processing literary citations are needed to provided enhance user (author) experience and to more efficiently and effectively facilitate accuracy of inserted citations, as well as identify and recommend for selection a set of recommended citations for inclusion in a document based on a limited set of textual and/or citation data, such as provided by an author within a document.