With the advents of the printing press, typeset, typewriting machines, and computer-implemented word processing and storage, the amount of information generated by mankind has risen dramatically and with an ever quickening pace. As a result there is a continuing and growing need to accurately collect and store, identify, track, classify and catalogue for retrieval and distribution this growing sea of information.
In the area of scholarly and scientific research and writing a sophisticated process and convention for documenting research, supporting materials and organizing fields of study has emerged called “bibliographic citation.” Such scientific writings include, among other things, books, articles published in journals, magazines or other periodicals, manuscripts, and papers presented, submitted and published by society, industry and professional organizations such as in proceedings and transactions publications. To facilitate the widespread distribution of information published in scholarly writings to more efficiently and effectively move bodies of study forward, scholars and scientists use bibliographic citation to recognize the prior work of others, or even themselves, on which advancements set forth in their writings are based. “Citations” or “cited references,” as included in any particular work or body of work, is used herein to refer broadly to cited references, bibliographic or other reference data, that collectively form in-text citations, footnotes, endnotes, and bibliographies and are used to identify sources of information relied on or considered by the author and to give the reader a way to confirm accuracy of the content and direction for further study. A “bibliography” may refer to either of a complete or selective list or compilation of writings specific to an author, publisher or given subject, or it may refer to a list or compilation of writings relied on or considered by an author in preparing a particular work, such as a paper, article, book or other informational object.
Citations briefly describe and identify each cited writing as a source of information or reference to an authority. Citations and bibliographies follow particular formatting conventions to enhance consistency in interpreting the information. Each citation typically includes the following information: full title, author name(s), publication data, including publisher identity, volume, edition and other data, and date and location of publication. Given the formatting requirements and numerous fields associated with each citation and given that there are tens and in some cases hundreds of citations in a given paper, the likelihood of misspellings and typographical errors presents a substantial problem in the publishing and research processes. Perhaps other than the title associated with a given paper, most of the fields are inherently ambiguous. For example, even the volume, page and date fields or data for a given reference is not particularly helpful in the event of an error. As opposed to the title information, where one letter missing or misspelled in one word from a string of words still leaves usable information, a missing or erroneous date or volume character makes the rest of the data largely useless or at least unreliable. Also, author names are most usually in an abbreviated form, such as an initial rather than full first or middle names (e.g., J. Smith), or suffer naturally from commonality with other authors, such as having either a common first or last name or both e.g., John Smith. This results in a latent ambiguity as to the actual identity of the author. There have been many attempts to disambiguate author and other citation information. A system and method for disambiguating information is disclosed in U.S. Ser. No. 11/799,768, filed May 2, 2007, entitled Method and System for Disambiguating Information Objects, which is owned by the assignee of the present application and is hereby incorporated herein by reference.
In support of the pursuits of science and research databases, database management tools, citation management and analysis tools, research authoring tools, and other powerful tools and resources have been used and developed for the beneficial use of scholars, researchers, and scientists. These tools and resources may be available to users in an online environment, over the Internet or some other computer network, and may be in the form of a client-server architecture, central and/or local database, application service provider (ASP), or other environment for effectively communicating and accessing electronic databases and software tools. Examples of such tools and resources are Thomson Scientific's Web of Science™ (WoS), Web of Knowledge™ (WoK), and ResearchSoft™ suite of publishing solutions including, EndNote™, EndNoteWeb™, ProCite™, Reference Manager™, and RefViz™, as well as solutions such as Scholar One's Manuscript Central™. A longstanding problem in the publication process has been accurately entering citation information in papers during creation and the time consuming and tedious process of manually verifying the accuracy of the citation prior to publication. Small but critical errors, such as incomplete information and incorrect information (e.g., misspellings and typographical errors) cause the author and publisher to lose credibility and cause the reader to waste effort searching for the referenced material incorrectly cited in the document.
Techniques used to help build out databases and confirm database information include extraction and sorting, such as parsing of data from sentence or word structures, performed on electronic documents to extract information from papers and citations for further processing.
“Writings,” “manuscripts,” and “papers,” as used herein shall refer to both “hard” documents and “soft” electronic documents and shall be used interchangeably and given the broadest collective meaning. Such works of authorship are now widely created, edited, maintained, archived, catalogued and researched in whole or in part electronically. The Internet and other networks and intranets facilitate electronic distribution of and access to such information. The advent of databases, database management systems and search languages and in particular relational databases, e.g., DB2 and others developed by IBM, Oracle, Sybase, Microsoft and others, has provided powerful research and development tools and environments in which to further advance all areas of science and the study of science. There are companies and institutions that have created electronic databases and associated services, such as WoS and WoK, that are specifically designed to help organize and harness the vast array of knowledge.
Thousands of papers and manuscripts are submitted to reviewers and publishers daily by authors and many of the submissions include malformed references. To catch and correct these errors, the current path to publication usually includes a manual reference validation step consisting of checks for style and content accuracy. The validation task may be performed by a variety of roles, most commonly by a copy editor or a production editor, but also possibly by a typesetter. With papers and manuscripts commonly containing dozens (or hundreds) of cited references, the validation process is tedious and time-consuming, and adds significant costs to the publication process, having been estimated to account for up to 60% of a publisher's correction and formatting effort.
Recent developments have provided a significant opportunity to develop reference validation tools within the context of manuscript creation, submission, approval, proofing, and production processes. Many reference databases, which may be referred to herein as authority databases, have become available via web service connections. It is now possible to efficiently pull or extract reference lists from a manuscript or electronic document into XML. Also, processes used in the paper creation and submission process may be extended into the production stage of the publishing process to provide a complimentary, coordinated and efficient overall process.
Robust and accurate reference mark-up and validation tools are needed to effectively reduce the cost and burden associated with validating references prior to publication.
Existing effort and systems directed to “validation, XML, scholarly meta data, etc.” include the following. XML validation, which presently applies to the validation of the XML tags and schema in a document and not to validating the associated data. Scholarly meta data and linking refers to for example the creation of a Digital Object Identifier (DOI) or unique digital identifier for a specific scholarly work, for example a URL. DOI may be used to identify content objects in a digital environment. Entities operating over digital networks are assigned DOI “names,” and have associated with them “current” information, including address information. Name information does not change but other information, e.g., address, may change over time. A DOI system provides a framework for managing the following: identification, content; metadata, links, and media.