To those involved in the scientific enterprise, whether in its mechanisms, support or the application of its products, the ability to navigate both past and current published literature is essential. To accomplish this, effective methods are required to locate, organize, link and summarize scientific ideas and findings as they emerge, within the context of those that constitute their basis. Currently, greater than 1 million scientific papers are published each year in more than 25 thousand journals. This amounts to a total of greater than 20 million publications in biomedical sciences alone, which increases by 2 to 4 thousand new documents per day. Furthermore, the landscape of scientific research continues to increase in complexity due to a trend towards progressive discipline specialization. At such a scale it has become a major challenge for researchers to stay up to date in their own fields, let alone all of science. Research information overload threatens the ability of scientists and their related community to efficiently construct knowledge and to research in a collective manner.
Scientific, medical and technical journals have, in the past, enabled the navigation of scientific literature by organizing according to topic and quality on a system-wide scale. Besides a small number of multi-disciplinary journals which publish papers of broad interest, the vast majority of are highly specialized in their content, and restrict themselves to a particular field. New journals develop in response to discoveries which branch fields into ever more specialized areas. Journals can be ranked according to a variety of metrics; a prime example of these is the Journal Impact Factor, which is calculated on the average number of citations received per paper published in a given journal during the two preceding years. Individual articles thus tend to be judged, by proxy, according on the status of their journal of publication. Scientific journals have played a pivotal role in the development of modern science by additionally providing a mechanism for workers to i) disseminate new results and ideas ii) register a date of priority for new discoveries and iii) validate and improve the quality of impending publications by way of peer review. Indeed, dissemination and registration of scientific results originally motivated the development of journals. Over the past century a dramatic expansion in journal number, commensurate with increasing numbers of scientists and papers published each year, and combined with a virtually universal transition from paper to electronic publishing, shifted the consumption of journal articles from a subscription- to search-driven activity. In biomedicine, for example, the National Center for Biotechnology Information PubMed abstract database mixes journal content across topic and impact level, allowing search nets to be cast broadly. As such, the ability to find relevant publications within any field of study has become dependent on the skilled operation and capacity of tools for filtering large literature databases.
In terms of finding important and relevant information, the World Wide Web presents an apparently similar problem, scaled up by greater than two orders of magnitude. Here, filtering a network of billions of linked documents has been handled via the development of powerful internet-wide search engines. Querying results not only in the instant retrieval of hyperlinks to all indexed documents harboring the query terms, but a ranked list, organized by relevance. The calculation of relevance is accomplished using algorithms able to estimate the relative importance of the retrieved documents based on the structural features of the hyperlink network, as exemplified by the PageRank algorithm, as well as other factors. When applied to scientific literature databases, search engines are effective for certain goals; for example, using a search engine it is trivial to locate every article that contains the word protein, or the most recently published set. Experienced users can construct advanced filters using Boolean operators, however the success of the search is contingent on the terms, synonyms and syntax that are used. Thus, searching for documents in this manner may result in a hit or miss, depending on the skill of the searcher.
What is needed is an improved system and method that addresses some of these significant challenges and limitations in the prior art. As will be described in more detail, the problem of finding relevant scientific information differs from a typical internet query in several aspects, one of which is the temporal dependence of later produced works on those that precede them. Published scientific ideas and findings tend to be cumulative, and theoretically can be arranged into continuous lines of inquiry, which begin at a primordial expression of the concept or theory and branch as new insights expand the territory that is open for investigation. Therefore, context is necessary to evaluate the merits of an article toward the collective pursuit of its stated objectives. It is often desired, for example, to trace publications along the development of a particular line of research. This is especially the case when one seeks to obtain a broad perspective on work in a particular topic, summarize the context of a new idea or finding, identify open research problems, or to map important discoveries in a field and understand their intellectual roots. Search engines are innately unsuited to such analyses, since a ranked list of relevant material is not what is sought, but rather an ordered historical narrative. Although it is feasible to construct a single search engine query that will return the majority of publications contributing to a particular topic of research, this is challenging without prior knowledge of which specific documents must be found. Moreover, the complexity of such a query precludes this as a routine approach, leading instead to a strategy comprising iterative search and query refinement cycles. Considering the diverse works contributing to the progress of a field, and the number of technical synonyms used to describe their results, approaches such as this not only are tedious and time consuming, but unlikely to be exhaustive. The issue is made worse by the fact that, for an individual to stay current on any topic, searches must be repeated as often as possible. Published review-style articles serve as the classic resource for overviews of any topic or subfield; as such they are a natural starting point to identify important citations and appreciate their context. However, it can require months of research for review authors to compile an authoritative list of citations from which an article's discussions and concepts will be synthesized. Further, review articles tend to rapidly age with the progression of a field, often becoming partially out of date by their time of publication.
Scientific publications communicate significant information beyond the topical, or explicit, knowledge contained within each work. These latent signals, termed meta-knowledge (“knowledge about knowledge”), include information used to situate an article within its many relevant settings, and rely on the depth of a reader's understanding of related work, as well as social contexts and norms in scientific research. Scholarly articles each reference dozens of publications from a variety of fields that they build upon in some aspect, however their primary advances are usually only pertinent to one or two areas of research. Experienced workers in these areas are able to quickly appreciate the relevance of a study and appraise its impact based on a constellation of indicators, both general and specific. These include the publication history and reputation of the authors, author teams and collaborative networks of teams.
Likewise, the research focus and prestige of an article's institute of origin, and even the institute's geographical location, can flag it as potentially important. In the context of an article's intended field, the date of publication locates it within the framework of understanding current to that time period, the assumptions of the authors being similar to those of their contemporaries, and the meaning of the results deciphered in light of what was known. The methods and materials used in a study signal informed readers to classify it with past works that offered similar types of evidence in support their rational and interpretations. While the identity of an article's publishing journal can suggest its impact, much more specific measures are available.
The simplest and best known of these is the number of citations a publication has received over its lifetime. Where citations are slow to accrue, newer, more immediate metrics have been proposed that exploit digital footprints of knowledge consumption on the internet; these include measurements of article readership (e.g. article HTML pageviews) and references in social media (e.g. the number of Twitter posts pointing to a particular publication). Similarly, the connected nature of the internet allows the consolidation of traditional, but decentralized, indicators of publication impact and author status, including awards and grants stemming from particular works.
While the true merit of a publication can only be ascertained by studying the explicit knowledge it contains, scientific meta-knowledge can provide vital cues about which documents are relevant and important, in the face of the tremendous body of literature that now exists. Given the widespread adoption of social applications on the internet, ranging from wiki-style encyclopaedias and answer engines to specialized social networks, it is now possible to envision powerful socially-driven alternatives to scientific literature search engines. In particular, it should be feasible to invert the workflow of published knowledge discovery from an activity characterized by repetitive search and context reconstruction to one of context- and data-driven browsing and article selection.