The digitization of scientific data is known. For example, authors prepare manuscripts and presentations using a collection of text, graphics and data processing software. Also, consumers of scientific data regularly download documents from publishers' websites, search for content in databases, and share data with their colleagues electronically, often in an entirely paperless fashion.
In addition to the above, dozens of commercial and academic research groups are actively working on ways to use software to analyze this rapidly expanding corpus of data to provide facile information retrieval, and to build decision support systems to ensure that new research makes the best possible use of all available prior art. However, and despite the near complete migration from paper to computers, the style in which scientists express their results has barely changed since the dawn of scientific publishing. In particular, ideas and facts are expressed as terse paragraphs of text—kept as short as possible to minimize printing costs—and as stripped down diagrams that often summarize vast numbers of individual data points in a form that can be visualized statically by a scientific peer. This style of communication has remained consistent because it is effective for its primary purpose, but also presents a major hurdle to computer software that is attempting to perform data mining operations on published results.
In the case of biological assays, experiments designed to measure the effects of introduced substances for a model of a biological system or disease process, the protocols are typically described in one or more textual paragraphs. Information about the target biology, the proteins or cells, the measurement system, the preparation process, etc., are all described using information rich jargon that allows other scientists to understand the conditions and the purpose. This comprehension process is, however, expert-specific and quite time consuming. While one scientist may read and understand dozens of published assay descriptions, this is not scalable for large-scale analysis, e.g. clustering into groups after generating pairwise metrics, or searching databases for related assays. Therefore, an improved process and system for identifying and labeling or annotating scientific technical data would be desirable.