When a researcher searches for articles or other documents relevant to his research, often he is flooded with a large number of documents obtained from literature searches. To narrow down this set, a reviewer may manually screen the abstracts of hundreds or even thousands of documents in order to identify the small subset of documents that are potentially relevant for further analysis. For example, the systematic review and meta-analyses of the scientific literature related to a medical problem is the foundation of a field known as evidence-based medicine, or EBM. In order to identify a core set of articles (dozens to thousands) for an EBM review, a literature search often begins with thousands of citations, from which the investigators must screen and select the potentially relevant articles. An average EBM evidence report may address one to five key questions and require the manual screening of 5,000 abstracts to find about 100 articles that meet inclusion criteria.
The manual document screening process, however, suffers from several drawbacks. One drawback is that manual screening is often prone to human errors. One source of error stems from a human reviewer's tendency to consider only a fraction of the information in any one document or abstract. The manual document screening process is also often tedious and fatiguing. To reduce the effects of these problems, the screening task is typically done by a team of several people experienced in scientific and methodological issues. The team approach, however, often produces inconsistent results as each team member performs differently. To combat this, and because of varying topics for each project, team training may be employed in an effort to improve the consistency of the results. Duplicate manual screening may also be used to combat errors and inconsistencies. Training and duplicate screening, however, may undesirably increase, or even double, the cost of screening a group of documents.
The manual screening process also suffers from being time consuming, as even an experienced reviewer may spend thirty seconds or more to screen a single abstract. At that rate, screening five thousand abstracts requires about five person days.
The manual screening process is also inflexible. If a review question or criterion is changed midproject, manual re-screening of documents or abstracts introduces additional fatigue, error, and reviewer boredom.
Computers can be used to aid the screening process. However, conventional document review algorithms rely on keywords, and keywords alone are often poor representatives of complex information sought, such as the complex information in medical abstracts and technical documents. More advanced document search algorithms, such as those found in the fields of natural language processing (NLP) and information extraction (IE), improve over keyword searches, but still suffer from deficiencies. A number of special-purpose products are commercially available to extract entities and “facts” from text documents, such as ThingFinder™ by InXight, Extractor™ by NetOwl, and products by the Autonomy™ company, and most of these products can be customized with catalogues of domain-specific entities and with special rules to extract specific information. Among other problems, however, these products are not user-friendly; they must be pre-configured by experts and are thus too inflexible to be applied to the problem of large scale document screening. Moreover, these products do not “learn” and become better at finding relevant documents as they are used.
Moreover, whether keyword, NLP, or IE technology is used to make a first cut and identify an initial set of documents for further consideration, the initial set of documents is typically so large that it is not effective or cost efficient to evaluate each document in the set individually to produce a desired subset of relevant documents.
Accordingly, there exists a need for systems and methods that may transform a process that is now a labor intensive manual process or a limited computer-aided process into an efficient flexible process capable of handling complex information in a large number of documents. There is a need for systems and methods that automatically classify unseen-by-the-user documents accurately and efficiently and that improve their classification of documents as the user learns more about the information provided in the documents.