Electronic documents or passages of text otherwise stored electronically (such as stored directly on web pages accessible via the internet) can contain large amounts of information either in a single source or over multiple sources. With regards to electronic documents this is particularly relevant to the review of vast amounts of electronic documents, be it those originally in electronic form or those which have been converted into electronic documents, where particular types of passages or groups of text have to be identified. For example, it could be necessary to search through a document or number of documents to identify passages related to specific contract language in a legal due diligence exercise.
Prior art solutions range from simple word searches in text to slightly more sophisticated methods capable searching text based on the characteristics of the tokens or the text, or the tokens contained within the text of headers. One such example is shown in the paper entitled “Identifying Sections in Scientific Abstracts Using Conditional Random Fields” by Hirohata et al. (2008). In this paper, conditional random fields are used to pull out specific text following the use of conditional random fields to determine section headings in the abstracts of scientific papers. For example, one could use the Hirohata method to extract all the Conclusions from a group of scientific papers in order to quickly review the conclusions drawn from a plurality of papers without having to manually search through the text of all papers.
However, the Hirohata method is heavily dependent on the proper labelling of sections within abstracts and specifically requires that a heading named “Conclusion” be present in order to pull out the sentence or sentences following this heading. Applications of the teachings of Hirohata more broadly would still require searching to be based on a particular pre-defined token or feature of text.
There is accordingly a need in the art for an improved method and system capable of identifying passages in electronic documents.