Document review is an activity frequently undertaken in the legal field during the discovery phase of litigation. Typically, document review requires reviewers to assess the relevance of documents to a particular topic as an initial step. Document reviews can be conducted manually by human reviewers, automatically by a machine, or by a combination of human reviewers and a machine. As the amount of documents to review increases, efficient methods of review are needed to reduce costs and time spent on review. Identification of duplicate and near duplicate documents can both reduce costs and time based on reducing the number of documents to review.
For instance, near duplicate documents can include emails having threads of text that can subsume earlier versions. Generally, the most recent reply is located at the top of the document, while the older replies are listed below the most recent reply. To prevent a user from reviewing each and every single email document in a thread, only the most recent email, which includes all the replies, need be reviewed. Alternatively, only original documents need to be reviewed.
Thus, there remains a need for a system and method for efficiently and effectively identifying duplicate and near duplicate documents to reduce costs and time spent reviewing documents.