Similar document search is a very useful feature for content management, content search and content recommendation. In various scenarios, a lot of applications need to find a similar document set to a target document from a huge amount of documents.
The typical solution is to calculate a similarity between the documents to be determined and the target document one by one, and then return those that satisfy similarity criteria. For example, keywords may be searched one by one according to an inverted index, and the intersection of the document sets obtained by each keyword searching is determined as the similar document set. However, if the number of the documents to be determined becomes larger and larger, the time cost by this method will become significant, and thereby it will be hard to finish the calculation within the time limit given by a user. Moreover, the method is deficient in aspects of performance and accuracy.