1. Field
This disclosure is generally related to analysis of document similarities. More specifically, this disclosure is related to identifying similar documents based on meaningful entities extracted from the documents as well as user input.
2. Related Art
Modern workers often deal with large numbers of documents; some are self-authored, some are received from colleagues via email, and some are downloaded from websites. Many documents are often related to one another since a user may modify an existing document to generate a new document. For example, a worker may generate an annual report by combining a number of previously generated monthly reports. When email users correspond back-and-forth to each other discussing a related topic, email messages often share similar words or combinations of words. For example, conversations discussing local weather may all include words like “rain,” “snow,” or “wind.”
Therefore, some document-similarity calculation methods rely on the comparison of the occurrences of meaningful words that are defined as “entities” in order to derive similarities between messages or conversations. Other methods estimate document similarity by detecting a sequence of operations performed when the document is generated. However, such approaches do not consider possible comparison between documents based on different document-similarity calculation methods. Furthermore, the density of entities is often not sufficiently high for reliable similarity calculations based on semantic entities.