1. Technical Field
One or more embodiments relate generally to systems and methods of organizing electronic documents. More specifically, one or more embodiments relate to systems and methods of organizing electronic documents by topic.
2. Background and Relevant Art
The advent of computer technology has lead to an increase in communication using various forms of electronic documents. More specifically, advances in computer technology have allowed users to easily generate, duplicate, and communicate electronic text documents. Examples of electronic text documents include computer data files comprising free-form text, such as responses to survey questions, e-commerce customer reviews, electronic messages (e.g., email), or social media posts (e.g., tweets). Additionally, the development of computer technology has enabled users to organize electronic text documents using various techniques. Conventional techniques of organizing electronic text documents, however, are often overwhelmed and not useful when users attempt to organize large numbers of electronic text documents in a helpful way. Accordingly, conventional systems and methods of organizing electronic text documents typically present several disadvantages.
To illustrate, conventional systems of organizing electronic text documents are generally expensive and/or require significant human effort. For example, many conventional methods rely on human reviewers to manually read and classify each electronic text document by assigning one or more predetermined topics (e.g., codes, labels, tags, categories, etc.) to each electronic text document. Having a human reviewer read through and classify each electronic text document consumes a significant amount of time and resources, especially when the number of electronic text documents is of the order of tens or hundreds of thousands or more.
In an effort to reduce the amount of time and resources needed to manually review each electronic text document, some conventional systems attempt to organize electronic text documents using a classification algorithm. Most conventional classification algorithms, however, generally require training using a set of manually classified electronic text documents, which can take significant time and incur substantial expense. Moreover, even when conventional systems employ a classification algorithm, the classification algorithm is often static and limited in flexibility, which frequently leads to the inaccurate classification of electronic text documents. More specifically, most conventional classification algorithms are limited to predetermined topics and cannot adapt to emergent or novel topics (e.g., topics that may be included within the electronic text documents, but are never identified because the emergent topics are not included in the predetermined topics). Thus, given the limitation of static predetermined topics and the inability to identify emergent topics, conventional systems are usually rigid, inflexible, and prone to error.
Furthermore, conventional systems of organizing electronic documents can result in the incorrect organization of electronic text documents due to poor handling of various features of written human language. In particular, conventional systems are often incapable of handling polysemy (i.e., a word having many meanings) and synonymy (i.e., multiple words having the same meaning) As an example of polysemy, the word “bed” can mean a piece of furniture upon which a person sleeps or the bottom of a lake, river, sea, or other body of water. As such, many conventional methods of organizing electronic documents typically fail to differentiate between multiple meanings of individual words (e.g., such approaches may organize electronic text documents referring to a person's bed in the same grouping as electronic text documents referring to a lake bed).
As an example of synonymy, the words “couch” and “sofa” can both mean a piece of furniture upon which two or more people can sit. Conventional systems, however, often fail to classify two electronic text documents together based on the sharing of synonyms. Rather, conventional systems often classify the two electronic text documents in separate groupings. Consequently, conventional systems are often incapable of effectively handling various features of written human language, which leads to the inaccurate classification of electronic text documents.
Accordingly, there are a number of considerations to be made in organizing electronic text documents.