Computers and computing systems have affected nearly every aspect of modern living. Computers are generally involved in work, recreation, healthcare, transportation, entertainment, household management, etc. One particular area of computer technology that has seen exponential growth is in relation to large data processing. Technologies have sought to find more efficient and more accurate ways of dealing with tremendous volumes of data. For example, Information Retrieval (IR) systems such as search engines and Questions and Answering (QA) systems have been broadly implemented to retrieve information.
One exemplary source of large amounts of information is contained in electronic messages which are sent for work and personal correspondence. Electronic messages, however, pose unique challenges to data management and IR. For example, data within electronic messages is generally only pertinent and readily accessible by the parties included in the communication threads. Additionally, the resulting data tends to comprise small cells of information that comprise low information density and that lack context making them difficult to analyze for some machine learning approaches that rely on large bodies of data to provide reliable accuracy.
Implementing a computer system that is capable of intelligently processing conversational data is associated with several significant technical problems. For example, many conventional systems suffer from a lexical gap. Lexical gap exists when the words within a statement having different forms share the same meaning. For example, simple examples such as “how to get rid of stuffy nose?” and “how to prevent a cold?”, are both associated with the same concept and resulting answer, but both questions are composed of significantly different words. While human minds are easily able to identify the common solution to both questions, computer-based systems are presented with significant technical challenges in identifying the commonality.
An additional technical challenge that is presented to computer systems relates to polysemy. Polysemy occurs when a word reveals different senses as the context changes. For example, the word “apple” may refer to a “computer company” or a type of “fruit” according to its context. Similar to the lexical gap problems, this is a problem that human minds are naturally able to overcome, but computer-based systems have significant challenges it distinguishing between the meaning of words based on context.
Another technical challenge relates to word order within statements. For example, sometimes two questions express totally different meanings though they have same words. For example, the sentences “does the cat run faster than a rat?” and “does the rat fun faster than a cat?” comprise the same words but have very different meanings.
Yet another technical challenge relates to data sparsity. When training a computer system to properly identify context, conventional systems utilize large data sets. In some cases, though, a large dataset may not be available, or a large dataset may dilute the actual desired dataset. As such, it would be desirable to provide systems and methods that are capable of accurately relying upon small datasets.
In view of the above, there exists a need to have an IR system that retrieves content from electronic messages and that quickly and accurately analyzes and stores the information for later use. In particular, there is a need for systems and methods of accomplishing this task despite the low information density and sparse context associated with electronic messages. The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one exemplary technology area where some embodiments described herein may be practiced.