The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Most early approaches to automated text processing were based on pattern recognition, morphological decomposition, syntactic analysis, and rule-based processing. For example, a rule for analyzing text could have the general “if . . . then . . . else” structure. However, these approaches to processing text in general are limited and computationally inefficient.
Today, artificial intelligence (AI) systems widely utilize machine-learning (ML) models to analyze large sets of data, including text-based data. An ML model in general relies on a corpus of prior knowledge to predict future outcomes. For example, an ML model can be trained with data tuples listing the height, weight, and age of a professional basketball player, along with the number of points he or she scores per season. The height, weight, and age in this example can correspond to features, and the number of points can correspond to a label. Once trained, the ML model can predict how many points a basketball player of a certain height, weight, and age will score per season. The efficiency and accuracy of predictions an ML model yields to a large extent depends on how well the model is trained, what features the model is configured to recognize, and how the model is otherwise parameterized. In the example above, the data tuples can include additional features such as the month in which the basketball player was born, but this feature will likely increase the complexity of the model without improving the predictions. A less efficient model, even when capable of yielding generally accurate predictions, can require more computing power, more memory, and more time to execute.
One of the areas in which ML models potentially could be used is processing of invoices, which generally tend to be structured and worded in a recognizable manner, especially when these invoices pertain to the same industry. However, the large number of random variables, or “dimensions,” that may be expected in an invoice (even when the industry is known) makes automatic processing of these invoices technically difficult. Other difficulties include differences in the way similar activities are described or coded, complex relationships between dimensions that cannot be easily expressed in terms of rules, deviations from common practices and formal guidelines, etc. Meanwhile, an inefficient ML model can require excessive processing power, memory, and time, as discussed above.