For finance institutions and professionals, conducting “Know Your Customer (KYC)” process is a crucial task at the point of sale. The KYC solution sits in parallel to all the data exchange and transactional systems. These systems help financial institutions in avoiding the risks and vulnerability of fraud by the customers. Thus, data collection or extraction from documents provided by the customers for KYC must be highly accurate. The text data in such documents may include spelling mistakes, punctuation errors, and also some junk characters. These errors deteriorate the extraction accuracy of information from documents for the purpose of KYC.
One of the conventional techniques uses taxonomy based or dictionary based approaches to correct such erroneous data. However, this conventional technique is not scalable and have some limitations. The limitations may include: the dictionary or the taxonomy list must be updated continuously; the context in which the word appears is not considered, it is hard to correct numerical data in the financial tables, and it may not be able to identify noise characters without knowing context and semantic correctness.
Another conventional technique proposes a method for generating text using various machine learning algorithms, for example, Recurrent Neural Network (RNN). RNN is used to generate a text and validate the text using a generative adversarial network. The RNN initially generates some random noise, which may be corrected, based on the feedback of generative adversarial network. This process repeats until proper text with semantical meaning is generated. However, this conventional technique only generates the text from noise and does not correct the data based on its semantical context.
Yet another conventional technique builds a conditional sequence generative adversarial net which comprises of two adversarial sub models, a generative model (generator) which translates the source sentence into the target sentence as the traditional Neural Machine Translation (NMT) models do and a discriminative model (discriminator) which discriminates the machine translated target sentence from the human translated one. However, this conventional technique deals with generating text word-by-word and validates the text generation using generative adversarial network, but it also doesn't correct any errors generated.
Another conventional technique uses the teacher forcing algorithm that trains recurrent networks by supplying observed sequence values as inputs during training and using the network's own one-step ahead predictions to do multi-step sampling. However, this conventional technique uses a different algorithm to validate the output to sequence generated using RNN. In this approach, the semantic context is not considered. The algorithm validates the data generated against the actual data, but does not correct the data.