Unstructured data is widespread across multiple industries and across the Internet. Techniques exist for processing unstructured data into a form that may be of greater use. In relation to text, the key cluster of techniques used are known as Natural Language Processing (NLP).
A standard technique from NLP in structuring data for analysis involves the use of tokenisation. Tokenisation breaks a text document down into words, phrases, or other meaningful elements called tokens. The number of occurrences of each token in the document can be used along with hand written rules to guess the nature of the document and the topics covered by the document.
The disadvantage of such generic approach is that it provides a broad approach to processing documents which results, therefore, in less accurate outcomes.
There is a desire for an improved data processing method and system which can be optimised for specific fields.
It is an object of the present invention to provide a method and system of processing data using an augmented natural language processing engine which overcomes the disadvantages of the prior art, or at least provides a useful alternative.