In the field of computing, Natural Language Processing (NLP) is a field concerned with the interactions between computers and human (e.g., natural) languages. Natural language generation systems convert information from computer databases into readable human language. The term “natural language” is used to distinguish human languages from computer languages (e.g., C++ or Java). The NLP may be used for both, text and speech recognition, although, over time, the work on speech processing has evolved into a separate field. In NLP, information extraction is a type of information retrieval, whose purpose is to automatically extract structured information from unstructured machine-readable documents. A broad goal of the information extraction is to allow computation to be done on the previously unstructured data. A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data. A typical usage of the information extraction is to scan a set of documents written in a natural language and populate a database with the information extracted. More specifically, the information extraction includes tasks such as named entity recognition, terminology extraction, and relationship extraction. The named entity recognition locates and classifies atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, and so on.
Data transactions between business partners often include unstructured data such as invoices or purchase orders. To process such unstructured data automatically, complex business entities need to be identified. Examples of such complex business entities include products, business partners, and purchase orders that are stored in a supplier relationship management system. Both, structured records in the enterprise system and text data, describe these complex entities. Analyzing and integrating documents in a supplier relationship management system is typically a manual process. For example, an agent checks for a purchase order identifier (ID) in an invoice. If such an ID is found, the agent associates the document with structured data for the purchase order in the supply relationship management system and checks whether the purchase order corresponds to the invoice. If no ID is found, the agent creates an invoice in the system and manually enters the relevant information. However, automatic identification of the ID and the associated data stored with the purchase order in the structured data within the invoice could save time and reduce expenses and human errors.
Identification of entities from unstructured text to create machine readable knowledge has been investigated for several decades. There are many approaches in this area, such as the Named Entity Recognition. Three main techniques have been employed in the past for identifying entities: 1) based on rules describing the patterns of occurrence of entities; 2) machine learning techniques to identify best matching feature-combination on the basis of training data; and 3) lookup of predefined entities in a domain-specific dictionary. However, these techniques do not link extracted data to structured data nor do they map relationships in the structured data to relationships implicit in the text.