1. Field of Art
This disclosure relates generally to the field of information extraction (IE) technologies and specifically to the field of named entity recognition (NER) for extracting entity values from texts, with applications to the information leakage prevention and detection systems.
2. Description of the Related Art
In general, named entity recognition is a task of information extraction that seeks to identify and classify atomic elements in texts into pre-defined categories such as the personal names, personal identification (such as SSN), organizations, address, email address, account number, phone number, credit card number, date, time expression, monetary values, etc. One usually refers to these pre-defined data categories as named entities, or entities for short.
Presently, there are shortcomings of known named entity recognition technologies. For example, present capabilities of existing named entity recognition technologies are insufficient to address query problems such as the one that follows by example. At the outset, assumptions are made that (1) entities are pre-defined; (2) millions of instances of these pre-defined entities are stored in the database system or other data storage space such as MS Excel; and (3) multiple entities are relevant and their instances can be presented in a tabular format much like data records. Thereafter, for any given text current named entity recognition technologies are unable to (1) identify the pre-defined entity instances within the text and map these identified instances into data records; or (2) verify the extracted entity records based on validation methods and matching rules.
In an attempt to address these shortcomings, conventional named entity recognition engines have been developed to apply linguistic grammar-based techniques as well as statistical models to achieve tasks. However, the direct application of existing named entity recognition technologies to the query problem described above result in several disadvantages. For example, such technologies have poor accuracy. Specifically, the false positive rate is more than 15% for the best named entity recognition engine. Moreover, the false negative rate is even worse. Another problem is slow processing speed due to the nature of natural language text processing. Yet another problem is that conventional named entity recognition technologies are language dependent. This limits flexibility and portability of such technologies because they are designed based on specific individual writing languages.
Hence, the present art lacks a system and a method for extracting entity values from texts. Moreover, the present art lacks a system and method for extracting entity values from text in an information leakage prevention and detection system.