A significant portion of an organization's knowledge base is encapsulated by text within unstructured sources, such as word processing documents and electronic mail. To facilitate analysis of this text, text analysis tools have been developed to extract specific features (e.g., sentences, paragraphs, clauses, entities) from unstructured text sources. These tools may also assign types to the extracted features using pre-defined catalogues of recognized terms. The utility of these tools is therefore strongly linked to the quality and relevance of the catalogues.
For example, a conventional text analysis tool may extract text entities such as people, places, organizations, dates, countries, etc. The tool may employ a generic catalogue which allows it to identify general entity types without requiring setup or manual configuration. To enhance the quality and relevance of the extracted text entities, users may manually generate custom catalogues for extracting custom entities such as project names, internal document names, domain-specific terminology, numbers, etc. Generation and maintenance of these custom catalogues can be costly and error-prone.
Systems are desired to provide improved extraction of text entities while addressing shortcomings in conventional approaches. For example, systems are desired which exhibit reduced reliance on custom catalogues.