The following relates generally to methods, apparatus and articles of manufacture therefor, for defining finite-state networks for marking, tagging, characterizing, or indexing input data recognized by the networks as intelligible natural language data. Such marked data may subsequently be further processed using natural language applications, such as categorization, language identification, and search. In one embodiment, the finite-state marking networks are applied to corrupted language data to mark intelligible language data therein. Text-based errors may be introduced in language data, for example, when image-based documents or audio-based files are processed to identify characters or words therein using, for example, OCR (optical character recognition) or voice-to-text applications. Text-based errors may arise from character recognition errors that introduce misspellings that render a word or sentence that it forms part of unintelligible. Such errors hamper subsequent language processing search or analysis of the textual data using natural language processing applications.
Once a corpus is processed using finite-state natural language technology the data may be indexed for the purpose of querying information in the corpus. An index is generally a data structure that may be used to optimize the querying of information, by for example, indexing the location of key terms found in a corpus. Queries may be simple or complex. For example, a simple query may be used to search for the presence of two terms in a corpus, while a complex query used for example in a Database Management System (DBMS) may be defined using a specialized language called a query language. Generally, facilities for creating indices and queries are usually developed separately, and have different properties.