A number of techniques have been proposed for providing linguistic services. For example, it has been proposed to provide software running in a dedicated server computer attached to a network, so that a linguistic service such as foreign language comprehension assistance is available on the network. Other proposed linguistic services include tokenization, tagging, morphological analysis, language identification, disambiguation, idiom recognition, contextual dictionary lookup, terminology extraction, and noun-phrase extraction, and it has been proposed to provide these services in multiple languages. It has further been proposed to use an object oriented design, such as a design written in the C++ programming language, in a portable, robust, extensible architecture both for standalone and client-server implementations.
In one proposal, a version of Xerox Linguistic Development Architecture (XeLDA), client code running on a client machine provides a request to the network for delivery to a server machine; server code running on the server machine receives the request, causes execution of appropriate software modules to perform the requested service, and produces a result, which is then provided to the network for delivery to the client machine. The server code for this version of XeLDA has an input adapter for retrieving and extracting data from the request before services are performed and an output adapter for modifying or formatting the results of the services before providing them to the network. The client code includes service stubs supporting a user application.
Faith, R. and Martin, B., A Dictionary Server Protocol, The Internet Society, October 1997, pp. 1-11, disclose a TCP transaction based query-response protocol that allows a client to access dictionary definitions from a set of natural language dictionary databases. The server protocol is an interface between programs and the dictionary databases. Commands and replies are composed of encoded characters, and each command consists of a command word followed by zero or more parameters. The parameters can include databases, strategies, and words. A response can be a status response indicating the server's response to the last command received or a textual response sent as a series of successive lines of textual matter. If an OPTION MIME command has been given, all textual responses are prefaced by a MIME header. Although the protocol could have been extended to specify searches over databases with certain attributes, this would needlessly complicate parsing and analysis and the classification system could restrict the types of databases that can be used. In the future, extensions to the protocol may be provided to allow a client to request binary encodings. Also, standard extensions should be proposed to allow the client to request certain content types or encodings. Given a database with sufficient mark-up information, it may be possible to generate output in a variety of different formats, the use of which may be explored as extensions to the protocol. Commands beginning with the letter "XC" are reserved for experimental extensions.
The invention addresses basic problems that arise in providing linguistic services upon request. A multitude of different services could be requested including some that are not yet available, and the data on which a service will be requested cannot be known in advance and could be in any of a large number of languages. The changing set of available services, the possibly large set of supported languages, and the unpredictability of data make it difficult to produce a linguistic services system that remains useful over an extended time.
The invention is based on the discovery of new techniques for providing linguistic services that alleviate these problems. The new techniques involve requests for a new linguistic service, which is "new" in the sense that the techniques have made it newly available. The requests each identify the new linguistic service and indicate linguistic data on which the service is to be performed. The new techniques also involve operations that respond to requests by performing the new linguistic service on linguistic data.
Each of the new techniques relates to an executable, sometimes referred to herein as a "service executable", that can be executed in response to a request and that, when executed, performs the new linguistic service. More specifically, the new techniques relate to the production of service executables from human-readable code, sometimes referred to herein as "linguistic source code" or simply "source code".
The new techniques treat linguistic services hierarchically, allowing a programmer to write source code for a new linguistic service based on a hierarchical descendant relationship with an ancestor service for which source code already exists. The ancestor service may, for example, be a less specified linguistic service or it may be a proto-service that serves only as an ancestor of one or more linguistic services within a hierarchy. Therefore, source code for the descendant can be produced by modifying the preexisting source code for the ancestor. Then, a service executable can be produced from the source code. When executed in response to a request that identifies the new linguistic service, the service executable performs the new linguistic service on the indicated linguistic data.
The new techniques alleviate the problems described above, because they make it relatively easy to add a new linguistic service by further specifying or otherwise modifying preexisting source code for an ancestor service.
Some of the new techniques can be implemented with object-oriented programming. For example, preexisting source code in an object-oriented programming language can define a class for a proto-service, referred to herein as a "top-level service class". The top-level service class can include a service identifier whose value can identify one of the descendant linguistic services, parameters that are common to the descendants, and a default execute method that can be further specified to perform any of the descendants.
Parameters of the top-level service class can include input parameters providing information needed to obtain linguistic data on which a linguistic service is performed. For example, one input parameter, referred to herein as an "input format" parameter, could indicate the format and character set of linguistic data to be processed, thus making it relatively easy to add a new input format or a new input character set. Another input parameter, referred to herein as a "data access" parameter, could include data for accessing the linguistic data, such as the linguistic data itself, a file name, a URL, or another type of access data, thus making it relatively easy to add a new way of accessing linguistic data. A related input parameter, referred to herein as a "data position" parameter, could indicate the portion of the linguistic data to be processed, such as a starting position and a number of characters Yet another input parameter, referred to herein as an "input language" parameter, could indicate the natural language of the linguistic data or could have a value indicating that the language is not known, making it relatively easy to take into account a new input language.
Similarly, parameters of the top-level service class can include result parameters providing information needed to return results of performing a linguistic service on linguistic data. For example, one result parameter, referred to herein as a "result format" parameter, could indicate the format and character set in which results are returned or could have a value indicating that the results should, be returned as an unformatted object; this would make it relatively easy to add a new results format or a new character set for results thus making it relatively easy for the client to handle results.
Starting with preexisting source code for a top-level service class or for another ancestor service class, source code can be obtained that defines a descendant class for a linguistic service, referred to herein as a "lower-level service class". The lower-level service class can include a service identifier identifying the linguistic service provided by the class, fields for parameters that are specific to the linguistic service, and methods for responding to those parameters. For example, for a lower-level service class that provides translation services or other services that respond to linguistic data in a first natural language by providing results in a second natural language different than the first, a result language parameter could indicate the natural language of results. Also, for a lower-level service class that provides dictionary lookup services on untokenized text data, a set of module type parameters could indicate types of linguistic modules that are employed, such as a type of tokenizer, a type of morpho-syntactic analyzer, a type of syntactic disambiguator, and a type of dictionary lookup.
A lower-level service class can have a specialized execute method for performing the new linguistic services by creating and calling associated methods of appropriate objects. For example, a data retrieval object can obtain linguistic data in accordance with the data access parameter. Then a content extraction object can extract textual content in accordance with the input format parameter. A language identification object can identify the language of the linguistic data in accordance with the input language parameter. One or more service module objects can perform the new linguistic service on the part of the linguistic data indicated by the data position parameter. Finally, if the result format parameters value indicates a format and character set, a result conversion object can convert the results of the new linguistic service in accordance with the format and character set indicated by the result format parameter.
A processor can accordingly respond to a request for a linguistic service by creating an instance of the lower-level service class that provides the requested service. The request can include the information necessary to create the lower-level service instance. The lower-level service instance can be transferred between machines and, when received by a server, can cause a service executable to perform the requested service in accordance with the parameters.
The data retrieval object can create a specialized instance of an input data class. The data retrieval object can use the data access parameter to create the specialized input data instance with parameters and methods appropriate for the linguistic data being accessed. To add a capability to access a new type of linguistic data, all that is necessary is to add source code to the data retrieval object so that it can create a specialized input data instance that can access the new type of linguistic data. The methods of the specialized input data instance can be implemented to retrieve linguistic data in parts, referred to herein as "chunks".
The content extraction object can create a specialized instance of an input data extraction class to extract textual content from the chunks. The content extraction object can use the input format parameter to create the specialized input data extraction instance with fields and methods appropriate for the chunks being retrieved. To add a capability to access linguistic data in a new format, all that is necessary is to add source code to the content extraction object so that it can create a specialized input data extraction instance that can extract textual content from chunks of linguistic data in the new format.
The new techniques can also treat result conversion objects hierarchically, allowing a programmer to write source code for a format definition class for a new format and character set based on a hierarchical descendant relationship with an ancestor conversion class for which source code already exists. The ancestor conversion class may, for example, be a less specified class or it may be a proto-class that serves only as an ancestor of one or more conversion classes within a hierarchy. Therefore, source code for the descendant can be produced by modifying the preexisting source code for the ancestor. Then, a conversion executable can be produced from the source code. When executed in response to a request that indicates the new conversion method, the conversion executable creates an instance of the new format definition class that converts results of a linguistic service accordingly.
The result conversion object produced by the execute method of the new linguistic service can be a specialized instance of a pivot format class, which provides a representation of a document. The pivot format object can have an export method that uses a format definition object to convert the specialized pivot format instance into an object containing the result converted to the format and character set indicated by the result format parameter.
To add a capability to convert results to a new format, all that is necessary is to add source code to an ancestor format definition object to obtain a specialized format definition object class that can convert to the new format.
The new techniques can also treat communication methods hierarchically, allowing a programmer to write source code for a new communication method based on a hierarchical descendant relationship with an ancestor communication class for which source code already exists. The ancestor communication class may, for example, be a less specified class or it may be a proto-class that serves only as an ancestor of one or more communication classes within a hierarchy. Therefore, source code for the descendant can be produced by modifying the preexisting source code for the ancestor. Then, a communication executable can be produced from the source code. When executed in response to a request that indicates the new communication method, the communication executable communicates accordingly.
Two proto-classes can be ancestors of other communication methods, a top-level client-side class for execution by a processor at a client machine and a top-level server-side class for execution by a processor at a server. In addition to allowing implementation of most client-server communication protocols, the new techniques can therefore be implemented in a standalone application by modifying the top-level client-side class to obtain a lower-level client-side class for directly providing a lower-level service instance as an input to the client processor while executing a service executable. A counterpart instance of the top-level client-side class is not necessary. The lower-level client-side instance can thus provide a direct link to an executable that would otherwise be executed at the server, thus avoiding the need for transport of requests and results over a network.
The new techniques are advantageous because they allow abstract definition of linguistic services. In addition, a new service can be added quickly and easily by providing new source code that is a modified version of preexisting source code for an ancestor service.
The following description, the drawings, and the claims further set forth these and other aspects, objects, features, and advantages of the invention.