The following application is related to the present application.
U.S. patent application entitled xe2x80x9cSYSTEM, METHOD, AND PRODUCT FOR DYNAMICALLY PROPAGATING TRANSLATIONS IN A TRANSLATION-MEMORY SYSTEM,xe2x80x9d Ser. No. 09/085,468, naming as inventor Jonathan Clark, assigned to the assignee of the present invention and filed concurrently herewith.
1. Field of the Invention
The present invention relates generally to language translation and, more particularly, to translation-memory tools.
2. Related Art
Technology-driven industries have increasingly relied upon translation, localization and related services to bring products to the global markets. The need to quickly and efficiently create foreign language versions of products has increased dramatically as global competition increases, upgrades are developed and released more frequently, and the time in which products become obsolete decreases.
Historically, the expertise of language translators, engineers and publishers have been utilized to translate a document from a source to a target language. More recently, advances in computer software and hardware have enabled the growth of processor-based language translation tools. Traditionally, two types of language translation tools generally have been available: machine-translation tools and translation- memory tools (also referred to herein as translation memory systems).
Generally, machine translation tools use natural language translation techniques to perform language translation; accordingly, they are also referred to as natural language translation tools. Machine-translation tools perform in-depth morphological, grammatical, syntactical and some semantical analysis of text written in a source language. The machine translation tool then attempts to parse the source language into a target language using extensive glossaries and a complex set of linguistic rules. However, despite the many types of machine-translation tools that so far have been developed, there are a number of limitations that have prevented machine-translation tools from being fully successful.
First, machine-translation systems are expensive to set up, operate and maintain. Furthermore machine translation typically performs below publication-grade translation, even when operating under optimal conditions. As a result, machine-translation has been proven to be effective only when used to translate very controlled input text. However, this is time consuming and expensive since providing such controlled input generally requires careful planning.
Translation memory tools are software programs that recycle existing translations provided by a human translator-operator. Conventional translation memory tools generally utilize well-known text search and replace methodologies to perform language translation. For each file in a group of files, referred to herein as a project, translation memory tools contain a database of text strings that are to be translated. The user-operator searches for a particular string throughout a text file and, for each occurrence, replaces the found string with a translated text. Generally, the translation memory tools are utilized when the input files include text having substantial duplication of text strings, such as in technical texts, or when upgrades are performed.
Although translation memory tools overcome the computational burdens of machine translation tools, there are a number of problems with translation memory tools that compromise their effectiveness in today""s rapidly changing global markets. One such drawback to conventional translation memory tools is that typically they are completely manual; that is, the operator must provide all of the target language translations. Unfortunately, the time involved in providing such translations is extensive, making it difficult to translate a document efficiently and cost-effectively. To reduce this burden, some conventional translation memory tools provide techniques to address multiple occurrences of a given text string in the file being translated. However, these systems still require the operator to manually address each occurrence of the text string. In addition, the operator must perform the same functions in each of the files in a project.
Another drawback to conventional translation memory tools is that the integrity of the translation is dependent upon each operator entry. This drawback makes such systems sensitive to inconsistent translations provided by the same user-operator over time as well as by different translator-operators. Furthermore, this drawback often yields a translated text which is either incorrect, misleading or at least inconsistent with itself.
What is needed, therefore, is a translation memory tool that accurately translates text quickly and efficiently and is not sensitive to variations in the source language or to different operators.
To overcome these and other drawbacks of conventional language translation systems, the present invention, in one embodiment, is an aligner for a translation memory system. The aligner associates words, phrases, or other characters to be translated (referred to herein as xe2x80x9ctranslatable source segmentsxe2x80x9d) with previous translations of such words, phrases, or other characters (referred to herein as xe2x80x9ccorresponding target segmentsxe2x80x9d), if a previous translation exists. The translatable source segments, and their identifying attributes (such as, for example, formatting information) are extracted from one or more source files. The corresponding target segments, and their identifying attributes, are extracted from one or more target files. If a previous translation does not exist for a translatable source segment, the aligner copies the translatable source segment and its attributes to create a corresponding target segment and its attributes.
The aligner associates each extracted translatable source segment with a corresponding target segment based upon commonality of attributes of the segments. In one implementation, the aligner also associates the source and target segments based upon their relative locations in their respective files.
In one embodiment, the aligner stores each translatable source segment and its attributes with its corresponding target segment in a source-target pair record of a source-target pair database. Each source-target pair record may include a propagation flag identifying whether the corresponding target segment stored in the source-target pair record is to be propagated to other occurrences of the associated translatable source segment in the source-target pair database. Each source-target pair record may also include a pointer to a page of an occurrence book having pages, wherein each page includes pointers to a common translatable source-segment in records of the source-target pair database.
In one embodiment, the aligner assigns a unique identifier to each translatable source segment. The aligner generates such unique identifier based on the attributes of the translatable source segment. The aligner may also generate such unique identifier based on the location of each translatable source segment in its source file. The aligner may associate each translatable source segment with one corresponding target segment by matching their unique identifiers.
The aligner may include a project identifier that selects the source and target files from files in a file system. The file system may be local, or it may be remote. In one embodiment, the project identifier identifies legacy files, if any, associated with one or more of the source and target files. Generally, legacy files refer to previous translations of source files including translatable source segments. In one implementation, legacy files refer to a source file and a corresponding target file wherein one or more translatable source segments have been translated and stored in a corresponding target segment.
In one embodiment, the aligner includes a parser-extractor that extracts each translatable source segment and its attributes from a source file, and also extracts each corresponding target segment from a target file. The parser-extractor may include a syntactic customizer that generates a customized syntactical description of the format of a file type that is the file type of a source file. In one implementation, the customized syntactical description includes a syntactic rule for identifying the source segments in the source files and the target segments in the target files. The syntactic rule may be in a BNF form. The customized syntactical description may also include a tagged syntactical element for uniquely identifying source and target segments. In one implementation, the tagged syntactical element includes a tag that is an extension to a conventional BNF notation.
In one embodiment, the parser-extractor of the aligner includes means for parsing the source files to generate the translatable source segments and their attributes; means for parsing the target files to generate the corresponding target segments and their attributes; means for extracting the translatable source segments and their attributes; means for extracting the translatable source segments and their attributes; means for storing the translatable source segments and their attributes in a source segment and attribute list; and means for storing the corresponding target segments and their attributes in a target segment and attribute list. The parser-extractor may also include means for identifying a pre-existing target file corresponding to each source file, and means for generating a target file when the pre-existing target file does not exist.
In one embodiment, the parser-extractor also includes a conflict resolver that determines whether the attribute identifier of each translatable source segment and each corresponding target segment is a unique attribute identifier and, if not, assigns a unique attribute identifier. In some implementations, the unique attribute identifier includes hashed representations of identifying attributes of each translatable source segment and each corresponding target segment. In some implementations, the parser-extractor employs morpho-syntactic analysis to identify the source and target segments.
In one embodiment, the invention is a method for associating translatable source segments extracted from one or more source files having a first format with corresponding target segments extracted from one or more target files having the first format. The method includes the steps of: (1) determining identifying attributes of each translatable source segment; (2) generating a unique attribute identifier for each translatable source segment based upon its identifying attributes; (3) determining identifying attributes of each corresponding target segment; (4) generating a unique attribute identifier for each corresponding target segment based upon its identifying attributes; (5) comparing the unique attribute identifiers of the translatable source segments and corresponding target segments; and (6) associating a translatable source segment with a corresponding target segment when they have the same unique attribute identifier.
Step (1) of such method may include the steps of: (a) identifying a first type of file of the first format; and (b) searching for identifying attributes based on a syntactical description of the first type of file. In one implementation, step (1) may include the steps of: (a) identifying a first type of file of the first format; (b) customizing a syntactical description of the first type of file; (b) searching for identifying attributes based on the customized syntactical description. Step (1)(b) may include the step of tagging a syntactical element with a tag that is an extension to a conventional BNF notation.
Step (2) of such method may include the step of further generating the unique attribute identifier of each translatable source segment based upon its locations in a source file. Step (3) may include the step of further generating the unique attribute identifier of each corresponding target segment based upon its locations in a target file.
Such method may also include the step of storing each translatable source segment and its attributes with its corresponding target segment in a source-target pair record of a source-target pair database. Also, such method may include the step of storing in each source-target pair record a propagation flag identifying whether the corresponding target segment stored in the source-target pair record is to be propagated to the corresponding target field of other occurrences of the associated translatable source segment in the source-target pair database. Such method may further include the step of storing in each source-target pair record a pointer to a page of an occurrence book comprising pages, each page comprising pointers to the same translatable source-segment in records of the source-target pair database.
In one embodiment, the invention is a computer system having a central processing unit (CPU), an operating system, a memory unit, and an aligner. The aligner cooperates with the CPU and the operating system to associate translatable source segments extracted from one or more source files having a first format with corresponding target segments extracted from one or more target files having the first format, such association being based upon commonality of attributes of the segments. In one implementation, such computer system includes means for determining identifying attributes of each translatable source segment; means for generating a unique attribute identifier for each translatable source segment based upon its identifying attributes; means for determining identifying attributes of each corresponding target segment; means for generating a unique attribute identifier for each corresponding target segment based upon its identifying attributes; means for comparing the unique attribute identifiers of the translatable source segments and corresponding target segments; means for associating a translatable source segment with a corresponding target segment when they have the same unique attribute identifier; means for storing each translatable source segment and its attributes with its corresponding target segment in a source-target pair record of a source-target pair database; means for storing in each source-target pair record a propagation flag identifying whether the corresponding target segment stored in the source-target pair record is to be propagated to other occurrences of the associated translatable source segment in the source-target pair database; and means for storing in each source-target pair record a pointer to a page of an occurrence book comprising pages, each page comprising pointers to the same translatable source-segment in records of the source-target pair database.
In one embodiment, the invention is storage media that contains software that, when executed on an appropriate computing system having a CPU, an operating system, and a memory unit, performs a method to associate translatable source segments extracted from one or more source files having a first format with corresponding target segments extracted from one or more target files having the first format, such association being based upon commonality of attributes of the segments. Such method includes the steps of: (1) determining identifying attributes of each translatable source segment; (2) generating a unique attribute identifier for each translatable source segment based upon its identifying attributes; (3) determining identifying attributes of each corresponding target segment; (4) generating a unique attribute identifier for each corresponding target segment based upon its identifying attributes; (5) comparing the unique attribute identifiers of the translatable source segments and corresponding target segments; (6) associating a translatable source segment with a corresponding target segment when they have the same unique attribute identifier; storing each translatable source segment and its attributes with its corresponding target segment in a source-target pair record of a source-target pair database; storing in each source-target pair record a propagation flag identifying whether the corresponding target segment stored in the source-target pair record is to be propagated to other occurrences of the associated translatable source segment in the source-target pair database; and storing in each source-target pair record a pointer to a page of an occurrence book comprising pages, each page comprising pointers to the same translatable source-segment in records of the source-target pair database.
In one embodiment, the invention is a computer program product for use with an appropriate computing system having a CPU and a memory unit. The computer program product includes a computer usable medium having embodied therein computer readable program code method steps. Such steps associate translatable source segments extracted from one or more source files having a first format with corresponding target segments extracted from one or more target files having the first format, such association being based upon commonality of attributes of the segments. The steps may include: (1) determining identifying attributes of each translatable source segment; (2) generating a unique attribute identifier for each translatable source segment based upon its identifying attributes; (3) determining identifying attributes of each corresponding target segment; (4) generating a unique attribute identifier for each corresponding target segment based upon its identifying attributes; (5) comparing the unique attribute identifiers of the translatable source segments and corresponding target segments; and (6) associating a translatable source segment with a corresponding target segment when they have the same unique attribute identifier.
Significantly, the present invention enables the language memory tool to essentially recycle existing translations (performed by human or machine translators) in projects where substantial duplication exists and when upgrades are performed. Advantageously, the invention may operate upon any known, or to be developed, type of file, file format, or character format.