Not Applicable
Not Applicable
The present invention relates to processing information content, and more particularly, to combining and/or separating segments of this content to simplify and otherwise facilitate translation and other processing functions associated with the content. Over the past few decades, opportunities for international relationships have expanded at a staggering rate. Many factors have contributed to this expansionxe2x80x94improved transportation capabilities, advances in communication and media technologies, opening of once inaccessible cultures, among others. More recently, the Internet (the World Wide Web, in particular) has provided seemingly unlimited access to international audiences. The Internet represents a massive global business opportunity, and has provided the means for a wide range of businesses to deploy a multilingual and multicultural marketing presence, thereby increasing revenue, improving customer loyalty and reinforcing brand recognition.
As information becomes available globally, the role of translators has shifted away from simple transcription of text into a target language. Translators always had to pay close attention to any attributes and linguistic idiosyncrasies of the target culture, as well as understand and adapt to these differences. Now, however, translators must also ensure the timely deployment of the translated content to the designated site. Translation can be made more efficient with greater flexibility in software functionality and the ability to save previous translations for future use. Traditionally, translators worked with hard copy documents, from which they had the flexibility to translate content at any suitable level. Thus, translators had the ability to look at an entire document and translate it without confines. The increased need for efficient content translation has motivated numerous companies to develop tools that automate at least part of the translation process.
To increase the overall speed of content translation, tools have been developed to save translations in some type of memory (referred to herein as xe2x80x9ctranslation memoryxe2x80x9d or xe2x80x9cTMxe2x80x9d), so that the tool can make automatic substitutions, and the translator will not have to consider further instances of those translations. The TM provides a record of pairs of units of translation that have already been translated. A xe2x80x9cunit of translationxe2x80x9d is a segment of content that has been delineated by any of several criteria, as is discussed in more detail herein. Each associated pair in the TM includes a unit of translation from the content in the source language (i.e., the language of the content that is to be translated), and the corresponding translation unit from content in the target language (i.e., the language into which the source content is being translated). In order to populate the TM, prior art translation methods segment content into sentences (or other syntactic units, e.g., words, phrases, etc.) based on predetermined criteria so that the translator can focus on translating one sentence (or other syntactic unit) at a time.
However, differences between the source language and the target language create difficulties in translating directly from one language to another within the constraints of the particular segments chosen. Such differences may include, but are not limited to, differences in grammatical structure, differences in idiomatic expressions, and punctuation differences. Further, segments that are spatially adjacent in the source document may not necessarily be best suited as adjacent in the target content. Content generally cannot be translated word for word, sentence for sentence, paragraph for paragraph, because of these language differences. Another consideration is that competent, efficient translation is typically not deterministic. For example, three translators operating on the same content may well produce three different translations, each of which would be technically correct. Any type of segmentation tool that segments the content based on a rigid set of criteria will force a translator to approach translation of the content on a word for word (etc.) basis.
Flexibility in content segmentation is important because translators must be able to account for the differences in language structures. For instance, translating content sentence by sentence may populate a translation memory with more specific entries. Storing more specific entries in translation memory is useful because doing so increases the likelihood that future translation instances will make use of those entries. However, as described herein, a sentence-to-sentence translation may not be accurate, depending on the languages being used in the translation. For example, the following sentences in Italian:
Per quanto riguarda la Banca Centrale Europea, un euro debole può essere un problema soltanto se aumenta l""inflazione. Però, a 2.3%, l""inflazione nella zona euro xc3xa8 ancora abbastanza modesta.
would be translated as a single sentence in English:
Yet as far as the ECB is concerned, a weak euro is only really a problem if it pushes up inflation; and at 2.3%, inflation in the euro zone is still rather modest. (The Economist, Sep. 23-29, 2000, p. 89)
On the other hand, although translating an entire paragraph as a unit may be more accurate, it can be inefficient for translators because doing so will populate the translation memory with entries that are unlikely to be used again.
An additional problem with content segmentation is determining the sentence boundaries. Typically, a period denotes a sentence end. Yet, if a word within a sentence is abbreviated and uses a period (e.g., xe2x80x9cMr.xe2x80x9d ), the period following the abbreviation could be interpreted as a sentence end and the sentence would thus be segmented at that point. Likewise, some languages such as Thai do not even use period punctuation.
It is an object of the present invention to substantially overcome the above-identified disadvantages and drawbacks of the prior art.
The present invention provides a method of and system for splitting and merging blocks of information content (e.g., textual blocks) so as to simplify and expedite a translator""s task in converting content from one language to another. The method and system of the present invention is referred to herein, in general, as xe2x80x9cSplit/Merge.xe2x80x9d The textual information to be translated from one language to another is referred to herein as xe2x80x9ccontent.xe2x80x9d The Split/Merge method and system allows a user (i.e., a translator) to decide, in real time, the level at which he or she wishes to translate content. The translator has the ability to xe2x80x9csplitxe2x80x9d a paragraph into separate sentences, allowing for individual translation of each sentence. Thus, the translation memory contains entries at the sentence level, which are more likely to be repeated than entire paragraphs. In addition, the translator can xe2x80x9cmergexe2x80x9d selected sentences together to form a single segment for translation. Furthermore, the translator can xe2x80x9cmergexe2x80x9d all sentences of a paragraph into a single textual xe2x80x9cchunk,xe2x80x9d as well as merge all of the paragraphs into a larger textual xe2x80x9cchunkxe2x80x9d. This split/merge functionality provides flexibility for source material that is not suitable for sentence-by-sentence translation.
The utility of the Split/Merge invention may be exploited in a translation system such as Idiom""s WorldServer. In general, WorldServer is a Web-based application that enables enterprises to manage their content while leveraging established Web architecture, content management and workflow systems. A translator uses WorldServer to determine what content he or she needs to translate. The translator can either export the content needing translation to a third party editing tool, or use the Translation Workbench to perform the actual translation. A translator can be an individual contributor, including users that are adapting but not translating content and reviewers who review content.
The Split/Merge feature of the present invention provides value for translators by giving them greater flexibility of how to translate content before performing the translation. In addition, increased flexibility in segmentation will populate the TM with more utilizable entries.
The foregoing and other objects are achieved by the invention which in one aspect comprises a method of identifying one or more source units of translation in a block of source content, so as to segment the block of content into the one or more source units of translation. The method includes selecting one or more delineating characteristics of the source content in addition to lexical characteristics. The method further includes determining instances of the delineating characteristics in the block of source content, and identifying one or more pairs of the instances within the text. The method also includes, for each pair of instances of the delineating characteristics, associating a first instance of the pair with a first boundary of a source unit of translation, and associating a second instance of the pair with a second boundary of the source unit of translation.
Another embodiment of the invention further includes identifying one or more target units of translation in a block of target content, and assigning associations among the source units of translation in the block of source code and the target units of translation in the block of target code.
Another embodiment of the invention further includes translating content in the source units of translation to the associated target units of translation.
In another embodiment of the invention, the delineating characteristics include syntactic characteristics. The method further includes determining pairs of instances of syntactic characteristics of the source content.
In another embodiment of the invention, the delineating characteristics include formatting characteristics. The method further includes determining pairs of instances of formatting characteristics of the source content.
In another embodiment of the invention, the document formatting characteristics include HTML code markers.
In another embodiment of the invention, the delineating characteristics include conceptual characteristics. The method further includes determining pairs of instances of conceptual characteristics of the source content.
In another embodiment of the invention, the conceptual characteristics include spatial adjacency.
In another embodiment of the invention, the delineating characteristics include sound-based characteristics. The method further includes determining pairs of instances of sound-based characteristics of the source content.
In another embodiment of the invention, the sound based characteristics include voice inflections.
In another embodiment of the invention, the delineating characteristics include one or more markers manually inserted by a user. The method further includes determining pairs of instances of markers within the source content.
Another embodiment of the invention further includes translating the one or more source units of translation into a target language so as to form target units of translation, and merging the target units of translation into one or more blocks of target content.
In another embodiment of the invention, the source units of translation are characterized by a first adjacency pattern. The method further includes merging the target units of translation so as to follow the first adjacency pattern.
In another embodiment of the invention, the source units of translation are characterized by a first adjacency pattern. The method further includes merging the target units of translation so as to follow a second adjacency pattern different from the first adjacency pattern.
In another embodiment of the invention, at least one of the source units of translation corresponds with two or more target units of translation.
In another embodiment of the invention, two or more of the source units of translation corresponds with a single target unit of translation.
In another embodiment of the invention, each one of the source units of translation corresponds with a single target unit of translation.
Another embodiment of the invention further includes merging the target units of translation into a hierarchical structure.
Another embodiment of the invention further includes providing one or more predetermined hierarchy criteria. The characteristics of the hierarchical structure are defined by the predetermined hierarchy criteria.
In another aspect, the invention comprises a system for computer assisted identification one or more source units of translation in a block of source content, so as to segment the block of content into the one or more source units of translation. The system includes a user interface for allowing a user to select one or more delineating characteristics of the source content in addition to lexical characteristics. The system further includes a content processor for determining instances of the delineating characteristics in the block of source content, and identifying one or more pairs of the instances. The system also includes, for each pair of instances of the delineating characteristics, a segment processor for associating a first instance of the pair with a first boundary of a source unit of translation. The segment processor also associates a second instance of the pair with a second boundary of the source unit of translation.
In another embodiment of the invention, the content processor further identifies one or more target units of translation in a block of target content. The content processor also assigns associations among the source units of translation in the block of source code and the target units of translation in the block of target code.
In another embodiment of the invention, the content processor further translates content in the source units of translation to the associated target units of translation.
In another embodiment of the invention, the delineating characteristics include syntactic characteristics, and the content processor further determines pairs of instances of syntactic characteristics of the source content.
In another embodiment of the invention, the delineating characteristics include document formatting characteristics, and the content processor further determines pairs of instances of document formatting characteristics of the source content.
In another embodiment of the invention, the document formatting characteristics include HTML code.
In another embodiment of the invention, the delineating characteristics include conceptual characteristics, and the content processor further determines pairs of instances of conceptual characteristics of the source content.
In another embodiment of the invention, the conceptual characteristics include spatial adjacency.
In another embodiment of the invention, the delineating characteristics include sound-based characteristics, and the content processor further determines pairs of instances of sound-based characteristics of the source content.
In another embodiment of the invention, the sound based characteristics include voice inflections.
In another embodiment of the invention, the delineating characteristics one or more markers manually inserted by a user, and the content processor further determines pairs of instances of markers within the source content.
In another embodiment of the invention, the segment processor further translates the source units of translation into a target language so as to form target units of translation, and merges the target units of translation into one or more blocks of target content.
In another embodiment of the invention, the source units of translation are characterized by a first adjacency pattern, and the segment processor further merges the target units of translation so as to follow the first adjacency pattern.
In another embodiment of the invention, the source units of translation are characterized by a first adjacency pattern, and the segment processor further merges the target units of translation so as to follow a second adjacency pattern different from the first adjacency pattern.
In another embodiment of the invention, at least one of the source units of translation corresponds with two or more target units of translation.
In another embodiment of the invention, two or more of the source units of translation correspond with a single target unit of translation.
In another embodiment of the invention, each one of the source units of translation corresponds with a single target unit of translation.
In another embodiment of the invention, the segment processor further merges the target units of translation into a hierarchical structure.
In another embodiment of the invention, the segment processor further receives one or more predetermined hierarchy criteria, and the characteristics of the hierarchical structure are defined by the predetermined hierarchy criteria.
In another aspect, the invention comprises a system for computer assisted identification one or more source units of translation in a block of source content, so as to segment the block of text into the one or more source units of translation. The system includes means for allowing a user to select one or more delineating characteristics of the source content in addition to lexical characteristics. The system also includes means for determining one or more pairs of instances of the delineating characteristics in the block of source content. The system further includes, for each pair of instances of the delineating characteristics, means for associating a first instance of the pair with a first boundary of a source unit of translation, and means for associating a second instance of the pair with a second boundary of the source unit of translation.
In another aspect, the invention comprises a method of dynamically selecting one or more segmentation criteria used to identify source units of translation in a block of source content, wherein the segmentation criteria identifies delineation characteristics of the source content for defining boundaries of the source units of translation. The method includes providing two or more source segmentation criteria associated with the block of source content. The method also includes selecting one of the source segmentation criteria from the two or more segmentation criteria as an initial source criterion, and using the initial source criterion for defining boundaries of the source units of translation. The method further includes dynamically selecting, as a function of one or more external factors, subsequent source segmentation criteria from the two or more source segmentation criteria, as the boundaries of the source units of translation are defined.
Another embodiment of the invention further includes providing two or more target segmentation criteria associated with a block of target content. The method further includes selecting one of the target segmentation criteria from the two or more target segmentation criteria as an initial target criterion, and using the initial target criterion for defining boundaries of the target units of translation. The method also includes dynamically selecting, as a function of one or more external factors, subsequent target segmentation criteria from the two or more target segmentation criteria, as the boundaries of the target units of translation are defined. The method also includes assigning associations among the source units of translation in the block of source code and the target units of translation in the block of target code.
In another embodiment of the invention, the one or more external factors includes the associations among the source units of translation in the block of source code and the target units of translation in the block of target code.
In another embodiment of the invention, the one or more external factors includes input from a user translating from the source units of translation to the target units of translation.
In another embodiment of the invention, the one or more external factors includes data relating to characteristics of the source content.
In another embodiment of the invention, the data relating to characteristics of the source content includes HTML code.
In another aspect, the invention comprises a system for computer assisted dynamic selection of one or more segmentation criteria used to identify source units of translation in a block of source content. The segmentation criteria identifies delineation characteristics of the source content for defining boundaries of the source units of translation. The system includes a user interface for providing two or more source segmentation criteria associated with the block of source content, and for selecting one of the source segmentation criteria from the two or more segmentation criteria as an initial source criterion. The system also includes a content processor for using the initial source criterion for defining boundaries of the source units of translation. The system further includes a segment processor for dynamically selecting, as a function of one or more external factors, subsequent source segmentation criteria from the two or more source segmentation criteria, as the boundaries of the source units of translation are defined.
In another embodiment of the invention, the user interface further provides two or more target segmentation criteria associated with a block of target content, and selects one of the target segmentation criteria from the two or more target segmentation criteria as an initial target criterion. The content processor further uses the initial target criterion for defining boundaries of the target units of translation. The segment processor further dynamically selects, as a function of one or more external factors, subsequent target segmentation criteria from the two or more target segmentation criteria, as the boundaries of the target units of translation are defined. The segment processor further assigns associations among the source units of translation in the block of source code and the target units of translation in the block of target code.
In another embodiment of the invention, the one or more external factors includes the associations among the source units of translation in the block of source code and the target units of translation in the block of target code.
In another embodiment of the invention, the one or more external factors includes input from a user translating from the source units of translation to the target units of translation.
In another embodiment of the invention, the one or more external factors includes data relating to characteristics of the source content.
In another embodiment of the invention, the data relating to characteristics of the source content includes HTML code.
In another aspect, the invention comprises a system for computer assisted dynamic selection of one or more segmentation criteria used to identify source units of translation in a block of source content. The segmentation criteria identify delineation characteristics of the source content for defining boundaries of the source units of translation. The system includes means for providing two or more source segmentation criteria associated with the block of source content, and for selecting one of the source segmentation criteria from the two or more segmentation criteria as an initial source criterion. The system further includes means for using the initial source criterion for defining boundaries of the source units of translation. The system also includes means for dynamically selecting, as a function of one or more external factors, subsequent source segmentation criteria from the two or more source segmentation criteria, as the boundaries of the source units of translation are defined.
Another embodiment of the invention further includes means for providing two or more target segmentation criteria associated with a block of target content. The system further includes means for selecting one of the target segmentation criteria from the two or more target segmentation criteria as an initial target criterion, and using the initial target criterion for defining boundaries of the target units of translation. The system also includes means for dynamically selecting, as a function of one or more external factors, subsequent target segmentation criteria from the two or more target segmentation criteria, as the boundaries of the target units of translation are defined. The system also includes means for assigning associations among the source units of translation in the block of source code and the target units of translation in the block of target code.