1. Field of the Invention
Embodiments of the present invention generally relate to content translation and, more particularly, to a method and apparatus for extracting localizable content from an article.
2. Description of the Related Art
In today's global marketplace, digital content that is written in, for example, English, must be translated and localized in order to make it accessible to readers of other cultures and in other languages. Translation is a literal word for word changing of source content into a target language. Localization adapts source content for a specific region or language by adding locale-specific content and translating text as needed. Localization does not require word for word matching of the source content, but rather provides content that has the same connotation, or meaning, as the source content. For example, “Like father, like son” is an English phrase. This phrase localized for the Chinese culture may read, after word for word translation from Chinese back into English, “Tigers do not breed dogs.” In some cases, however, localization may include a word for word translation.
Digital content is typically created in a content management system, such as ADOBE® CQ, which is based on a Java Content Repository (JCR) standard. Content authored using the JCR standard has a specific format, although when a reader views the content, it is typically in the form of a HyperText Markup Language (HTML) page. The content is referred to as an article or page, for example, a dynamic Portable Document Format (PDF). The page has components that make up the content of the page, for example, “bodycontent”, “legaltext”, and the like. Each component is stored as a node in the JCR. Each node has properties that are also stored in the JCR. The properties include information about the component, such as “datelastmodified”, “lastmodifiedby”, “description”, “title”, and the like. Not all properties are applicable for localization. For example, the “datelastmodified” and “lastmodifiedby” properties may not need to be localized, but the description and title may be properties that an author wants to have translated or localized when displaying the page in another locale. Currently, when digital content is created using the JCR standard, the entire HTML file is sent to translators for localization. Sending the entire HTML files leads to a waste of time because the localizable properties must be identified manually.
Therefore, there is a need for a method and apparatus for extracting localizable content from an article.