A word breaker (also referred to as a morphological analyzer) is an automated system which receives words and outputs morphemes. For example, given a word a word breaker is able to identify combinations of one or more morphemes which may make up that word. A morpheme is the shortest grammatical unit in a language. An example of a word and its constituent morphemes is the word “feeling” which may comprise a single morpheme “feeling” in the case that the word is used as a noun, and which may comprise two morphemes “feel” and “ing” where the word is used as a verb.
Existing word breakers are often created through supervised learning where examples of words and their morphemes are annotated by human judges. This makes word breakers expensive and time consuming to produce especially for highly inflectional languages such as Turkish. Another option is to use lexical data and linguistic rules. However, lexical data and linguistic rules are often unavailable depending on the language involved.
Word breakers are extremely useful for many applications including but not limited to information retrieval, machine translation and speech processing. In particular, word breakers are useful when processing morphology-rich languages such as Finnish, German, Turkish and Arabic.
The embodiments described below are not limited to implementations which solve any or all of the disadvantages of known word breakers and/or ways of building word breakers.