Systems for compiling word usage frequencies are desirable for prioritizing words to be learned by a language student. A tool that would enable a student or teacher to determine which words are the most used in a language would allow such words to be taught and learned before less important words. In order to determine which words are most used, a student or teacher may look to public information sources such as news services, and other written documents created in the language by native users of the language. In to determine the usage frequency from such documents, the student or teacher needs a method to determine usage frequency of each character and word.
Systems to support language study by determining word and character usage frequency must be able to analyze written words in languages that use an alphabet, known as Latin-based languages, and also in languages that use graphics, known as Sino-Tibetan language. As used herein, a “word” comprises one or more “characters” and a character comprises a letter of an alphabet either in a Latin based language or in a graphic in a Sino-Tibetan language. Words and characters may be encoded in Unicode, a universal coding scheme for storing the characters of the world's major languages.
The use of vocabulary builders is known in the prior art. For example, speech-recognition software, such as Dragon NaturallySpeaking® by ScanSoft® and ViaVoice™ by IBM®, include vocabulary building programs. One such vocabulary building program is a vocabulary optimizer program that refines a language model by scanning documents present in the folder labeled My Documents and/or e-mail on the user's computer. The language model at the time of installation includes default statistics regarding the probability that a given word will be used in the context of other words that precede it in a group of text. The vocabulary optimizer program adjusts the default statistics to reflect the contents of the user's documents.
Another known vocabulary building program is a vocabulary addition program that adds words from a user's documents to a vocabulary list allowing the user to select specific documents or the contents of entire folders from locations accessible by the user's computer. The user has the option of displaying a list of words from all the documents in a list that are not in the current vocabulary, along with the number of times they are used. The words are presented in alphabetical order or in order of decreasing usage frequency. The user can then select which words from the list will be added to a vocabulary file. The user is also informed of the total number of documents processed, the total number of words processed, and the number of words found that were not present in the program's dictionary.
The vocabulary optimizer program makes no provision for allowing the user to view the statistics regarding word usage frequency. The user cannot direct the vocabulary optimizer program to scan documents in any locations other than the My Documents folder. In addition, the vocabulary optimizer program does not scan documents that are older than 90 days or documents that are less than 512 bytes in size, and the user is not permitted to adjust these parameters.
The vocabulary addition program only reports the frequency of usage for words that are not already in the vocabulary addition program's dictionary or in an associated dictionary. The vocabulary addition program does not provide the user with usage frequency statistics for each individual document. It lacks the ability to calculate frequency of usage ratios or percentages. It does not allow the user to sort results by increasing frequency of usage. It does not track frequency of usage across multiple sessions of scanning.
The prior art vocabulary optimizer and the vocabulary addition program lack the ability to scan websites to collect frequency of usage data. Neither program allows the user to limit what resources are scanned based on the number of words they contain.
Therefore, a need exists for a system to determine usage frequency for each word in list of resources to guide a user regarding which words are the most important to learn.