The invention relates generally to computer systems, and deals more particularly with program tools to gather web pages containing audio files of different languages and transcribe the audio files for a search engine repository or other use.
Web search engines such as “Google.com” and “Yahoo.com” are well known today. The user can specify key words for a search, and the search engine will search its repository of web pages and files for those web pages or files which include the key words. Alternately, the user can specify a subject category such as “golf”, and the search engine will search its repository of existing web pages and files for those web pages or files which were previously classifed/indexed by the search engine into the specified subject category.
Periodically, content gathering tools, called web crawlers or spiders, send out requests to other web sites to identify and download their web pages for storage in the search engine's repository. The web crawler goes to an initial web site specified by an administrator or identified by some other means. Some crawlers identify every page at the web site by navigating through the web site, and then download a copy of every web page to a storage repository of a search engine. This type of web crawler does not filter the web pages; it does not conduct any key word searching of the web pages that it identifies and downloads. Other web crawlers search text within the web pages for those web pages containing key words. The web crawler then downloads to the search engine repository a copy of only those web pages containing the key words. The search engine may index the web pages downloaded by either or both types of content gathering tools. A subsequent user of the search engine can then request all web pages in certain categories or conduct a key word search of the web pages in the repository, as described above. Both types of content gathering tools, after completing their investigation into the initial web site, can go to other web sites referenced by the initial web site or identified by some other means.
Some web pages reference or include audio files, alone or associated with a video file. It is also known for a content gathering program, when encountering a web page referencing or including an audio file, to invoke voice recognition software to attempt to transcribe the audio file into text so that the audio file can be indexed and searched by key words. See “Speechbot: An Experimental Speech-Based Search Engine for Multimedia Content on the Web” by Van Thong, et al., published IEEE Transactions on Multimedia, Volume 4, Issue 1, March 2002 pages 88-96. See also US 2003/0050784 A1 to Hoffberg et al. However, in some cases, difficulties have arisen in determining the language of the audio file, and therefore what voice recognition software to use and how to appropriately configure it for an accurate translation.
Accordingly, an object of the present invention is to determine a language of an audio file referenced by or included in a web page, so that the proper voice recognition software can be employed to transcribe the audio file.