The information revolution has generated vast amounts of data. Originally, this data consisted mostly of textual information stored in text-based computer files on a computer. Finding a particular piece of data was accomplished by a search for a text string of all files on a computer.
Today, this data manifests itself not only in files, but in data streams as well. Data streams have no particular beginning or end. These streams are often used to represent video and audio information in digital computer formats.
An increasing variety of computer applications have been created, each storing its data in a different format. For example, a word processor stores data in a different format than does a spreadsheet or a drawing editor. Some applications allow for file searching by including summary information within the data files. Application-specific search engines then retrieve this summary information and compare it with search terms without having to search the entire file. For example, Microsoft Word allows a user to store information about a document such as the author, title, revision date, and keywords. Searches for particular keywords or authors can be performed by reading the summary information from many Word document files until summary information matching the search terms are found.
Unfortunately, such summary information is specific to Microsoft Word document files and cannot be used to search for keywords in other kinds of files, such as another manufacturer's word processor or a graphics file. As more and more new applications become available, each may include its own specific summary information, or no summary information at all. Thus searching for information is more difficult as new applications become available.
Non-Textual Files Difficult to Search--FIG. 1
FIG. 1 shows several different kinds of files stored on a computer. Text file 12 can be a simple ASCII file with textual characters, or it could be a document file in a specific format such as Novell's Word-Perfect or Microsoft's Word. These files contain mostly textual information, and a search can be performed by searching for a text string or keywords within the file. While text file 12 has been a common file type, other kinds of files are becoming more common as the graphics and multimedia capabilities of computers improves.
Graphics file 14 contains one or more pictures. These pictures can be formatted as a grid of pixels or a bit-map, or in a vector format that describes the outlines of shapes within the picture. Graphics file 14 is basically non-textual, and searching for a particular graphics file on a computer is difficult. Sometimes a viewer is used to display small "thumbnails" of many pictures, allowing the user to visually search for a desired picture. Occasionally the file name for the picture is descriptive, or summary information such as keywords are stored by a specific graphics program in a proprietary format.
Video file 16 is a video file that contains a sequence of pictures that are rapidly displayed by a video or movie application program. Video file 16 is often requires a large amount of storage space, and the sequence of frames or pictures is often compressed. Again, searching is difficult since the content of video file 16 is an image, not text. Audio file 18 contains binary data that is sent to an audio or sound card on a computer to activate speakers. Many different types of audio files are used, such as for generating sound by controlling a music synthesizer, or directly controlling audio output to the speakers, such as a voice-clip file. Since audio file 18 contains binary values representing audio intensity at a point in time, or frequency of sound generated, little or no text is contained in audio file 18. The lack of text in audio file 18, as well as graphics file 14 and video file 16, makes text-based searching unfruitful.
Often two or more file types are combined for multimedia. Multimedia file 20 contains audio, video, and textual information in a single file. Searching of the textual information is possible, although searching for a video or audio sequence is still difficult unless the text happens to describe the audio or video.
Not only are the different kinds of non-textual files difficult to search, each manufacturer can develop its own specific format for the file content. Hundreds of file formats are common today in what has become a virtual tower of Babel of proprietary data formats.
Searching Across Networks--FIG. 2
While the array of file formats complicates searching, computers are often linked together in wide-reaching networks. Data is no longer stored in centralized mainframe computers. Instead, data is distributed among many desktop personal computers (PC's) and even on portable PC's that are not always connected to a computer network. Larger networks such as the Internet or organization-wide Intranets often use search engines that build databases of the available data.
FIG. 2 shows a network with distributed content and a search database. Network 28 connects several different computers together allowing each to remotely access files on another system. Files are created and stored on different computer systems connected to network 28. For example, PC 30 stores one or more files 22, while PC 31 stores file 24. Other files 22 are stored on other computers attached to network 28, such as UNIX workstation 32, Apple Macintosh Computer 34, server 38, and laptop PC 36. Laptop PC 36 is only occasionally connected to network 28. UNIX workstation 32 uses radically different file formats than does Apple Macintosh Computer 34, or PC 30.
Each of the files 22 can be textual files or other kinds of files such as graphics or multimedia files. A search engine operating on server 38 builds a search database by reading a list of files on each PC or computer attached to network 28, and perhaps extracting the most-commonly used words in a text document as keywords.
Search database 26 is constructed by the search engine on server 38. Information about each of files 22, 24 is stored in database 26. Information stored in database 26 can include file names, and keywords extracted from files 22, 24. The locations of each file 22, 24 is also stored in database 26. Thus a centralized search database 26 is constructed even though the content, files 22, 24, are located in various locations on network 28.
When a user performs a search for a particular keyword or file name on network 28, the search engine on server 38 searches through database 26. The information in database 26 is the information that is actually searched rather than the files themselves. Database 26 must be frequently updated since files 22, 24 may change frequently and even be deleted.
Out-of-date search databases 26 are a nuisance, as Internet users can attest. Often a search is performed and a link followed only to get a "DNS Server Not Found" error when the file or site no longer exists.
While search databases have been useful, they generate additional network traffic since the search engine must constantly crawl the network to update file information. This additional network traffic reduces performance of other, more important network tasks. As the number of computers and files on a network increases, greater network bandwidth is required for updating search databases.
What is desired is a file-search system that does not generate additional network traffic for its maintenance. A fully up-to-date search system is desired. It is desired to eliminate the centralized search database. It is desired to search files that are distributed across a network without using a centralized search database. It is desired to search for content within a file regardless of the file type or the computer system the file resides on. It is further desired to bind search information with the file itself. It is desired to search for non-textual files such as graphics, video, audio, and multimedia files. It is desired to bind search information with these non-textual files regardless of file format or the type of computer system the file is located on.