Every word processing program has a "find" feature, and it takes some time for the feature to find what is sought. This invention relates generally to the facility of searching for and finding a particular group of characters, called a "string", within a larger string or an entire file. The invention has particular benefits for computers executing applications that manipulate files containing text written in a particular language, including word processing application programs and programming languages.
In a computer that is executing a typical word processing program, the user can type a key command for finding a string within a file. For example, with the Sprint word processor executing on an industry standard architecture personal computer, the user can press the F7 function key, and the legend Forward search: appears on the screen. The user types, say, "the", and the result from the user's point of view is that the cursor of the program moves to a point on the screen where the word "the" appears. Concealed from the user is the detailed process within the computer whereby the search is accomplished. When the user types "the", the application program causes the computer to scan through the file, searching for a match between what was typed and successive three-character portions of the file.
Well known in the prior art is the technique almost universally employed in word processing applications and in programming languages. First, the computer inspects successive locations in the file for an occurrence of the letter t. When a t is found, the next location in the file is tested to see if it is an h. If it is, then the next location is tested to see if it is an e. If so, then the search has been finished. (A commonly used slang is that there has been a "hit".) If the h or e test fails, then the search will have to continue. The next occurrence of a t is found, and the h test is performed again. If the h test is satisfied, then the e test is performed again, and so on.
In many word processing applications a more broad search is available, namely a search that accepts a match that is insensitive to case. In such a broad search, for example, the user may type "the" and the program will announce a "hit" if it finds, say, "The." For such a broad search, the procedure followed by the computer involves inspection of successive locations in the file, but each location is tested not only to see if it is a t but also to see if it is a T. The h and e comparisons likewise also require tests to see if the candidate position contains a case-equivalent character, i.e. H for h or E for e.
Where the file to be searched is small--only a few hundred or a few thousand characters in length--the search time need not be unduly long. With modern processor speeds and RAM (random access memory) response times, a search of several thousand positions can be finished by the time the user's finger has lifted from the final keystroke of the search command. Even where the file is on disk rather than in RAM, the search time is small when compared to other delays. Even the time to redraw the screen to show where the desired text was found may be comparable in length to the search time.
Where the file to be searched is large, however, search times become relevant. For about a decade, users of personal computers have found it unremarkable to edit and manipulate text files of a megabyte or more in length. Searching in megabyte-sized files can take tens of seconds or even minutes. For an indication of the importance of this delay, consider that the time required to search a large file for a string (group of characters) is something software reviewers often measure when comparing text processing application programs.
Two general approaches have been employed in the prior art to cut down on the execution time of text searches. The first is to make judicious use of assembly language for such searches, even if the rest of the application is written in a higher level language. It is well known that tight coding in assembly language offers speed enhancements over compiled languages such as FORTRAN, Pascal, and C, and offers even greater enhancements over interpreted languages such as BASIC and dBase.
A second approach is to optimize the software task by reallocating the hardware details of the task. Inspecting successive positions within a file goes slowly for floppy disks, somewhat faster for hard disks, and quite quickly for RAM. Thus in text processing applications if the available RAM is large enough the file may be copied to a RAM disk or to RAM allocated within the space allocated to the application by DOS (disk operating system). The search is performed in RAM, where it goes more quickly than it would from disk.
Neither of these approaches is quite satisfactory for the range of real-life text manipulations which users have become accustomed to in the last decade. Even with very fast CPUs (central processing units) and recently released word processing programs, a search in a large file can take many tens of seconds. The time required for the search is much greater (as will be seen below, it is generally doubled) where the match is to be case-insensitive.