The present invention relates generally to computer systems for processing strings of data, and also to a parallel string processor for a minicomputer and a method of searching strings of bits and bytes for the presence of a desired keyword.
Prior art computers and microprocessors process data strings one byte at a time. One of the most frequently occurring processing tasks is to attempt to locate one or more control characters in a data string. Prior art systems compare the data one byte at a time to the control or reference characters which are loaded into a CPU (central processing unit) register After a byte is compared, the data string is rotated one byte so that the next byte in the data string is compared, continuing until all bytes are compared.
The foregoing is a time-consuming procedure and utilizes a substantial amount of computer time as numerous repetitions of the comparison process are required to check each byte sequentially in order to determine whether it contains the reference character. For example, there may be only one "carriage return" (CR) character found per 80-character line, but all 80 characters must be compared one at a time. If a data string has 512 bytes, and each byte is separately compared to the control character, the comparison must be executed 512 times. Thus, a need exists for reducing the amount of time required to compare strings of data and find control characters embedded in the strings.
Eight bytes of data are simultaneously compared. Thus, the number of comparisons necessary is reduced by a factor of 8 relative to a single byte comparison, resulting in a substantial reduction of processing time. Since the comparison procedure is commonly executed numerous times in any given program, a significant saving in processing time may be achieved by this simultaneous comparison of a number of bytes.
In other computer applications it is desirable to have the capability to search long strings of bytes for the presence of a selected pattern of bytes. One such application that is relatively well known is used in the word processing context. This application allows one to search a portion of text for a particular word or phrase. For example, one may want to find each occurrence of the word "country" within a particular document so that the word "county" can be substituted therefor. Alternatively, one may want to find each occurrence of "county" so that it can be replaced with the proper spelling "county." These are known as search and replace operations. Search and replace operations are also used in connection with automatic spelling check programs that are offered by many commercially available word processing programs.
In word processing programs and other programs in which words and letters are used, each letter of the alphabet as well as each symbol such as an asterisk or hyphen is represented as a unique string of eight 1 or 0 logic bits, also known as a byte. In order to determine whether two byte strings represent the same word, the corresponding bits in each byte are compared to determine whether they are the same. If all of the bits in the two byte strings are identical, the two byte strings represent the same word.
A portion of text can be thought of and is represented as a long, continuous string of bytes, one byte for each letter appearing in the portion of text. To determine whether a particular word, or "keyword," appears in a portion of text, current string processors typically, starting at the beginning of the byte string that represents the portion of text, or the "character string," compare the first byte of the keyword with the first byte of the character string. If these two bytes match (the first letter of the keyword matches the first letter in the portion of text), then the processor compares the second byte in the keyword to the second byte in the character string. If these two bytes match, then the processor compares the next pair of bytes in the two strings, and so on. If all of the respective bytes in the two strings match, the processor has found an occurrence of the keyword in the portion of text.
However, the keyword does not usually appear as the first word in the portion of text being searched. Consequently, one of the bytes of the keyword will not match one of the bytes in the character string (the keyword is not the first word in the portion of text). In this case, the character string is shifted one byte relative to the keyword so that the first byte of the keyword is compared to the second byte of the character string. If these two bytes match, then the second byte of the keyword is compared to the third byte of the character string, and so on. If one of the pairs of bytes do not match, then the character string is again shifted one byte relative to the keyword so that the first byte of the keyword is now compared to the third byte of the character string. This general process repeats, usually until all occurrences of the keyword in the portion of text have been found.
As an example, let the keyword be the word "the" and the character string be "that time is the essence." Initially, as described above and set forth below, the byte representing the "t" in "the" will be compared to the byte representing the "t" in "that":
______________________________________ Data String: that time is the essence Keyword: the ______________________________________
This comparison will yield a match, and so the byte representing the "h" in "the" will be compared with the byte representing the "h" in "that." These bytes will also match, and so the byte representing the "e" in "the" will be compared to the byte representing the "a" in "that." These bytes will not match, and so the character string will be shifted one byte with respect to the keyword so that the first byte of "the" will be compared with the second byte in the character string. The relative position of the keyword and the character string are set forth below, and the "3" above the letter "t" in the word "that" is the number of comparisons that were required to determine whether or not there was a match.
______________________________________ Comparisons: 3 Data String: that time is the essence Keyword: the ______________________________________
Only one comparison will be needed at this point to determine that the keyword is not present at this portion of the character string, and the character string will be shifted again:
______________________________________ Comparisons: 31 Data String: that time is the essence Keyword: the ______________________________________
This process will continue to repeat until the match is found, at which point the character string will have been shifted to the position set forth below:
______________________________________ Comparisons: 31121211111114 Data String: that time is the essence Keyword: the ______________________________________
Note that this particular example required 21 comparisons to find the keyword "the" in the character string "that time is the essence." In particular, four comparisons were required even where the keyword matched the same word in the character string (the blank space required one comparison).
In other computer applications it is desirable to test bit patterns for the presence of a particular bit string. Examples of such applications are encryption and decryption algorithms used to scramble and unscramble binary information to protect it from unauthorized reception. Such algorithms are often used in the intelligence field to protect highly classified information from being intercepted and used by foreign countries having adverse interests. These algorithms are also used by corporations to safeguard their valuable commercial information and trade secrets.
In general, these encryption and decryption algorithms may perform similar search and replace operations as described above in connection with word processing programs. In addition, it would be desirable to be able to perform operations on strings of binary information that are not an integral number of bytes long, for example, a string of five bits. Processors such as those described above in connection with word processing programs do not even have this capability since they shift strings of eight bits, or one byte, at a time. Even if such processors had the capability to shift strings of data one bit at a time, their use as described above on strings of bits would be even slower due to the large number of comparisons that would be necessary. As an example, assume that the bit string "11001110011011" is to be searched for the presence of the keyword "1101." Initially, as described above, the first bit of the keyword would be compared to the first bit in the bit string as set forth below:
______________________________________ Bit String: 11001110011011 Keyword: 1101 ______________________________________
The processor would need to make four comparisons before it could determine that the four bits in the keyword do not match the first four bits in the bit string. Again, as described above, the processor would then shift the bit string relative to the keyword string as set forth below and compare the respective bits again:
______________________________________ Comparisons: 4 Bit String: 11001110011011 Keyword: 1101 ______________________________________
Again, the "4" above the first "1" in the bit string means that four comparisons were required in order to determine that the keyword did not match. After the keyword was shifted as shown above, two comparisons would be required to test the next portion of the bit string. As shown below, 22 comparisons would be needed to find the portion of the bit string that matched the keyword.
______________________________________ Comparisons: 4211242114 Bit String: 11001110011011 Key Word: 1101 ______________________________________
A greater number of comparisons are required overall in bit searching than are required in byte or character searching since it is more likely that a pair of bits each having one of two possible values will match than a pair of letters each having one of 26 possible values. Thus, a processor performing operations on bit strings in this manner would have an unduly large amount of computing overhead.