As Internet communications become increasingly widespread, and with more and more users communicating by mail (specifically, electronic mail or email), mailbox searches have become an important search technique among data searches. Mailbox searches are typically based on mailbox indices. That is, all of a user's mail will typically be searched using a mailbox index.
One existing method for establishing mail indices is as follows: on the whole, mail box indices are established in the form of inverted indices. For example, there are three mail files with the names: doc_id1, doc_id2, and doc_id3, all of which contain the phrase “hello my world.” Thus, the inverted index records storing mappings of keyword and mail files are as shown below:
hello ->doc_id1, doc_id2, doc_id3,my ->doc_id1, doc_id2, doc_id3,world ->doc_id1, doc_id2, doc_id3;
The inverted index records described above are stored in an inverted index file. The offset position and length of each inverted index record in the inverted index file are recorded, and the offset position is written into a dictionary file in the manner described below:
{“hello”: {“file_path”:“/xxx/inverted_index_file”, “offset”:0}};
Assuming that a user searches mail that includes “hello,” it is possible to find all mail containing this keyword in a dictionary file. That is, the address “/xxx/inverted_index_file” is found. Then this inverted index file is opened, the position for the offset “0” is fetched, and thus three pieces of mail {doc_id1, doc_id2, doc_id3} may be fetched.
However, when new mail is added, the inverted index file needs to be updated in order to ensure the completeness of search results. For example, a new piece of mail, doc_id4, is added. This piece of mail also contains “hello my world,” a total of three keywords. Thus, at this point, the inverted index records need to be updated as follows:
hello ->doc_id1, doc_id2, doc_id3, doc_id4,my ->doc_id1, doc_id2, doc_id3, doc_id4,world ->doc_id1, doc_id2, doc_id3, doc_id4;
If the updated inverted index records are saved to the inverted index file, the original storage locations of two inverted index records, “my→doc_id1, doc_id2, doc_id3, doc_id4” and “world→doc_id1, doc_id2, doc_id3, doc_id4” need to be changed within the inverted index file. At the same time, the corresponding offset values in the dictionary file need to be revised.
Therefore, with the method described above, other related data content of the inverted index file needs to be shifted whenever a new piece of mail is added.
Existing mailbox searches using mail indices as described above typically require keyword searches of entire inverted index files. As the scale of mail data expands, mailbox servers may have hundreds of millions of subscribers and billions of individual mail messages. To store such large volumes of data will require large amounts of hard disk IO resources, making it difficult to impossible to quickly index mailboxes. Furthermore, the storage costs of vast quantities of mail are very high for mailbox servers. Large quantities of storage resources can be tied up as a result.