With the unprecedented increase in the Internet usage and many new emerging applications, the relative volume of semi-structured and unstructured data managed by database management systems is steadily growing. E-mail, the largest growing data type, HTML and XML data, directory data, product catalogs, customer information data, and free text fields are only a few examples of character data that is being produced, stored and managed every day. The management of text data is a problem of increasing importance.
Users that query character or string data often query it with partial strings for several reasons: 1) power users try to reduce their typing effort, 2) normal users may not know the full content of a text field, for example the NAME field may contain last name or both first and last name, 3) because formats to store data change over time and new data is recorded in a different format compared to the old, for example, a data entry operator may have entered a phone number in the ### ### #### format before retirement, while the new data entry operator enters phone numbers in the 1-###-###-#### format. In addition, first names may not consistently precede last names in name fields. And 4) typographical errors may have been made while entering string data.
Both SQL query language for relational data and XPath query language for XML data allow users to pose complex pattern-based regular expression queries. The requirement is the efficient indexing mechanisms to evaluate such queries. Although the problem of approximate string matching has been studied, as it is surveyed in the literature, these results have not been utilized much in relational database management systems (DBMS). A common approach adopted by commercial database systems is moving the functionality into object-relation extenders to support approximate string matching. However, by moving the function into an extender, efficiency of query execution is compromised. This comes from the inability of query optimizers to handle object-relational extensions well during query optimization. As a result, grossly inefficient query execution plans may be selected.
SQL, perhaps the most widely used database query language, supports a LIKE clause against character fields. Through the LIKE clause, UNIX style, wildcard queries may be specified. Two special characters' and ‘%’ may be used to specify any single character match, and match against any substring, respectively.
To illustrate, consider a relational database table, shown in FIG. 1, consisting of two fields and seven rows. The SQL query SELECT Code, Airport WHERE Airport LIKE ‘%Calif%’ run against the table would return the row:
<SJC San Jose, Calif.>
And the SQL query SELECT Code, Airport WHERE Airport LIKE ‘an %’ would return two rows:
<SJC San Jose, Calif.>
<SFO San Francisco>
One technique for implementing partial string queries, such as the SQL LIKE statement, is the use of n-gram based indexing. N-gram based indexing suggests itself, particularly because it can be applied to text in non-English languages (about 50% of database usage is outside the US). It has been previously observed that it is not necessary for an n-gram index to completely and correctly filter the text based on the query. So far as the filtering ensures that no correct results are lost, and a large fraction of the non-qualifying rows are filtered out, the actual wildcard pattern may be checked against the non-filtered data returned.
For example, consider the SQL query SELECT Code, Airport WHERE Airport LIKE ‘% York %’ would return:
<JFK John F Kennedy, New York>
<LGA La Guardia, New York>
By realizing that the n-gram “or” is a substring of the query string, “York”, and if the gram “or” indexed the attribute Airport, then the index could be used for partial query evaluation, returning the following three rows:
<SJC San Jose, Calif.>
<JFK John F Kennedy, New York>
<LGA La Guardia, New York>
Subsequent to identification, the wildcard may be evaluated on three rows to filter the one row which was a false hit. At the price of the index access, access of three rows and three evaluations of the regular expression, access and evaluation of the regular expression on the rest of the data may be avoided. It is easy to visualize examples where significant savings may be achieved.
The problem of supporting wildcard queries through n-gram indexing has been previously examined. However, these previous attempts at the problem were approached from the point of view of indexing a huge document collection (like the world wide web). The size of the index created was only limited by the disk space available at the server. In the database context, by contrast, there is contention for disk space by permanent and auxiliary data (temporary sort files, indexes, scratch pads, space for maintenance operations) so one has to be thrifty about space usage for any persistent data structures like indexes. The problem reduces to identifying a valuable set of n-grams to be used for indexing.
Accordingly, what is needed is a method and system for selecting a set of optimal n-grams for indexing string data in a DBMS system under space constraints introduced by the system. The present invention addresses such a need.