1. Field
Embodiments relate generally to databases, particularly to database queries.
2. Background
Computers are very powerful tools for storing and providing access to vast amounts of information. Databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical relational database is an organized collection of related information stored as “records” having “fields” of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Databases commonly organize data in the form of tables, each table having a number of rows and columns. Each row in a table generally has a data value associated with each of the columns, this intersection of rows and columns commonly called a cell. Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about the underlying hardware-level details.
One purpose of a database system is to answer queries requesting information from the database. A query may be defined as a logical expression over the data and the data relationships set forth in the database, and execution of a query results in the identification of a subset of the database. In operation, for instance, the execution of a request for data from a relational DBMS is typically issued by a client system as one or more Structured Query Language or “SQL” queries for retrieving particular data from database tables located on a server. For example, in response to a request for all employees making over $25,000, the database system typically returns the list of employee names whose salary is greater than $25,000, where a table named “employees” may be defined to include information about employees of a particular organization.
A query usually involves a request for the data contained in one or more cells of any rows that meet a particular condition. This condition often involves the comparison of the values of cells in a column to some other value to determine whether the row associated with the compared cell meets the condition. For example, the condition may involve searching for one or more phrases within text data stored within the database. Each phrase may include one or more terms separated from one another by a space. A database engine or DBMS may employ a searching algorithm that tries to find the closest match for a phrase by matching pairs of words in each row of a database table. To improve the accuracy of search results, the searching algorithm may rank each result according to its relevance to the search criteria.
The rank or relevance of a particular search result may be determined based on a number of different factors. Such factors include, but are not limited to, the proximity of the various phrase terms within a predefined space of each other and the degree to which an exact match for the phrase was returned. For searches involving a single phrase with multiple terms, search results may be ranked according to the number of phrase terms that were returned in each result. Additionally, for searches involving multiple phrases, the relative complexity of the phrases, measured by phrase length (i.e., the number of terms in each phrase), may be an important factor. For example, depending on a given application, a result containing a match for a phrase having a higher number of terms may be preferred over a result with a match for a different phrase having fewer terms.
There are multiple ways to determine the relative complexity of a phrase based on the number of terms the phrase contains. One way is to simply count the number of terms in the phrase. This approach produces a linear correlation between the number of terms in the phrase and the determined complexity of the phrase. However, in order to produce an exponential correlation, in which greater weight is given to a higher number of terms, the searching algorithm may use, for example, N-ary summation to calculate the result. For example, a phrase containing three terms may be given more weight (e.g., by a factor of 2) than a phrase containing two terms, and a phrase containing four terms may be given much more weight (e.g., by a factor of 8) than a phrase containing three terms.
Further, if results comprising partial matches are to be ranked, one factor used by the searching algorithm may be how close a particular match is to a given search phrase. For example, for a search phrase ‘a b c d e’, a row may include a partial match ‘a c e’ and another row may include a partial match ‘a b c’. In this case, both partial matches have the same number of terms. However, the match ‘a b c’ may be considered more relevant than the match ‘a c e’ due to the proximity of the matching terms with respect to the original search phrase. If only the full search phrase match is to be returned, the ranking could be pre-computed for all the rows to be returned. However, if the result should contain partial phrase matches as well, the ranking must be computed for every such partial match.
N-ary summation, also referred to as sigma (Σ) summation, for a value N is calculated by summing the numbers from one to the value N. For example, the value N is equivalent to the number of terms in a phrase for determining the complexity of the phrase. To compute an N-ary summation using traditional methods, the ranking algorithm must iterate N times and perform a number of operations with each iteration. A search for a set of phrases within an enterprise database system may yield millions of documents, and ranking millions of documents based on the number of terms of a phrase using such traditional methods can become computationally expensive.