The user-friendly keyword search paradigm that has proven successful for searching the unstructured content of text databases and the Web is also attractive as a means of searching structured and semi-structured data. This invention addresses the problem of how to apply keyword search to structured and semi-structured data. The fundamental obstacle that must be overcome is that keyword search relies on matching query keywords with unstructured data whose semantics is lexically defined whereas the semantics of structured and semi-structured data is largely defined by its schema or other metadata rather than by its lexical content.
Existing approaches to enabling keyword search on structured and semi-structured data use ad hoc heuristics to automate the identification of semantic content in database schemas and allow this content to contribute to keyword matches. The combination of keyword matches arising from the new content extracted from schemas and the existing structured content are then used to reformulate the keyword query into a query using the database's native query language and retrieve results. These approaches suffer from the following problems: 1. The heuristics used to extract semantic content from schemas typically make naive assumptions about the properties of schemas which can result extracted content that leads to poor precision and recall. 2. The structured queries to which the keyword queries are transformed do not support the concept of ranking search results according to a relevancy score and require the creation new mechanisms for relevancy calculation rather than leveraging the highly evolved methods used by full text search engines. 3. Structurally distinct data cannot be composed to represent the semantics of compound concepts. 4. Content is not linguistically well-formed and does not support searches that specify the order and proximity of query keywords as a means to improve precision. 5. Coded data and other lexically incoherent structured data is not addressed. 6. No accommodation is made for the case where the database contains both structured and unstructured content.