Search engines discover and store information about documents such as web pages, which they typically retrieve from the textual content of the documents. The documents are sometimes retrieved by a crawler or an automated browser, which may follow links in a document or on a website. Conventional crawlers typically analyze documents as flat text files examining words and their positions (e.g. titles, headings, or special fields). Data about analyzed documents may be stored in an index database for use in later queries. A query may include a single word or a combination of words.
Usefulness of a search engine depends on the relevance of the result set it returns. While there may be a large number of documents that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Thus, many search engines employ a variety of methods to rank the results. Some search engines utilize predefined and/or hierarchically ordered keywords that have been pre-programmed. Other search engines generate the index by analyzing located texts automatically.
Traditional search engines such as the ones discussed above retrieve document contents and index them as plain text. Different types of documents are typically treated as a collection of plain text. Thus, relationships between metadata/data defined in the document as well as non-textual object related data are lost during crawl time. This loss of information, especially for documents that define structured data/metadata, prevents filtering and/or display of search results based on data/metadata structure.