Notwithstanding the significant advances made in the past decades, electronic document technology continues to suffer from a number of disadvantages preventing users from fully realizing the benefits that may flow from advances in computing and related technology.
For example, a Web page that satisfies a given search expression typically includes constituents that do not satisfy the search expression. In many cases, a small proportion of the page's total content will be relevant to the search. If the user's goal is information that corresponds to the search expression, then delivering the entire Web page to the user entails a waste of download bandwidth and a waste of screen real estate. It also presents the user with the task of finding the relevant constituents within the Web page. Highlighting search terms on the page eases this task only slightly. The problem of presenting search results on mobile devices is especially acute.
Standard Web search engines return links to Web pages. Various search engines handle search requests that specify categories or instances of sub-document constituents. These may be called “sub-document” search engines. Some sub-document search engines are limited to returning text constituents. Other sub-document search engines return constituents that belong to non-text categories, but are limited to non-text categories that can be characterized by very simple markup properties. Some sub-document search engines use string-based algorithms to determine which constituents to extract. Other sub-document search engines use tree-based algorithms that examine very simple properties of markup trees. Yet other sub-document search engines support highly expressive languages for specifying constituents. None of these sub-document search engines effectively exploits the inter-relationships of sub-document constituents, as these inter-relationships are reflected in document tree structures (or other document graph structures) and document layout structures.
Various search engines handle search requests that specify proximity relationships. Some search engines are fundamentally limited to string-based proximity relationships. Other search engines recognize constituent boundaries in order to ignore these boundaries. Other search engines recognize when search terms occur within the same constituent. None of these search engines effectively exploits structural proximity relationships that are based on properties of the tree structures (or other graph structures) and layout structures of documents.
Co-occurrences of search terms within documents are evidence that the search terms are mutually relevant. Moreover, relevance is transitive. Current systems use learning algorithms that leverage these principles to enable responses to search requests where in some cases, the response doesn't include any of the words contained in the request. These systems require a learning process.
The very limited download bandwidth and screen real estate associated with mobile devices has motivated the creation of the WAP (Wireless Access Protocol) network. Because building a WAP site is labor intensive, the WAP network remains extremely small, in comparison to the World Wide Web, and has correspondingly less to offer users. For purposes of search, the World Wide Web is a vastly more powerful resource than the WAP network.
Limited download bandwidth and limited screen real estate has also motivated the creation of browsers that reformat HTML files for presentation on mobile devices. These mobile browsers reformat content so that horizontal scrolling is reduced. They may introduce page breaks into tall pages. They may remove or replace references to large files. They may replace fonts. They may offer distinctive user interfaces. Similar functionality is also offered by server transcoders that intercept user requests for HTML files. Such a server transcoder may be applied to reformat Web pages that satisfy search criteria. Current mobile browsers and server transcoders offer at most very rudimentary content extraction facilities, based on limited ranges of simple criteria.
Another limitation of current technology involves false hits for complex search expressions. Suppose that a given Web page contains a constituent N1 that contains a single occurrence of the term haydn but doesn't contain the term boccherini. Suppose further that the page contains a constituent N2 that contains a single occurrence of the term boccherini but doesn't contain the term haydn. And suppose that the page contains just this one occurrence of haydn and just this one occurrence of boccherini. Now suppose that a user searches the Web with the intention of finding information that pertains to both haydn and boccherini. While the Web page contains occurrences of both haydn and boccherini, the page may or may not satisfy the user's search request. Whether it does depends in part on the characteristics of N1 and N2, and on the relationship of these constituents within the Web page. Current technology is unable to use the correspondence of search expressions to sub-page constituents to reduce the incidence of false hits.
Similarly, current technology is unable to use the correspondence of search expressions to sub-page constituents to produce correct sub-page hits for search expressions with irreducible negation. Suppose that the search expression “haydn and not boccherini” is applied to the Web page described in the preceding paragraph. Constituent N1 satisfies this expression, but the page as a whole does not. Given that the user's request can be satisfied with sub-page constituents, systems that are limited to returning entire pages will not provide optimal responses.