The continued growth and popularity of the Internet and company Intranets and Extranets as sources of information has resulted to information explosion to users. This has lead to a demand from users to visually verify search result relevancy thru previewing prior having to download the actual content. This preview functionality is expected to be an integrated part of the overall information search experience. Typically, when a human user is looking for information from Internet on a particular subject he or she will use public search engines such as Google or Yahoo Search.
Generally speaking a search engine is a program that performs a search based on user search query (e.g. keyword(s) or phrase) and sends the search results back to the user. Typically, these result lists include a listing of hyperlinks for the web pages or other documents produced by the search and additional information such as an excerpt of the text on the page, which relates to the keywords entered by the user for the search and the file type of the result document. Techniques, such as Boolean query language, may be used to create a search phrase and limit and narrow down the number of search hits.
In case of a typical Internet, Intranet or Extranet content such as Extensible HyperText Markup Language (XHTML) and HyperText Markup Language (HTML) files, search results may include cached version of the content stored and managed by the search engine as it was at the time the search engine carried out content crawling and indexing activity. Cached version of the content may be a full copy of the original content or a striped-down version of it. Later in the context of explaining this invention, the concept of “data file” is used to describe various forms of HTML and XHTML formatted data streams, which may be stored in static files, or dynamically generated as a response to query delivered by appropriate communication protocol such as HyperText Transfer Protocol (HTTP).
Often, search engines cache textual content only, leaving out graphics and other multimedia components. In some cases the cached content contains links to the multimedia objects, and if such linked data is still available online, viewing cached version means relying on old version of the content bundled with currently available graphics. If associated multimedia objects have changed since indexing, or are not available at all, this approach may significantly degrade the visual aspects of the content layout and its look-and-feel. This method does not serve well the users' need for fast information access into long XHTML/HTML files, nor the demand for the instant discovery of those parts of the content which contain matching search criteria.
In order to find a matching part of the long XHTML/HTML content file, the user has to manually scroll and read thru the content until he or she finds the possibly highlighted search term, or alternatively carry out secondary search using the embedded content search functionality within a Web browser. The process requires additional effort from the user and is cumbersome for long content files such as news, blogs or articles in Internet as well as corporate Extranets and Intranets.
In some cases the search listing contains visual presentations (also known as thumbnails) of the web pages, still images or first/multiple frames of the video content. In case of the Web document thumbnails, the rectangular upper part of XHTML/HTML page is rendered as bitmap and resized in order to create a visual abstract of the upper part of the page. It is well known to those skilled in the art that rendering means processing a document for visual representation. The rendering engine of the web browser essentially processes format instructions and converts them into graphical elements, determines the layout and calculates the overall appearance of the document.
The above described thumbnail presentation may perform acceptably with those web documents where the content length is sufficiently short, allowing all of the content in the source XHTML/HTML page to be conveniently rendered into a standard screen size, aspect ratio and resolution available for thumbnail viewing. After the content is rendered into the intended viewing size using a virtual canvas, it is often scaled down according to specified thumbnail dimensions, providing a high-level preview of the web page.
The thumbnail dimensions vary among different services, but as the goal is to provide a visual preview of the upper part of the web page while leaving room for some concurrently visible content on that page, the width of the thumbnail is often less than half of the intended rendered size. These small dimensions combined with a high compression factor of the bitmap image make it difficult to read small text rendered into the thumbnail—only large high-level details are visible and distinguishable.
While the above described method works fine for short XHTML/HTML content, there are significant shortcomings when content files are long, spanning into multiple pages when printed out. The length of these files such as blogs is expected to grow as new textual content is often appended at the end of the file. This is a typical situation with news feed services, discussion groups, and blogs—all of them experiencing a significant growth in usage volumes both in Internet as well as corporate Extranets and Intranets.
When these long XHTML/HTML contents are paginated for example to print them, it is quite common that one single XHTML/HTML page spans into tens of separate pages. In such cases it is evident that just providing the rectangular upper part of the XHTML/HTML page is not sufficient. The searchable keyword may be located outside the preview area. In case of providing previews with search term highlighting or other context-sensitive enhancements, such partial previews may completely miss the relevant content the search was originally targetted at. For the end-user, this kind of partial content presentation causes several usability issues when previews are used to enhance search results.
One of the typical ways to share search findings in Internet and Intranet environments is to send a bookmark to other users. This allows other users to directly open the document which has been reviewed by some other user to contain relevant and interesting data. These bookmarks are often links to the document file instead of accurate pointers of interesting sections of the document. The document level link accuracy causes a lot of additional effort for long XHTML/HTML documents when the content is previewed and screened by other users. To locate the relevant part of the long document, other users need to either scroll and browse thru the document to find relevant keywords, or find appropriate position with secondary, browser-based string-search functionality.
In case of paginating and previewing long XHTML/HTML documents visual accuracy and capability to re-produce the original layout characteristics is one of the key features needed to be able to provide good user experience. Typically the original XHTML/HTML content does not contain pagination information such as page breaks, the preview generation process should be able to define and enforce such pagination logic which makes it possible for dividing long XHTML/HTML content pieces into logical, readable slices emulating typical per page printing behavior. However, as XHTML/HTML content may have specific style definitions for printing purposes, emulating printer behavior only is not sufficient. The system should be able to accurately reproduce the visual aspects of the XHTML/HTML content just as it would be viewed thru a browser.
The system should also be able to uniquely identify and mark these paginated preview pages for page-level bookmarking and content sharing purposes. Enabling direct access into an area of XHTML/HTML content containing search keywords or other unique identifiers improves accessibility and discoverability of information content.