The Internet is a worldwide system of computer networks in which a user at any computer can, with the proper permissions, obtain information from any other computer. Using the Internet, a user has access to millions of pages of information. Typically, Internet browsers are used to access information on the Internet. For personal computers (PC), the most popular browsers are Microsoft Internet Explorer, Opera and Firefox. One can also browse the Internet with a smaller device, such as a cellular telephone, using a micro browser such as Opera.
Web or Internet browsers are software programs that help a user navigate the Internet and access text, graphics, hyperlinks, audio, video, and other multimedia. Browsers operate by translating or interpreting hypertext mark-up language (HTML), which is the code embedded in web pages that tells the respective page how to look and behave. Browsers read this code and display the web page accordingly.
A typical Internet user visits a fairly small number of websites on a regular basis. Most websites attempt to make their pages appeal to the widest possible audience. As a result of this, average users are generally only interested in thoroughly reading some subset of the content, even on their most frequently visited sites.
On the personal computer and other large-screen devices, this form of content overload still provides a generally satisfying user experience. Users become accustomed to loading large pages and scanning over the content until they reach the sections of interest. However, on devices with more limited form-factors, such as mobile telephones and personal digital assistants, this type of casual “surfing” doesn't provide an acceptable user experience. This is because download times are often long, data transfer rates are costly and, more importantly it is very difficult to navigate pages to find the items of interest due to the small screen size.
Web pages are often an aggregation of number of small information items. These information items often occupy a small, specific area (for example, a small rectangular region) of the entire page and usually focus on a specific subject. Users could, for example, view a commercial news web site and identify many visual information items, which are also known as segments. These segments include items such as the main headline, the navigation bar, the company logo, the secondary headline, world news, local news, etc. These segments are important because content owners put a great deal of effort into grouping items to make the content easy to read and navigate. Even though this microstructure of the web page can be easily discerned by the naked eye, all of the web pages do not explicitly identify and specify segments in a way that can be used by computer programs.
Information items are useful in a number of applications. For example, one application provides for an identification mechanism for information items. Another application involves the automatic identification of static and dynamic information items in web resources. Static information items exhibit negligible or no change over time, while dynamic information items may change frequently over time. For example, on a news web page, static Information Items may include the navigation bar, the company logo, search bar, copyright notices, etc. while dynamic information items include the main headline, the secondary headlines, and individual news stories. The ability to only serve the dynamic information items (deltas) to mobile clients can decrease the download times of web content, as well as reduce resource usage of mobile terminals and congestion in wireless networks.
Existing segment identification algorithms have two main inputs: the web resource and device constraints. The output from segment identification algorithms is one or more segments. The web resource is usually an HTML page or any resource that can be transcoded to extensible HTML (XHTML) or HTML. Device constraints are a set of parameters, such as the granularity of the segment, expressed as its byte size, the visual area it occupies, the total number of segments, etc. Segments are themselves web resources (in the form of sub-documents) that can be interpreted and displayed by any browser.
The core functionality of segment identification algorithms consists of an analysis phase on the web resource, where a combination of the mark-up tags, structural aspects of the document, and layout styles are used for segment identification. Mark-up tags convey important grouping and positioning semantics. For example, in a HTML document, all of the children of the TABLE, LIST, FORM or PARAGRAPH elements have higher cohesion than the children of a BODY, an embedded OBJECT in the document is conceptually a sub document that's loosely bound to the rest of the content. The document structure, on the other hand, conveys structural information that is also useful for segmentation. Two nodes of the tree sharing a common parent are desirable candidates to be placed in the same segment, rather than two nodes that share common grandparent. Segment identification algorithms also employ cascading style sheets (CSS) styles, such as borders and background colors, for segment identification. For example, all children of a node can be placed in a segment if their node has a BORDER style.
There are currently several limitations with existing segment identification algorithms. Relying only upon the mark-up tags and the structure of a web resource has various drawbacks. Such an approach fails to account for the microstructure of the document. Taking the microstructure into account improves segmentation In addition, nodes in the document tree that are far apart, and nodes that have a weak semantic binding, can appear as neighbors when rendered by the browser. Similarly, bad HTML, which is endemic on the Internet today, compounds the problem of segment identification because the structure that the author is trying to convey is open to interpretation.
Currently, segment identification algorithms do not fully utilize the vast amount of information that is generated by the browser during document layout. Further geometrical analysis of the layout trees can reveal if segments are aligned on the left, top, right or bottom. For a given segment, it is possible to detect the neighboring segments on both the X and Y axes, as well as the distance between the corresponding edges. The greater the distance between the segment edges, the smaller is their chances of being merged. Furthermore, merging segments based on alignments will yield better results than merging based on distance thresholds or constraints such as visual size or byte size. In addition, shapes of segments can be deduced from layout trees. For example, two segments are better merge candidates if they have similar shapes and dimensions or if they form a polygon that meets the device constraints (such as display size). It is always desirable to merge segments that share similar backgrounds.
There are two types of constraints that are used for detecting segments. These are hard and soft constraints. Hard constraints are imposed by the device, such as display size and memory. Hard constraints impose restrictions on detecting new segments, combining two or more segments to form a new segment, or refining an existing segment and may result in segments that do not conform to the expectations of the end user. Soft constraints are device agnostic and allow for the discovery of natural structure of the web resource, such as number of segments.
Web content authors employ a number of techniques to achieve a desired layout, as well as the look and feel for a web resource. For example, web page authors use transparent images in order to adjust the spacing between content. This results in large segments with very little content or empty with no content. In addition there are no constructs in HTML to determine if the image was used for spacing purposes. This implies that the image must be downloaded regardless of whether it is actually visible to the user. Using a second technique, HTML content authors often do not specify the title of paragraphs with a header element. Instead, they mark the content with a bold or font tag to create the effect of the title. The paragraph usually is rendered below the title and it is positioned in a way to create a grouping effect. When such content is segmented, the title and the content may end up in different segments.
The HTML table layout can also be used to display content in a tabular fashion to the user. Even though the tables support a notion of column headers, these mechanisms are rarely used to specify column headers for tabular content. In practice, the column headers might end up in an entirely different table, as a cell within a table, or be aligned with the data, thereby ending up in separate segments. In a fourth technique, authors use a combination of absolute positing, alignment, text styling, etc. to display content as a list. When such content is segmented, the list items may end up in independent segments.
The segmentation of web resources, when performed on a mobile device, presents a unique set of problems in terms of memory and performance. Due to these concerns, a segmentation algorithm on a mobile device should be able to work on the data stream, processing data as it arrives rather than once all of the content has been downloaded. Segment identification algorithms currently output one or more segmented web resource. Outputting segment identifiers that are a high level description of segments can be used to uniquely identify and extract the segment from the web resource. This saves computational and power resources on mobile devices.