The present invention relates generally to markup languages and more particularly to automatically abstracting markup language documents.
The explosion of incompatible non-PC devices that can access markup language documents from sources like the Internet has created tremendous opportunities and challenges. One of the reasons for the incompatibility of these devices arises from their diverse capabilities. For example, a network administrator might be accessing the content of a Web page with a technologically advanced server computer; while at the same time, a stock broker is accessing the same page with a pager having minimal computing power, not much memory and a low-resolution small screen. Both users have different needs and preferences, but are trying to get information from the same source using very different devices.
Not only devices have diverse capabilities and users have diverse interests, it is not uncommon for connections to the devices to have significantly different characteristics. The network administrator might be accessing the Web page through a T1 line, while the stock broker is accessing the page over the air at 14.4 Kbits per second.
Both the network administrator and the stock broker do not want to wait to find out that the information in the page is not what they are looking for. Both of them want to gain access instantaneously. This creates major problems for content providers, service providers and device manufacturers.
It should have been apparent from the foregoing that there is a need to quickly provide an indication regarding information in a Web page or other markup language documents. Time is of the essence. People with different interests using different types of devices and connections still want to access their desired information expediently.
The present invention provides methods and apparatus to automatically create a hyperlinked abstract of a markup language document. The abstract can be considered as a summarized version of the document. It occupies less bandwidth than the document, and can be transmitted to a user at a much faster pace, even if the user""s computing system and connection are not very sophisticated. Through the abstract, the user can quickly become aware of the coverage of the document. If more detailed information is preferred, through hyperlinks, the user can access those materials in the document.
In one embodiment, the document is parsed to create a syntax tree, with one or more levels and one or more nodes at each level. Each node of the tree is analyzed statistically to collect information, which can be used to create an annotated syntax tree.
Based on the analysis, information at each node can be classified to create a classified tree. In one embodiment, a node can be in one of seven categories. Information at each classified node can also be represented in the syntax of a language that can be understood by an output device. Then, the tree is summarized.
The summarization step can be performed heuristically. One heuristic is based on an input from a user. Note that the heuristics can be embedded into software programs or hardware circuits.
In one embodiment, the summarization step includes grouping. The invention groups a predetermined number of nodes together, and may give this set of nodes a group-name. Due to grouping, the numbers of levels (renamed as group-levels) and nodes (renamed as group-nodes) in the tree are reduced. Each group encapsulates more information than those in each of its nodes.
This grouping process can depend on the output device and the connection to the output device. This grouping process can also depend on the class a node belongs to, and user preferences.
Moreover, across every group-level, each group-node should be of similar importance, such as the variance in size across group-nodes at a group-level is low. A high variance at a group-level can imply that at least one of the group-nodes is occupying significantly more space. That group-node can then be split into smaller group-nodes, which are considered to be at the same group-level as the original group-nodes with low variance. This can be done recursively until the variance among group-nodes at the group-level is low.
The summarized tree occupies less bandwidth than the original document. Transmitting the summarized tree to a user requires less bandwidth, and can quickly provide the user an indication regarding information in the document.
After summarization, the tree can be modified by an output-specific filter, and can then be sent to an output device. The output-specific filter can depend on the device, the connection to the device and the user preference.
Other aspects and advantages of the present invention will become apparent from the following detailed description, which, when taken in conjunction with the accompanying drawings, illustrates by way of example the principles of the invention.