The present invention relates generally to electronic tree-structured data, available over a network, such as the Internet, for entities available at one or even several Web sites. More particularly, the invention relates to an automated system and associated method for building a comprehensive database of a configurable (with different options) entity that is available from one or more Web sites, while removing redundancies.
The World Wide Web (WWW) comprises an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. One of the key features of the Web is the wide array and large amount of information available to users. With vast number of WWW sites and the potentially large amount of data available from any given site, redundancy of information is commonplace. This redundancy can limit the effectiveness of searches by simply overwhelming the search engine and/or its user.
In contradistinction, a limited display of data at Websites, either by design or by constraint, often limits the usefulness and utility of the information. Often the Website contains only static, predefined information making it difficult to extract the needed information from the site. Thus, there is a need among individual users, as well as e-businesses, for a data mining tool that can filter the information available on the Web, removing redundancy and extracting data that has been purposely made difficult to decode, decipher or analyze.
The introduction of Web content based on the Extensible Markup Language (XML) has spawned immense growth in the number of publicly available documents that contain tree-structured data. Each tree structure has a root from which branches, nodes and leaves may emanate. Inherently, tree structures allow information to be represented in very fine detail and data to be logically grouped, often into entities known as subtrees. Also inherent in the tree structure of XML is the potential for eliminating redundancy with an appropriate algorithm. Removing the redundancy can greatly enhance the viability and utility of the data. Subtree extraction can also make otherwise undecipherable information obvious to the user.
For example, a situation may exist where an individual or a corporation wishes to build a comprehensive database of some configurable entity published on the Internet, but the publisher makes only a portion of the configuration data visible at any one time. A common example might be that of a computer, which exemplifies an entity that can be configured with many different options (hard disk drives and memory boards of different capacities, CPUs of different clock speeds, etc.). Some computer manufacturers provide a limited interface for browsing this configuration data on their Web site. Pages of the Web site often display only a static, predefined configurations for their computers. In particular, the page might display the configuration and price of a specific desktop computer pre-configured with a given hard disk drive and specific amount of memory. If users wish to see the price of another configuration of the same computer, they must enter different values into a query form or follow a different hyperlink, and then wait for the new configuration and price data to appear.
The process of sequentially requesting the configuration and price information of each pre-configured computer inhibits, and perhaps eliminates, the possibility of a timely comparative analysis of the entire product portfolio of the computer manufacturer. Having a comprehensive database of all the possible configurations of the manufacturer""s computers would be extremely useful and even profitable to an end-user. In many cases the pages displaying the various configurations are very similar in structure and content, and both can be viewed as instances of a tree-like data structure.
There is therefore an unsatisfied need for a mechanism for retrieving and processing individual tree data structures from the pages of one or more Web sites and then merging them locally. The need for such an adaptive mechanism and corresponding process has heretofore remained unsatisfied. In the case of the various computer configurations, the user, with the aid of an appropriate tool, would be able to logically connect the different variations of the same computer, create a better tree data structure and, ultimately, be able to deduce the appropriate underlying features. As a point of fact, the tree structure could either be the presentation itself, or some other tree data structure extracted from the presentation (e.g. price data structure).
The present invention addresses these and other data analysis needs by incorporating a method for merging tree data structures that contain redundant data, into more tractable tree data structures where those redundancies have been removed. Advantageously, Web users are able to retrieve information stored on one or more Web pages, available from one or more Web sites and locally merge the data. While Web site owners may have good reasons not to make their product database easily extractable and therefore display only a limited view of the data at a time, the present invention describes a mechanism for bypassing this restriction.
The system and associated method of the present invention provide for a generalized, automated method for merging and pruning data trees. The resulting tree structure or the data extracted from the tree structure can be the end product. More specifically, a feature of the present system is to automate the process of collecting information from one or more Web sites and convert the raw data into a logically fashioned, non-redundant tree structure.
The present system provides several features and advantages among which are the following:
It provides a means of retrieving sets of individual Web pages from Web sites and locally merging the data.
It enables the user to obtain logical tree data structure where redundancies have been removed.
It enables the user to bypass the built-in restrictions in product databases to effectively mine the data for information.
It permits comparative analysis of the data that would otherwise be difficult or impossible.
Briefly, the foregoing and other features and advantages of the present invention are realized by a system and associated method for automating a method of extracting and reducing tree-structured data from one or more Web pages, residing at one or more Web sites. The system and method include:
A MERGE feature that determines how two matching nodes are to be integrated. Specifically, the MERGE feature is used as a child, or subordinate, node to describe how matching parent nodes are to be combined. In addition the MERGE feature has an attribute what specifies what is to be done to the output when two tree nodes match.
A MATCH feature that is used to describe how and when two nodes match or overlap. In particular, the MATCH feature includes an attribute that specifies the matching condition.
A UNIQUE functionality that specifies that duplicates potentially generated by MERGE nodes are to be removed. This functionality results in the removal of duplicate values generated by the function of selecting information at a given level in the tree structure from a matching document. The removal of duplicate elements results in a set of unique values or elements that still contain the common values.
When used to evaluate tree data at web-sites, the system of the present invention will transform the information into a new tree structure with most, if not all redundancy removed. Users employing the system of the present invention will be able to logically group and reduce data by removing redundancies and, thus, obtain, as an end produce, the resulting tree structure or the data generated from the tree structure. The system is implementable in a local computer and may be employed by businesses and other users who need its capabilities in the field of data analysis.