Since the creation of the World Wide Web in the mid 1990's, the size of the Internet has exploded a thousand-fold. People are now inter-connected, not by means of direct face-to-face interaction, but through virtual communication channels. This new revolution of technology has fundamentally changed the way people live.
A parallel development with the World Wide Web is the “Information Technology Age” that presents a stunning variety of online information resources ranging from product information to academic papers. These elements have enabled the exponential growth of Electronic Commerce that capitalizes on the convenience and low cost which the Internet delivers.
There are several million or more online vendors on the World Wide Web. Although current comparison shopping or price comparison search engines can retrieve from different online competitors, according to an online buyer's or user's query, somewhat relevant search results pertinent to any desired products requested and their desired prices, the buyer or user can be confronted with an endless sea of information. Sometimes, the buyer or user receives a “failure page” of search results because the search engines have missed other Websites of online multilingual vendors existing in the rest of the Internet-connected countries (currently numbering 245) selling exactly what was requested. Furthermore, although information about products and vendors is easily accessible on the Web, buyers or users are still in the loop in all stages of the buying process.
The potential of the Internet for transforming the present mode of e-commerce into a truly global ensemble marketplace is largely unrealized today, and electronic purchases are still non-automated. Buying on the Internet is far from being simple, efficient, or enjoyable. Search engines and centralized directory services are insufficient for locating products the online buyer wants and the merchants willing to sell such products or services. Furthermore, the typical online purchase procedure is mostly manually driven and requires the buyer to enter all terms and keywords for which he or she wants to search. Therefore, a prospective buyer is faced with a daunting task, with responsibility for collecting and interpreting information about merchants and products, making decisions about them, and ultimately entering purchase and payment information. The scenario is that the user or buyer is easily overloaded with information without sufficient time and expertise.
In order of complexity, there are two imperfect strategies presently adopted and implemented to partially automate an online catalog price comparison process as follows:
(1) Non real-time approach
(2) Real-time hard-coded wrappers approach
The non real-time approach is the simplest way to implement a price comparison agent. Its implementation involves manually collecting all necessary information from the Web, and then writing a separate HTML file for each item of the search results in order to visually display the search results.
The benefits of the above are obvious—easy implementation and short searching time. Notwithstanding those benefits, there are three main undesirable drawbacks. Firstly, as the price comparison is done manually, maintaining a large wrapper repository becomes very costly, particularly in view of the continuing growth of the Internet. Secondly, great effort must be invested to keep the price and other information up-to-date. Lastly, the size of the database required to store and coordinate all of the above information is extremely large.
The real-time hard-coded wrappers approach is an alternative to the non real-time approach. Instead of fetching the items directly as in the non real-time approach, the real-time approach tries to generalize the HTML page into a specific format. To perform this extraction task, a customized wrapper procedure named pcwrapHLRT—programming acronym—is invoked. FIG. 1 provides an example of the pertinent portion of the program that has one “while” loop. In this example, the algorithm behind the creation of a wrapper is to confine the target data on the HTML page by a pair of delimiters. The pcwrapHLRT procedure works because the site exhibits a uniform formatting convention. Product items are rendered in bold whereas prices are in italics. PcwrapHLRT operates by scanning the HTML document for particular strings {“<B>, “</B>, “<I>,” ”</I>”} that identify the text fragments to be extracted. These strings are identified by pcwrapHLRT as li, ri, lp and rp, respectively. The notation lk (k∈{i, p}) indicates that the string delimits the left-hand edge of an attribute to be extracted whereas rk indicates a right delimiter. Other possible attributes to be extracted by a wrapper are product names, graphics, terms and conditions, etc.
When a HTML page is given, pcwrapHLRT sequentially scans the entire page starting from the head line number. The outer loop checks whether there are additional model numbers and/or price pairs to extract by searching for delimiter “<B>” on the non-scanned portion of the page. As long as the beginning of a model number is found, the inner loop is invoked to extract the appropriate page sub-strings.
Few Websites publish their formatting conventions. Thus, the designer of an information-gathering system using pcwrapHLRT would manually construct such a wrapper for each resource. Unfortunately, this hard-coding process is tedious and error-prone, as a common HTML page may consist of several thousand lines of code. Moreover, most sites periodically change their formatting conventions that usually will break a wrapper.
Another disadvantage of pcwrapHLRT is that the speed of search time is moderate, as the agents have to contact the vendor Website upon receiving a request from the user. Because this kind of wrapper is partially automated, extra administrative work must be performed to manually analyze the format of the HTML page in order to determine the wrapper.