1. Field of the Invention
The present invention relates to a retrieval technique applied to an open network environment that involves a plurality of semi-structured documents and search engines. In particular, the present invention relates to an integrated retrieval scheme by managing the location data, document structure data, item data, presentation style data, etc., to provide a unified interface for retrieving required information item by item from a plurality of semi-structured documents irrespective of differences among the locations, document structures, elements, input forms of search engines.
2. Description of the Prior Art
Increasing performance and decreasing cost in personal computers, improvements in network technology, and the growth of inexpensive network providers are vitalizing open networks, in particular, the Internet. Many information providers employ HTML (hypertext markup language), that is description language of hytertext for realizing easy contents creation, to transmit various informations to users through the open networks. The number of information providers is increasing due to an exploding increase in information consumers. This results in accumulating various kinds of information in the networks, and it is required to efficiently provide each consumer with necessary information from among the accumulated pieces of information.
The consumers want to entirely retrieve desired information from across information sources. It is hardly granted because information accumulated in the open networks is mostly in HTML documents that have mutually different structures, presentation styles, or search formats to retrieve devised information from across different information sources.
Information retrieval apparatus, so called, search engines are widely used with respect to retrieving HTML documents scattered over the network. Here, the search engine is a generic term for system retrieving certain information through input form. FIG. 1 shows an information retrieval technique according to a prior art using URL search engine. The URL search engine is a search engine returning URL as search result with respect to query with keyword or conditional term. For example, a user has an interest in xe2x80x9ca PC of 100,000 yen or below.xe2x80x9d The user enters keywords into an URL search engine. FIG. 2 shows an example of an URL search engine according to a prior art. The URL search engine 900 has a keyword index 910 that contains keywords and locations, i.e., URLs related to HTML documents spreading over networks, the keyword index 910 is registered in advance. A search processor 930 searches the keyword index 910 for the keywords entered by the user and returns a list of URLs and outlines, the URL indicates location of HTML documents that contain the entered keywords and its synonym. Returning to FIG. 1, the user accesses the returned HTML documents one by one to find out necessary information. In this way, first, the users had to find out the locations of HTML documents that may contain necessary information by wide document search, and then inspect each of the HTML documents in obtained URL list for the necessary information when obtaining the information from HTML documents of which is unknown, so that it needs long time and labor to obtain necessary information. The users must spend much time and labor before they get necessary information. In addition, the prior arts are incapable of collectively retrieving information from across a plurality of HTML documents.
The prior arts may find out the locations of HTML documents that contain given keywords and the synonyms thereof but are unable to collect information item by item by collectively retrieving information involved in HTML documents. The prior arts are-unable to set conditions on search results. For example, they are unable to filter search results by date. And, when using URL search engine that provides search interface for each HTML document as input form, users must take into account such individual form input interface for each URL search engine and access each URL search engine one by one.
More particularly, HTML documents employed in on-line shops of electronic commerce frequently show the product information such as names and prices with list description of table or clause style that includes one meaningful clustered data. There are demands to retrieve information collectively among these HTML documents of on-line shops. For example, a user may want to retrieve information about shops that offer the lowest price for a specific product. In this case, the user enters the name, maker, category, etc., of the product as keywords. Then, the prior art of FIG. 1 provides the user with the locations of HTML documents related to the keywords. The user accesses the HTML documents one by one to check to see if they offer the product under preferable conditions. The prior art of FIG. 1, however, searches the full text of each HTML document for the entered keywords without considering elements that form the HTML document, and therefore, tends to retrieve a lot of irrelevant data for the user. Accordingly, the user must spend much time and labor to find out the necessary information from among the HTML documents retrieved by the prior art.
The prior arts are incapable of retrieving information from a given HTML document item by item. For example, they are unable to extract the price, image, maker, etc., of a given product from a given HTML document containing product information table. The prior arts are unable to extract the name, phone number, address, etc., of each shop from a given HTML document containing claused-shop information. The prior arts are unable to set conditions such as date to filter results retrieved from HTML documents.
There is a conventional technique that creates a hypothetical database by mapping the internal structure of each document and relationships between documents into unique models, to extract itemized pieces of information. This technique was disclosed by N. Ashish and C. A. Knoblock in xe2x80x9cSemi-automatic wrapper generation for internet information sources,xe2x80x9d Proceedings of Cooperative Information Systems, 1997. This technique considers a portion in HTML document as meaningful information, the portion has specific tags such as TITLE tag such as size, color, typestyle (e.g., bold and italic), and extracts these information automatically. This technique cover a case that minimum cluster of certain information is described in one HTML document, and a plurality of the HTML documents are described in mutually same format. This technique is, for example, effective when regionalized weather information is described in different HTML documents. However, this technique doesn""t take into account a case that information is described as a list description such as table or clause in one HTML document. Accordingly, this technique is unable to be applied to the above case.
J. Hammer, H. Garcia-Molina, J. Cho, R. Araha, and A. Crespo disclosed another technique in xe2x80x9cExtracting semistructured information from the web,xe2x80x9d Workshop on Management of Semistructured Data, 1997. This technique creates a hypothetical database by employing an unique OEM data model, and manage relationship between the database and various information sources, and therefore, retrieve information from heterogeneous web sources integratively. This technique employs template file depending on HTML tag description rule for HTML document to manage above relationship. However, in this technique, modification in HTML document affect hypothetical database and also modification in hypothetical database affect application. Accordingly, this technique need much labor for management and maintenance of system.
There are no standards for HTML descriptions used for information providing such as products handled by on-line shops. Namely, on-line shops are using individual HTML documents. This will be explained.
HTML documents prepared by on-line shops have different document structures. For example, a shop A employs a tag TABLE to describe products in table format, while a shop B employs a tag UL to itemize products in clause format.
The HTML documents of on-line shops employ different presentation styles even for the same product. For example, yen, thousand yen, ten-thousand yen, dollars, etc., are used as unit prices depending on shops. Some shops use double-byte characters to express prices and others employ single-byte characters for the same purpose.
The HTML documents of on-line shops have different data elements even for the same product. For example, a product is represented with only the name thereof, or the name and model number thereof, or the maker, name, and model number thereof depending on shops. To get necessary information from HTML documents gathered by the conventional retrieval techniques, users must extract pieces of information from the documents and compare them with one another. It takes a long time and much labor to retrieve necessary data from them.
In addition, when using plural search engines, the search engines used to search open networks for required information differ from one another in information types to handle, performance, and fees, and therefore, the users must choose them depending on situations. In otherwise, for this purpose, the users must know the locations, and interfaces of the search engines peculiarly.
First, it is difficult to find and manage the locations of search engines. The users must individually manage the locations of search engines with the use of, for example, bookmarks. This is hard to achieve in an environment using all terminal but own terminal, such as moble environment.
Second, the search interfaces of search engines provided by input forms are not unified. Many search engines employ their own input forms of which structure are not unified. Accordingly, the users must acquire separate systems and operation sequences and schemes when handling different search engines. It is hard for the users to know which search engine is effective for certain search item. It is also hard for the users to process information conditionally contained in retrieved HTML documents.
Third, the search information through search engines are inefficient. The users must handle several search engines until they get required information. This involves many search operations and is inefficient.
Fourth, the search engines return search result that is different item presentation styles, character codes, etc., when presenting search results, and it is difficult for the users to compare the search results with one another.
To solve the heterogeneity among the search engines, Jumon World Seek at http://member.nifty.ne.jp/jumon has disclosed a technique of preparing a common search interface for URL search engines that is one kind of search engine, managing relationships between the common search interface and individual interface for URL search engines, converting a search request for the common search interface into search requests for the search engines, and executing the search requests for the search engines. This technique provides the common search interface employing a single text box to handle the URL search engines. In practice, there are not only the URL search engines but also other various search engines. To use such a variety of search engines, this technique has the following problems:
(1) Necessity of Considering a Plurality of Input Items
Some search engines employ a simplest input form with a single text box for entering keywords to search. To narrow information to retrieve, some search engines allow the users to enter search conditions such as an area and an industry field in addition to keywords. However, the technique mentioned above is incapable of achieving such a narrowing search operation because it does not support a plurality of input items.
(2) Necessity of Coping with a Variety of Input Forms
To properly enter search conditions, some search engines employ several input form objects for text input such as text boxes, radio buttons for selecting one among several items, and select boxes or check boxes for selecting some among several items. The technique mentioned above is incapable of coping with these data entering objects except for text box because it supports only a single text box.
(3) Reconstruction of Application
When adding, correcting, deleting search engines with respect to the common search interface, the technique mentioned above must correct the common search interface and reconstruct corresponding applications.
In this way, the conventional technique mentioned above is incapable of coping with a variety of search engines and needs a lot of time and labor to design, maintain, and manage.
An object of the present invention is to provide an integrated retrieval scheme capable of retrieving required information from a plurality of semi-structured documents such as HTML documents that are scattering over open networks and have different document structures, presentation styles, and information elements, converting the retrieved information into a unified form for each user, and returning the information in the unified form to the user.
Another object of the present invention is to provide an integrated retrieval scheme capable of individually managing input form objects of each search engine serving for open networks to resolve differences among the search engines, generating search requests specific to the search engines according to a user""s search request, and executing search operations with respect to the search engines in open network environment including many search engines.
Still another object of the present invention is to provide an integrated retrieval scheme capable of managing the location, document structure, and item attributes of each HTML document and extracting required information item by item from different HTML documents that differs in the location, the document structure, and attributes arbitrary.
In order to accomplish the objects, an aspect of the present invention provides an apparatus for retrieving data contained in a plurality of semi-structured documents over open networks, comprising: a unit for storing meta data for each of the semi-structured documents, the meta data including items to be extracted from the semi-structured documents and item data used to conditionally retrieve the items; a unit for retrieving data scattered among the semi-structured documents for entered query according to the meta data, and preparing a collective search result; and a unit for outputting the search result in a prescribed single format that is specific to each user.
Another aspect of the present invention provides an apparatus for retrieving data contained in a plurality of semi-structured documents over open networks, comprising: (a) a unit for storing location data about the location of each of the semi-structured documents, document structure data about the structure of each of the semi-structured documents, used to delimit document into items to be extracted, attribute data about the attributes of each of the items to be extracted, used to conditionally retrieve the items, and style conversion data used to convert item presentation styles of the user and item presentation styles of the semi-structured documents from one into another; (b) a unit for finding, according to the location data, the location of a semi-structured document that contains all search items specified in an entered query that consists of the search items and search conditions; (c) a unit for converting, if necessary, item presentation styles of the entered query into item presentation styles of the search item in location found semi-structured documents according to the style conversion data, and forming queries for the location found semi-structured documents; (d) a unit for transmitting the queries provided by the unit (c) to the found locations and acquiring the semi-structured documents; (e) a unit for extracting item data from the acquired semi-structured documents according to the document structure data, selecting the extracted item data, if necessary, according to the attribute data for the search condition, and preparing a search result; and (f) a unit for converting, if necessary, item presentation styles of the search result into the item presentation styles of each user according to the style conversion data.
Still another aspect of the present invention provides an apparatus for retrieving data through search engines over open networks, comprising: (aa) a unit for storing location data about the location of each search engine, essential input item data specifying essential input items required by an input form of each search engine, document structure data about the structure of each HTML document, used to delimit document into items to be extracted, attribute data about the attributes of the items to be extracted, used to conditionally retrieve the items, and style conversion data used to convert item presentation styles of a user and item presentation styles of each HTML document from one into another; (bb) a unit for finding, according to the location data, the location of a search engine that contains all search items specified in an entered query that consists of the search items and search conditions; (cc) a unit for selecting, according to the essential input item data, search engine to be searched from among the location found search engines, the search engine of which the essential input item satisfy the specified search condition; (dd) a unit for determining an optimum retrieval pattern for each of the selected search engines according to a matrix table and converting the entered query into queries for the selected search engines accordingly, the matrix table defining combination between the search items and search conditions and the items and essential input items of each search engine; (ee) a unit for converting, if necessary, item presentation styles of the queries provided by the unit (dd) into item presentation styles of the search item in selected search engines according to the style conversion data; (ff) a unit for transmitting the queries provided by the unit (ee) to the found locations and acquiring HTML documents; (gg) a unit for extracting item data from the acquired HTML document serving as a first search result according to the structure data, selecting the extracted item data, if necessary, according to the attribute data for the search condition on the basis of corresponding retrieval pattern and preparing a second search result; and (hh) a unit for converting, if necessary, item presentation styles of the second search result into item presentation styles of each user according to the style conversion data.
Still another aspect of the present invention provides an apparatus for extracting data item by item from arbitrary HTML document over open networks, comprising: (aaa) a unit for storing a template for each HTML document according to document structure data about the structure of the HTML document used to delimit document into items to be extracted, the template stipulating at least item name to be extracted and prescribed text extraction style data of item group to be extracted from the HTML document; (bbb) a unit for analyzing a template corresponding to acquired HTML document; and (ccc) a unit for comparing the acquired HTML documents with corresponding template by scanning the acquired HTML document, and extracting item data of the items matching the text extraction style data of the template, so as to prepare a search result.
Still another aspect of the present invention provides a method of retrieving data contained in a plurality of semi-structured documents over open networks, comprising the steps of: retrieving data scattered among semi-structured documents for entered query according to meta data about each of the semi-structured documents and preparing a collective search result, the meta data including items to be extracted from the semi-structured documents and item data used to conditionally retrieve the items; and outputting the search result in a prescribed single format that is specific each the user.
Still another aspect of the present invention provides a method of retrieving data contained in a plurality of semi-structured documents over open networks, comprising the steps of: (a) finding, according to location data that specifies the location of each of the semi-structured documents, the location of a semi-structured document that contains all search items specified in an entered that consists of the search items and search conditions; (b) converting, if necessary, item presentation styles of the entered query into item presentation styles of the search item in location found semi-structured documents according to style conversion data and forming queries for the location found semi-structured documents, the style conversion data being used to convert item presentation styles of a user and item presentation styles of the semi-structured documents from one into another; (c) transmitting the queries provided by the step b) to the found locations and acquiring the semi-structured documents; (d) extracting item data from the acquired semi-structured documents according to document structure data, selecting the extracted item data, if necessary, according to attribute data for the search condition and preparing a search result, the document structure data specifying the structure of each of the semi-structured documents and being used to delimit document into items to be extracted, the attribute data specifying the attributes of each item to be extracted and being used to conditionally retrieve the items; and (e) converting, if necessary, item presentation styles of the search result into the item presentation styles of each user according to the style conversion data.
Still another aspect of the present invention provides a method of retrieving data through search engines over open networks, comprising the steps of: (aa) finding, according to location data that specifies the location of each search engine, the location of a search engine that contains all search items specified in an entered query that consists of the search items and search conditions; (bb) selecting, according to essential input item data that specifies essential input items required by an input form of each search engine, search engine to be searched from among the location found search engines, the search engine of which the essential input item satisfy the specified search condition; (cc) determining an optimum retrieval pattern for each of the selected search engines according to a matrix table and converting the entered query into queries for the selected search engines accordingly, the matrix table defining combination between the search items and search conditions and the items and essential input items of each search engine; (dd) converting, if necessary, item presentation styles of the queries provided by the step (cc) into item presentation styles of the search item in selected search engines according to style conversion data that is used to convert item presentation styles of a user and item presentation styles of each HTML document from one into another; (ee) transmitting the queries obtained by the step (dd) to the found location and acquiring HTML documents; (ff) extracting item data from the acquired HTML document serving as first search result according to document structure data, selecting, if necessary, the extracted item data according to attribute data for the searching condition on the basis of corresponding retrieval pattern, and preparing a second search result, the document structure data specifying the structure of each HTML document and being used to delimit document into items to be extracted, the attribute data specifying the attributes of the items to be extracted and being used to conditionally retrieve the items; and (gg) converting, if necessary, item presentation styles of the second search result into item presentation styles of each user according to the style conversion data.
Still another aspect of the present invention provides a method of extracting data item by item from arbitrary HTML document over open networks, comprising the steps of: (aaa) analyzing a template corresponding to acquired HTML document, the template for each HTML document being set according to document structure data that specifies the structure of each HTML document and is used to delimit document into items to be extracted, the templates stipulating at least item name to be extracted and prescribed text extraction style data of item group to be extracted from the corresponding HTML document; and (bbb) comparing the acquired HTML documents with corresponding template by scanning the acquired HTML document, and extracting item data of the items watching the text extraction style data of the template, so as to prepare a search result.
Still another aspect of the present invention provides a computer readable recording medium recording a program for causing the computer to execute processing for retrieving data contained in a plurality of semi-structured documents over open networks, the processing including: a process for retrieving the data scattered among semi-structured documents for entered query according to meta data about each of the semi-structured documents and preparing a collective search result, the meta data including items to be extracted from the semi-structured documents and item data used to conditionally retrieve the items; and a process for outputting the search result in a prescribed single format that is specific each the user.
Still another aspect of the present invention provides a computer readable recording medium recording a program for causing the computer to execute processing for retrieving data involved in a plurality of semi-structured documents over open networks, the processing including: (a) a process for finding, according to location data that specifies the location of each of the semi-structured documents, the location of a semi-structured document that contains all search items specified in an entered that consists of the search items and search conditions; (b) a process for converting, if necessary, item presentation styles of the entered query into item presentation styles of the search item in location found semi-structured documents according to style conversion data and forming queries for the location found semi-structured documents, the style conversion data being used to convert item presentation styles of a user and item presentation styles of the semi-structured documents from one into another; (c) a process for transmitting the queries provided by the process (b) to the found locations and acquiring the semi-structured documents; (d) a process for extracting item data from the acquired semi-structured documents according to document structure data, selecting the extracted item data, if necessary, according to attribute data for the search condition and preparing a search result, the document structure data specifying the structure of each of the semi-structured documents and being used to delimit document into items to be extracted, the attribute data specifying the attributes of each item to be extracted and being used to conditionally retrieve the items; and (e) a process for converting, if necessary, item presentation styles of the search result into the item presentation styles of each user according to the style conversion data.
Still another aspect of the present invention provides a computer readable recording medium recording a program for causing the computer to execute processing for retrieve data through search engines over the open networks, the processing including: (aa) a process for finding, according to location data that specifies the location of each search engine, the location of a search engine that contains all search items specified in an entered query that consists of the search items and search conditions; (bb) a process for selecting, according to essential input item data that specifies essential input items required by an input form of each search engine, search engine to be searched from among the location found search engines, the search engine of which the essential input item satisfy the specified search condition; (cc) a process for determining an optimum retrieval pattern for each of the selected search engines according to a matrix table and converting the entered query into queries for the selected search engines accordingly, the matrix table defining combination between the search items and search conditions and the items and essential input items of each search engine; (dd) a process for converting, if necessary, item presentation styles of the queries provided by the process (cc) into item presentation styles of the search item in selected search engines according to style conversion data that is used to convert item presentation styles of a user and item presentation styles of each HTML document from one into another; (ee) a process for transmitting the queries obtained by the process (dd) to the found location and acquiring HTML documents; (ff) a process for extracting item data from the acquired HTML document serving as first search result according to document structure data, selecting, if necessary, the extracted item data according to attribute data for the searching condition on the basis of corresponding retrieval pattern, and preparing a second search result, the document structure data specifying the structure of each HTML document and being used to delimit document into items to be extracted, the attribute data specifying the attributes of the items to be extracted and being used to conditionally retrieve the items; and (gg) a process for converting, if necessary, item presentation styles of the second search result into item presentation styles of each user according to the style conversion data.
Still another aspect of the present invention provides a computer readable recording medium recording a program for causing the computer to execute processing for extracting data item by item from arbitrary HTML documents over open networks, the processing including: (aaa) a process for analyzing a template corresponding to acquired HTML document, the template for each HTML document being set according to document structure data that specifies the structure of each HTML document and is used to delimit document into items to be extracted, the templates stipulating at least item name to be extracted and prescribed text extraction style data of item group to be extracted from the corresponding HTML document; and (bbb) a process for comparing the acquired HTML documents with corresponding the template by scanning the acquired HTML document, and extracting item data of the items matching the text extraction style data of the template, so as to prepare a search result.
Other and further objects and features of the present invention will become apparent from the following description taken in conjunction with the accompanying drawings.