The present invention relates to the field of data processing, and particularly to a software system and associated method for use with a search engine, to search data maintained in systems that are linked together over an associated network such as the Internet. More specifically, this invention pertains to a computer software product for dynamically associating keywords encountered in abstracts or summaries of a search result set, with domain-specific search engine queries, in order to retrieve resources pertaining to the keywords within the context of a current information sphere.
The World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. Users navigate these pages by means of computer software programs commonly known as Internet browsers. Due to the vast number of WWW sites, many web pages have a redundancy of information or share a strong likeness in either function or title. The vastness of the unstructured WWW causes users to rely primarily on Internet search engines to retrieve information or to locate businesses. These search engines use various means to determine the relevance of a user-defined search to the information retrieved.
The authors of web pages provide information known as metadata, within the body of the hypertext markup language (HTML) document that defines the web pages. A computer software product known as a web crawler, systematically accesses web pages by sequentially following hypertext links from page to page. The crawler indexes the pages for use by the search engines using information about a web page as provided by its address or Universal Resource Locator (URL), metadata, and other criteria found within the page. The crawler is run periodically to update previously stored data and to append information about newly created web pages. The information compiled by the crawler is stored in a metadata repository or database. The search engines search this repository to identify matches for the user-defined search rather than attempt to find matches in real time.
A typical search engine has an interface with a search window where the user enters an alphanumeric search expression or keywords. The search engine sifts through available web sites for the user""s search terms, and returns the search of results in the form of HTML pages. Each search result includes a list of individual entries that have been identified by the search engine as satisfying the user""s search expression. Each entry or xe2x80x9chitxe2x80x9d includes a hyperlink that points to a Uniform Resource Locator (URL) location or web page.
In addition to the hyperlink, certain search result pages include a short summary or abstract that describes the content of the URL location. Typically, search engines generate this abstract from the file at the URL, and only provide acceptable results for URLs that point to HTML format documents. For URLs that point to HTML documents or web pages, a typical abstract includes a combination of values selected from HTML tags. These values may include a text from the web page""s xe2x80x9ctitlexe2x80x9d tag, from what are referred to as xe2x80x9cannotationsxe2x80x9d or xe2x80x9cmeta tag valuesxe2x80x9d such as xe2x80x9cdescription,xe2x80x9d xe2x80x9ckeywords,xe2x80x9d etc., from xe2x80x9cheadingxe2x80x9d tag values (e.g., H1, H2 tags), or from some combination of the content of these tags.
However, for one HTML parent page with links to multiple different relevant non-HTML documents that satisfy the user""s search criteria, the search result may include multiple identical URLs, one for each relevant non-HTML document. Each of these identical URLs points to the same HTML parent page, and each may include an identical abstract that is descriptive of the parent HTML page. As a result, the search results in redundant abstracts that can be practically useless, distracting, and time consuming to review.
More specifically, the popularity of domain-specific portal sites, that act as gateways to very specialized information sources, has grown concurrently with the WWW, both in complexity and volume of data. The term xe2x80x9cportalxe2x80x9d is generally synonymous with gateway, and is typically used to refer to a WWW site which is intended to be a major starting site or as an anchor site for web users. Current leading general-purpose portal sites include: Yahoo!(copyright), Excite(copyright), Netscape(copyright), Lycos(copyright), Cnet(copyright), and MSN The Microsoft Network(copyright). However, while such portal sites attempt to serve as gateways to a wide variety of general-purpose information, specialized portals have also been gaining popularity in recent years.
Specialized portal sites, such as the jCentral(copyright), xCentral, etc., attempt to focus on a particular domain that appeals to a target audience. By limiting the scope of their operation, the belief is that specialized portal sites will be able to present information of greater relevance to their target audience.
For example, in a portal site such as jCentral(copyright) that caters to users interested to learn more about the Java programming language and related topics, the users are allowed to conduct a search by querying the portal database. The portal database is a vast repository of pre-collected, indexed, and summarized information, typically gathered from the WWW using automated crawling tools. When a user enters a query, the portal""s search engine attempts to match the keywords specified by the user with summarized metadata that have been previously extracted from the documents stored in the repository, and then returns an ordered list of potential candidate matches relevant to the user""s query.
Typically, the search engine will return a result set for a search query including a URL and a text based abstract of the original resource. Sometimes, users are able to control the length of the abstract. For instance, the HotBot(copyright) site at URL: http://www.hotbot.com, provides the choice of having only a list of URLs displayed as the search result, the URL with a brief abstract, or a comprehensive abstract.
However, since the abstract is usually generated on the server side, a resulting problem is the inability of the users to obtain more detailed information pertaining to domain-specific terms that appear in the abstract, without issuing a separate query with the relevant term as the new keyword. By so doing, the user might become distracted and distanced from the original search result. Moreover, the conventional search engines do not provide the capability to allow users to dynamically conduct an automatic search based on keywords that appear in an abstract or summary. Rather, the full text of the abstract or summary is displayed to the user.
There is currently no adequate mechanism by which search engines allow the user to dynamically interface with the search abstract, such as by selecting a term of interest in the abstract to obtain more information about this term within the context of the domain being queried. The need for such a mechanism has heretofore remained unsatisfied.
The abstract keywords association system and associated method of the present invention satisfy this need. In accordance with one embodiment, the abstract keywords association system allows the user to dynamically interface with the search abstract. The user selects a term of interest in the abstract, and the abstract keywords association system automatically provides the user with additional information about this term within the context of the domain being queried. This permits the user to consider more information and to better judge the usefulness of the resource and search result.
The abstract keywords association system of the present invention provides several features and advantages, among which are the following:
The ability to automatically detect and select keywords from abstracts of search result items, by using a domain-specific dictionary of keywords.
The ability to select and generate an optimal query string for a particular keyword. This comprises the steps of building a complex Boolean query string, and calibrating the quantity of the search result set to a manageable size.
A method to dynamically link domain-specific terms encountered in abstract summaries of web resources returned in response to search engine queries, to new queries that retrieve resources specific to keywords in the context of the current information domain. The positions at which a hyperlink is inserted are marked using specific markup tags.
The ability to update, remove, change, or add inserted hyperlinks, when a related domain-specific dictionary changes.
A synchronization mechanism to keep the stored query information up to date. This involves the detection of changes in the summary metadata, as well as changes in the usage pattern of the search engines used, which leads to the creation of a new query string.
A mechanism that controls the abstract keywords association system based on the user""s input and events.
The foregoing and other features and advantages of the present invention are realized by an abstract keywords association system for use with a search engine and a search engine repository to dynamically associate a keyword encountered in an abstract of a search result set with a domain-specific query. In this system, a local query database stores the domain-specific query, and a synchronization unit synchronizes the search engine repository and the local query database.
A query builder builds a search query from a query template using the search engine repository. A summary marker incorporates the search query with the keyword in the abstract of the search result item. A keyword detector generates a list of keywords included in a domain-specific dictionary. A search result calibration manager calibrates the number of the query result items. A search result item buffer receives a request for processing an abstract metadata item from the synchronization unit. The request includes a Universal Resource Locator (URL) and a corresponding abstract, wherein the query builder uses a (URL, keyword) pair to build the domain-specific query from the query template. The summary marker updates the abstract corresponding to the URL in the search engine repository, and marks and inserts the domain-specific query for all occurrences of the keyword.
The abstract keywords association system of the present invention enables users to read and learn more on specific terms encountered in abstract summaries of web resources returned by domain-specific search engines. The system allows the user to dynamically probe the information presented, and thus obtain the desired detail. This permits the user to gather and access information faster and with greater convenience.
The abstracts presented by the abstract keywords association system contain dynamic data associated with keywords derived from the domain-specific dictionary. The dynamic data represents pointers, links, or URLs to external data repositories. As a result, the retrieved data is always current and up to date.
If a meta search engine were used, the search results for the keywords could contain various different categories. For example, books related to the keyword, reviews from other users about the keyword, links to web sites etc.