The present invention relates to Internet based information retrieval. More particularly, the present invention relates to systems and methods for concept-based Internet searching.
The Web has blossomed as a means of access to a variety of information by remote individuals. The Web is an open system in that virtually any individual or organization with a computer connected to a telephone line may use the Web to present information concerning almost any subject. To accomplish this, the Web utilizes a body of software, a set of protocols, and a set of defined conventions for presenting and providing information over the Web. Hypertext and multimedia techniques allow users to gain access to information available via the Web.
Users typically operate personal computers (PC""s) executing browser software to access information stored by an information provider computer. The user""s computer is commonly referred to as a client, and the information provider computer is commonly referred to as a Web server. The browser software executing on the user""s computer requests information from Web servers using a defined protocol. One protocol by which the browser software specifies information for retrieval and display from a Web server is known as Hypertext Transfer Protocol (HTTP). HTTP is used by the Web server and the browser software executing on the user""s computer to communicate over the Internet.
Web servers often operate using the UNIX operating system, or some variant of the UNIX operating system. Web servers transmit information requested by the browser software to the user""s computer. The browser software displays this information on the user""s computer display in the form of a Web page. The Web page may display a variety of text and graphic materials, and may include links that provide for the display of additional Web pages. A group of Web pages provided by a common entity, and generally through a common Web server, form a Web site.
A specific location of information on the Internet is designated by a Uniform Resource Locator (URL). A URL is a string expression representing a location identifier on the Internet or on a local Transmission Control Protocol/Internet Protocol (TCP/IP) computer system. The location identifier generally specifies the location of a server on the Internet, the directory on the server where specific files containing information are found, and the names of the specific files containing information. Certain default rules apply so that the specific file names, and even the directory containing the specific files, need not be specified. Thus, if a user knows that specific information desired by the user is located at a location pointed to by a URL, the user may enter the URL on the user""s computer in conjunction with execution of the browser software to obtain the desired information from a particular Web server. Users, or the browser software executing on the user""s computer, must always at a minimum know the Internet address portion of the URL for a particular Web server.
However, often the user does not know the URL of a site containing desired information. Even if the user once knew the proper URL, the user may have forgotten, mistyped, or otherwise garbled a URL for a specific location, as URL""s can often be lengthy strings with a variety of special characters. To allow increased ease in locating Web sites containing desired information, search engines identifying Web sites likely to contain the desired information are widely available. A search engine using a well constructed search may often very quickly allow a user to quickly and accurately locate Web sites with desired information. Due to the multiplicity of Web sites, and indeed due to the unstructured nature of the Web, a poorly constructed search may make locating a Web site with the desired information virtually impossible.
An inability of a user to quickly and easily locate a Web site poses difficulties with respect to some commercial uses of the Web. Commercial entities have found the Web a useful medium for the advertisement and sale of goods and services. A variety of commercial entities have created home pages for the commercial entity as a whole, and for particular products sold and marketed by the commercial entity. The effectiveness of advertising in such a way on the Web is dependent on users accessing a commercial entity""s Web site and viewing the information located there. The user must undertake two critical actions for this to occur. The user must first access a commercial entity""s Web site, and then the user must actually view the material displayed there. A user who desires to view a Web page advertising or selling a particular product, but who is a poor Web searcher, may represent a lost sale of the product.
The huge amounts of poorly accessible information frustrate consumers, analysts and content providers alike. Existing navigation devices often fail to connect people and content, limiting the growth of Web-based information services and e-commerce.
What is needed is an improved method that allows a user to easily obtain information via the Web. The method should allow a user to use natural language, and search based on idea concepts, rather than strict Boolean strings.
The present invention addresses these needs by providing a system, method and article of manufacture for concept-based information selection. The raw text of information is retrieved from various sources on a network, such as thee Internet, and compiled. Preferably, the information retrieval and compilation is performed continuously. The text is parsed into components such as by identifying an event, a time, a location, and/or a participant associated with information in the text. Elements of information are extracted from the components and cataloged. The cataloged information is matched with user-specific parameters.
In one embodiment of the present invention, the user-specific parameters are extracted from a user query. Preferably, the user query is entered in natural language. In another embodiment of the present invention, the matched information is routed to an information cache specific to a user so that the user can retrieve the information for viewing. Preferably, the text is parsed into components by identifying at least one of an event, a time, a location, and a participant associated with information in the text.
According to another embodiment of the present invention, a system, method and article of manufacture are provided for incorporating concept-based retrieval within Boolean search engines. Initially, textual information is retrieved from a data source utilizing a network. The textual information is then segmented into a plurality of phrases, which are then scanned for patterns of interest. For each pattern of interest found a corresponding event structure is built. Event structures that provide information about essentially the same incident are then merged.
In one embodiment of the present invention, at least one phrase includes a noun group. Optionally, at least one phrase includes a verb group. In a further embodiment, a user interface is provided that allows a user to provide the search request. Further, the merged event structures may be stored in an information cache for later retrieval.
According to yet another embodiment of the present invention, a system, method and article of manufacture are provided for allowing concept based information searching according to one embodiment of the present invention. Textual information from various sources is collected utilizing a network. The textual information is parsed to create topic specific information packets, which are stored in an information cache. A query is received from a user, which, as mentioned above, may be input in natural language. The information packets in the information cache are matched with the user query. Matching information packets are formatted for display to a user and output.
In one embodiment of the present invention, the query is converted into an internal query form that is used to find matching information in the information cache. In another embodiment of the present invention, if the user query is not understood, a network search engine is executed and used to perform a search of information sources utilizing the user query. Information matching the user query is output to the user. In yet another embodiment of the present invention, the formatted information includes a hyperlink to the original source of the textual information.
Advantageously, the present invention efficiently connects people and content, by providing answers to user""s questions on large collections of dynamic, free-text information.
The present invention provides some dramatic benefits to a range of applications. As a web site tool, the present invention provides single-step, question and answer searches for information, very high precision information retrieval, and smooth migration of search to wireless PDAs, wireless phones, and other small devices.
In addition, the present invention can provide custom news services and automated information routing. In this mode, users post persistent queries, that is, long-standing information requests that the system continuously monitors and satisfies as new sources provide relevant information.
Further, as an information router, the present invention provides real-time monitoring of news, automated alerts to business intelligence and marketing staffs, and when combined with Open Agent Architecture (OAA) technology, the present invention provides news summaries through multiple modalities, such as e-mail, speech, or custom Web homepages.