1. Field of the Invention
This invention relates to the field of content retrieval. In particular, the invention relates to a computer system and method for dynamically identifying and retrieving content distributed over the Internet.
2. Description of the Related Art
The Internet comprises a vast number of computers and computer networks that are interconnected through communication links. The interconnected computers exchange information using various services, such as electronic mail, Gopher, and the World Wide Web (xe2x80x9cWWWxe2x80x9d). The WWW service allows a server computer system to send graphical Web pages of information to a remote client computer system. The remote client computer system can then display the Web pages. Each resource (e.g., computer or Web page) of the WWW is uniquely identifiable by a Uniform Resource Locator (xe2x80x9cURLxe2x80x9d). To view a specific web page, a client computer system specifies the URL for that Web page in a request (e.g., a HyperText Transfer Protocol (xe2x80x9cHTTPxe2x80x9d) request). The request is forwarded to the Web server that supports that Web page. When that Web server receives the request, it sends that Web page to the client computer system. When the client computer system receives that Web page, it typically displays the Web page using a browser. A browser is a special-purpose application program that effects the requesting of Web pages and the display of Web pages.
Currently, Web pages are typically defined using Hyper Text Markup Language (xe2x80x9cHTMLxe2x80x9d). HTML provides a standard set of tags that define how a Web page is to be displayed. When a user indicates to the browser to display a Web page, the browser sends a request to the server computer system to transfer to the client computer system an HTML document that defines the Web page. When the requested HTML document is received by the client computer system, the browser displays the Web page as defined by the HTML document. The HTML document contains various tags that control the displaying of text, graphics, controls and other features. The HTML document may contain URLs of other Web pages available on that server computer system or other server computer systems.
The WWW is made up of millions of xe2x80x98web sitesxe2x80x99 with each site having a number of HTML pages. Each HTML page usually has a number of web objects on each page such as graphics, text and xe2x80x98hyper textxe2x80x99 references (URL""s) to other HTML pages. There is a need to identify and retrieve dynamically updated content from diverse network sources.
The invention comprises systems and methods to facilitate the collection and distribution of information over a computer network. This invention solves several information management problems, such as the marking of content distributed over a network, the instant display of current information distributed over a network, and the retrieval of information at a browser without an intermediary step to save the information. As such, the invention enables customized aggregation of content distributed over a network in real-time.
This invention enables users to simultaneously view not only their favorite web sites, but their favorite parts of their favorite web sites, all within a single window. Individual users may employ the invention to collect portions or individual web pages which may be located at any web site. Corporate web designers and site managers can use this invention to mark and collect content from their own corporate intranet or from anywhere on the web. Information aggregators may use this invention to collect web-based information from one web site or from many web sites and xe2x80x98re-purposexe2x80x99 that content in a completely new form.
The invention may also be used to xe2x80x98post processxe2x80x99 the results of any search engine to display only xe2x80x98qualityxe2x80x99 or xe2x80x98desiredxe2x80x99 information, thereby eliminating a need for additional user mouse clicks, and simplifying the search process while improving the quality of search results.
The invention is equally applicable to the collection and re-purposing of XML net objects as well as audio objects such as MP3. The invention also has applications on the Internet as well as conventional communications systems such as voice telephony and in broadband communications.
Embodiments of the invention include a recursive scripting language, or xe2x80x9cContent Collection Languagexe2x80x9d (CCL), for identifying and accessing objects distributed over the Internet. In embodiments of the invention, short scripts written in the scripting language are used in place or URLs: unlike URLs, which are designed for referencing static data, scripts written in the Content Collection Language may point to xe2x80x98dynamicxe2x80x99 data that is constantly updated. The CCL statement can be used just like a URL.
Embodiments of the invention include a feature extraction object used for identifying similar information objects. The invention makes it possible to divide and sort page contents from several pages into groups sharing similar attributes, which are contained in a Feature Extraction object. In this way information brokers and publishers can aggregate information from several sources into a new information object.
The invention includes systems and methods for reducing a web page to its smallest network objects and creating a Feature Extraction xe2x80x98tagxe2x80x99 or xe2x80x98web lingerprintxe2x80x99 of the object; this tag may be referenced again to find the object in the future. In embodiments of the invention, Feature Extraction uses xe2x80x98fuzzy logicxe2x80x99 to ensure that targeted content is identified and collected after a source page has been updated with fresh information or graphics.
As such, feature extraction may be used to perform any one or more of the following:
Divide any web page into its smallest parts or xe2x80x9catomsxe2x80x9d.
Given any desired object or its containers, to generate a symbolic xe2x80x98Internet fingerprintxe2x80x99 that is persistent and may be used as an alias pointing to the target object.
Use the Internet fingerprint to find the desired object even though the static URLs on its page have changed.
Provide a resilient and robust xe2x80x98fingerprintxe2x80x99 that can work well with missing rules.
Build a Feature Extraction tag of a target that is descriptive of its results and behavior i.e., better knowledge representation.
Produce a tag that will be consistent with the page being examined and the target object type over a vast majority of site/page combinations.
The invention provides a way to provide xe2x80x98version controlxe2x80x99 of the attribute tags as information changes and new types of internet standards are adopted. By using this approach to version control, an information aggregation system can save an attribute tag and continue to update and extend its capturing and publishing system. While the system evolves, the older attribute tag will continue to point back accurately to the desired information recorded earlier. Web publishers can use feature extraction tags as an xe2x80x98aliasxe2x80x99 to information on a page that will allow them to identify and modify other areas on a page while maintaining the alias intact.
The invention is natural language neutral so that the software used to create an alias to an English information object can also be used to mark information objects in any language.
The list of attributes in a feature extraction object can be extended to include fuzzy patterns produced by a limited dictionary. For example, a limited dictionary for a university could include the terms: instructor, text, or fee. If these words are found within the context of a pattern match they can be included in the feature extraction attribute list. This part of the invention allows the user to develop extremely specific feature extraction objects for vertical subject domains in addition to the very general or horizontal approach used without domain dictionaries.
In embodiments of the invention, the feature tag may be used to accurately xe2x80x98rankxe2x80x99 information objects within a collection of objects in a database or on a page. For example, a page can be divided into information objects, and the user will be shown only the xe2x80x98most importantxe2x80x99 objects on the page. A search engine can use this ability to do a standard lexical search and subsequently return only the most important information objects of the search results. For example, links returned by a search engine may be examined using the Feature Extraction technology of this invention to parse each search result page into atoms and subsequently score the page for its quality content. Depending on the content score, different Feature Extraction objects are used to collect data from the page. In one embodiment, a page with a high xe2x80x98headlinexe2x80x99 score will be parsed and displayed using a headline capture process. A page with a high text score may be displayed using an xe2x80x98articlexe2x80x99 capture object. A high graphic score may be displayed by use of a graphic capture object.
The invention provides a method and system for collecting and displaying information that has been collected from the Internet. Some embodiments are divided into the xe2x80x98collection systemxe2x80x99 and the xe2x80x98display systemxe2x80x99.
The collection system allows a user to use a web browser application to navigate to a web page on the Internet and then xe2x80x98markxe2x80x99 content. When the user desires to capture an object from the web she will enter xe2x80x98navigationxe2x80x99 mode. In navigation mode, as the user clicks on hypertext links on a page the invention process will record every action. The user will continue to navigate until he reaches the page that contains the desired target content. Once the desired content is visible within the browser the user can click on a xe2x80x98stop recordingxe2x80x99 button. The system will display a number of selections to the user who may select xe2x80x98Text Articlexe2x80x99, xe2x80x98Imagesxe2x80x99, xe2x80x98Numeric Tablexe2x80x99 or xe2x80x98Headlinesxe2x80x99 from the target page. If Headlines are selected, all of the headlines are collected from the page and displayed on the preview page. If Images are selected, all of the images from the page are collected and displayed on the preview page. These and other embodiments are described in greater detail herein.