The present invention is in the field of digital network information gathering from network servers and pertains more particularly to methods and apparatus for providing and operating a networked system of machines dedicated to performing automated data gathering, processing, and presentation of such data.
The information network known as the World Wide Web (WWW), which is a subset of the well-known Internet, is arguably the most complete source of publicly accessible information available. Anyone with a suitable Internet appliance such as a personal computer with a standard Internet connection may connect to the Internet and navigate to many thousands of information pages (termed web pages) stored on Internet-connected servers for the purpose of garnering information and initiating transactions with hosts of such servers and pages.
Information travels over the Internet network through many connected computers known as nodes in the art. Internet nodes include any hosted machines dedicated to performing a service such as file serving, data storing, data routing, and so on. Such nodes are generally loosely associated with each other only by universal resource locator (URL) addressing and mapped network paths.
Some data initiated by or requested by users is not protected from being intercepted by some network-connected nodes and therefore may perhaps be observed by third parties due to the nature of publicly-shared bandwidth over the Internet. However, various means for protecting data from being observed by third parties are established and routinely practiced by entities hosting pluralities of nodes connected to the Internet. Such methods include the use of firewall technology, secure servers, and private sub-networks connected to the Internet network.
Many companies doing business on the Internet host semi-private data networks comprising a plurality of computer nodes dedicated to the provision of proprietary information and related data. Certain authorized users such as those working for the company or those having password access and/or active and verifiable accounts with the company may access such data. For example, a large company may host a plurality of file servers, including connected data storage systems wherein users may search for and access data stored for the purpose by the company. Such sub-nets, as they are often termed, use the Internet as a connective wide area network (WAN) and the data travels through shared bandwidth connections. Although a user may be protected from third party interceptions of data sent or requested the user must generally navigate to each URL where data is available. If a search engine is provided to assist a user in searching for specific data made available by the company, it is limited to searching only the nodes hosted by the company or data from third party nodes that is made available through cooperative URL linking or posting.
An information gathering, summarization and presentation system known to the inventor and described in the related patent application listed under the cross-reference section uses an Internet portal and software suite to allow users to request and obtain data including Web-page summaries containing specific data found by using a unique scripting method supplied by a knowledge worker. In some embodiments such data may also be pushed to a user subscribing to the service.
A service such as that described above requires a considerable amount of processing power in order to service a very large client base in terms of job processing. A desired goal is to automate such an information gathering and presentation service so as to be wholly or largely transparent to individual users. Prior art network architectures do not possess the processing power nor the dedicated cross-communication capabilities that would be required for such a service to be wholly automated and be able to serve a mass clientele.
What is clearly needed is a dedicated and hierarchical network of cooperating computer-nodes that is adapted to fulfill a very large number of automatically-scheduled and user-initiated data requests in a wholly automated and transparent fashion. Such a networked system could be scaleable in that it may be easily expanded in terms of adding machinery according to user demand. Such a system would save users and service providers much time and labor associated with obtaining optimum and efficient results from an information gathering and presentation service.
In a preferred embodiment of the present invention a data-gathering and reporting system for collecting data from a wide area network (WAN) is provided, comprising a database stored in a data repository; a first server having access to the data base and organizing data-gathering work assignments from data in the database; a hierarchical network of distributor servers having a highest level connected to the first server and expanding to a lowest level, with distributor servers at different levels connected by data links and distributing work assignments to lower levels on demand from the distributor servers at lower levels; a plurality of gatherer servers connected by data links to the lowest level of the hierarchy of distributor servers and to the WAN, the lowest level of distributor servers distributing work assignments to the gatherer servers on demand from the gatherer servers, the gatherer servers accomplishing the work assignments distributed by the distributor servers and queueing data collected from the WAN as a result of the work assignments; a hierarchical network of collector servers having a lowest level connected to the gatherer servers and contracting to a highest level, the gatherer servers communicating data collected to the lowest level of collector servers, with collector servers at different levels connected by data links and delivering collected data to higher levels; and one or more filing servers connected to the highest level of collector servers, the filing servers communicating with the database in the data repository, the collector servers delivering collected data to the one or more filing servers, and the filing servers writing the collected data to the database.
In one important embodiment the WAN is the Internet, and data is collected from WEB servers on the Internet. Also in a preferred embodiment gating of work assignments and data between one server and another in the distributor server hierarchy is by the one server having a queue with an adjustable threshold, and demanding data or work assignments from the other server as a result of the queue level falling to the threshold. Latency and database writing efficiency may be adjusted by adjusting queue thresholds among servers, and server power and capacity required in a system is adjusted by scaling the number of servers and number of hierarchical levels of servers.
In some embodiments priority is assigned to work assignments, and work assignments and collected data are gated from server to server according to assigned priority as well as by need. Also in some embodiments work assignments are expressed in a markup language, allowing all information required to fill an assignment to be encapsulated such that only the one or more filing servers need be connected to the database.
In a preferred embodiment the system is associated with an Internet subscription server, and the work assignments are for collecting data from WEB pages associated with individual subscribers. In this case some work assignments may be automatically scheduled for individual subscribers and some assignments may be on demand from individual subscribers.
In another aspect of the invention a data-gathering and reporting system for collecting WEB summaries from the Internet for individual subscribers to a Portal subscription system is provided, comprising a plurality of gatherer servers each connected to the Internet, to an ascending hierarchy of work request distribution servers, and to an ascending hierarchy of collector servers; a work request generator at the top of the hierarchy of distribution servers, generating work requests for collecting WEB summaries; and a filer server at the top of the hierarchy of collector servers, the file server connected to and writing data to a database. Flow is by work requests from the work request generator down the hierarchy of distributor servers to the gatherer servers where work requests are accomplished by gathering WEB summaries from Internet servers according to the work requests, and by data collected from the gatherer servers up the hierarchy of collector servers to the filing server, and wherein flow is gated on demand down the hierarchy of distributor servers by each server from a previous server in the direction of flow.
In this system gating of work assignments and data between one distribution server and another is by the one server having a queue with an adjustable threshold, and demanding data or work assignments from the other server as a result of the queue level falling to the threshold. Latency and database writing efficiency is adjusted by adjusting queue thresholds among servers, and server power and capacity required in a system is adjusted by scaling the number of servers and number of hierarchical levels of servers. In some cases priority may be assigned to work assignments, and work assignments and collected data may be gated from server to server according to assigned priority as well as by need. Also in a preferred embodiment work assignments are expressed in a markup language, allowing all information required to fill an assignment to be encapsulated such that only the one or more filing servers need be connected to the database.
In another aspect of the invention methods are provided for practicing the invention using the system of the invention. In the embodiments of the invention taught below in enabling detail, for the first time a scalable and very efficient system for gathering large amounts of data on the Internet is provided, where the data collected may be directed by work assignments in small increments. There are many advantages. For example, the system of the invention relieves the user of the necessity of navigating the clutter of the Internet to find what is needed on a daily basis. It also provides immediate access for the user to information from multiple sources, because information is gathered on behalf of a user continuously. Various second-level service may also be provided, such as access from wireless internet appliance devices.