A portion of the disclosure recited in the specification contains material which is subject to copyright protection. Specifically, a Microfiche Appendix in accordance with 37 CFR Section 1.96 is included that lists source code instructions for a process by which the present invention is practiced in a computer system. The Microfiche Appendix comprises 13 sheets of microfiche containing 377 frames, or pages, of source code. The copyright owner has no objection to the facsimile reproduction of the specification as filed in the Patent and Trademark Office. Otherwise all copyright rights are reserved.
This invention relates in general to data processing in networked computers and, more specifically, to an object-oriented approach for handling digital information in large distributed networks such as an Intranet or the Internet.
The evolution of computer systems and computer networks is one of accelerating growth. Individual computer systems are equipped with vast amounts of storage and tremendous processing power. Even as these computing resources have increased, the physical size of the systems has decreased, along with the cost of such systems, so that the number of computer systems has grown dramatically.
Not only has the number of computer systems increased at an astounding rate, but improvements in telecommunications have contributed to making massive worldwide networks such as the Internet a reality. However, these advances in computer systems and computer network technology make an overwhelming amount of information available. So much so that it is difficult for a user to extract desired information from the network or to be able to efficiently use the network to accomplish many tasks. Although the Internet has made it very easy to both publish information as well as access it through content servers and browsers, respectively, this opportunity has given rise to thousands of information publishers and millions of information consumers.
This phenomenal growth is making it increasingly difficult for an information consumer to hook up with the publishers that are of interest. A second problem is that the continuous exchange of information in the form of data or algorithms between the myriad types of computer systems, or platforms, which are foreign to one another and mutually incompatible, means that a user of the Internet needs to be aware of compatibility issues such as which types of applications run on which platform, what data formats can be used with which applications, etc. A third problem with the Internet is that, although computer systems have acquired resources to be able to run very large, resource-intensive application programs, this type of xe2x80x9cfatxe2x80x9d resident application, or client, on a user""s machine does not fit today""s Internet paradigm where restricted bandwidth makes it impractical to transfer large amounts of data or programs. Also, the shear volume of information and number of users means that stored space on the various server computers making up the Internet, and used to perform routing of information on the Internet, is at a premium.
Brute-force xe2x80x9ckeywordxe2x80x9d search engines already prove incapable of effectively solving the problems of the Internet. A whole plethora of xe2x80x9cpushxe2x80x9d technologies s emerging that is attempting to provide solutions to this problem within some spectrums. A few xe2x80x9cpublish-subscribexe2x80x9d solutions exist, but these demand a fair amount of infrastructure at both the publisher and consumer ends. Each of the shortcomings of these approaches is discussed in turn.
Keyword Search
An example of the inefficient data search and retrieval of the popular keyword search engines is illustrated by the following example of performing a simple search to locate job candidates.
Assume, as was done in an actual test case, that the user of a computer system on the Internet wants to locate and hire computer programmers. Using traditional Internet search technology, the user might go to a website such as AltaVista, Yahoo!, HotBot, etc., and enter a search query for xe2x80x9cprogrammer available.xe2x80x9d This search on AltaVista in February 1998, produced 166 documents matching the query. However, the vast majority of these documents are useless in accomplishing the goal of finding a job candidate. For example, many of the documents are outdated. Other of the documents merely use the phrase xe2x80x9cprogrammer availablexe2x80x9d in ways other than to identify an actual job candidate. Some of the documents are from xe2x80x9cdeadxe2x80x9d links which no longer exist and are inaccessible. Many of the documents were duplicate documents resulting from peculiarities in the way the database is compiled and maintained.
Many of the documents in the search results would not be useful even if they identified an available programmer candidate. This is because the candidates are from different places in the world and many of the documents are old, meaning the programmers are probably not available anymore or have moved. Of course, the search can be refined by adding additional keywords, such as the specific type of programming language skill desired, region, restricting the documents to a specific timeframe, etc. However, since the only tool available to the user to refine the search is to add keywords, or to place relational conditions on the keywords, a second full-text search of the entirety of documents on the Internet would yield many of the same problems as in the previous search, along with new problems introduced by unpredictable uses of the additional or modified text phrases in the free-form format of documents on the Internet.
Another limitation with the full-text search engines available on the Internet today is that much of the information on the Internet exists in xe2x80x9cdynamicxe2x80x9d web pages which are created in response to specific one-time requests or actions by human users or automated signals. Even the so-called xe2x80x9cstaticxe2x80x9d web pages are updated frequently, or are left on the Internet long after they cease to be supported or cease to be valid or relevant. Since the search engines compile a database based on xe2x80x9crobotsxe2x80x9d or xe2x80x9cspidersxe2x80x9d visiting sites on the Internet at repeated time intervals many of their results are unrepeatable or outdated and invalid. Also, the spiders are not able to discover all possible web pages such as pages that might be included in a resume database that is not published in the form of one web page per resume. Still further problems exist with keyword search engines in that use of the text language is not fully standardized. An example of this is that many people use the spelling xe2x80x9cprogramersxe2x80x9d instead of xe2x80x9cprogrammersxe2x80x9d with two xe2x80x98mxe2x80x99s.
The second problem with the Internet, that of compatibility issues between platforms, programs and data types, is encountered by a user of today""s Internet whenever a user tries to obtain software, and sometimes data, from the Internet. Although the Internet has provided a wealth of commercial (and free) software, utilities, tools, etc., much of this software requires a great deal of effort on the part of the user to get it running correctly, or is of little or no value because of incompatibility problems that must be solved at the user""s time and expense.
For example, when a user downloads a piece of software, they must know about their computer, operating system, compression/decompression utility required, etc. in order to determine whether the software being downloaded is going to be usable in the first place. Keeping track of proper versions of the software and utilities further complicates matters. This is especially true when the software obtained is designed to work with data of a certain type, such as where the software is used to access multimedia files of a certain provider""s format, is a new driver for hardware from a specific manufacturer, etc. This makes it difficult for would-be manufacturers of third party xe2x80x9cvalue-addedxe2x80x9d utilities to produce software that can be used with other software, data or hardware made by another manufacture. Thus, although today""s Internet is successful in making available a plethora of software, utilities, tools, drivers and other useful programs; and can usually adequately deliver the software to a user, it fails in providing a uniform and a seamless environment that eliminates significant compatibility problems essential to allowing a user to easily obtain added functionality.
The third shortcoming of the Internet is the relatively poor ability of the Internet to download large amounts of digital information which make up the data and programs of interest to a user. Today, because of improvements in storage capacity and processing power, a typical user runs applications that are resource-intensive and thus require large amounts of data and large programs to manipulate the data. For example, it is not unusual for a user to download a demonstration program on the order of 10 to 20 megabytes. Such a download through a 28.8 k bit/sec. modem might take 3-6 hours depending on the speed of the user""s overall connection to the Internet, server overload, number of server xe2x80x9chopsxe2x80x9d to connect with the download server, etc. Thus, although the trend in computer systems has been toward larger-and-larger application programs which manipulate huge data structures, this trend is incompatible with a network such as the Internet which is rather limited in the speed with which it can handle the demands of the many millions of users trying to utilize it.
xe2x80x9cPushxe2x80x9d
The approach of finding out what information a user desires and xe2x80x9cpushingxe2x80x9d this information to the user by sending it over the network to the user""s computer from time-to-time is epitomized by the application PointCast, available at http://www.pointcast.com/. The application program requires the user to specify areas of interest such as categories of news (e.g., politics, business, movies, etc.), specific sports teams, stocks, horoscope, etc. The user""s interests are then recorded at a Pointcast server site. Periodically the user is sent, or xe2x80x9cpushed,xe2x80x9d the specific information from PointCast""s server site to the user""s computer. The information is compiled and maintained by PointCast although other companies may be involved.
Although xe2x80x9cpushxe2x80x9d technology such as PointCast has the advantage that the user can be updated automatically about specific categories of information, this approach is not very flexible and does not provide much improvement in obtaining information other than providing a tailored version of the daily newspaper. Drawbacks with xe2x80x9cpushxe2x80x9d technology include the inability of the user to specify arbitrary informationxe2x80x94the user must pick from a list; there is no mechanism for the user to obtain information from outside of the xe2x80x9cpushxe2x80x9d provider""s server site, and the user cannot upload the user""s own information for distribution.
xe2x80x9cPushxe2x80x9d technology provides uniformity across platforms and data types but it does so only by limiting the user to a single application front end and to information controlled by a single main entity. In this sense, xe2x80x9cpushxe2x80x9d technology thwarts the usefulness of a universal interactive network like the Internet and transforms it into a non-interactive traditional information model, such as radio or television.
Because the pushed information comes from a server or servers controlled by a single entity, push technology fails to create a standardized market for information object, information processing products and information services. Instead, the push approach pits different push providers against each other for user share. The push approach, unlike the Internet paradigm, is not an open approach and, in fact, is contrary to what many view as the exciting and valuable qualities of the Internet.
Publish-Subscribe
The Publish-Subscribe approach provides a more powerful information exchange system than the xe2x80x9cpushxe2x80x9d approach. The Publish-Subscribe approach allows a user, in a similar manner to the xe2x80x9cpushxe2x80x9d approach, to specify the type of information to which the user wishes to subscribe. However, the user is not strictly limited to categories presented by an application program front-end. Rather, a typical Publish-Subscribe approach allows a user to specify more general types of information such as by using a plain-text subject description.
In a Publish-Subscribe network, publishers provide information that is freely available for all users. Publishers can make any generalized type of information available and identify such information by a mechanism such as the xe2x80x9csubjectxe2x80x9d line. With publishers publishing information and identifying the information by subject, and subscribers subscribing to information identified by subject, processes within the Publish-Subscribe network perform the task of matching up the subscription requests with the available published information and setting up resources in the form of, for example, xe2x80x9cchannels,xe2x80x9d so that the transfer of information can take place. However, Publish-Subscribe has been successful only in relatively small, proprietary networks where the publishers and subscribers are aware of the types of information in the network and agree on how to identify the information. Since Publish-Subscribe is limited in how the types of information are specified, as by plain-text, for example, a subject header ensuring that a proper match takes place introduces problems similar to those discussed above with the keyword search engines. So far, the Publish-Subscribe approach has failed to be scaled up to be suitable for larger, more generalized networks such as a large intranet or the Internet because the Publish-Subscribe model fails to provide efficient mechanisms allowing simple and versatile unrestricted data subscriptions and efficient, organized and robust distribution of information.
Further, Publish-Subscribe technology relies on custom front-ends that have specific features designed by a single manufacturer. For this reason, Publish-Subscribe technology limits the user in the types of information, and distribution of information, that are possible. Similar to xe2x80x9cpushxe2x80x9d technology, Publish-Subscribe does not provide an xe2x80x9copenxe2x80x9d information architecture allowing unrelated commercial entities to provide information items, information processing products and information services in a compatible and mutually beneficial manner.
Prior Art Information Processing Models
FIG. 2A shows the prior art models for information processing in networked computer systems.
In FIG. 2A, conventional distributed client/server applications and their manner of processing data are illustrated in diagram form. The execution of a xe2x80x9csingletonxe2x80x9d application is shown at 160. This represents an application program executing on the user""s local computer, such as a desktop computer where almost all of the data and executed instructions reside on the user""s computer. The application program has a large amount of data associated with it and typically performs all of its processing on the local data, which may be copies of remote data. The computer is hooked up to a network represented by lines 162. The network, in the singleton application case, is used only to access a common database 163 that might be shared among several users as, for example, in a workgroup. An example of a singleton application is a database program such as would be common in the late 1980s. The common database 163 can be modified by various database application programs executing at the various user computers. Such updates or modifications are typically made through a limited set of commands such as Get/Set illustrated at 164. Data can also be routed through routing hardware 178 to remote data store servers such as 182 via additional network connections such as 180.
Later models of information processing make more use of the network so that more data can be present at remote locations. As shown at 170 the evolved model has a user operating a client at a local computer, or workstation, as before. However, much of the data 173 now resides in a remote location as, for example, at the user""s server. Communication with the server (not shown) is via a local network such as an Ethernet, indicated by line 172. Naturally, various copies of data will exist and some copies of the data will necessarily exist at the user""s desktop or workstation such as in the user""s random access memory (RAM), cache memory or disk, in local files that the user may have designated or that may be automatically created by the client application. However, the model from the user""s and the client""s point of view is that remote data is being accessed.
Get/Set operations 174 can be performed on the data 173. The data is often obtained from, and synchronized with, other remote databases through one or more networks and associated routing hardware, indicated by networks 176 and 180, routing hardware 178 and remote data store server 182. Additional networks and servers can be used to transfer data to and from the user""s local server database as indicated by additional network 184 and data store server 186.
Note that a property of the approaches shown in FIG. 2A is that the processing entity, namely clients 190 and 192, resides in the user""s local computer system. Also, as is typical with traditional information processing models, each client is specific and dedicated to processing data of certain types and to performing specific limited tasks. In other words, the dozens of processing applications created by different software manufacturers are incompatible with each other in that they cannot, without considerable effort, be made to process a data structure created by a foreign application program.
From the above discussion, it is apparent that a system that provides for data searching and manipulation on the Internet in an efficient, effective and intuitive manner would be most welcome. The system should provide an environment and architecture that is adaptable for different applications and customizable by information providers and consumers to suit the many types of commercial, recreational, educational and other uses that the Internet fulfills. The system should also operate uniformly across the various computer platforms connected to the Internet and so provide a basis for uniform development of algorithms, or computer programs, to make feasible a third-party value-added market for the Internet. The system should operate efficiently within the boundaries of the Internet""s limited bandwidth so as to make today""s Internet a truly valuable resource.
The invention provides a mechanism for augmenting processing of information objects that are transferred among processors within a network. In general, the processing is performed by a process, or processor, (called a xe2x80x9crobotxe2x80x9d) at any point in the network where an information object is transferred, or where the object resides. By allowing processing at source, destination and at xe2x80x9cinterimxe2x80x9d points between the source and destination, the ability to add functionality, services, control and management of objects and object transfers is greatly enhanced. As an example, an object that is a notice of a book that is available for purchase can be published in the system. An interim process, or xe2x80x9crobot,xe2x80x9d detects the book information object and adds text of a review of the book. The resulting augmented object is then available to users of the system, perhaps at an increased cost to the subscribers.
The robots, can reside at any point in the system. For example, a robot can be local to an end-user""s computer, can reside on a content source server, or can be on another computer, processor, storage location or device on the network. Any type of processing can be performed by the robots. For example, access rights can be maintained so that certain attributes and values of information objects are restricted on a per user, per machine, chronological or other basis. Robots can use conditions which, when satisfied by attribute/value pairs within a specific object, or conditions which are satisfied by other, external, conditions, trigger specific processing. The processing can include one or more objects, other information processing, software or hardware control functions, etc. Information can be appended to objects. Statistics on object use, publication, subscription or transfers can be compiled. Groups of robots can operate in cooperation. Robots can share information.
In the preferred embodiment, the use of augmenting processes occurs within an overall system architecture and protocol that uses a Network Addressable Semantically Interpretable Byte-Set (NASIB). A NASIB is defined as any series of bytes that can be semantically interpreted by a pre-existing algorithm given sufficient context. A NASIB by this definition could thus be a traditional data structure or even an algorithm. The human client in a traditional client/server paradigm becomes another NASIB process with a higher degree of foreign (may be human) interactivity than other servers. All processing is then viewed as a distributed task whose completion involves transfer and consumption of relevant NASIB across processes in response to voluntary accesses as well as involuntary distribution. Each process and its NASIB are uniquely addressable and fetchable across connected networks (an example of such a WAN would be the Internet or a company intranet).
However to achieve this goal the architecture employs a plethora of unique concepts, apparatus and mechanisms:
1. Encapsulating each information unit is a unique specification system using various types of Network Addressable Semantically Interpretable Byte-Sets (NASIBs).
2. NASIB style processing involves movement of objects and consumption of resources. Processes that participate in such processing are called NASIBs processes.
NASIBs have the intrinsic property of being network addressable and serializable.
Any processing entity is network addressable if it can be made available to any requesting algorithm residing on a distributed connected network based on its name, which could be a simple string value or a number identifier. The name in this case needs to contain at least the following two components:
Network Process Identifier: A substring which when handed to an appropriate protocol handler can provide the means to route data packets to that specific process and communication port as identified by the name within the combined name space of the distributed network.
Process Name Space Identifier: A substring which when handed to an appropriate process can provide the means to retrieve a specific set of bytes as identified by this name within the combined name space of that process.
Any processing entity is serializable if, within the same language framework, any of its instances can be converted into a string of bytes that can be stored as an instance of just one underlying language data type and, further, can be recreated back from this one instance to an entity of the original type.
Formal means of encapsulating the state of a NASIB and interacting with it through state-based Equivalency Events and Event Interest Requests (interest in an equivalency).
External Subsystems:
3. Transport and persistence implementations in each NASIB process are looked upon as external subsystems with generic parameterization and switchability being possible. Any physical implementation of a NASIB server usually would possess at least the following external subsystems.
4. Ability to be network addressable as an xe2x80x9couter algorithm.xe2x80x9d
5. Ability to serve and store any arbitrary byte sets, once identified, within its name space.
6. Ability to garbage collect its content and archive it for later contextual recall based on some predefined policies.
7. Ability to persist and re-create its complete execution state and/or data state to a permanent computer data medium using direct or indirect means.
Pure Data Source/Sink processes, i.e., performing only publishing or providing generic retrieve capabilities are also seen as external to the system. Anyone who can facilitate a byte buffer following a predefined format on a recognizable transport can become a publisher or retriever.
All communications between processes follows four request-reply messaging paradigms, each differentiated from the others on the basis of the life cycle of the transport task:
8. Synchronous Request: Requesting process transmits request message set and waits for reply message set, the transport task expires when a successful or unsuccessful request is sent back from the serving process. No trace of the request or reply is left.
9. Asynchronous Request: Requesting process transmits request message set, sets up a callback rendezvous algorithm and then continues with processing. The transport task expires when a successful or unsuccessful request message set is sent back from the serving process using what is available, until such time the request is cached. The waiting algorithm on the requesting process is then activated. No trace of the request or reply is then left.
10. Asynchronous Cached Request: Requesting process transmits request message set, sets up a callback rendezvous algorithm and then continues with processing. The transport task is persisted by the serving process until the expiration of the cache specification. The cache specification can be in two units.
11. Quality and Quantity of Reply Specification Set
12. Relative or Absolute Time Specification and one push request paradigm.
13. Broadcast Request: Requesting process broadcasts request message sets to one or multiple receivers. Successful transmission (and optional acknowledgement) of the request is considered to be the expiration of the transport task.
Implementation System Assumptions
The logical design of a NASIB server can be implemented in multiple hardware and software systems and in combinations thereof. The logical design is not closely tied to any specific piece of hardware or software. However implementations of such systems assumes certain well-published advances in software technologies representing necessary syntactic or semantic capabilities:
Dynamic Algorithmic Linking: Ability to incorporate new algorithms into virtual address space at run-time.
Basic Polymorphism: Provide basic polymorphism in various types.
Network Socket Interfaces: Provide basic network interfaces and link abstractions, such as sockets for interprocess communications.
Multi-Threading: Ability to run independently schedulable multiple threads of execution in one process which can access the same virtual address space. Also, related synchronization mechanisms like locking are assumed.
Serialization: Capability to serialize a memory object into a persistable string of bytes and vice versa.
Many commercial data languages meet all these requirements; for example, C++, JAVA.
Utility of the Invention
As mentioned earlier, networking technologies have made possible seamless communication between distributed processes running on disparate stations (operating system and CPU), connected through physical or virtual transport schemes. Such networks result in network data streams composed of data structures and algorithms associated with each other through external application semantics. The user picks the necessary applications and then manipulates data in the same process or in a remote process. The invention has multiple applications in multiple application domains, some of which have existed for over two years and some which have become evident only recently, like the Internet and World Wide Web (WWW):
A distributed framework for separation of data structures, algorithms and the association between them.
Specifying a means of dissecting the global name space of knowledge into knowledge domains where all interacting parties agree upon the common nature of data structures, algorithms and their associations. Chaining such domains together creates a unified network wide knowledge domain.
Specifying a canonical description system for self-describing network messages (chunks of raw transport packets) which make it possible for the messages to carry data structures and algorithms anywhere over a connected network.
Facilitating a singular network communication protocol for the meaningful transmission of these canonical messages between NASIBs-aware processes.
Facilitating creation of knowledge domain manager processes (NASIB servers) that provide services built around this canonical knowledge system to knowledge-access applications (NASIB clients).
Service applications built using NASIB architecture to provide the following basic utilities:
Ability to delineate and hook up with an external data structure and algorithm stream.
Ability to selectively call up any global data structure and algorithm and meaningfully manipulate or execute them.
Ability to facilitate changes in data definitions and values to be automatically propagated across all relevant components of the system, by providing for a default-system-enforced synchronization policy.
Multiprocessing virtual address spaces, involving selective memory synchronizations.
Because of the store and forward nature and dynamically mutable characteristics of the system, it is an ideal framework for structured communication applications like Typed Messaging and Structured Workflow. (See FIG. 3.)
An important goal of the NASIB System is to facilitate a singular or limited suite of computer applications, providing a global data structure and algorithm browsing and hosting mechanisms. Presumably, using such an application and employing such a protocol a user can browse and execute data structures and algorithms and combinations thereof (indirectly representing discrete forms of the original applications) to selectively and meaningfully interact with any knowledge domain on the network, a prime example of that being the WWW.