1. Field of the Invention
This invention relates to computer file systems. More particularly, this invention relates to an improved semantically based system for dynamically organizing XML files or files with markup tags in a context sensitive manner, so as to enable a shared federated repository to be browsed in a manner that is intuitive to a user.
2. Description of the Related Art
It has been recognized that static, hierarchical systems of organizing documents are inadequate to efficiently meet the needs of computer users attempting to access increasingly vast amounts of dynamically changing information. Conventional file systems are simply too unwieldy to deal with this information load in a way that is convenient to the user. They have become increasingly impractical for efficient document management.
A relational database is an alternative to a file system as a repository for documents, and many databases today provide some support for documents such as extended markup language (XML) documents. The XML document type declaration contains, or points to, markup declarations that provide a grammar for a class of documents. This grammar is known as a document type definition (DTD). However, the flexible document type declaration of a XML document, while easily represented as a graph, does not map naturally or efficiently into a flat static table. Moreover, the standard query language (SQL) interface of databases is not as commonly used in software applications, as is the conventional file system interface. Furthermore, management of large databases often requires a skilled administrator.
This approach to controlling the information explosion involves attaching metadata to documents. For example, using the MPEG-7 standard it is possible to attach attributes to video data.
It is proposed in the document Semantic File Systems. D. Gifford, P. Jouvelot, M. Sheldon, and J. O'Toole Jr., In Proc. 13th ACM Symposium on Operating Systems Principles, October 1991, pp. 16–25, to provide access to documents using queries. Virtual directories are created, each pointing to files that satisfy a query. The concepts presented in this document provide a foundation for the invention disclosed herein.
The document Presto: An Experimental Architecture for Fluid Interactive Document Spaces, Paul Dourish, W. Keith Edwards, Anthony LaMarca, and Michael Salisbury, ACM Transactions on Computer-Human Interaction, 6(2) 1999, and the document Using Properties for Uniform Interaction in the Presto Document System, Paul Dourish, W. Keith Edwards, Anthony LaMarca, and Michael Salisbury, in Proceedings of the ACM Symposium on User Interface Software and Technology, UIST '99, Asheville, N.C., 1999, together disclose a document management system that emphasizes the attributes of documents being retrieved, while retaining some structural aspects of conventional file systems. The system is driven by attributes that are manually attached to the files, or extracted from the files using a filter. The attributes, including content, may be arbitrarily defined by different users and their numbers extended, such that different users have entirely different view of the document space. This arrangement has the drawbacks of requiring the user to execute a separate application using a separate interface.
Another approach is taken in the document, Integrating Content-Based Access Mechanisms with Hierarchical File Systems, B. Gopal and U. Manber, Operating Systems Design and Implementation (OSDI), 1999. This document proposes to extend the file system interface, wherein users are able to create their own name spaces based on queries, path names, or combinations thereof. This approach has a drawback in that interoperability with existing applications is difficult.
Retrieval of information from federated data repositories is a field of increasing importance. A federated data repository typically comprises heterogeneous data distributed across an enterprise.
Distributed file systems, for example the Andrew File System (AFS), and the Network File System (NFS), provide a measure of information sharing. Hierarchical trees from different sources are exposed to the user by gluing the different tree structures side-by-side. The distributed file system can then be used to share information from the different sources. Thus, information from separate sources is typically presented side by side.
Peer-to-peer communication over data networks, realized, for example, in the currently popular Napster and Gnutella systems, embody a powerful concept for sharing and exchanging information over the Internet. Nevertheless, they utilize proprietary, specially tailored interfaces, and rely heavily on file naming for locating files.
Current peer-to-peer file sharing services allow for search, but do not support browsing well. Moreover, the search in such systems is typically based on file names, which lack necessary information, and may even be misleading. This is because naming conventions are not consistently enforced in such systems, and users are free to invent file names, which may have little relation to the file content.
Internet search engines typically employ search indices. However, these indices are not context sensitive, and irrelevant information is often returned from them on query. For example, when searching for a song called “Let It Be”, a free text search invariably retrieves many unrelated documents in which the search text appears. Moreover, updating of the indices is limited by the efficiency of offline Web crawling. Thus, Internet search indexes cannot be relied upon to be up-to-date. Typically, a snapshot of the indices in the system is taken from time to time, and serves as the basis for the query results. Ongoing changes in existing documents, document additions and deletions are not visible on query until the next scheduled index build.
Wide Area Information Servers (WAIS) is an arrangement that is intended to help users locate information over networks. This represents a unified interface that relies on natural language questions, and employs indexing. However, once again, the information in the indices is not context sensitive. Conventional free text searching and ranking is performed. WAIS is disclosed in further detail in the document, An Information System for Corporate Users: Wide Area Information Servers, B. Kahle, et al., ONLINE, Vol. 15, No. 5, p. 56–60, September 1991.
Distributed Lightweight Directory Access Protocol (LDAP) and X.500 (International Standard ISO/IEC 9594-1) services, when queried, may cause one server to refer the client to a second server. Alternatively, the first server may query the second server, should it be unable to respond to the query. LDAP queries inherently force the user to specify attributes and values in constructing his query. Thus, these services are suitable only when there is ample advance knowledge of the subject.
Bio-informatics is an exemplary rapidly growing field in which above-noted difficulties in information retrieval from a variety of unrelated sources arise frequently, and are often stumbling blocks to research. A life sciences company typically has data stored in a cluster of repositories, and in multiple formats. Conventionally, it would be necessary to tailor a specific application in order to deal with the extraction and combination of data from all the sources. Furthermore, it is not uncommon for an external organization to gain access to several such clusters from several such companies. Using conventional technology, it is indeed daunting to try to establish a coherent system for convenient access to all the data. What is needed is a friendly facility from which to be able to query all the different data sources at once, and obtain combined results.
More generally, in the field of retrieval of information from federated repositories or the Internet, there is no efficient manner in which to combine several distinct resource repositories in a way that is convenient to use and is capable of supplying a meaningful response to a query in a dynamic environment. In such environments in such repositories not only do the resources themselves change with time, but the participating sites also vary.