1. Field of the Invention
This invention relates to computer file systems. More particularly this invention relates to an improved semantically based system for dynamically organizing XML files or files with markup tags in a context sensitive manner.
2. Description of the Related Art
It has been recognized that static, hierarchical systems of organizing documents are inadequate to efficiently meet the needs of computer users attempting to access increasingly vast amounts of dynamically changing information. Conventional file systems are simply too unwieldy to deal with this information load in a way that is convenient to the user. They have become increasingly impractical for efficient document management.
A relational database is an alternative to a file system as a repository for documents, and many databases today provide some support for documents such as extended markup language (XML) documents. The XML document type declaration contains, or points to, markup declarations that provide a grammar for a class of documents. This grammar is known as a document type definition (DTD) However, the flexible document type declaration of an XML document, while easily represented as a graph, does not map naturally or efficiently into a flat static table. Moreover, the standard query language (SQL) interface of databases is not as commonly used in software applications, as is the conventional file system interface. Furthermore, management of large databases often requires a skilled administrator.
This approach to controlling the information explosion involves attaching metadata to documents. For example, using the MPEG-7 standard it is possible to attach attributes to video data.
It is proposed in the document Semantic File Systems. D. Gifford, P. Jouvelot, M. Sheldon, and J. O""Toole Jr., In Proc. 13th ACM Symposium on Operating Systems Principles, October 1991, pp. 16-25, to provide access to documents using queries. Virtual directories are created, each pointing to files that satisfy a query. The concepts presented in this document provide a foundation for the invention disclosed herein.
The document Presto: An Experimental Architecture for Fluid Interactive Document Spaces, Paul Dourish, W. Keith Edwards, Anthony LaMarca, and Michael Salisbury, ACM Transactions on Computer-Human Interaction, 6(2) 1999, and the document Using Properties for Uniform Interaction in the Presto Document System, Paul Dourish, W. Keith Edwards, Anthony LaMarca, and Michael Salisbury, in Proceedings of the ACM Symposium on User Interface Software and Technology, UIST ""99, Asheville, N.C., 1999, together disclose a document management system that emphasizes the attributes of documents being retrieved, while retaining some structural aspects of conventional file systems. The system is driven by attributes that are manually attached to the files, or extracted from the files using a filter. The attributes, including content, may be arbitrarily defined by different users and their numbers extended, such that different users have entirely different view of the document space. This arrangement has the drawbacks of requiring the user to execute a separate application using a separate interface.
Another approach is taken in the document, Integrating Content-Based Access Mechanisms with Hierarchical File Systems, B. Gopal and U. Manber, Operating Systems Design and Implementation (OSDI), 1999. This document proposes to extend the file system interface, wherein users are able to create their own name spaces based on queries, path names, or combinations thereof. This approach has a drawback in that interoperability with existing applications is difficult.
It is a primary advantage of some aspects of the present invention that existing applications using the conventional file system applications programming interface (API) are supported.
It is still another advantage of some aspects of the invention that file organization dynamically accommodates changes in the document space.
If is a further advantage of some aspects of the invention that different users may see files organized in a different fashion, and that a given user is able to see the files organized in different ways.
It is yet another advantage of some aspects of the invention that a user can quickly determine what information is contained in a repository of files in a given context.
These and other advantages of the present invention are attained by a file system, which in a preferred embodiment exploits attributes encoded in an XML document. The file system presents a dynamic directory structure to the user, and breaks the conventional tight linkage between sets of files and the physical directory structure, thus allowing different users to see files organized in a different fashion. The dynamic structure is based upon content, which is extracted according to attributes defined by the XML structure.
In a preferred embodiment of the invention, an XML-aware file system (XMLFS) combines the interface of a conventional file system with the organizational power of information retrieval to provide a repository for XML documents. It provides a solution for organizing, searching and browsing collections of XML documents. The semi-structured nature of documents that comply with the XML standard implies that XML documents readily include metadata. Because of its popularity, XML appears to be an ideal format for innovation that results in sensibly ordering an ever-growing amount of information.
To the user, the XML-aware file system appears to be a completely conventional standard file system, and it supports any existing application that employs a standard file system applications programming interface. In addition, in some embodiments, since the XML-aware file system is built upon an existing file system, it can exploit existing support facilities, for example backup facilities.
In an important departure from the view presented by traditional hierarchical file systems, instead of showing files organized in a static directory structure, the XML-aware file system shows files organized in a dynamic hierarchy which is constructed on-thexcx9cfly. The user of the XML-aware file system is informed by the directory path as to what content is relevant at a particular instance in time. A directory path in the XML-aware file system is a sequence of attributes and values, and the contents of a directory are all of the XML documents that have the attributes and values named in the path. In other words, a directory path in the XML-aware file system reflects a query for a set of documents matching a set of constraints. As the path is being incrementally constructed, the user of the file system browses through a set of documents that match a partial query.
The invention provides a computer implemented method of information retrieval, including the steps of retrieving structural information of memorized XML documents according to a document type declaration that corresponds to each of the documents, retrieving elements of the documents, attributes and values of the elements. The method further includes generating a multilevel inverted index from the structural information, the elements, the attributes and the values, and accepting a specification from a user that has members comprising at least one of the elements, the attributes or the values. The method further includes extracting data from the index responsive to the specification, wherein the data complies with at least one of the members. The method further includes displaying virtual directory paths of corresponding ones of the documents, wherein the directory paths each comprise a sequence of the members, and wherein contents of directories that are identified in the directory paths comprise selected ones of the documents possessing the specification.
According to an aspect of the method, the index includes a structural section that has postings of the structural information, and a words section that has postings of the values, wherein the values are words.
One aspect of the method includes arranging the directory paths in a hierarchy that is constructed in conformance with the specification. Arranging can be accomplished by extracting a document identifier from one of the postings of the values, extracting an offset of a context from the one of the postings of the values, and extracting an entry length of the context from the one of the postings of the values.
According to one aspect of the method, the documents are written in a markup language.
According to still another aspect of the method, the documents are XML documents.
An additional aspect of the method includes noting changes in a composition of a repository of the documents, and updating the index responsive to the changes.
According to another aspect of the method, the specification includes a partial query and a complete query.
According to yet another aspect of the method, a portion of the specification is stated as a path name by the user.
The invention provides a computer software product, including a computer-readable medium in which computer program instructions are stored, which instructions, when read by a computer, cause the computer to perform the steps of retrieving structural information of memorized documents according to a document type declaration that corresponds to each of the documents, retrieving elements, attributes and values of the elements, generating a multilevel inverted index from the structural information, the elements, the attributes and the values, and accepting a specification from a user that comprises at least one of the elements, the attributes and the values. Responsive to the specification the computer further performs the steps of extracting data from the index, associating the data with corresponding ones of the documents, and displaying the corresponding ones of the documents as virtual directory paths, wherein the directory paths each comprise a sequence of elements, the attributes and the values, and wherein contents of directories that are identified in the directory paths comprise selected ones of the documents possessing the specification.
According to an aspect of the computer software product, the index includes a structural section that has postings of the structural information, and a words section that has postings of the values, wherein the values are words.
An additional aspect of the computer software product includes arranging the directory paths in a hierarchy that is constructed in conformance with the specification. Arranging the directory paths can include extracting a document identifier from one of the postings of the values, extracting an offset of a context from the one of the postings of the values, and extracting an entry length of the context from the one of the postings of the values.
According to one aspect of the computer software product, the documents are written in a markup language.
According to another aspect of the computer software product, the documents are XML documents.
A further aspect of the computer software product includes noting changes in a composition of a repository of the documents, and updating the index responsive to the changes.
According to yet another aspect of the computer software product, the specification includes a partial query and a complete query.
According to still another aspect of the computer software product, the specification is stated as a path name by the user.
According to one aspect of the computer software product, the specification is issued via a file system applications programming interface.
According to another aspect of the computer software product, the instructions define a file system engine that issues calls to an operating system.
The invention provides a computer implemented information retrieval system for presenting a semantically dependent directory structure of XML files to a user, including a file system engine that receives a file request via a file system application programming interface and issues file system calls to an operating system, wherein the file request specifies a file content of memorized files. The system also includes an XML parser linked to the file system engine that retrieves structural information of XML documents, the XML parser further retrieving at least one of elements, attributes and respective values thereof from the XML documents. The system also includes an indexer linked to the XML parser for constructing an inverted index of the elements, the attributes and the respective values thereof, wherein responsive to the file request, the file system engine retrieves postings of the inverted index that satisfy requirements of the file request, and returns directory paths to the file system application programming interface of selected ones of the XML documents corresponding to the postings.
According to an aspect of the information retrieval system, the inverted index includes a structural section that has postings of the structural information, and a words section that has postings of words of the XML documents.
According to yet another aspect of the information retrieval system, the postings of the structural information and the postings of words comprise a document identifier of one of the XML documents, an offset of a context in the one XML document, and an entry length in the context of the one XML document.
Still another aspect of the information retrieval system includes an XML analyzer for updating the inverted index, wherein the XML analyzer analyzes additions to the memorized files.
According to an additional aspect of the information retrieval system, the XML parser retrieves the structural information from document type declarations of the XML documents.
According to one aspect of the information retrieval system, the file request includes a partial query and a complete query.
According to another aspect of the information retrieval system, a portion of the file request is a path name.
According to a further aspect of the information retrieval system, the repository of the XML documents can be a networked file system.
The invention provides a computer implemented information retrieval system for presenting a semantically dependent directory structure of document files to a user, wherein documents of the document files are written in a markup language, including a file system engine that receives a file request via a file system application programming interface and issues file system calls to an operating system, wherein the file request specifies a file content of memorized files. The system includes a parser of the markup language, linked to the file system engine, that retrieves structural information of the documents, the parser further retrieving at least one of elements, attributes and respective values thereof from the documents. The system includes an indexer, linked to the parser, for constructing an inverted index of the elements and the attributes and the respective values thereof, wherein responsive to the file request, the file system engine retrieves postings of the inverted index that satisfy requirements of the file request, and returns directory paths to the file system application programming interface of selected ones of the documents corresponding to the postings.
According to an aspect of the information retrieval system, the inverted index includes a structural section that has postings of the structural information, and a words section that has postings of words of the documents.
According to one aspect of the information retrieval system, the postings of the structural information and the postings of words include a document identifier of one of the documents, an offset of a context in the one document, and an entry length of the context in the one document.
Another aspect of the information retrieval system includes an analyzer for updating the inverted index, wherein the analyzer analyzes additions to the memorized files.
According to a further aspect of the information retrieval system, the parser retrieves the structural information from document type declarations of the documents.
According to yet another aspect of the information retrieval system, the file request includes a partial query and a complete query.
According to still another aspect of the information retrieval system, a portion of the file request is a path name.
According to an additional aspect of the information retrieval system, a repository of the documents can be a networked file system.