1. Technical Field
The present disclosure relates generally to full text indexing and searching applied to distributed file systems.
2. Description of Background Art
The volume of information contained within a single file system has increased dramatically since file systems were first designed and implemented. Whereas early file systems managed tens of megabytes of data, today's distributed file systems often encompass tens of terabytes. This represents a million fold increase, and the end is not in sight. Consider the following:                The storage capacity of a 3.5″ disk drive is projected to increase from today's 250 gigabytes to 25 terabytes.        A single file server typically exports the file systems resident on many disk drives.        Global file systems, now being deployed, will wrap hundreds or thousands of file servers into a single virtual file server.The volume of data contained within a file system will be so enormous that information may be lost within the file system! Current file systems provide only a very limited ability to locate particular information contained in the system's files.        
Over time, the file system application programming interface (“API”) of the earliest file system implementations has been enhanced to include new functionalities. And, it is time once again to extend the definition and capabilities of a file system.
Distributed Data Service (“DDS”) is a distributed file system that integrates industry standard file servers (Unix, Linux, Windows, Mac) into highly distributed, multi-protocol virtual file servers of vast proportions. A single DDS virtual file server may encompass hundreds of petabytes. Fundamental concepts underlying a DDS virtual file server are disclosed in U.S. Pat. Nos. 5,611,049, 5,892,914, 6,026,452, 6,205,475, 6,366,952(B2), 6,505,241(B2) and 6,804,706(B2). All of the immediately preceding United States patents are hereby incorporated by reference as though fully set forth here.
DDS global file systems accessible via a DDS virtual file server encompass entities that might not normally be thought of as files, so when describing DDS global file systems the term object is often used to denote a superset class which includes what is conventionally identified as a file.
The object related definitions are:
                Object                    A named entity represented within a namespace to which a connection can be established for the purpose of reading or writing data. The most common type of object is a file, but other types include:                            directories, domains, and other containers,                live video feeds,                application programs, and                shared memory.                                                Object system                    A provider of objects. For example, a file system (a type of object system) contains a collection of files and it provides a service through which its content may be accessed.                        Provider                    A synonym for object system.                        Namespace                    A set of names in which all names are unique. All objects within an object system have at least one name, and the complete set of all names for all objects comprises the object system's namespace.                        
DDS constructs virtual file servers from heterogeneous collections of industry standard file servers. A single DDS virtual file server provides a highly distributed file service, perhaps, incorporating as many as thousands of geographically dispersed file servers. As stated previously, DDS is also capable of providing remote access to objects other than files, such as live video feeds. Accordingly, the term “object” is generally used throughout this document to denote a file, a data stream, or some other data entity that might stretch the definition of “file”.
The DDS architecture provides a framework for highly distributed, multi-protocol file caching. FIG. 1 illustrates a basic structure for a DDS cache module referred to by the general reference character 20. The DDS cache module 20 may be installed on file servers, client workstations, and intermediate network nodes such as routers, switches, and dedicated file caching appliances.
DDS implements a file level cache 22 layered above all data sources. A data source is usually a file system, either local or remote, but it could be, for example, a real time data stream. When appropriate, each DDS cache module 20 automatically caches whatever data is being accessed through its file level cache 22 regardless of the source of data.
Individual file level caches 22, using both local RAM and local disk for data storage, may vary dramatically in size. Some DDS cache modules 20, perhaps within switches and routers, may implement only RAM based caching. Other DDS cache modules 20 in high capacity locations might be configured with 16 gigabytes of RAM and a terabyte or more of disk storage.
Although some current distributed file system implementations employ callback mechanisms to synchronously invalidate file images cached “just below” the multiple client processes accessing a shared file, all processes remain unaware of the consistency operations. When a process reads a file, the response includes the most recently written data (the modification), and whatever consistency operations were required to ensure the currency of the response remains hidden from file system clients.
Sprite, CIFS, and NFSv4 each implement consistency callback mechanisms as described above. Therefore, since cache consistency is maintained through private communications between the server and the client components of these distributed file systems, it is impossible for one process to detect another process's modification of a shared file except by reading the file. Consequently, detecting shared file modifications when using distributed file systems such as Sprite, CIFS, and NFSv4 requires use of a polling loop that periodically reads the shared file.
All DDS cache modules 20 maintain the consistency of cached images via origin file server callbacks. Files which are in use or have been recently used are registered with the origin file server, and may receive a callback (at the onset of a concurrent write sharing condition) to invalidate or flush the cached file image. DDS incorporates a consistency disconnect-reconnect mechanism, described in U.S. Pat. No. 5,946,690 (“the '690 patent”), whereby a cached file image, including the file's metadata, may be disconnected from the origin file server and then, at a later time (weeks, months, years), reconnected and revalidated. This is an essential mechanism for implementing high capacity, long term (persistent) caches. The '690 patent is hereby incorporated by reference.
A DDS cache module 20, illustrated in FIG. 1, includes five major components:                File Level Cache 22—The file level cache 22 consists of a large number of channels (10,000 to 100,000). Each channel is a data structure that contains (or has pointers to) consistent data/metadata images of a (usually) remote source object. Channels also contain data structures that track client access patterns, measure rates of consumption by clients and the rate of replenishment from the origin file server.        Channels are managed on a least recently used (“LRU”) basis and are identified by an object id. A simple hash mechanism allows an incoming file system request to be connected to the appropriate channel within a microsecond or two. Background processes strive to ensure that, for well-mannered clients, channels are primed such that incoming requests are immediately serviced with no need to block while waiting to fetch data from downstream (closer to the origin file server).        Just before a channel is reassigned to a new object, its contents are written to a disk cache if one exists at this DDS cache module 20, not illustrated in FIG. 1. Disk caches are also managed on an LRU basis.        The file level cache 22 also incorporates a redirector 24. All cache misses are passed on to the redirector 24, even when the source object resides within a local file system.        Source Provider Interface Routines 32—For any given file with multiple active clients distributed about the network, there exists a tree structured hierarchy of DDS cache modules 20 rooted at a DDS Server Terminator Site. The DDS Server Terminator Site communicates directly with the origin server for a file. The Source Provider Interface Routines 32 (“SPIRs 32”) interface one or more local file systems to the DDS cache module 20, e.g. NTFS, UFS, RDR . . . . When a DDS cache module 20 is the DDS Server Terminator Site for a file, the file level cache 22 accesses the file via one of the SPIRs 32        Client Intercept Routines 42—A set of client intercept routines 42 provide industry standard local and remote file services directly to clients. The DDS cache module 20 with which a client communicates directly via one of the client intercept routines 42 is the DDS Client Terminator Site for that client. FIG. 1 depicts a DDS cache module 20 configured with three client intercept routines 42: UFS, CIFS, and NFS. Unmodified Windows clients communicating directly with this DDS cache module 20, for example, may use the CIFS protocol to access file data sourced from a Unix file server that is remote from this DDS cache module 20, or for which this DDS cache module 20 is the file's DDS Server Terminator Site. Local processes running on the system hosting this DDS cache module 20 may access the same file data via the UFS (Unix File System) client intercept routine 42.        Each file's metadata is represented within file level cache 22 as a discriminated union: the discriminator identifies ‘UFS’ as the source file system and the union contains the file's metadata as formatted by the UFS source provider routine on the file's origin server. When a particular DDS cache module 20 services NFS or UFS requests, no protocol translation is necessary. However, the CIFS client intercept routine 42 must be configured with a UFS to CIFS translation module so ‘UFS’ files may be accessed via the CIFS protocol.        Note that local clients may use the UFS interface to access remote files via the DDS cache module 20.        DDS Client Code 52—When a file level cache 22 requires additional file data and the DDS cache module 20 is not the DDS Server Terminator Site for the file, the file level cache 22 invokes DDS Client code 52 to fetch missing file data. To access missing file data, the DDS Client code 52 generates and dispatches a network request expressed in a DDS protocol directed toward the file's DDS Server Terminator Site.        DDS Server Code 62—A DDS Server code 62 code receives requests dispatched by the DDS Client code 52 at an upstream DDS cache module 20, i.e. a DDS cache module 20 which is or is closer to the DDS Client Terminator Site. The DDS Server code 62 implements the DDS protocol.        
DDS Protocol
The DDS protocol is a remote file access protocol, providing functionality comparable to NFS and/or CIFS. It is designed to efficiently stream file data and metadata into large RAM/disk caches distributed throughout a network, and to maintain the consistency of cached images at a level which approaches that of a local cache. The DDS protocol transfers and caches images of files and objects from many different sources (UFS, VxFS, NTFS file systems, video cameras, . . . ) with no “image degrading” translations between the source object and its cached image. Protocol translation is always performed at DDS Client Terminator Sites, and is required only for heterogeneous (with respect to the origin file server) clients.
The DDS protocol, as currently implemented, consists of five operations:                DDS_CONNECT—This operation connects to an existing file, directory, or file system, and optionally creates a new file or directory if it doesn't already exist. If successful, this operation returns a file handle. This operation supplies the functionality required by the NFS operations mount, lookup, create, and mkdir.        DDS_NAME—This operation manipulates names in various ways. It supplies the functionality required by the NFS operations link, symlink, rename, remove, and rmdir.        DDS_LOAD—This operation loads data and metadata. The request includes flags (DDS_CC_SITE_READING, DDS_CC_SITE_WRITING) that inform downstream DDS cache modules 20 what types of operations will be performed upon data/metadata images cached at the DDS cache module 20. These flags are used by DDS's distributed consistency mechanism to keep track of the types of file activities occurring at various DDS cache modules 20.        A single load or flush request may specify multiple file segments, and each segment may be up to 4 gigabytes in length.        The response to a load or flush request includes flags (DDS_CC_SUSTAIN_DIR_PROJECTION, and        DDS_CC_SUSTAIN_FILE_PROJECTION) that indicate whether the returned data/metadata may be cached or whether it must be discarded immediately after responding to the current client request.        The DDS_LOAD operation supplies the functionality required by the NFS operations statfs, getattr, setattr, read, write, readdir, and readlink.        DDS_FLUSH—This operation flushes modified data downstream towards the DDS Server Terminator Site. A flush level specifies how far the flush should propagate. Currently available flush levels are:                    DDS_FLUSH_TO_NOWHERE—Don't flush            DDS_FLUSH_TO_CCS—Flush to Consistency Control Site (“CCS”)            DDS_FLUSH_TO_SITE_DISK—First DDS cache module 20 with disk cache            DDS_FLUSH_TO_SITE_STABLE_RAM—First DDS cache module 20 with stable RAM            DDS_FLUSH_TO_SERVER_DISK—Flush all the way                        
A basic concept of DDS is that DDS projects the source file system at the DDS Server Terminator Site into distant DDS cache modules 20. Consequently, an image of data present in an upstream DDS cache buffer is identical to that in an internal file system buffer at the DDS Server Terminator Site. After a write operation modifies a file system buffer (either local or remote), performance is enhanced if the buffer is asynchronously written to the server's disk at the DDS Server Terminator Site. However, file modifications are safeguarded when they're synchronously written to disk or some other form of stable storage. Flush levels allow both the client and the DDS Server Terminator Site to express their level of paranoia regarding file consistency. The most paranoid of the client and the DDS Server Terminator Site prevails.                DDS_FSCTL—DDS_FSCTL implements various file system control operations. These various file system control operations include:                    fs_sync—Commands all downstream DDS cache modules 20 to flush all modified file data from this DDS cache module 20 and this file or file system to whatever level is specified by the flush level parameter.            fs_ping—Pings for the status of specified file systems at downstream DDS cache modules 20. Usually, the fs_ping request specifies all file systems currently being accessed through downstream DDS cache modules 20 regardless of the file system's origin server. Downstream DDS cache modules 20 respond immediately with status indications for each specified file system.            Upstream DDS cache modules 20 use fs_ping (often referred to as a fast ping) to detect, within a few seconds, partitioning events that isolate DDS cache modules 20 from remote file systems. Fast ping rates (typically set from 500 to 3000 milliseconds) are specified as mount parameters when each file system is mounted. For a set of file systems accessed through the same downstream DDS cache module 20, the most aggressive rate determines the fast ping rate for that DDS cache module 20.            fs_callback—Pings the root of the specified file system at the next downstream DDS cache module 20. The downstream DDS cache module 20 doesn't respond until the timeout period (specified in the request, typically 5 to 30 minutes) expires or a consistency event occurs (on any file in the specified file system). Occurrence of a consistency event requires that a cached file image at the upstream DDS cache module 20 (and DDS cache modules 20 further upstream) be recalled or invalidated. Upstream DDS cache modules 20 use fs_callback (often referred to as a slow ping) to register with the downstream DDS cache module 20 and provide a means for the delivery of asynchronous notifications.            When a slow ping is received, it is possible that multiple notifications are queued and waiting to be forwarded upstream. To handle such events expeditiously, the slow ping response can transmit multiple notifications.                        
The three preceding file system control operations provide the functionality required to ensure the integrity of file modifications, to implement cache consistency, and to quickly detect network partition events that compromise cache consistency.
The DDS protocol facilitates efficient transfers by allowing a single DDS_LOAD or DDS_FLUSH request to specify an array of file segments, each ranging in size up to 4 gigabytes, as targets of the request.
DDS_LOAD and DDS_FLUSH requests include flags that indicate whether the requesting DDS cache module 20 shares memory (DDS_LOAD_COMMON_MEMORY) or disk (DDS_LOAD_COMMON_DISK) with the downstream DDS cache module 20. Whenever data is being passed between DDS cache modules 20 with a common memory, pointers to the data are returned rather than the data itself.
A distributed consistency mechanism, an integral component of the DDS protocol and its implementation, enables a file's consistency control site (CCS—only exists when there's a concurrent write sharing condition present) to dynamically relocate itself as necessary ensuring that it is always positioned as far upstream from the DDS Server Terminator Site as possible but still able to monitor and coordinate all file writing operations.
The DDS protocol endeavors, with a minimum number of operations, to provide all the functions described above, and to thereby implement a superset of the functionality provided by all remote file access protocols. The protocol employs discriminated unions to virtualize the file object metadata that flows through and is cached within the DDS layer. Metadata is represented in its native format, and a discriminator identifies the format whenever the metadata is referenced by a client intercept routine 42 in the course of responding to a file access request. This virtualization of metadata is the means that enables DDS to transparently service file access requests from unmodified client workstations regardless of the homogeneity/heterogeneity of the client with respect to the origin file server.
For example, in the process of responding to an NFS request, the NFS client intercept routine 42 must access the file's metadata. When the discriminator identifies the metadata format as NFS or UFS, an NFS client intercept routine (“CIR”) can easily interpret the metadata and generate its response. However, when the metadata format is NTFS, an NFS CIR requires the services of an NTFS to UFS translation module in order to respond to the request.
DDS Domain Hierarchies
U.S. Pat. No. 6,847,968 B2 (the '968 patent”) discloses the methods employed by DDS cache modules 20 to organize themselves into a hierarchy of domains. The '968 patent is hereby incorporated by reference as though fully set forth here.
FIG. 2 illustrates a DDS virtual file server for Inca Technology. As depicted, an inca domain 102 contains an eng domain 112, a sales domain 114, a corp domain 116 and a mrkt domain 118. The eng domain 112 is non-atomic, which means it contains other domains (sub-domains). In this case the sub-domains are a bob domain 122, a joe domain 124, a pat domain 126, and a svrail domain 128. Three of these domains are atomic domains: the bob domain 122, joe domain 124 and pat domain 126 are all file servers, but svrail domain 128 is a non-atomic domain. The sales domain 114, also an atomic domain, consists of the resources being exported by a single file server. These resources include the exported file system 32A, the exported file system 32B and the exported file system 32C.
FIG. 3 illustrates a user's view of the inca domain 102. In this illustration the user is employing Microsoft's Explorer program to navigate through his computer's file space. Note that the DDS global file system has been mapped (connected to) the host computer's X: drive. As depicted, the X: drive contains two folders: inca_mail and Internet. inca_mail contains a private namespace and users must have the proper credentials to open that folder and view its contents. However, Internet is a public namespace and is open to all users.
Internet contains two top level domains: com and edu. The com directory contains cnn, hp, ibm, and inca. And, finally, the inca directory contains the inca domain tree depicted in FIG. 2. The cnn, hp, ibm, and inca directories are each the root of a company's domain tree. As FIG. 3 illustrates, when a domain tree root directory is opened (inca, in this case), the next level sub-domains (corp, eng, marketing, sales) appear.
Comparing FIG. 2's inca domain 102 with the inca folder in FIG. 3 clarifies the relationship between domains and folders: they are essentially the same thing. They are both resource containers that may recursively contain other resource containers and/or resource objects. A folder is a visual representation for a domain, and there may be other representations.
For example, the visual representation for a company's domain tree might be the corporate icon (with the well known filename logo.icon stored in the root directory of the domain tree).
FIG. 3 also depicts how DDS binds the shared resources of several companies (cnn, hp, ibm, and inca) into a single namespace enabling a user to seamlessly navigate across company boundaries.
The DDS namespace consists of two layers:                Filesystem Namespace—namespace defined by individual exported file systems. This layer is defined by the file systems (UFS, NTFS, EXT2FS, . . . ) containing the resource objects being exported through DDS.        Network Namespace—namespace consisting of DDS domain names. These names can usually be converted to an ip address using industry standard name resolution services such as domain name system (“DNS”).        
The Network Namespace ties together the disjunct namespaces of all the individual exported file systems to create a single namespace. DDS employs the existing network name resolution infrastructure to construct the Network Namespace. This results in the binding of exported file systems into the reference framework with which users and system administrators are already familiar.
FIG. 3 depicts the com directory as containing only four sub-directories. In reality, the com directory would contain the root directories of millions of company level domain trees. A single DDS virtual file server, encompassing the complete Internet namespace (gov, org, edu, mil, . . . ) and multiple private namespaces, may encompass hundreds, or even thousands, of petabytes.
This massive amount of data demands improved mechanisms for navigating through the global file system's namespace and for locating content of interest. Obviously, valuable content that cannot be located is actually valueless.
DDS—a Step Beyond the Internet
The Internet as it exists today is an instance of a read only (mostly?) distributed file system on the same order of magnitude as what the DDS global file system will become. Today, Internet users routinely employ search engines to locate content of interest. These search engines appear to work quite well, but one should consider that users generally aren't aware of relevant content that a search fails to reveal.
The DDS global file system requires a search mechanism substantially faster and more efficient than the currently deployed Internet search engines. Recognize that DDS provides a file access service, complete with consistency guarantees. The Internet, by comparison, is an electronic distribution system for published content. Its content, once published, is unlikely to be modified. Furthermore, when an object is modified, a generous “grace period” is acceptable to allow the new content to migrate to distant access points (web proxy cache sites).
Even after most proxy cache sites have loaded the latest version of an object, it may be hours or even days before a web crawler fetches a copy to feed into an indexing engine. So, new content (and modifications to existing content) may not show up in search results for several days.
In contrast, the DDS global file system supports collaboration between individuals and groups. Whenever a document is created or modified, other users often need to be aware of the changes as quickly as possible. This gives rise to a requirement that DDS, to the maximum extent possible, index new and modified content in real time such that a search performed a few seconds after the creation of a new document will locate that document if it does, in fact, satisfy the criteria of the search.
DDS provides the functionally required to enable unmodified industry standard workstations to access remote files using their native CIFS or NFS implementations. DDS virtual file servers receive NFS or CIFS requests and service them from cached data when possible and, when valid cached data is not present, DDS issues requests directed towards origin file servers to fetch the requested data.
Although the DDS protocol is highly streamlined and simplified in comparison with CIFS and NFS, it provides essentially the same capabilities. After all, the DDS protocol is designed to enable client access to files residing on very remote file servers. DDS implements the functionality provided by the file system APIs provided by Linux, Unix, Windows, and other major operating systems.
Using a DDS, NFS, CIFS, UFS, NTFS . . . API, an application establishes a connection to a content object through a series of operations:                Connect to a directory,        Enumerate the directory's contents,        Connect to the target object or to a directory believed to contain the target:                    If connected to the target object: DONE.            If a sub-directory appears to contain the target object: GOTO Bullet 2.                        
Using the preceding method, a user can laboriously navigate throughout a file system and explore its content. However, discovering content by scanning directories becomes very inadequate when individual file systems encompass hundreds or thousands of petabytes of data. For such large file systems this method becomes unviable because users just don't live long enough. For large file systems, users (and processes) require more powerful methods for locating content which enables them to quickly and efficiently establish connections to objects of possible interest.
For the Internet, the problem of searching the content of large files systems has already been addressed. Internet search engines accept user's queries and, in general, respond by providing the user with links to Internet objects that appear to contain something which meet a query's criteria. The user then uses the links to easily connect to objects of potential interest so their content may be perused and a final user determination made as to their relevance.
It is noteworthy that Internet search engines actually represent only the most recent instance of at least three (3) successive generations of computer search engines which provide content searching:
Dedicated Mainframe Systems                The early mainframe search engines (Dialog, Nexus, Lexus . . . ) indexed data residing on storage directly connected to the system hosting the search service. Users at terminals (both local to and remote from the mainframe) queried the system using a very structured Boolean syntax.        
Software Applications                Verity, Lextek . . . .        
Internet Search Engines                Lycos, Alta Vista, Magellan, Inktomi, Google . . . .        
Content based retrieval systems (search engines) are generally implemented as two distinct sets of applications:                Index generation applications, and        Retrieval applications.Index generation is performed once on new (or modified) content, and the resultant new or updated indices are subsequently used by retrieval applications in responding to queries. Traditionally, content indexing systems, e.g., full text indexing, employ batch processing to index document collections. Dialog, one of the first mainframe based commercial full text retrieval systems, used nighttime hours for generating an inverted file which indexed its document collections, and during daytime hours provided online document search and retrieval services.        
Presently, search engines continue to generate their inverted index structures in batch mode. Internet Web crawlers prowl sites, discover new content, and ship the new content back to index generation sites (which are usually sites also hosting search engines). New content, continuously flowing in from web crawlers, accumulate at the indexing sites. Eventually, the accumulated new content exceeds a threshold thereby causing it to be forwarded to an indexing engine. The indexing engine processes the new content, extracting and sorting index terms, and then merging the new terms into an inverted file. When invoked, the indexing engine processes all the accumulated new content in a single batch operation.
Although search engine technology has changed over the last thirty-five (35) years progressing from mainframe computer implementations to local area network (“LAN”) implementations, and then from LAN implementations to wide area network (“WAN”) implementations, the indexing component still retains its lineage: indexing is still performed as a batch mode process.
Definition of Terms
There appears to be no consensus about how terms associated with full text retrieval are used. Therefore, to avoid ambiguity some definitions for full text retrieval terms appear below:                Document—An object (file, record, document) within a collection associated with an accession number.        Accession number—A number, often assigned by the retrieval system's index generation software, which uniquely identifies a document within a collection.        Linear file—A collection of documents, concatenated together, often ordered by accession number.        Linear file index—An index into the linear file. Typically, each record in the linear file index consists of an accession number and a pointer to the associated document. In traditional full text retrieval systems such as Dialog, the pointer is a byte offset into the linear file. In Web based full text retrieval systems, the pointer may be a universal resource locator (“URL”). Records within this file are sorted by accession number.        Index term—A word or phrase extracted from a document during the parsing phase of the index generation process.        Inverted file entry—An index term followed by a pointer to a specific occurrence of the index term within a specific document.        Inverted file record—An index term followed by inverted file entries pointing to each occurrence of the index term within a document collection. The inverted file entries are ordered by <object id, position within document>.        Inverted file—The complete set of inverted file records associated with a document collection.                    Inverted file records may be alphabetically sorted by index term. With Web based full text retrieval systems, the boundary between document collections has blurred. Distinguishing features of various document collections might be nothing more than that all the documents within a collection were indexed as a batch.            Individual documents might not be contained within a linear file; they might be geographically scattered about the Web and URLs within linear file index records provide links to these documents.                        Inverted file index—An index into the inverted file. Typically, each record in the inverted file index consists of an index term and a pointer (byte offset) to the index term record in the inverted file.                    Records within this file are alphabetically sorted by index term.                        
In some full text retrieval implementations, inverted file records are actually files. In which case, “inverted file”, as defined above, refers to the set of inverted files. And, in which case, there is no inverted file index since the containing file system provides the indexing required to locate an inverted file record.
Although indexing and retrieval isn't the first thing that comes to mind when “file system” is mentioned, file systems do provide fairly complete name based (as opposed to content based) indexing capabilities. When an object is created, a new entry is created in a parent directory. The entry typically contains the object's name and a link to the object's attributes, which are stored within an inode (or equivalent). One of the inode's attributes is an extent map specifying the device addresses (usually expressed as disk block numbers) where the object itself (the object's data) is stored within the file system.
File systems generally use a hierarchical indexing structure that facilitates rapidly adding new entries and removing deleted entries. file system performance directly impacts overall system performance, so the speed at which entries can be created, deleted, and looked up has been a force that has molded all current file systems. In particular, name based indexing, which is fairly anemic when contrasted against content based indexing, is employed by all commonly deployed file systems.
File system developers have consistently and uniformly concluded that file system performance requirements exclude considering content based indexing. They've opted for speed over heavyweight indexing.
However, the file systems landscape has changed considerably over forty years. Consider the following:                Virtualized global file systems encompassing hundreds of petabytes are on the horizon if they are not already here, e.g. the World Wide Web.        These virtual file servers will consist of thousands of individual systems.        Individual systems may have very substantial physical resources:                    Multiple CPUs, 4 Ghz and faster, 32 or 64 bit,            2 to 64 gigabytes of main memory,            1 to 1000 terabytes of disk memory,            Multiple GigE or 10 GigE network connections.                        
Incorporating a full text search capability into existing file system APIs as seamlessly as possible provides both users and processes with enhanced capabilities for locating, identifying and establishing connections based upon file content.