1. The Field of the Invention
The present invention generally relates to storing electronic data and more specifically, the invention relates to a system and method for managing data that is electronically stored.
2. State of the Art
The widespread use of electronic data systems to create, transmit, store and retrieve data is self evident. Computers and computer systems are ubiquitous with businesses and other groups now being connected by various systems or networks. Users in such systems access and transmit messages (e.g., e mail) and a wide variety of other data as part of their daily business activities.
At the same time, there is growing reliance on electronic data storage over paper records. Data storage is exploding at an incredible rate as more and more users accept or prefer electronic storage over paper storage. Protocols for e mail and other documents may lead to storage of a ghost document, a backup document and later the final document. The sender and all recipients will store and retain in a variety of ways and in turn are often storing the same document or data. At the same time, highly sensitive business information which is electronically transmitted is more difficult to regulate and control.
Electronic storage can simply be the memory of a computer, but can also be servers attached to a network that users can access and use to storage documents of all kinds. In the past, tapes, discs and even punch cards have functioned as an electronic data storage device and also as a form of back up. Today, many use tapes, discs or CD's to backup their computer systems. With the explosion of electronic data, not only are the memory capabilities of the computer or system of the user being taxed, the back-up systems are starting to become cumbersome because the data to be recorded in the back-up process is becoming excessive.
The size and in deed the number of networks continue to increase. Some users will be tied to several networks. They could, for example, be able to access the world wide web, be connected to a LAN for its geographic area (e.g., office/floor), and to a system network that could be company wide involving electronic interconnection of offices in multiple cities.
Storage of data has led to the development of large database servers which can now contain or store 100 gigabytes (GB) to at least 1 terabyte (TB) of records and files. The large quantity of data in different systems makes backup difficult during periods of reduced or non user (e.g., overnight).
Recently, Storage Area Networks (SAN) have been developed to replace locally attached disk and tape drives and resulted in hundreds of systems being able to share the same disk or tape drive. A SAN is a high-speed special purpose network (or sub-network) that interconnects the different kinds of data storage devices with associated data servers on behalf of a larger network of users. A SAN may use a variety of proprietary protocols optimized accelerating data transfer to a storage medium.
A Network Attached Storage (NAS) may also be used for data storage. A NAS is a system dedicated to sharing files via Network File System (NFS) and Common Internet File System (CIFS). The CIFS protocol shares files between Windows® based desktops and laptops. However, any Unix box configured processor can read from or write to any file system on any other Unix box via NFS—provided it has been configured to effect file sharing. Each system requires dedicated servers incurring extra cost and extra administration. Given suitable software, the protocol to access the file system is believed to be independent of the platform.
NAS appears to have evolved to offer network shared file systems resulting in an overall improvement in performance. NAS is a network appliance containing storage that is accessed by either NFS or CIFS. In the commercial environment SAN, NAS and alternatives such as Solid State Storage are prevalent.
Data is traditionally stored on these systems in units called files. A file may be quite small containing a few kilobytes (kb) to megabytes (mb) of data. Files have been adopted as a mechanism for organizing electronic data leading to directories on the various types of available storage media for locating and retrieving the files. Boolean logic has been adopted in data management systems so that \directories may contain other directories, called subdirectories, and subdirectories may contain other subdirectories, leading to an inverted hierarchical storage system. Assigning directories names has lead to the evolution of a sequence of directories within directories identified by a directory path or the path name. A file is created, accessed and stored by supplying a path name.
For example, in a computer system with multiple Personal computers (PC's) connected by a network, a local disk and an attached storage (NAS) device, Client A uses a PC to communicate a message (data) to Client B's PC through or over whatever network is in place. The data is stored as the message is broadcast (and sometimes as it is generated) on the NAS, depending on the client application, through a reference to the path name assigned by the sender. The file server then performs an operation on the file system called “mounting.” A mount attaches a named file system to the file system hierarchy at the pathname location directory.
By referencing a file that starts with /disk1, the file server can direct the data to the local disk. If a file that starts with /disk2 is referenced, the file server can direct the data to the NAS. Therefore, the file, via the path name, determines where it is physically stored, thus relegating the filing mechanism to the structure of the path name for the lifetime of the file.
When Client A creates the data, the system typically functions to store that data so that the path name is accessible to Client A. Client A can then recover the data. When Client B receives the data or message, the data may again be stored. Client B may store the message or data under a newly assigned path name so that in effect identical data is now being stored in the server but with different path names.
Many companies now store data from a variety of devices for mobile communication and 3G technologies. The advent of new technology has combined data from laptop Personal Computers (PCs), Personal Digital Assistants (PDAs) and mobile phones. During regular course of business software applications and data need to be accessible from this diverse range of platforms, which are remote from the corporate network. The accessibility requires users to store the necessary applications and data locally to do their job. Consequently, large amounts of data is being sent and received leading to the assignment of different path names by multiple recipients. It can be seen that with the proliferation of networked tools (e.g., PDA devices like a Palm Pilot™ product), the storage needs for many system operators is escalating at an alarming rate.
At the same time, the data being transmitted can lead to widespread access to very sensitive confidential information. A lost PC or PDA can allow the finder to access some of a company's most important confidential information. Stated alternately, the proliferation of network tools is making it very difficult for a the owner of confidential business information that is accessible on a network to maintain limited distribution and to otherwise effect logical management of sensitive data.
Efficient disk optimization is also difficult to achieve because companies have minimal control over where the data and files are stored. For example, an employee storing a personal collection of mp3 tracks on disk1/music forces a company to incur costs that are unplanned. Accordingly, the true cost of centralized storage, and storage on portable devices, can increase from cents per gigabyte to dollars per gigabyte. In this manner, storage for organizations can quickly become expensive, unmanaged and invisible.
In U.S. Pat. No. 6,615,405, an example of a system for installing software on microprocessor based devices is provided for accessibility over a computer network. The system includes identifying component data associated with a software application using an electronic device. The system generates a first server update algorithm by comparing the component data against data present on a first server and executing the first server algorithm thereby duplicating the component data on the first server. Then the system generates a second server update algorithm by comparing the component data on the first server against data present on a second server and executing the second server algorithm thereby duplicating the component data on the second server. Finally, the system generates a second device update algorithm by comparing the component data on the second server to data present on a second device and installing the software application on the second device by executing the second device update algorithm.
The current systems continue to follow the Boolean logic by assigning a new path name to each file or sub file so that it can be accessed and retrieved. Such systems do not provide search functionality or easily permit application of policies to regulate or manage the files or data. In other words, there is no known system to eliminate multiple path names or files for the same or equivalent document being stored by different and in some cases even the same user. For example, Client A sends a message to Client B. Client B replies by sending a message which attaches the original message sent by Client A. Thus Client A now has the message from Client B plus the original message from his/her/itself which is now assigned a new path name. The rapid increase in storage requirements can thus be better understood particularly with the increasing and widespread use of electronic communications and electronic documents. Faxes once produced by paper are now being electronically delivered to users on their PC's.
In short, many businesses are now confronting an explosion of electronic data storage which is expensive to acquire, expensive to maintain and which takes up space. In addition, the site costs go up due to the need for cooling and electrical power.
The complexity of the problem is because file names are not standard, The user selects some component of the path name. The only information that can automatically be determined about a file is from the filename extension and administrative properties. For instance, when an employee creates a PowerPoint presentation “salesforecast.ppt” on a Windows box with a default registry, information available to the employee includes the Power Point file format, the size of the file, the file creation date, and the last date accessed and modified. If additional information is desired pertaining to the file contents, the person accessing the file must rely on information the author may have provided in a file summary depending on whether the author took the time to create it. If the file summary or an informative filename has not been provided, the person accessing the files may spend a significant amount of time manually searching through a series of files and directories.
In view of the above and other related drawbacks and limitations identified in the relevant data management systems, there is a need for a meta-data driven intelligent file system that uses profiles and policies to create file manipulation that no longer relies simply on a path name.