A computer is a machine that has a processor, memory, and other parts. Computer memories contain data. Some of the data are computer programs and other data are information that can be operated on or accessed by a computer program. Most computers contain two types of memory: volatile, which forgets data when not energized, and non-volatile, which does not forget data when not energized.
When a computer is first started, the processor executes a series of computer programs stored in non-volatile memory. In many instances, the processor loads a program before executing it. Loading means copying all or part of the computer program from non-volatile memory into volatile memory. A computer connected to a computer network can also load a computer program from another computer on the network into its own volatile memory. The program or set of programs the computer loads first is usually the operating system. The operating system supplies resources to and coordinates resource sharing among all the programs that the computer is running. Some programs are designed to run continuously and perform a task without most users being aware that the program is running. When the service is not beneficial to the user, it is often called a virus. When it is beneficial, such as a virus scanner, it can be called a daemon.
A non-volatile memory contains data. The data is meaningless without a computer program to interpret it. Most computers have an operating system that interprets some, or all, of the non-volatile memory contents as a file system. A file system is often thought-of as a hierarchical arrangement of directories and files. In reality, most file systems have at least three kinds of files and some extra information for keeping track of things. The three kinds of files are regular files, directory files, and special files.
Regular files contain data and are neither special files nor directories. Directory files, also called “directories”, contain directory data. Directory data includes information about descendant files. Special files come in many varieties with many uses. For example, some special files can be used for accessing devices, such as hard drives, keyboards, and graphics chips. Other special files can be used for accessing services offered by the operating system, such as pipes. Special files also provide for capabilities like soft linking (also known as symbolic linking).
Every file in a file system usually has other information associated with it. For clarity, this other information will be called “I-nodes” (short for information node) because that is what it is called in the UNIX operating system and its progeny. Other operating systems give it other names. Every file has at least one I-node. An I-node is associated with only one file. I-nodes are not part of the file data. I-nodes can include data informing the computer where the file data are located. I-nodes can include other data such as regarding file permissions, ownership, creation time, and last modification time.
Directory files, mentioned earlier, contain information about descendent files. Oftentimes, a directory's information about a file is a directory entry that pairs a filename with an I-node reference. A user sees the filename and can choose to open the file. When the file is opened, the operating system uses the I-node reference to find the file's first I-node. Then it examines the I-node to determine the file type, permissions, and other information. When more than one directory entry, even entries in different directories, contains the same I-node reference, the file data is “hard linked”. A hard link is different from a soft link because soft links typically are implemented using special files while hard links use directory entries.
As mentioned earlier, computer memory, including non-volatile memory contains data. As also mentioned earlier, files can have file data. The difference between a file and a memory is that a memory is a physical device giving little if any structure to the data. The memory can contain files, directories, portions of files, portions of directories, or nothing without understanding those contents. A file is structured data held in the computer memory where the structure is defined by people and implemented by a computer program.
For example, a brand new computer hard drive is a non-volatile computer memory containing nothing. A computer can use the hard drive to hold files and directories. The files and directories are structured data held on the hard drive and the structure is implemented by the computer operating system. A computer can also treat the hard drive itself as a single file called a device file or a raw device file.
Metadata is generally defined as data about data; hence file metadata is information about a file. As discussed earlier, a file system can contain regular files, directory files, and special files. Each file has at least one I-node and can contain file data. An I-node can contain data about the file such as file creation date, last modification date, who created it, who can access it, and what kind of file it is. This data can be file metadata. As such, a file system consists of file data and file metadata. A collection of file data and file metadata is called an image. The image of a file system can be used to create a duplicate file system or to recover a file system that is lost or destroyed. Furthermore, the metadata can also contain data that is not normally contained in an I-node, such as a checksum or a digital signature.
One of the problems often encountered in computer administration is ensuring that a computer memory contains the correct data. The files and other objects in memory, such as those held on a hard drive, can become corrupted. Sources of corruption include: computer viruses; malicious, inattentive, or untrained users; hardware failure or degradation over time; and upgrades to the set of correct files. One way to detect a corrupt file is to compare it to a known good copy. A byte by byte or word by word comparison can detect corruption as well as where in the file the corruption lies.
Another way to detect file corruption is to use a file checksum. A file checksum is a number that is calculated based on the data held in the file. Two files with the same contents will have the same checksum. When two files that are different can have the same checksum, it is called a collision. Some people are motivated to intentionally create collisions. Historically, many checksum algorithms have been developed to avoid both intentional and unintentional collisions. Unintentional collisions are exceedingly rare. The various checksum algorithms are widely known and available.
The advantage of a checksum over a byte by byte comparison is that a checksum is a relatively small datum. For example, one of the popular current checksum algorithms gives a number that is 256 bits long. Checksums are small enough to be treated as file metadata, although current operating systems rarely use them. Byte by byte comparison requires a complete copy of the file under scrutiny, which can be many gigabytes long. Furthermore, a cryptographically secure and signed checksum can be distributed with a file so that anyone can verify that the file is not corrupted. A complete copy of the file, even if received from a trusted source, is not cryptographically secure. A trusted source is a computer or other data source that is known to provide the correct data and to never supply a corrupted copy. Trusted servers, digital signatures, secure checksums, key exchanges, and other fundamentals of secure or trusted data storage and data exchange are well known to those practiced in the arts of secure computing, encryption, and cryptography. Checksums are also commonly used to verify data integrity in other protocols, such as rsync.
Data, such as files and images, can be distributed in a variety of ways. Some of the ways involving transport of physical media are via floppy disk, compact disk, digital video disk, magnetic tape, and disk drive. Other ways involve using client-server exchanges over communications networks as are standardized by Internet protocols such as HTTP, FTP, TCP, UDP and others. Data can also be transmitted securely, meaning no one can tamper with, spoof, or intercept it. More recent development are peer-to-peer networks (PTP) and TORRENT files (torrents).
In PTP, a client requests data from the PTP network. In a centralized PTP network, such as the original NAPSTER network, the client sends the request to a central index server. The central index server responds by telling the client where to go to get the data. In a decentralized PTP network, the client asks another computer on the network for the data. If that computer does not have or know where to get the data, it then it forwards the request to other computers on the network. The request can be repeatedly forwarded until the data is found or the search is given up. If the data is found, then the client is told where to go to get the data. In both centralized PTP and decentralized PTP networks, the client often receives many different places to get the data from and chooses one of them.
TORRENT files (torrents), such as those implemented with the BITTORRENT software, are special implementations of PTP. A torrent client receives many places from which to retrieve all or part of the data. The client can then retrieve different parts of the data from different computers and can do it simultaneously. The central index server mediates the exchange of data between the client and the other computers on the network. In a decentralized architecture, the central index server is replaced by a decentralized database that carries information for mediating the data exchange.
Among the problems that arise in administering computers is ensuring that the correct data are present in the computer memory. The data can be corrupted intentionally or unintentionally. It can be uncorrupted, but out of date. Regardless, if the computer has the wrong data in memory, repairing or updating the data contained in a single computer is a tedious chore that is error-prone. In some computing environments, hundreds, thousands, or even hundreds of thousands of computers must be maintained. The human cost of such maintenance is tremendous. Automatic maintenance is usually restricted to causing all of the computers to load, via a network connection, a common image of the operating system, other programs, and data. A problem computer is shut down and restarted such that it receives a fresh image. Such schemes can be effective, but are limiting.