The invention relates to a multi-threaded system for computationally intensive data management applications.
In a multi-processor system, two or more instruction processors (generally referred to as CPUs) work together to process programs simultaneously. Symmetric multiprocessing or SMP involves a multiprocessor computer architecture where two or more identical processors can connect to a single shared main memory. SMP is the standard processing model that exists in personal computers.
A multi-core processor is an integrated circuit to which two or more processors have been attached for enhanced performance, reduced power consumption, and more efficient simultaneous processing of multiple tasks. Multi-core processing is a growing industry trend as single core processors rapidly reach the physical limits of possible complexity and speed. Companies that have produced multi-core products include Intel, AMD, ARM, Broadcom, and other chip makers. Multi-core processors are widely used across many application domains including: general-purpose, data mining, Web applications, mathematical analysis, embedded applications, network, digital signal processing, and graphics. Current operating systems such as Microsoft Windows, Linux, Solaris, and FreeBSD are now capable of benefiting from multi-core processors.
The amount of performance gained by the use of a multi-core processor is strongly dependent on the software algorithms and implementation. In particular, the possible gains are limited by the fraction of the software that can be parallelized to run on multiple cores simultaneously; this effect is described by Amdahl's law.
Software can be run in parallel by dividing a large problem into smaller ones, which are then solved concurrently in parallel. The software that addresses the smaller problems can be a process or a thread. Both processes and threads are methods of parallelizing an application. However, processes are independent execution units that contain their own state information, use their own address spaces, and only interact with each other via interprocess communication mechanisms, generally managed by the operating system.
By contrast, a thread is a coding construct. A single process might contain multiple threads; all threads within a process share the same state and same memory space. Threads can communicate with each other directly, because they share the same variables, as well as via interprocess communication mechanisms. A thread is much quicker to create. Other advantages of threads over processes are: the switching between threads is faster than between processes, and communication between them is simple and fast since they share the address space. A process or a thread can be persistent, which means it is always available, or it can be created to perform a certain task and then it dies.
Originally, in Unix creation of a process is based on two system calls fork( ) and exec( ). The system call fork( ) creates a copy of the process that invokes it. The process image is identical to that of the calling process, sometimes referred to as the parent process, except for a few parameters like process identifier (PID). The system call fork( ) creates a process but is not enough to run a new program. To do that, the forked child needs to overwrite its own image with the code and data of the new program. This is done by exec( ).
Operating systems provide proprietary thread implementations. Known proprietary thread implementations are those of IBM's AIX, Sun's Solaris, Linux, and Microsoft's Windows NT systems. A portable thread implementation is provided by POSIX (Portable Operating System Interface). The POSIX.4a specification provides a set of Application Program Interfaces (APIs) that allow C programmers to include thread support in their programs. The POSIX standard was defined for the C programming language only. While efforts to define the POSIX standard for other languages are still in progress, programmers writing in other programming languages can use the POSIX standard by using wrappers around C function calls.
The promise of multi-threading is based on the opportunity to leverage multi-core computing platforms to increase performance. This is especially critical in computationally intensive applications. However, many of existing applications are not multi-core aware. Some of them extensively use fork( ) and exec( ) and have limited use of multi-threading. In addition to being slow, fork( ) and exec( ) use a great amount of resources. Limited use of threads, absence of appropriate synchronization between processes and threads, and the fact that software tasks were not designed to run in parallel make it difficult for some applications to benefit from multi-core processors.
Interprocess communication (IPC) mechanisms allow arbitrary processes and threads to exchange data and synchronize execution. IPC may also be referred to as inter-thread communication. The main IPC methods are: message queue, signals, socket, pipe, named pipe, semaphore, and shared memory. In addition to IPC, POSIX threads have the following methods for synchronization: mutual exclusion (mutex) locks, condition variables, and read-write locks. Also, POSIX threads specify a synchronization object called a barrier, along with barrier functions. The functions create the barrier, specifying the number of threads that are synchronizing on the barrier, and set up the threads to perform tasks, and wait at the barrier until all the threads reach the barrier. When the last thread arrives at the barrier, all the threads resume execution.
Other forms of IPC are message passing in Java Remote Method Invocation (RMI), Common Object Request Broker Architecture (CORBA), and others. There is also Message Passing interface (MPI), which is a library specification for message passing, proposed as a standard by a broadly based committee of vendors, implementers, and users. Java has built-in thread support for multi-threading synchronization.
In some systems a process or a thread executes a program in response to commands it receives. The command is parsed to find its associated executable program. Some applications, for instance Unix shell, use hash tables to locate an executable program. A hash table uses a hash function. A hash function executes an algorithm that takes a variable-size input, like a name, and returns a fixed-size string or an integer which is called a hash value. The hash value is used to find the item that is associated with the input. In many situations, hash tables are more efficient than search trees and many other table lookup structures.
In a multi-threaded environment, to increase performance, some recommend using processor or thread affinity to associate processes or threads with particular processor or core. This minimizes thread migration and context switching. A context switch is the computing process of storing and restoring the state, referred to as the context, of a CPU such that multiple threads and processes can share a single CPU resource. Processor or thread affinity also improves the data locality and reduces the cache-coherency traffic among processors or cores.
Current operating systems that support multi-core systems have a built-in thread affinity by giving a thread a tendency to run where it has run before, keep threads close to their parents, avoid moving threads around, keep data close to the threads that initializes it, group cores according to locality, and assign threads to less loaded cores and core groups. However, program behavior is unpredictable as it changes over time and may cause a drop in performance. For instance, data initialized at the beginning of a program by a thread, but later used by multiple threads may cause allocation of many threads to the core where data initialization took place, while other cores are less loaded. To solve such problems there is need to use thread affinity to force the execution of threads in the less loaded cores. Also, if two or more threads use the same data in memory, the threads could be mapped to the same core so that they can share the same cache. The implementation of built-in thread affinity is different for different operating systems. A multi-threaded program with threads designed to run in parallel behaves differently for different operating systems.
Computationally intensive applications can take advantage of multi-core architecture. One of those applications is data management. Technology advances decreased the cost of storage, increased the size of digital data, and increased the rate of data transfer. This resulted in applications that have to deal with voluminous data. An example of such applications is enterprise data management. Enterprise data management is the development and execution of policies, practices, and procedures that properly manage enterprise data.
Aspects of enterprise data management that are computationally challenging include discovery of data stored in an enterprise network, data categorization (sometimes referred to as classification), and applying enterprise management policies to categorized data. Some of the data discovery methods utilize Internet Protocol (IP) port scanners. IP port scanners determine services, devices available in the network, and the type of data source. Categorization of data is based on metadata or full text search. Categorization rules specify how data is classified into different groups. For instance, documents categorization could be based on who owns them, their size, and their content. Metadata consist of information that characterizes data. Sometimes it is referred to as “data about data”. Data categorization methods, based on metadata, group data according to information extracted from its metadata. A few examples of such information are: the time a document was last accessed, its owner, its type and its size. Categorization based on full text search utilizes search technology. Full text search is used to identify documents that contain specific terms, phrases, or a combination of both. The result of the search is used to categorize data. In addition to categorization, enterprise data management involves formulation of policies to be applied to categorized data. For example, policies could be encrypting sensitive data, auditing data, retaining data, archiving data, deleting data, modifying data access, and modifying read and write permissions.
As technology advances, the computational requirements of data management will increase. Other applications that deal with voluminous data are: scientific research, environment, energy, and applications that include modeling and simulation.
It would thus be highly desirable to provide a general and modular multi-core aware solution to the problem of enterprise data management.
It would additionally be highly desirable if the solution could be used for other computationally intensive applications.