A computer system comprising a single processor is referred to as a uniprocessor (UP) system. Typically, the UP system initially loads an operating system to manage all other program code executed by the processor. A set of core services provided by the is operating system is often coded in the operating system's kernel. Such services may include, among other things, providing file-system semantics, input/output (I/O) operations, memory management, network communications, and the like. A kernel may be embodied as a microkernel that organizes the kernel's services in relatively small, configurable modules. In this sense, the Data ONTAP™ operating system, available from Network Appliance, Inc. of Sunnyvale, Calif., is an example of an operating system implemented as a microkernel.
In general, kernel code in a uniprocessor system may be divided into multiple segments of program code. Each program defines a specific set of ordered operations that are performed by the processor. A process refers to an instance of a program being executed by the processor. Accordingly, UP system resources, such as memory address space and processor bandwidth, may be divided among multiple processes. Moreover, the resources may be allocated to both user and kernel processes. Here, a kernel process implements one or more of the kernel's services. In contrast, a user process provides a service, such as word processing, web browsing, image editing, software developing, etc., for a user.
Typically, a process scheduler controls when different processes are executed by the processor. In some implementations, the process scheduler is integrated into the kernel (“kernel-level”). Other implementations code the scheduler outside the kernel (“user-level”). The process scheduler maintains a run queue that governs the order in which processes are executed by the processor. Thus, the scheduler dequeues a process for the processor to execute, then re-enqueues the process after an appropriate time interval elapses or the processor finishes the process's execution.
In practice, a process in a UP system often may be interrupted (i.e., “stalled”) by the processor so other operations can be performed in the system. For instance, the kernel may recognize an I/O interrupt signal having higher priority than the executing process. In this case, the process may be stalled while the processor performs appropriate I/O operations. Similarly, other process interruptions may include memory page faults, segmentation faults, error handling routines, and so forth.
A process often employs one or more threads to minimize processor stalling and make more efficient use of the processor's bandwidth while the process is being executed. Each thread is an instance of a sequence of the process's program code. For instance, a user-level thread is an instance of a sequence of the process's program code which is not included in the kernel's code. The threads allow the process to keep the processor busy, even in the presence of conventional process interruptions, such as I/O interrupts, page faults, etc. To that end, the process can perform “blocking” operations when an executing thread is interrupted, thereby suspending that thread's execution and beginning execution of a different thread. In this manner, the process can continuously execute thread code for as long as the process is scheduled to execute on the processor.
As used herein, one or more workloads may be multiplexed on an execution vehicle that executes the workloads in an orderly manner. For example, when the workloads are user-level threads, one or more of the user-level threads may be multiplexed on an execution vehicle, such as a kernel process, through which they are executed. Specifically, the process scheduler may allot a time interval for a kernel process to execute, and the kernel process may divide its allotted time among its user-level threads. To that end, a thread scheduler may be used to control when the process's different user-level threads are executed by the processor. The thread scheduler allows the process to continue executing even when one or more threads block, thereby maximizing utilization of the process's scheduled processor during the process's time slice. Thus, in the absence of other processors, the thread scheduler maximizes the system's processor utilization.
Each thread is associated with a thread context that supplies the processor with meta-data information needed to execute the thread. For instance, a thread's associated program-counter value, stack-pointer value, hardware register values, etc. are stored in its thread context. Each thread context is typically stored in a corresponding data structure called a thread control block (TCB). Although threads are associated with their own thread-specific meta-data information, multiple threads may share access to a common process context that contains process-specific meta-data information shared among the threads. The process context may include global variables, memory address spaces, semaphores, etc., defining the process's computing environment. Typically, the process context is stored in a corresponding data structure known as a process control block (PCB).
Thread contexts are typically smaller (i.e., “lightweight”) than their associated process contexts, so switching thread contexts generally consumes less resources, such as processor bandwidth and memory, than switching process contexts. Furthermore, a new thread is typically not allocated many of the resources, such as “heap” memory, run-time libraries, etc., that are allocated to a new process. In fact, only a TCB and a call stack are typically allocated when a new thread is created; other resources are previously allocated at the time the thread's process is created. Thus, threads may be dynamically created and destroyed using less overhead (i.e., resource consumption) than processes since they not only require less context information, but also require less resource allocation. Notably, threads need not be lighter than processes in every dimension. For example, it is possible to have threads which consume less memory than a process, even if they execute less efficiently and/or require more time to context switch.
While multi-threading techniques may be used to ensure efficient use of the processor in a UP system, threads also facilitate efficient processing in a multi processor (MP) system. As used herein, two or more processors in an MP system execute concurrently if they are scheduled to execute program code at substantially the same time. A process scheduler in an MP system traditionally determines which processors execute processes in the scheduler's run queue. For instance, the scheduler may assign a first processor to execute a first process, and at the same time assign a second processor to is execute a second process. In this case, the first and second processors may concurrently execute user-level threads multiplexed over their respective first and second processes. Accordingly, thread schedulers associated with the first and second processes may be used to schedule user-level thread execution on the first and second processors, thereby maximizing utilization of both processors' bandwidths.
However, problems often arise in an MP system when concurrently executing processors attempt to access (i.e., read or write) the same data at substantially the same time. For example, a processor may attempt to modify data, while the same data is being accessed or processed by another processor. As a result, modifications made to the data may be known to one processor and not the other, thus causing inaccuracies or errors in one or both of the processors' execution. To prevent such conflicts, synchronization mechanisms are often employed to give processors exclusive access to shared data. Conventional synchronization techniques include, inter alia, global locks, monitors, domains and master/slave approaches.
A unique global lock may be associated with each data structure allocated in an MP system. According to this synchronization technique, a sequence of program code, such as a thread or process, must “acquire” a global lock before it can access data in the lock's associated data structure. When the program code is finished accessing the data, it “releases” the global lock. For example, a thread may acquire a global lock by setting the lock to a first value (e.g., “1”), then later release the lock by setting it to a second value (e.g., “0”). Illustratively, the lock's value may be stored in a special “lock” data structure, although those skilled in the art will appreciate other implementations are possible as well. Broadly stated, a global lock is a type of mutual exclusion (“mutex”) lock since it can only be possessed by one thread or process at a time. Threads or processes may either “block” while waiting for mutex locks to be released or may loop to repeatedly check the status of a lock held by another thread or process. The latter process is called “spinning” and mutex locks that require spinning are known as spin locks.
While a global lock synchronizes access to an individual data structure, a monitor is synchronizes execution of different segments of program code, such as threads. In general, a monitor is a program that manages the interactions of other segments of program code. Typically, a monitor is associated with a monitor lock that can be acquired by only one thread or process at a time. For example, suppose one or more data structures is shared among a set of threads, and each of the threads executes through a common monitor. Since the monitor's lock only allows one of the threads to acquire the monitor at any given time, the lock prevents multiple threads from accessing the shared data structures at the same time. When the threads are scheduled to execute on different processors in an MP system, the monitor can therefore control their concurrent access to the shared data structures.
A domain typically protects a larger amount of program code than a monitor. More specifically, a programmer may organize a program's code into a plurality of distinct code segments, or domains, that can execute concurrently in an MP system. However, processes and threads within the same domain usually may not execute concurrently. Further, each instance of a function or data type allocated in a domain is accessible to only one specific thread or process executing in that domain. Each domain is typically associated with its own domain-specific context information and resources. Before a domain can load its context information and execute its processes and threads, the domain first must acquire its associated domain lock. The domain lock prevents multiple instances of the domain's threads and processes from executing concurrently on different processors in the MP system. Synchronization using domains is usually determined at compile-time and is further described in more detail in commonly assigned application Ser. No. 09/828,271 entitled “Symmetric Multiprocessor Synchronization Using Migrating Scheduling Domains,” to Rajan et al., now issued as U.S. Pat. No. 7,694,302 on Apr. 6, 2010, which is hereby incorporated by reference as though fully set forth herein.
A master/slave implementation assigns each process that can execute in an MP system to a “master” or “slave” processor. Specifically, program code that cannot execute concurrently in an MP system is assigned to a master processor. Other program code may be executed concurrently in the MP system by one or more slave processors. In this manner, a software programmer controls which threads and processes execute concurrently in the MP system by choosing whether each process or thread executes on a master or slave processor. Thus, unlike the other synchronization techniques previously described, the master/slave approach requires the programmer to predetermine which processors will execute different processes in a software application.
While conventional synchronization techniques enable multiple processes and threads to execute concurrently in an MP system, they are not easily implemented in program code originally designed to execute in a UP system. For instance, converting a UP-coded thread to employ global locks in an MP system requires adding extra code to the thread for acquiring and releasing the locks. Firstly, this additional code may be time consuming to incorporate. Secondly, the extra code adds overhead to the thread even if the thread never executes concurrently with other threads, and therefore may unnecessarily slow the thread's execution. Thirdly, incorporating this additional code into the UP-coded thread may result in the thread consuming an unacceptable amount of memory resources in the MP system. In addition, MP-system synchronization using global locks may lead to “deadlocks.” For example, suppose a first thread in the MP system seeks to acquire global lock A then B, and a second thread seeks to acquire global lock B then A. In this case, the first and second threads block one another. If one of the locks A or B is critical to the system, the entire system can freeze. Programming around such issues can considerably complicate MP development and reduce the efficiency of the MP code.
Similarly, converting a UP-coded thread to employ monitors in an MP system also requires additional code to be added to the thread. For example, the thread must be configured to acquire and release monitor locks. Further, the thread must be modified to incorporate signaling and blocking mechanisms typically associated with monitors. In sum, the UP-coded thread would have to be re-written within a monitor paradigm. This process of re-writing and redesigning the thread's code is costly in terms of both time and development resources.
In general, some system architectures align well with domain synchronization techniques, while others do not. Thus, converting a UP-coded thread to implement domains in an MP system may require re-architecturing of the thread's code to ensure its compatibility with the MP system's architecture. For instance, a programmer typically must divide the original UP-coded program code into individual domains. Although statically allocating blocks of domain code may simplify debugging operations, the process of splitting large amounts of program code into domains is typically an arduous process. For instance, the UP-coded processes and threads within a domain must be modified in accordance with the domain's synchronization mechanisms, e.g., to properly utilize the domain's associated domain lock, domain context, domain identifiers, etc. The process of re-writing and redesigning the UP threads' code to employ domain synchronization mechanisms typically consumes excessive software-development time and resources. Furthermore, domains may be of limited use in the MP system since each instance of a function or data type allocated in a domain is typically only manipulated by a single thread or process.
Lastly, converting a UP-coded thread to execute in a master/slave configuration requires a programmer to assign the thread to execute on a master or slave processor in the MP system. According to the master/slave approach to concurrency, a UP-coded thread assigned to a slave processor must be rewritten to make it MP safe with respect to all of the other threads in the MP system. Thus, the master/slave approach suffers the disadvantage of having to incorporate conventional synchronization mechanisms into UP thread code assigned to the slave processors.
It is therefore desirable to convert a UP-coded thread for use in an MP system without having to modify the original thread code. The thread should not have to be recoded or augmented with additional code in order to execute in the MP system. Also, the thread should be able to cooperate with other threads and/or processes to access instances of functions and data types allocated in the MP system. Further, the UP thread code should not be “bound” to any particular hardware implementation, and should easily scale when processors are added or removed in the MP system. It is also desirable that the same thread code can function in both the UP and MP systems.