Citation
Synchronization algorithms for real-time systems

Material Information

Title:
Synchronization algorithms for real-time systems
Creator:
Harathi, Krishna, 1957-
Publication Date:
Language:
English
Physical Description:
xii, 165 leaves : ill. ; 29 cm.

Subjects

Subjects / Keywords:
Algorithms ( jstor )
Data models ( jstor )
Deadlines ( jstor )
Multiprocessors ( jstor )
Multiprogramming ( jstor )
Operating systems ( jstor )
Recordings ( jstor )
Scheduling ( jstor )
Semaphores ( jstor )
Simulations ( jstor )
Computer and Information Sciences thesis, Ph. D
Dissertations, Academic -- Computer and Information Sciences -- UF
Parallel processing (Electronic computers) ( lcsh )
Real-time data processing ( lcsh )
Synchronization ( lcsh )
Genre:
bibliography ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1995.
Bibliography:
Includes bibliographical references (leaves 158-164).
General Note:
Typescript.
General Note:
Vita.
Statement of Responsibility:
by Krishna Harathi.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
The University of Florida George A. Smathers Libraries respect the intellectual property rights of others and do not claim any copyright interest in this item. This item may be protected by copyright but is made available here under a claim of fair use (17 U.S.C. §107) for non-profit research and educational purposes. Users of this work have responsibility for determining copyright status prior to reusing, publishing or reproducing this item for purposes other than what is allowed by fair use or other copyright exemptions. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. The Smathers Libraries would like to learn more about this item and invite individuals or organizations to contact the RDS coordinator (ufdissertations@uflib.ufl.edu) with any additional information they can provide.
Resource Identifier:
021602049 ( ALEPH )
33392850 ( OCLC )

Downloads

This item has the following downloads:


Full Text










SYNCHRONIZATION ALGORITHMS FOR REAL-TIME SYSTEMS


By

KRISHNA HARATHI













A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA


1995

























To my father, Sri. Raghavan Harathi, my mother, Smt. Vasantha Lakshmi Harathi,

my wife, Padmashree and my son, Ankith Nitesh.













ACKNOWLEDGEMENTS


I take this opportunity to express my sincere thanks to Dr. Theodore Johnson, my mentor and chairman of my supervisory committee, for his guidance, encouragement, and patience. He was always there when I went to him for help. I am sure I would not have attempted this work, if not for him.

I am grateful to Dr. Richard Newman-Wolfe, who was also the chairman of my master's supervisory committee, for the very many fruitful discussions I had with him about everything, be it academic or otherwise. I am indebted to Dr. Randy Chow for his friendly and thoughtful advice. Dr. Yann-Hang Lee commands my respect for his to-the-point critique. I am fortunate to associate with Dr. Paul Avery for his external points of view.

Special thanks go to my beloved wife, Padmashree, for her support, encouragement, and the patience that made this work possible. Thanks also go to my son Ankith who turns one, when I am done. A bundle of joy at home is a heaven to go to after a long and hard night's work!
















TABLE OF CONTENTS




ACKNOWLEDGEMENTS ............................ iii

LIST OF TABLES ................................. vii

LIST OF FIGURES ..... ................................ viii

ABSTRACT .................................... xi

CHAPTERS

1 INTRODUCTION ............................... 1

1.1 G oals . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Pessimistic Synchronization ....................... 3
1.3 Optimistic Synchronization ........................ 3
1.4 Dissertation Structure .......................... 4
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 RELATED RESEARCH ............................ 6

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Real Time Scheduling ........................... 6
2.2.1 Introduction .... ............... ... .. ... 6
2.2.2 Rate Monotonic Scheduling ................... 8
2.2.3 Static Scheduling Algorithms ..... .................. 11
2.2.4 Dynamic Scheduling Algorithms ..................... 13
2.3 Priority Inheritance Protocols ...... ...................... 14
2.3.1 Priority Inheritance Protocol ........................ 14
2.3.2 Priority Ceiling Protocol ............................ 15
2.3.3 Priority Ceiling Protocols with Abortable Critical Sections . . 16 2.3.4 Priority Inheritance Protocols for Multiprocessors ......... 16
2.4 Synchronization in Multiprocessors ........................ 18
2.4.1 Pessimistic Synchronization ..... ................... 18
2.4.2 Optimistic Synchronization ..... ................... 22
2.4.3 Synchronization through Message passing ................ 23
2.5 Conclusion ........ ................................. 23











3 A PRIORITIZED MULTIPROCESSOR SPIN LOCK ............ 25

3.1 Introduction .. ..... .. .. .. ... .. ... . .. ... .. .. . 25
3.1.1 Previous W ork .......................... 27
3.2 PR-Lock Algorithm ............................ 29
3.2.1 Assumptions ............................ 30
3.2.2 Implementation .......................... 31
3.3 Correctness of PR-Lock Algorithm ................... 40
3.4 Extensions ........ ................................. 47
3.4.1 Multiple Locks ....... .......................... 47
3.4.2 Backing Out ................................... 48
3.5 Simulation Results ....... ............................ 49
3.6 Conclusion ......................................... 51
4 INTERRUPTIBLE CRITICAL SECTIONS ..................... 55

4.1 Introduction ....................................... 55
4.2 Interruptible Critical Sections ............................ 58
4.3 Implementing Interruptible Critical Sections .................. 60
4.3.1 Background ....... ............................ 60
4.3.2 Implementation ................................ 62
4.3.3 Reducing the Clean-up ............................ 65
4.4 System Support ..................................... 69
4.5 Interruptible Locks ....... ............................ 72
4.6 Implementation ..................................... 73
4.6.1 ICSctxsw Routine ............................... 75
4.6.2 User-level Entry and Exit ...... .................... 76
4.7 Experimental Performance Results ........................ 77
4.8 Analysis ........ .................................. 80
4.9 Conclusion ........................................ 85

5 EXTENDING INTERRUPTIBLE CRITICAL SECTIONS TO MULTIPROCESSORS ........ ...................................... 90

5.1 Introduction ............. ..... .... ..... ..... 90
5.2 ICSM with Lock Release (ICSM-R) ...... .. ......... . 93
5.2.1 ICSM-Rctxsw Routine ............................ 95
5.2.2 ICSM-Rclient Tasks ...... ....................... 97
5.2.3 ICSM-R Performance Analysis ....................... 97
5.3 ICSM with Task Kill (ICSM-K) .......................... 121
5.3.1 Experimental Performance Results .................. 123
5.4 ICSM with Priority Queue (ICSM-P) ....................... 129
5.4.1 Implementation ................................ 130
5.4.2 Correctness of the ICSM-P Algorithm .................. 133
5.4.3 Experimental Performance Results .................. 145
5.4.4 ICSM-P algorithm with single word CAS ............... 147
5.5 Conclusions ....... ................................ 150











6 CONCLUSIONS ................................ 154

6.1 PR-Lock Algorithm ............................. 155
6.2 Interruptible Critical Sections on Uniprocessors ............... 156
6.3 Interruptible Critical Sections on Multiprocessors .............. 156
6.4 Final Words ....................................... 157

REFERENCES ........ ................................... 158

BIOGRAPHICAL SKETCH .................................. 165
















LIST OF TABLES


4.1 Example task description 1 for ICS ..... ................... 82

4.2 Example task description 2 for ICS ..... ................... 83

4.3 Example task description 3 for ICS ..... ................... 85


5.1 Validating cycle time analysis using simulation for ICSM-R ...... .109 5.2 Validating lock utilization analysis using simulation for ICSM-R . . . 110

5.3 Performance comparison of ICSM-P algorithm for lock utilization of
25%, 50%, and 75% .................................. 148

5.4 Performance comparison of ICSM-P algorithm for lock utilization of
100% ........ .................................... 149
















LIST OF FIGURES



3.1 CAS used in the PR-Lock Algorithm .................. 32

3.2 Data Structures used in the PR-Lock Algorithm ............... 35

3.3 Queue data structure used in PR-Lock algorithm .............. 36

3.4 Stages in the acquire-lock operation ........................ 36

3.5 The acquiredock operation procedure ..................... 38

3.6 The releasedock operation procedure ..................... 40

3.7 Observed queue � before and after a releasedock ............. 43

3.8 Observed queue L before and after an acquiredock to an empty queue 44

3.9 Observed queue 1 before and after an acquirelock to a non-empty
queue ........ ................................... 44

3.10 A concurrent acquirelock A'succeeds before A ............. 46

3.11 A concurrent releasedock R succeeds before A ............... 47

3.12 ReleaseJock R and acquirelock A'succeed before A ....... ..48 3.13 Comparison of lock acquisition times ..... .................. 53

3.14 Comparison of lock release times ...... .................... 54

4.1 Herlihy's non-blocking data structures .... ................. 66

4.2 Shadow-page ICS ....... ............................. 67

4.3 Response time distribution of the non-prioritized semaphore ..... ..87 4.4 Response time distribution of the prioritized semaphore ........... 87

4.5 Response time distribution of the Interruptible Critical Section . ... 88 4.6 Response time distribution of the Interruptible Lock ............. 88











4.7 Response time distribution of the Interruptible Critical Section for
high lock utilization .................................

4.8 Response time distribution of the Interruptible Lock for high lock utilization ....... ...................................

5.1 The context switch routine for ICSM-R .....................

5.2 The algorithm for a task using ICSM-R .....................

5.3 ICSM-R Performance Results (Tw = 20 milli seconds) ..........

5.4 Model of a cycle for a task using LOCK-NR .................

5.5 Model of a cycle for a task using ICSM-R ....................

5.6 ICSM-R Cycle Times using analysis for Tc = 10 ...............

5.7 ICSM-R Cycle Times using analysis for Tc = 25 ...............

5.8 ICSM-R Cycle Times using analysis for Tc = 50 ...............

5.9 ICSM-R Cycle Times using analysis for Tc = 75 ...............

5.10 ICSM-R Cycle Times using analysis for Tc = 90 ...............

5.11 ICSM-R critical section utilization using analysis for Tc = 10 ...... 5.12 ICSM-R critical section utilization using analysis for Tc = 25 ..... 5.13 ICSM-R critical section utilization using analysis for Tc = 50 ......


ICSM-R critical section ICSM-R critical section


utilization using utilization using


anal' anal'


Vsis for Tc = 75 ...... sis for Tc = 100 . ...


96 98 100 102 104 111 111 112 112 113 113 114 114 115 115


5.16 ICSM-R effect of multiprogramming on cycle time for Tc = 25 (M = 1)116 5.17 ICSM-R effect of multiprogramming on cycle time for Tc = 25 (M = 4)116 5.18 ICSM-R effect of multiprogramming on cycle time for Tc = 25 (M = 8)117

5.19 ICSM-R effect of multiprogramming on cycle time for Tc = 25 (M =
12) ........ ..................................... 117
5.20 ICSM-R effect of multiprogramming on cycle time for Tc = 25 (M =
16) ........ ..................................... 118
5.21 ICSM-R effect of multiprogramming on cycle time for Tc = 50 (M = 1)118


5.14 5.15











5.22 ICSM-R effect of multiprogramming on cycle time for Tc = 50 (M = 4)119 5.23 ICSM-R effect of multiprogramming on cycle time for Tc = 50 (M = 8)119

5.24 ICSM-R effect of multiprogramming on cycle time for Tc = 50 (M =
12) ........ ..................................... 120

5.25 ICSM-R effect of multiprogramming on cycle time for Tc = 50 (M =
16) ........ ..................................... 120
5.26 The global data structures and the ICSM-K algorithm for the high
priority task ....... ................................ 124
5.27 The ICSM-K algorithm for the low priority task ............... 125

5.28 Response time distribution of LOCK-NK (20% Lock Utilization) . . . 127 5.29 Response time distribution of ICSM-K (20% Lock Utilization) . . . . 127 5.30 Response time distribution of LOCK-NK (100% Lock Utilization) . . 128 5.31 Response time distribution of ICSM-K (100% Lock Utilization) . . . 128 5.32 CAS2 used in the ICSM-P Algorithm ....................... 131

5.33 Data Structures used in the ICSM-P Algorithm ............... 132

5.34 The acquiredock operation procedure for ICSM-P ............. 134
5.35 A successful acquiredock operation for ICSM-P .............. 135

5.36 The commit releaseJock operation procedure for ICSM-P ..... .135 5.37 Observed queue � before and after a releaseJock ............ .139

5.38 Observed queue � before and after a killiock ................ 141

5.39 Observed queue � before and after an acquire.ock to an empty queue142
5.40 Observed queue � before and after an acquireJock to a non-empty
queue ........ ................................... 142
5.41 A concurrent acquireiock A'succeeds before A ............. 144
5.42 A concurrent releaselock R succeeds before A ............... 145
5.43 ReleaseJock R and acquirelock A'succeed before A ......... 146 5.44 The acquiredock ICSM-P with single word CAS .............. 151
5.45 The commit-releaseJock ICSM-P with single word CAS ...... ..152
x













Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

SYNCHRONIZATION ALGORITHMS FOR REAL-TIME SYSTEMS By

Krishna Harathi

May 1995



Chairman: Dr. Theodore Johnson
Major Department: Computer and Information Sciences


Real time systems are becoming common and multiprocessors are also being used more often. There is an increasing need for building priority synchronization primitives for real-time systems.

In this regard, we have designed a prioritized spin-lock mutual exclusion algorithm, the PR-Lock. The PR-Lock algorithm is characterized by a prioritized lock acquisition, a low release overhead, very little bus-contention, and well-defined semantics. While other prioritized spin-locks have been proposed, the PR-Lock has superior characteristics.

Current methods for synchronization (including the PR-Lock) in real-time systems are pessimistic, and use blocking to enforce concurrency control. While protocols to bound the blocking of high priority tasks exist, high priority tasks can still be blocked by low priority tasks. In addition, these protocols require a complex interaction with the scheduler.

We present a new approach to synchronization with special applicability to embedded and real-time systems. We propose Interruptible Critical Sections, an optimistic











synchronization mechanism as an alternative to purely blocking methods. Practical optimistic synchronization requires techniques for writing interruptible critical sections, and system support for detecting critical section conflicts.

We show how Interruptible Critical Sections can be used to design algorithms for synchronization in a real-time system. These algorithms vary depending on the environment considered and the techniques used. Our experimental performance results show that these algorithms reduce the variance in the response time of the highest priority task with only a small impact on the performance of the low priority tasks. We also present an analysis which shows that Interruptible Critical Sections improve the schedulability of task sets that have high priority tasks with tight deadlines.

We extend the usage of Interruptible Critical Sections to multiprocessor systems under real-time and non-real time environments. Our performance evaluation shows that the algorithms perform well, making the Interruptible Critical Sections a feasible mechanism for synchronization.
















CHAPTER 1
INTRODUCTION

1.1 Goals



The broad objective of this research was to devise algorithms and data structures for synchronization on real-time systems. Most of the work done in synchronization is not based on priorities, and thus is not suitable for real-time systems. In this regard, we have designed a prioritized spin-lock algorithm, the PR-Lock. We also present a novel optimistic synchronization mechanism named Interruptible Critical Sections. We show how this mechanism can be used to design algorithms for synchronization in a real-time system.

Real time systems are becoming common and multiprocessors are also being used more often. There is an increasing need for building priority synchronization primitives for real-time systems [91]. Even on a uniprocessor, mutual exclusion is necessary to protect shared data in an interleaved thread schedule.

There are two areas of research that this research is trying to bridge: real-time systems and synchronization in uniprocessors as well as multiprocessors. Synchronization in multiprocessors includes concurrency control techniques which guarantee that a process executes an entire code section without being interrupted by another process.
In real-time systems, each process has strict timing constraints and is associated with a priority indicating the urgency of that process [88]. This priority is used by the operating system to order the rendering of services among competing processes.











Normally, the higher the priority of a process, the faster its request for services gets honored.

Synchronization controls access to a shared resource, usually some data structure shared between processes. Synchronization on a shared-memory system of multiprocessors is an important operation, since an application process' speedup or throughput depends on the operation's efficiency.

When synchronization primitives disregard priorities, a lower priority process can be granted access to a critical section among a set of competing processes. Thus, lower priority processes may block the execution of a process with a higher priority and a stricter timing constraint [80]. This priority inversion may cause the higher priority process to miss its deadline, leading to the failure of a real-time system. Priority Inheritance and Priority Ceiling protocols belong to a class of protocols that reduces the effect of priority inversion. These protocols bound the amount of time a higher priority process is blocked by a set of lower priority processes, making the execution time of a process more predictable. However, the protocols require prioritized semaphores.

Priority synchronization algorithms ensure that higher priority processes gain access to the critical section before any of the competing lower priority processes. For processes using priority synchronization algorithms, more accurate, optimistic and predictable execution time estimates can be made. This, in turn, improves the schedulability of a set of processes in a real-time system.

There are several software synchronization algorithms in existence, usually based on some hardware provided mechanisms. Synchronization algorithms can be classified broadly on the approach they take: pessimistic or optimistic.











1.2 Pessimistic Synchronization



In pessimistic synchronization algorithms, the use of a shared resource is guarded by a lock or regulated by a queue. This approach views processes as being in existence for the sole purpose of interfering with each other. Thus, by means of a lock, care is taken so that no other process is using the resource before allowing a requesting process to use it.
Most of the current synchronization mechanisms are based on spin-locks [23, 30, 47, 52, 67, 81, 82], or queue-based locks [4, 28, 30, 67]. Current multiprocessor hardware design includes read-modify-write atomic instructions that assist this type of synchronization.

As mentioned earlier, using these types of locks for real-time systems presents the problem of priority inversion because priority is not taken into consideration. There is a need to modify or redesign these algorithms for use in real-time systems. Not much work has been done in this direction. We discuss what has been done so far in chapter 2. Having designed synchronization protocols for real-time systems, there remain the issues of proving their correctness, analyzing their performance, weighing their costs, and implementing the algorithms.

1.3 Optimistic Synchronization



A process using optimistic synchronization algorithms uses the shared resource with the assumption that no other process may be interfering with it. In the event that such a conflict occurs, it is detected, and the affected process starts a recovery mechanism. Recovery includes either re-executing the critical section all over again or preposting the computation for any other process to complete.











Optimistic synchronization algorithms are suitable for use in real-time systems as they do not cause priority inversion, and there is no need to design any priority inheritance protocols.

Only a small amount of research has been done on optimistic algorithms. Most of the work on optimistic synchronization algorithms is done on uniprocessors [5, 11, 73]. Also, processor architectures should have suitable instructions to support these algorithms [36, 42, 90], and much research needs to be done to formulate and evaluate the support needed [2, 10]. Also, these mechanisms need to be extended for multiprocessors. Here again, there is the issue of what hardware support is needed to implement these optimistic algorithms efficiently. We present a discussion on optimistic synchronization algorithms and related issues in chapter 4.

1.4 Dissertation Structure



The rest of the dissertation is organized as follows. In chapter 2, we present a survey of the research in related areas, namely, real-time scheduling, priority inheritance protocols, and synchronization mechanisms in multiprocessors. More importantly, we show the effect on scheduling of not using priority synchronization algorithms. In Chapter 3, we describe the current results achieved on designing priority lock mechanisms based on pessimistic synchronization methods. Chapter 4 describes Interruptible Critical Sections that rely on optimistic synchronization. We also present an implementation of Interruptible Critical Sections for uniprocessor systems. Interruptible Critical Sections are extended to multiprocessors in Chapter 5. We summarize our research in Chapter 6.








5


1.5 Conclusion



There is a growing importance in supporting real-time systems on uniprocessors and multiprocessors. We argue the necessity for designing priority synchronization algorithms for this purpose. We demonstrate the viability of such an effort. The design effort is backed by sound theoretical and analytical reasoning, and practical implementation results.
















CHAPTER 2
RELATED RESEARCH

2.1 Introduction



This chapter introduces previous research areas that are affected by prioritizing synchronization. Priority synchronization algorithms improve the schedulability of a set of processes in a real-time system. If priority is disregarded for synchronization, a lower priority process may block a higher priority process. In this context, we will study what schedulability is and how it is affected by blocking. We also present Priority Inheritance and Priority Ceiling protocols that reduce the effect of blocking. We classify synchronization mechanisms based on their characteristics. Finally, we present various, relevant synchronization mechanisms that are currently in use.

2.2 Real Time Scheduling

2.2.1 Introduction



A real-time system is one that executes processes with time constraints [18], where a process (or a task) is defined as an individually schedulable entity. A hard real-time system is a real-time system in which every process's deadline is critical, i.e., the system fails if any one of its processes does not meet its deadline. In a soft realtime system, processes tolerate any possible delay in completion past their deadline. Usually, there is a penalty associated with the delay in completion, and longer the delay, the greater is the penalty. Therefore, the goal of a hard real-time system is to











avoid any failure state, and that of the soft real-time system is to minimize the total penalty. In this discussion, we are not concerned with the type of real-time system, since our proposed algorithms do not distinguish between the two.

The fundamental problem in a real-time system is to schedule processes so as to maximize the number of processes that meet their deadlines. The complexity of the problem depends on the complexity of the process model itself. A simplistic model is one in which we only know about the process priority, which is the importance of the process relative to all other processes. At the other end of the spectrum is a model of a process with resource requirements, concurrency constraints and input/output requirements, in addition to computation requirements and timing constraints. A process can be periodic or nonperiodic. A periodic process is one which is invoked at regular intervals of time. A nonperiodic process has an arbitrary arrival time and deadline and when invoked, is expected to execute just once. Processes are also distinguished as preemptable or non-preemptable. A process is preemptable if its execution can be interrupted by other processes at any time and resumed afterwards. A process is non-preemptable if it must run to completion once it starts. Process scheduling can be classified according to the underlying process model.

Another way to classify the scheduling problem as well as the system is when information about a process is available. If knowledge about a process is available a priori then it is a static real-time system since all the scheduling decisions can be done once only during the initialization phase of the system. As can be imagined, such a system is very inflexible and if a new process has to be accommodated at a future date, the whole system has to be first shut down. The advantage is that there is no overhead during runtime. A dynamic approach determines schedules for processes on the fly and allows processes to be dynamically invoked. Dynamic approaches involve











higher run-time costs, but they are flexible and can easily adapt to changes in the environment.

The goal in designing real-time systems is predictability rather than speed. The primary goal of scheduling is to meet the individual process deadlines, not to minimize, say, the response time. Although scheduling is used in many areas like job shop scheduling, etc., real-time process scheduling is different because of the deadlines of each process.

Although scheduling is the main issue in real-time systems, there are other concerns as well. Some of them are provisions to include user written device drivers, guaranteed minimum interrupt response time, ability to tailor the system to specific requirements, etc. In addition, the system should not restrict the usage of hardware in any undesirable way.

2.2.2 Rate Monotonic Scheduling



The analysis of the schedulability will be performed by using a theory of realtime systems which is based on rate monotonic scheduling theory. Rate monotonic scheduling theory provides analytical mechanisms for understanding and predicting the execution timing behavior of real-time systems. The basic theory, introduced in a seminal paper written by Liu and Layland [59], gives us a rule for assigning priorities to periodic process and a formula for determining if a set of periodic processes will all meet their deadlines.
We assume the following notation. We consider N periodic processes T1, T2, T3, ..., TN on a uniprocessor. Let Ei, Di, and Ci represent the execution time, the deadline and the cycle time (periodicity) of the process Ti. Assume that the numbering of the processes is such that the following relationship holds: C'1 < C2 < ... < ON.











The CPU Utilization of a process Ti is the ratio of a process's execution time to its period. The CPU utilization of a set of processes is the sum of the utilizations of the individual processes.


CPU Utilization of a Set of Processes U(N) = ElIC1+E2/C2+...+EN/CN.




A set of assumptions have been made for the rate monotonic scheduling theory.

* Process switching is instantaneous.

* Processes account for all execution times.

e Process interactions are not allowed.

* Processes become ready to execute precisely at the beginning of their periods.

* Process deadlines are always the start of the next period.

The rate monotonic algorithm assigns priorities to processes based on the process's cycle times. Processes with shorter periods are assigned higher priorities and a higher priority process can preempt a lower priority process. The rate monotonic theorem proves that a set of N independent processes scheduled by the rate monotonic algorithm will always meet their deadlines, for all process phasings, if


El/C1 + E2/C2 + ... + EN/CN = U(N) < N(21/N - 1)


Basically, if the utilization of the process set is less than a theoretically determined bound, then the set of processes is guaranteed to meet all of its deadlines.

Given a set of N independent periodic processes scheduled by the rate monotonic algorithm, a particular process, Tk, k < N, will always meet its deadline if











E1/C1 + E2/C2 +... + Ek/Ck = U(k) < k(2'l/k 1)


From this result, it can be seen that the only factors that determine the schedulability of process Tk are the utilization of higher priority processes and the utilization of the process Tk itself.
The discussion so far assumes that processes always execute in consistence with their rate monotonic priority. But in practice, because of priority inversion, a higher priority process may be blocked by a lower priority process that is executing a nonpreemptable section of code. This blocking effect can be included in the previous result as follows.
Let Bk be the worst case total amount of blocking that a process Tk can incur during any period. It has been shown [80] that all processes will meet their deadlines if



E1/C1 + B1/T < 1(21/1 - 1)

E1/C1 + E2/C2 + B2/C2 < 2(21/2 - 1)




E1/1C, + E2/C2 +... + Ek/Ck + Bk/Ck < k(21/k _ 1)




E1/C, + E2/C2 + ... + EN/CN + BN/CN < N(2'/N -1)



The inequalities explicitly show how blocking affects the schedulability of a set of processes and why it is desirable to minimize blocking.











Process synchronization is a common source of blocking. When more than one process requires mutually exclusive access to a resource, processes must synchronize. If a lower priority process has locked a resource and is then preempted by a higher priority process that executes until it needs to access the same resource, the higher priority process is forced to wait. The higher priority process is blocked.

By using priority synchronization algorithms, this blocking can be reduced or avoided, in proportion to the priority of the processes, which directly improves the schedulability of a set of processes. The blocking as a whole is not reduced; it is shifted from higher priority processes to lower priority processes.

In section 2.3, we discuss priority ceiling protocol (PCP) and priority inheritance protocol (PIP), a class of protocols that reduce the effects of blocking and also prevent mutual deadlock.

2.2.3 Static Scheduling Algorithms

Static, Preemptive Scheduling on a Uniprocessor for arbitrary tasks.

Horn [40] developed an O(n2) algorithm for n tasks to be scheduled, based on the earliest deadline policy: tasks with earlier deadlines and earlier ready times are chosen to run before tasks with later deadlines and ready times. Tasks can

have arbitrary ready times and deadlines.

* Static, Preemptive Scheduling on a Multiprocessor for arbitrary tasks.

Horn also described an 0(n3) algorithm to schedule n tasks with arbitrary ready times and deadlines on a multiprocessor. His approach is based on the network flow method and considers only processors with the same speeds. This approach is extended by Martel [64] by considering processors with different











speeds. The complexity of Martel's algorithm is O(m2n4 + n5), where m is the

number of processors.

" Static, Preemptive Scheduling on a Uniprocessor for periodic tasks.

The algorithm for scheduling arbitrary tasks can be applied to periodic tasks by considering the instances of the periodic tasks within the time interval between zero and the least-common-multiple of the task's periods. Horn's and Martel's

approach can also be applied to multiprocessor systems in the same way.

The rate monotonic algorithm described earlier assigns priorities to tasks based on the task's cycle times. Tasks with shorter periods are assigned higher priorities and a higher priority task can preempt a lower priority task. The rate monotonic theorem [59], proves that a set of N independent tasks scheduled by the rate monotonic algorithm will always meet their deadlines, for all task

phasings, if

U(N) <- N(21/N - 1).

Teixeira [93] presented a fixed-priority assignment scheme in which he assumed that the relative deadline of a periodic task can be different from the period of

a task.

" Static, Preemptive Scheduling on a Multiprocessor for periodic tasks.

A partition approach is adopted to solve this problem. The main idea is to partition a set of periodic tasks among a minimum number of processors such that each partition of the periodic tasks can be scheduled on one processor according to the earliest deadline scheme or the rate monotonic priority scheme. If the earliest deadline scheme is used, a bin-packing packing algorithm can











be used to determine a suboptimal partition pattern of periodic tasks among

multiple processors [201.

9 Static Nonpreemptive Scheduling.

Nonpreemptive scheduling is more difficult than preemptive scheduling and many nonpreemptive problems have been shown to be NP-Hard. Scheduling nonpreemptable tasks with arbitrary ready times is NP-Hard even in uniprocessor systems. But for some restrictive problems, efficient algorithms are available. For example, the earliest deadline algorithm has been shown to be optimal for scheduling a set of tasks with the same ready times [72]. Kise developed an O(n2) algorithm for the case in which a task has an earlier ready time if and

only if it has an earlier deadline [49].

For multiprocessor systems, a nonpreemptive scheduling problem is NP-Hard even when the ready times and deadlines of tasks are the same [98]. A polynomial optimal algorithm is available for scheduling tasks with unit computation

time [86].

2.2.4 Dynamic Scheduling Algorithms



Most algorithms that are optimal for static scheduling are not so for dynamic scheduling. For multiprocessors, there can be no optimal algorithm for scheduling preemptable processes if the arrival time of the processes are not known in advance [69]. Since run-time cost is an important factor for dynamic scheduling, most static algorithms are not suitable for dynamic scheduling. Hence, there is a need to develop heuristic algorithms for dynamic scheduling.











For uniprocessor systems, it was shown that the earliest deadline algorithm is optimal for scheduling preemptable processes with arbitrary arrival times [21]. Stankovic et al. [87] describe an algorithm that is based on the earliest deadline policy but takes into account the run-time cost. Baker and Su [7] compared four heuristic algorithms that schedule processes according to an order determined by ready time, by deadline, by the average of ready time and deadline, and by both ready time and deadline. It has been shown by them that the last two algorithms perform better than the first two.
In multiprocessor systems, Mok et al. [69] have shown that if a set of all possible processes that will ever arrive in a system can be scheduled ahead of time, then the set of processes can also be scheduled at run-time. The drawback of this approach, is that the probability of all possible arriving processes being scheduled ahead of time is very low. They have also proved that one successful run-time algorithm is the least laxity algorithm. Locke et al. [60] found that the least laxity first and the earliest deadline first are two good heuristic policies.

2.3 Priority Inheritance Protocols



In the discussion on rate monotonic scheduling (section 2.2.2), we have seen how blocking affects the schedulability of processes. Here, we give a brief discussion of the Priority Inheritance and Priority Ceiling protocols that reduce blocking [80].

2.3.1 Priority Inheritance Protocol



Priority inheritance prevents a medium priority process from prolonging the actual amount of time that a resource is locked by a lower priority process. Without the inheritance, a medium priority process can preempt the lower priority critical











section and prolong the period of blocking of a higher priority process. To avoid this situation, priority inheritance allows the lower priority process to inherit the blocked process's higher priority for the duration of the critical section. Thus, priority inheritance prevents the medium priority process from preempting the critical section, that is now executing at a high priority.

However, this basic priority inheritance mechanism has a drawback, in that, if a process shares m resources with lower priority processes, then it can be blocked at most m times per execution period due to process synchronization. This can be illustrated by an example. Suppose a high priority process requires data from several resources that are all currently locked. The low priority process locks a resource; it is then preempted by a slightly higher priority process that in turn locks another resource, and so on. The blocking process inherits the blocked process's priority and, after the critical section is completed, will relinquish the resource. The high priority process will use the resource and then be forced to wait for the second resource that it needs and so on. Thus, the high priority process can wait at most m times for the m resources that it needs.

2.3.2 Priority Ceiling Protocol



This protocol solves the above problem, in that a high priority process waiting for m shared resources waits at most once per period, for the duration of the entire critical section. In this mechanism, associated with each resource, in addition to the semaphore or monitor that protects it, is an attribute, known as the priority ceiling. This is the highest priority at which a critical section associated with that resource can be executed, i.e., the highest priority of a process that can use the resource. Thus, if a high priority process wishes to use m resources, the ceiling of those resources is











set to the priority of this process. So, a medium priority process can never preempt a lower priority process in its critical section. Thus, this rule allows only one process to have locks at any given time. Hence, the high priority process is blocked at most once per period of execution.

2.3.3 Priority Ceiling Protocols with Abortable Critical Sections

Priority inversion of high priority tasks is reduced further by selectively aborting a low priority task executing within a critical section. Shu et al. [84] proposed an Abort Ceiling Protocol, an extension to the Priority Ceiling Protocol. In this algorithm, an abort ceiling priority is associated with a task. The abort ceiling comes into effect when the task is executing. Another task may abort the currently running task and run immediately if its priority is higher than the current abort ceiling. The protocol relies on the Interruptible Critical Sections (Chapter 4) to restart the critical section of the aborted task. Also, the protocol assumes static priorities. The Ceiling Abort Protocol [92] proposed by Takada and Sakamur is a similar extension to the Priority Ceiling Protocol. This protocol assigns an abort ceiling priority to the critical section instead. Also, the critical section is divided into abortable and non-abortable segments. The issue here is to minimize abortion and re-execution overheads.

2.3.4 Priority Inheritance Protocols for Multiprocessors



Priority inheritance protocols have been extended for multiprocessors by Rajkumar et al. in [79]. In the case of multiprocessors, the concept of blocking is generalized to include remote blocking. When a process executing on one processor has to wait for the execution of a process assigned to another processor, it is said to experience remote blocking.











By its very nature, remote blocking is very different from uniprocessor blocking: remote blocking is a function of execution times of processes on other processors even in the absence of data-sharing. Thus, uniprocessor priority inheritance protocols are to be enhanced for multiprocessors.
In the multiprocessor priority ceiling protocol, tasks are assumed to be bound to a processor. Static binding is found to perform better in static as well as dynamic priority scheduling algorithms. A critical section executed by processes on different processors is called a global critical section (GCS). A processor which executes global critical sections is called a synchronization processor and processors which run application processes only are called application processors. A synchronization processor can also be running application tasks.
The priority ceiling of a semaphore S indicates the maximum priority at which a critical section guarded by this semaphore can execute. The priority ceiling of a local semaphore S is defined to be the priority of the highest priority process that may lock the semaphore.
Let the priority of the highest priority process that accesses a global semaphore GS be denoted by PS. Then the priority ceiling of a global semaphore GS is defined such that

9 The priority ceiling of GS is higher than PS.

* If GS and GSj are global semaphores and PS > PSi, then the priority ceiling

of GSi is greater than the priority ceiling of GSj.

The multiprocessor priority ceiling protocol that is used in each of the application processors and the synchronization processors is as follows.











* Each application processor runs the priority ceiling protocol on the set of processes that it runs, and to the set of local semaphores bound to the application

processor.

* Each global critical section on the synchronization processor normally executes

at its assigned priority.

" The synchronization processor runs the priority ceiling protocol on the global

critical sections, the set of application tasks, and the set of global and local

semaphores bound to the synchronization processor.

The multiprocessor priority ceiling protocol prevents deadlocks, and bounds the blocking duration of each process as a function of the critical section duration of other tasks.

2.4 Synchronization in Multiprocessors



Synchronization primitives make programs easier to understand and write, but processors waste time when waiting for locks. Synchronization primitives are used in nearly every parallel program, and lessening synchronization delays is a major goal for efficient parallel program execution. We will study synchronization both in the context of pessimistic and optimistic approaches. We also will briefly review synchronization mechanisms in message passing multiprocessor.

2.4.1 Pessimistic Synchronization



Pessimistic synchronization mechanisms are overly restrictive in their approach. Even if there is no interference with other processes while sharing a resource, a process incurs the overhead of establishing the lock for itself before using the shared resource.











Mechanisms include hardware operations like Test&Set, Compare&Swap, etc., low level primitives like spin-locks, condition variables, etc., and high level primitives like semaphores, monitors, etc. We restrict our discussion to low level synchronization primitives only.

Hardware Synchronization Primitives



Hardware synchronization primitives evolved primarily on shared-memory multiprocessors. Atomic, sequentially consistent loads and stores can be used to build synchronization primitives. Hardware primitives include Test&Set, Fetch&Store, Fetch&Add, and Compare&Swap. These primitives are also called read-modify-write operations.
Test&Set and reset are a set of basic synchronization primitives [23]. Test&Set is repeatedly executed to get exclusive access to a lock variable before entering a mutual exclusion. Lock reset is used to exit from the section. Because the processor repeatedly tests a lock until it is acquired, this may cause excessive network traffic. In contrast, a suspend-lock employs interprocessor interrupts. A processor waits for an interrupt if its first Test&Set fails.
In another scheme [47], a full/empty tag is associated with each word in sharedmemory. Less general than read-modify-write, this tag is tested before a producerconsumer write or read operation. Only a full word can be read and only an empty word can be written. When the test succeeds, the read or write operation is performed and the value of the tag is reversed.
An extension to the Test&Set, the Test&Test&Set repeatedly tests a local copy of the lock whenever the first atomic Test&Set of the global lock fails [82]. The local copies are invalidated when the global lock is reset. Every waiting processor











does another Test&Test&Set operation. Only one process will get the lock. This scheme reduces the network traffic associated with the Test&Set. Anderson [4] found that exponential backoff after the first Test&Test&Set failure is effective in reducing contention among processes while acquiring a lock.

Fetch&Op primitives include Fetch&Store (swap) and Fetch&Add [52]. The later primitive provides for adding an increment to a shared sum.
Compare&Swap [52] compares the contents of a memory location against a given value, and sets a condition code to indicate whether they are equal. If so, it replaces the contents of the memory with a second given value. Herlihy [34] showed that the Compare&Swap operation is more powerful than the rest of the operations listed. Herlihy showed that Compare&Swap can be used to convert any sequential data object into a concurrent wait-free( 2.4.2) data object. Spin Locks



Spin locks are busy-wait constructs in which processes repeatedly test shared variables to determine when the processes proceed. Busy-waiting is preferred over scheduler-based blocking when scheduler overhead exceeds expected waiting time, when processor resources are not needed for other processes, or when scheduler-based blocking is inappropriate or impossible, for example, in the kernel of an operating system.
The simplest mutual exclusion lock employs a polling loop to access a boolean flag that indicates whether the lock is free. Each processor repeatedly executes a Test&Set instruction in an attempt to change the flag from false to true, thereby acquiring the lock. A processor releases the lock by setting it to false. To reduce network traffic, Test&Test&Set and exponential backoff may be employed.











A ticket lock [81] reduces the number of atomic operations to one per lock. A ticket lock consists of two counters, one containing the number of requests to acquire the lock, and the other the number of times the lock has been released. A processor acquires the lock by performing a Fetch&Increment operation on the request counter and waiting until the result is equal to the value of the release counter. It releases the lock by incrementing the release counter. Processes acquire the lock in FIFO order of their requests.
The ticket lock can still cause substantial memory and network contention through polling of a common location. Thus, it is not possible to obtain a lock with an expected number of network transactions, due to the unpredictability of the length of the critical sections. Anderson [4] and Graunke and Thakkar [30] have proposed locking algorithms that achieve the constant bound on the number of remote memory operations in cache-coherent multiprocessors. Each processor uses an atomic operation to obtain the address of a location on which to spin. Each processor spins on a different location in a different cache line. This array-based queuing lock guarantees FIFO ordering of requests and require space per lock linear in the number of processors.
Mellor-Crummey and Scott [66] devised a list-based queuing lock, the MCS-Lock. It requires atomic Fetch&Store and Compare&Swap instructions. The lock variable maintains the tail of a FIFO queue and the head of the queue is maintained by the process using the lock. The acquire operation is accomplished with a Fetch&Store operation on the lock variable and the release by a Compare&Swap. This lock guarantees FIFO ordering of lock acquisitions, processes spin on locally-accessible flag variables only and requires a constant amount of space per lock. Markatos [631 designed a priority spin-lock algorithm based on the MCS-Lock. Craig [19] refined Marktos's approach to achieve FIFO and priority locks with better space complexities in case











of nested lock acquisitions. In chapter 3, we discuss our own implementation of a priority spin-lock, the PR-Lock. Our PR-Lock has better acquire and release lock characteristics, and differs from the above cited locks in some important details. Condition Variables



Condition variables [85] allow conditional blocking inside a critical section. Condition variables can be used to implement monitors and are provided in Mach [1] operating system. The operations using condition variable include condition-wait and condition-signal. A condition variable is associated with a mutex variable.

When a process performs the condition-wait operation on a condition variable, the associated mutex variable is unlocked and the calling process is blocked. When another process executes the condition-signal operation on the same condition variable indicating that the condition may have changed, the associated mutex variable is locked and the blocked process continues. The unblocked process must re-evaluate the condition before proceeding further.

2.4.2 Optimistic Synchronization



In these types of synchronization mechanisms, processes use a shared resource with the optimism of non-interference. On the other hand, care has to be taken if there is a conflict. In Interruptible Critical Sections, the affected process recovers and restarts the critical section from the beginning [11]. This is the subject of discussion in chapter 4.
Lock-free synchronization was introduced by Lamport [53]. Lock-free data structures can be further classified as non-blocking and wait-free [341. Nonblocking algorithms guarantee that some process accessing the data structure will complete an











operation in a finite number of steps. Wait-free algorithms ensure that all processes complete their operation within a fixed number of steps.

Herlihy [35] has shown that it is impossible to construct non-blocking implementations of arbitrary concurrent objects with any combination of read, write, and Fetch&Op (where Op can be Store, Increment, Add, etc.) when the number of processes being considered are greater than two. However, there are some universal atomic operations which are capable of implementing arbitrary non-blocking objects, Compare&Swap being one. Methods for automatically converting a sequential implementation of an abstract data type into a wait-free implementation was given by Herlihy, and into a non-blocking implementation by Prakash et al. [78], and Turek et al. [97]. The methodology proposed by Turek et al. also handles wait-free implementation, uses less memory and accomodates greater concurrency.

2.4.3 Synchronization through Message passing



Another way processes can communicate and synchronize is through message passing. Message passing is a form of synchronization, since a message can be received only after it has been sent. Remote Procedure Calls and Rendezvous are higher level forms of synchronization using message passing. In the context of real-time systems, Goscinski [29] developed two algorithms for mutual exclusion in distributed systems. Johnson and Newman-Wolfe [46] proposed a distributed priority lock based on the PR-Lock (chapter 3) that has low storage and overhead requirements.

2.5 Conclusion


In this chapter we introduced some of the issues in real-time systems including scheduling, priority inversion, and synchronization. We presented some of the current











practices in scheduling processes in single as well as multiple processors. We presented an analysis of the rate-monotonic scheduling algorithm and the effect of blocking due to priority inversion. We also discussed two protocols for reducing the effect of blocking, namely, priority inheritance and priority ceiling.

We categorized synchronization into two types: pessimistic and optimistic. We cited many examples and techniques illustrating the two mechanisms. Not all of the specific mechanisms are suitable under a given environment: some can be more efficient than others.

Optimistic synchronization algorithms were not previously applied to real-time systems. We show how optimisitic synchronization can be effectively used for realtime systems, thereby avoiding the priority inversion problem that is inherent in the lock-based synchronization mechanisms.
















CHAPTER 3
A PRIORITIZED MULTIPROCESSOR SPIN LOCK

3.1 Introduction

Mutual exclusion is a fundamental synchronization problem for exclusive access to critical sections or shared resources on multiprocessors [62]. The spin-lock is one of the mechanisms that can be used to provide mutual exclusion on sharedmemory multiprocessors [6]. A spin-lock usually is implemented using atomic readmodify-write instructions such as Test&Set or Compare&Swap, which are available on most shared-memory multiprocessors [52]. Busy waiting is effective when the critical section is small and the processor resources are not needed by other processes in the interim. However, a spin-lock is usually not fair, and a naive implementation can severely limit performance due to network and memory contention [4, 27]. A careful design can avoid contention by requiring processes to spin on locally stored or cached variables [66].

In real-time systems, each process has timing constraints and is associated with a priority indicating the urgency of that process [881. This priority is used by the operating system to order the rendering of services among competing processes. Normally, the higher the priority of a process, the faster its request for services gets honored. When the synchronization primitives disregard the priorities, lower priority processes may block the execution of a process with a higher priority and a stricter timing constraint [79, 80]. This priority inversion may cause the higher priority process to miss its deadline, leading to a failure of the real-time system. Most of the work done in











synchronization is not based on priorities, and thus is not suitable for real-time systems. Furthermore, general purpose parallel processing systems often have processes that are "more important" than others (kernel processes, processes that hold many locks, etc.). The performance of such systems will benefit from prioritized access to critical sections.

In this chapter, we present a prioritized spin-lock algorithm, the PR-Lock. The PR-Lock algorithm is suitable for use in systems which either use static-priority schedulers, or use dynamic-priority schedulers in which the relative priorities of existing tasks do not change while blocked (such as Earliest Deadline First [88] or Minimum Laxity [39]). The PR-Lock is a contention-free lock [66], so its use will not create excessive network or memory contention. The PR-Lock maintains a queue of records, with one record for each process that has requested but not yet released the lock. The queue is maintained in sorted order (except for the head record) by the acquire lock operations, and the release lock operation is performed in constant time. As a result, the queue order is maintained by processes that are blocked anyway, and a high priority task does not perform work for a low priority task when it releases the lock. The lock keeps a pointer to the record of the lock holder, which aids in the implementation of priority inheritance protocols [79, 80]. A task's lock request and release are performed at well-defined points in time, which makes the lock predictable. We present a correctness proof, and simulation results which demonstrate the prioritized lock access, the locality of the references, and the improvement over a previously proposed prioritized spin-lock.
We organize this chapter as follows. In Section 3.1.1 we describe previous work in this area and in Section 3.2, we present our algorithm. In Section 3.3 we argue the correctness of our algorithm. In Section 3.4 we discuss an extension to the algorithm presented in Section 3.2. In Section 3.5 we show the simulation results which











compare the performance of the PR-Lock against that of other similar algorithms. In Section 3.6 we conclude this chapter by suggesting some applications and future extensions to the PR-Lock algorithm.

3.1.1 Previous Work

Our PR-Lock algorithm is based on the MCS-Lock algorithm, which is a spin-lock mutual exclusion algorithm for shared-memory multiprocessors [66]. The MCS-Lock grants lock requests in FIFO order, and blocked processes spin on locally accessible flag variables only, avoiding the contention usually associated with busy-waiting in multiprocessors [4, 27]. Each process has a record that represents its place in the lock queue. The MCS-Lock algorithm maintains a pointer to the tail of the lock queue. A process adds itself to the queue by swapping the current contents of the tail pointer for the address of its record. If the previous tail was nil, the process acquired the lock. Otherwise, the process inserts a pointer to its record in the record of the previous tail, and spins on a flag in its record. The head of the queue is the record of the lock holder. The lock holder releases the lock by reseting the flag of its successor record. If no successor exists, the lock holder sets the tail pointer to nil using a Compare&Swap instruction.
Molesky, Shen, and Zlokapa [71] describe a prioritized spin-lock that uses the test-and-set instruction. Their algorithm is based on Burn's fair test-and-set mutual exclusion algorithm [14]. However, this lock is not contention-free.

Markatos and LeBlanc [63] presents a prioritized spin-lock algorithm based on the MCS-Lock algorithm. Their acquire lock algorithm is almost the same as the MCS acquire lock algorithm, with the exception that Markatos' algorithm maintains a doubly linked list. When the lock holder releases the lock, it searches for the highest priority process in the queue. This process' record is moved to the head











of the queue, and its flag is reset. However, the point at which a task requests or releases a lock is not well defined, and the lock holder might release the lock to a low priority task even though a higher priority task has entered the queue. In addition, the work of maintaining the priority queue is performed when a lock is released. This choice makes the time to release a lock unpredictable, and significantly increases the time to acquire or release a lock (as is shown in section 3.5). Craig [19] proposes a modification to the MCS lock and to Markatos' lock that substitutes an atomic Swap for the Compare&Swap instruction, and permits nested locks using only one lock record per process. Takada and Sakamura [91] extended queuing spin-locks modified to be preemptable for servicing interrupts.
Goscinski [29] developed two algorithms for mutual exclusion for real-time distributed systems. The algorithms are based on token passing. A process requests the critical section by broadcasting its intention to all other processes in the system. One algorithm grants the token based on the priorities of the processes, whereas the other algorithm grants the token to processes based on the remaining time to run the processes. The holder of the token enters the critical section.
The utility of prioritized locks is demonstrated by rate monotonic scheduling theory [59, 80]. Suppose there are N periodic processes T1, T2, T3,..., TN on a uniprocessor. Let Ei and Ci represent the execution time and the cycle time (periodicity) of the process T. We assume that Ca < C2 < ... < CN. Under the assumption that there is no blocking, [59] show that if for each j


Ei/1C :- j (21/i-l)
i=1


Then all processes can meet their deadlines.











Suppose that Bj is the worst case blocking time that process T will incur. Then [80] show that all tasks can meet their deadlines if E,/C, + Bj/CJ j (21/i - 1)
i=1
Thus, the blocking of a high priority process by a lower priority process has a significant impact on the ability of tasks to meet their deadlines. Much work has been done to bound the blocking due to lower priority processes. For example, the Priority Ceiling protocol [80] guarantees that a high priority process is blocked by a lower priority process for the duration of at most one critical section. The Priority Ceiling protocol has been extended to handle dynamic-priority schedulers [16] and multiprocessors [17, 79].

Our contribution over previous work in developing prioritized contention-free spinlocks ([19] and [63]) is to more directly implement the desired priority queue. Our algorithm maintains a pointer to the head of the lock queue, which is the record of the lock holder. As a result, the PR-Lock can be used to implement priority inheritance [79, 80]. The work of maintaining priority ordering is performed in the acquire lock operation, when a task is blocked anyway. The time required to release a lock is small and predictable, which reduces the length and the variance of the time spent in the critical section. The PR-Lock has well-defined points in time in which a task joins the lock queue and releases its lock. As a result, we can guarantee that the highest priority waiting task always receives the lock. Finally, we provide a proof of correctness.

3.2 PR-Lock Algorithm

Our PR-Lock algorithm is similar to the MCS-Lock algorithm in that both maintain queues of blocked processes using the Compare&Swap instruction. However, while the MCS-Lock and Markatos' lock maintain a global pointer to the tail of the











queue, the PR-Lock algorithm maintains a global pointer to the head of the queue. In both the MCS-Lock and the Markatos' lock, the processes are queued in FIFO order, whereas in the PR-Lock, the queue is maintained in priority order of the processes.

3.2.1 Assumptions

We make the following assumptions about the computing environment.


1. The underlying multiprocessor architecture supports an atomic Compare&Swap

instruction. We note that many parallel architectures support this instruction,

or a related instruction [35, 74, 99].

2. The multiprocessor has shared-memory with coherent caches, or has locallystored but globally-accessible shared-memory.

3. Each processor has a record to place in the queue for each lock. In a NUMA

architecture, this record is allocated in the local, but globally accessible, memory. This record is not used for any other purpose for the lifetime of the queue.

In Section 3.4, we allow the record to be used among many lock queues.

4. The higher the actual number assigned for priority, the higher the priority of a

process (we can also assume the opposite).

5. The relative priorities of blocked processes do not change. Acceptable priority

assignment algorithms include Earliest Deadline First and Minimum Laxity.

It should be noted that each process pi participating in the synchronization can be associated with an unique processor Pi. We expect that the queued processes will not be preempted, though this is not a requirement for correctness.











3.2.2 Implementation

The PR-Lock algorithm consists of two operations. The acquirelock operation acquires a designated lock and the releaselock operation releases the lock. Each process uses the acquire/ock and releaseJock operations to synchronize access to a resource as follows.


acquirelock(L, r)
critical section
releaselock(L)

The following sub-sections present the required version of Compare&Swap, the needed data structures, and the acquiredock and release/ock procedures. The Compare&Swap

The PR-Lock algorithms make use of the Compare&Swap instruction, the code for which is shown in Figure 3.1. Compare&Swap is often used on pointers to object records, where a record refers to the physical memory space and an object refers to the data within a record. Current is a pointer to a record, Old is a previously sampled value of Current, and New is a pointer to a record that we would like to substitute for *Old (the record pointed to by Old). We compute the record *New based on the object in *Old (or decide to perform the swap based on the object in *Old), so we want to set Current equal to New only if Current still points to the record *Old. However, even if Current points to *Old, it might point to a different object than the one originally read. This will occur if *Old is removed from the data structure, then re-inserted as Current with a new object. This sequence of events cannot be detected by the Compare&Swap and is known as the A-B-A problem.

Following the work of Prakash et al. [77] and Turek et al. [97], we make use of a double-word Compare&Swap instruction [74] to avoid this problem. A counter is




























Figure 3.1. CAS used in the PR-Lock Algorithm

appended to Current which is treated as a part of Current. Thus Current consists of two parts: the value part of Current and the counter part of Current. This counter is incremented every time a modification is made to *Current. Now all the variables Current, Old , and New are twice their original size. This approach reduces the probability of occurrence of the A-B-A problem to acceptable levels for practical applications. If a double-word Compare&Swap is not available, the address and counter can be packed into 32 bits by restricting the possible address range of the lock records.

We use a version of the Compare&Swap operation in which the current value of the target location is returned in old, if the Compare&Swap fails. The semantics of the Compare&Swap used is given in Figure 3.1. A version of the Compare&Swap instruction that returns only TRUE or FALSE can be used by performing an additional read.


Procedure CAS(structure pointer *Current, *Old, *New)
/* Assume CAS operates on double words */ atomic{
if( *Current == *Old ) {
*Current = *New;
return (TRUE);
}
else {
*Old = *Current;
return (FALSE);











Data Structures

The basic data structure used in the PR-Lock algorithm is a priority queue. The lock L contains a pointer to the first record of the queue. The first record of the queue belongs to the process currently using the lock. If there is no such process, then L contains nil.

Each process has a locally-stored but globally-accessible record to insert into the lock queue. If process p inserts record q into the queue, we say that q is p's record and p is q's process. The record contains the process priority, the next-record pointer, a boolean flag Locked on which the process owning the element busy-waits if the lock is not free, and an additional field Data that can be used to store application-dependent information about the lock holder.

The next-record pointer is a double sized variable: one half is the actual pointer and the other half is a counter to avoid the A-B-A problem. The counter portion of the pointer itself has into two parts: one bit of the counter called the Dq bit is used to indicate whether the queuing element is in the queue. The rest of the bits are used as the actual counter. This technique is similar to the one used by Prakash et al. [77] and Turek et al. [97]. Their counter refers to the record referenced by the pointer. In our algorithm, the counter refers to the record that contains the pointer, not the record that is pointed to.

If the Dq bit of a record q is FALSE, then the record is in the queue for a lock L. If the Dq bit is TRUE, then the record is probably not in the queue (for a short period of time, the record might be in the queue with its Dq bit set TRUE). The Dq bit lets the PR-Lock avoid garbage accesses.

Each process keeps the address of its record in a local variable (Self). In addition, each process requires two local pointer variables to hold the previous and the next











queue element for navigating the queue during the enqueue operation (PrevNode and Next-Node).

The data structures used are shown in Figure 3.2. The Dq bit of the Pointer field is initialized to TRUE, and the Ctr field is initialized to 0 before the record is first used.

A typical queue formed by the PR-Lock algorithm is shown in Figure 3.3 below. Here L points to the record q0 of the current process holding the lock. The record q0 has a pointer to the record q, of the next process having the highest priority among the processes waiting to acquire the lock L. Record ql points to record q2 of the next higher priority waiting process and so on. The record qn belongs to the process with the least priority among waiting processes. Acquire-Lock Operation

The acquirelock operation is called by a process P before using the critical section or resource guarded by lock L. The parameters of the acquiredock operation are the lock pointer L and the record 4 of the process (passed to local variable Self).

An acquire/ock operation searches for the correct position to insert 4 into the queue using PrevNode and Next-Node to keep track of the current position. In Figure 3.4, Prev-Node and Next-Node are abbreviated to P and N. The records pointed by P and N are qi and qi+i, belonging to processes pi and pi+,. Process P positions itself so that Pr(pi) > Pr(i) > Pr(pi+1), where Pr is a function which maps a process to its priority. Once such a position is found, 4 is prepared for insertion by making 4 point to qi+,. Then, the insertion is committed by making q, to point to 4 by using the Compare&Swap instruction. The various stages and final result are shown in Figure 3.4.


















structure Pointer {
structure Object *Ptr;
int3l Ctr;
boolean Dq;
}


structure Record {
structure structure-of-data Data;
boolean Locked;
integer Priority;
structure Pointer Next;


Shared Variable
structure Pointer L;

Private Variables
structure Pointer Self, PrevNode, Next-node;
boolean Success, Failure;
constant TRUE, FALSE, NULL, MAX-PRIORITY;


Record Structure

Data

Locked Priority
Next.Ptr Next.Ctr Next.Dq


Figure 3.2. Data Structures used in the PR-Lock Algorithm

























Figure 3.3. Queue data structure used in PR-Lock algorithm











Start: L q I



Position: L qO qi q P NI


q


Prepare: L qO q/ -- qi---- qn

P N


q


Commit: L qqi - - -- qiqiI ----q

P N

Result: L --- q- ---qiqq Iq


Figure 3.4. Stages in the acquireilock operation











The acquireiock algorithm is given in Figure 3.5. Before the acquiredock procedure is called, the Data and the Priority fields of the process' record are initialized appropriately. In addition, the Dq bit of the Next pointer is implicitly TRUE.

The acquiredock operation begins by assuming that the lock is currently free (the lock pointer L is null). It attempts to change L to point to its own record with the Compare&Swap instruction. If the Compare&Swap is successful, the lock is indeed free, so the process acquires the lock without busy-waiting. In the context of the composite pointer structures that the algorithm uses, a NULL pointer is all zeros.

If the swap is unsuccessful, then the acquiring process traverses the queue to position itself between a higher or equal priority process record and a lower priority process record. Once such a junction is found, PrevNode will point to the record of the higher priority process and Next-Node will point to the record of the lower priority process. The process first sets its link to Next-Node. Then, it attempts to change the previous record's link to its own record by the atomic Compare&Swap.

If successful, the process sets the Dq flag in its record to FALSE indicating its presence in the queue. The process then busy-waits until its Locked bit is set to FALSE, indicating that it has been admitted to the critical section.

There are three cases for an unsuccessful attempt at entering the queue. Problems are detected by examining the returned value of the failed Compare&Swap marked as F in the algorithm. Note that the returned value is in the Next-Node. In addition, a process might detect that it has misnavigated while searching the queue. When we read Next-Node, the contents of the record pointed to by PrevNode are fixed because the record's counter is read into Next-Node.














Procedure acquire-lock(L, Self) {
Success=FALSE;
do {
PrevNodefNULL; Next-NodefNULL
if(CAS(&L, &Next-Node, &Self)) { /* No Lock */
Success=TRUE; FailurefFALSE; /* Use Lock */
Self.Ptr->Next.DqfFALSE;
}
else { /* Lock in Use */
FailurefFALSE; Self.Ptr->Locked=TRUE;
do {
Prev-NodefNext-Node;
Next-NodefPrev-Node. Ptr- >Next;
if((Next -Node.DqfiTRUE) /* Deque, Try Again *i
or (PrevNode.Ptr->PriorityPriority)) ii
Failure=TRUE;
else {
if(Next-Node.Ptr=NULL or (Next-Node. Ptr! =NULL and
Next -Node. Ptr->PriorityPriority) ) { Self. Ptr- >Next. Ptr=Next-Node. Ptr
if(CAS (&(Prev-Node.Ptr->Next), &Next-Node, &Self)) { F
Self. Ptr->Next. DqfFALSE;
while(Self.Ptr->Locked); /* Busy Wait */
Success=TRUE; /* Then, use lock *1
}
else {
if((Next-Node.Dq==TRUE) /* Deque, Try Again *i
or Prev-Node.Ptr->Priority < Self.Ptr->Priority)) iii Failure=TRUE;
else
Next -Node=PrevNode;
I
}
}
}while(!Success and !Failure);
}i
IwhileC(!Success) ;


Figure 3.5. The acquirelock operation procedure











1. A concurrent acquire-lock operation may overtake the acquireiock operation and insert its own record immediately after PrevNode, as shown in Figure 3.10. In this case the Compare&Swap will fail at the position marked F in Figure 3.5. The correctness of this operation's position is not affected, so the operation continues from its current position (line marked by i in Figure 3.5).

2. A concurrent releaseilock operation may overtake the acquirelock operation

and removes the record pointed to by Prev-Node, as shown in Figure 3.11. In this case, the Dq bit in the link pointer of this record will be TRUE. The algorithm checks for this condition when it scans through the queue and when it tries to commit its modifications. The algorithm detects the situation in the two places marked by ii in the Figure 3.5. Every time a new record is accessed (by PrevNode), its link pointer is read into Next-Node and the Dq bit is checked. In addition, if the Compare&Swap fails, the link pointer is saved in Next-Node and the Dq bit is tested. If the Dq bit is TRUE, the algorithm starts

from the beginning.

3. A concurrent releaselock operation may overtake the acquirelock operation

and remove the record pointed to by PrevNode, and then the record is put back into the queue, as shown in Figure 3.12. If the record returns with a priority higher than or equal to Self's priority, then the position is still correct and the operation can continue. Otherwise, the operation cannot find the correct insertion point, so it has to start from the beginning. This condition is tested
at the lines marked iii in Figure 3.5.

The spin-lock busy waiting of a process is broken by the eventual release of the lock by the process which is immediately ahead of the waiting process.

























Figure 3.6. The release-lock operation procedure Release-Lock Operation

The releasedlock operation is straight forward and the algorithm is given in Figure 3.6. The process p releasing the lock sets the Dq bit in its record's Link pointer to TRUE, indicating that the record is no longer in the queue. Setting the Dq bit prevents any acquire-lock operation from modifying the link. The releasing process copies the address of the successor record, if any, to L. The process then releases the lock by setting the Locked boolean variable in the record of the next process waiting to be FALSE. To avoid testing special cases in the acquirelock operation, the priority of the head record is set to the highest possible priority.

3.3 Correctness of PR-Lock Algorithm

In this section, we present an informal argument for the correctness properties of our PR-Lock algorithm. We prove that the PR-Lock algorithm is correct by showing that it maintains a priority queue, and the head of the priority queue is the process that holds the lock. The PR-Lock is decisive-instruction serializable [83]. Both operations of the PR-Lock algorithm have a single decisive instruction. The decisive instruction for the acquire-lock operation is the successful Compare&Swap


Procedure releaseiock(L, Self){
Self.Dq=TRUE;
L=Self.Ptr->Next; /* Release Lock */
if(Self . Ptr->Next! =NULL) {
L. Ptr->Priority=MAXPRIORITY;
L.Ptr->Locked=FALSE;
}











and the decisive instruction for the releaselock operation is setting the Dq bit. Corresponding to a concurrent execution C of the queue operations, there is an equivalent (with respect to return values and final states) serial execution Sd such that if operation 01 executes its decisive instruction before operation 02 does in C, then 01 < 02 in Sd. Thus, the equivalent priority queue of a PR-Lock is in a single state at any instant, simplifying the correctness proof (a concurrent data structure that is linearizable but not decisive-instruction serializable might be in several states simultaneously [37]).

We use the following notation in our discussion. PR-Lock � has lock pointer L, which points to the first record in the lock queue (and the record of the process that holds the lock). Let there be N processes P1, P2, ..., PN that participate in the lock synchronization for a priority lock �, using the PR-Lock algorithm. As mentioned earlier, each process pi allocates a record qj to enqueue and dequeue. Thus, each process pi participating in the lock access is associated with a queue record qi. Let Pr(pi) be a function which maps a process to its priority, a number between 1 and N. We also define another function Pr(qj) which maps a record belonging to a process pi to its priority.

A priority queue is an abstract data type that consists of

* A finite set Q of elements qj, i = 1.. .N

* A function Pr : qi - ni ,where n Ef N. For simplicity, we assume that every ni

is unique. This assumption is not required for correctness, and in fact processes

of the same priority will obtain the lock in FCFS order.


e Two operations enqueue and dequeue










At any instant, the state of the queue can be defined as Q =: (q0,7q,,Iq2,7...,7qn)

where ql Pr(qj).
We call qo the head record of priority queue Q. The head record's process is the current lock holder. Note that the non-head records are totally ordered.
The enqueue operation is defined as

enqueue((qo, ql,q2,. .., qn),) -4 (qo, ql,q2,...,q, q, q+1,. ..,qn) where Pr(qj) > Pr(4) > Pr(qi+l).
The dequeue operation on a non-empty queue is defined as

dequeue( ((qo, q, q2, . - . , qn) ) --+ (qi, q2, - .-., qn)

where the return value is qo. A dequeue operation on an empty queue is undefined.
For every PR-Lock C, there is an abstract priority queue Q. Initially, both � and Q are empty. When a process P0 with a record 4 performs the decisive instruction for the acquiredock operation, Q changes state to enqueue(Q, 4). Similarly, when a process executes the decisive instruction for a releasedock operation, Q changes state to dequeue(Q).
We show that when we observe , we find a structure that is equivalent to Q. To observe , we take a consistent snapshot [15] of the current state of the system memory. Next, we start at the lock pointer L and observe the records following the linked list. If the head record has its Dq bit set and its process has exited the acquiredock operation, then we discard it from our observation. If we observe the same records in the same sequence in both C and Q, then we say that L and Q are equivalent, and we write L * Q.













Before: L qO q1 q2 q


After: L q 2q


Figure 3.7. Observed queue L before and after a releaselock

Theorem 1 The representative priority queue Q is equivalent to the observed queue of the PR-Lock L.

Proof We prove the theorem by induction on the decisive instructions, using the following two lemmas.

Lemma 1 If Q - * L before a releaselock decisive instruction, then Q � L after the releaselock decisive instruction.

Proof Let Q = (qo,ql,q2,..., qn) before a releasedock decisive instruction. A releaselock operation is equivalent to a dequeue operation on the abstract queue. By definition,

dequeue ((qo, ql, q2,. . . ,qn)) -+ (q, q2,. . . ,q,)

The before and after states of L are shown in Figure 3.7. If L points to the record qo before the releaselock decisive instruction, the releasedlock decisive instruction sets the Dq bit in qo to TRUE, removing qo from the observable queue. Thus, Q -* L after the releasedock operation. Note that L will point to q, before the next releasedock decisive instruction. 11 Lemma 2 If Q <, 1 before an acquirelock decisive instruction, then Q <, 1 after the acquire-lock decisive instruction.












Before: L After: L qFigure 3.8. Observed queue � before and after an acquiredock to an empty queue





Before: L qO q qi+I q




After: L q i --q


Figure 3.9. Observed queue � before and after an acquiredock to a non-empty queue

Proof. There are two different cases to consider:

Case 1: Q = () before the acquirelock decisive instruction. The equivalent operation on the abstract queue Q is the enqueue operation. Thus, enqueue (0, 4) -+ (4)

If the lock � is empty, 4's process executes a successful decisive Compare&Swap instruction to make L to point to 4 and acquires the lock (Figure 3.8).

Clearly, Q *. � after the acquiredock decisive instruction.

Case 2. Q = (qo, qi, q2,... , q,) before the acquirelock decisive instruction. The state of the queue Q after the acquiredock is given by


enqueue = ((qo, qj, q2,. .., q,), 4) -+ (qo, ql, q2,. ..,I q, , qi+i,..., q,)











The corresponding � before and after the acquire-lock is shown in Figure 3.9. The pointers P and N are the PrevNode and Next-Node pointers by which 4's acquirelock operation positions its record such that the process observes Pr(qj) > Pr(4) > Pr(qi+i). Then, the Next pointer in 4 is is set to the address of qi+,. The Compare&Swap instruction, marked F in Figure 3.5, attempts to make the Next pointer in qj point to q. If the Compare&Swap instruction succeeds, then it is the decisive instruction of 4j's process and the resulting queue L is illustrated in the Figure 3.9. This is equivalent to Q after the enqueue operation. If the Compare&Swap succeeds only when qj is in the queue, qj+i is the successor record, and Pr(qj) < Pr(l) < Pr(qi+1), then Q *,C.

If there are no concurrent operations on the queue, we can observe that the P and N are positioned correctly and the Compare&Swap succeeds. If there are other concurrent operations, they can interfere with the execution of an acquiredock operation, A. There are three possibilities:

Case a: Another acquiredock A 'enqueued its record q' between qj and qi+1, but qi has not yet been dequeued. If Pr(qj) > Pr(4) > Pr(qi+1), then 4's process will attempt to insert 4 between qi and qi+i. Process A' has modified qi's next pointer, so that 4's Compare&Swap will fail. Since qi has not been dequeued, Pr(qi) > Pr(4j), and 4's process should continue its search from qi, which is what happens. If Pr(qi+i) > Pr(4) then 4's process can skip over q' and continue searching from qi+1, which is what happens. This scenario is illustrated in Figure 3.10.

Case b: A releaselock operation R overtakes A and removes qi from the queue (i.e., R has set qi's Dq bit), and qi has not yet been returned to the queue (its Dq bit is still false). Since q, is not in the lock queue, A is lost and must start searching again. Based on its observations of qj and qj+j, A may have decided to continue searching the queue or to commit its operation. In either case A sees the Dq bit
















BeforeiA': L AO q- qi q


AfterA': L qi qg r 3.11--------P N

Continue A: L qO qi qiF qn-pq~ -Figure 3.10. A concurrent acquire-lock A 'succeeds before A

set and fails, so A starts again from the beginning of the queue. This scenario is illustrated in Figure 3.11

Case c: A releasedock operation R overtakes A and removes qj from the queue, and then qj is put back in the queue by another acquirelock A . If A tries to commit its operation, then the pointer in qj is changed, so the Compare&Swap fails. Note that even if qj is pointing to qi+i, the version numbers prevent the decisive instruction from succeeding. If A continues searching, then there are two possibilities based on the new value of Pr(qj). If Pr(4) > Pr(qj), A is lost and cannot find the correct place to insert 4. This condition is detected when the priority of q, is examined (the lines marked iii in Figure 3.5), and operation A restarts from the head of the queue. If Pr(l) < Pr(qj), then A can still find a correct place to insert 4 past qi, and A continues searching. This scenario is illustrated in Figure 3.12.

No matter what interference occurs, A always takes the right action. Therefore, Q �, C after the acquireiock decisive instruction. 01












-- --n q~ Before R P NI

L --- / '+l After R

P ?N
N__]]]eRestart A

P

Figure 3.11. A concurrent releaselock R succeeds before A

To prove the theorem we use induction. Initially, Q = () and L points to nil. So, Q <= C is trivially true. Suppose that the theorem is true before the ith decisive instruction. If the ith decisive instruction is for an acquirelock operation, Lemma 2 =. Q <-, � after the ith decisive instruction. If the ith decisive instruction is for a releaseiock operation, Lemma 1 =' Q <= � after the 1th decisive instruction. Therefore, the inductive step holds, and hence, Q <* C. 0l

3.4 Extensions

In this section we discuss a couple of simple extensions that increase the utility of the PR-Lock algorithm.

3.4.1 Multiple Locks

As described, a record for a PR-Lock can be used for one lock queue only (otherwise, a process might obtain a lock other than the one it desired). If the real-time system has several critical sections, each with their own locks (which is likely), each process must have a lock record for each lock queue, which wastes space.
Fortunately, a simple extension of the PR-Lock algorithm allows a lock record to be used in many different lock queues. We replace the Dq bit by a Dq string of I bits.











Before R

L qO qI7/ N/ q

After R and A'
L r 7 / qi ........



Restart A if Pr(qA) > Pr(qi)
N
N - r ri - - - - L i- -i- - - - - - - --i


Continue A if Pr(qA) <= Pr(qi)

L yO rH7 ri+ -------Figure 3.12. Releaselock R and acquirelock A'succeed before A

If the Dq string evaluates to i > 0 when interpreted as a binary number, then the record in in the queue for lock i. If the Dq string evaluates to 0, then the record is (probably) not in any queue. The acquire-lock and release-lock algorithms carry through by modifying the test for being or not being in queue i appropriately.

We note that if a process sets nested locks, a new lock record must be used for each level of nesting. Craig [19] presents a method for reusing the same record for nested locks.

3.4.2 Backing Out

If a process does not obtain the lock after a certain deadline, it might wish to stop waiting and continue processing. The process must first remove its record from the lock queue. To do so, the process follows these steps:











1. Find the preceding record in the lock queue, using the method from the algorithm for the acquire-lock operation. If the process determines that its record

is at the head of the lock queue, return with a "lock obtained" value.

2. Set the Dq bit (Dq string) of the process' record to "Dequeued".

3. Perform a compare and swap of the predecessor record's next pointer with

the process' next pointer. If the Compare&swap fails, go to 1. If the Compare&swap succeeds, return with a "lock released" value.

Step 2 fixes the value of the process's successor. If the process removes itself from the queue without obtaining the lock, the Compare&swap is the decisive instruction. If the Compare&swap fails, the predecessor might have released the lock, or third process has enqueued itself as the predecessor. The process can't distinguish between these possibilities, so it must re-search the lock queue.

3.5 Simulation Results

We simulated the execution of the PR-Lock algorithm in PROTEUS, which is a configurable multiprocessor simulator [12]. We also implemented the MCS-Lock and Markatos' lock to demonstrate the difference in the acquisition and release time characteristics.

In the simulation, we use a multiprocessor model with eight processors and a global shared-memory. Each processor has a local cache memory of 2048 bytes size. In PROTEUS, the units of execution time are cycles. Each process executes for a uniformly randomly distributed time, in the range 1 to 35 cycles, before it issues an acquire-lock request. After acquiring the lock, the process stays in the critical section for a fixed number of cycles (150) plus another uniformly randomly distributed number (1 to 400) of cycles before releasing the lock. This procedure is repeated fifty











times. The average number of cycles taken to acquire a lock by a process is then computed. PROTEUS simulates parallelism by repeatedly executing a processor's program for a time quantum, Q. In our simulations, Q = 10. The priority of a process is set equal to the process/processor number and the lower the number, the higher the priority of a process.

Figure 3.13 shows the average time taken for a process to acquire a lock using the MCS-Lock algorithm, the PR-Lock algorithm, and the Markatos' lock algorithm. A process using MCS-Lock algorithm has to wait in the FIFO queue for all other processes in every round. However, a process using the PR-Lock algorithm will wait for a time that is proportional to the number of higher priority processes. As an example, the highest and second highest priority process on the average waits for about one critical section period. We note that the two highest priority processes have about the same acquire lock execution time because they alternate in acquiring the lock. Only after both of these processes have completed their execution can the third and fourth highest priority processes obtain the lock. Figure 3.13 clearly demonstrates that the average acquisition time for a lock using PR-Lock is proportional to the process priorities, whereas the average acquisition time is proportional to the number of processes in case of the MCS-Lock algorithm. This feature makes the PR-Lock algorithm attractive for use in real-time systems.

The same prioritized lock-acquisition behavior is shown using Markatos' algorithm, but the average time to acquire a lock is 50% greater than when the PR-Lock is used. At first this result is puzzling, because Markatos' lock performs the majority of its work when the lock is released and the PR-Lock performs its work when the lock is acquired. However, the time to release a lock is part of the time spent in the critical section, and the time to acquire a lock depends primarily on time spent in the critical section by the preceding lock holders. Thus, the PR-Lock allows much











faster access to the critical section. As we will see, the PR-Lock also allows more predictable access to the critical section.

Finally, we compared the time required to release a lock using both the PRLock and Markatos' lock. These results are shown in Figure 3.14. The time to release a lock using PR-Lock is small, and is consistent for all of the processes. Releasing a lock using Markatos' lock requires significantly more time. Furthermore, in our experiments a high priority process is required to spend significantly more time releasing a lock than is required for a low priority process. This behavior is a result of the way that the simulation was run. When high priority processes are executing, all low priority processes are blocked in the queue. As a result, many records must be searched when a high priority process releases a lock. Thus, a high priority process does work on behalf of low priority processes. The time required for a high priority process to release its lock depends on the number of blocked processes in the queue. The result is a long and unpredictable amount of time required to release a lock. Since the lock must be released before the next process can acquire the lock, the time required to acquire a lock is also made long and unpredictable.

Most of the time the cache-hit ratio is 95% or higher on each of the processors using the PR-Lock algorithm, and we found an average cache hit range of 99.72% to 99.87%. Thus, the PR-Lock generates very little network or memory contention in spite of the processes using busy-waiting.

3.6 Conclusion

In this chapter, we present a priority spin-lock synchronization algorithm, the PRLock, which is suitable for real-time shared-memory multiprocessors. The PR-Lock algorithm is characterized by a prioritized lock acquisition, a low release overhead, very little bus-contention, and well-defined semantics. Simulation results show that











the PR-Lock algorithm performs well in practice. This priority lock algorithm can be used as presented for mutually exclusive access to a critical section or can be used to provide higher level synchronization constructs such as prioritized semaphores and monitors. The PR-Lock maintains a pointer to the record of the lock holder, so the PR-Lock can be used to implement priority inheritance protocols. Finally, the PR-Lock algorithm can be adapted for use as a single-dequeuer, multiple-enqueuer parallel priority queue.

While several prioritized spin-locks have been proposed, the PR-Lock has the following advantages.

* The algorithm is contention free.

* A higher priority process does not have to work for a lower priority process

while releasing a lock. As a result, the time required to acquire and release a

lock is fast and predictable.

* The PR-Lock has a well-defined acquire-lock point.

9 The PR-Lock maintains a pointer to the process using the lock that facilitates

implementing priority inheritance protocols.

For future work, we are interested in prioritizing access to other operating system structures to make them more appropriate for use in a real-time parallel operating system.
















* MCS-Lock o PR-Lock [ Markatos'Lock


XX~ ~ ~ ~~~I IX XXXXXXXXXXX


0 10 20 30 40


Average Time (Cycles) x 100


Figure 3.13. Comparison of lock acquisition times


50 60







































Markatos' Lock
0 PR-Lock


I 50 I i i I I i S I S
o 0 100 150 200 250 300 350 400 450
Average Time (Cycles)


Figure 3.14. Comparison of lock release times


2 � 3

4
0 I-1 (A5



7 8
















CHAPTER 4
INTERRUPTIBLE CRITICAL SECTIONS

4.1 Introduction

The scheduling of independent real-time tasks is well understood, as optimal scheduling algorithms have been proposed for periodic and aperiodic tasks on uniprocessor [21, 59] and multiprocessor systems [20, 60, 70]. However, if the tasks communicate through shared critical sections, a low-priority task that holds a lock may block a high priority task that requires the lock, causing a priority inversion. In this chapter, we present a method for real-time synchronization that avoids priority inversions.

We present a new approach to synchronization on uniprocessors with special applicability to embedded and real-time systems. Existing methods for synchronization in real-time systems are pessimistic, and use blocking to enforce concurrency control. While protocols to bound the blocking of high priority tasks exist, high priority tasks can still be blocked by low priority tasks. In addition, these protocols require a complex interaction with the scheduler. We propose interruptible critical sections (i.e., optimistic synchronization) as an alternative to purely blocking methods. Practical optimistic synchronization requires techniques for writing interruptible critical sections, and system support for detecting critical section access conflicts. We discuss our implementation of an interruptible lock on a system running the pSOS+ real-time operating system. Our experimental performance results show that interruptible locks reduce the variance in the response time of the highest priority task with only a small impact on the performance of the low priority tasks. We show how 55











interruptible critical sections can be combined with the Priority Ceiling Protocol, and present an analysis which shows that interruptible locks improve the schedulability of task sets that have high priority tasks with tight deadlines.

Rajkumar, Sha, and Lehoczky [80] have proposed the Priority Ceiling Protocol (PCP) to minimize the effect of priority inversion. The priority ceiling of a semaphore S is the priority of the highest priority task that will ever lock S. A task may lock a semaphore only if its priority is higher than the priority ceiling of all locked semaphores (except for the semaphores that it has locked). The PCP guarantees that a task will be blocked by a lower priority task at most once during its execution. However, the tasks must have static priorities in order to apply the Priority Ceiling Protocol. In addition, blocking for even the duration of one critical section may be excessive. Rajkumar, Sha, and Lehoczky have extended the Priority Ceiling Protocol to work in a multiprocessor system [79].
Blocking-based synchronization algorithms have been extended to work with dynamic priority schedulers. Baker [8] presents a pre-allocation based synchronization algorithm that can manage resources with multiple instances. A task's execution is delayed until the scheduler can guarantee that the task can execute without blocking a higher priority task. Tripathi and Nirkhe [95], and Faulk and Parnas [24] also discuss pre-allocation based scheduling methods. Chen and Lin [16] extend the Priority Ceiling Protocol to permit dynamically-assigned priorities. Chen and Lin [17] extend the protocol in [16] to account for multiple resource instances.
Previous approaches to real-time synchronization suffer from several drawbacks. First, a high-priority task might be forced to wait for a low-priority task to complete a critical section. Mercer and Tokuda [68] note that the blocking of high-priority tasks must be kept to a minimum in order to ensure the responsiveness of the real-time system. If tasks can have delayed release times [57], a high priority task might not be











able to block for the duration of a critical section and still be guaranteed to meet its deadlines. Jeffay [43] discusses the additional feasibility conditions required if tasks have preemption constraints. Second, dynamic-priority scheduling algorithms are feasible with much higher CPU utilizations than static-priority scheduling algorithms [59], and dynamic-priority schedulers might be required for aperiodic tasks. The simple Priority Ceiling Protocol of Rajkumar, Sha, and Lehoczky [80] can be applied to static-priority schedulers only. The dynamic-priority synchronization protocols [8, 16, 17] are complex, and must be closely integrated with the scheduling algorithm.

We present a different approach to synchronization, one which guarantees that a high-priority task never waits for a low-priority task at a critical section. We introduce the idea of an Interruptible Critical Section (ICS), which is a critical section protected by optimistic concurrency control instead of by blocking. A task calculates its modifications to the shared data structure, then attempts to commit its modification. If a higher priority task previously committed a conflicting modification, the lower priority task fails to commit, and must try again (as in optimistic concurrency control [91). Otherwise, the task succeeds, and continues in its work. The synchronization algorithms are not tied to the scheduling algorithm, simplifying the design of the real-time operating system.

A purely optimistic approach to synchronization can starve low priority tasks, leading to poor performance (i.e. low schedulability). We show how to combine ICS with locking, to create interruptible locks. Interruptible locks can be used in conjunction with the PCP to provide schedulability guarantees for the low priority tasks. We present an analysis of periodic tasks that use interruptible locks with the Priority Ceiling Protocol.
We present our implementation of ICS and interruptible locks on the pSOS+ realtime operating system, and show that we can reduce the maximum response time of











a high priority task. Our implementation of interruptible locks is realized through a small amount of code, and did not require a modification of the pSOS+ kernel (although it did make use of a kernel call-out routine). We note that pSOS+ does not provide priority inheritance.

Interruptible critical sections are best applied in embedded or real-time operating systems to improve the schedulability of the highest priority tasks. An operating system for embedded systems will of necessity provide the flexibility required to implement an ICS (as does pSOS+). In such an environment, high priority tasks can enter an ICS without making a system call, thus avoiding the associated overhead. Although an ICS can't reserve resources for a process (but can co-exist with blocking algorithms [8, 17, 80] which can be applied), an ICS can be used to communicate with a high-priority device driver. Low priority tasks submit requests to the device driver through the ICS, and the device is serviced by a high priority driver which obtains commands through the ICS. In Section 4.8, we provide examples of tasks sets that cannot be guaranteed to meet their deadlines using the Priority Ceiling Protocol, but are feasible if interruptible locks are used.

4.2 Interruptible Critical Sections

We build our optimistic synchronization methods on Restartable Atomic Sequences (RAS) [11]. A RAS is a section of code that is re-executed from the beginning if a context switch occurs while a process is executing in the code section. The re-execution of a RAS is enforced by the kernel context-switch mechanism. If the kernel detects that the process program counter is within a RAS on a context switch, the kernel sets the program counter to the start of the RAS. Bershad et al. show that an RAS implementation of an atomic test-and-set has better performance than a hardware











test-and-set on many architectures, and is much faster than kernel-level synchronization [11].
We note that the idea of scheduler support for critical sections is well established. In 4.3BSD UNIX, a system call that is interrupted by a signal is restarted using the longjump instruction [56]. Anderson et al. [5] argue that the operating system support for parallel threads should recognize that a preempted thread is executing in a critical section, and execute the preempted thread until the thread exits the critical section. In addition, Moss and Kohler coded several of the run-time support calls of the Trellis/Owl language so that they could be restarted if interrupted [73].

The simple mechanism described in [11] is too crude for our purposes, because there is no guarantee that a conflicting operation was performed when other processes had control of the CPU. The unnecessary re-executions are not a problem for the critical sections described in [11], because those critical sections are very short and a re-execution is unlikely. In addition, the authors of [11] did not need to consider the predictability required by real-time systems. If the critical section execution occupies a large fraction of a time slice, then a context switch is far more likely. To guarantee progress, a process that is interrupted in its critical section execution should be restarted only if a conflicting operation was executed. We call a region of code that is protected in this manner an interruptible critical section (ICS). Restarting a critical section only if a conflicting operation was performed improves real-time schedulability, because a low priority task can experience restarts only from higher priority tasks that share a critical section, instead of from all higher priority tasks.
We indicate an interruptible critical section by explicitly declaring it so.








60


interruptible-critical-section {
stmtl;

stmtn;
I

As an example, we can implement a shared stack as an ICS by using the following code.

struct stack-elem{
data item;
struct stack-elem *next;
} *sp

push(elem) { stack-elem *elem
interruptible-critical-section {
elem-i+next=sp;
sp=elem;
}
I

stack-elem *pop() { struct stack-elem *temp
interruptible-critical-section {
temp=sp;
if(sp !=NULL)
sp=sp--*next;
}
return (t emp) ;



4.3 Implementing Interruptible Critical Sections

4.3.1 Background

The techniques used to write interruptible critical sections are based on the ideas

developed for non-locking concurrent data structures. Herlihy [35] introduces the idea

of non-blocking concurrent objects. An algorithm for a non-blocking object provides

the guarantee that one of the processes that accesses the object makes progress in











a finite number of steps. Herlihy provides a method for implementing non-blocking objects that swaps in the new value of the object in a single write. Our methods are similar to an extension of Herlihy's work proposed by Turek, Shasha, and Prakash [97].
In the context of real-time synchronization, non-blocking shared objects are desirable because a high priority task will not be blocked by a low priority task. In a uniprocessor system, only one process at a time will access the shared data structures. We can take advantage of the serial but interruptible access to simplify the specification of the existing non-blocking techniques, and to improve on their efficiency.
In an interruptible critical section, a process can perform only one write that is visible to other processes. Furthermore, the globally visible write must be the last instruction in the protected region. Therefore, a process that is executing an ICS records its updates in a private buffer (the commit buffer). The final write commits the updates that are recorded in the buffer by setting a commit flag. Any subsequent process that executes the ICS performs the updates and clears the commit flag.

This approach to optimistic synchronization is discussed by Alemany and Felton [2] and by Bershad [10]. In this chapter, we discuss the following implementational details that do not appear in the previous work.

" Efficient implementation in a uniprocessor system.

" How to perform the bulk of the ICS processing outside of the kernel.

* How to share commit buffers among processes.

* How to use Herlihy's small-object protocol [35] to minimize the number of writes

that must be placed in the commit buffer.


* How to apply optimistic synchronization to real-time systems.











e An analysis of interruptible locks in a system of periodic tasks.

4.3.2 Implementation

In the following discussion, we assume that if a process experiences a context switch while executing an ICS, the process re-executes from the start of the ICS when it regains control of the CPU (as in [11]). In section 4.4, we discuss the modification necessary to permit re-execution only when a conflicting operation commits. The modification is minor, but the fully general algorithm would confuse the current discussion.

In [97], Turek et al. propose a method for transforming locking data structures into non-blocking data structures. The key to the transformation is to post a continuation instead of a lock. The continuation contains the modifications that the process intends to perform. If a process attempts to post a continuation but is blocked (because a continuation is already posted), the 'blocked' process performs the actions listed in the continuation, removes the continuation, then re-attempts to post its own continuation. As a result, a blocked process can unblock itself.

Although Turek's approach simplifies the process of writing a critical section, a direct translation of Turek's algorithm can require a high priority process to perform the work of many low priority processes that have posted but not yet performed their actions. An easy modification of Turek's approach results in a simple algorithm which guarantees that a high priority process does the work for at most one low priority process. We present an algorithm of an ICS based on this approach here. We note that one can write an ICS by a rather different approach, the details of which are contained in [44].

Every shared concurrent object has a single commit record, and a flag indicating whether the commit record is valid or invalid. When a process starts executing a











critical section, it check to see if a previous operation left an unexecuted commit

record (the flag is valid). If so, the process executes the writes indicated by the

commit record, then sets the flag to invalid. The process then performs its operation, recording all intended writes in the commit record. For the decisive instruction,

the process sets the flag to valid. A typical critical section has the following form.

struct commit -record-element {
word *lhs,rhs} commit-record[MAX] boolean valid

crit ical-sect ion 0
interruptible-critical-section{
if (valid)
instruct ion=O
while (instruction commit -record[instruction] lhs != NULL)
* (commit-record [instruction]. lhs) = commit-record [instruction]. rhs valid=FALSE
calculate modifications
load modifications into commit-record
valid=true
}



For example, the following code inserts a record in a doubly linked list. Other list

operations are similar.

struct list-elem{
data item;
struct listelem *forward, *backward;
} *head;


struct commit -record-element {
word *lhs,rhs} commit-record[2] boolean valid


insert(elem)












list-elem *elem
list-elem *prev,*next
interruptible-critical-section {
if (valid)
instruct ion=O
while(instruction<2 and
commit-record [instruction] lhs != NULL)
*(commit-record [instruction] .lhs)=
commit-record [instruction]. rhs valid=FALSE

prev=NULL; next=head
while (notfound-posit ion (next))
prev=next; next=next-+ forward
// Found the insertion point
elem--+forward=next; elem--+backward=prev
if (prev==NULL)
commit-record [0] . lhs=&head
else
commitrecord [0]. lhs=& (prev-+forward)
commit.record [0]. rhs=elem
if(next != NULL)
commit-record [1] . lhs=& (next--backward)
commit-record [1]. rhs=elem
else
commit-record [1]. Ihs=NULL
valid=TRUE



The transformation from a blocking-based critical section to an ICS is straightforward. The cleanup phase is inserted in the beginning of the critical section. Whenever

a write is performed into global data in the blocking-based critical section, the write

is recorded in the commit record in the ICS. The last statement of the ICS is to set

valid to TRUE. If operations perform few writes, then a high priority task performs

at most a few instructions on behalf of a low priority task. Further, the costs balance because the high priority task leaves the commit record for a different task to

execute. In a blocking-based approach, the high priority task would incur a context











switch, thus costing the context switch overhead and also overhead due to cache line invalidations.

4.3.3 Reducing the Clean-up

If the critical section requires a small modification (or can be broken into several sections, each requiring only a small modification), then the basic approach allows a low priority operation to block a high priority operation for only a short period. If an operation performs a substantial modification and the number of modifications that an operation commits might vary widely, then a high priority operation might spend a substantial amount of time performing a low priority operation's updates to the data structure.

In [33], Herlihy proposes a 'shadow-page' method for implementing a non-blocking concurrent data structure. An operation calculates its modifications to the data structure in set of privately allocated (shadow) records, then links its records into the data structure in its decisive instruction. The process is illustrated in Figure 4.1. The blocks in the data structure marked 'g' are replaced by the shadow blocks. An operation performs its decisive instruction by swapping the anchor pointer from the current root to the shadow root. The blocks that are removed from the data structure are garbage collected by the successful operation and are (eventually) made available to other operations. We note that the decisive instruction always must be to swap the anchor, in order to ensure serializability in a parallel system.

The most complicated part of Herlihy's protocol is managing the garbage-collected records. The protocols are complex, and require O(P2) space, where P is the number of processes that access the shared object. We can take advantage of the serial access to the data structure in the ICS to simplify the implementation and reduce the space overhead.




























Figure 4.1. Herlihy's non-blocking data structures

The process of implementing a shadow-page ICS is illustrated in Figure 4.2. A process obtains the records it needs to prepare its modifications from a global stack of records. The global record stack provides the records for all operations that use records of the size it stores. When a process obtains a record from the global stack, it does not remove the record from the stack. Instead, the modifications are made to records while they are still on the stack. A local variable, current, keeps track of the last allocated record from the record stack. Another pair of local variables, g-head and g-tail, keep track of the records to be removed from the data structure. To commit the modification to the data structure, the operation must remove the records it used from the stack of global records, add the garbage records to the global stack, and adjust a pointer in the data structure. These three modifications can be performed using a regular commit record.

Before listing the procedures to implement the shadow-page ICS, we note a couple of details. First, every record in the data structure must contain enough additional space to thread a list through it, whether the garbage list or the global record stack.













anchor


Figure 4.2. Shadow-page ICS Second, the critical instruction of the operation is to declare that the commit record is valid. As a result, the commit record can contain instructions to change any links in the data structure. As an example, in Figure 4.2, a link from the root instead of the anchor is modified.

We assume that every record has a field next that is used to thread the global record and the garbage lists through the nodes. The procedure for acquiring a new record is

record *getbuf(record **current)
buffer *temp
t emp=* current
*current=(*current)--next

The procedure to declare that a node is garbage is given by garbage(record *elem, **glhead, **gtail)
if (*gtail==NULL)
*g-tail=elem
elem-*next=*g-head
*g.head=elem

A typical critical section is given by












struct commit-record-element {
word *lhs,rhs} commit-record[3] boolean valid
Global record *pool

crit ical-sect iono
record *current, *g-head, *g-tail
restartable{
if (valid)
instruction=O
while(instruction<3 and
commit-record [instruction]. lhs != NULL)
* (commit-record [instruction] . lhs) = commit-record [instruction]. rhs valid=FALSE
// Initialize the list pointers
current=pool
g-head=gt ail=NULL

Compute the modifications to the data structure
using the getbuf and garbage procedures

// Prepare the commit record

commit-record [0] . lhs=& (g-tail--next)
commit-record [0] . rhs=current
commit-record[]. lhs=&pool commit-record [1] . rhs=glhead
commit-record [2] . lhs=critical Iink
commit-record [2]. rhs=critical ink-value

valid=TRUE // commit your update



The shadow-page ICS requires that a high priority operation perform at most

three writes on the behalf of a low priority operation when the shared data structure

is a tree. Arbitrary graph structures might require more updates, but the technique

has a similar application. Since a high-priority operation does not perform its own

clean-up, the costs balance, and again the high priority task avoids the context switch











overhead. The space requirements for a shadow page ICS are independent of the number of competing processes, as the global pool must be initialized with enough records to allow the data structure to reach its maximum size, plus the number of records in the largest modification. Furthermore, the global pool can be shared among several data structures (in which case they must share a commit record). The linked list that is threaded through the data structure imposes an 0(1) penalty on every node in the data structure.

4.4 System Support

If an interruptible critical section is to be efficient, then a process executing one should be restarted only if a conflicting operation occurs. Thus, information about critical section executions must be transmitted to the kernel. In this section, we describe a simple and efficient method of providing kernel-level support for interruptible critical sections.

An operating system must have a small context switch overhead to achieve good performance. Thus, the context-switch time support for an ICS must be limited to a minimum. However, we would like to make the mechanism as flexible as possible. In addition, we would like to avoid making kernel traps to set up a request for critical section entry.

To be efficient, information about conflicting executions must be passively transmitted to the kernel. With every critical section, we associate an execution count, cs-count. When a process enters a critical section, it reads cs-count into a local variable, process-count. When the process completes the critical section, it increments cs-count. Thus, the kernel can detect that a conflicting operation occurred when the process-count of the switched-in process is different than the cs-count











of the critical section being executed. We use this mechanism as the basis of our context switch support for the ICS.

Every critical section has a control block with the following information

1. The starting and ending location of the critical section code.

2. The cs.count.

3. Additional structures necessary to implement interruptible critical sections.

Every process that executes interruptible critical sections has a block of memory in user/kernel space that contains the following variables.

1. A flag that is set if the process is executing an interruptible critical section.

2. A pointer to the critical section control block.

3. process-count.

On a context switch, the kernel executes the following code before giving control to the switched-in process.

If the next process to run is executing an ICS
Find the critical section control block
If the program counter of the next process to run
is inside the ICS
If cs-count != process-count
Set the program counter of the next process to run
to the start of the ICS.

To take advantage of the kernel mechanism, the process loads cs-count into process-count before entering the ICS, and increments cs-count before exiting the ICS. The following is a first attempt at writing the entry and exit code for an ICS.











// Entry code
Make the process' control block point to the critical section control block.
Set the flag in the process' control block. proce s s count=cs-count
BeginICS: .... // start of the ICS // Exit code
EndICS: cs.count++
reset the flag in the process' control block.


The problem with the above entry and exit code is that it doesn't cooperate with the code that implements the interruptible critical section. The ICS expects that the last instruction in the restartable region sets valid to TRUE. If valid is set before cs-count is incremented, then an incorrect execution can result (either an operation is executed twice, or a committed operation is ignored). If cs-count is incremented before valid is set, then a process can cause itself to restart. We can avoid these race conditions by having a single write that both commits the operation and increments cs-count. With each critical section, we associate a second execution counter, aux-count that normally has the same value as cscount. The last instruction in the restartable region increments cs-count. A process can detect that an operation has recently committed by testing aux-count and cscount for equality. If they are different, the process performs the writes of the previous operation. The process signals that all of the updates are performed by setting aux-count=cs-count.

There is one remaining problem. When two operations execute concurrently, they interfere when they record their writes in the commit record. If the system uses strict priority scheduling, a high priority operation will overwrite the concurrent lower priority process' updates to the commit record, then force the lower priority process to restart. If the executions of the two operations can be interleaved, then they must have their own commit records to record their updates. But then, when











an operation commits it must indicate which commit record contains update. This is done by incrementing cs.count by the commit record index when committing.

The new exit code is

// Exit code
EndICS: cs-count+=proces s-number reset the flag in the process' control block.

The code in the ICS to detect and perform a committed operations updates is

index=cs-count- aux-count
if (index !=0)
instruction=O
while(instruction commit-record[instruction] [index] .lhs != NULL)
* (commit-record [instruction] [index]. lhs) =
commit -record [instruct ion] [index] . rhs
aux-count=cs-count

As an optimization, we note that an operation that queries the data and performs no updates does not need to force other operations to restart. Queries can be implemented using the same methods as for updating operations, except that queries do not modify cs-count.

4.5 Interruptible Locks

Interruptible critical sections let high priority operations execute quickly at the expense of making low priority operations execute slowly. If too many tasks are allowed to enter a critical section without blocking, low priority tasks might experience an excessive number of restarts, increasing their response time and decreasing the schedulability of the task set. We can reduce the unpredictability of the low priority operations by letting only the highest priority tasks execute the critical section without locking. Moderate to low priority tasks must acquire a semaphore to execute in the critical section. As our results section shows, this greatly improves the











predictability of a set of real-time tasks. Furthermore, tasks that must acquire a semaphore can be required to use the priority ceiling protocol. Our analysis section shows that a combination of interruptible locks and the priority ceiling improves the schedulability of the low priority tasks.

The entry and exit code is changed to the following (we assume here that lower priority numbers mean lower priorities). II Entry code
Make the process' control block point to the critical section control block. Set the flag in the process' control block. process-count=cs-count
if(my priority < cutoff-priority) P(S)
BeginICS: .... II start of the ICS // Exit code
EndICS: cs -count +=process-number
if(my priority < cutoff-priority) v(S)
reset the flag in the process' control block.


Interruptible locks also reduce the space requirements for an ICS with multitasking processes. Since the processes which set a lock will not execute concurrently, they can share a commit record. In a typical use of interruptible locks, only one very high priority process will be able to interrupt the lock, so only two commit records are needed.

4.6 Implementation

We implemented ICS support in a VMEexec [76] system development environment with a pSOS+ [75] real-time, multi-tasking operating system kernel. The VMEexec system consists of a host running on a VMEmodule driven SYSTEM V/68 operating system and a set of VMEmodule target processors running the pSOS+ kernel. In our











configuration, we have six MVME147 VMEmodules based on Motorola MC68030 with 4Mb of shared-memory on each module. One VMEmodule is used as a host processor running the SYSTEM V/68 and the rest are real-time target processors running the pSOS+ kernel. For the experiments described in this chapter, we made use of only one of the target processors.

pSOS+ is a real-time, multi-tasking kernel that supports multi-processors. It provides a rich set of system services including task management, shared-memory regions, synchronous / asynchronous signals, semaphores, and messages. One particular feature that pSOS+ supports are user written routines that can be called at the start of a task, during a context-switch, and at the end of a task. This feature allows us to implement ICS support without modifying the kernel.

We use two data structures to implement the ICS: one for the critical section and one for each task that uses the critical section. The global lock structure consists of a critical section identifier, a counter that tracks the number of times the critical section has been executed, and the critical section bounds.


struct ICS..struct {
int id; /* ID of this critical section */
int cs-count; /* Global Execution Count */
char *cslow; /* CS Low Address */
char *cshigh; /* CS High Address */
I

The structure local to a task consists of the copy of the ICS execution count, a count of the number of times the critical section is retried on any invocation (for statistics), a pointer to the ICSstruct and a flag to indicate that the task is entering the critical section.












struct ICS-Tstruct {
int process-count; /* Local Execution Count */
struct ICSLstruct *ilp; /*Interruptible Lock Record Pointer*/
int icount; /* Interrupt Count of a task */
int flag; /* Flag = ID => In CS; = 0 => Not */




The ICS implementation code consists of two parts: the ICSctxsw routine which provides the ICS Lock mechanism and the ICSclient task that uses the ICS mechanism. We have already discussed the algorithms used by the ICSclient task in Section 4.4

4.6.1 ICSctxsw Routine

The ICSctxsw routine is integrated with the pSOS+ kernel as a user written routine that is called during a context-switch. The call occurs at the point where the context of the switched-out task has been completely saved, and before the context of the switched-in task is loaded. pSOS+ provides the addresses of the Task Control Blocks (TCBs) of both the switched-in task and the switched-out task in machine registers. The TCB contains all the context of a task, including the Program Counter

(PC). ICSctxsw can reset the PC in the TCB of a switched-in task, if required. pSOS+ provides a set of eight software-defined user registers USRs) that a task can access in the TCB. The user register 0, UAREGO, is used to contain the address of the ICS.Tstruct of a task using ICS.

ICSctxsw()
{
struct tcb *in-tcb;
struct ICSTstruct *tlp;

load in-tcb from the machine register;
tlp = Get UREGO from in-tcb;











if(tlp != NULL && tlp->flag == LOCKID)
{
if(tcb->currpc >= tlp->ilp->cslow &&
tcb->currpc <= tlp->ilp->cshigh)
{
if(tlp->process-count != tlp->ilp->cscount)
in-tcb->currpc = tlp->ilp->cslow;
}
}




ICSctxsw checks if the program counter (PC) of the task about to be run is within the critical section region, and if so, it decides on the criteria to reset the PC to the beginning of the critical section. If the criteria is met, the task is forced to re-execute the critical section. Otherwise, the task is allowed to continue.

4.6.2 User-level Entry and Exit

The ICS entry and exit code that is used in conjunction with the ICSctxsw routine must (in general) be kernel calls, because the task control block is modified. To permit user-level synchronization, the entry and exit calls must be designed so that bad parameters cannot be passed.

Instead of storing the critical section ICS control block (ICS_.Lstruct) in arbitrary locations, they are stored in an array in kernel space. Registering an ICS requires a call to fill in one of the control blocks. In the task ICS control block (ICSTstruct), we store the index of the control block of the ICS that is being executed instead a pointer to it (or 0 if no ICS is being executed). In the context switch routine (ICSctxsw), the index to the ICS control block is used in place of the reference. If the number of allowed ICS control blocks is a power of 2, then bounds checking can be done by masking out the high order bits of the index in the task ICS control block. An invalid index causes no problems, since the PC won't be in the specified range.











4.7 Experimental Performance Results

We tested the performance of interruptible locks on a shared priority queue. There are three low priority enqueue tasks (of equal priority) and a single high priority dequeue task. This experiment corresponds to several computational tasks providing data for a high priority I/0 task. All four tasks are started under the control of a low priority parent task. The parent and the tasks communicate through message queues.

We compared 4 types of mechanisms.


1. Interruptible Critical Sections: All tasks immediately enter the ICS.

2. Interruptible Locks: The enqueuing tasks set and release a semaphore, the

dequeuing task does not.

3. Non-prioritized Semaphore Locks: All of the tasks acquire a semaphore before

entering the critical section. The semaphore lock is granted on FCFS basis.

4. Prioritized Semaphore Locks: Same as the above, but the semaphore is granted

on a priority basis.


Parameters In the first experiment, each task performs 10,000 enqueue (dequeue) operations, but we stop collecting statistics after any task completes. Each enqueue task spins for 7 ticks (about 70 ms), then executes a 1 tick critical section. The time quantum for a task is 2 ticks. The dequeue task sleeps for 10 ticks before entering a 1 tick critical section.

We collect the time to execute a critical section, and we create histograms that show the frequency that the critical section execution takes a particular amount of time. The performance of the non-prioritized semaphore is shown in Figure 4.3.











The dequeue operation sometime experiences a long delay, and the time to execute enqueue operations is moderate. Since pSOS+ offers prioritized semaphores, a fairer comparison should use them. This data is shown in Figure 4.4. There is a slight decrease in the dequeue and enqueue response times, but still the dequeue operation experiences a long delay a few times. In Figure 4.5, we show the time to execute the enqueue and dequeue critical section using interruptible critical sections. The dequeue operation is always performed without delay, and the enqueue operations perform as well as when using the prioritized semaphores. In Figure 4.6, we use interruptible lock. The performance is comparable to the interruptible critical sections.

Using an ICS can cause poor performance among low priority tasks if the critical sections have a high utilization. In the second experiment, the enqueue task spins for 2 ticks instead of 7 ticks, and then executes a 4 tick critical section. The dequeue task sleeps for 20 ticks and executes a 1 tick critical section. These parameters are selected to exaggerate the conflicts among the tasks, to show the restart problems that using an ICS can cause.

Figure 4.7 shows the time to execute the enqueue and dequeue critical section using interruptible critical sections. We note that the scale on this chart is nonlinear. The dequeue operation is again performed without delay, but the enqueue operation can take an extremely long time to execute. In contrast, Figure 4.8 shows the usage of interruptible locks in which the time to execute an enqueue operation is moderate.

Comparing the interruptible lock and the prioritized semaphore implementations for the critical sections, we find that the interruptible lock eliminates the delay in executing the high priority critical section, while adding only a small delay (in this case about 20%) to the time to execute a low priority critical section.











The significance of these experiments is not that the average response time of the high priority dequeue operation is reduced, but that the response times of the dequeue operations becomes predictable. In the low-conflict experiment, the dequeue operation usually completes immediately, but on occasion requires 5 ticks when a prioritized semaphore is used. This unpredictability might cause timing problems. We note that the priority ceiling protocol would provide the same performance as the prioritized semaphore does (since there are no other critical sections), but at the cost of a more complex and expensive scheduler and synchronization mechanism. Interruptible locks always allow the dequeue operation to finish immediately, even under a very high load.

To test the overhead of using interruptible critical sections, we ran experiments to time the overhead in the context switch code and in the ICS entry code. In the first experiment, we enter and exit a (empty) critical section 10,000 times. We found that acquiring and releasing a semaphore adds 57 ticks to the program execution time. Entering and exiting an ICS requires 67 ticks if the entry and exit code is contained in a system call, and 1 tick if the entry and exit code are user code. In the second experiment, we force 10,000 context switches. We find that the unavoidable context switches by themselves require 58 ticks, and the ICS callout code adds 9 ticks of overhead. These numbers are for the current implementation of ICS and interruptible locks. An implementation that is more tightly integrated into the kernel will require less overhead. For example, if the context-switch code is part of the kernel, then there is no need for the callout routine overhead and we would have faster access to the program counter.











4.8 Analysis

Hard real-time systems require timing guarantees, and for this reason one typically considers periodic task sets. In this section, we show how to analyze a periodic task set that uses an ICS or an interruptible lock for synchronization, and derive worst-case response times.

The set of tasks is {ri}. We use the convention that ri has a higher priority than rj if i < j. Each task i has a period Ti and a worst case execution time Ci. An instantiation of task i is released at the beginning of its period, and has for a deadline the release of the next instantiation of the task. If all tasks can always complete before their deadline, then the task set is feasible. A real-time system with periodic tasks is typically scheduled using the Rate Monotonic algorithm, which gives static preemptive priority to tasks with shorter periods. Rate Monotonic scheduling is well studied, and the feasibility of a task set can be exactly characterized. Let ri be the worst cast response time of task i. If the tasks do not access critical sections then, ri is the fixed point of the following recursive equation [48] ri= Ci+Zi










of the task can be less than the task period, perhaps significantly less. If the task deadlines are shorter than the task periods, then the Deadline Monotonic algorithm is the optimal static scheduler [58].

If the tasks can access critical sections and thus experience blocking, then the maximum blocking time is added to the above ri value. If the Priority Ceiling Protocol is used, a task will block for at most the duration of one critical section. If the tasks use interruptible locks, then there is an additional re-execution component that must be added to the response times. We next compute the time wasted due to re-executions of critical sections.

We assume that interruptible locks are used in conjunction with the Priority Ceiling Protocol. So, for each critical section, there is a (possible empty) set of tasks that enter the critical section without blocking, and another (possible empty) set of tasks that acquire a semaphore before entering the critical section.

The tasks are described by their periods T, their execution times (in the absence of concurrency) Ci, the set of critical sections that they access Zi, the execution times in their critical sections bi,,, and their deadlines Di. Hence, we assume that tasks priorities are assigned according to the length of Di. Let Z be the set of critical sections (Z = UZi), and for z E Z, let rjz, 'ri,..., rz be the set of tasks that access z, in order of priority. Of the n, tasks, the I, highest priority tasks enter z without blocking, and the remaining n, - I, acquire a semaphore using the Priority Ceiling Protocol. We define I(z) to be the highest numbered task that enters z without blocking (i.e., ri(,) = rT,).

Let us first consider a couple of simple examples. Suppose that we have a set of three tasks that each access the semaphore z. The characteristics of the tasks are listed in Table 4.1. Note that this task set cannot be guaranteed to meet its deadlines if the Priority Ceiling Protocol is used, as C1 + maxi(bi,z) > D1.












Table 4.1. Example task description 1 for ICS task T Ci Zi bi, A ,rl 10 ms 2.5 ms z 1lms 3 ms 72 15 5 z 1 10 r3 30 4 z 1 28



If the semaphore z is protected by an ICS, then task 7-1 can always meet its deadline because C1 < D1. The question is whether the remaining tasks can meet their deadlines. To analyze task r2, we observe that every time that r1 interrupts r2, the result can be a re-execution of z. So, to determine the response time r2 of r2, we modify equation 4.1 to incorporate the re-execution time.

r2 = C2 + [r2/T11 (C, + b2,-) (4.2) By solving equation 4.2, we find that r2 = 8.5 < D2. To determine r3, we observe that both r and r2 can cause a re-execution in T3. However, an execution of r1 can either cause a re-execution in r2 or a re-execution in r3, but not both. In either case, r3 is increased by one execution of z. Therefore, the formula for r3 is

r3 = c3 + [r3/Tl (C, + b2,z) + rr3/T2] (C2 + b3,z) (4.3)

We find that r3 = 26.5 < D3, so 73 can always meet its deadline. We conclude that the task set in Table 4.1 is feasible if synchronization is done with an ICS, but is infeasible if the synchronization is done using the Priority Ceiling Protocol.
We can observe a general method for computing worst-case response times of the tasks, if all critical sections are protected by ICSs only. We define

b(z;j,i) = maxj
= 0 Otherwise











Thus, b(z; j, i) is the longest critical section that can be interrupted by rj and increase the response time of Ti.

Next, we need to determine increase in response time of task r, caused by executions of Ti. If Tj causes a re-execution of critical section z, then only one critical section must be re-executed because of rj. If Tk and T are executing z when rj executes z (j < k < 1), then both Tk and TI must re-execute z. However, Tj would need to re-execute anyway, because of Tk's execution. Furthermore, all tasks with a lower priority than Tk have their response time increased by Tk's re-execution time. Therefore, the increase in Ti's response time due to an execution of T3 is



bto0(j,i) = maxzEzb(z;j,i)

Then, the response time of task Ti is the solution of


r,= C, + E2j

Table 4.2. Example task description 2 for ICS
task Ti Ci Zi bi,8 Di ri
T, 20 ms 2.5 ms X Ims 5.5 ms 2.5 ms
_2 20 2.5 Y 1 5.5 5.0 _3 30 5 X 1 15 11 T4 40 4 Y 1 25 16 T5 50 4 X, Y 1,1 30 29











To incorporate a mixed system that uses both the ICS and the PCP techniques, we need to modify equation 4.4 to account for blocking and the restricted interruption. In particular, rj can cause re-executions only if it does not block before executing critical section z.

We define

bp(z;j,i)= maxj
= 0 Otherwise



bptot(j,i) = maxzEz b(z;j,i)

Since all tasks that access z numbered larger than I(s) use the Priority Ceiling Protocol, we must calculate the worst-case execution times of critical sections that can be locked, BP(z). In particular, we must account for the possible re-executions of Z.


BP(z) = max {fr/Tlb(z,j)Iz E Zi n Zj, i
i,j st. z E Zi n Z4, i < I(z)
0


I(z) = 0

Otherwise


max{BP(z)iz can block i}


Then, the response time of task 7i is the solution of


ri = Ci + Ej

(4.5)











In Table 4.3, we show an example analysis of a task set. The column labeled ri (ICS) shows the worst-case response times when interruptible critical sections are used for synchronization, and the colums labeled ri (ICS + PCP) shows the worst case response time when interruptible locks are used. For the interruptible locks, we assume that tasks ri and T2 never block, but tasks 73 through -8 acquire a lock using the Priority Ceiling Protocol. We observe that the task set cannot meet its deadlines if only the PCP is used. Furthermore, task r8 cannot be guaranteed to meet its deadline when only ICS is used. However, a combination of interruptible locks and the priority ceiling protocol lets all tasks meet their deadlines. We observe that interruptible locks penalize intermediate-priority tasks, due to the possible blocking waits, B(i). However, the response time of high priority tasks is reduced because of the reduced rate of critical section interruptions.


Table 4.3. Example task description 3 for ICS task Tj Ci Z bj, Di ri (ICS) (ICS + PCP)
T- 25 ms 3 ms X 1 ms 6.5 ms 3 ms 3 ms
T2 25 3 Y 1 6.5 6 6 73 30 3 X 1 15 10 12 74 30 3 Y 1 20 14 16 T5 30 3 X 1 30 18 19 16 30 3 Y 1 30 22 22 T7 100 3 X 1 80 49 45 18 100 3 Y 1 80 86 48


4.9 Conclusion

We have presented methods for implementing interruptible critical sections (ICS), and using them with interruptible locks. Interruptible critical sections use optimistic concurrency control instead of pessimistic concurrency control. If a process that is











executing an ICS is interrupted and a conflicting operation commits, the conflicted process restarts its execution from the beginning of the critical section. In a real-time system, interruptible critical sections prevent priority inversion. In addition, the ICS mechanism is independent of the scheduling algorithm. We show how several recent ideas in non-blocking and uniprocessor synchronization can be synthesized to provide low-overhead interruptible critical sections.

We show how an ICS can be implemented in practice, and discuss our ICS implementation in the pSOS+ operating system. We find that the use of a prioritized semaphore can lead to unpredictable executions times of high priority tasks, while the use of an ICS allows the high priority task to always complete quickly.

The use of interruptible critical sections only can cause too many critical section re-executions, making low-priority tasks unschedulable. Interruptible critical sections can be used with locks to create interruptible locks. We show that when an interruptible lock is used, a low-priority task never blocks a high priority task, and the low priority tasks experience only a small degradation in execution time. Interruptible locks are appropriate when very time sensitive tasks must communicate with lower priority tasks through shared-memory data structures.
We present an analysis of a hard real-time periodic task set that synchronizes using interruptible critical sections. We show that if the highest priority tasks have very tight deadlines, then interruptible critical sections can improve the schedulability of the task set. Interruptible locks can be used in conjunction with the priority ceiling protocol. We show that using interruptible locks with PCP can improve schedulability over using ICS or PCP alone.












FCFS Semaphore Locks inn_


80


60


040.


20


01
0


Figure 4.3. Response




100-


0
&60o 40-


M Dequeue
0] Enqueue












1 2 3 4 5 6 7 8 9
Time to execute the critical section (ticks)

time distribution of the non-prioritized semaphore



Priority Semaphore Locks


0 1 2 3 4 5 6 7 8 9 Time to execute the critical section (ticks) Figure 4.4. Response time distribution of the prioritized semaphore


�Dequeue [Enqueue




__r--I [- I"-'













Interruptable Critical Sections


0


U
U
>1 2-


I Dequeue, 80- [ Enqueue 60


40


20

H n


0 1 2 3 4 5 6 7 8 9 Time to execute the critical section (ticks) Figure 4.5. Response time distribution of the Interruptible Critical Section


Interruptable Critical Locks


1 nfl


�Dequeue 0- []Enqueue [0.



nj F1H-


S0 1 2 3 4 5 6 7 8 9 Time to execute the critical section (ticks) Figure 4.6. Response time distribution of the Interruptible Lock


.It J




Full Text

PAGE 1

SYNCHRONIZATION ALGORITHMS FOR REAL-TIME SYSTEMS By KRISHNA HARATHI A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1995

PAGE 2

To my father, Sri. Raghavan Harathi, my mother, Smt. Vasantha Lakshmi Harathi, my wife, Padmashree and my son, Ankith Nitesh.

PAGE 3

ACKNOWLEDGEMENTS I take this opportunity to express my sincere thanks to Dr. Theodore Johnson, my mentor and chairman of my supervisory committee, for his guidance, encouragement, and patience. He was always there when I went to him for help. I am sure I would not have attempted this work, if not for him. I am grateful to Dr. Richard NewmanWolfe, who was also the chairman of my master's supervisory committee, for the very many fruitful discussions I had with him about everything, be it academic or otherwise. I am indebted to Dr. Randy Chow for his friendly and thoughtful advice. Dr. Yann-Hang Lee commands my respect for his to-the-point critique. I am fortunate to associate with Dr. Paul Avery for his external points of view. Special thanks go to my beloved wife, Padmashree, for her support, encouragement, and the patience that made this work possible. Thanks also go to my son Ankith who turns one, when I am done. A bundle of joy at home is a heaven to go to after a long and hard night's work! iii

PAGE 4

TABLE OF CONTENTS ACKNOWLEDGEMENTS iii LIST OF TABLES vii LIST OF FIGURES viii ABSTRACT xi CHAPTERS 1 INTRODUCTION 1 1.1 Goals 1 1.2 Pessimistic Synchronization 3 1.3 Optimistic Synchronization 3 1.4 Dissertation Structure 4 1.5 Conclusion 5 2 RELATED RESEARCH 6 2.1 Introduction 6 2.2 Real Time Scheduling 6 2.2.1 Introduction 6 2.2.2 Rate Monotonic Scheduling 8 2.2.3 Static Scheduling Algorithms 11 2.2.4 Dynamic Scheduling Algorithms 13 2.3 Priority Inheritance Protocols 14 2.3.1 Priority Inheritance Protocol 14 2.3.2 Priority Ceiling Protocol 15 2.3.3 Priority Ceiling Protocols with Abortable Critical Sections . . 16 2.3.4 Priority Inheritance Protocols for Multiprocessors 16 2.4 Synchronization in Multiprocessors 18 2.4.1 Pessimistic Synchronization 18 2.4.2 Optimistic Synchronization 22 2.4.3 Synchronization through Message passing 23 2.5 Conclusion 23 iv

PAGE 5

3 A PRIORITIZED MULTIPROCESSOR SPIN LOCK 25 3.1 Introduction 25 3.1.1 Previous Work 27 3.2 PR-Lock Algorithm 29 3.2.1 Assumptions 30 3.2.2 Implementation 31 3.3 Correctness of PR-Lock Algorithm 40 3.4 Extensions 47 3.4.1 Multiple Locks 47 3.4.2 Backing Out 48 3.5 Simulation Results 49 3.6 Conclusion 51 4 INTERRUPTIBLE CRITICAL SECTIONS 55 4.1 Introduction 55 4.2 Interruptible Critical Sections 58 4.3 Implementing Interruptible Critical Sections 60 4.3.1 Background 60 4.3.2 Implementation 62 4.3.3 Reducing the Clean-up 65 4.4 System Support 69 4.5 Interruptible Locks 72 4.6 Implementation 73 4.6.1 ICSctxsw Routine 75 4.6.2 User-level Entry and Exit 76 4.7 Experimental Performance Results 77 4.8 Analysis 80 4.9 Conclusion 85 5 EXTENDING INTERRUPTIBLE CRITICAL SECTIONS TO MULTIPROCESSORS 90 5.1 Introduction 90 5.2 ICSM with Lock Release (ICSM-R) 93 5.2.1 ICSM-Rctxsw Routine 95 5.2.2 ICSM-Rclient Tasks 97 5.2.3 ICSM-R Performance Analysis 97 5.3 ICSM with Task Kill (ICSM-K) 121 5.3.1 Experimental Performance Results 123 5.4 ICSM with Priority Queue (ICSM-P) 129 5.4.1 Implementation 130 5.4.2 Correctness of the ICSM-P Algorithm 133 5.4.3 Experimental Performance Results 145 5.4.4 ICSM-P algorithm with single word CAS 147 5.5 Conclusions 150 v

PAGE 6

6 CONCLUSIONS 154 6.1 PR-Lock Algorithm 155 6.2 Interruptible Critical Sections on Uniprocessors 156 6.3 Interruptible Critical Sections on Multiprocessors 156 6.4 Final Words 157 REFERENCES 158 BIOGRAPHICAL SKETCH 165 VI

PAGE 7

LIST OF TABLES 4.1 Example task description 1 for ICS 82 4.2 Example task description 2 for ICS 83 4.3 Example task description 3 for ICS 85 5.1 Validating cycle time analysis using simulation for ICSM-R 109 5.2 Validating lock utilization analysis using simulation for ICSM-R ... 110 5.3 Performance comparison of ICSM-P algorithm for lock utilization of 25%, 50%, and 75% 148 5.4 Performance comparison of ICSM-P algorithm for lock utilization of 100% 149 vii

PAGE 8

LIST OF FIGURES 3.1 CAS used in the PR-Lock Algorithm 32 3.2 Data Structures used in the PR-Lock Algorithm 35 3.3 Queue data structure used in PR-Lock algorithm 36 3.4 Stages in the acquireJock operation 36 3.5 The acquireJock operation procedure 38 3.6 The release Jock operation procedure 40 3.7 Observed queue C before and after a releaseJock 43 3.8 Observed queue C before and after an acquireJock to an empty queue 44 3.9 Observed queue C before and after an acquireJock to a non-empty queue 44 3.10 A concurrent acquire Jock A ' succeeds before A 46 3.11 A concurrent releaseJock R succeeds before A 47 3.12 ReleaseJock R and acquireJock J 'succeed before A 48 3.13 Comparison of lock acquisition times 53 3.14 Comparison of lock release times 54 4.1 Herlihy's non-blocking data structures 66 4.2 Shadow-page ICS 67 4.3 Response time distribution of the non-prioritized semaphore 87 4.4 Response time distribution of the prioritized semaphore 87 4.5 Response time distribution of the Interruptible Critical Section .... 88 4.6 Response time distribution of the Interruptible Lock 88 viii

PAGE 9

4.7 Response time distribution of the Interruptible Critical Section for high lock utilization 89 4.8 Response time distribution of the Interruptible Lock for high lock utilization 89 5.1 The context switch routine for ICSM-R 96 5.2 The algorithm for a task using ICSM-R 98 5.3 ICSM-R Performance Results (Tw = 20 milli seconds) 100 5.4 Model of a cycle for a task using LOCK-NR 102 5.5 Model of a cycle for a task using ICSM-R 104 5.6 ICSM-R Cycle Times using analysis for Tc = 10 Ill 5.7 ICSM-R Cycle Times using analysis for Tc = 25 Ill 5.8 ICSM-R Cycle Times using analysis for Tc = 50 112 5.9 ICSM-R Cycle Times using analysis for Tc = 75 112 5.10 ICSM-R Cycle Times using analysis for Tc = 90 113 5.11 ICSM-R critical section utilization using analysis for Tc = 10 113 5.12 ICSM-R critical section utilization using analysis for Tc = 25 114 5.13 ICSM-R critical section utilization using analysis for Tc = 50 114 5.14 ICSM-R critical section utilization using analysis for Tc = 75 115 5.15 ICSM-R critical section utilization using analysis for Tc = 100 .... 115 5.16 ICSM-R effect of multiprogramming on cycle time for Tc = 25 (M = 1)116 5.17 ICSM-R effect of multiprogramming on cycle time for Tc = 25 (M = 4)116 5.18 ICSM-R effect of multiprogramming on cycle time for Tc = 25 (M = 8)117 5.19 ICSM-R effect of multiprogramming on cycle time for Tc = 25 (M = 12) 117 5.20 ICSM-R effect of multiprogramming on cycle time for Tc = 25 (M = !6) 118 5.21 ICSM-R effect of multiprogramming on cycle time for Tc = 50 (M = 1)118 ix

PAGE 10

5.22 ICSM-R effect of multiprogramming on cycle time for Tc = 50 (M = 4)119 5.23 ICSM-R effect of multiprogramming on cycle time for Tc = 50 (M = 8)119 5.24 ICSM-R effect of multiprogramming on cycle time for Tc = 50 (M = 12) 120 5.25 ICSM-R effect of multiprogramming on cycle time for Tc = 50 (M — 16) 120 5.26 The global data structures and the ICSM-K algorithm for the high priority task 124 5.27 The ICSM-K algorithm for the low priority task 125 5.28 Response time distribution of LOCK-NK (20% Lock Utilization) ... 127 5.29 Response time distribution of ICSM-K (20% Lock Utilization) .... 127 5.30 Response time distribution of LOCK-NK (100% Lock Utilization) . . 128 5.31 Response time distribution of ICSM-K (100% Lock Utilization) ... 128 5.32 CAS2 used in the ICSM-P Algorithm 131 5.33 Data Structures used in the ICSM-P Algorithm 132 5.34 The acquire Jock operation procedure for ICSM-P 134 5.35 A successful acquireJock operation for ICSM-P 135 5.36 The commit_release Jock operation procedure for ICSM-P 135 5.37 Observed queue C before and after a releaseJock 139 5.38 Observed queue C before and after a killJock 141 5.39 Observed queue C before and after an acquireJock to an empty queuel42 5.40 Observed queue C before and after an acquireJock to a non-empty queue 142 5.41 A concurrent acquireJock A ' succeeds before A 144 5.42 A concurrent releaseJock R succeeds before A 145 5.43 ReleaseJock R and acquireJock A ' succeed before A 146 5.44 The acquireJock ICSM-P with single word CAS 151 5.45 The commit_releaseJock ICSM-P with single word CAS 152

PAGE 11

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy SYNCHRONIZATION ALGORITHMS FOR REAL-TIME SYSTEMS By Krishna Harathi May 1995 Chairman: Dr. Theodore Johnson Major Department: Computer and Information Sciences Real time systems are becoming common and multiprocessors are also being used more often. There is an increasing need for building priority synchronization primitives for real-time systems. In this regard, we have designed a prioritized spin-lock mutual exclusion algorithm, the PR-Lock. The PR-Lock algorithm is characterized by a prioritized lock acquisition, a low release overhead, very little bus-contention, and well-defined semantics. While other prioritized spin-locks have been proposed, the PR-Lock has superior characteristics. Current methods for synchronization (including the PRLock) in real-time systems are pessimistic, and use blocking to enforce concurrency control. While protocols to bound the blocking of high priority tasks exist, high priority tasks can still be blocked by low priority tasks. In addition, these protocols require a complex interaction with the scheduler. We present a new approach to synchronization with special applicability to embedded and real-time systems. We propose Interruptible Critical Sections, an optimistic xi

PAGE 12

synchronization mechanism as an alternative to purely blocking methods. Practical optimistic synchronization requires techniques for writing interruptible critical sections, and system support for detecting critical section conflicts. We show how Interruptible Critical Sections can be used to design algorithms for synchronization in a real-time system. These algorithms vary depending on the environment considered and the techniques used. Our experimental performance results show that these algorithms reduce the variance in the response time of the highest priority task with only a small impact on the performance of the low priority tasks. We also present an analysis which shows that Interruptible Critical Sections improve the schedulability of task sets that have high priority tasks with tight deadlines. We extend the usage of Interruptible Critical Sections to multiprocessor systems under real-time and non-real time environments. Our performance evaluation shows that the algorithms perform well, making the Interruptible Critical Sections a feasible mechanism for synchronization. Xll

PAGE 13

CHAPTER 1 INTRODUCTION 1.1 Goals The broad objective of this research was to devise algorithms and data structures for synchronization on real-time systems. Most of the work done in synchronization is not based on priorities, and thus is not suitable for real-time systems. In this regard, we have designed a prioritized spin-lock algorithm, the PR-Lock. We also present a novel optimistic synchronization mechanism named Interruptible Critical Sections. We show how this mechanism can be used to design algorithms for synchronization in a real-time system. Real time systems are becoming common and multiprocessors are also being used more often. There is an increasing need for building priority synchronization primitives for real-time systems [91]. Even on a uniprocessor, mutual exclusion is necessary to protect shared data in an interleaved thread schedule. There are two areas of research that this research is trying to bridge: real-time systems and synchronization in uniprocessors as well as multiprocessors. Synchronization in multiprocessors includes concurrency control techniques which guarantee that a process executes an entire code section without being interrupted by another process. In real-time systems, each process has strict timing constraints and is associated with a priority indicating the urgency of that process [88]. This priority is used by the operating system to order the rendering of services among competing processes. 1

PAGE 14

2 Normally, the higher the priority of a process, the faster its request for services gets honored. Synchronization controls access to a shared resource, usually some data structure shared between processes. Synchronization on a shared-memory system of multiprocessors is an important operation, since an application process' speedup or throughput depends on the operation's efficiency. When synchronization primitives disregard priorities, a lower priority process can be granted access to a critical section among a set of competing processes. Thus, lower priority processes may block the execution of a process with a higher priority and a stricter timing constraint [80]. This priority inversion may cause the higher priority process to miss its deadline, leading to the failure of a real-time system. Priority Inheritance and Priority Ceiling protocols belong to a class of protocols that reduces the effect of priority inversion. These protocols bound the amount of time a higher priority process is blocked by a set of lower priority processes, making the execution time of a process more predictable. However, the protocols require prioritized semaphores. Priority synchronization algorithms ensure that higher priority processes gain access to the critical section before any of the competing lower priority processes. For processes using priority synchronization algorithms, more accurate, optimistic and predictable execution time estimates can be made. This, in turn, improves the schedulability of a set of processes in a real-time system. There are several software synchronization algorithms in existence, usually based on some hardware provided mechanisms. Synchronization algorithms can be classified broadly on the approach they take: pessimistic or optimistic.

PAGE 15

3 1.2 Pessimistic Synchronization In pessimistic synchronization algorithms, the use of a shared resource is guarded by a lock or regulated by a queue. This approach views processes as being in existence for the sole purpose of interfering with each other. Thus, by means of a lock, care is taken so that no other process is using the resource before allowing a requesting process to use it. Most of the current synchronization mechanisms are based on spin-locks [23, 30, 47, 52, 67, 81, 82], or queue-based locks [4, 28, 30, 67]. Current multiprocessor hardware design includes read-modify-write atomic instructions that assist this type of synchronization. As mentioned earlier, using these types of locks for real-time systems presents the problem of priority inversion because priority is not taken into consideration. There is a need to modify or redesign these algorithms for use in real-time systems. Not much work has been done in this direction. We discuss what has been done so far in chapter 2. Having designed synchronization protocols for real-time systems, there remain the issues of proving their correctness, analyzing their performance, weighing their costs, and implementing the algorithms. 1.3 Optimistic Synchronization A process using optimistic synchronization algorithms uses the shared resource with the assumption that no other process may be interfering with it. In the event that such a conflict occurs, it is detected, and the affected process starts a recovery mechanism. Recovery includes either re-executing the critical section all over again or preposting the computation for any other process to complete.

PAGE 16

4 Optimistic synchronization algorithms are suitable for use in real-time systems as they do not cause priority inversion, and there is no need to design any priority inheritance protocols. Only a small amount of research has been done on optimistic algorithms. Most of the work on optimistic synchronization algorithms is done on uniprocessors [5, 11, 73]. Also, processor architectures should have suitable instructions to support these algorithms [36, 42, 90], and much research needs to be done to formulate and evaluate the support needed [2, 10]. Also, these mechanisms need to be extended for multiprocessors. Here again, there is the issue of what hardware support is needed to implement these optimistic algorithms efficiently. We present a discussion on optimistic synchronization algorithms and related issues in chapter 4. 1.4 Dissertation Structure The rest of the dissertation is organized as follows. In chapter 2, we present a survey of the research in related areas, namely, real-time scheduling, priority inheritance protocols, and synchronization mechanisms in multiprocessors. More importantly, we show the effect on scheduling of not using priority synchronization algorithms. In Chapter 3, we describe the current results achieved on designing priority lock mechanisms based on pessimistic synchronization methods. Chapter 4 describes Interruptible Critical Sections that rely on optimistic synchronization. We also present an implementation of Interruptible Critical Sections for uniprocessor systems. Interruptible Critical Sections are extended to multiprocessors in Chapter 5. We summarize our research in Chapter 6.

PAGE 17

5 1.5 Conclusion There is a growing importance in supporting real-time systems on uniprocessors and multiprocessors. We argue the necessity for designing priority synchronization algorithms for this purpose. We demonstrate the viability of such an effort. The design effort is backed by sound theoretical and analytical reasoning, and practical implementation results.

PAGE 18

CHAPTER 2 RELATED RESEARCH 2.1 Introduction This chapter introduces previous research areas that are affected by prioritizing synchronization. Priority synchronization algorithms improve the schedulability of a set of processes in a real-time system. If priority is disregarded for synchronization, a lower priority process may block a higher priority process. In this context, we will study what schedulability is and how it is affected by blocking. We also present Priority Inheritance and Priority Ceiling protocols that reduce the effect of blocking. We classify synchronization mechanisms based on their characteristics. Finally, we present various, relevant synchronization mechanisms that are currently in use. 2.2 Real Time Scheduling 2.2.1 Introduction A real-time system is one that executes processes with time constraints [18], where a process (or a task) is defined as an individually schedulable entity. A hard real-time system is a real-time system in which every process's deadline is critical, i.e., the system fails if any one of its processes does not meet its deadline. In a soft realtime system, processes tolerate any possible delay in completion past their deadline. Usually, there is a penalty associated with the delay in completion, and longer the delay, the greater is the penalty. Therefore, the goal of a hard real-time system is to 6

PAGE 19

7 avoid any failure state, and that of the soft real-time system is to minimize the total penalty. In this discussion, we are not concerned with the type of real-time system, since our proposed algorithms do not distinguish between the two. The fundamental problem in a real-time system is to schedule processes so as to maximize the number of processes that meet their deadlines. The complexity of the problem depends on the complexity of the process model itself. A simplistic model is one in which we only know about the process priority, which is the importance of the process relative to all other processes. At the other end of the spectrum is a model of a process with resource requirements, concurrency constraints and input/output requirements, in addition to computation requirements and timing constraints. A process can be periodic or nonperiodic. A periodic process is one which is invoked at regular intervals of time. A nonperiodic process has an arbitrary arrival time and deadline and when invoked, is expected to execute just once. Processes are also distinguished as preemptable or non-preemptable. A process is preemptable if its execution can be interrupted by other processes at any time and resumed afterwards. A process is non-preemptable if it must run to completion once it starts. Process scheduling can be classified according to the underlying process model. Another way to classify the scheduling problem as well as the system is when information about a process is available. If knowledge about a process is available a priori then it is a static real-time system since all the scheduling decisions can be done once only during the initialization phase of the system. As can be imagined, such a system is very inflexible and if a new process has to be accommodated at a future date, the whole system has to be first shut down. The advantage is that there is no overhead during runtime. A dynamic approach determines schedules for processes on the fly and allows processes to be dynamically invoked. Dynamic approaches involve

PAGE 20

8 higher run-time costs, but they are flexible and can easily adapt to changes in the environment. The goal in designing real-time systems is predictability rather than speed. The primary goal of scheduling is to meet the individual process deadlines, not to minimize, say, the response time. Although scheduling is used in many areas like job shop scheduling, etc., real-time process scheduling is different because of the deadlines of each process. Although scheduling is the main issue in real-time systems, there are other concerns as well. Some of them are provisions to include user written device drivers, guaranteed minimum interrupt response time, ability to tailor the system to specific requirements, etc. In addition, the system should not restrict the usage of hardware in any undesirable way. 2.2.2 Rate Monotonic Scheduling The analysis of the schedulability will be performed by using a theory of realtime systems which is based on rate monotonic scheduling theory. Rate monotonic scheduling theory provides analytical mechanisms for understanding and predicting the execution timing behavior of real-time systems. The basic theory, introduced in a seminal paper written by Liu and Layland [59], gives us a rule for assigning priorities to periodic process and a formula for determining if a set of periodic processes will all meet their deadlines. We assume the following notation. We consider N periodic processes 7\, T2, T3, T N on a uniprocessor. Let A, and C, represent the execution time, the deadline and the cycle time (periodicity) of the process T t . Assume that the numbering of the processes is such that the following relationship holds: d < C 2 < ... < C N .

PAGE 21

9 The CPU Utilization of a process T; is the ratio of a process's execution time to its period. The CPU utilization of a set of processes is the sum of the utilizations of the individual processes. CPU Utilization of a Set of Processes U(N) = Ei/Ci + E 2 /C 2 + ... + E N /C N . A set of assumptions have been made for the rate monotonic scheduling theory. • Process switching is instantaneous. • Processes account for all execution times. • Process interactions are not allowed. • Processes become ready to execute precisely at the beginning of their periods. • Process deadlines are always the start of the next period. The rate monotonic algorithm assigns priorities to processes based on the process's cycle times. Processes with shorter periods are assigned higher priorities and a higher priority process can preempt a lower priority process. The rate monotonic theorem proves that a set of N independent processes scheduled by the rate monotonic algorithm will always meet their deadlines, for all process phasings, if Exld + E 2 /C 2 + ... + E N /C N = U(N) < N(2^ N 1) Basically, if the utilization of the process set is less than a theoretically determined bound, then the set of processes is guaranteed to meet all of its deadlines. Given a set of N independent periodic processes scheduled by the rate monotonic algorithm, a particular process, T k , k
PAGE 22

10 Ei Id + E 2 /d + ... + E k IC k = U(k) < k(2^ k 1) From this result, it can be seen that the only factors that determine the schedulability of process Tk are the utilization of higher priority processes and the utilization of the process Tjt itself. The discussion so far assumes that processes always execute in consistence with their rate monotonic priority. But in practice, because of priority inversion, a higher priority process may be blocked by a lower priority process that is executing a nonpreemptable section of code. This blocking effect can be included in the previous result as follows. Let Bk be the worst case total amount of blocking that a process 7* can incur during any period. It has been shown [80] that all processes will meet their deadlines if Ei/d + Bt/TtK 1(2^ -1) Ei/d + E 2 /d + B 2 /C 2 < 2(2 1 ' 2 1) Ei/d + E 2 /d + + E k /d + B k IC k < k(2^ k 1) E l /d + E 2 /d + ... + E N /C N + B N /C N < N{2 l ' N 1) The inequalities explicitly show how blocking affects the schedulability of a set of processes and why it is desirable to minimize blocking.

PAGE 23

11 Process synchronization is a common source of blocking. When more than one process requires mutually exclusive access to a resource, processes must synchronize. If a lower priority process has locked a resource and is then preempted by a higher priority process that executes until it needs to access the same resource, the higher priority process is forced to wait. The higher priority process is blocked. By using priority synchronization algorithms, this blocking can be reduced or avoided, in proportion to the priority of the processes, which directly improves the schedulability of a set of processes. The blocking as a whole is not reduced; it is shifted from higher priority processes to lower priority processes. In section 2.3, we discuss priority ceiling protocol (PCP) and priority inheritance protocol (PIP), a class of protocols that reduce the effects of blocking and also prevent mutual deadlock. 2.2.3 Static Scheduling Algorithms • Static, Preemptive Scheduling on a Uniprocessor for arbitrary tasks. Horn [40] developed an 0(n 2 ) algorithm for n tasks to be scheduled, based on the earliest deadline policy: tasks with earlier deadlines and earlier ready times are chosen to run before tasks with later deadlines and ready times. Tasks can have arbitrary ready times and deadlines. • Static, Preemptive Scheduling on a Multiprocessor for arbitrary tasks. Horn also described an 0(n 3 ) algorithm to schedule n tasks with arbitrary ready times and deadlines on a multiprocessor. His approach is based on the network flow method and considers only processors with the same speeds. This approach is extended by Martel [64] by considering processors with different

PAGE 24

12 speeds. The complexity of Martel's algorithm is 0(m 2 n 4 + n 5 ), where m is the number of processors. • Static, Preemptive Scheduling on a Uniprocessor for periodic tasks. The algorithm for scheduling arbitrary tasks can be applied to periodic tasks by considering the instances of the periodic tasks within the time interval between zero and the least-common-multiple of the task's periods. Horn's and Martel's approach can also be applied to multiprocessor systems in the same way. The rate monotonic algorithm described earlier assigns priorities to tasks based on the task's cycle times. Tasks with shorter periods are assigned higher priorities and a higher priority task can preempt a lower priority task. The rate monotonic theorem [59], proves that a set of N independent tasks scheduled by the rate monotonic algorithm will always meet their deadlines, for all task phasings, if U{N) <= N{2 l ' N 1). Teixeira [93] presented a fixed-priority assignment scheme in which he assumed that the relative deadline of a periodic task can be different from the period of a task. • Static, Preemptive Scheduling on a Multiprocessor for periodic tasks. A partition approach is adopted to solve this problem. The main idea is to partition a set of periodic tasks among a minimum number of processors such that each partition of the periodic tasks can be scheduled on one processor according to the earliest deadline scheme or the rate monotonic priority scheme. If the earliest deadline scheme is used, a bin-packing packing algorithm can

PAGE 25

13 be used to determine a suboptimal partition pattern of periodic tasks among multiple processors [20]. • Static Nonpreemptive Scheduling. Nonpreemptive scheduling is more difficult than preemptive scheduling and many nonpreemptive problems have been shown to be NP-Hard. Scheduling nonpreemptable tasks with arbitrary ready times is NP-Hard even in uniprocessor systems. But for some restrictive problems, efficient algorithms are available. For example, the earliest deadline algorithm has been shown to be optimal for scheduling a set of tasks with the same ready times [72]. Kise developed an 0(n 2 ) algorithm for the case in which a task has an earlier ready time if and only if it has an earlier deadline [49]. For multiprocessor systems, a nonpreemptive scheduling problem is NP-Hard even when the ready times and deadlines of tasks are the same [98]. A polynomial optimal algorithm is available for scheduling tasks with unit computation time [86]. 2.2.4 Dynamic Scheduling Algorithms Most algorithms that are optimal for static scheduling are not so for dynamic scheduling. For multiprocessors, there can be no optimal algorithm for scheduling preemptable processes if the arrival time of the processes are not known in advance [69]. Since run-time cost is an important factor for dynamic scheduling, most static algorithms are not suitable for dynamic scheduling. Hence, there is a need to develop heuristic algorithms for dynamic scheduling.

PAGE 26

14 For uniprocessor systems, it was shown that the earliest deadline algorithm is optimal for scheduling preemptable processes with arbitrary arrival times [21]. Stankovic et al. [87] describe an algorithm that is based on the earliest deadline policy but takes into account the run-time cost. Baker and Su [7] compared four heuristic algorithms that schedule processes according to an order determined by ready time, by deadline, by the average of ready time and deadline, and by both ready time and deadline. It has been shown by them that the last two algorithms perform better than the first two. In multiprocessor systems, Mok et al. [69] have shown that if a set of all possible processes that will ever arrive in a system can be scheduled ahead of time, then the set of processes can also be scheduled at run-time. The drawback of this approach, is that the probability of all possible arriving processes being scheduled ahead of time is very low. They have also proved that one successful run-time algorithm is the least laxity algorithm. Locke et al. [60] found that the least laxity first and the earliest deadline first are two good heuristic policies. 2.3 Priority Inheritance Protocols In the discussion on rate monotonic scheduling (section 2.2.2), we have seen how blocking affects the schedulability of processes. Here, we give a brief discussion of the Priority Inheritance and Priority Ceiling protocols that reduce blocking [80]. 2.3.1 Priority Inheritance Protocol Priority inheritance prevents a medium priority process from prolonging the actual amount of time that a resource is locked by a lower priority process. Without the inheritance, a medium priority process can preempt the lower priority critical

PAGE 27

15 section and prolong the period of blocking of a higher priority process. To avoid this situation, priority inheritance allows the lower priority process to inherit the blocked process's higher priority for the duration of the critical section. Thus, priority inheritance prevents the medium priority process from preempting the critical section, that is now executing at a high priority. However, this basic priority inheritance mechanism has a drawback, in that, if a process shares m resources with lower priority processes, then it can be blocked at most m times per execution period due to process synchronization. This can be illustrated by an example. Suppose a high priority process requires data from several resources that are all currently locked. The low priority process locks a resource; it is then preempted by a slightly higher priority process that in turn locks another resource, and so on. The blocking process inherits the blocked process's priority and, after the critical section is completed, will relinquish the resource. The high priority process will use the resource and then be forced to wait for the second resource that it needs and so on. Thus, the high priority process can wait at most m times for the m resources that it needs. 2.3.2 Priority Ceiling Protocol This protocol solves the above problem, in that a high priority process waiting for m shared resources waits at most once per period, for the duration of the entire critical section. In this mechanism, associated with each resource, in addition to the semaphore or monitor that protects it, is an attribute, known as the priority ceiling. This is the highest priority at which a critical section associated with that resource can be executed, i.e., the highest priority of a process that can use the resource. Thus, if a high priority process wishes to use m resources, the ceiling of those resources is

PAGE 28

16 set to the priority of this process. So, a medium priority process can never preempt a lower priority process in its critical section. Thus, this rule allows only one process to have locks at any given time. Hence, the high priority process is blocked at most once per period of execution. 2.3.3 Priority Ceiling Protocols with Abortable Critical Sections Priority inversion of high priority tasks is reduced further by selectively aborting a low priority task executing within a critical section. Shu et al. [84] proposed an Abort Ceiling Protocol, an extension to the Priority Ceiling Protocol. In this algorithm, an abort ceiling priority is associated with a task. The abort ceiling comes into effect when the task is executing. Another task may abort the currently running task and run immediately if its priority is higher than the current abort ceiling. The protocol relies on the Interruptible Critical Sections (Chapter 4) to restart the critical section of the aborted task. Also, the protocol assumes static priorities. The Ceiling Abort Protocol [92] proposed by Takada and Sakamur is a similar extension to the Priority Ceiling Protocol. This protocol assigns an abort ceiling priority to the critical section instead. Also, the critical section is divided into abortable and non-abortable segments. The issue here is to minimize abortion and re-execution overheads. 2.3.4 Priority Inheritance Protocols for Multiprocessors Priority inheritance protocols have been extended for multiprocessors by Rajkumaretal. in [79]. In the case of multiprocessors, the concept of blocking is generalized to include remote blocking. When a process executing on one processor has to wait for the execution of a process assigned to another processor, it is said to experience remote blocking.

PAGE 29

17 By its very nature, remote blocking is very different from uniprocessor blocking: remote blocking is a function of execution times of processes on other processors even in the absence of data-sharing. Thus, uniprocessor priority inheritance protocols are to be enhanced for multiprocessors. In the multiprocessor priority ceiling protocol, tasks are assumed to be bound to a processor. Static binding is found to perform better in static as well as dynamic priority scheduling algorithms. A critical section executed by processes on different processors is called a global critical section (GCS). A processor which executes global critical sections is called a synchronization processor and processors which run application processes only are called application processors. A synchronization processor can also be running application tasks. The priority ceiling of a semaphore S indicates the maximum priority at which a critical section guarded by this semaphore can execute. The priority ceiling of a local semaphore S is defined to be the priority of the highest priority process that may lock the semaphore. Let the priority of the highest priority process that accesses a global semaphore GS be denoted by PS. Then the priority ceiling of a global semaphore GS is defined such that • The priority ceiling of GS is higher than PS. • If GSi and GSj are global semaphores and PSi > PSj, then the priority ceiling of GSi is greater than the priority ceiling of GSj. The multiprocessor priority ceiling protocol that is used in each of the application processors and the synchronization processors is as follows.

PAGE 30

IS • Each application processor runs the priority ceiling protocol on the set of processes that it runs, and to the set of local semaphores bound to the application processor. • Each global critical section on the synchronization processor normally executes at its assigned priority. • The synchronization processor runs the priority ceiling protocol on the global critical sections, the set of application tasks, and the set of global and local semaphores bound to the synchronization processor. The multiprocessor priority ceiling protocol prevents deadlocks, and bounds the blocking duration of each process as a function of the critical section duration of other tasks. 2.4 Synchronization in Multiprocessors Synchronization primitives make programs easier to understand and write, but processors waste time when waiting for locks. Synchronization primitives are used in nearly every parallel program, and lessening synchronization delays is a major goal for efficient parallel program execution. We will study synchronization both in the context of pessimistic and optimistic approaches. We also will briefly review synchronization mechanisms in message passing multiprocessor. 2.4.1 Pessimistic Synchronization Pessimistic synchronization mechanisms are overly restrictive in their approach. Even if there is no interference with other processes while sharing a resource, a process incurs the overhead of establishing the lock for itself before using the shared resource.

PAGE 31

19 Mechanisms include hardware operations like Test&Set, Compare&Swap, etc., low level primitives like spin-locks, condition variables, etc., and high level primitives like semaphores, monitors, etc. We restrict our discussion to low level synchronization primitives only. Hardware Synchronization Primitives Hardware synchronization primitives evolved primarily on shared-memory multiprocessors. Atomic, sequentially consistent loads and stores can be used to build synchronization primitives. Hardware primitives include Test&Set, Fetch&Store, Fetch&Add, and Compare&Swap. These primitives are also called read-modify-write operations. Test&Set and reset are a set of basic synchronization primitives [23]. Test&Set is repeatedly executed to get exclusive access to a lock variable before entering a mutual exclusion. Lock reset is used to exit from the section. Because the processor repeatedly tests a lock until it is acquired, this may cause excessive network traffic. In contrast, a suspend-lock employs interprocessor interrupts. A processor waits for an interrupt if its first Test&Set fails. In another scheme [47], a full/empty tag is associated with each word in sharedmemory. Less general than read-modify-write, this tag is tested before a producerconsumer write or read operation. Only a full word can be read and only an empty word can be written. When the test succeeds, the read or write operation is performed and the value of the tag is reversed. An extension to the Test&Set, the Test&Test&Set repeatedly tests a local copy of the lock whenever the first atomic Test&Set of the global lock fails [82]. The local copies are invalidated when the global lock is reset. Every waiting processor

PAGE 32

20 does another Test&Test&Set operation. Only one process will get the lock. This scheme reduces the network traffic associated with the Test&Set. Anderson [4] found that exponential backoff after the first Test&Test&Set failure is effective in reducing contention among processes while acquiring a lock. Fetch&Op primitives include Fetch&Store (swap) and Fetch&Add [52]. The later primitive provides for adding an increment to a shared sum. Compare&Swap [52] compares the contents of a memory location against a given value, and sets a condition code to indicate whether they are equal. If so, it replaces the contents of the memory with a second given value. Herlihy [34] showed that the Compare&Swap operation is more powerful than the rest of the operations listed. Herlihy showed that Compare&Swap can be used to convert any sequential data object into a concurrent wait-free( 2.4.2) data object. Spin Locks Spin locks are busy-wait constructs in which processes repeatedly test shared variables to determine when the processes proceed. Busy-waiting is preferred over scheduler-based blocking when scheduler overhead exceeds expected waiting time, when processor resources are not needed for other processes, or when scheduler-based blocking is inappropriate or impossible, for example, in the kernel of an operating system. The simplest mutual exclusion lock employs a polling loop to access a boolean flag that indicates whether the lock is free. Each processor repeatedly executes a Test&Set instruction in an attempt to change the flag from false to true, thereby acquiring the lock. A processor releases the lock by setting it to false. To reduce network traffic, Test&Test&Set and exponential backoff may be employed.

PAGE 33

21 A ticket lock [81] reduces the number of atomic operations to one per lock. A ticket lock consists of two counters, one containing the number of requests to acquire the lock, and the other the number of times the lock has been released. A processor acquires the lock by performing a Fetch&Increment operation on the request counter and waiting until the result is equal to the value of the release counter. It releases the lock by incrementing the release counter. Processes acquire the lock in FIFO order of their requests. The ticket lock can still cause substantial memory and network contention through polling of a common location. Thus, it is not possible to obtain a lock with an expected number of network transactions, due to the unpredictability of the length of the critical sections. Anderson [4] and Graunke and Thakkar [30] have proposed locking algorithms that achieve the constant bound on the number of remote memory operations in cachecoherent multiprocessors. Each processor uses an atomic operation to obtain the address of a location on which to spin. Each processor spins on a different location in a different cache line. This array-based queuing lock guarantees FIFO ordering of requests and require space per lock linear in the number of processors. Mellor-Crummey and Scott [66] devised a list-based queuing lock, the MCS-Lock. It requires atomic Fetch&Store and Compare&Swap instructions. The lock variable maintains the tail of a FIFO queue and the head of the queue is maintained by the process using the lock. The acquire operation is accomplished with a Fetch&Store operation on the lock variable and the release by a Compare&Swap. This lock guarantees FIFO ordering of lock acquisitions, processes spin on locallyaccessible flag variables only and requires a constant amount of space per lock. Markatos [63] designed a priority spin-lock algorithm based on the MCS-Lock. Craig [19] refined Marktos's approach to achieve FIFO and priority locks with better space complexities in case

PAGE 34

22 of nested lock acquisitions. In chapter 3, we discuss our own implementation of a priority spin-lock, the PR-Lock. Our PR-Lock has better acquire and release lock characteristics, and differs from the above cited locks in some important details. Condition Variables Condition variables [85] allow conditional blocking inside a critical section. Condition variables can be used to implement monitors and are provided in Mach [1] operating system. The operations using condition variable include condition-wait and condition-signal. A condition variable is associated with a mutex variable. When a process performs the condition-wait operation on a condition variable, the associated mutex variable is unlocked and the calling process is blocked. When another process executes the condition-signal operation on the same condition variable indicating that the condition may have changed, the associated mutex variable is locked and the blocked process continues. The unblocked process must re-evaluate the condition before proceeding further. 2.4.2 Optimistic Synchronization In these types of synchronization mechanisms, processes use a shared resource with the optimism of non-interference. On the other hand, care has to be taken if there is a conflict. In Interruptible Critical Sections, the affected process recovers and restarts the critical section from the beginning [11]. This is the subject of discussion in chapter 4. Lock-free synchronization was introduced by Lamport [53]. Lock-free data structures can be further classified as non-blocking and wait-free [34]. Nonblocking algorithms guarantee that some process accessing the data structure will complete an

PAGE 35

23 operation in a finite number of steps. Wait-free algorithms ensure that all processes complete their operation within a fixed number of steps. Herlihy [35] has shown that it is impossible to construct non-blocking implementations of arbitrary concurrent objects with any combination of read, write, and Fetch&Op (where Op can be Store, Increment, Add, etc.) when the number of processes being considered are greater than two. However, there are some universal atomic operations which are capable of implementing arbitrary non-blocking objects, Compare&Swap being one. Methods for automatically converting a sequential implementation of an abstract data type into a wait-free implementation was given by Herlihy, and into a non-blocking implementation by Prakash et al. [78], and Turek et al. [97]. The methodology proposed by Turek et al. also handles wait-free implementation, uses less memory and accomodates greater concurrency. 2.4.3 Synchronization through Message passing Another way processes can communicate and synchronize is through message passing. Message passing is a form of synchronization, since a message can be received only after it has been sent. Remote Procedure Calls and Rendezvous are higher level forms of synchronization using message passing. In the context of real-time systems, Goscinski [29] developed two algorithms for mutual exclusion in distributed systems. Johnson and NewmanWolfe [46] proposed a distributed priority lock based on the PR-Lock (chapter 3) that has low storage and overhead requirements. 2.5 Conclusion In this chapter we introduced some of the issues in real-time systems including scheduling, priority inversion, and synchronization. We presented some of the current

PAGE 36

24 practices in scheduling processes in single as well as multiple processors. We presented an analysis of the rate-monotonic scheduling algorithm and the effect of blocking due to priority inversion. We also discussed two protocols for reducing the effect of blocking, namely, priority inheritance and priority ceiling. We categorized synchronization into two types: pessimistic and optimistic. We cited many examples and techniques illustrating the two mechanisms. Not all of the specific mechanisms are suitable under a given environment: some can be more efficient than others. Optimistic synchronization algorithms were not previously applied to real-time systems. We show how optimisitic synchronization can be effectively used for realtime systems, thereby avoiding the priority inversion problem that is inherent in the lock-based synchronization mechanisms.

PAGE 37

CHAPTER 3 A PRIORITIZED MULTIPROCESSOR SPIN LOCK 3.1 Introduction Mutual exclusion is a fundamental synchronization problem for exclusive access to critical sections or shared resources on multiprocessors [62]. The spin-lock is one of the mechanisms that can be used to provide mutual exclusion on sharedmemory multiprocessors [6]. A spin-lock usually is implemented using atomic readmodify-write instructions such as Test&Set or Compare&Swap, which are available on most shared-memory multiprocessors [52]. Busy waiting is effective when the critical section is small and the processor resources are not needed by other processes in the interim. However, a spin-lock is usually not fair, and a naive implementation can severely limit performance due to network and memory contention [4, 27]. A careful design can avoid contention by requiring processes to spin on locally stored or cached variables [66]. In real-time systems, each process has timing constraints and is associated with a priority indicating the urgency of that process [88]. This priority is used by the operating system to order the rendering of services among competing processes. Normally, the higher the priority of a process, the faster its request for services gets honored. When the synchronization primitives disregard the priorities, lower priority processes may block the execution of a process with a higher priority and a stricter timing constraint [79, 80]. This priority inversion may cause the higher priority process to miss its deadline, leading to a failure of the real-time system. Most of the work done in 25

PAGE 38

26 synchronization is not based on priorities, and thus is not suitable for real-time systems. Furthermore, general purpose parallel processing systems often have processes that are "more important" than others (kernel processes, processes that hold many locks, etc.). The performance of such systems will benefit from prioritized access to critical sections. In this chapter, we present a prioritized spin-lock algorithm, the PR-Lock. The PRLock algorithm is suitable for use in systems which either use static-priority schedulers, or use dynamic-priority schedulers in which the relative priorities of existing tasks do not change while blocked (such as Earliest Deadline First [88] or Minimum Laxity [39]). The PR-Lock is a contention-free lock [66], so its use will not create excessive network or memory contention. The PRLock maintains a queue of records, with one record for each process that has requested but not yet released the lock. The queue is maintained in sorted order (except for the head record) by the acquire lock operations, and the release lock operation is performed in constant time. As a result, the queue order is maintained by processes that are blocked anyway, and a high priority task does not perform work for a low priority task when it releases the lock. The lock keeps a pointer to the record of the lock holder, which aids in the implementation of priority inheritance protocols [79, 80]. A task's lock request and release are performed at well-defined points in time, which makes the lock predictable. We present a correctness proof, and simulation results which demonstrate the prioritized lock access, the locality of the references, and the improvement over a previously proposed prioritized spin-lock. We organize this chapter as follows. In Section 3.1.1 we describe previous work in this area and in Section 3.2, we present our algorithm. In Section 3.3 we argue the correctness of our algorithm. In Section 3.4 we discuss an extension to the algorithm presented in Section 3.2. In Section 3.5 we show the simulation results which

PAGE 39

27 compare the performance of the PR-Lock against that of other similar algorithms. In Section 3.6 we conclude this chapter by suggesting some applications and future extensions to the PR-Lock algorithm. 3.1.1 Previous Work Our PR-Lock algorithm is based on the MCS-Lock algorithm, which is a spin-lock mutual exclusion algorithm for shared-memory multiprocessors [66]. The MCS-Lock grants lock requests in FIFO order, and blocked processes spin on locally accessible flag variables only, avoiding the contention usually associated with busy-waiting in multiprocessors [4, 27]. Each process has a record that represents its place in the lock queue. The MCS-Lock algorithm maintains a pointer to the tail of the lock queue. A process adds itself to the queue by swapping the current contents of the tail pointer for the address of its record. If the previous tail was nil, the process acquired the lock. Otherwise, the process inserts a pointer to its record in the record of the previous tail, and spins on a flag in its record. The head of the queue is the record of the lock holder. The lock holder releases the lock by reseting the flag of its successor record. If no successor exists, the lock holder sets the tail pointer to nil using a Compare&Swap instruction. Molesky, Shen, and Zlokapa [71] describe a prioritized spin-lock that uses the test-and-set instruction. Their algorithm is based on Burn's fair test-and-set mutual exclusion algorithm [14]. However, this lock is not contention-free. Markatos and LeBlanc [63] presents a prioritized spin-lock algorithm based on the MCS-Lock algorithm. Their acquire lock algorithm is almost the same as the MCS acquire lock algorithm, with the exception that Markatos' algorithm maintains a doubly linked list. When the lock holder releases the lock, it searches for the highest priority process in the queue. This process' record is moved to the head

PAGE 40

28 of the queue, and its flag is reset. However, the point at which a task requests or releases a lock is not well defined, and the lock holder might release the lock to a low priority task even though a higher priority task has entered the queue. In addition, the work of maintaining the priority queue is performed when a lock is released. This choice makes the time to release a lock unpredictable, and significantly increases the time to acquire or release a lock (as is shown in section 3.5). Craig [19] proposes a modification to the MCS lock and to Markatos' lock that substitutes an atomic Swap for the Compare&Swap instruction, and permits nested locks using only one lock record per process. Takada and Sakamura [91] extended queuing spin-locks modified to be preemptable for servicing interrupts. Goscinski [29] developed two algorithms for mutual exclusion for real-time distributed systems. The algorithms are based on token passing. A process requests the critical section by broadcasting its intention to all other processes in the system. One algorithm grants the token based on the priorities of the processes, whereas the other algorithm grants the token to processes based on the remaining time to run the processes. The holder of the token enters the critical section. The utility of prioritized locks is demonstrated by rate monotonic scheduling theory [59, 80]. Suppose there are N periodic processes 7\, T 2 , T 3 , . . . , T N on a uniprocessor. Let Ei and C, represent the execution time and the cycle time (periodicity) of the process T { . We assume that C x < C 2 < ... < C N . Under the assumption that there is no blocking, [59] show that if for each j X>/a < i(2^ -i) t'=l Then all processes can meet their deadlines.

PAGE 41

29 Suppose that Bj is the worst case blocking time that process Tj will incur. Then [80] show that all tasks can meet their deadlines if 1=1 Thus, the blocking of a high priority process by a lower priority process has a significant impact on the ability of tasks to meet their deadlines. Much work has been done to bound the blocking due to lower priority processes. For example, the Priority Ceiling protocol [80] guarantees that a high priority process is blocked by a lower priority process for the duration of at most one critical section. The Priority Ceiling protocol has been extended to handle dynamic-priority schedulers [16] and multiprocessors [17, 79]. Our contribution over previous work in developing prioritized contention-free spinlocks ([19] and [63]) is to more directly implement the desired priority queue. Our algorithm maintains a pointer to the head of the lock queue, which is the record of the lock holder. As a result, the PR-Lock can be used to implement priority inheritance [79, 80]. The work of maintaining priority ordering is performed in the acquire lock operation, when a task is blocked anyway. The time required to release a lock is small and predictable, which reduces the length and the variance of the time spent in the critical section. The PR-Lock has well-defined points in time in which a task joins the lock queue and releases its lock. As a result, we can guarantee that the highest priority waiting task always receives the lock. Finally, we provide a proof of correctness. 3.2 PR-Lock Algorithm Our PR-Lock algorithm is similar to the MCS-Lock algorithm in that both maintain queues of blocked processes using the Compare&Swap instruction. However, while the MCS-Lock and Markatos' lock maintain a global pointer to the tail of the

PAGE 42

queue, the PR-Lock algorithm maintains a global pointer to the head of the queue. In both the MCS-Lock and the Markatos' lock, the processes are queued in FIFO order, whereas in the PR-Lock, the queue is maintained in priority order of the processes. 3.2.1 Assumptions We make the following assumptions about the computing environment. 1. The underlying multiprocessor architecture supports an atomic Compare&Swap instruction. We note that many parallel architectures support this instruction, or a related instruction [35, 74, 99]. 2. The multiprocessor has shared-memory with coherent caches, or has locallystored but globally-accessible sharedmemory. 3. Each processor has a record to place in the queue for each lock. In a NUMA architecture, this record is allocated in the local, but globally accessible, memory. This record is not used for any other purpose for the lifetime of the queue. In Section 3.4, we allow the record to be used among many lock queues. 4. The higher the actual number assigned for priority, the higher the priority of a process (we can also assume the opposite). 5. The relative priorities of blocked processes do not change. Acceptable priority assignment algorithms include Earliest Deadline First and Minimum Laxity. It should be noted that each process p, participating in the synchronization can be associated with an unique processor P;. We expect that the queued processes will not be preempted, though this is not a requirement for correctness.

PAGE 43

31 3.2.2 Implementation The PR-Lock algorithm consists of two operations. The acquire Jock operation acquires a designated lock and the releaseJock operation releases the lock. Each process uses the acquire Jock and releaseJock operations to synchronize access to a resource as follows. acquireJock(L, r) critical section releaseJock (L) The following sub-sections present the required version of Compare&Swap, the needed data structures, and the acquireJock and releaseJock procedures. The ComparefcSwap The PR-Lock algorithms make use of the Compare&Swap instruction, the code for which is shown in Figure 3.1. Compare&Swap is often used on pointers to object records, where a record refers to the physical memory space and an object refers to the data within a record. Current is a pointer to a record, Old is a previously sampled value of Current, and New is a pointer to a record that we would like to substitute for *01d (the record pointed to by Old). We compute the record *New based on the object in *01d (or decide to perform the swap based on the object in *01d), so we want to set Current equal to New only if Current still points to the record *01d. However, even if Current points to *01d, it might point to a different object than the one originally read. This will occur if *01d is removed from the data structure, then re-inserted as Current with a new object. This sequence of events cannot be detected by the Compare&Swap and is known as the A-B-A problem. Following the work of Prakash et al. [77] and Turek et al. [97], we make use of a double-word Compare&Swap instruction [74] to avoid this problem. A counter is

PAGE 44

32 Procedure CAS (structure pointer *Current, *0ld, *New) /* Assume CAS operates on double words */ atomic{ if( *Current == *01d ) { Current = *New; return (TRUE) ; } else { *01d = *Current; return (FALSE) ; } } Figure 3.1. CAS used in the PRLock Algorithm appended to Current which is treated as a part of Current. Thus Current consists of two parts: the value part of Current and the counter part of Current. This counter is incremented every time a modification is made to *Current. Now all the variables Current, Old , and New are twice their original size. This approach reduces the probability of occurrence of the A-B-A problem to acceptable levels for practical applications. If a double-word Compare&Swap is not available, the address and counter can be packed into 32 bits by restricting the possible address range of the lock records. We use a version of the Compare&Swap operation in which the current value of the target location is returned in old, if the Compare&Swap fails. The semantics of the Compare&Swap used is given in Figure 3.1. A version of the Compare&Swap instruction that returns only TRUE or FALSE can be used by performing an additional read.

PAGE 45

33 Data Structures The basic data structure used in the PR-Lock algorithm is a priority queue. The lock L contains a pointer to the first record of the queue. The first record of the queue belongs to the process currently using the lock. If there is no such process, then L contains nil. Each process has a locally-stored but globallyaccessible record to insert into the lock queue. If process p inserts record q into the queue, we say that q is p's record and p is q's process. The record contains the process priority, the next-record pointer, a boolean flag Locked on which the process owning the element busywaits if the lock is not free, and an additional field Data that can be used to store application-dependent information about the lock holder. The next-record pointer is a double sized variable: one half is the actual pointer and the other half is a counter to avoid the A-B-A problem. The counter portion of the pointer itself has into two parts: one bit of the counter called the Dq bit is used to indicate whether the queuing element is in the queue. The rest of the bits are used as the actual counter. This technique is similar to the one used by Prakash et al. [77] and Turek et al. [97]. Their counter refers to the record referenced by the pointer. In our algorithm, the counter refers to the record that contains the pointer, not the record that is pointed to. If the Dq bit of a record q is FALSE, then the record is in the queue for a lock L. If the Dq bit is TRUE, then the record is probably not in the queue (for a short period of time, the record might be in the queue with its Dq bit set TRUE). The Dq bit lets the PR-Lock avoid garbage accesses. Each process keeps the address of its record in a local variable (Self). In addition, each process requires two local pointer variables to hold the previous and the next

PAGE 46

34 queue element for navigating the queue during the enqueue operation (Prev_Node and Next-Node). The data structures used are shown in Figure 3.2. The Dq bit of the Pointer field is initialized to TRUE, and the Ctr field is initialized to 0 before the record is first used. A typical queue formed by the PR-Lock algorithm is shown in Figure 3.3 below. Here L points to the record q 0 of the current process holding the lock. The record q 0 has a pointer to the record q\ of the next process having the highest priority among the processes waiting to acquire the lock L. Record qi points to record q 2 of the next higher priority waiting process and so on. The record q n belongs to the process with the least priority among waiting processes. AcquireXock Operation The acquireJock operation is called by a process p before using the critical section or resource guarded by lock L. The parameters of the acquireJock operation are the lock pointer L and the record q of the process (passed to local variable Self). An acquireJock operation searches for the correct position to insert q into the queue using PrevJIode and NextJIode to keep track of the current position. In Figure 3.4, Prev-Node and Next-Node are abbreviated to P and N. The records pointed by P and N are and qi+i, belonging to processes p, and pi+i. Process p positions itself so that Pr(pi) > Pr(p) > Pr(p i+ i), where Pr is a function which maps a process to its priority. Once such a position is found, q is prepared for insertion by making q point to qi +i . Then, the insertion is committed by making qi to point to q by using the Compare&Swap instruction. The various stages and final result are shown in Figure 3.4.

PAGE 47

structure Pointer { structure Object *Ptr; int31 Ctr; boolean Dq; } structure Record { structure structure -of -data Data; boolean Locked; integer Priority; structure Pointer Next; } Shared Variable structure Pointer L; Private Variables structure Pointer Self, PrevJJode, Next_node; boolean Success, Failure; constant TRUE, FALSE, NULL, MAX_PRIORITY; Record Structure Data Locked Priority Next.Ptr Next.Ctr \ Next.Dq Figure 3.2. Data Structures used in the PR-Lock Algorithm

PAGE 48

36 qO qi q2 qn Figure 3.3. Queue data structure used in PR-Lock algorithm Start: N L P qO ql qi+l qn Prepare: L Commit: L Result: L qO qO q() Position: L *| qO | *j ql qi *| qi+l qi qi+l qi q> A q qn qi+l qn qn qi+l qn Figure 3.4. Stages in the acquireJock operation

PAGE 49

37 The acquireJock algorithm is given in Figure 3.5. Before the acquireJock procedure is called, the Data and the Priority fields of the process' record are initialized appropriately. In addition, the Dq bit of the Next pointer is implicitly TRUE. The acquireJock operation begins by assuming that the lock is currently free (the lock pointer L is null). It attempts to change L to point to its own record with the Compare&Swap instruction. If the Compare&Swap is successful, the lock is indeed free, so the process acquires the lock without busy-waiting. In the context of the composite pointer structures that the algorithm uses, a NULL pointer is all zeros. If the swap is unsuccessful, then the acquiring process traverses the queue to position itself between a higher or equal priority process record and a lower priority process record. Once such a junction is found, Prev_Node will point to the record of the higher priority process and Next -Node will point to the record of the lower priority process. The process first sets its link to NextJJode. Then, it attempts to change the previous record's link to its own record by the atomic Compare&Swap. If successful, the process sets the Dq flag in its record to FALSE indicating its presence in the queue. The process then busy-waits until its Locked bit is set to FALSE, indicating that it has been admitted to the critical section. There are three cases for an unsuccessful attempt at entering the queue. Problems are detected by examining the returned value of the failed Compare&Swap marked as F in the algorithm. Note that the returned value is in the NextJJode. In addition, a process might detect that it has misnavigated while searching the queue. When we read NextJJode, the contents of the record pointed to by Prev Jlode are fixed because the record's counter is read into NextJJode.

PAGE 50

38 Procedure acquireJock(L, Self) { Success=FALSE; do { PrevJiode=NULL ; Next _Node=NULL if(CAS(*L, &Next_Node, &Self)) { /* No Lock */ Success=TRUE; Failure=FALSE; /* Use Lock */ Self . Ptr->Next . Dq=FALSE ; } else { /* Lock in Use */ Failure=FALSE ; Self . Ptr->Locked=TRUE ; do { PrevJJode=Next_Node ; Next_Node=Prev_Node . Ptr->Next ; if((Next_Node.Dq==TRUE) /* Deque, Try Again */ ii or (Prev.Node.Ptr->PriorityPriority)) iii Failure=TRUE; else { if(Next_Node.Ptr==NULL or (Next-Node .Ptr ! =NULL and Next_Node . Ptr->PriorityPriority) ) { Self . Ptr->Next . Ptr=Next Jiode . Ptr if(CAS(&(Prev_Node.Ptr->Next) , &Next_Node, &Self)) { F Self . Ptr->Next . Dq=FALSE ; while(Self .Ptr->Locked) ; /* Busy Wait */ Success=TRUE; /* Then, use lock */ } else { if ( (Next-Node. Dq==TRUE) /* Deque, Try Again */ ii or Prev_Node.Ptr->Priority < Self .Ptr->Priority) ) iii Failure=TRUE; else Next_Node=Prev_Node; i } } } } while (! Success and IFailure); }. } while ( ! Success) ; } Figure 3.5. The acquireJock operation procedure

PAGE 51

39 1. A concurrent acquire Jock operation may overtake the acquire Jock operation and insert its own record immediately after PrevJJode, as shown in Figure 3 JO. In this case the Compare&Swap will fail at the position marked F in Figure 3.5. The correctness of this operation's position is not affected, so the operation continues from its current position (line marked by i in Figure 3.5). 2. A concurrent release Jock operation may overtake the acquire Jock operation and removes the record pointed to by PrevJJode, as shown in Figure 3.11. In this case, the Dq bit in the link pointer of this record will be TRUE. The algorithm checks for this condition when it scans through the queue and when it tries to commit its modifications. The algorithm detects the situation in the two places marked by ii in the Figure 3.5. Every time a new record is accessed (by PrevJIode), its link pointer is read into Next_Node and the Dq bit is checked. In addition, if the Compare&Swap fails, the link pointer is saved in Next -Node and the Dq bit is tested. If the Dq bit is TRUE, the algorithm starts from the beginning. 3. A concurrent release Jock operation may overtake the acquire Jock operation and remove the record pointed to by PrevJIode, and then the record is put back into the queue, as shown in Figure 3.12. If the record returns with a priority higher than or equal to Self's priority, then the position is still correct and the operation can continue. Otherwise, the operation cannot find the correct insertion point, so it has to start from the beginning. This condition is tested at the lines marked iii in Figure 3.5. The spin-lock busy waiting of a process is broken by the eventual release of the lock by the process which is immediately ahead of the waiting process.

PAGE 52

40 Procedure releaseJock(L, Self){ Self .Dq=TRUE; L=Self .Ptr->Next; /* Release Lock */ if (Self . Ptr->Next ! =NULL) { L . Ptr->Priority=MAX_PRIORITY ; L . Ptr->Locked=FALSE; } } Figure 3.6. The releaseJock operation procedure Release-Lock Operation The releaseJock operation is straight forward and the algorithm is given in Figure 3.6. The process p releasing the lock sets the Dq bit in its record's Link pointer to TRUE, indicating that the record is no longer in the queue. Setting the Dq bit prevents any acquireJock operation from modifying the link. The releasing process copies the address of the successor record, if any, to L. The process then releases the lock by setting the Locked boolean variable in the record of the next process waiting to be FALSE. To avoid testing special cases in the acquireJock operation, the priority of the head record is set to the highest possible priority. 3.3 Correctness of PR-Lock Algorithm In this section, we present an informal argument for the correctness properties of our PR-Lock algorithm. We prove that the PR-Lock algorithm is correct by showing that it maintains a priority queue, and the head of the priority queue is the process that holds the lock. The PRLock is decisive-instruction serializable [83]. Both operations of the PR-Lock algorithm have a single decisive instruction. The decisive instruction for the acquireJock operation is the successful Compare&Swap

PAGE 53

41 and the decisive instruction for the releaseJock operation is setting the Dq bit. Corresponding to a concurrent execution C of the queue operations, there is an equivalent (with respect to return values and final states) serial execution Sd such that if operation 0\ executes its decisive instruction before operation O2 does in C, then 0\ < O2 in SdThus, the equivalent priority queue of a PR-Lock is in a single state at any instant, simplifying the correctness proof (a concurrent data structure that is linearizable but not decisive-instruction serializable might be in several states simultaneously [37]). We use the following notation in our discussion. PR-Lock C has lock pointer L, which points to the first record in the lock queue (and the record of the process that holds the lock). Let there be N processes p l5 p 2l • • Pn that participate in the lock synchronization for a priority lock £, using the PR-Lock algorithm. As mentioned earlier, each process p, allocates a record to enqueue and dequeue. Thus, each process p, participating in the lock access is associated with a queue record <7,. Let Pr(pi) be a function which maps a process to its priority, a number between 1 and N. We also define another function Pr(qi) which maps a record belonging to a process Pi to its priority. A priority queue is an abstract data type that consists of • A finite set Q of elements q t , i = 1. . .N • A function Pr : qi — n, ,where n, € Af. For simplicity, we assume that every n,is unique. This assumption is not required for correctness, and in fact processes of the same priority will obtain the lock in FCFS order. • Two operations enqueue and dequeue

PAGE 54

42 At any instant, the state of the queue can be defined as where q x < Q q 2 Pr(qj). We call q 0 the head record of priority queue Q. The head record's process is the current lock holder. Note that the non-head records are totally ordered. The enqueue operation is defined as enqueue((q 0 , ft, ft, • • , ft), i>r(g) > Pr(q i+i ). The dequeue operation on a non-empty queue is defined as where the return value is q 0 . A dequeue operation on an empty queue is undefined. For every PR-Lock C, there is an abstract priority queue Q. Initially, both C and Q are empty. When a process p with a record q performs the decisive instruction for the acquireJock operation, Q changes state to enqueue(Q,q). Similarly, when a process executes the decisive instruction for a release Jock operation, Q changes state to dequeue(Q). We show that when we observe £, we find a structure that is equivalent to Q. To observe £, we take a consistent snapshot [15] of the current state of the system memory. Next, we start at the lock pointer L and observe the records following the linked list. If the head record has its Dq bit set and its process has exited the acquireJock operation, then we discard it from our observation. If we observe the same records in the same sequence in both C and Q, then we say that L and Q are equivalent, and we write L Q. dequeue((q 0 ,q 1 ,q 2 , . . . , ftO) -» (ft, 92,-.-,

PAGE 55

43 Before: L qO ql q2 qn After: qi q2 qn Figure 3.7. Observed queue £ before and after a releaseJock Theorem 1 The representative priority queue Q is equivalent to the observed queue of the PR-Lock £. Proof. We prove the theorem by induction on the decisive instructions, using the following two lemmas. Lemma 1 If Q <=> L before a releaseJock decisive instruction, then Q O £ after the releaseJock decisive instruction. Proof. Let Q = (qo,qi,q2, • • • ,1n) before a releaseJock decisive instruction. A releaseJock operation is equivalent to a dequeue operation on the abstract queue. By definition, dequeue((q 0 ,q 1 ,q 2 , . . . ,q n )) -» ?2, • • • , q n ) The before and after states of £ are shown in Figure 3.7. If L points to the record q 0 before the releaseJock decisive instruction, the releaseJock decisive instruction sets the Dq bit in q 0 to TRUE, removing (fa from the observable queue. Thus, Q £ after the releaseJock operation. Note that L will point to q x before the next releaseJock decisive instruction. Lemma 2 If Q £ before an acquire Jock decisive instruction, then Q £ after the acquire Jock decisive instruction.

PAGE 56

44 Before: After: L" Figure 3.8. Observed queue C before and after an acquire Jock to an empty queue Before: L qO ql 7LLI7 qi+1 qn After: qO qi q' A qi+1 qn q Figure 3.9. Observed queue £ before and after an acquireJock to a non-empty queue Proof. There are two different cases to consider: Case 1: Q = () before the acquireJock decisive instruction. The equivalent operation on the abstract queue Q is the enqueue operation. Thus, enqueue((), q) — (q) If the lock C is empty, q's process executes a successful decisive Compare&Swap instruction to make L to point to q and acquires the lock (Figure 3.8). Clearly, Q C after the acquireJock decisive instruction. Case 2: Q = (qo, qi,qi, , q n ) before the acquireJock decisive instruction. The state of the queue Q after the acquireJock is given by enqueue = ((q 0 , q u q 2 , . . . , q n ), q) -* (ftiftift, • • • ,9i,9,9t+l, • • . ,9n)

PAGE 57

45 The corresponding C before and after the acquireJock is shown in Figure 3.9. The pointers P and N are the Prev.Node and Next-Node pointers by which cfs acquireJock operation positions its record such that the process observes Pr(qi) > Pr(q) > Pr(qi +1 ). Then, the Next pointer in q is is set to the address of The Compare&Swap instruction, marked F in Figure 3.5, attempts to make the Next pointer in point to q. If the Compare&Swap instruction succeeds, then it is the decisive instruction of g's process and the resulting queue C is illustrated in the Figure 3.9. This is equivalent to Q after the enqueue operation. If the Compare&Swap succeeds only when is in the queue, Pr(q) > Pr(<7, +1 ), then #'s process will attempt to insert q between and Process A' has modified Pr(q), and Pr(q) then q y s process can skip over q' and continue searching from which is what happens. This scenario is illustrated in Figure 3.10. Case b: A release Jock operation R overtakes A and removes qi from the queue (i.e., R has set g.'s Dq bit), and qi has not yet been returned to the queue (its Dq bit is still false). Since 9, is not in the lock queue, A is lost and must start searching again. Based on its observations of 9, and
PAGE 58

16 Before A': qO 7-LLI7 qi+1 qn After A': L Continue A: L qO qO ql 7 qi 9' N' qi+1 qi+1 qn qn Figure 3.10. A concurrent acquire Jock A' succeeds before A set and fails, so A starts again from the beginning of the queue. This scenario is illustrated in Figure 3.11 Case c: A release Jock operation R overtakes A and removes from the queue, and then is put back in the queue by another acquire Jock A '. If A tries to commit its operation, then the pointer in qi is changed, so the Compare&Swap fails. Note that even if qi is pointing to <7,+i, the version numbers prevent the decisive instruction from succeeding. If A continues searching, then there are two possibilities based on the new value of Pr(qi). If Pr(q) > Pr(qi), A is lost and cannot find the correct place to insert q. This condition is detected when the priority of qi is examined (the lines marked iii in Figure 3.5), and operation A restarts from the head of the queue. If Pr(q) < Pr(qi), then A can still find a correct place to insert q past qi, and A continues searching. This scenario is illustrated in Figure 3.12. No matter what interference occurs, A always takes the right action. Therefore, Q C after the acquire Jock decisive instruction.

PAGE 59

47 1 F qi+1 qn / / Before R qi+1 qn ql l— T / After R N LP qi+1 qn Restart A Figure 3.11. A concurrent release Jock R succeeds before A To prove the theorem we use induction. Initially, Q = () and L points to nil. So, Q <=> C is trivially true. Suppose that the theorem is true before the i th decisive instruction. If the i th decisive instruction is for an acquire Jock operation, Lemma 2 =>• Q C after the i th decisive instruction. If the i th decisive instruction is for a releaseJock operation, Lemma 1 Q 43> C after the i th decisive instruction. Therefore, the inductive step holds, and hence, Q <=S> C. 3.4 Extensions In this section we discuss a couple of simple extensions that increase the utility of the PRJjOck algorithm. 3.4.1 Multiple Locks As described, a record for a PRrLock can be used for one lock queue only (otherwise, a process might obtain a lock other than the one it desired). If the real-time system has several critical sections, each with their own locks (which is likely), each process must have a lock record for each lock queue, which wastes space. Fortunately, a simple extension of the PR-Lock algorithm allows a lock record to be used in many different lock queues. We replace the Dq bit by a Dq string of / bits.

PAGE 60

48 Before R L * qO qi y q> L * rO rl / qi+l qn After R and A' ri+1 qi+l N LP Restart A if Pr(q A ) > Pr(qi) rO rl qi ri+1 rm Continue A if Pr(q A ) <= Pr(qi) rO 7^7 ri+1 Figure 3.12. ReleaseJock R and acquireJock A 'succeed before A If the Dq string evaluates to i > 0 when interpreted as a binary number, then the record in in the queue for lock i. If the Dq string evaluates to 0, then the record is (probably) not in any queue. The acquire JLock and release JLock algorithms carry through by modifying the test for being or not being in queue i appropriately. We note that if a process sets nested locks, a new lock record must be used for each level of nesting. Craig [19] presents a method for reusing the same record for nested locks. 3.4.2 Backing Out If a process does not obtain the lock after a certain deadline, it might wish to stop waiting and continue processing. The process must first remove its record from the lock queue. To do so, the process follows these steps:

PAGE 61

49 1. Find the preceding record in the lock queue, using the method from the algorithm for the acquire JLock operation. If the process determines that its record is at the head of the lock queue, return with a "lock obtained" value. 2. Set the Dq bit (Dq string) of the process' record to "Dequeued" . 3. Perform a compare and swap of the predecessor record's next pointer with the process' next pointer. If the Compare&swap fails, go to 1. If the Compare&swap succeeds, return with a "lock released" value. Step 2 fixes the value of the process's successor. If the process removes itself from the queue without obtaining the lock, the Compare&swap is the decisive instruction. If the Compare&swap fails, the predecessor might have released the lock, or third process has enqueued itself as the predecessor. The process can't distinguish between these possibilities, so it must re-search the lock queue. 3.5 Simulation Results We simulated the execution of the PR-Lock algorithm in PROTEUS, which is a configurable multiprocessor simulator [12]. We also implemented the MCS-Lock and Markatos' lock to demonstrate the difference in the acquisition and release time characteristics. In the simulation, we use a multiprocessor model with eight processors and a global shared-memory. Each processor has a local cache memory of 2048 bytes size. In PROTEUS, the units of execution time are cycles. Each process executes for a uniformly randomly distributed time, in the range 1 to 35 cycles, before it issues an acquire-lock request. After acquiring the lock, the process stays in the critical section for a fixed number of cycles (150) plus another uniformly randomly distributed number (1 to 400) of cycles before releasing the lock. This procedure is repeated fifty

PAGE 62

50 times. The average number of cycles taken to acquire a lock by a process is then computed. PROTEUS simulates parallelism by repeatedly executing a processor's program for a time quantum, Q. In our simulations, Q = 10. The priority of a process is set equal to the process/processor number and the lower the number, the higher the priority of a process. Figure 3.13 shows the average time taken for a process to acquire a lock using the MCS-Lock algorithm, the PR-Lock algorithm, and the Markatos' lock algorithm. A process using MCS-Lock algorithm has to wait in the FIFO queue for all other processes in every round. However, a process using the PR-Lock algorithm will wait for a time that is proportional to the number of higher priority processes. As an example, the highest and second highest priority process on the average waits for about one critical section period. We note that the two highest priority processes have about the same acquire lock execution time because they alternate in acquiring the lock. Only after both of these processes have completed their execution can the third and fourth highest priority processes obtain the lock. Figure 3.13 clearly demonstrates that the average acquisition time for a lock using PR-Lock is proportional to the process priorities, whereas the average acquisition time is proportional to the number of processes in case of the MCS-Lock algorithm. This feature makes the PR-Lock algorithm attractive for use in real-time systems. The same prioritized lock-acquisition behavior is shown using Markatos' algorithm, but the average time to acquire a lock is 50% greater than when the PRLock is used. At first this result is puzzling, because Markatos' lock performs the majority of its work when the lock is released and the PR-Lock performs its work when the lock is acquired. However, the time to release a lock is part of the time spent in the critical section, and the time to acquire a lock depends primarily on time spent in the critical section by the preceding lock holders. Thus, the PR-Lock allows much

PAGE 63

51 faster access to the critical section. As we will see, the PR-Lock also allows more predictable access to the critical section. Finally, we compared the time required to release a lock using both the PRLock and Markatos' lock. These results are shown in Figure 3.14. The time to release a lock using PR-Lock is small, and is consistent for all of the processes. Releasing a lock using Markatos' lock requires significantly more time. Furthermore, in our experiments a high priority process is required to spend significantly more time releasing a lock than is required for a low priority process. This behavior is a result of the way that the simulation was run. When high priority processes are executing, all low priority processes are blocked in the queue. As a result, many records must be searched when a high priority process releases a lock. Thus, a high priority process does work on behalf of low priority processes. The time required for a high priority process to release its lock depends on the number of blocked processes in the queue. The result is a long and unpredictable amount of time required to release a lock. Since the lock must be released before the next process can acquire the lock, the time required to acquire a lock is also made long and unpredictable. Most of the time the cache-hit ratio is 95% or higher on each of the processors using the PR-Lock algorithm, and we found an average cache hit range of 99.72% to 99.87%. Thus, the PR-Lock generates very little network or memory contention in spite of the processes using busy-waiting. 3.6 Conclusion In this chapter, we present a priority spin-lock synchronization algorithm, the PRLock, which is suitable for real-time shared-memory multiprocessors. The PR-Lock algorithm is characterized by a prioritized lock acquisition, a low release overhead, very little bus-contention, and well-defined semantics. Simulation results show that

PAGE 64

52 the PR-Lock algorithm performs well in practice. This priority lock algorithm can be used as presented for mutually exclusive access to a critical section or can be used to provide higher level synchronization constructs such as prioritized semaphores and monitors. The PR-Lock maintains a pointer to the record of the lock holder, so the PR-Lock can be used to implement priority inheritance protocols. Finally, the PR-Lock algorithm can be adapted for use as a single-dequeuer, multiple-enqueuer parallel priority queue. While several prioritized spin-locks have been proposed, the PR-Lock has the following advantages. • The algorithm is contention free. • A higher priority process does not have to work for a lower priority process while releasing a lock. As a result, the time required to acquire and release a lock is fast and predictable. • The PRLock has a well-defined acquire-lock point. • The PRLock maintains a pointer to the process using the lock that facilitates implementing priority inheritance protocols. For future work, we are interested in prioritizing access to other operating system structures to make them more appropriate for use in a real-time parallel operating system.

PAGE 65

MCS-Lock PR-Lock S Markatos'Lock 0 10 20 30 40 50 60 Average Time (Cycles) x 100 Figure 3.13. Comparison of lock acquisition times

PAGE 66

^V^ ^\\\\\\\\\\\\\\\\\\\N 8 5' [awwsw v^ S Markatos' Lock -WWW ^ PR-Lock n 1 r 0 50 100 150 200 250 300 350 400 450 Average Time (Cycles) Figure 3.14. Comparison of lock release times

PAGE 67

CHAPTER 4 INTERRUPTIBLE CRITICAL SECTIONS 4.1 Introduction The scheduling of independent real-time tasks is well understood, as optimal scheduling algorithms have been proposed for periodic and aperiodic tasks on uniprocessor [21, 59] and multiprocessor systems [20, 60, 70]. However, if the tasks communicate through shared critical sections, a low-priority task that holds a lock may block a high priority task that requires the lock, causing a priority inversion. In this chapter, we present a method for real-time synchronization that avoids priority inversions. We present a new approach to synchronization on uniprocessors with special applicability to embedded and real-time systems. Existing methods for synchronization in real-time systems are pessimistic, and use blocking to enforce concurrency control. While protocols to bound the blocking of high priority tasks exist, high priority tasks can still be blocked by low priority tasks. In addition, these protocols require a complex interaction with the scheduler. We propose interruptible critical sections (i.e., optimistic synchronization) as an alternative to purely blocking methods. Practical optimistic synchronization requires techniques for writing interruptible critical sections, and system support for detecting critical section access conflicts. We discuss our implementation of an interruptible lock on a system running the pSOS-lreal-time operating system. Our experimental performance results show that interruptible locks reduce the variance in the response time of the highest priority task with only a small impact on the performance of the low priority tasks. We show how 55

PAGE 68

56 interruptible critical sections can be combined with the Priority Ceiling Protocol, and present an analysis which shows that interruptible locks improve the schedulability of task sets that have high priority tasks with tight deadlines. Rajkumar, Sha, and Lehoczky [80] have proposed the Priority Ceiling Protocol (PCP) to minimize the effect of priority inversion. The priority ceiling of a semaphore S is the priority of the highest priority task that will ever lock S. A task may lock a semaphore only if its priority is higher than the priority ceiling of all locked semaphores (except for the semaphores that it has locked). The PCP guarantees that a task will be blocked by a lower priority task at most once during its execution. However, the tasks must have static priorities in order to apply the Priority Ceiling Protocol. In addition, blocking for even the duration of one critical section may be excessive. Rajkumar, Sha, and Lehoczky have extended the Priority Ceiling Protocol to work in a multiprocessor system [79]. Blocking-based synchronization algorithms have been extended to work with dynamic priority schedulers. Baker [8] presents a pre-allocation based synchronization algorithm that can manage resources with multiple instances. A task's execution is delayed until the scheduler can guarantee that the task can execute without blocking a higher priority task. Tripathi and Nirkhe [95], and Faulk and Parnas [24] also discuss pre-allocation based scheduling methods. Chen and Lin [16] extend the Priority Ceiling Protocol to permit dynamically-assigned priorities. Chen and Lin [17] extend the protocol in [16] to account for multiple resource instances. Previous approaches to real-time synchronization suffer from several drawbacks. First, a high-priority task might be forced to wait for a low-priority task to complete a critical section. Mercer and Tokuda [68] note that the blocking of high-priority tasks must be kept to a minimum in order to ensure the responsiveness of the real-time system. If tasks can have delayed release times [57], a high priority task might not be

PAGE 69

57 able to block for the duration of a critical section and still be guaranteed to meet its deadlines. Jeffay [43] discusses the additional feasibility conditions required if tasks have preemption constraints. Second, dynamic-priority scheduling algorithms are feasible with much higher CPU utilizations than static-priority scheduling algorithms [59], and dynamic-priority schedulers might be required for aperiodic tasks. The simple Priority Ceiling Protocol of Rajkumar, Sha, and Lehoczky [80] can be applied to static-priority schedulers only. The dynamic-priority synchronization protocols [8, 16, 17] are complex, and must be closely integrated with the scheduling algorithm. We present a different approach to synchronization, one which guarantees that a high-priority task never waits for a low-priority task at a critical section. We introduce the idea of an Interruptible Critical Section (ICS), which is a critical section protected by optimistic concurrency control instead of by blocking. A task calculates its modifications to the shared data structure, then attempts to commit its modification. If a higher priority task previously committed a conflicting modification, the lower priority task fails to commit, and must try again (as in optimistic concurrency control [9]). Otherwise, the task succeeds, and continues in its work. The synchronization algorithms are not tied to the scheduling algorithm, simplifying the design of the real-time operating system. A purely optimistic approach to synchronization can starve low priority tasks, leading to poor performance (i.e. low schedulability). We show how to combine ICS with locking, to create interruptible locks. Interruptible locks can be used in conjunction with the PCP to provide schedulability guarantees for the low priority tasks. We present an analysis of periodic tasks that use interruptible locks with the Priority Ceiling Protocol. We present our implementation of ICS and interruptible locks on the pSOS-|realtime operating system, and show that we can reduce the maximum response time of

PAGE 70

58 a high priority task. Our implementation of interruptible locks is realized through a small amount of code, and did not require a modification of the pSOS-f kernel (although it did make use of a kernel call-out routine). We note that pSOS+ does not provide priority inheritance. Interruptible critical sections are best applied in embedded or real-time operating systems to improve the schedulability of the highest priority tasks. An operating system for embedded systems will of necessity provide the flexibility required to implement an ICS (as does pSOS+). In such an environment, high priority tasks can enter an ICS without making a system call, thus avoiding the associated overhead. Although an ICS can't reserve resources for a process (but can co-exist with blocking algorithms [8, 17, 80] which can be applied), an ICS can be used to communicate with a high-priority device driver. Low priority tasks submit requests to the device driver through the ICS, and the device is serviced by a high priority driver which obtains commands through the ICS. In Section 4.8, we provide examples of tasks sets that cannot be guaranteed to meet their deadlines using the Priority Ceiling Protocol, but are feasible if interruptible locks are used. 4.2 Interruptible Critical Sections We build our optimistic synchronization methods on Restartable Atomic Sequences (RAS) [11]. A RAS is a section of code that is re-executed from the beginning if a context switch occurs while a process is executing in the code section. The re-execution of a RAS is enforced by the kernel context-switch mechanism. If the kernel detects that the process program counter is within a RAS on a context switch, the kernel sets the program counter to the start of the RAS. Bershad et al. show that an RAS implementation of an atomic test-and-set has better performance than a hardware

PAGE 71

59 test-and-set on many architectures, and is much faster than kernel-level synchronization [11]. We note that the idea of scheduler support for critical sections is well established. In 4.3BSD UNIX, a system call that is interrupted by a signal is restarted using the long jump instruction [56]. Anderson et al. [5] argue that the operating system support for parallel threads should recognize that a preempted thread is executing in a critical section, and execute the preempted thread until the thread exits the critical section. In addition, Moss and Kohler coded several of the run-time support calls of the Trellis/Owl language so that they could be restarted if interrupted [73]. The simple mechanism described in [11] is too crude for our purposes, because there is no guarantee that a conflicting operation was performed when other processes had control of the CPU. The unnecessary re-executions are not a problem for the critical sections described in [11], because those critical sections are very short and a re-execution is unlikely. In addition, the authors of [11] did not need to consider the predictability required by real-time systems. If the critical section execution occupies a large fraction of a time slice, then a context switch is far more likely. To guarantee progress, a process that is interrupted in its critical section execution should be restarted only if a conflicting operation was executed. We call a region of code that is protected in this manner an interruptible critical section (ICS). Restarting a critical section only if a conflicting operation was performed improves real-time schedulability, because a low priority task can experience restarts only from higher priority tasks that share a critical section, instead of from all higher priority tasks. We indicate an interruptible critical section by explicitly declaring it so.

PAGE 72

60 interruptible_critical -section { stmtl ; stmtn; } As an example, we can implement a shared stack as an ICS by using the following code. struct stack_elem{ data item; struct stack_elem *next; } * S P push(elem) { stack_elem *elem interruptible_criticaI_section{ elem— >next=sp ; sp=elem; } } stack-elem *pop(){ struct stack-elem *temp interruptible_critical_section { temp=sp; if (sp!=NULL) sp=sp— >next ; } return (temp) ; } 4.3 Implementing Interruptible Critical Sections 4.3.1 Background The techniques used to write interruptible critical sections are based on the ideas developed for non-locking concurrent data structures. Herlihy [35] introduces the idea of non-blocking concurrent objects. An algorithm for a non-blocking object provides the guarantee that one of the processes that accesses the object makes progress in

PAGE 73

61 a finite number of steps. Herlihy provides a method for implementing non-blocking objects that swaps in the new value of the object in a single write. Our methods are similar to an extension of Herlihy's work proposed by Turek, Shasha, and Prakash [97]. In the context of real-time synchronization, non-blocking shared objects are desirable because a high priority task will not be blocked by a low priority task. In a uniprocessor system, only one process at a time will access the shared data structures. We can take advantage of the serial but interruptible access to simplify the specification of the existing non-blocking techniques, and to improve on their efficiency. In an interruptible critical section, a process can perform only one write that is visible to other processes. Furthermore, the globally visible write must be the last instruction in the protected region. Therefore, a process that is executing an ICS records its updates in a private buffer (the commit buffer). The final write commits the updates that are recorded in the buffer by setting a commit flag. Any subsequent process that executes the ICS performs the updates and clears the commit flag. This approach to optimistic synchronization is discussed by Alemany and Felton [2] and by Bershad [10]. In this chapter, we discuss the following implementational details that do not appear in the previous work. • Efficient implementation in a uniprocessor system. • How to perform the bulk of the ICS processing outside of the kernel. • How to share commit buffers among processes. • How to use Herlihy's small-object protocol [35] to minimize the number of writes that must be placed in the commit buffer. • How to apply optimistic synchronization to real-time systems.

PAGE 74

62 • An analysis of interruptible locks in a system of periodic tasks. 1.3.2 Implementation In the following discussion, we assume that if a process experiences a context switch while executing an ICS, the process re-executes from the start of the ICS when it regains control of the CPU (as in [11]). In section 4.4, we discuss the modification necessary to permit re-execution only when a conflicting operation commits. The modification is minor, but the fully general algorithm would confuse the current discussion. In [97], Turek et al. propose a method for transforming locking data structures into non-blocking data structures. The key to the transformation is to post a continuation instead of a lock. The continuation contains the modifications that the process intends to perform. If a process attempts to post a continuation but is blocked (because a continuation is already posted), the 'blocked' process performs the actions listed in the continuation, removes the continuation, then re-attempts to post its own continuation. As a result, a blocked process can unblock itself. Although Turek's approach simplifies the process of writing a critical section, a direct translation of Turek's algorithm can require a high priority process to perform the work of many low priority processes that have posted but not yet performed their actions. An easy modification of Turek's approach results in a simple algorithm which guarantees that a high priority process does the work for at most one low priority process. We present an algorithm of an ICS based on this approach here. We note that one can write an ICS by a rather different approach, the details of which are contained in [44]. Every shared concurrent object has a single commit record, and a flag indicating whether the commit record is valid or invalid. When a process starts executing a

PAGE 75

63 critical section, it check to see if a previous operation left an unexecuted commit record (the flag is valid). If so, the process executes the writes indicated by the commit record, then sets the flag to invalid. The process then performs its operation, recording all intended writes in the commit record. For the decisive instruction, the process sets the flag to valid. A typical critical section has the following form. struct commit jrecord_element{ word *lhs,rhs} commit .record [MAX] boolean valid critical_section() interruptible_critical_section{ if (valid) instruction=0 while ( instruct ion
PAGE 76

64 list.elem *elem list_elem *prev,*next interruptible_critical_section{ if (valid) instruction=0 while (instruction<2 and commit jrecord [instruction] .lhs != NULL) * (commit jrecord [instruction] . lhs) = commit jrecord [instruction] .rhs valid=FALSE prev=NULL; next=head while(not jf oundjposition(next) ) prev=next; next=next— > forward // Found the insertion point elem— >f orward=next ; elem— backward=prev if (prev==NULL) commit_record[0] .lhs=&head else commit jrecord [0] . lhs=& (prev— »f orward) commit jrecord [0] .rhs=elem if (next != NULL) commit jrecord [1] . lhs=& (next— ^backward) commit jrecord [1] .rhs=elem else commit_record[l] .lhs=NULL valid=TRUE } The transformation from a blocking-based critical section to an ICS is straightforward. The cleanup phase is inserted in the beginning of the critical section. Whenever a write is performed into global data in the blocking-based critical section, the write is recorded in the commit record in the ICS. The last statement of the ICS is to set valid to TRUE. If operations perform few writes, then a high priority task performs at most a few instructions on behalf of a low priority task. Further, the costs balance because the high priority task leaves the commit record for a different task to execute. In a blocking-based approach, the high priority task would incur a context

PAGE 77

65 switch, thus costing the context switch overhead and also overhead due to cache line invalidations. 4.3.3 Reducing the Clean-up If the critical section requires a small modification (or can be broken into several sections, each requiring only a small modification), then the basic approach allows a low priority operation to block a high priority operation for only a short period. If an operation performs a substantial modification and the number of modifications that an operation commits might vary widely, then a high priority operation might spend a substantial amount of time performing a low priority operation's updates to the data structure. In [33], Herlihy proposes a 'shadow-page' method for implementing a non-blocking concurrent data structure. An operation calculates its modifications to the data structure in set of privately allocated (shadow) records, then links its records into the data structure in its decisive instruction. The process is illustrated in Figure 4.1. The blocks in the data structure marked 'g' are replaced by the shadow blocks. An operation performs its decisive instruction by swapping the anchor pointer from the current root to the shadow root. The blocks that are removed from the data structure are garbage collected by the successful operation and are (eventually) made available to other operations. We note that the decisive instruction always must be to swap the anchor, in order to ensure serializability in a parallel system. The most complicated part of Herlihy's protocol is managing the garbage-collected records. The protocols are complex, and require 0(P 2 ) space, where P is the number of processes that access the shared object. We can take advantage of the serial access to the data structure in the ICS to simplify the implementation and reduce the space overhead.

PAGE 78

66 anchor „t.. private buffers Figure 4.1. Herlihy's non-blocking data structures The process of implementing a shadow-page ICS is illustrated in Figure 4.2. A process obtains the records it needs to prepare its modifications from a global stack of records. The global record stack provides the records for all operations that use records of the size it stores. When a process obtains a record from the global stack, it does not remove the record from the stack. Instead, the modifications are made to records while they are still on the stack. A local variable, current, keeps track of the last allocated record from the record stack. Another pair of local variables, g_head and g_tail, keep track of the records to be removed from the data structure. To commit the modification to the data structure, the operation must remove the records it used from the stack of global records, add the garbage records to the global stack, and adjust a pointer in the data structure. These three modifications can be performed using a regular commit record. Before listing the procedures to implement the shadow-page ICS, we note a couple of details. First, every record in the data structure must contain enough additional space to thread a list through it, whether the garbage list or the global record stack.

PAGE 79

67 Figure 4.2. Shadow-page ICS Second, the critical instruction of the operation is to declare that the commit record is valid. As a result, the commit record can contain instructions to change any links in the data structure. As an example, in Figure 4.2, a link from the root instead of the anchor is modified. We assume that every record has a field next that is used to thread the global record and the garbage lists through the nodes. The procedure for acquiring a new record is record *getbuf (record **current) buffer *temp temp=*current * current = ( * current ) — mext The procedure to declare that a node is garbage is given by garbage (record *elem , **g_head , **g_tail) if (*g_tail==NULL) *g_tail=elem elem— >next=*g_head *g_head=elem A typical critical section is given by

PAGE 80

68 struct commit_record_element{ word *lhs,rhs} commit_record[3] boolean valid Global record *pool crit ical _sect ion () record *current ,*g_head,*g_tail restartable{ if (valid) instruct ion=0 while(instruction<3 and commit_record [instruction] . lhs != NULL) * (commit_record [instruction] . lhs) = commit_record [instruction] .rhs valid=FALSE // Initialize the list pointers current=pool g_head=g_tail=NULL Compute the modifications to the data structure using the getbuf and garbage procedures // Prepare the commit record commit_record[0] . lhs=&(g_tail— mext) commit_record[0] .rhs=current commit_record[l] .lhs=&pool commit_record[l] .rhs=g_head commit_record[2] . lhs=critical JLink commit_record [2] . rhs=crit icalJLink_value } valid=TRUE // commit your update The shadow-page ICS requires that a high priority operation perform at most three writes on the behalf of a low priority operation when the shared data structure is a tree. Arbitrary graph structures might require more updates, but the technique has a similar application. Since a high-priority operation does not perform its own clean-up, the costs balance, and again the high priority task avoids the context switch

PAGE 81

69 overhead. The space requirements for a shadow page ICS are independent of the number of competing processes, as the global pool must be initialized with enough records to allow the data structure to reach its maximum size, plus the number of records in the largest modification. Furthermore, the global pool can be shared among several data structures (in which case they must share a commit record). The linked list that is threaded through the data structure imposes an 0(1) penalty on every node in the data structure. 4.4 System Support If an interruptible critical section is to be efficient, then a process executing one should be restarted only if a conflicting operation occurs. Thus, information about critical section executions must be transmitted to the kernel. In this section, we describe a simple and efficient method of providing kernel-level support for interruptible critical sections. An operating system must have a small context switch overhead to achieve good performance. Thus, the context-switch time support for an ICS must be limited to a minimum. However, we would like to make the mechanism as flexible as possible. In addition, we would like to avoid making kernel traps to set up a request for critical section entry. To be efficient, information about conflicting executions must be passively transmitted to the kernel. With every critical section, we associate an execution count, cs_count. When a process enters a critical section, it reads cs_count into a local variable, processjcount. When the process completes the critical section, it increments cs_count. Thus, the kernel can detect that a conflicting operation occurred when the process_count of the switched-in process is different than the cs.count

PAGE 82

70 of the critical section being executed. We use this mechanism as the basis of our context switch support for the ICS. Every critical section has a control block with the following information 1. The starting and ending location of the critical section code. 2. The cs.count. 3. Additional structures necessary to implement interruptible critical sections. Every process that executes interruptible critical sections has a block of memory in user/kernel space that contains the following variables. 1. A flag that is set if the process is executing an interruptible critical section. 2. A pointer to the critical section control block. 3. process_count. On a context switch, the kernel executes the following code before giving control to the switched-in process. If the next process to run is executing an ICS Find the critical section control block If the program counter of the next process to run is inside the ICS If cs_count != process_count Set the program counter of the next process to run to the start of the ICS. To take advantage of the kernel mechanism, the process loads cs_count into process-count before entering the ICS, and increments cs.count before exiting the ICS. The following is a first attempt at writing the entry and exit code for an ICS.

PAGE 83

71 // Entry code Make the process' control block point to the critical section control block. Set the flag in the process' control block. process_count=cs_count BeginJECS: // start of the ICS // Exit code End_ICS : cs_count++ reset the flag in the process' control block. The problem with the above entry and exit code is that it doesn't cooperate with the code that implements the interruptible critical section. The ICS expects that the last instruction in the restartable region sets valid to TRUE. If valid is set before cs_count is incremented, then an incorrect execution can result (either an operation is executed twice, or a committed operation is ignored). If cs_count is incremented before valid is set, then a process can cause itself to restart. We can avoid these race conditions by having a single write that both commits the operation and increments cs.count. With each critical section, we associate a second execution counter, aux.count that normally has the same value as cs_count. The last instruction in the restartable region increments cs.count. A process can detect that an operation has recently committed by testing aux.count and cs_count for equality. If they are different, the process performs the writes of the previous operation. The process signals that all of the updates are performed by setting aux_count=cs_count. There is one remaining problem. When two operations execute concurrently, they interfere when they record their writes in the commit record. If the system uses strict priority scheduling, a high priority operation will overwrite the concurrent lower priority process' updates to the commit record, then force the lower priority process to restart. If the executions of the two operations can be interleaved, then they must have their own commit records to record their updates. But then, when

PAGE 84

72 an operation commits it must indicate which commit record contains update. This is done by incrementing cs_count by the commit record index when committing. The new exit code is // Exit code End_ICS: cs_count+=process .number reset the flag in the process' control block. The code in the ICS to detect and perform a committed operations updates is index=cs_count-auxjCOunt if (index !=0) instruct ion=0 while (instruction
PAGE 85

73 predictability of a set of real-time tasks. Furthermore, tasks that must acquire a semaphore can be required to use the priority ceiling protocol. Our analysis section shows that a combination of interruptible locks and the priority ceiling improves the schedulability of the low priority tasks. The entry and exit code is changed to the following (we assume here that lower priority numbers mean lower priorities). // Entry code Make the process' control block point to the critical section control block. Set the flag in the process' control block. process_count=cs_count if (my priority < cutoff .priority) P(S) BeginJECS: // start of the ICS // Exit code End_ICS: cs_count+=process_number if (my priority < cutoff -priority) V(S) reset the flag in the process' control block. Interruptible locks also reduce the space requirements for an ICS with multitasking processes. Since the processes which set a lock will not execute concurrently, they can share a commit record. In a typical use of interruptible locks, only one very high priority process will be able to interrupt the lock, so only two commit records are needed. 4.6 Implementation We implemented ICS support in a VMEexec [76] system development environment with a pSOS-f [75] real-time, multi-tasking operating system kernel. The VMEexec system consists of a host running on a VMEmodule driven SYSTEM V/68 operating system and a set of VMEmodule target processors running the pSOS+ kernel. In our

PAGE 86

74 configuration, we have six MVME147 VMEmodules based on Motorola MC68030 with 4Mb of shared-memory on each module. One VMEmodule is used as a host processor running the SYSTEM V/68 and the rest are real-time target processors running the pSOS+ kernel. For the experiments described in this chapter, we made use of only one of the target processors. pSOS+ is a real-time, multi-tasking kernel that supports multi-processors. It provides a rich set of system services including task management, shared-memory regions, synchronous / asynchronous signals, semaphores, and messages. One particular feature that pSOS+ supports are user written routines that can be called at the start of a task, during a context-switch, and at the end of a task. This feature allows us to implement ICS support without modifying the kernel. We use two data structures to implement the ICS: one for the critical section and one for each task that uses the critical section. The global lock structure consists of a critical section identifier, a counter that tracks the number of times the critical section has been executed, and the critical section bounds. struct ICSJLstruct { int id; /* ID of this critical section */ int cs_count; /* Global Execution Count */ char *cslow; /* CS Low Address */ char *cshigh; /* CS High Address */ } The structure local to a task consists of the copy of the ICS execution count, a count of the number of times the critical section is retried on any invocation (for statistics), a pointer to the ICSstruct and a flag to indicate that the task is entering the critical section.

PAGE 87

75 struct ICS_Tstruct { int process_count ; /* struct ICS_Lstruct *ilp; /* int i count; /* int flag; /* Local Execution Count */ Interruptible Lock Record Pointer*/ Interrupt Count of a task */ Flag = ID => In CS; = 0 => Not */ The ICS implementation code consists of two parts: the ICSctxsw routine which provides the ICS Lock mechanism and the ICSclient task that uses the ICS mechanism. We have already discussed the algorithms used by the ICSclient task in Section 4.4 4.6.1 ICSctxsw Routine The ICSctxsw routine is integrated with the pSOS+ kernel as a user written routine that is called during a context-switch. The call occurs at the point where the context of the switched-out task has been completely saved, and before the context of the switched-in task is loaded. pSOS+ provides the addresses of the Task Control Blocks (TCBs) of both the switched-in task and the switched-out task in machine registers. The TCB contains all the context of a task, including the Program Counter (PC). ICSctxsw can reset the PC in the TCB of a switched-in task, if required. pSOS-f provides a set of eight softwaredefined user registers USRs) that a task can access in the TCB. The user register 0, U-REGO, is used to contain the address of the ICS.Tstruct of a task using ICS. ICSctxswO { struct tcb *in_tcb; struct ICS_Tstruct *tlp; load in_tcb from the machine register; tip = Get U_REG0 from in_tcb;

PAGE 88

76 if (tip != NULL && tlp->flag == LOCKID) { if (tcb->currpc >= tlp->ilp->cslow && tcb->currpc <= tlp->ilp->cshigh) { if (tlp->process_count != tlp->ilp->cs_count) in_tcb->currpc = tlp->ilp->cslow; } } ICSctxsw checks if the program counter (PC) of the task about to be run is within the critical section region, and if so, it decides on the criteria to reset the PC to the beginning of the critical section. If the criteria is met, the task is forced to re-execute the critical section. Otherwise, the task is allowed to continue. 4.6.2 User-level Entry and Exit The ICS entry and exit code that is used in conjunction with the ICSctxsw routine must (in general) be kernel calls, because the task control block is modified. To permit user-level synchronization, the entry and exit calls must be designed so that bad parameters cannot be passed. Instead of storing the critical section ICS control block ( I CS_L struct) in arbitrary locations, they are stored in an array in kernel space. Registering an ICS requires a call to fill in one of the control blocks. In the task ICS control block (iCSJTstruct), we store the index of the control block of the ICS that is being executed instead a pointer to it (or 0 if no ICS is being executed). In the context switch routine (ICSctxsw), the index to the ICS control block is used in place of the reference. If the number of allowed ICS control blocks is a power of 2, then bounds checking can be done by masking out the high order bits of the index in the task ICS control block. An invalid index causes no problems, since the PC won't be in the specified range.

PAGE 89

77 4.7 Experimental Performance Results We tested the performance of interruptible locks on a shared priority queue. There are three low priority enqueue tasks (of equal priority) and a single high priority dequeue task. This experiment corresponds to several computational tasks providing data for a high priority I/O task. All four tasks are started under the control of a low priority parent task. The parent and the tasks communicate through message queues. We compared 4 types of mechanisms. 1. Interruptible Critical Sections: All tasks immediately enter the ICS. 2. Interruptible Locks: The enqueuing tasks set and release a semaphore, the dequeuing task does not. 3. Non-prioritized Semaphore Locks: All of the tasks acquire a semaphore before entering the critical section. The semaphore lock is granted on FCFS basis. 4. Prioritized Semaphore Locks: Same as the above, but the semaphore is granted on a priority basis. Parameters In the first experiment, each task performs 10,000 enqueue (dequeue) operations, but we stop collecting statistics after any task completes. Each enqueue task spins for 7 ticks (about 70 ms), then executes a 1 tick critical section. The time quantum for a task is 2 ticks. The dequeue task sleeps for 10 ticks before entering a 1 tick critical section. We collect the time to execute a critical section, and we create histograms that show the frequency that the critical section execution takes a particular amount of time. The performance of the non-prioritized semaphore is shown in Figure 4.3.

PAGE 90

78 The dequeue operation sometime experiences a long delay, and the time to execute enqueue operations is moderate. Since pSOS+ offers prioritized semaphores, a fairer comparison should use them. This data is shown in Figure 4.4. There is a slight decrease in the dequeue and enqueue response times, but still the dequeue operation experiences a long delay a few times. In Figure 4.5, we show the time to execute the enqueue and dequeue critical section using interruptible critical sections. The dequeue operation is always performed without delay, and the enqueue operations perform as well as when using the prioritized semaphores. In Figure 4.6, we use interruptible lock. The performance is comparable to the interruptible critical sections. Using an ICS can cause poor performance among low priority tasks if the critical sections have a high utilization. In the second experiment, the enqueue task spins for 2 ticks instead of 7 ticks, and then executes a 4 tick critical section. The dequeue task sleeps for 20 ticks and executes a 1 tick critical section. These parameters are selected to exaggerate the conflicts among the tasks, to show the restart problems that using an ICS can cause. Figure 4.7 shows the time to execute the enqueue and dequeue critical section using interruptible critical sections. We note that the scale on this chart is nonlinear. The dequeue operation is again performed without delay, but the enqueue operation can take an extremely long time to execute. In contrast, Figure 4.8 shows the usage of interruptible locks in which the time to execute an enqueue operation is moderate. Comparing the interruptible lock and the prioritized semaphore implementations for the critical sections, we find that the interruptible lock eliminates the delay in executing the high priority critical section, while adding only a small delay (in this case about 20%) to the time to execute a low priority critical section.

PAGE 91

79 The significance of these experiments is not that the average response time of the high priority dequeue operation is reduced, but that the response times of the dequeue operations becomes predictable. In the low-conflict experiment, the dequeue operation usually completes immediately, but on occasion requires 5 ticks when a prioritized semaphore is used. This unpredictability might cause timing problems. We note that the priority ceiling protocol would provide the same performance as the prioritized semaphore does (since there are no other critical sections), but at the cost of a more complex and expensive scheduler and synchronization mechanism. Interruptible locks always allow the dequeue operation to finish immediately, even under a very high load. To test the overhead of using interruptible critical sections, we ran experiments to time the overhead in the context switch code and in the ICS entry code. In the first experiment, we enter and exit a (empty) critical section 10,000 times. We found that acquiring and releasing a semaphore adds 57 ticks to the program execution time. Entering and exiting an ICS requires 67 ticks if the entry and exit code is contained in a system call, and 1 tick if the entry and exit code are user code. In the second experiment, we force 10,000 context switches. We find that the unavoidable context switches by themselves require 58 ticks, and the ICS callout code adds 9 ticks of overhead. These numbers are for the current implementation of ICS and interruptible locks. An implementation that is more tightly integrated into the kernel will require less overhead. For example, if the context-switch code is part of the kernel, then there is no need for the callout routine overhead and we would have faster access to the program counter.

PAGE 92

80 4.8 Analysis Hard real-time systems require timing guarantees, and for this reason one typically considers periodic task sets. In this section, we show how to analyze a periodic task set that uses an ICS or an interruptible lock for synchronization, and derive worst-case response times. The set of tasks is {r;}. We use the convention that r, has a higher priority than Tj if i < j. Each task i has a period T, and a worst case execution time C,-. An instantiation of task i is released at the beginning of its period, and has for a deadline the release of the next instantiation of the task. If all tasks can always complete before their deadline, then the task set is feasible. A real-time system with periodic tasks is typically scheduled using the Rate Monotonic algorithm, which gives static preemptive priority to tasks with shorter periods. Rate Monotonic scheduling is well studied, and the feasibility of a task set can be exactly characterized. Let r,be the worst cast response time of task i. If the tasks do not access critical sections then, r, is the fixed point of the following recursive equation [48] ri= Ci + Z^n/TjlCj (4.1) Unfortunately, it is not always realistic to assume that a task is released for execution at the start of its period. For example, most static priority tasks schedulers are implemented using tick scheduling a periodic clock interrupt polls the task set and performs a context switch if a task with a higher priority than the current one is ready for execution. The release of a task can also be blocked by external events, such as the arrival of a message from a communicating task in a distributed system [57]. The task set might be subject to release jitter [94], possibly due to tick scheduling or due to waiting for external events. Tindell, Burns, and Wellings show how to modify equation 4.1 to account for release jitter. In each of these cases, the deadline

PAGE 93

81 of the task can be less than the task period, perhaps significantly less. If the task deadlines are shorter than the task periods, then the Deadline Monotonic algorithm is the optimal static scheduler [58]. If the tasks can access critical sections and thus experience blocking, then the maximum blocking time is added to the above r, value. If the Priority Ceiling Protocol is used, a task will block for at most the duration of one critical section. If the tasks use interruptible locks, then there is an additional re-execution component that must be added to the response times. We next compute the time wasted due to re-executions of critical sections. We assume that interruptible locks are used in conjunction with the Priority Ceiling Protocol. So, for each critical section, there is a (possible empty) set of tasks that enter the critical section without blocking, and another (possible empty) set of tasks that acquire a semaphore before entering the critical section. The tasks are described by their periods Ti, their execution times (in the absence of concurrency) C,-, the set of critical sections that they access Z,-, the execution times in their critical sections 6, iZ , and their deadlines Z),. Hence, we assume that tasks priorities are assigned according to the length of £);. Let Z be the set of critical sections (Z = UZ,), and for z € Z, let Tj , Tj , . . . , r* be the set of tasks that access z, in order of priority. Of the n z tasks, the I z highest priority tasks enter z without blocking, and the remaining n z — I z acquire a semaphore using the Priority Ceiling Protocol. We define I(z) to be the highest numbered task that enters z without blocking (i.e., r /(2 ) = r/J. Let us first consider a couple of simple examples. Suppose that we have a set of three tasks that each access the semaphore z. The characteristics of the tasks are listed in Table 4.1. Note that this task set cannot be guaranteed to meet its deadlines if the Priority Ceiling Protocol is used, as C\ + max,(6 1)2 ) > D\.

PAGE 94

82 Table 4.1. Example task description 1 for ICS task Ti Ci Zi Di Tl 10 ms 2.5 ms z 1 ms 3 ms r 2 15 5 z 1 10 r 3 30 4 z 1 28 If the semaphore z is protected by an ICS, then task T\ can always meet its deadline because C\ < D\. The question is whether the remaining tasks can meet their deadlines. To analyze task r 2 , we observe that every time that T\ interrupts r 2 , the result can be a re-execution of z. So, to determine the response time r 2 of r 2 , we modify equation 4.1 to incorporate the re-execution time. r 2 = C 2 +\r 2 /T 1 ](C 1 + b 22 ) (4.3) We find that r 3 = 26.5 < Z) 3 , so r 3 can always meet its deadline. We conclude that the task set in Table 4.1 is feasible if synchronization is done with an ICS, but is infeasible if the synchronization is done using the Priority Ceiling Protocol. We can observe a general method for computing worst-case response times of the tasks, if all critical sections are protected by ICSs only. We define b{z; i, i) = maxj
PAGE 95

S3 Thus, b(z; j, i) is the longest critical section that can be interrupted by tj and increase the response time of r,-. Next, we need to determine increase in response time of task 77 caused by executions of Tj. If Tj causes a re-execution of critical section z, then only one critical section must be re-executed because of tj. If and 77 are executing z when Tj executes z (j < k < /), then both and 77 must re-execute 2. However, 77 would need to re-execute anyway, because of 77/ s execution. Furthermore, all tasks with a lower priority than rjt have their response time increased by r^'s re-execution time. Therefore, the increase in t, 's response time due to an execution of tj is btot(j, i) = max z6Z b(z; j, i) Then, the response time of task 77 is the solution of ri = d + Ej
PAGE 96

84 To incorporate a mixed system that uses both the ICS and the PCP techniques, we need to modify equation 4.4 to account for blocking and the restricted interruption. In particular, Tj can cause re-executions only if it does not block before executing critical section z. We define bp(z;j,i) = maxj
PAGE 97

85 In Table 4.3, we show an example analysis of a task set. The column labeled r, (ICS) shows the worst-case response times when interruptible critical sections are used for synchronization, and the colums labeled r, (ICS + PCP) shows the worst case response time when interruptible locks are used. For the interruptible locks, we assume that tasks t\ and r 2 never block, but tasks T3 through t$ acquire a lock using the Priority Ceiling Protocol. We observe that the task set cannot meet its deadlines if only the PCP is used. Furthermore, task t$ cannot be guaranteed to meet its deadline when only ICS is used. However, a combination of interruptible locks and the priority ceiling protocol lets all tasks meet their deadlines. We observe that interruptible locks penalize intermediate-priority tasks, due to the possible blocking waits, B(i). However, the response time of high priority tasks is reduced because of the reduced rate of critical section interruptions. Table 4.3. Example task description 3 for ICS task Ti Ci Zi Di r, (ICS) (ICS + PCP) Tl 25 ms 3 ms X 1 ms 6.5 ms 3 ms 3 ms T 2 25 3 Y 1 6.5 6 6 T 3 30 3 X 1 15 10 12 T4 30 3 Y 1 20 14 16 r~> 30 3 X 1 30 18 19 r& 30 3 Y 1 30 22 22 t 7 100 3 X 1 80 49 45 Ts 100 3 Y 1 80 86 48 4.9 Conclusion We have presented methods for implementing interruptible critical sections (ICS), and using them with interruptible locks. Interruptible critical sections use optimistic concurrency control instead of pessimistic concurrency control. If a process that is

PAGE 98

86 executing an ICS is interrupted and a conflicting operation commits, the conflicted process restarts its execution from the beginning of the critical section. In a real-time system, interruptible critical sections prevent priority inversion. In addition, the ICS mechanism is independent of the scheduling algorithm. We show how several recent ideas in non-blocking and uniprocessor synchronization can be synthesized to provide low-overhead interruptible critical sections. We show how an ICS can be implemented in practice, and discuss our ICS implementation in the pSOS+ operating system. We find that the use of a prioritized semaphore can lead to unpredictable executions times of high priority tasks, while the use of an ICS allows the high priority task to always complete quickly. The use of interruptible critical sections only can cause too many critical section re-executions, making low-priority tasks unschedulable. Interruptible critical sections can be used with locks to create interruptible locks. We show that when an interruptible lock is used, a low-priority task never blocks a high priority task, and the low priority tasks experience only a small degradation in execution time. Interruptible locks are appropriate when very time sensitive tasks must communicate with lower priority tasks through shared-memory data structures. We present an analysis of a hard real-time periodic task set that synchronizes using interruptible critical sections. We show that if the highest priority tasks have very tight deadlines, then interruptible critical sections can improve the schedulability of the task set. Interruptible locks can be used in conjunction with the priority ceiling protocol. We show that using interruptible locks with PCP can improve schedulability over using ICS or PCP alone.

PAGE 99

87 FCFS Semaphore Locks 5 I Dequeue Q Enqueue JZL 123456789 Time to execute the critical section (ticks) Figure 4.3. Response time distribution of the non-prioritized semaphore Priority Semaphore Locks 100 H Dequeue O Enqueue 123456789 Time to execute the critical section (ticks) Figure 4.4. Response time distribution of the prioritized semaphore

PAGE 100

88 Interruptable Critical Sections 100 a u 8 u 3u c u 3 80 6040 20 0 I Dequeue D Enqueue 0123456789 Time to execute the critical section (ticks) Figure 4.5. Response time distribution of the Interruptible Critical Section Interruptable Critical Locks 100® 80a u
PAGE 101

89 Interruptable Critical Sections 100 c u a | >^ u c u 3 CT r" 80 60 40 20 0 1 Dequeue D Enqueue 0 2 4 6 8 10 12 14 16 18 20 40 60 80 90+ Time to execute the critical section (ticks) Figure 4.7. Response time distribution of the Interruptible Critical Section for high lock utilization Interruptable Critical Locks 4 6 8 10 12 14 16 18 20 40 60 80 90+ Time to execute the critical section (ticks) Figure 4.8. Response time distribution of the Interruptible Lock for high lock utilization

PAGE 102

CHAPTER 5 EXTENDING INTERRUPTIBLE CRITICAL SECTIONS TO MULTIPROCESSORS 5.1 Introduction Mutual exclusion is a significant problem for sharing a resource in a real-time shared-memory multiprocessor system. Most lock based synchronization mechanisms can block a high priority task that requires the lock when a low priority task is holding the lock, causing priority inversion. In this chapter, we present algorithms that extend Interruptible Critical Sections to shared-memory multiprocessor environment (ICSM). Takada and Sakamura [91] proposed algorithms that extend queuing spin-locks to be preempted for servicing interrupts. They address the conflicting issue of servicing a pending interrupt while holding a lock. Rajkumar et al. extended the Priority Ceiling Protocol that limits the priority inversion in a uniprocessor to multiprocessors [79]. Shu et al. [84] proposed an Abort Ceiling Protocol, an extension to the Priority Ceiling Protocol. In this algorithm, an abort ceiling priority is associated with a task. The abort ceiling comes into effect when the task is executing. Another task may abort the currently running task and run immediately if its priority is higher than the current abort ceiling. The protocol relies on the Interruptible Critical Sections to restart the critical section of the aborted task. Also, the protocol assumes static priorities. The Ceiling Abort Protocol [92] proposed by Takada and Sakamur is a similar extension to the Priority Ceiling Protocol. This protocol assigns an abort ceiling priority to the critical section instead. Also, the critical section is divided into abortable and non-abortable segments. 90

PAGE 103

91 McCann et al. [65] conclude that preempting processors in a coordinated way is critical to response times while using critical sections. Anderson et al. [3] show that a naive implementation of spin-locks can not only delay the processor waiting for a lock, but other processors doing work. They suggest an Ethernet-style backoff scheme or a queue-based algorithm for reducing the cost of spinwaiting. Anderson et al. [5] argue that the operating system should recognize that a preempted thread is executing in a critical section, and execute the preempted thread until the thread exits the critical section. Alemany and Felton [2] consider implementation issues of non-blocking concurrent objects on shared-memory multiprocessors. They show how the resources wasted by the non-blocking operations that fail and the cost of data copying required by a non-blocking implementation can be reduced by relying on the operating system support. Interruptible Critical Sections is an optimistic synchronization mechanism on uniprocessors with applicability to embedded and real-time systems. Techniques using interruptible critical sections improve the schedulability of task sets that have high priority tasks with tight deadlines. In this chapter, we extend interruptible critical sections to shared-memory multiprocessors under progressively complex environments. Although the methodology for writing interruptible critical sections remain the same as discussed in Chapter 4, the algorithms differ in the techniques used in detecting the critical section access conflicts for a multiprocessor. We discuss the algorithms as implemented on a multiprocessor system running the pSOS+ realtime operating system. We study the various parameters under which interruptible critical sections outperform other lock-based algorithms using our implementation, simulation, and analytical modeling. We also present a formal proof for one of the algorithms showing that it is indeed correct.

PAGE 104

92 In this chapter, we extend interruptible critical sections to multiprocessor systems. The main issue here is how to detect the conflict among tasks spread across multiple processors in using the shared critical section. In the uniprocessor case, we were able to detect the conflict during the context switch, but this is no longer possible in a multiprocessor. To resolve this issue, we let the critical section be protected by a global lock mechanism. Depending on how we handle the lock, we present the following algorithms. • ICSM with Lock Release (ICSM-R) In a multiprogrammed multiprocessor, the task owning the lock may experience a context-switch. In such an event, tasks on other processors waiting for the lock will be blocked for a significant amount of time much larger than the critical section execution time. This bottleneck will be more pronounced as the number of processors and the degree of multiprogramming are increased. During a context switch, the lock is released from the old task in the ICSM-R algorithm. When the task is re-scheduled on the processor, the lock is re-acquired and the test for conflict is performed and the critical section is restarted, if necessary. This solution avoids the problem of detecting a conflict. • ICSM with Task Kill (ICSM-K) A common scenario in a multiprocessor real-time system is a high priority Server task servicing the requests of several low priority Client tasks. In this model, we assume that there is only one high priority task among a set of tasks using the critical section. The low priority tasks synchronize the usage of the critical section with a semaphore. The high priority task sets a kill flag before it uses the critical section. If there is already another task that

PAGE 105

03 is executing in the critical section, the set state of the kill flag is detected and the critical section is re-executed. This solution detects a conflict by using shared-memory and the atomic Compare&Swap instruction. • ICSM with Priority Queue (ICSM-P) Here, we consider a set of tasks with varying priorities that execute on a multiprocessor. Instead of the spin-lock, we protect the critical section using a priority queue lock, using the mechanism of the PRLock (Chapter 3). First, a task enqueues in the priority queue as in the PR-Lock. When a higher priority task finds that the critical section is busy with a lower priority task, it interrupts the lower priority task at the head of the queue and uses the critical section. The lower priority task detects the failure during the release of the lock, and retries the lock acquisition. In the following sections, we will discuss the implementation, and the performance of each of these algorithms. 5.2 ICSM with Lock Release (ICSM-R) In a multiprogrammed multiprocessor system, existing lock based mechanisms hold the lock for a task even during a context-switch. The duration of the contextswitch depends on the time quantum and the degree of multiprogramming. This duration can be high, resulting in a wasted lock utilization and more importantly, an increase in the average response time for tasks using the lock. In this section, we study the possibility of releasing the lock held by a task during a context-switch by proposing the ICSM-R algorithm. Once the lock is released during a context-switch, it is reacquired when the task executes in the next quantum. This

PAGE 106

94 results in low lock utilization, but there is a penalty of restarting the critical section if there is a previous commit in the critical section by some other task. The critical section response time will be low if the number of restarts of the critical section are small. We study the parameters and conditions under which the ICSM-R algorithm performs better than an algorithm that does not release the lock during a contextswitch. We implemented ICSM-R with spin-locks using the Test&Set instruction. Two data structures are needed to implement ICSM-R: one for the critical section, and one for each task that uses the critical section. The global lock structure consists of a critical section identifier, a counter that tracks the number of times the critical section has been executed, and the critical section bounds. It also contains the address of the global spin-lock variable. struct ICS_Lstruct { int id; /* ID of this critical section */ int cs_count; /* Global Execution Count */ char *cstop; /* CS Top Address including the spin-lock */ char *cslow; /* CS Low Address excluding the spin-lock */ char *cshigh; /* CS High Address */ int gcount ; /* Shared Counter */ int *glock; /* Spin-Lock Address */ } The structure local to a task consists of the copy of the ICS execution count, a count of the number of times the critical section is retried on any invocation (for statistics), a pointer to the ICS-Lstruct and a flag to indicate that the task is entering the critical section. struct ICS_Tstruct { int process_count ; /* Local Execution Count */ struct ICS-Lstruct *ilp; /* Interruptible Lock Record Pointer */

PAGE 107

95 int int i count ; flag; /* Interrupt Count of a task */ /* Flag = ID => In CS; =0 => Not */ } The ICS implementation code consists of two parts: The ICSM-Rctxsw routine which provides the ICS Lock mechanism and the ICSM-Rclient task that uses the ICS mechanism. 5.2.1 ICSM-Rctxsw Routine The ICSM-Rctxsw routine is shown in Figure 5.1. The ICSM-Rctxsw routine is integrated with the pSOS+ kernel as a user written routine that is called during a context-switch. The call occurs at the point where the context of the switchedout task has been completely saved, and before the context of the switched-in task is loaded. pSOS+ provides the addresses of the Task Control Blocks (TCBs) of both the switched-in task and the switched-out task in machine registers. The TCB contains all the context of a task, including the Program Counter (PC). ICSM-Rctxsw can reset the PC in the TCB of a switched-in task, if required. pSOS+ provides a set of eight software-defined user registers that a task can access in the TCB. The user register 0, U_REG0, is used to contain the address of the ICS-Tstruct of a task using ICSM-Rctxsw first checks if the program counter (PC) of the old task about to be switched out is within the critical section region, and if so, it releases the spin-lock by setting the lock variable to zero. Next, ICSM-Rctxsw routine checks if the new task about to be switched in is within the critical section region. If so, it attempts to reacquire the spin-lock without spinning. If successful, ICSM-Rctxsw checks if there was a conflicting operation in the interim. If so, it sets the task's PC to reexecute the critical section. If there is no conflict, the task is allowed to continue where it left ICS.

PAGE 108

96 ICSM-RctxswO { struct tcb *in_tcb, *out_tcb; struct ICS_Tstruct *tlp; load out_tcb from the machine register; tip = Get U_REG0 from out.tcb; if (tip != NULL kk tlp->flag == LOCKID) { if (tcb->currpc >= tlp->ilp->cslow kk tcb->currpc < tlp->ilp->cshigh) *(tlp->ilp->glock) = 0; } load in_tcb from the machine register; tip Get UJtEGO from in.tcb; if (tip != NULL kk tlp->flag == LOCKID) { if (tcb->currpc >= tlp->ilp->cslow &^ tcb->currpc < tlp->ilp->cshigh) { status = Test&Set (tlp->ilp->glock) ; if (status == SUCCESS) { if (tlp->process_count != tlp->ilp->cs_count) in_tcb->currpc = tlp->ilp->cslow; } else { in_tcb->currpc = tlp->ilp->cstop; } Figure 5.1. The context switch routine for ICSM-R

PAGE 109

97 when it was switched out. If the attempt to re-acquire the lock by the ICSM-Rctxsw routine is unsuccessful, the task is made to restart from the point of acquiring the lock for the critical section. 5.2.2 ICSM-Rclient Tasks We choose to implement a shared global counter. On each processor, we run four tasks. Among these four tasks, only one task is the ICSM-Rclient task that increments the shared counter and the rest are dummy tasks. All the four tasks are started under the control of a low priority parent task. Message Queues of pSOS+ are used for communication between the tasks and the parent. The algorithm for an ICSM-Rclient task is presented in Figure 5.2. 5.2.3 ICSM-R Performance Analysis First, we analyzed the performance of ICSM-R algorithm as implemented, by conducting a set of experiments. We compared the performance of ICSM-R with a spin-lock algorithm that does not release the lock during a context-switch (LOCKNR). As we are limited by the number of processors available for the implementation, and to better understand the implications of various parameters, we constructed an analytical model. We validated the analytical model by using discrete event simulation. We present the results of the experiments and the analysis in the following sub-sections. Experimental Performance Results We analyzed the performance of ICSM-R algorithm by conducting an experiment. The goal of the experiment is to increment a global counter till a target value is reached. A spin-lock critical section is used to protect the global counter from multiple

PAGE 110

Parent ( ) { Setup the global ICS-Lstruct ilr; Startup all the tasks; while(all the tasks are not done) { collect the timing and retries from each run of a task; } Report the parameters and statistics; } ICSM-RclientQ { struct ICS.Tstruct tlr; Setup the local ICS.Tstruct structure tlr; Setup the U_REG0 to point to local ICS.Tstruct; while( !Done) { Work for Tw cycles; tlr. flag = LOCKJD; ics_top: do { status = Test&Set(tlr . ilp->glock) ; }while(status == FAILURE); ics_start : tlr .process jcount = tlr.ilp->csjcount; tlr. ilp->gcounter++; if (tlr.ilp->gcounter == TARGET-COUNT) done = TRUE; Idle for CS-IDLE cycles of time; *(tlr.ilp->glock) = 0; tlr . ilp->cs jcount++ ; ics_end: tlr. flag = 0; Report the time taken and retry count to parent; } Clear UJtEGO; Report done to parent; } Figure 5.2. The algorithm for a task using ICSM-R

PAGE 111

99 updates. We used the Motorola 68030 multiprocessor system with shared-memory running the pSOS+ kernel. We compared the performance of ICSM-R with a spin-lock algorithm that does not release the lock during a context-switch (LOCK-NR). Each processor executes four tasks, where each task works for T w units of time. One task among the four tasks enters the critical section. It stays in the critical section for T c time units. We varied the number of processors from one to four. In our experiment, T w and T c are random variables uniformly distributed by 20% about a selected mean. We set T w to 20 milliseconds (ms). The time quantum for a task is 20 ms for processor sharing among the four tasks. We varied the critical section execution time T c to reflect various load conditions with lock utilization of 1.25%, 6% and 10% per processor. The performance results are shown in Figure 5.3. Experimental results confirm that as the number of processors increase, ICSM-R performed better than LOCK-NR. When using a small critical section, the performance improvement is little as there is only a small probability that a context-switch occurs during a critical section. As the critical section size increases, this probability is higher, and the task releasing the lock during a context-switch helps tasks running on other processors to acquire the lock faster, thereby avoiding wasted cycles of spinning. Note that the elapsed time can increase as the number of processors are increased beyond a certain limit. This can be attributed to the fact that the cost of synchronization can exceed the speedup that can be achieved by an increase in the number of processors. The elapsed times for the LOCK-NR algorithm increases as the number of processors are increased from 3 to 4, as shown in Figure 5.3.

PAGE 112

Critical Section: 1 milli second 1000 -5800 E F 3 600 400 200 Lock Released Lock Not Released 1 2 Processors Critical Section: 5 milli seconds 1000 -S800 CD 600 400 200 1 1 Lock Released -« — Lock Not Released -*— E P -o
PAGE 113

101 Analytical Performance of ICSM-R In this section, we derive the performance of the ICSM-R algorithm and the LOCK-NR algorithm using an analytical model. For the analytical model, we consider N number of processors running M tasks each. Each task on a processor works for T w units of time. In addition, one out of M tasks on a processor requests the service of a critical section that takes T c time units to execute. Each processor is shared among M tasks using a time quantum of T q time units in a round-robin fashion. The cycle time of a task is composed of the work time and the critical section time, if any, of the task. We are interested in estimating the cycle time of the task using the critical section. We use the following notation for the analysis: N : Number of processors M : Number of tasks per processor T w : Work time for each task T q : Time quantum for processor sharing T c : Critical Section execution time R p : Critical section utilization per processor R : Total critical section utilization X : The total work time in a cycle Z : The total time spent in waiting for the critical section in a cycle B : The CPU time spent holding the critical section.

PAGE 114

102 Figure 5.4. Model of a cycle for a task using LOCK-NR E[CxinCs] : Expected number of context-switches while in critical section Cn '• Cycle time for a task using NICSM-R Cj : Cycle time for a task using ICSM-R Analysis of algorithm LOCK-NR In this model, the critical section is not released during a context-switch while the critical section is being executed by a task. A cycle of the task using the critical section on a processor is shown in Figure 5.4. A cycle consists of the work time X, Z units of time waiting for the critical section and B units of time in the critical section. We have, C N = X + Z + B (5.1) X = M*T W (5.2) B depends on T c , T q and the number of context switches that are possible while holding the critical section. E[CxinCs] = ^ -*9 B = T c + E[CxinCs]*{M -l)*Tq = T c + ^*(M-l)*Tq = M*T C (5.3)

PAGE 115

103 Assuming that a request for critical section is uniformly distributed in time, the probability of a conflict in using the critical is just the percentage of time the critical section is being used. Considering that the critical section can be modeled as a M/M/l queue, the expected blocking time Z is given by z (T^T* (5 4) where R r is the utilization of the rest of the N 1 tasks that use the critical section, given by * iJ ^*« and B r is the residual life [51] of the lock holding time B for which the task under consideration is blocked. Assuming that the lock holding time is uniformly distributed with a mean B and variance as — pt c , B r is given by B a B r = + 2 2*B As only one task uses the critical section on a processor, the utilization per processor is given by Kp ~ (X + Z + B) Then, the critical section utilization is R = N*R P

PAGE 116

104 X Zl Bl Z2 B2 ZF BF Figure 5.5. Model of a cycle for a task using ICSM-R = N* = N* B (X + Z + B) B (5.5) (* + (iiflfcy + X and B can be computed using equations 5.2 and 5.3, respectively. We can compute R using equation 5.5 with iteration, setting the initial value of R to be zero. Knowing R and B, Z can be computed, and hence the cycle time Cat. Analysis of algorithm ICSM-R In this model, the critical section is released during a context-switch while the critical section is being executed by a task. The critical section is re-acquired and continued by the task during the next quantum. This acquire/release cycle is continued till the critical section is completed. A cycle of the task using the critical section on a processor is shown in Figure 5.5. In this case, after the work period X, there is a possible blocking time Zl spent waiting for the critical section to be free. Then there is the partial critical section holding time Bl. At this time the task may experience a context switch and another blocking time represented by Z2. This acquire / release critical section is continued till the critical section is completed. There is a possibility of restarting the task at the beginning of the critical section whenever the critical section is re-acquired, if there is a previous commit in the critical section during the time period when the critical section is last released and re-acquired. We have,

PAGE 117

105 Ci = X + Z + B (5.6) X = M*T W (5.7) where Z = Zl + Z2 + ... + ZF and B = Bl + B2 + .. + BF. B depends on T c and the number of times a task is restarted from the beginning of the critical section because of a commit by another task. B = T c + T r *N T where T r is the partial execution time of the critical section before a task is restarted because of a conflict, and N r is the number of times the task is restarted because of a conflict before it commits. The expected value of T r = T c / 2. Given the probability of restart of a task within the critical section as Pr[CSRestart], N T = ^i*Pr[CS Restart}' i>0 Pr[CSRestart] (1 Pr[CS Restart}) 2 Then, T T C PrjCSRestart} Jc+ 2 (I -Pr[CS Restart})* [ ™> where Pr[CSRestart] = Pr[CxinCs\* Pr[A Commit in the previous (M — 1) * T q interval]

PAGE 118

106 We can estimate that some other task commits in the previous interval of (M 1) *T q by modeling the commits as a Poisson process with an arrival rate = (N-l) (X + B + Z) Then, Pr[A commit in the previous (M — 1) * T q interval] = 1 — Pr[No commit in the previous (M — 1) * T q interval] I _ e (-A*(M-i).r,) Pr[CxinCs] = ^ for M > 1 = 0 Otherwise The blocking time Z is due to the critical section being busy, and a context switch while in critical section. D Z = Pr[CxinCs]*(M-l)*Tq+(l + Pr[CxinCs])*— r — * B r (5.9) (1 — Rr) where R r is the utilization of the rest of the N-l tasks that use the critical section, given by and B T is the residual life of the lock holding time B for which the task under consideration is blocked. Assuming that the lock holding time is uniformly distributed with a mean B and variance o~b =
PAGE 119

107 B a B r = + 2 2*5 As only one process uses the critical section on a processor, the utilization per processor is given by p (X + Z + B) Then, the critical section utilization is R = N*R P = Nt (X + Z + B) < M °> X can be computed using equation 5.7. We can compute B and R with equations 5.8, 5.9 and 5.10 using iteration, setting the initial values of B to be T c and of R to be zero. Knowing R and B, Z can be computed, and hence the cycle time Cj. Validation of Analysis We validated the analysis by simulation using SIMPACK [25], a discrete event simulation package. We set the values of M — 4, T w — 1000, and T q — 100. The work time T w is a random variable uniformly distributed between 800 and 1200 with a mean of 1000. The critical section time is also an uniformly distributed random variable with a range of 20% either way about the mean. In each experiment, we selected a different T c from 10 to 90 to represent a wide range of critical section utilizations. The results comparing the cycle times obtained by simulation and the cycle times computed by analysis are given in Table 5.1. As can be seen, except for a high T c /T g ratio, the results are reasonably accurate for the analysis to be

PAGE 120

108 meaningful. The inaccuracy in the analytical model is due to the 1/(1-/2,) factor that is present for the computation of Z. This factor goes to infinity as R becomes 100%. A similar conclusion can be drawn for the lock utilization obtained from simulation and analysis, as presented in Table 5.2. Performance comparison using analysis We analyzed the performance of ICSM-R and LOCK-NR algorithms using the model developed in the previous sub-sections. The cycle times of the tasks using the critical section under various critical section execution times are shown in Figures 5.6 through Figure 5.10. The work time T w is 1000 units, and M is 4. For a low critical section to quantum time ratio of up to 50%, ICSM-R algorithm performs better than LOCK-NR algorithm. Even for a higher ratio, ICSM-R algorithm performs better for small number of processors. We observe that there is a steep transition in the cycle times when the lock utilization reaches 100% for both the algorithms. As expected, the critical section utilization is less for ICSM-R as shown in Figures 5.11 through 5.15. We analyzed the effect of multiprogramming by varying M from 1 to 16, with T c = T q I 4 and T c — T q / 2. Except for the case of no multiprogramming, ICSM-R always performs better as shown in figures 5.16 through 5.25. We concluded that for a low to moderate critical section execution time to quantum time ratio, it is advantageous to use the ICSM-R algorithm. In general, critical section execution times are very short and only a fraction of the quantum times. Thus, the ICSM-R algorithm that releases the lock during a context-switch is an attractive alternative to other lock-based algorithms.

PAGE 121

109 Table 5.1. Validating cycle time analysis using simulation for ICSM-R Cycle Time ICSM-R LOCK-NR Tc Processors Simulation Analysis Abs.% Diff Simulation Analysis Abs.% Diff 10 4 4044.30 4044.31 0.00 4044.50 4044.39 0.00 8 4040.04 4040.01 0.00 4040.64 4040.48 0.00 12 4049.11 4049.08 0.00 4050.32 4050.12 0.00 16 4055.16 4054.95 0.00 4056.57 4056.44 0.00 20 4038.95 4038.11 0.01 4042.03 4041.72 0.02 24 4039.85 4038.93 0.02 4042.91 4041.62 0.03 28 4049.14 4048.36 0.02 4053.54 4053.23 0.01 32 4052.63 4051.64 0.02 4057.69 4057.75 0.00 36 4055.59 4054.40 0.03 4062.63 4062.16 0.01 40 4046.91 4045.40 0.04 4055.67 4055.13 0.01 44 4039.62 4037.89 0.04 4050.62 4049.92 0.02 48 4049.14 4046.98 0.05 4062.36 4061.72 0.02 52 4048.51 4046.26 0.06 4064.73 4063.90 0.02 56 4044.58 4041.87 0.07 4064.06 4062.51 0.04 60 4046.44 4043.55 0.07 4068.55 4068.77 0.00 64 4049.33 4046.37 0.07 4076.04 4075.88 0.00 50 4 4216.49 4219.43 0.07 4220.63 4220.72 0.00 8 4230.69 4239.95 0.22 4245.23 4249.52 0.10 12 4254.30 4278.36 0.57 4298.42 4314.26 0.37 16 4279.15 4318.48 0.92 4370.40 4428.70 1.33 90 4 4408.98 4425.15 0.34 4413.87 4422.75 0.20 8 4501.65 4559.00 1.27 4518.83 4584.24 1.45 12 4675.88 4729.99 1.16 4752.02 5043.99 6.14 16 5078.08 5077.26 0.02 5767.69 7134.15 23.69

PAGE 122

110 Table 5.2. Validating lock utilization analysis using simulation for ICSM-R Lock Utilization (%) ICSM-R LOCK-NR Tc Processors Simulation Analysis Abs.% Diff Simulation Analysis Abs.% Diff 10 4 1.00 0.99 1.00 4.00 3.90 2.50 8 2.00 2.02 1.00 7.70 7.90 2.59 12 3.00 3.06 0.99 12.00 11.90 0.83 16 4.10 4.10 0.00 15.60 15.80 1.28 20 5.10 5.18 1.57 19.70 19.81 0.56 24 6.20 6.20 0.00 23.00 23.76 3.30 28 7.20 7.30 1.39 27.30 27.60 1.10 32 8.20 8.30 1.22 30.70 31.50 2.60 36 9.20 9.40 2.17 34.10 35.40 3.81 40 10.30 10.50 1.94 38.30 39.47 3.05 44 11.30 11.52 1.95 43.00 43.34 0.79 48 12.30 12.60 2.44 45.40 47.30 4.18 52 13.30 13.60 2.26 49.80 51.20 2.81 56 14.40 14.71 2.15 52.50 55.18 5.10 60 15.40 15.76 2.34 56.50 59.00 4.42 64 16.40 16.81 2.50 59.80 62.90 5.18 50 4 5.00 5.01 0.20 18.70 18.93 1.23 8 10.60 10.87 2.55 37.50 37.68 0.48 12 16.40 17.57 7.13 55.70 55.70 0.00 16 22.50 24.62 8.79 72.50 72.35 0.21 90 4 9.20 9.09 1.20 32.50 32.52 0.06 8 20.80 21.70 4.33 63.50 62.88 0.98 12 35.70 38.80 8.68 90.50 85.73 5.27 16 54.40 57.70 6.07 99.30 100.00 0.70

PAGE 123

4300 4250 4200 4150 4100 4050 4000 Tc=10Tq = 100 ICSM-R LOCK-NR 0 10 20 30 40 50 60 70 80 90 100 Processors Figure 5.6. ICSM-R Cycle Times using analysis for Tc 10000 9000 8000 7000 6000 5000 4000 Tc = 25Tq = 100 ICSM-R LOCK-NR 0 10 20 30 40 50 60 70 80 90 100 Processors Figure 5.7. ICSM-R Cycle Times using analysis for Tc

PAGE 124

112 16000 14000 12000 e o 10000 6000 4000 0 10 20 30 40 50 60 70 80 90 100 Processors Figure 5.8. ICSM-R Cycle Times using analysis for Tc = 50 Figure 5.9. ICSM-R Cycle Times using analysis for Tc = 75

PAGE 125

113 400000 I 1 1 1 1 1 1 1 r Processors Figure 5.10. ICSM-R Cycle Times using analysis for Tc = 90 Figure 5.11. ICSM-R critical section utilization using analysis for Tc = 10

PAGE 126

o a g 1 I O T-i 1 1 ' 1 I ' ' 1 — 1 T 1 ICSM-R LOCK-NR -h 10 20 30 40 50 60 70 80 Processors 90 100 Figure 5.12. ICSM-R critical section utilization using analysis for Tc Figure 5.13. ICSM-R critical section utilization using analysis for Tc

PAGE 127

o I CO I o ICSM-R LOCK-NR Tc = 75Tq = 100 _l I I— 10 20 30 40 50 60 70 80 90 100 Processors Figure 5.14. ICSM-R critical section utilization using analysis for Tc Processors Figure 5.15. ICSM-R critical section utilization using analysis for Tc

PAGE 128

116 2400 1 1 i r 1 1 1 1 1 ICSM-R — . 2200 • M = 1 LOCK-NR 2000 E 1800 F 0) Cyc 1600 1400 1200 1000 I' ' 1 ' * * 1_ 1 1 1 1 1 10 20 30 40 50 60 70 80 90 100 Processors Figure 5.16. ICSM-R effect of multiprogramming on cycle time for Tc = 25 (M = 1) B E f= 10000 9000 8000 7000 6000 5000 4000 1 M =4 T 1 1 1 — 1 1 P 1 ICSM-R — LOCK-NR A' .if' ,*'* Jf f* j 1 0 10 20 30 40 50 60 70 80 90 100 Processors Figure 5.17. ICSM-R effect of multiprogramming on cycle time for Tc = 25 (M = 4)

PAGE 129

117 20000 18000 16000 (
PAGE 130

118 40000 35000 jjj 30000 O 25000 20000 15000 1 r 1 1 1 1 — M = 16 1 1 1 ICSM-R — LOCK-NR ----, f . . . . ..._L_.*--f'' 1 1 1 1 1 1 1 1 1 0 10 20 30 40 50 60 70 80 90 100 Processors Figure 5.20. ICSM-R effect of multiprogramming on cycle time for Tc = 25 (M = 16) E P o | 0 4000 3500 3000 h 2500 2000 1500 1000 1 1 I rICSM-R LOCK-NR -+M = 1 * * ' 1 — | i |_ 0 10 20 30 40 50 60 70 80 90 100 Processors Figure 5.21. ICSM-R effect of multiprogramming on cycle time for Tc = 50 (M = 1)

PAGE 131

119 16000 14000 12000
PAGE 132

120 45000 40000 35000 | 30000 a> >. 25000 O 20000 15000 10000 M = 12 _l 1 L_ ICSM-R LOCK-NR -t--10 20 30 40 50 60 70 80 90 100 Processors Figure 5.24. ICSM-R effect of multiprogramming on cycle time for Tc = 50 (M = 12) 0) E F S. I 60000 55000 50000 45000 40000 35000 30000 25000 20000 15000 M = 16 ICSM-R — , LOCK-NR ~y* • >*' ,#'* .**" .>*' JT At* r'' .+•' 1 1 1 l 1 1 1 1 1 10 20 30 40 50 60 70 80 90 100 Processors Figure 5.25. ICSM-R effect of multiprogramming on cycle time for Tc = 50 (M 16)

PAGE 133

121 5.3 ICSM with Task Kill fICSM-K) In a multiprocessor system, we usually encounter a synchronization environment in which a high-priority task synchronizes access to a critical section with a set of low-priority tasks. For example, a set of Client tasks enqueue their I/O requests in a shared queue that is serviced by a high priority I/O Server task. In a real-time multiprocessor system, a high priority Monitor task acquires data from an external source, that is in-turn analyzed by a set of low-priority Analysis tasks. Both the examples mandate that the critical section used to synchronize access to the shared resource should have a faster and predictable response time for the high priority task. The ICSM-K algorithm addresses the issue of synchronization in the above environment. For this implementation of ICS, we assume that there is only a single high priority task, with a set of low priority tasks running on multiple processors. The critical section is implemented using techniques of writing interruptible critical sections as described in Chapter 4. In addition, we assume the availability of a general synchronization mechanism for multiprocessors. The global data structures consists of a lock object L and two semaphores, as shown in Figure 5.26. The lock L is composed of a pointer to the shared object and an integer Kill flag. One semaphore is used by the set of tasks for exclusive access to the critical section. The Kill flag is used by the high priority task to gain access to the critical section if the critical section is busy. Another semaphore is used to return the critical section to the low priority task that was killed by the high priority task. The algorithm for the high priority task is also presented in Figure 5.26. The high priority task tests the availability of the mutex semaphore. It enters the critical section if the mutex semaphore is free. If not, it sets the Kill flag and enters the critical section anyway. The Kill flag is an integer variable in which the low order

PAGE 134

122 bit is used as the Kill bit and the high order bits are used as the counter for the Kill bit to avoid the A-B-A problem when used with a Compare&Swap instruction, as described in Chapter 3. In the algorithm shown, we set the Kill bit and increment the counter with a sigle statement that increments the Kill flag by a value of 3. After committing its changes to the shared object, the high priority task leaves the critical section by resetting the Kill flag and making the kills semaphore available. We note that the high priority task can use techniques other than ICS for modifying the shared object, as it is not interrupted in the critical section. The algorithm for the low priority tasks is presented in Figure 5.27. A low priority task first waits for the mutex semaphore to be free. Then, it takes a snap-shot of the lock object L until the Kill bit is not set. The low priority task enters the critical section and constructs the changes to the shared object in the local variable MyObject. At the end of the critical section, the low priority prepares to commit its changes to the shared object using the ICS mechanism. The low priority task constructs a new lock object NewL that points to its local MyObject. The Kill flag value of NewL is set to the value that was read before, with the Kill bit reset. The low priority task commits its changes by replacing the lock object L with NewL, using an atomic Compare&Swap instruction. The Compare&Swap instruction ensures that the high priority task has not entered the critical section in the meantime. If the Kill bit is set, the Compare&Swap fails indicating that the critical section execution was interrupted by the high priority task. In this case, the low priority task waits for the kills semaphore to be free, and the critical section is restarted from the beginning. We used the double Compare&Swap (CAS2) for testing the Kill flag and committing the critical section operation simultaneously. To show that the ICSM-K algorithm is correct, we can see that there can be a conflict only between the high priority task and a single low priority task. In this

PAGE 135

123 case, The ICSM-K algorithm is decisive-instruction serializable [83]. The tasks using the ICSM-K algorithm have a single decisive instruction. The decisive instruction for the high priority task is the setting of the Kill flag. The decisive instruction for the low priority task is the successful Compare&Swap instruction that commits the lock L. Corresponding to a concurrent execution C of a set of operations, there is an equivalent (with respect to return values and final states) serial execution Sd such that if operation Oi executes its decisive instruction before operation O2 does in C, then 0\ < O2 in SdIf the high priority task commits by setting the Kill flag before the low priority task executes its decisive Compare&Swap instruction, the Compare&Swap fails. In this case, the low priority task correctly waits for the kills semaphore and restarts its critical section. If the high priority task commits after the low priority task executes the Compare&Swap instruction, the Compare&Swap for the low priority task succeeds and there is no critical section conflict. 5.3.1 Experimental Performance Results We used the ICSM-K synchronization algorithm to implement a shared queue on the Motorola 68030 shared-memory multiprocessor running the pSOS-|kernel. There are three low priority enqueue tasks (of equal priority) and a single high priority dequeue task. Each task runs on an individual processor. We studied the performance of the ICSM-K under a moderate lock utilization as well as a high lock utilization. We compared the performance of the ICSM-K algorithm with an algorithm that does not use the Kill flag for the high priority flag (LOCK-NK). In the first experiment, each task performs 10,000 enqueue (dequeue) operations. Each enqueue task works (idles) for 140 ms (milliseconds) and enters a 10 ms critical section, using a combined lock utilization of 20%. The dequeue task sleeps for 60 ms

PAGE 136

124 struct Lock { struct Object *Ptr; int Kill-Flag; } struct Lock L = {NULL, 0} semaphore mutex = 1; semaphore kills = 0; ICSM-K-Task-HO { status = sm_p(mutex, No_Wait) ; if (status == SUCCESS) { Use Critical Section; sm_v (mutex) ; } else { Kill_Flag +=3; /* Set kill and increment counter */ Use Critical Section; Kill-Flag = Kill_Flag ft I; /* Reset kill */ sm_v (kills) ; } Figure 5.26. The global data structures and the ICSM-K algorithm for the high priority task

PAGE 137

ICSM-K-Task-L (struct Object *MyObject) { struct Lock OldL, NewL; status = sm_p(mutex, Wait); do /* cs_restart */ { do { OldL = L; if (OldL. Kill-Flag & 1 == TRUE ) sm_p(kills, Wait); } while (OldL. Kill-Flag & 1 == TRUE); Use Critical Section; NewL.Ptr = MyObject; NewL. Kill-Flag = OldL. Kill-Flag; status = cas2(&L, OldL, NewL); /* cs commit */ }while(status == FAILURE); sm_v(mutex) ; Figure 5.27. The ICSM-K algorithm for the low priority task

PAGE 138

126 before entering a 10 ms critical section. We collected the time to execute the critical section, and we created histograms that shows the frequency that the critical section execution takes a particular amount of time. The performance results of the LOCK-NK algorithm and the ICSM-K algorithm for 20% lock utilization are shown in Figure 5.28 and Figure 5.29, respectively. In a critical section using LOCK-NK algorithm, the dequeue operation sometimes experiences a delay. By using the ICSM-K algorithm, the high priority dequeue operation is always performed without delay, while the enqueue operations experience only a small degradation. In the next experiment, the work time of the enqueue tasks is set to 100 ms while the critical section time is extended to 50 ms to reflect high lock utilization of 100%. The parameters for the dequeue task remain the same as before. Figure 5.30 and Figure 5.31 show the time to enqueue and dequeue using LOCK-NK and ICSM-K algorithms. The delay in the dequeue operation is eliminated by ICSM-K, but with a significant increase in the response time of enqueue operations. This is because of the 100% lock utilization by the enqueue tasks. A frequently encountered situation in a multiprocessor environment is a high priority task sharing a resource with a set of low priority tasks. Existing synchronization mechanisms block the high priority task from executing the critical section when a low priority task is in the critical section. The critical section execution time for the high priority task varies depending on the resource utilization of the low priority tasks, resulting in the unpredictable response time of the high priority task. By using the ICSM-K algorithm for sharing a resource, the critical section response time for the high priority task is always predictable irrespective of the utilization by the low priority tasks, by eliminating priority inversion. This desired feature of the ICSM-K algorithm makes it suitable for a real-time multiprocessor system.

PAGE 139

100 80 c u I 60 >» o § 40 cr 8 20 0 I Dequeue D Enqueue 10 20 30 40 50 60 70 80 90 100 Time to execute the critical section (ms) Figure 5.28. Response time distribution of LOCK-NK (20% Lock Utilization) 100 80C ^ o I 40. a* 20 0 10 20 30 40 50 60 70 80 90 Time to execute the critical section (ms) 100 Figure 5.29. Response time distribution of ICSM-K (20% Lock Utilization)

PAGE 140

128 10 20 40 60 80 100 120 140 160 180 200 220 Time to execute the critical section (ms) Figure 5.30. Response time distribution of LOCK-NK (100% Lock Utilization) 100 © 80c u s u b c u 60 4020 0 1 Dequeue D Enqueue n n n 10 20 40 60 80 100 120 140 160 180 200 220 Time to execute the critical section (ms) Figure 5.31. Response time distribution of ICSM-K (100% Lock Utilization)

PAGE 141

129 5.4 ICSM with Priority Queue (ICSM-P) Here, we consider a general shared-memory multiprocessor real-time system with tasks of varying priorities accessing a shared critical section. In the ICSM-P algorithm, we will allow a task with a higher priority to kill a task with lower priority that is using the lock. The killed lower priority task has to restart its critical section execution using the ICS mechanism. We will use the PR-Lock algorithm described in Chapter 3 to implement the ICSM-P algorithm. The PR-Lock uses a spin-lock for synchronization. Tasks wait for the lock to be released in a queue. The queue is ordered according to the priorities of the waiting tasks. The task that is currently using the lock is at the head of the queue and may have a lower priority than the other waiting tasks. A waiting higher priority task can be blocked by at least a critical section execution time using the PR-Lock. This priority inversion can be avoided by using the ICS mechanism. We modify the PR-Lock such that if the first task ph waiting has a higher priority than the task p c currently using the critical section, then ph interrupts the critical section execution of p c and enters the critical section. Task p c detects this interruption during its commit-and-release phase and restarts to acquire the lock again. Thus, each process uses the acquireJock and commit_releaseJock operations to synchronize access to the critical section in the following way. do { acquireJock (L, r) critical section status = releaseJock(L, r, Commit -Parameters) } while (status == FAILURE) The following sub-sections present the required modifications to the PR-Lock algorithm and the acquireJock and commit_releaseJock procedures.

PAGE 142

130 5.4.1 Implementation We modify the PR-Lock algorithm to implement the ICSM-P algorithm as follows. We add an apriority field to the task lock record structure that contains the task's actual priority. The priority field in the record lock structure is modified by the PR-Lock algorithm. We use Compare&Swap2, another form of double-word Compare&Swap in which the addresses of the two words are not consecutive. The semantics of the Compare&;Swap2 used is given in Figure 5.32. The Dq bit, the Dq count to avoid the A-B-A problem, and the next record pointer can be implemented in a single word of 32 bits by using an index for the next record address instead of a pointer. If too many high priority tasks are allowed to acquire a lock without blocking, low priority tasks might experience an excessive number of restarts, increasing the response time and decreasing the schedulability of the task set (Chapter 4 section 4.5). In order to limit re-executions, we introduce a minimum kill priority Pr_KILL. A task p< with a priority Pr(p t ) can interrupt a task pj with priority Pr(pj) only if Pr(pi) > Pr.KILL and Pr(p f ) > Pv(pj). Each process keeps the address of its record in a local variable (Self). In addition, each process requires two local pointer variables to hold the previous and the next queue element for navigating the queue during the enqueue operation (PrevJJode and Next_Node). The data structures used are shown in Figure 5.33. The Dq bit of the Pointer field is initialized to TRUE, and the Ctr field is initialized to 0 before the record is first used. Acquire_Lock Operation The modification to the PR-Lock acquireJock operations are marked with M in Figure 5.34. A process p h enqueues its lock record as in the PR-Lock algorithm.

PAGE 143

131 Procedure CAS2 (structure pointer *C1, *C2, *01, *02, *N1, *N2) /* Assume CAS operates on two different words */ atomic{ if( *ci == *01 && *C2 == *02 ) { *C1 = *N1; *C2 = *N2; return (TRUE) ; } else { *01 = *C1; *02 = *C2; return (FALSE) ; } } Figure 5.32. CAS2 used in the ICSM-P Algorithm Before spinning for the lock variable, ph checks to see if it is eligible to kill any other process and also, if the process p c in front is with a lower priority. If the condition for interruption is satisfied, then process ph attempts to acquire the lock for itself by using Compare&Swap2 to make lock L to point to its own record. Note that for Pr(p c ) < Pr(p>t), process p c must be the current lock user. The conditions for the successful Compare&Swap2 are as follows. If the lock L is pointing to the record q c of process p c and q c (with the Dq bit not set) points to record qh of process p^, then make the lock L to point to q^ and set the Dq bit in q c . This is illustrated in Figure 5.35. If the Compare&Swap2 is successful, then process Ph proceeds to use the critical section. There are two cases for an unsuccessful attempt to interrupt p c . 1. Process p c may commit and release the lock by setting the Dq bit in q c and making the lock L to point to q^.

PAGE 144

structure Pointer { structure Object *Ptr; int31 Ctr; boolean Dq; } structure Record { structure structurejof -data Data; boolean Locked; integer Apriority; integer Priority; structure Pointer Next; } Shared Variable structure Pointer L; Private Variables structure Pointer Self, PrevJIode, Next_node; boolean Success, Failure; constant TRUE, FALSE, NULL, MAX_PRIORITY; constant KILL-PRIORITY; Record Structure Data Locked Apriority Priority Next.Ptr Next.Ctr Pointer P P.Ptr P.Ctr Next.Dq P.Dq Figure 5.33. Data Structures used in the ICSM-P Algorithm

PAGE 145

133 2. A concurrent acquire Jock operation of another process p g with Pr(p s ) > Pr(ph) may overtake the Compare&Swap2 instruction of process ph. Then, q c points to record q g of process p g and no longer points to record q^. In either case, the Compare&Swap2 instruction of process ph fails and process ph spins for the lock as in the PR-Lock. In the first case, process ph is about to acquire the lock anyway. In the second case, process ph is no longer the highest priority process among waiting processes to interrupt process p c . Commit _Release-Lock Operation We use the Compare&Swap instruction to set the Dq bit for release because of the nature of the acquire Jock operation and because the Dq bit is part of the next record pointer. If the Compare&Swap is successful, then the commit_releaseJock operation proceeds to the commit phase. Otherwise, a higher priority process ph has interrupted the critical section execution before the commit and the commit_releaseJock operation returns a failure status. The failed process requeues by calling the acquire Jock operation again. We tie the commit phase of the critical section to the release Jock operation of the PR-Lock, i.e., the commit pointer is updated only after the Dq bit is set with a successful Compare&Swap. In this case, the commit_releaseJock operation returns a success. The modified commit .release Jock operation is shown in Figure 5.36. 5.4.2 Correctness of the ICSM-P Algorithm In this section, we present an informal argument for the correctness properties of our ICSM-P algorithm. We prove that the ICSM-P algorithm is correct by showing that it maintains a priority queue, and the head of the priority queue is the process

PAGE 146

134 Procedure acquireJock(L, Self) { Success=FALSE; do { Prev-Node=NULL; Kext_Node=NULL if(CAS(ftL, &Next_Node, ftSelf)) { /* No Lock */ Success=TRUE; Failure=FALSE; /* Use Lock */ Self . Ptr->Priority = MAX-PRIORITY; Self . Ptr->Next . Dq=FALSE ; } else { /* Lock in Use */ Failure=FALSE ; Self . Ptr->Locked=TRUE ; do { Prev-Node=Next-Node ; Next _Node=Prev-Node . Ptr->Next ; if ( (Next-Node. Dq==TRUE) /* Deque, Try Again */ ii or (PrevJiode.Ptr->PriorityPriority) ) iii Failure=TRUE; else { if (Next-Node. Ptr==NULL or (Next -Node. Ptr!=NULL and Next-Node . Ptr->PriorityPriority ) ) { Self . Ptr->Next . Ptr=Next-Node . Ptr if(CAS(ft(PrevJTode.Ptr->Next), &Next_Node, ftSelf)) { F Self . Ptr->Next . Dq=FALSE ; if (Self . Ptr->Apriority >= KILL-PRIORITY and M Self .Ptr->Apriority > Prev_Node.Ptr->Apriority) { M Next-Node. Ptr = Self .Ptr; M Temp-Node = Next -Node; Temp -Node. Dq = TRUE; M if(CAS2(ftL, &(Prev_Uode.Ptr->Next), &Prev_Node, M ftNext-Node, ftSelf, &Temp_Node )) M Self .Ptr->Locked = FALSE; M } M while(Self .Ptr->Locked) ; /* Busy Wait */ Success=TRUE; /* Then, use lock */ } else { if ( (Next-Node. Dq==TRUE) /* Deque, Try Again */ ii or Prev_Node.Ptr->Priority < Self .Ptr->Priority)) iii Failure=TRUE; else Next-Node=Prev_Node; i } } }while( ! Success and '.Failure); } . }while( ! Success) ; } Figure 5.34. The acquire Jock operation procedure for ICSM-P

PAGE 147

135 Prev Next \ Prev \ \ ql qn Next Before ICSM-P qh \ ql qn After Enqueue Temp \ T qh ql qn / After Kill Figure 5.35. A successful acquireJock operation for ICSM-P Procedure commit_release Jock(L , Self, Commit_Parameters) { Status = FAILURE; Old.Value = Self . Ptr->Next ; while (Self .Ptr->Next.Dq == FALSE) { New.Value = Old-Value ; New_Value.Dq = TRUE; if(CAS(&Self .Ptr->Next, &01d_Value, &New_Value)) { Commit the critical section operation; L = Self .Ptr->Next; if (Self .Ptr->Next != NULL){ L . Ptr->Priority=MAX_PRIORITY ; L . Ptr->Locked=FALSE ; } Status = SUCCESS; } return (Status) ; Figure 5.36. The commit_release Jock operation procedure for ICSM-P

PAGE 148

136 that holds the lock. For the correctness argument, we consider the acquireJock algorithm to be consisting of two operations: the acquireJock operation as in the PR-Lock algorithm and the killJock operation in which a higher priority process attempts to preempt a lower priority process that is using the lock. The ICSMP algorithm is decisive-instruction serializable [83]. All operations of the ICSM-P algorithm have a single decisive instruction. The decisive instruction for the acquireJock operation is the successful Compare&Swap. The decisive instruction for the releaseJock operation is the successful Compare&Swap instruction that sets the Dq bit. The decisive instruction for the killJock operation is the successful Compare&Swap2 instruction that sets the lock L and the Dq bit simultaneously. Corresponding to a concurrent execution C of the queue operations, there is an equivalent (with respect to return values and final states) serial execution Sd such that if operation 0\ executes its decisive instruction before operation O2 does in C, then 0\ < O2 in SdThus, the equivalent priority queue of a ICSM-P lock is in a single state at any instant, simplifying the correctness proof (a concurrent data structure that is linearizable but not decisive-instruction serializable might be in several states simultaneously [37]). We use the following notation in our discussion. ICSM-P lock C has lock pointer L, which points to the first record in the lock queue (and the record of the process that holds the lock). Let there be N processes pi, p 2 , . . ., pn that participate in the lock synchronization for a priority lock £, using the ICSM-P algorithm. As mentioned earlier, each process pi allocates a record to enqueue and dequeue. Thus, each process p, participating in the lock access is associated with a queue record 9,. Let Pr(pi) be a function which maps a process to its priority, a number between 1 and N. We also define another function Pr(qi) which maps a record belonging to a process Pi to its priority.

PAGE 149

137 A priority queue is an abstract data type that consists of • A finite set Q of elements q,-, i = 1. . .N • A function Pr : — n, ,where rii € N. For simplicity, we assume that every n t is unique. This assumption is not required for correctness, and in fact processes of the same priority will obtain the lock in FCFS order. • Three operations enqueue, dequeue, and kill. At any instant, the state of the queue can be defined as Q = (ft,ft,ft,-..,ft) where q x < Q q 2 Pr(qj). We call q 0 the head record of priority queue Q. The head record's process is the current lock holder. Note that the non-head records are totally ordered. The enqueue operation is defined as enqueue((q 0 , ft, ft, ... , g„), q) -* {q 0 , ft, ft, . . . , ft, q, ...,?») where Pr(ft) > Pr(?) > Pr(ft + i). The dequeue operation on a queue is defined as dequeue((qo, ft, q 2 , • . . , q n ),«) -» (ft>ft?--->9n) < SUCCESS > iff i = 0 -> (9o, ft, ft,--., ft) < FAILURE > Otherwise where SUCCESS and FAILURE are return values.

PAGE 150

138 The kill operation on a queue is denned as kill((q 0 ,qi,q2 ,---,9n),9i) -» (?1,92 ,...,q n )< success > iff i = 1 A Pr(ft) > Pr( go ) (9o,9i,92 ,...,?„)< FAILURE > Otherwise For every ICSP-M £, there is an abstract priority queue Initially, both C and (5 are empty. When a process p with a record q performs the decisive instruction for the acquire Jock operation, Q changes state to enqueue (Q,q). Similarly, when a process executes the decisive instruction for a releaseJock operation, Q changes state to dequeue (Q,q). We show that when we observe £, we find a structure that is equivalent to Q. To observe £, we take a consistent snapshot [15] of the current state of the system memory. Next, we start at the lock pointer L and observe the records following the linked list. If the head record has its Dq bit set and its process has exited the acquire Jock operation, then we discard it from our observation. If we observe the same records in the same sequence in both C and Q, then we say that L and Q are equivalent, and we write L Q. Theorem 1 The representative priority queue Q is equivalent to the observed queue of the PR-Lock C. Proof. We prove the theorem by induction on the decisive instructions, using the following three lemmas. Lemma 1 If Q L before a releaseJock decisive instruction, then Q & C after the releaseJock decisive instruction.

PAGE 151

139 Before: L qO qi q2 qn After: ql q2 qn Figure 5.37. Observed queue C before and after a releaseJock Proof: Let Q = (qo, q\, q)-*fa,ta,...,q*)< SUCCESS > Let L point to the record q 0 of process p 0 . If there is no other concurrent operation, the releaseJock Compare&Swap decisive instruction of po changes the Dq bit in qo from FALSE to TRUE, removing qo from the observable queue. The before and after states of C are shown in Figure 5.37. In this case, the releaseJock operation returns a successful status. Thus, Q £ after the releaseJock operation. Note that L will point to q\ before the next releaseJock decisive instruction. A concurrent kill Jock operation K of process p\ may overtake the releaseJock operation and sets the Dq bit in q 0 to TRUE. Thus, Q = (qi,q2, • • • ,q n ) before the releaseJock operation. By definition, dequeue((q u q 2 ,...,q n ),q 0 ) -+ (q u q 2 , . , q n ) < FAILED > In this case, the Compare&Swap instruction of the releaseJock operation fails and the releaseJock operation returns a failed status. Therefore, Q C after the releaseJock operation.

PAGE 152

140 Lemma 2 If Q & L before a killJock decisive instruction, then Q £ after the kill-lock decisive instruction. Proof. Let Q = (90,91,92, • • • , 9n) before a killJock decisive instruction. A killJock operation is equivalent to a kill operation on the abstract queue. By definition, kiU((q 0 ,q 1 ,q 2 ,...,qn),qi) (go, 9i, 92, • • • , q n ) < FAILURE > for i 7^ 1. The killJock operation of pi uses the PrevJMode pointer at which the calling process has enqueued in the Compare&Swap2 instruction. For the Compare&Swap2 instruction to be successful, the lock L should point to the record qj in the PrevJMode and the next pointer in the record qj must point to the record 9i. This is true only if j = 0 and i = 1. As the Compare&Swap2 instruction is atomic, this condition is not affected by any other concurrent operation. If i ^ 1, Compare&Swap2 fails and is regarded as the failure of the killJock operation. Thus, Q ^ Cm this case. Next, we consider the case where i = 1. By definition, H/((9o,9i,9 2 ,...,9„),9i) -+ (gi,g 2 ,...,g„) < SUCCESS > Let L point to the record 90 of process po. If there is no other concurrent operation, the killJock Compare&Swap2 decisive instruction of pi changes the lock L to point to 91 and the Dq bit in 90 from FALSE to TRUE, removing q 0 from the observable queue. The before and after states of C are shown in Figure 5.38. In this case, the killJock operation returns a successful status. Thus, Q C after the killJock operation. A concurrent release Jock operation R of process p 0 may overtake the killJock operation and sets the Dq bit in 90 to TRUE. Thus, Q = (91,92,. . .,q n ) before the

PAGE 153

141 Before: L qO qi q2 qn After: ql q2 qn Figure 5.38. Observed queue C before and after a killJock kill-lock operation. By definition, kill((qi,q 2 ,...,qn),qi) ~* (ft, #2, • • • , 9n) < FAILED > There are two cases to consider. In the first case, process p 0 is successful in setting the Dq bit in q 0 to TRUE. In the second case, in addition to the above, process p 0 changes the lock L to point to q x . In both cases, the Compare&:Swap2 instruction of the killJock operation fails and returns a failed status. Therefore, Q C after the killJock operation. A concurrent acquire Jock operation A of process p may overtake the killJock operation and enqueues q between po and qo. Thus, Q = (qoiQiQiiQ2, • • 5 9n) before the killJock operation. By definition, kill((q 0 ,q,q 1 ,q2,...,q n ),qi) -» (go, 9, 9i, 9a, • • • , ?n) < FAILED > In this case, the Compare&Swap2 instruction of the killJock operation fails as even though the lock L is pointing to <7o, qo is not pointing to qi any more. Therefore, Q C after the killJock operation. Lemma 3 If Q C before an acquire Jock decisive instruction, then Q C after the acquireJock decisive instruction. Proof. There are two different cases to consider:

PAGE 154

142 Before: After: L Figure 5.39. Observed queue C before and after an acquire Jock to an empty queue Before: qO ql /my qi+1 qn After: qO ql qi+1 qn q> q Figure 5.40. Observed queue C before and after an acquireJock to a non-empty queue Case 1: Q = () before the acquireJock decisive instruction. The equivalent operation on the abstract queue Q is the enqueue operation. Thus, enqueue((), q) — > (q) If the lock C is empty,

PAGE 155

143 Pr(q) > Pr(q i+ i). Then, the Next pointer in q is is set to the address of The Compare&Swap instruction, marked F in Figure 5.34, attempts to make the Next pointer in qi point to q. If the Compare&Swap instruction succeeds, then it is the decisive instruction of Pr(q) > Pr(g t+ i), then q's process will attempt to insert q between and <7,+i. Process A' has modified q^s next pointer, so that g's Compare&Swap will fail. Since qi has not been dequeued, Pr(qi) > Pr(q), and Pr(q) then ^'s process can skip over q' and continue searching from qi+i, which is what happens. This scenario is illustrated in Figure 5.41. Case b: A release Jock operation R overtakes A and removes 9, from the queue (i.e., R has set q^s Dq bit), and qi has not yet been returned to the queue (its Dq bit is still false). Since qi is not in the lock queue, A is lost and must start searching again. Based on its observations of qi and A may have decided to continue searching the queue or to commit its operation. In either case A sees the Dq bit set and fails, so A starts again from the beginning of the queue. This scenario is illustrated in Figure 5.42.

PAGE 156

144 Before A': qO qi qi+l qn qi / 1 After A': L Continue A: L qO qO qi+l qi+l qn qn Figure 5.41. A concurrent acquire Jock A 'succeeds before A Case c: A release Jock operation R overtakes A and removes qi from the queue, and then Pr(qi), A is lost and cannot find the correct place to insert q. This condition is detected when the priority of qi is examined (the lines marked iii in Figure 5.34), and operation A restarts from the head of the queue. If Pr(q) < Pr(qi), then A can still find a correct place to insert q past
PAGE 157

145 ql l— qi+1 qn / / Before R qi+1 qn / After R N LP qi+1 qn Restart A Figure 5.42. A concurrent releaseJock R succeeds before A Q C after the i th decisive instruction. If the i th decisive instruction is for a releaseJock operation, Lemma 1 => Q & C after the i th decisive instruction. Therefore, the inductive step holds, and hence, Q C. 5.4.3 Experimental Performance Results We compared the performance of the MCS-Lock and the PR-Lock algorithms with the ICSM-P algorithm. We selected two versions of the ICSM-P algorithm, one with a single process that has the kill ability (ICSM-P1K), and another one with two processes with the kill ability (ICSM-P2K). We again used the Motorola 68030 multiprocessor system with shared-memory running the pSOS-f kernel. We ran the experiments with four processors, each processor running the task that uses a critical section with the lock. The tasks are identified with the processor number on which they are running, from 1 to 4. The priorities are assigned such that higher task numbers have higher priorities. Each task works (idles) for an uniformly distributed random time between 20 and 40 ms (milli seconds) before entering a critical section. The critical section time is also a random variable, uniformly distributed about a select mean by 25%. We

PAGE 158

146 Before R qO qi "j v j qi+1 qn After R and A' rO rl y q> ri+l qi+1 rm N LP Restart A if Pr(q A ) > Pr(qi) rO rl q> ri+l rm Continue A if Pr(q A ) <= Pr(qi) rO ~~4 ql 4 ri+l Figure 5.43. Release Jock R and acquire Jock A ' succeed before A varied the critical section time used by each processor from 2.6 to 13 ms to achieve lock utilizations of 25%, 50%, 75% and 100%. We stopped the experiment as soon as any task completes 10,000 acquire / release lock operations and collected the time to execute the critical section. We created histograms that show the frequency that the critical section execution takes a particular amount of time. The performance results for the lock utilization 25%, 50%, and 75% are presented in Table 5.3 and the results for 100% lock utilization are presented in Table 5.4. First, let us considering the results for lock utilization of 100% only. 0.75% of the critical section executions of the the highest priority task (task 4) had the worst case response time of 30 ms using the MCS-Lock algorithm. The PRLock algorithm reduced this worst case response to 20 ms for task 4. This improvement is at the cost of increasing the frequency of the critical section executions that took 30 ms for task

PAGE 159

147 3. There is also a cost of increasing the worst case response time from 30 ms to 40 ms for task 1 and task 2. For task 4, ICSP-1K algorithm reduced the frequency of executing the critical section for 20 ms from 33.30% to .70%. This improvement is achieved at the penalty of a 10 ms increase in the worst case response time for tasks 2 and 3, and a 20 ms increase for task 1. ICSP-2K algorithm reduces the worst case response time for task 3 from 40 ms to 30 ms. In this case, the worst case response time for task 1 becomes 100 ms and the worst case response time for task 2 becomes 60. There is a similar, but less significant improvement for the other lock utilizations. One reason for a small improvement during low lock utilization is due to the inability to measure time accurately in pSOS+ operating system. The other reason is that the overhead of the ICSM-P algorithm becomes negligible as the lock utilization increases. As shown, ICSM-P1K algorithm reduces the response time of the highest priority task 4, and ICSM-P2K reduces the response times of high priority tasks 3 and 4. The improvement in response time is more pronounced for higher lock utilizations. As expected, there is an increase in the response time of the low priority tasks while using the ICSM-P algorithm. 5.4.4 ICSM-P algorithm with single word CAS We present a more aggressive ICSM-P algorithm that uses a single word Compare&Swap instead of the two-word Compare&Swap presented before. In the acquire Jock operation, the higher priority task ph will try to set the Dq bit of the lower priority task p c currently using the lock to TRUE, instead of enqueuing itself first in the queue. This operation requires only a single word Compare&Swap instruction. If the instruction is successful, the record is modified to point to the

PAGE 160

148 Table 5.3. Performance comparison of ICSM-P algorithm for lock utilization of 25%, 50%, and 75% Algo Task Lock Utilization 25% 50% 75% Bins ms) Bins(ms Bins(ms) 10 20 10 20 30 10 20 30 40 50 MCS 1 97.46 2.54 91.46 8.54 0.07 81.89 18.02 0.09 2 97.51 2.49 91.19 8.81 82.61 17.31 0.08 3 97.78 2.22 91.65 8.35 83.01 16.92 0.07 4 97.92 2.08 91.96 8.14 81.91 18.05 0.04 PR 1 97.50 2.50 89.13 10.80 80.12 16.56 0.92 2 97.02 2.98 90.88 9.12 83.00 16.56 0.44 3 97.57 2.43 92.19 7.81 83.80 16.02 0.18 4 98.02 1.98 92.24 7.76 84.23 15.77 P1K 1 95.05 4.95 82.06 17.33 0.61 65.97 27.39 5.98 0.66 2 95.08 4.92 84.25 15.75 69.98 26.03 3.78 0.21 3 95.62 4.38 86.15 13.85 72.96 24.10 2.94 4 99.63 0.37 99.63 0.37 99.56 0.44 P2K 1 92.64 7.36 75.68 22.41 1.91 53.00 28.82 11.88 4.87 1.43 2 93.24 6.76 77.00 22.22 0.78 59.46 29.00 8.77 2.77 3 96.77 3.23 90.90 9.10 87.58 11.68 0.74 4 99.59 0.41 99.69 0.31 99.48 0.52

PAGE 161

149 Table 5.4. Performance comparison of ICSM-P algorithm for lock utilization of 100% Algo Task Lock Utilization 100% Bins(ms) 10 20 30 40 50 60 70 80 90 100 MCS 1 63.39 35.74 0.87 2 63.84 35.31 0.85 3 64.22 35.22 0.56 4 64.38 34.87 0.75 PR 1 61.88 35.19 2.57 0.36 2 63.49 34.84 1.51 0.16 3 65.41 33.89 0.70 4 66.70 33.30 P1K 1 31.34 35.89 16.82 11.08 4.63 4.02 2 38.00 37.92 15.27 7.19 1.62 3 46.78 35.89 12.51 4.92 4 99.29 0.71 P2K 1 17.08 18.79 9.04 10.31 10.16 6.12 3.63 3.71 4.19 16.97 2 31.43 28.18 10.76 12.10 10.93 6.60 3 84.54 11.18 4.28 4 99.52 0.48

PAGE 162

150 rest of the queued records. The higher priority task then changes the lock to point to its own record, and enters the critical section. The Compare&Swap instruction can fail if task p c sets the Dq bit instead, in preparation for a commit. If another task with a higher priority than task ph succeeds in setting the Dq bit first, then also the Compare&Swap instruction of task ph fails. Then, task ph will continue to queue its record as in the PR-Lock algorithm. The modified algorithm is presented in Figure 5.44. The modifications are marked with the letter M. The commit_release Jock is as before, except that the task p c that is unsuccessful in releasing the lock will wait till the lock pointer L is changed. This is shown in Figure 5.45. We expect that the performance of the modified algorithm will be similar to the ICSM-P algorithm discussed before. 5.5 Conclusions In this chapter, we extended Interruptible Critical Sections (ICS) to multiprocessor systems. We show how ICS can be implemented under different environments of a multiprocessor system. In the ICSM-R algorithm for a non-real time multiprocessor system, tasks release the lock during a context-switch. The lock is re-acquired by the task during its next quantum, and the critical section is restarted, if necessary. Our experimental and analytical results show that, except for a high critical section time to quantum time ratio, ICSM-R algorithm performs better than an algorithm that holds the lock during the time the task experiences a context-switch. To implement the ICSM-R algorithm, a small addition to the context-switch routine of the operating system is necessary.

PAGE 163

151 Procedure acquireJock(L, Sell) { Success=FALSE; do { Prev-Node=NULL; Next_Node=NULL; if(CAS(4L, &Next_Node, ftSelf)) { /* Ho Lock */ Success=TRUE; Failure=FALSE; /* Use Lock */ Self .Ptr->Priority = HAX_PRIORITY; Sell . Ptr->Next . Dq=FALSE ; } else if (Next Jlode. Ptr->Priority < Self .Ptr->Priority) M while(NextJfode.Dq == FALSE) { M Temp-Node = Next-Node; Temp -Node . Dq = TRUE; M if(CAS(ftNext-Node.Ptr->Next, &Next_Node, &Temp-Node)) { M Self . Ptr->Next . Ptr = Next-Node. Ptr; M Success=TRUE; Failure=FALSE ; /* Use Lock */ M Self .Ptr->Priority = MAX-PRIORITY; M Self . Ptr->Next .Dq=FALSE; L.Ptr = Self. Ptr; M } M } M else { /* Lock in Use */ Failure=FALSE ; Self . Ptr->Locked=TRUE ; do { Prev_Node=Next_Node ; Next_Node=Prev_Node . Ptr->Next ; if ((Next -Node. Dq==TRUE) /* Deque, Try Again */ ii or (Prev-Node . Ptr->PriorityPriority ) ) iii Failure=TRUE; else { if (Next -Node. Ptr==NULL or (Next-Node. Ptr !=NULL and Next-Node . Ptr->PriorityPriority ) ) { Self . Ptr->Next . Ptr=Next_Node . Ptr if(CAS(ft(Prev_Node.Ptr->Next) , &Next_Node, ftSelf)) { F Self . Ptr->Next . Dq=FALSE ; while(Self .Ptr->Locked) ; /* Busy Wait */ Success=TRUE; /* Then, use lock */ } else { if( (Next-Node. Dq==TRUE) /* Deque, Try Again */ ii or Prev_Node.Ptr->Priority < Self . Ptr->Priority) ) iii Failure=TRUE; else Next_Node=PrevJIode; -> } } }while( ! Success and IFailure); } }while( ! Success) ; } i Figure 5.44. The acquire Jock ICSM-P with single word CAS

PAGE 164

152 Procedure commit_release_lock(L, Self, Commit_Parameters) { Status = FAILURE; Old-Value = Self .Ptr->Next ; while (Self .Ptr->Next.Dq == FALSE) { New_Value = 01d_Value; New.Value.Dq = TRUE; if(CAS(&Self .Ptr->Next, &01d_Value, &New_Value)) { Commit the critical section operation; L = Self .Ptr->Next; if (Self .Ptr->Next != NULL){ L . Ptr->Priority=MAX_PRIORITY ; L . Ptr->Locked=FALSE ; } Status = SUCCESS; } } if(Status = FAILURE) while (L == Self .Ptr) ; return(Status) ; Figure 5.45. The commit_release Jock ICSM-P with single word CAS

PAGE 165

153 ICSM-K algorithm prevents a possible priority inversion to a high priority task that shares a critical section with a set of low priority tasks. The resulting increase in the response time of the critical section for the low priority tasks depends on their lock utilization. This algorithm is suitable for real-time multiprocessor systems. ICSMK algorithm uses an atomic Compare&Swap instruction that is available on most shared-memory multiprocessor systems. In the ICSM-P algorithm, we consider a real-time multiprocessor system with shared-memory and a set of tasks with varying priorities. ICSM-P algorithm reduces the priority inversion possible to the high priority tasks. There is a penalty of increase in the response times of the critical section for the low priority tasks. ICSM-P algorithm can be tuned to limit this increase to a certain degree. ICSM-P algorithm also uses an atomic Compare&Swap instruction for its implementation. As in the uniprocessor system, using ICS in a multiprocessor system allows the high priority task to always complete quickly. In a real-time system, ICS prevent priority inversion. In addition, the ICS mechanism is independent of the scheduling algorithm.

PAGE 166

CHAPTER 6 CONCLUSIONS Real time systems are becoming increasingly important. It is becoming necessary to impose deadlines for responses to every action (for example, message transfers and process executions), be it in the field of computer networks, database systems, simulation, AI, uniprocessor systems, multiprocessor systems or distributed systems. Even if not necessary, we can find applications that can use systems that respond within deadlines. Synchronization for mutual exclusion is an important issue in real-time systems for the same reasons as for non-real time systems, in which processes cooperate for using a shared resource in an orderly way. We have shown in Chapter 2 that existing synchronization mechanisms for non-real time systems are inadequate for this purpose. We have broadly classified synchronization algorithms depending on the characteristics of the primitives: pessimistic and optimistic. In pessimistic synchronization, a lock or a queue regulates the access to a shared resource so that only one process uses the resource at a time. This is the traditional method employed but it is overly restrictive for a high priority process in a real-time system. In optimistic synchronization, a process proceeds to use the shared resource with the assumption that no other process is interfering with it. In the case that such an event occurs, it is detected and a recovery action is taken. The recovery action can be either to execute the critical section that accesses the resource or to post the process's modification to the resource 154

PAGE 167

155 so that other processes can complete the operation. We developed synchronization algorithms using each of these techniques that perform well in real-time system. 6.1 PRLock Algorithm The PR-Lock algorithm presented in Chapter 3 is based on the pessimistic approach. PR-Lock is a prioritized spin-lock synchronization algorithm for multiprocessor system with shared-memory. Processes using the PR-Lock wait in a priorityordered queue. We have shown that the amount of time to acquire the lock using PR-Lock is inversely proportional to the priority of a process, thus making PR-Lock suitable for real-time systems. Releasing the lock is a constant time operation. Priority inversion in which a higher priority process is blocked by a lower priority process can still occur using PR-Lock. Simulation results show that the PR-Lock algorithm performs well in practice. The PR-Lock algorithm has the following advantages over other proposed prioritized spin-locks. • The algorithm is contention free. • A higher priority process does not have to work for a lower priority process while releasing the lock. As a result, the time to acquire and release the lock is fast and predictable. • The PR-Lock has a well defined acquire-lock point. • The PR-Lock maintains a pointer to the process using the lock that facilitates implementing priority inheritance protocols.

PAGE 168

156 6.2 Interruptible Critical Sections on Uniprocessors We presented Interruptible Critical Sections in Chapter 4, an optimistic approach to synchronization that is applicable to real-time systems. We designed two synchronization algorithms for a real-time uniprocessor system. The first algorithm is based on ICS. The other algorithm, the Interruptible Lock, uses a combination of Interruptible Critical Sections and semaphore locks. Both algorithms reduce the variance in the response time of the highest priority task with only a small impact on the performance of low priority tasks. The Interruptible Lock algorithm further improves the performance of low priority tasks over the algorithm based on pure ICS. We further show how Interruptible Critical Sections can be combined with the Priority Ceiling Protocol, and present an analysis which shows that interruptible locks improve the schedulability of task sets that have high priority tasks with tight deadlines. 6.3 Interruptible Critical Sections on Multiprocessors We extended the usage of Interruptible Critical Sections to shared-memory multiprocessors under various environments in Chapter 5. The first environment considered is a non-real time multiprocessor system. In the ICSM-R algorithm, the lock is released if a process experiences a context switch. When the process gains control of the processor, the lock is re-acquired, and if there is no other conflicting execution of the critical section, the process continues. Otherwise, the process restarts the critical section. Our experimental results show that ICSM-R performs well compared to an algorithm that does not release the lock under various load conditions. We also present an analysis that it is advantageous to use ICSM-R, except for a high critical section time to quantum time ratio.

PAGE 169

157 In the real-time environment, we first consider a single high priority task with a set of low priority tasks. In the ICSM-K algorithm, the high priority task is allowed to access the critical section unconditionally. Any other concurrent low priority task executing the critical section detects this conflict and re-executes the critical section. A semaphore lock is used to regulate the access to the critical section among the low priority tasks. In ICSM-P algorithm, tasks can have arbitrary priorities. The ICSM-P algorithm is based on the spin-lock PR-Lock algorithm and uses a priority queue. If the highest priority process waiting for a lock in the queue finds that a lower priority process is using the lock, it enters the critical section itself. The low priority process detects this conflict and re-tries the lock acquisition. Experimental results show that all the algorithms perform well making the Interruptible Critical Sections mechanism a viable approach to synchronization in realtime systems. 6.4 Final Words While the idea of synchronizing optimistically is not new, it is not being used in a real-time system environment . Optimistic concurrency control techniques were proposed in database systems long ago. But, recovery in a database system is a costly operation. Contrary to expectations, optimistic synchronization is suitable for real-time systems. Critical high priority tasks have faster access to critical sections and the overall schedulability of a set of processes can be improved. Lower priority processes do not suffer many re-executions as a real-time system is scheduled conservatively, not aggressively.

PAGE 170

REFERENCES 1] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young, Mach: A New Kernel Foundation for Unix Development, Proceedings of the Summer 1986 USENIX Conference, Los Alamitos, CA, USA, 1986, pp. 51-56. 2] J. Alemany and E.W. Felten, Performance Issues in Non-Blocking Synchronization on Shared Memory Multiprocessors, Proceedings of the 11th Annual ACM Symposium on Principles of Distributed Computing, Vancouver, BC, Canada, 1992, pp. 125-134. 3] T. E. Anderson, E. D. Lazowska, and H. M. Levy, The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors, IEEE Transactions on Computers, Vol. 38. No. 12, 1989, pp. 1631-1644. 4] T. E. Anderson, The Performance of Spin Lock Alternatives for Shared Memory Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 1, 1990, pp 6-16. 5] T. E. Anderson and H. M. Levy, Scheduler Activations: Effective Kernel Support for the User Level Management of Parallelism, ACM Transactions on Computer Systems, Vol. 9, No. 1, 1992, pp. 53-79. 6] G. R. Andrews, Concurrent Programming Principles and Practice, The Benjamin/Cummings Publishing Company, Inc., Menlo Park, CA, USA, 1991. 7] K. R. Baker and Z.S. Su, Sequencing with Due-dates and Early Start Times to Minimize Maximum Tardiness, Naval Research Logistics Quarterly, Vol. 21, No. 1, 1974, pp. 171-176. 8] T.P. Baker, A Stack-Based Resource Allocation Policy for Realtime Processes, Proceedings of the 11th Real-Time Systems Symposium, Lake Buena Vista, FL, USA, 1991, pp. 191-200. 9] P.A. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley Publishing Company, Reading, MA, USA, 1987. [10] B. Bershad, Practical Considerations for Non-Blocking Concurrent Objects, IEEE 13th International Conference on Distributed Computing Systems, Pittsburgh, PA, USA, 1993, pp. 264-273. 158

PAGE 171

159 [11] B. N. Bershad, D. D. Redell and J. R. Ellis, Fast Mutual Exclusion for Uniprocessors, Proceedings of the Fifth International Conferance on Architectural Support for Programming Languages and Operating Systems, Boston, MA, USA, 1992, pp 223-233. 121 E. A. Brewer and C. N. Dellarocas, PROTEUS User Documentation, MIT Laboratory for Computer Science, October 1991. 131 E. D. Brooks, The Butterfly Barrier, International Journal on Parallel Programming, Vol. 15, No. 4, 1986, pp. 295-307. 141 J.E. Burns, Mutual Exclusion with Linear Waiting Using Binary Shared Variables, SIGACT News, Vol. 10, No. 2, 1978, pp. 42-47. 151 K. M. Chandy and L. Lamport, Distributed Snapshots: Determining Global States of Distributed Systems, ACM Transactions on Computer Systems, Vol. 3, No. 1, 1985, pp. 63-75. 161 M.I. Chen and K.J. Lin, Dynamic Priority Ceiling: A Concurrency Control Protocol for Real-Time Systems, Real-Time Systems Journal, Vol. 2, No. 4, 1990, pp. 325-346. 171 M.I. Chen and K.J. Lin, A Priority Ceiling Protocol for MultipleInstance Resources, IEEE Real-Time Systems Symposium, San Antonio, TX, USA, 1991, pp. 140-149. 181 S. Cheng, J. A. Stankovic and K. Ramamritham, Scheduling Algorithms for Hard Real-Time Systems A Brief Survey, Tutorial Hard Real-Time Systems, IEEE Computer Society Press, Washington, D.C, USA, 1988, pp. 150-173. 191 T. S. Craig, Building FIFO and Priority-Queuing Spin Locks from Atomic Swap, Technical Report UW-CSE-93-02-02, University of Washington, 1993. 201 S. Davari and S.K. Dhall, An On-Line Algorithm for Real-Time Task Allocation, IEEE Real-Time System Symposium, New Orleans, Louisiana, USA, 1986, pp. 194-200. 211 M. L. Dertouzos, Control Robotics: The Procedural Control of Physical Processes, Proceedings of the IFIP Congress, Stockholm, Sweden, 1974, pp. 807-813. 221 E. W. Dijkstra, The Structure of THE Multiprogramming System, Communications of the ACM, Vol 11, No. 5, 1968, pp. 341-346. 231 M. Dubois, C. Scheurich and F. A. Briggs, Synchronization, Coherence, and Event Ordering in Multiprocessors, IEEE Computer, Vol. 21, No. 2, 1988, pp. 9-21. 241 S.R. Faulk and D.L. Parnas, On Synchronization in Hard Real-Time Systems, Communications of the ACM, Vol. 31, No. 3, 1988, pp. 274-287. 251 P. A. Fishwick, SIMPACK: Getting Started with Simulation Programming in C and C++, Technical Report Electronic TR92-022, University of Florida, 1992.

PAGE 172

160 [26] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbson, A. Gupta and J. Hennessy, Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors, The 17th Annual International Symposium on Computer Architecture, Seattle, WA, USA, May 1990, pp. 15-26. [27] R. R. Glenn, D. V. Pryor, J. M. Conroy and T. Johnson, Characterizing Memory Hotspots in a Shared Memory MIMD Machine, Proceedings of Supercomputing '91, Albuquerque, NM, USA, 1991, pp. 554-566. [28] J. R. Goodman, M. K. Vernon and P. J. Woest, Efficient Synchronization Primitives for Large-Scale CacheCoherent Multiprocessors, Computer Architecture News, Vol. 17, No. 2, 1989, pp. 64-75. [29] A. Goscinski, Two Algorithms for Mutual Exclusion in Real-time Distributed Computer Systems, Journal of Parallel and Distributed Computing, Vol. 9, No. 1, 1990, pp 77-82. [30] G. Graunke and S. Thakkar, Synchronization Algorithms for Shared-Memory Multiprocessors, IEEE Computer, Vol. 23, No. 6, 1990, pp. 60-69. [31] P. B. Hansen, Operating System Principles, Prentice-Hall, Englewood Cliffs, NJ, USA, 1973. [32] D. Hensgen, R. Finkel, and U. Manber, Two Algorithms for Barrier Synchronization, International Journal on Parallel Programming, Vol. 17, No. 1, 1988, pp. 1-17. [33] M. Herlihy, Apologizing Versus Asking Permission: Optimistic Concurrency Control for Abstract Data Types, ACM Transactions on Database Systems, Vol. 15, No. 1, 1990, pp. 96-124. [34] M. Herlihy, Wait-Free Synchronization, ACM Transactions on Programming Languages and Systems, Vol. 13, No. 1, 1991, pp. 124-149. [35] M. Herlihy, A Methodology for Implementing Highly Concurrent Data Objects, ACM Transactions on Programming Languages and Systems, Vol. 15, No. 5, 1993, pp. 745-770. [36] M. Herlihy and J. E. B. Moss, Transactional Memory: Architectural Support for Lock-Free Data Structures, Proceedings of the 20th Annual International Symposium on Computer Architecture, San Diego, CA, USA, 1993, pp. 289300. [37] M. Herlihy and J. Wing, Linearizability: A Correctness Condition for Concurrent Objects, ACM Transactions on Programming Languages and Systems, Vol. 12, No. 3, 1990, pp. 463-492. [38] C. A. R. Hoare, Monitors: An Operating System Structuring Concept, Communications of the ACM, Vol. 17, No. 10, 1974, pp. 549-557. [39] J. Hong, X. Tan, and D. Towsley, A Performance Analysis of Minimum Laxity and Earliest Deadline in a Real-Time System, IEEE Transactions on Computers, Vol. 38, No. 12, 1989, pp. 1736-1744.

PAGE 173

161 [40] W.A. Horn, Some Simple Scheduling Algorithms, Naval Research Logistics Quarterly, Vol. 21, No. 1, 1974, pp. 177-185. [41] IBM T. J. Watson Research Center, System/310 Principles of Operations, Yorktown Heights, NY, USA, 1983, pp. 7.13-14. [42] Intel Corporation, I860 64-bit Microprocessor Programmer's Reference Manual, Santa Clara, CA, USA, 1989. [43] K. JefFay, Analysis of a Synchronization and Scheduling Discipline for Real-Time Tasks with Pre-emption Constraints, IEEE Real Time Systems Symposium, Los Alamitos, CA, USA, 1989, pp. 295-305. [44] T. Johnson, Interruptable Critical Sections for Real-Time Systems, Technical Report Electronic TR93-017, University of Florida, 1993. [45] T. Johnson and T. Davis, Space Efficient Parallel Buddy Memory Management, Parallel Processing Letters, Vol. 2, No. 4, 1992, pp. 391-398. [46] T. Johnson and R. NewmanWolfe, A Fast Low Overhead Distributed Priority Lock, Technical Report TR94-010, University of Florida, 1994. [47] H. F. Jordan, Performance Measurements on HEP a Pipelined MIMD Computer, Proceedings of the 10th Annual International Symposium on Computer Architecture, Los Angeles, CA, USA, June 1983, pp. 207-212. [48] M. Joseph and P. Pandya, Finding Response Times in a Real Time System, BCS Computer Journal, Vol. 29, No. 5, 1986, pp. 390-395. [49] H. Kise, A Solvable Case of the One-Machine Scheduling Problem with Ready and Due Times, Operations Research, Vol. 26, No. 1, 1978, pp. 121-126. [50] M. H. Klein and T. Ralya, An Analysis of Input/Output Paradigms for RealTime Systems, Technical Report CMU/SEI-90-TR-19, CarnegieMellon University, 1990. [51] L. Kleinrock, Queuing Systems, Volume 1: Theory, John Wiley & Sons, New York, NY, USA, 1975. [52] C. P. Kruskal, L. Rudolph and M. Snir, Efficient Synchronization on Multiprocessors with Shared Memory, Proceedings of the 5th Annual ACM Symposium on Principles of Distributed Computing, Calagary, Alberta, Canada, 1986, pp. 218-228. [53] L. Lamport, Concurrent Reading and Writing, Communications of the ACM, Vol. 20, No. 11, 1977, pp. 806-811. [54] L. Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs, IEEE Transactions on Computers, Vol. 28, No. 9, 1979, pp. 690-691. [55] E. L. Lawler, Optimal Scheduling of a Single Machine Subject to Precedence Constraints, Management Science, Vol. 19, No. 5, 1973, pp. 544-554.

PAGE 174

162 [56] S.J. Leffler, M.K. McKusick, M.J. Karels, and J.S. Quarterman, The Design and Implementation of the 4-3 BSD UNIX Operating System, Addison Wesley, Reading, MA, USA, 1989. [57] J. P. Lehoczky, Fixed Priority Scheduling of Periodic Task Sets with Arbitrary Deadlines, IEEE Real-time Systems Symposium, Lake Buena Vista, FL, USA, 1990, pp. 201-209. [58] J.Y.T. Leung and J. Whitehead, On the Complexity of FixedPriority Scheduling of Periodic Real-Time Tasks, Performance Evaluation, Vol. 2, 1982, pp. 237-250. [59] C.L.Liu and W.J. Layland, Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment, Journal of the ACM, 1973, Vol. 20, No. 1, pp. 46-61. [60] C.D.Locke, H. Tokuda and E.D. Jensen, A Time-Driven Scheduling Model for Real-Time Operating Systems, Technical Report, CarnegieMellon University, 1985. [61] B. Lubachevsky, Synchronization Barrier and Related Tools for Shared Memory Parallel Programming, Proceedings of the 1989 International Conference on Parallel Processing, Vol II, University Park, Pennsylvania, USA, 1989, pp. 175-179. [62] M. Maekawa, A. E. Oldehoeft and R. R. Oldehoeft, Operating Systems Advanced Concepts, The Benjamin/Cummings Publishing Company, Inc., Menlo Park, CA, USA, 1987. [63] E. P. Markatos and T.J. LeBlanc, Multiprocessor Synchronization Primitives with Priorities, Technical Report, University of Rochester, 1991. [64] C. Martel, Preemptive Scheduling with Release Times, Deadlines, and Due Times, Journal of ACM, Vol. 29, No. 3, 1982, pp. 812-829. [65] C. McCann, R. Vaswani, and J. Zahorjan, A Dynamic Processor Allocation Policy for Multiprogrammed Shared-Memory Multiprocessors, ACM Transactions on Computer Systems, Vol. 11, No. 2, 1993, pp. 146-178. [66] J. M. Mellor-Crummey and M. L. Scott, Synchronization Without Contention, 4th International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA, USA, 1991, pp 269-278. [67] J. M. Mellor-Crummey and M. L. Scott, Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, ACM Transactions on Computer Systems, Vol. 9, No. 1, 1991, pp 21-65. [68] C.W. Mercer and H. Tokuda, Preemptibility in Real-Time Operating Systems, IEEE Real Time Systems Symposuim, Phoenix, Arizona, 1992, pp. 78-87. [69] A. K. Mok and M.L. Dertouzos, Multiprocessor Scheduling in a Hard Real-Time Environment, Proceedings of the Seventh Texas Conference on Computing Systems, Houston, TX, USA, 1978, pp. 169-175.

PAGE 175

163 [70] A.K. Mok and M.L. Dertouzos, Multiprocessor On-Line Scheduling of Hard RealTime Tasks, IEEE Transactions on Software Engineering, Vol. 15, No. 12, 1989, pp. 1497-1506. [71] L.D. Molesky, C. Shen, and G. Zlokapa, Predictable Synchronization Mechanisms for Real-Time Systems, Real-Time Systems, Vol. 2, No. 3, 1990, pp. 163-180. [72] J.M. Moore, An N Job, One Machine Sequencing Algorithm for Minimizing the Number of Late Jobs, Management Science, 1968, Vol. 15, No. 1, pp. 102-109. [73] J. Moss, and W. Kohler, Concurrency Features for the Trellis/Owl Language, European Conference on Object-Oriented Programming, Paris, France, 1987, pp. 171-180. [74] Motorola Inc., M68000 Family Programmer's Reference Manual, Tempe, AZ, USA, 1990. [75] Motorola Inc., psos+ RteidCompliant Real-Time Kernel User's Manual, Tempe, AZ, USA, 1990. [76] Motorola Inc., Vmeexec User's Guide, Second Edition, Tempe, AZ, USA, 1990. [77] S. Prakash, Y. H. Lee and T. Johnson, Nonblocking Algorithm for Shared Queues Using Compare-And-Swap, IEEE Transactions on Computers, Vol. 43, No. 5, 1994, pp. 548-559. [78] S. Prakash, Y. H. Lee and T. Johnson, Non-blocking Algorithms for Concurrent Data Structures, Technical Report TR91-002, University of Florida, 1991. [79] R. Rajkumar, L. Sha and J. P. Lehoczky, Real-Time Synchronization Protocols for Multiprocessors, IEEE Real-Time Systems Symposium, Huntsville, Alabama, 1988, pp. 259-269. [80] R. Rajkumar, L. Sha and J. P. Lehoczky, Priority Inheritance Protocols: An Approach to Real-Time Synchronization, IEEE Transactions on Computers, Vol. 39, No. 9, 1990, pp. 1175-1185. [81] D. P. Reed and R. K. Kanodia, Synchronization with Eventcounts and Sequencers, Communications of the ACM, Vol. 22, No. 2, 1979, pp. 115-123. [82] L. Rudolph and Z. Segall, Dynamic Decentralized Cache Schemes for MIMD Parallel Processors, The 11th Annual International Symposium on Computer Architecture, Santa Clara, CA, USA, June 1984, pp. 340-347. [83] D. Shasha and N. Goodman, Concurrent Search Structure Algorithms, ACM Transactions on Database Systems, Vol. 13, No. 1, 1988, pp 53-90. [84] LihChyun Shu, Michal Young, and Ragunathan Rajkumar, An Abort Ceiling Protocol for Controling Priority Inversion, Proceedings of the First International Workshop on Real-Time Computing Systems and Applications, Seoul, Korea, 1994, pp. 202-206. [85] A. Silberschatz, J. Peterson and P. Galvin, Operating System Concepts, Third Edition, Addison-Wesley Publishing Company, Reading, MA, USA, 1991.

PAGE 176

164 [86] B. Simons, Multiprocessor Scheduling of Unit-Time Jobs with Arbitrary Release Times and Deadlines, SIAM Journal on Computing, 1983, Vol. 12, No. 2, pp. 294-299. [87] J. A. Stankovic and K. Ramamritham, Dynamic Task Scheduling in Distributed Hard Real-Time Systems, IEEE Software, Vol. 1, No. 3, 1984, pp. 65-75. [88] J. A. Stankovic and K. Ramamritham, Tuturial Hard Real-Time Systems, IEEE Computer Society Press, Washington, D.C, USA, 1988. [89] J. M. Stone, A Simple and Correct Shared-Queue Algorithm Using Compareand-Swap, Proceedings of the IEEE Computer Society and ACM SIGARCH Supercomputing '90 Conference, Piscataway, NJ, USA, 1990, pp. 495-504. [90] J. M. Stone, H. S. Stone, P. Heidelberger, and J. Turek, Multiple Reservations and the Oklahoma Update, IEEE Parallel and Distributed Technology, Systems and Applications, Vol. 1, No. 4, 1993, pp. 58-71. [91] H. Takada, and K. Sakamura, Predictable Spin Lock Algorithms with Preemption, Proceedings of the 11th IEEE Workshop on Real-Time Operating Systems and Software, Los Alamitos, CA, USA, 1994, pp. 2-6. [92] H. Takada, and K. Sakamura, Real-Time Synchronization Protocols with Abortable Critical Sections, Proceedings of the First International Workshop on Real-Time Computing Systems and Applications, Seoul, Korea, 1994, pp. 48-52. [93] T.Teixeira, Static Priority Interrupt Scheduling, Proceedings of the Seventh Texas Conference on Computing Systems, Houston, TX, USA, 1978, pp. 294302. [94] K.W. Tindell, A. Burns, and A.J. Wellings, An Extendible Approach for Analyzing Fixed Priority Hard Real-Time Tasks, Real Time Systems, No. 6, 1994, pp. 133-151. [95] S.K. Tripathi and V. Nirkhe, PreScheduling for Synchronization in Hard RealTime Systems, Operating Systems of the 90s and Beyond, International Workshop, Dagstuhl Castle, Germany, 1991, pp. 102-108. [96] John Turek, Resilient Computation in the Presence of Slow-Downs, Ph.D Thesis, Department of Computer Science, New York University, 1991. [97] J. Turek, D. Shasha and S. Prakash, Locking without Blocking: Making Lock Based Concurrent Data Structure Algorithms Nonblocking, Proceedings of the 11th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, San Diego, CA, USA, 1992, pp. 212-222. [98] J.D. Ullman, Complexity of Sequence Problems, Computer and Job-shop Scheduling Theory, A Wiley-Interscience Publication, New York, 1976, pp. 5672. [99] C.-Q Zhu and P.-C. Yew, A Synchronization Scheme and its Applications for Large Multiprocessor Systems, Proceedings of the 4th International Conference on Distributed Computing Systems, Pittsburgh, PA, USA, 1984, pp. 486-493.

PAGE 177

BIOGRAPHICAL SKETCH Krishna Harathi was born in Anantapur, a small town in Andhra Pradesh, India, in 1957. He is the son of Sri. Raghavan Harathi and Smt. Vasantha Lakshmi Harathi. He has three younger sisters, Padma, Hema, and Sridevi, and a younger brother, Hari. His education started in Viveka Vardhini School, Hyderabad, India. For his higher secondary education, he attended the Madras High School in Delhi. He received a Bachelor of Technology degree from Jawaharlal Nehru Technological University, Anantapur, India, in 1979. In 1992, he received a Master of Engineering degree from University of Florida in Gainesville, Florida. In 1979, he started working as a Technical Officer in Electronics Corporation of India Ltd., Hyderabad, a company manufacturing computers among others. His areas of professional experience include network planning and control, real-time operating systems, fault-tolerant communication networks, graphics, and database systems. By the time he left the company to further his education in 1989, he had become a Senior Technical Officer in the company. His research interests are in the areas of real-time systems, distributed systems, fault-tolerant issues in networks, data compression techniques, concurrency control issues and nested transactions in distributed databases. His hobbies include gardening, jogging, and automobile maintenance. 165

PAGE 178

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Theodore J. Johnson, Chairman Assistant Professor of Computer and Information Sciences I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Richard E. NewmanWolfe Assistant Professor of Computer and Information Sciences I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Li in / Randy Chow Professor of Computer and Information Sciences I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Yanh-Hang Lee Associate Professed of Computer and Information Sciences

PAGE 179

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree ofDoctor of Philosophy. Paul R. Avery Associate Professor of Physics This dissertation was submitted to the Graduate Faculty of the College of Engineering and to the Graduate School and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy. Winfred M. Phillips Dean, College of Engineering Karen A. Holbrook Dean, Graduate School