Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: The performance of holding versus releasing locks in a multiprogrammed multiprocessor
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095331/00001
 Material Information
Title: The performance of holding versus releasing locks in a multiprogrammed multiprocessor
Series Title: Department of Computer and Information Science and Engineering Technical Reports
Physical Description: Book
Language: English
Creator: Johnson, Theodore
Harathi, Krishna
Publisher: Department of Computer and Information Sciences, University of Florida
Place of Publication: Gainesville, Fla.
Copyright Date: 1995
 Record Information
Bibliographic ID: UF00095331
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

1995175 ( PDF )


Full Text










The Performance of Holding Versus Releasing Locks in a

Multiprogrammed Multiprocessor

Theodore Johnson Krishna Harathi
Dept. of Computer and Information Science
University of Florida


Abstract
In a multiprogrammed multiprocessor system, existing lock based mechanisms hold the lock for a
task during a context-switch. In this case, the time that a task holds a lock can be greatly increased,
resulting in a wasted lock utilization and an increase in the average response time for tasks using the
lock. An option for avoiding this problem is to release the lock held by a task during a context-switch.
The lock is reacquired when the task executes in the next quantum. In this paper, we compare the
performance of releasing versus holding a lock during a critical section. We discuss our implementation,
the ICSM-R algorithm, as implemented on a multiprocessor system. Next, we develop an analytical
performance model, and validate it with a simulation. We study the various parameters under which
the ICSM-R algorithm outperforms lock-based algorithms. We find that if the critical section execution
time is 50% or less of the time quantum, ICSM-R has better performance than locking alone, reducing
response times and improving scalability.


1 Introduction


Mutual exclusion is a significant problem for sharing a resource in a multiprogrammed shared-memory

multiprocessor system. Existing lock based synchronization mechanisms hold the lock for a task even after

a context-switch when the task is not executing. In such an event, tasks on other processors waiting for the

lock will be blocked for a length amount of time much larger than the critical section execution time. As

a result, the lock utilization and the response times are needlessly increased. This bottleneck will be more

pronounced as the number of processors and the degree of multiprogramming are increased. In spite of this

problem, multiprogramming is often used in shared memory multiprocessors, in order to improve processor

utilization or perform background tasks (such as asynchronous I/O).

One way to avoid this problem with multiprogramming is to require a task to release all of its locks

before a context switch. In this paper, we analyze the performance of the IC \I-R algorithm. In the IC'"\I-

R algorithm, a task releases its locks on a context switch. When the task is re-scheduled on the processor,

the lock is re-acquired and a test for a conflicting critical section access is performed. If a conflicting access

did complete, the task restarts its execution of the critical section. The IC,\I -I; algorithm does not block

tasks needlessly, but there is a penalty of restarting the critical section due to conflicting accesses by other

tasks. However, the critical section response time will be low if the number of restarts of the critical section










are small. We study the parameters and conditions under which the IC'\-I ; algorithm performs better

than an algorithm that does not release the lock during a context-switch.

There has recently been a great deal of interest in the problem of handling critical sections in a multipro-

grammed shared memory multiprocessor. Anderson et al. [2] show that a naive implementation of spin-locks

can not only delay the processor waiting for a lock, but other processors doing work. They suggest an

Ethernet-style backoff scheme or a queue-based algorithm for reducing the cost of spin-waiting. McCann et

al. [11] conclude that preempting processors in a coordinated way is critical to response times while using

critical sections. Anderson et al. [3] argue that the operating system should recognize that a preempted

thread is executing in a critical section, and execute the preempted thread until the thread exits the critical

section. An approach that is related to the IC'I -I; algorithm is the use of non-blocking algorithms. Ale-

many and Felton [1] consider implementation issues of non-blocking concurrent objects on shared-memory

multiprocessors. They show how the resources wasted by the non-blocking operations that fail and the cost

of data copying required by a non-blocking implementation can be reduced by relying on the operating sys-

tem support. Bershad [4] discusses two approaches for implementing kernel level support for non-blocking

critical sections.

Interest in pre-emptable locks has recently developed in the real time systems community. Takada and

Sakamura [16] proposed algorithms that extend queuing spin-locks to be preempted for servicing interrupts.

They address the conflicting issue of servicing a pending interrupt while holding a lock. Shu et al. [15]

proposed an Abort Ceiling Protocol, an extension to the Priority Ceiling Protocol [14]. In this algorithm, an

abort ceiling priority is associated with a task. Another task may abort the currently running task and run

immediately if its priority is higher than the current abort ceiling. The protocol relies on the Interruptible

Critical Sections to restart the critical section of the aborted task. The Ceiling Abort Protocol [17] proposed

by Takada and Sakamur is a similar extension to the Priority Ceiling Protocol. This protocol assigns an

abort ceiling priority to the critical section instead.

The contribution of this work is to make provide a detailed analytical performance model of both the

lock holding and the lock releasing strategies. Some previous works have pointed out the benefit of a lock

releasing strategy [3, 1], but did not provide an analytical model. By using the analytical model, we can

make a detailed comparison of lock holding and of lock releasing, and find that a lock releasing strategy is

better than a lock holding strategy if the critical section execution time is 50% or less of a time quanta.










2 The ICSM-R Algorithm


We introduced the idea of an Interruptible Critical Section (ICS) [9], which is a critical section protected

by optimistic concurrency control instead of by blocking. A task calculates its modifications to the shared

data structure, then attempts to commit its modification. If a higher priority task previously committed

a conflicting modification, the lower priority task fails to commit, and must try again (as in optimistic

concurrency control [5]). Otherwise, the task succeeds, and continues in its work. In [7], the ICS algorithm

is extended to the IC \1-R algorithm, which executes on multiprocessors.

In an interruptible critical section, a process can perform only one write that is visible to other processes.

Furthermore, the globally visible write must be the last instruction in the protected region. Therefore, a

process that is executing an ICS records its updates in a private buffer (the commit buffer). The final write

commits the updates that are recorded in the buffer by setting a commit flag. Any subsequent process that

executes the ICS performs the updates and clears the commit flag.

Related approaches to optimistic synchronization are discussed by Alemany and Felton [1] and by Bershad

[4]. However, these protocols require that if context switch occurs while a task is executing a critical section,

the operating system kernel must either complete the critical section execution or back out of the execution.

Requiring the kernel to perform the execution on behalf of the task has many difficulties. The task must

pass a great deal of information to the kernel to initialize a critical section, there is a great potential for

security problems and kernel corruption, the critical section execution might not terminate, and the lengthy

context switch can cause timing problems. By contrast, the IC'",\-I algorithm pushes most of the work of

implementing releasable critical sections to the user level. The code to support IC',\1- I; is easy to implement,

and we present our implementation results in the next section. Other authors suggest that all tasks in a

program be scheduled for execution simultaneously [11]. However, synchronizing the simultaneous context

switches can be difficult to implement, and does not account for context switches due to background tasks.

We discuss the IC,\I -I; algorithm as implemented in a VMEexec [13] system development environment

with a pSOS+ [12] real-time, multi-tasking operating system kernel. The VMEexec system consists of a

host running on a VMEmodule driven SYSTEM V/68 operating system and a set of VMEmodule target

processors running the pSOS+ kernel. In our configuration, we have six MVME147 VMEmodules based

on Motorola MC'G.' ..11 with 11i) of shared-memory on each module. One VMEmodule is used as a host

processor running the SYSTEM V/68 and the rest are real-time target processors running the pSOS+ kernel.

pSOS+ is a real-time, multi-tasking kernel that supports multi-processors. It provides a rich set of

system services including task management, shared-memory regions, synchronous / asynchronous signals,










semaphores, and messages. One particular feature that pSOS+ supports are user written routines that can

be called at the start of a task, during a context-switch, and at the end of a task. This feature allows us to

implement ICS support without modifying the kernel.

We implemented IC \ I- I; with spin-locks using the Test&Set instruction. Two data structures are needed

to implement IC'"\I-I; one for the critical section, and one for each task that uses the critical section.

The global lock structure consists of a critical section identifier, a counter that tracks the number of

times the critical section has been executed, and the critical section bounds. It also contains the address of

the global spin-lock variable. The structure local to a task consists of the copy of the ICS execution count,

a count of the number of times the critical section is retried on any invocation (for statistics), a pointer to

the ICSLstruct and a flag to indicate that the task is entering the critical section.

The ICS implementation code consists of two parts: The IC'\1- I;- 1:.-;-r routine which provides the ICS

Lock mechanism and the IC'\1-1 I;. i. I task that uses the ICS mechanism.

The IC' \1-I. 1:.-;-r routine is integrated with the pSOS+ kernel as a user written routine that is called

during a context-switch. The call occurs at the point where the context of the switched-out task has been

completely saved, and before the context of the switched-in task is loaded. pSOS+ provides the addresses of

the Task Control Blocks (TCBs) of both the switched-in task and the switched-out task in machine registers.

The TCB contains all the context of a task, including the Program Counter (PC). IC'",\-I; I:.- -r can reset

the PC in the TCB of a switched-in task, if required.

IC \ I- I;. :.- -r first checks if the program counter (PC) of the old task about to be switched out is within

the critical section region, and if so, it releases the spin-lock by setting the lock variable to zero. Next,

IC'",-I -I; :.--I routine checks if the new task about to be switched in is within the critical section region. If

so, it attempts to reacquire the spin-lock without spinning. If successful, IC' \ -I; 1:.- checks if there was a

conflicting operation in the interim. If so, it sets the task's PC to reexecute the critical section. If there is no

conflict, the task is allowed to continue where it left when it was switched out. If the attempt to re-acquire

the lock by the IC'\- I;- 1:.- -r routine is unsuccessful, the task is made to restart from the point of acquiring

the lock for the critical section.

Our discussion of how an IC'\I-1; is implemented is of necessity quite brief. We refer the interested

reader to our other reports for a more detailed discussion [9, 7].










3 ICSM-R Performance Analysis


The IC'"\-I ; protocol has the intuitively good property that a task does not hold a lock while it is idle

due to a context switch. However, a critical section might need to be executed many times due to conflicts

while it was swapped out. A performance study is needed to determine when using an IC'"\-I ; has better

performance than permitting a task to hold its locks while switched out. In addition, a performance study

is necessary to justify the effort of implementing an IC'\l-I;

We first study the performance of an implementation of the IC'"\ -I ; algorithm. We compared the

performance of IC'\I- I; with a spin-lock algorithm that does not release the lock during a context-switch

(LOCK-NR). As we are limited by the number of processors available for the implementation, and to better

understand the implications of various parameters, we constructed an analytical model. We validated the

analytical model by using discrete event simulation. We present the results of the experiments and the

analysis in the following sub-sections.

3.1 Experimental Performance Results

We implemented the ICK'\I-I; algorithm by integrating it into the pSOS+ kernel, and ran an experiment.

A global counter is protected by a critical section, implemented using a spin-lock. On each processor, we

run four tasks. Among these four tasks, only one task is the IC' -\Il I;. i. task that increments the shared

counter and the rest are dummy tasks. All the four tasks are started under the control of a low priority

parent task.

We compared the performance of IC(' \ I- I; with a spin-lock algorithm that does not release the lock during

a context-switch (LOCK-NR). Each processor executes four tasks, where each task works for Tw amount of

time. One task among the four tasks enters the critical section. It stays in the critical section for Tc time

units. We varied the number of processors from one to four. In our experiment, Tw and Tc are random

variables uniformly distributed by 20% about a selected mean. We set Tw to 20 milli seconds (ms). The

time quantum for a task is 20 ms for processor sharing among the four tasks. We varied the critical section

execution time Tc to reflect various load conditions with best-case lock utilizations of 1.25%, 6% and 10%

per processor.

The performance results are shown in Figure 1. Experimental results confirm that as the number of pro-

cessors increase, IC,\I- I; performs better than LOCK-NR. When the lock utilization is low, the performance

improvement is little as there is only a small probability that a context-switch occurs during a critical section.

As the lock utilization increases this probability is higher. Releasing the lock during a context-switch helps










tasks running on other processors to acquire the lock faster, thereby avoiding wasted cycles of spinning.

3.2 Analytical Performance of ICSM-R

The implementation results are encouraging, but incomplete. They do not provide a good explanation about

why IC', 1-I; is better than LOCK-NR, and do not permit extrapolations to new scenarios. To make up for

these deficiencies, we developed an analytical performance model of both IC'\l4-R and LOCK-NR.

For the analytical model, we consider N number of processors running M tasks each. Each task on a

processor works for T, units of time. In addition, one out of M tasks on a processor requests the service

of a critical section that takes Tc time units to execute. Each processor is shared among M tasks using a

time quantum of Tq time units in a round-robin fashion. The cycle time of a task is composed of the work

time and the critical section time, if any, of the task. After completing an execution cycle, a task repeats

the cycle. We are interested in estimating the cycle time of the task using the critical section.

We use the following notation for the analysis:


N : Number of processors

M : Number of tasks per processor


T, : Work time for each task

Tq : Time quantum for processor sharing

Tc : Critical Section execution time


Rp : Critical section utilization per processor

R : Total critical section utilization

X : The total work time in a cycle

Z : The total time spent in waiting for the critical section in a cycle

B : The CPU time spent holding the critical section.

E[Cx in Cs] : Expected number of context-switches while in critical section


CN : Cycle time for a task using NIC', 1-I;

CI : Cycle time for a task using IC,\1- I;













Critical Section: 1 milli second


1000


? 800


E 600
F-
C0
a 400
w

200


0







1000


? 800


E 600
F-
C0
a 400


200


0


Critical Section: 5 milli seconds


2 3
Processors


Critical Section: 10 milli seconds


1000 Lock Not Released -+--


? 800





C 400


200

00
1 2 3 4
Processors

Figure 1: IC I- 1 Performance Results (Tw = 20 milli seconds)
Figure 1: JCs\MI I Performance Results (Tw =20 milli seconds)


1 2 3 4
Processors











X Z B


Figure 2: Model of a cycle for a task using LOCK-NR


3.2.1 Analysis of algorithm LOCK-NR

In this model, the critical section is not released during a context-switch while the critical section is being

executed by a task.

A cycle of the task using the critical section on a processor is shown in Figure 2. A cycle consists of the

work time X, Z units of time waiting for the critical section and B units of time in the critical section.

We have,




CN = X+Z+B (1)

X = M*T, (2)


B depends on T, Tq and the number of context switches that are possible while holding the critical

section.




E[Cx in Cs] =-

B = T +E[CxinCs]*(M- 1)*Tq

= T,+ *(M- 1)*Tq

= M* T (3)


Assuming that a request for critical section is uniformly distributed in time, the probability of a conflict in

using the critical is just the percentage of time the critical section is being used. Making the approximation

that the critical section can be modeled as a M/M/1 queue, the expected blocking time Z is given by



R,
Z = ( )* B, (4)


where R, is the utilization of the rest of the N 1 tasks that use the critical section, given by



(N 1)
R, = R
N











X Z1 B1 Z2 B2


Figure 3: Model of a cycle for a task using IC\M-I-;


and B, is the residual life [10] of the lock holding time B for which the task under consideration is blocked.

Assuming that the lock holding time is uniformly distributed with a mean B and variance O- = O-T, B,

is given by



B B 22
B, = +
2 2*B

As only one task uses the critical section on a processor, the utilization per processor is given by



B
S= (X Z + B)

Then, the critical section utilization is




R = N*Rp
B
= N*
(X + Z + B)
B
S(X (l ) B,) + B) (5)

X and B can be computed using equations 2 and 3, respectively. We can compute R using equation 5

with iteration, setting the initial value of R to be zero. Knowing R and B, Z can be computed, and hence

the cycle time CN.


3.2.2 Analysis of algorithm ICSM-R

In this model, the critical section is released during a context-switch while the critical section is being

executed by a task. The critical section is re-acquired and continued by the task during the next quantum.

This acquire/release cycle is continued till the critical section is completed.

A cycle of the task using the critical section on a processor is shown in Figure 3. In this case, after

the work period X, there is a possible blocking time Z1 spent waiting for the critical section to be free.

Then there is the partial critical section holding time Bl. At this time the task may experience a context

switch and another blocking time represented by Z2. This acquire / release critical section is continued till


ZF BF










the critical section is completed. There is a possibility of restarting the task at the beginning of the critical

section whenever the critical section is re-acquired, if there is a previous commit in the critical section during

the time period when the critical section is last released and re-acquired.

We have,




C = X+Z+B (6)

X = M*T, (7)


where Z = Z1 + Z2 + ... + ZF and B = B1 + B2 + .. + BF. B depends on T, and the number of times

a task is restarted from the beginning of the critical section because of a commit by another task.




B = T + T, N,


where T, is the partial execution time of the critical section before a task is restarted because of a conflict,

and N, is the number of times the task is restarted because of a conflict before it commits. The expected

value of T, = T1/2. Given the probability of restart of a task within the critical section as Pr[CS Restart],




N, = i Pr[CS Restart]'
i>o
Pr[CS Restart]
(1 Pr[CS Restart])2

Then,



T Pr[CS Restart]
2 (1 Pr[CS Restart])2

where


Pr[CS Restart] = Pr[Cx in Cs] *

Pr[A Commit in the previous (M 1) Tq interval]


We can estimate that some other task commits in the previous interval of (M 1) Tq by modeling the

commits as a Poisson process with an arrival rate

(N 1)
(X+B+Z)










Then,


Pr[A commit in the previous (M 1) Tq interval]

1 Pr[No commit in the previous

(M 1) T, interval]

1 e(-x*(M"-1)*T)

TL
Pr[Cx in Cs] for M >1

S 0 Otherwise

The blocking time Z is due to the critical section being busy, and a context switch while in critical

section.



R,
Z = Pr[Cx in Cs] *(M 1) Tq + (1+ Pr[Cx in Cs]) B, (9)
(1 R,)

where R, is the utilization of the rest of the N 1 tasks that use the critical section, given by



(N 1)
R,. (N R
N

and B, is the residual life of the lock holding time B for which the task under consideration is blocked.

Assuming that the lock holding time is uniformly distributed with a mean B and variance rB = "Tc, B,

is given by



B a2
B, = + B
2 2*B

As only one process uses the critical section on a processor, the utilization per processor is given by



B
S(X + Z + B)

Then, the critical section utilization is



R = N*Rp
B
= N B(10)
(X + Z + B)










X can be computed using equation 7. We can compute B and R with equations 8, 9 and 10 using

iteration, setting the initial values of B to be Tc and of R to be zero. Knowing R and B, Z can be computed,

and hence the cycle time CI.


3.2.3 Validation of Analysis

We validated the analysis by simulation using SIMPACK [6], a discrete event simulation package. We set the

values of M = 4, T, = 1000, and Tq = 100. The work time T, is a random variable uniformly distributed

between 800 and 1200 with a mean of 1000. The critical section time is also an uniformly distributed random

variable with a range of 20% either way about the mean. In each experiment, we selected a different Tc

from 10 to 90 to represent a wide range of critical section utilizations. The results comparing the cycle

times obtained by simulation and the cycle times computed by analysis are given in Table 1. As can be seen,

except for a high Tc/Tq ratio, the results are reasonably accurate for the analysis to be meaningful. A similar

conclusion can be drawn from the lock utilization obtained from simulation and analysis, as presented in

Table 2.


3.2.4 Performance comparison using analysis

We evaluated the performance of IC'\ l-I; and LOCK-NR algorithms using the model developed in the

previous sub-sections.

The cycle times of the tasks using the critical section are shown in Figures 4 through Figure 7, with Tc/Tq

ranging from 10% to "I' The work time T, is 1000 units, and M is 4. For a critical section to quantum

time ratio of up to 50%, IC',\ -I; algorithm performs better than LOCK-NR algorithm. Even for a higher

ratio, IC'\I- I; algorithm performs better for small number of processors. We observe that there is a steep

transition in the cycle times when the lock utilization reaches 100% for both the algorithms. As expected,

the critical section utilization is less for IC"\I-I; as shown in Figures 8 through 11. Because the IC'\I-I;

decreases the lock utilization, it improves the scalability of the application which uses it.

We analyzed the effect of multiprogramming by varying M from 1 to 8, with Tc = Tq/4. Except for the

case of no multiprogramming, IC',\I-I; always performs better as shown in figures 12 through 14.

We conclude that for a low to moderate critical section execution time to quantum time ratio, it is

advantageous to use the IC'\I-I; algorithm. In general, critical section execution times are very short

and only a fraction of the quantum times. Thus, the IC'\I-I; algorithm that releases the lock during a

context-switch is an attractive alternative to other lock-based algorithms.
























Table 1: Validating cycle time analysis using simulation for IC-\I-I;

Cycle Time
ICKN1-1; LOCK-NR
Tc Processors Simulation Analysis Abs.% Simulation Analysis Abs.%
Diff Diff
10 4 4044.30 4044.31 0.00 4044.50 4044.39 0.00
8 4040.04 4040.01 0.00 4040.64 4040.48 0.00
12 4049.11 4049.08 0.00 411511 :2 411511 12 0.00
16 I -,, Hi I1-, 1.95 0.00 4115i; 57 4115i; 44 0.00
20 411..- *-, 4038.11 0.01 4042.03 4041.72 0.02
24 4039.85 4038.93 0.02 4042.91 4041.62 0.03
28 4049.14 4048.36 0.02 415: 54 415:; *'3 0.01
32 ii,-' ..; 4051.64 0.02 41157.69 41157.75 0.00
36 ,11, ,' In 1 .40 0.03 Il .-' ..; 111.-' !1; 0.01
40 4046.91 4045.40 0.04 11,', i,7 1,., 0.01
44 4039.62 4037.89 0.04 41,51 i;2 4049.92 0.02
48 4049.14 4046.98 0.05 1111.-' ..1, 4061.72 0.02
52 4048.51 4046.26 0.06 4064.73 1'1... *I1 0.02
56 4044.58 4041.87 0.07 4064.06 In".-' .1 0.04
60 4046.44 4043.55 0.07 4068.55 4068.77 0.00
64 4049.33 4046.37 0.07 4076.04 1,-; 8 0.00
50 4 4216.49 4219.43 0.07 4220.63 4220.72 0.00
8 4230.69 4239.95 0.22 4245.23 4249.52 0.10
12 4254.30 4278.36 0.57 4298.42 4314.26 0.37
16 4279.15 4318.48 0.92 4370.40 4428.70 1.33
90 4 4408.98 4425.15 0.34 4413.87 4422.75 0.20
8 4501.65 4559.00 1.27 4518.83 4584.24 1.45
12 I.;, 8 17;-"'.99 1.16 -1 i-' 5043.99 6.14
16 5078.08 5077.26 0.02 7.'.69 7134.15 23.69
























Table 2: Validating lock utilization analysis using simulation for IC\I -I;

Lock Utilization (' .)
ICK 1--1; LOCK-NR
Tc Processors Simulation Analysis Abs.% Simulation Analysis Abs.%
Diff Diff
10 4 1.00 0.99 1.00 4.00 3.90 2.50
8 2.00 2.02 1.00 7.70 7.90 2.59
12 3.00 3.06 0.99 12.00 11.90 0.83
16 4.10 4.10 0.00 15.60 15.80 1.28
20 5.10 5.18 1.57 19.70 19.81 0.56
24 6.20 6.20 0.00 23.00 23.76 3.30
28 7.20 7.30 1.39 27.30 27.60 1.10
32 8.20 8.30 1.22 30.70 31.50 2.60
36 9.20 9.40 2.17 34.10 35.40 3.81
40 10.30 10.50 1.94 38.30 39.47 3.05
44 11.30 11.52 1.95 43.00 43.34 0.79
48 12.30 12.60 2.44 45.40 47.30 4.18
52 13.30 13.60 2.26 49.80 51.20 2.81
56 14.40 14.71 2.15 52.50 55.18 5.10
60 15.40 15.76 2.34 56.50 59.00 4.42
64 16.40 16.81 2.50 59.80 62.90 5.18
50 4 5.00 5.01 0.20 18.70 18.93 1.23
8 10.60 10.87 2.55 37.50 37.68 0.48
12 16.40 17.57 7.13 55.70 55.70 0.00
16 22.50 24.62 8.79 72.50 72.35 0.21
90 4 9.20 9.09 1.20 32.50 32.52 0.06
8 20.80 21.70 4.33 63.50 62.88 0.98
12 35.70 38.80 8.68 90.50 85.73 5.27
16 54.40 57.70 6.07 99.30 100.00 0.70










4 Conclusions


In a multiprogrammed shared memory multiprocessor, tasks can be needlessly blocked if another task holds

a lock and then is switched out of its CPU. We present the IC'" 1-I; protocol, in which a task releases its

locks when it undergoes a context switch. The advantage of the IC'"\1-1; protocol over other solutions to

this problem are that IC \1-I ; is more general and is easier to implement. The IC \1-I ; protocol avoids the

unnecessary blocking, but at the cost of possible re-executions of the critical section.

We evaluate the performance of the IC' \I- I; protocol in comparison to the no-release algorithm. We first

present experimental results from an implementation, and find that IC'"\ -I; always has better performance.

Next, we develop analytical performance models of IC'"\I-I; and of the no-release case. We validate the

analytical models with a simulation. Using the analytical models, we compare the performance of IC"\I -I;

and the no-release case. We find that if critical sections are short compared to the time quanta, then using

the IC' \I- I; protocol gives faster response times and allows a more scalable application than taking no action

on a context switch.


References


[1] J. Alemany and E.W. Felten, Performance Issues in Non-Blocking Synchronization on -l.....I Mem-

ory Multiprocessors, Proceedings of the llth Annual AC\I Symposium on Principles of Distributed

Computing, Vancouver, BC, Canada, 1992, pp. 125-134.

[2] T. E. Anderson, E. D. Lazowska, and H. M. Levy, The Performance Implications of Thread Management

Alternatives for Shared-Memory Multiprocessors, IEEE Transactions on Computers, Vol. 38. No. 12,

1989, pp. 1631-1644.

[3] T. E. Anderson and H. M. Levy, Scheduler Activations: Effective Kernel Support for the User Level

Management of Parallelism, AC\I Transactions on Computer Systems, Vol. 9, No. 1, 1992, pp. 53-79.

[4] B. Bershad, Practical Considerations for Non-Blocking Concurrent Objects, IEEE 13th International

Conference on Distributed Computing Systems, Pittsburgh, PA, USA, 1993, pp. 264-273.

[5] P.A. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Database Sys-

tems, Addison-Wesley Publishing Company, Reading, MA, USA, 1 1 .

[6] P. A. Fishwick, SIMPACK: Gelling Started with Simulation Programming in C and C++, Technical

Report Electronic TR92-022, University of Florida, 1992.










[7] K. Harathi, Synchronization Algorithms for Real Time Systems, Ph.D. Thesis, Dept. of CIS, University

of Florida, 1995.

[8] M. Herlihy, A Methodology for Implementing Highly Concurrent Data Objects, AC\I Transactions on

Programming Languages and Systems, Vol. 15, No. 5, 1993, pp. 745-770.

[9] T. Johnson and K. Harathi, Interruptable Critical Sections, Technical Report TI.;' I-iii;, Dept. of CIS,

University of Florida, 1994. Available at ftp.cis.ufl.edu:cis/tech-reports/tr94/tr94.007.ps.Z.

[10] L. Kleinrock, Queuing Systems, Volume 1: Theory, John Wiley & Sons, New York, NY, USA, 1 I'

[11] C. McCann, R. Vaswani, and J. Zahorjan, A Dynamic Processor Allocation Policy for Multiprogrammed

I/....-I-Memory Multiprocessors, AC\I Transactions on Computer Systems, Vol. 11, No. 2, 1993, pp.

146-178.

[12] Motorola Inc., psos+ Rteid-Compliant Real-Time Kernel User's Manual, Tempe, AZ, USA, 1990.

[13] Motorola Inc., Vmeexec User's Guide, Second Edition, Tempe, AZ, USA, 1990.

[14] R. Rajkumar, L. Sha and J. P. Lehoczky, Real-Time Synchronization Protocols for Multiprocessors,

IEEE Real-Time Systems Symposium, Huntsville, Alabama, 1988, pp. '_".'1-269.

[15] LihChyun Shu, Michal Young, and Ragunathan Rajkumar, An Abort Ceiling Protocol for Controling

Priority Inversion, Proceedings of the First International Workshop on Real-Time Computing Systems

and Applications, Seoul, Korea, 1994, pp. 202-206.

[16] H. Takada, and K. Sakamura, Predictable Spin Lock Algorithms with Preemption, Proceedings of the

llth IEEE Workshop on Real-Time Operating Systems and Software, Los Alamitos, CA, USA, 1994,

pp. 2-6.

[17] H. Takada, and K. Sakamura, Real-Time Synchronization Protocols with Abortable Critical Sections,

Proceedings of the First International Workshop on Real-Time Computing Systems and Applications,

Seoul, Korea, 1994, pp. 48-52.











4300
Tc=10 Tq=100 ICSM-R -
LOCK-NR+
4250


4200
I, I

a 4150 -


4100 .
.-'

4050 ----


4000
0 10 20 30 40 50 60 70 80 90 100
Processors


Figure 4: IC 's\-I; Cycle Times using analysis for Tc = 10

16000
ICSM-R
Tc=50Tq =100 LOCK-NR +
14000 .


12000


10000 /


8000 '


6000 "


4000
0 10 20 30 40 50 60 70 80 90 100
Processors


Figure 5: IC s\-II; Cycle Times using analysis for Tc = 50

55000
ICSM-R
50000 Tc = 75Tq = 100 LOCK-NR
45000
40000
35000
E
Fi 30000
5 25000
S20000
15000 +
10000 +-*
5000
0
0 10 20 30 40 50 60 70 80 90 100
Processors


Figure 6: IC 's\-I; Cycle Times using analysis for Tc = 75











400000
ICSM-R
350000 Tc=90Tq=100 NICSM-R

300000

250000
E
S200000

S150000

100000

50000
0-+ 4-~-- +-4----------------*-- +--t--
0 10 20 30 40 50 60 70 80 90 100
Processors


Figure 7: IC", \ -I; Cycle Times using analysis for Tc = 90

1
ICSM-R
0.9 Tc = 10Tq = 100 LOCK-NR "

0.8 +
o +.'
g 0.7 -
.N ..+/
D 0.6 "
2 0.5 .

c) 0.4

0.3 -

0.2

0.1 +,. f"

0
0 10 20 30 40 50 60 70 80 90 100
Processors


Figure 8: IC,\I -I; critical section utilization using analysis for Tc = 10


ICSM-R
0.9 LOCK-NR +

0.8 Tc = 50 Tq = 100

0.7

0.6

2. 0.5

Co 0.4

0.3

0.2 /

0.1

0
0 10 20 30 40 50 60 70 80 90 100
Processors


Figure 9: IC'\- I; critical section utilization using analysis for Tc = 50













0.9 LOCK-NR -+

0.8 'Tc = 75Tq = 100
0
0.7 /

D 0.6

0.5

U, 0.4 -

S 0.3
0.2

0.1

0
0 10 20 30 40 50 60 70 80 90 100
Processors

Figure 10: IC,\I -I; critical section utilization using analysis for Tc = 75


ICSM-R
0.9 / LOCK-NR -+

0.8 Tc= 90Tq = 100

0.7

D 0.6
S 0.5

c, 0.4

0.3

0.2

0.1

0
0 10 20 30 40 50 60 70 80 90 100
Processors


Figure 11: IC'" \-I; critical section utilization using analysis for Tc = 90

2400
ICSM-R -*
2200 M = 1 LOCK-NR

2000

E 1800
F-

S 1600

1400

1200

1000 ..
0 10 20 30 40 50 60 70 80 90 100
Processors


Figure 12: IC' \I-I; effect of multiprogramming on cycle time for Tc = 25 (M = 1)






















10000



9000



8000



7000



6000



5000



4000


ICSM-R -
M=4 LOCK-NR +


0.+
.

-s


-t


+-





'.,--I'..


0 10 20 30 40 50 60 70 80 90 100
Processors


Figure 13: IC' \I-I; effect of multiprogramming on cycle time for Tc = 25 (M = 4)


20000



18000



16000



14000



12000



10000



8000


ICSM-R --
M=8 LOCK-NR +


.+'
.

-s


-t
,+-



+

/ +'4
3 40
.4_. +"


0 10 20 30 40 50 60 70 80 90 100
Processors


Figure 14: IC'\I -I; effect of multiprogramming on cycle time for Tc = 25 (M = 8)




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs