WO2002088938A1

WO2002088938A1 - Apparatus and methods for proportional share scheduling

Info

Publication number: WO2002088938A1
Application number: PCT/US2002/014020
Authority: WO
Inventors: Jason Nieh; Christopher Vaill
Original assignee: The Trustees Of Columbia University In The City Of New York
Priority date: 2001-05-01
Filing date: 2002-05-01
Publication date: 2002-11-07

Abstract

A proportional share scheduling apparatus and technique for scheduling resources among a plurality of clients, each of which has a proportional resource allocation of the total resources for the CPU. The clients are sorted in a run queue from the client having the largest proportional share allocation to the client having the smallest proportional share allocation (112). Starting from the beginning of the run queue, each client is run for a constant time quantum (130). If a client in the run queue has received more than its proportional resource allocation, the remaining clients in the run queue are skipped, and the clients are run from the beginning of the run queue (120). This process repeats until all clients have received service. Since the clients with the largest proportional share allocation are placed at the beginning of the run queue, they are allowed to receive more service than the clients having a smaller proportional resource allocation propositioned at the end of the run queue.

Description

APPARATUS AND METHODS FOR PROPORTIONAL SHARE

SCHEDULING

SPECIFICATION

BACKGROUND OF THE INVENTION This invention relates to process management in computer operating systems, and more particularly to proportional share scheduling.

Proportional share resource management provides a flexible and useful abstraction for multiplexing scarce resources among users and applications. According to such a management scheme, each client, as is known in art, has an associated weight, and resources are allocated to the clients in proportion to their respective weights. Previously developed proportional sharing mechanisms can be classified into four categories: (a) those that are fast but have weaker proportional fairness guarantees; (b) those that map well to existing scheduler frameworks in current commercial operating systems but have no well-defined proportional fairness guarantees; (c) those that have strong proportional fairness guarantees and higher scheduling overhead; and (d) those that have weaker proportional fairness guarantees but have higher scheduling overhead. The four categories described above correspond to round-robin, fair-share, fair queueing, and lottery mechanisms. Proportional share scheduling, as used herein, has the following meaning: given a set of clients with associated weights, a proportional share scheduler should allocate resources to each client in proportion to its respective weight, i.e., its "proportional resource allocation." Thus the terms "share," "weight," and "proportional resource allocation" are used interchangeably. The process of scheduling a time-multiplexed resource among a set of clients is modeled in two steps: 1) the scheduler orders the clients in a queue, 2) the scheduler runs the first client in the queue for its time quantum, which is defined herein as the maximum time interval the client is allowed to run before another scheduling decision is made. The time quantum is typically expressed in time units of constant size determined by the hardware. As a result, the units of time quanta are referred to herein as "time units" (tu) rather than an absolute time measure, such as seconds.

Based on the above scheduler model, a scheduler can achieve proportional sharing in one of two ways. One way is to adjust the frequency that a client is selected to run by adjusting the position of the client in the queue so that it ends up at the front of the queue more or less often. Another way is to adjust the size of the time quantum of a client so that it runs longer for a given allocation. The manner in which a scheduler determines how often a client runs and how long a client runs directly affects the accuracy and scheduling overhead of the scheduler. A proportional share scheduler is more accurate if it allocates resources in a manner that is more proportionally "fair". Perfect fairness is defined an ideal state in which each client has received service exactly proportional to its share. The proportional share of clients is denoted as S_A, and the amount of service received by client __ during the time interval (ti, t ) as W_A(J_\, t ). formally, a proportional sharing algorithm achieves perfect fairness for time interval (ti, t ) if, for any client __.

For an ideal system in which all clients could consume their resource allocations simultaneously, then an ideal proportional share scheduler could maintain the above relationship for all time intervals. However, in scheduling a time- multiplexed resource in time units of finite size, it is not possible for a scheduler to be perfectly proportionally fair as defined by Equation (1) for all intervals.

A quantitative measure may be calculated to evaluate how close an algorithm of a proportional sharing mechanism gets to perfect fairness. A variation of Equation (1) may be used to define the service time error

t₂) for client A over interval (t_\, ti). The service time error is the difference between: (a) the amount of time allocated to the client during interval (t_\, t₂) under the given algorithm, and (b) the amount of time that would have been allocated under an ideal scheme that maintains perfect fairness for all clients over all intervals. Service time error is computed as: E_Λ(t, ,t₂)=W_A(t, ,t₂)-(t₂ -*,) ⁽²⁾

A positive service time error indicates that a client has received more than its ideal share over an interval; a negative error indicates that a client has received less. To be precise, the error EA measures how much time client A has received beyond its ideal allocation.

The goal of a proportional share scheduler should be to minimize the allocation error between clients. The effectiveness of different classes of proportional share algorithms in minimizing this allocation error is considered herein.

A first type of proportional share algorithms, i.e., "round-robin," is one of the oldest, simplest and most widely used proportional share scheduling algorithms. Clients are placed in a queue and allowed to execute in turn. When all client shares are equal, each client is assigned the same size time quantum. In the weighted round-robin case, each client is assigned a time quantum equal to its share. A client with a larger share, then, effectively gets a larger quantum than a client with a small share. Weighted round-robin (WRR) provides proportional sharing by running all clients with the same frequency but adjusting the size of their time quanta. A more recent variant called deficit round-robin ( as discussed in Shreedhar, M. and Narghese, G., "Efficient fair queueing using deficit round-robin," In Proceedings of ACMSIGCOMM'95, Volume 4(3) (Sept. 1995), pp. 231-242, which is incorporated by reference in its entirety herein) has been developed for network packet scheduling with similar behavior to a weighted round-robin CPU scheduler.

WRR is simple to implement and schedules clients in 0(1) time. However, it has a relatively weak proportional fairness guarantee as its service ratio error can be quite large. Consider an example in which 3 clients A, B, and C, have shares 3, 2, and 1, respectively. WRR will execute these clients in the following order of time units: A, A, A, B, B, C. The error in this example gets as low as -1 tu and as high as +1.5 tu. If the shares in the previous example are changed to 3000, 2000, and 1000, the error increases significantly from -1000 to + 1500 tu. A large error range is a major drawback of round-robin scheduling, in which each client gets all service due to it all at once, while other clients get no service. After a client has received all its service, it is well ahead of its ideal allocation (it has a high positive error), and all other clients are behind their allocations (they have low negative errors).

A second type of proportional share algorithms, e.g., "fair-share schedulers," were developed, in part, to respond to a need to provide proportional sharing among groups of users and compatible with a UNIX-style time-sharing framework. In UNIX time-sharing, scheduling is done based on multi-level feedback, with a set of priority queues. Each client has a priority which is adjusted as it executes, and the scheduler executes the client with the highest priority. Fair-share was to provides proportional sharing among users by adjusting the priorities of a user's clients in a suitable way. Fair-share provides proportional sharing by effectively running clients at different frequencies, as opposed to WRR which only adjusts the size of the clients' time quanta. Fair-share schedulers were compatible with UNIX scheduling frameworks and relatively easy to deploy in existing UNIX environments. Empirical measurements show that these approaches only provide reasonable proportional fairness over relatively large time intervals (as discussed in Essick, R., "An event-based fair share scheduler," In Proceedings of the Winter 1990 USENIX Conference (Berkeley, CA, USA, Jan. 1990), pp. 147-162. USENIX, which is incorporated by reference in its entirety herein.) The allocation errors in these approaches can be very large. The priority adjustments done by fair-share schedulers can generally be computed quickly in 0(\) time. In some cases, the schedulers need to do an expensive periodic re-adjustment of all client priorities, which required 0(N) time, where N is the number of clients.

A third type of proportional share algorithms is "fair queueing," which was first proposed for network packet scheduling as Weighted Fair Queueing (WFQ) (as discussed in Demers, A., Keshav, S., And Shenker, S., "Analysis and Simulation of a Fair Queueing Algorithm," In Proceedings of ACM SIGCOMM '89 (Austin, TX, Sept. 1989), pp. 1-12, which is incorporated by reference in its entirety herein). WFQ introduced the idea of a virtual finishing time (VFT) to do proportional sharing scheduling, which refers to the concept of "virtual time." The virtual time of a client is a measure of the degree to which a client has received its proportional allocation relative to other clients. When a client executes, its virtual time advances at a rate inversely proportional to the client's share. In other words, the virtual time of a client __ at time t is the ratio of W_A() to S_A-

Given a client's virtual time, the client's virtual finishing time (VFT) is defined as the virtual time the client would have after executing for one time quantum. WFQ then schedules clients by selecting the client with the smallest VFT. This is implemented by keeping an ordered queue of clients sorted from smallest to largest VFT, and then selecting the first client in the queue. After a client executes, its VFT is updated and the client is inserted back into the queue. Its position in the queue is determined by its updated VFT. Fair queueing provides proportional sharing by running clients at different frequencies by adjusting the position in at which each client is inserted back into the queue; the same size time quantum is used for all clients.

To illustrate how this works, consider again the example in which 3 clients A, B, and C, have shares 3, 2, and 1, respectively. Their initial VFTs are then 1/3, 1/2, and 1, respectively. WFQ would then execute the clients in the following order of time units: A, B, A, B, C, A. In contrast to WRR, WFQ's service time error ranges from -5/6 to +1 tu in this example, which is less than the allocation error of -1 to + 1.5 tu for WRR. The difference between WFQ and WRR is greatly exaggerated if larger share values are chosen: if the shares are 3000, 2000, and 1000 instead of 3, 2, and 1, WFQ has the same service time error range while WRR's error range balloons to -1000 to + 1500 tu.

It has been shown that WFQ guarantees that the service time error for any client never falls below -1, which means that a client can never fall behind its ideal allocation by more than a single time quantum. Fair queueing provides stronger proportional fairness guarantees than round-robin or fair-share scheduling. Unfortunately, fair queueing is more difficult to implement, and the time it takes to select a client to execute is 0(N) time for most implementations, where N is the number of clients. With more complex data structures, is possible to implement fair queueing such that selection of a client requires OflogN) time. However, the added difficulty of managing complex data structures in kernel space causes most implementers of fair queueing to choose the more straightforward 0(N) implementation. A fourth class of proportional schedulers, e.g., "lottery scheduling," operates such that each client is given a number of tickets proportional to its share. A ticket is then randomly selected by the scheduler and the client that owns the selected ticket is scheduled to run for a time quantum. Like fair queueing, lottery scheduling provides proportional sharing by running clients at different frequencies by adjusting the position in at which each client is inserted back into the queue; the same size time quantum is typically used for all clients.

Lottery scheduling is somewhat simpler to implement than fair queueing, but has the same high scheduling overhead as fair queueing, 0(N) for most implementations or OflogN) for more complex data structures. However, because lottery scheduling relies on the law of large numbers for providing proportional fairness, its accuracy is much worse than WFQ, and is also worse than WRR for smaller share values.

Accordingly, there is a need in the art for a share scheduler having strong proportional fairness guarantees and low scheduling overhead, for multiplexing time-shared resources among a set of clients.

SUMMARY OF THE INVENTION

An object of the present invention is to provide a CPU scheduler which is simple to implement and easy to integrate into existing commercial operating systems. Another object of the present invention is to provide a CPU scheduler which combines the benefits of accurate proportional share resource management with very low overhead. A further object of the present invention is provide a CPU scheduler which has constant scheduling overhead, even for large numbers of clients.

These and other objects of the invention, which will become apparent with reference to the disclosure herein, are accomplished by a method of proportional share scheduling for scheduling total resource allocation among a plurality of clients. Each one of the clients has a state with a value indicative of a proportional resource allocation of the total resources. The method comprises storing each one of the clients in a run queue in a sorted order beginning with a first client having a largest proportional resource allocation and followed by subsequent clients arranged in order of diminishing proportional resource allocations. A subsequent step is sequentially executing, in the sorted order beginning with the first client having the largest resource allocation, one of the clients in the run queue for a constant time quantum. The technique further includes determining a client virtual time as a measure of a degree to which a next subsequent client in the run queue has received service relative to its respective proportional allocation, and determining a queue virtual time as a measure of a degree to which all of the clients in the run queue have received service relative to the total resource allocation at the next subsequent time quantum. The client virtual time is compared with the queue virtual time. If the run queue of the client virtual time is greater than or equal to the queue virtual time, then the next subsequent client is executed for the constant time quantum. If the if client virtual time is less than the queue virtual time, then the scheduler will return to the beginning of the run queue to execute the first client for the constant time quantum.

Advantageously, the state of each one of the clients further comprises a value of a time counter which a respective value of a time counter for each of the clients is set to the number of time quanta representative of the client's respective proportional resource allocation. The technique may include decrementing the value of the time counter after the client is executed for the constant time quantum. The client virtual time is compared with by an amount comprising a ratio of the constant time quantum to the respective client's proportional resource allocation. The queue virtual time is incremented by an amount comprising a ratio of the constant time quantum to the total resource allocation of all of the clients in the run queue.

Preferably, the state of each one of the clients further comprises an indication of whether the client is runnable. Each of the clients is inserted into the run queue if the client is runnable, and the client is removed from the run queue if the client is not rumiable. The state of each one of the clients further comprises a client identification, and the technique further includes recording a reference to the client identification of a next previous client in the run queue and a next subsequent client in the run queue when the client is removed from the run queue. The client virtual time of the client is updated when the indication of whether the client was previously not runnable subsequently becomes runnable. The technique further includes updating the client virtual time by replacing the previous client virtual time with the greater of the previous client virtual time and the queue virtual time.

In accordance with the invention, an object of providing a proportional share scheduler having high proportional fairness guarantees and low scheduling overhead has been met. Further features of the invention, its nature and various advantages will be more apparent from the accompanying drawings and the following detailed description of illustrative embodiments.

SUMMARY OF THE DRAWINGS

FIG. 1 is a simplified figure of the share scheduler in accordance with the invention. FIG. 2 is a simplified drawing of the technique, in accordance with the invention.

FIG. 3 is a simplified drawing of a portion of the technique illustrated in FIG. 2, in accordance with the invention.

FIGS. 4 is a plot showing the error ranges for a prior art scheduler. FIGS. 5 is a plot showing the error ranges for the novel scheduler, in accordance with the invention.

FIGS. 6 is a plot showing the error ranges for the prior art scheduler. FIGS. 7 is a plot showing the error ranges for the novel scheduler, in accordance with the invention. FIG. 8 is a plot showing the average execution time required by prior art schedulers and the novel scheduler in order to select a client to execute.

FIG. 9 is a time plot showing the scheduling sequences on a prior art scheduler over a time interval.

FIG. 10 is a time plot showing the scheduling sequences on another prior art scheduler over a time interval. FIG. 11 is a time plot showing the scheduling sequences on the novel scheduler over a time interval, in accordance with the invention.

FIG. 12. is a time plot showing the number of MPEG frames encoded over time for a prior art scheduler. FIG. 13. is a time plot showing the number of MPEG frames encoded over time for another prior art scheduler.

FIG. 14. is a time plot showing the number of MPEG frames encoded over time for the novel scheduler, in accordance with the invention.

DETAILED DESCRIPTION OF THE EXEMPLARY EMBODIMENTS In accordance with the invention, a novel scheduler, also referred to as

"Virtual-Time Round Robin" (VTRR) is disclosed. As illustrated in FIG. 1, scheduler 10 and several client 20 each have several values associated with their respective scheduler and execution states. Two clients are illustrated, i.e., Client A and Client B; but the invention may be used with a plurality of any number of clients. The client 20 has five values associated with its execution state: (a) share 22, (b) virtual finishing time 24, (c) time counter 26, (d) id number 28, and (e) run state 30. A client's share 22 defines its resource rights. Each client receives a resource allocation that is directly proportional to its share. Thus the terms "share" and "proportional resource allocation" are used interchangeably herein. A client's virtual finishing time (VFT) 24 is defined as above, e.g., the virtual time of equation (3) that the chent would have after executing for one time quantum. Since a client has a VFT 24, it also has an implicit virtual time. A client's VFT 24 advances at a rate proportional to its resource consumption divided by its share. As will be described below, the VFT 24 is used to decide when the scheduler should reset to the first client in the run queue. The time counter 26 tracks the number of quanta the client must receive before the period is over. Thus, a client's time counter 26 ensures that the pattern of allocations is periodic, and that perfect fairness is achieved at the end of each period. A client's id number 28 is a unique client identifier that is assigned when the client is created. A client's run state 30 is an indication of whether or not the client can be executed. A client is runnable if it can be executed, and not runnable if it cannot. For example for a CPU scheduler, a client would not be runnable if it is blocked waiting for I/O and cannot execute.

In accordance with the invention, the scheduler 10 maintains the following values associated with the scheduler state: (a) time quantum 12, (b) run queue 14, (c) total shares 16, and (d) queue virtual time 18. As discussed above, the time quantum 12 is the duration of a standard time slice assigned to a client to execute. The time quantum 12 is maintained as a constant values for all clients, as will be described below. The run queue 14 is a sorted queue of all runnable clients ordered from the client having largest share to the client having the smallest share. Ties can be broken either arbitrarily or using the client id numbers 28, which are unique. The total shares 16 is the sum of the shares of all runnable clients. The queue virtual time (QVT) 18 is a measure of what a client's VFT 24 should be if it has received exactly its proportional share allocation.

The QVT 18 advances whenever a client 20 executes at a rate inversely proportional to the total shares. The system time quantum 12 is denoted as Q and the share of client 20 / as Si, then the QVT 18 is updated as follows:

The difference between the QVT 18 and a client's virtual time is a measure of whether the respective client 20 has consumed its proportional allocation of resources. If a client's virtual time is equal to the queue virtual time, it is considered to have received its proportional allocation of resources. An earlier virtual time indicates that the client 20 has used less than its proportional share. Similarly, a later virtual time indicates that it has used more than its proportional share. Since the QVT 18 advances at the same rate for all clients 20 on the run queue, the relative magnitudes of the virtual times provide a relative measure of the degree to which each client 20 has received its proportional share of resources.

First, the role of the time counters 26 is explained, hi relation to this, a scheduling cycle is defined as a sequence of allocations whose length is equal to the sum of all client shares. For example, for a queue of three clients with shares 3, 2, and 1, a scheduling cycle is a sequence of 6 allocations. The time counter 26 for each client 20 is reset at the beginning of each scheduling cycle to the client's share value, and is decremented every time a client 20 receives a time quantum 12. The novel VTRR scheduler uses the time counters 26 to ensure that perfect fairness is attained at the end of every scheduling cycle. At the end of the cycle, every time counter 26 is zero, meaning that for each client A, the number of quanta received during the cycle is exactly S_A, the client's share value. Clearly, then, each client 20 has received service proportional to its share, h order to guarantee that all counters 26 are zero at the end of the cycle, an invariant is enforced on the queue, called the time counter invariant: it is required that, for any two consecutive clients in the queue A and B, the counter value for B must always be no greater than the counter value for A.

The techniques for scheduling clients 20 is illustrated in FIG. 2. An early stage is to insert clients 20 into the run queue in sorted order according to the client's share (step 112). Each time counter is set to a value representative of the client's share value (step 114), as will be described in greater detail below. These steps may be performed consecutively or concurrently. For purposes of this discussion, each of the clients 20 are considered runnable. The initial VFT 24 is also set, as will be described in greater detail below.

Merely for purposes of explanation, the client 20 selected for execution is referred to as the current client and the subsequent client is referred to as the next client (step 120). The current client 20 is executed for one time quantum 12 (step 130). Once the current client 20 has completed execution for one time quantum 12, its time counter 26 is decremented by one (step 132) and its VFT 24 is incremented by the time quantum 12 divided by its share (step 134). The system time quantum 12 is denoted as Q, the current client as client A, the current client's share as S , and the current client's VFT 24 as VFT_A(t), VFT_A(X) is updated as follows:

VFT_A(t + Q) = VFT_A(t)+-^- (5)

⁵_.

If the scheduling cycle is not completed (step 136), the scheduler considers the next client in the run queue. First, the scheduler checks for violation of the time counter invariant: if the time counter value of the next client is greater than the time counter value of the current client (step 140), the scheduler makes the next client the current client (step 150) and executes it for a quantum (step 130). This causes its time counter 26 to be decremented (step 132), preserving the time counter invariant. If the next client's counter 26 is not greater than the current client's counter 26, as determined at step 140, the time counter invariant cannot be violated whether the next client is run or not, so the scheduler makes a decision using virtual time: the scheduler compares the VFT 24 of the next client with the QVT 18 the system would have after the next time quantum 12 a client executes. This comparison is referred to as the VFT inequality (step 142). If we denote the system time quantum 12 as ζ , the current client's VFT 24 as VFT_A(t), and its share as S_A, the VFT inequality is true if:

VFT_A(t)-QVT(t+Q)<- - (6)

If the VFT inequality is true, the scheduler selects the next client (step 150) and executes the next client in the run queue for one time quantum 12 (step 130) and the process repeats with the subsequent clients in the run queue. If the scheduler reaches a point in the run queue when the VFT inequality is not true, the scheduler returns to the beginning of the run queue and selects the first client to execute (step 120). At the end of the scheduling cycle, when the time counters 26 of all clients 20 reach zero (step 136), the time counters 26 are all reset to their initial values corresponding to the respective client's share, and the scheduler starts from the beginning of the run queue again to select a client to execute (as indicated by arrow A). Note that throughout this scheduling process, the ordering of clients on the run queue does not change.

For the example in which three clients A, B, and C, have shares 3, 2, and 1, respectively, their initial VFTs 24 are then 1/3, 1/2, and 1, respectively. The novel scheduler would then execute the clients in the following repeating order of time units: A, B, C, A, B, A. In contrast to WRR and WFQ, described above, the scheduler has a maximum allocation error between A and B of 1/3 tu in this example. This allocation error is much improved over WRR, and comparable to WFQ. Since the novel VTRR scheduler simply selects each client in turn to execute according to the sorted order in the run queue, selecting a client for execution can be done in 0(1) time.

In the description above, the technique only discussed clients that were runnable, but did not address dynamic considerations. The novel VTRR scheduler distinguishes between clients whose run states 30 are runnable and not runnable. As discussed above, clients 20 whose run states are "runnable" can be selected for execution by the scheduler, while clients 20 whose run states 30 are "not runnable" are not selected. Only runnable clients are placed in the run queue. A client must be created before it can become runnable, and a client becomes not runnable before it is terminated. As a result, client creation and termination have no affect on the run queue.

With continued reference to FIG. 2, when a client becomes runnable, it is inserted into the run queue (indicated by arrow B) so that the run queue remains sorted from largest to smallest share client. Ties can be broken either arbitrarily or using the unique client id numbers 28. The new client's initial VFT 24 is determined (step 114) in the following manner. When a client is created and becomes runnable, it has not yet consumed any resources, so it is neither below or above its proportional share in terms of resource consumption. As. a result, we set the client's implicit virtual time to be the same as the QVT 18. We can then calculate the VFT 24 of a new client A with share S_A as:'

After a client is executed (step 130), it may become not runnable. If the client is the current client and becomes not runnable, it is preempted and another client is selected by the scheduler as described above. As indicated in FIG. 3, the client that is not runnable (step 210) is removed from the run queue (step 212). While the client is not runnable, its VFT 24 is not updated. When the client is removed from the run queue, it records the client id 28 of the client 20 that was before it on the run queue, the last-previous client , and the client that was after it on the run queue, the last-next client (step 212).

When a client that is not runnable becomes runnable again, the novel VTRR scheduler inserts the now runnable client back into the run queue (arrow B of FIGS. 2-3). If the client's references to its last-previous or last-next client are still valid, it can use those references to determine its position in the run queue in constant time. If the last-previous and last-next references are not valid, the scheduler then simply traverses the run queue to find the insertion point for the now runnable client. For clients, such as highly interactive tasks, that frequently leave and rejoin the run queue in the same place, the last-previous and last-next references can often eliminate this sorted insert and its associated cost.

Determining whether the last-previous and last-next references are valid is performed in the exemplary embodiment as follows: The last-previous client reference is valid if such last-previous client has not exited and is rumiable, and if the share of the newly-runnable client is no more than the last-previous client and no less than the client following it in the run queue. If the last-previous client reference is no longer valid, the last-next reference is checked in a same manner. If either client has exited and been deallocated, last-previous and last-next may no longer refer to valid memory regions. To deal with this, a hash table may be kept that stores identifiers of valid clients. Hash function collisions can be resolved by simple replacement, so the table can be implemented as an array of identifiers. A client's identifier is put into the table when it is created, and deleted when the client exits. The last-previous and last- next pointers are not dereferenced, then, unless the identifier of the last-previous and last-next clients exist in the hash table. As will be discussed below, the use of a hash table was not necessary in the exemplary embodiment.

Once the now runnable client has been inserted in the run queue, the client's VFT 24 must be updated (step 114 of FIG. 2). The update is analogous to the VFT initialization used when a new client becomes runnable. The difference is that the algorithm also accounts for the client's original VFT 24 in updating the VFT 24. If the original VFT 24 of a client A is denoted as VFT_A(f), then the client's VFT is updated as follows: VFT_A(t)=MAX{QVT(t),VFT_A(t')} (8)

This update must be performed to prevent a client from "saving up" CPU time while it is not runnable. If a client is allowed to keep its previous VFT 24 upon reinsertion, it will have an undesirably large advantage in the selection process. If the client's VFT 24 is set to the QVT 18, this puts the client's virtual time at one quantum behind the QVT 18 (because VFT 24 is the virtual time the client will have at the end of its next quantum). In effect, a client is allowed to save up one quantum worth of CPU time, but no more, when it is not runnable.

The MAX operation of equation (8) is necessary to ensure that a client may not cause systematic allocation errors in its favor. By making itself not runnable and then immediately becoming runnable again whenever it is temporarily ahead of its proportional allocation, a client could gain an unfair advantage if a reinserted client's VFT 24 was simply always set to the QVT 18.

The VFT 24 of a reinserted client can be computed in other ways as well; Equation (8) is one example of an implementation. Alternatively, it would also be possible to allow a client to save up more or less CPU time while sleeping, depending on local conditions. Alternatively, the difference between a client's VFT 24 and the QVT 18 can be tracked, and upon reinsertion of the client the VFT 24 can be set so as to maintain this difference. The initial value of a client's time counter 26 is also set. A client's time counter 26 tracks the number of quanta due to the client before the end of the current scheduling cycle, and is reset at the beginning of each new cycle. We set the time counter 26 of a newly-inserted client to a value which will give it the correct proportion of remaining quanta in this cycle. The counter 26 for the new client^ is computed:

Note that this is computed before clients is inserted, so SA is not included in the Ω,S- summation. This value is modified by a rule similar to the rule enacted for the VFT 24: we require that a client cannot come back in the same cycle and receive a larger time count than it had previously. Therefore, if the client is being inserted during the same cycle in which it was removed, the time counter 26 is set to the minimum of C_A and the previous time counter value 26. Finally, to preserve the time counter invariant as discussed above, the time counter value 26 must be restricted to be between the time counter values of the clients before and after the inserted client.

If a client's share changes, there are two cases to consider based on the run state of the client. If the client is not runnable, no run queue modifications are needed. If the client is runnable and its share changes, the client's position in the run queue may need to be changed. This operation can be simplified by removing the client from the run queue, changing the share, and then reinserting it at arrow B. Removal and insertion can then be performed just as discussed above.

The novel proportional share scheduler has been presented in the context of CPU task scheduling. However, the novel algorithm may also be used to multiplex other time-shared resources such as network bandwidth. The scheduler may also be applied to packet queueing for proportional sharing of a network link. In the context of packet queueing, different terminology is appropriate. A client of a packet queueing discipline is referred to as a. flow, rather than a task. A flow that is waiting for service is considered bαcklogged rather than runnable, and network service is provided in packets rather than time quanta. When a number of flows are backlogged on a single link, the link service must be time-multiplexed between the flows, and the process is very similar to task scheduling except for a few key differences. One of the major differences between packet scheduling and task scheduling is that network packets can be different sizes. In task scheduling, the scheduler can set the size of the time quantum, and allocate as much time to each task as it sees fit. In packet scheduling, however, service must be provided in complete packets, whose size cannot be set by the scheduling system. The novel VTRR scheduler discussed above depends on being able to assign a constant-sized quantum (in packet scheduling terms, this means a constant number of bits) to each client it schedules. Therefore, some mechanism must be employed to provide whole-packet service while simulating a constant number of bits per allocation.

The preferred mechanism used in this context is an adaptation of that introduced by Deficit Round-Robin, Shreedhar and Varghese, discussed above. Each client has one extra piece of state, its deficit value, which is initialized to zero. A constant quantum size Q is chosen, which is a number of bits at least as large as the largest possible packet. When the novel VTRR scheduler selects a client A for service, it transmits as many packets from client A's queue as it can without sending more than Q + deficit^ bits. The difference between this value and the number of bits actually sent becomes the new value of A's deficit. This mechanism provides Q bits of service per allocation on average, and the extra delay added is, in the worst case, equal to the amount of time it takes to transmit one maximum-length packet.

The primary function of a scheduler is to select a client to execute when the resource is available. A key benefit of the novel VTRR scheduler is that it can select a client to execute in O(l) time. To do this, the novel VTRR scheduler simply has to maintain a sorted run queue of clients and keep track of its current position in the run queue. Updating the current run queue position and updating a client's VFT 24 are both 0(1) time operations. While the run queue needs to be sorted by client shares, the ordering of clients on the run queue does not change in the normal process of selecting clients to execute. This is an important advantage over fair queueing algorithms, in which a client needs to be reinserted into a sorted run queue after each time it executes. As a result, fair queueing has much higher complexity than VTRR, requiring 0(N) time to select a client to execute, or Or ogN) time if more complex data structures are used (but this is rarely implemented in practice).

When all clients on the run queue have zero counter values, the novel VTRR scheduler resets the counter values of all clients on the run queue. The complete counter reset takes 0(N) time, where Nis the number of clients. However, this reset is done at most once every N times the scheduler selects a client to execute (and much less frequently in practice). As a result, the reset of the time counters is amortized over many client selections so that the effective running time using the novel VTRR scheduler is still 0(1) time, hi addition, the counter resets can be done incrementally on the first pass through the run queue with the new counter values. In addition to selecting a client 20 to execute, a scheduler must also allow clients to be dynamically created and terminated, change run state, and change scheduling parameters such as a client's share. These scheduling operations typically occur much less frequently than client selection. In the novel VTRR scheduler, operations such as client creation and termination can be done in 0(1) time since they do not directly affect the run queue. Changing a client's run state from runnable to not runnable can also be done in 0(1) time for any reasonable run queue implementation since all it involves is removing the respective client from the run queue. The scheduling operations with the highest complexity are those that involve changing a client's share assignment and changing a client's run state to runnable. In particular, a client typically becomes rumiable after it is created or after an I/O operation that it was waiting for completes. If a client's share changes, the client's position in the run queue may have change as well. If a client becomes runnable, the client will have to be inserted into the run queue in the proper position based on its share. Using a doubly linked list run queue implementation, insertion into the sorted queue can require 0(N) time, where N is the number of runnable clients. Alternatively, a priority queue implementation could be used for the run queue to reduce the insertion cost to OflogN), but likely does not have better overall performance than a simple sorted list in practice.

Because queue insertion is required much less frequently than client selection in practice, the queue insertion cost is not likely to dominate the scheduling cost. In particular, if only a constant number of queue insertions are required for every N times a client selection is done, then the effective cost of the queue insertions is still only 0(1) time. Furthermore, the most common scheduling operation that would require queue insertion is when a client becomes runnable again after it was blocked waiting on a resource. In this case, the insertion overhead can be 0(1) time if the last-previous client and last-next client references remain valid at queue insertion time. If the references are valid, then the position of the client is already known on the run queue so the scheduler does not have to find the insertion point. A alternative implementation can be done that allows all queue insertions to be done in 0(1) time, if the range of share values is fixed in advance. The idea is similar to priority schedulers which have a fixed range of priority values and have separate run queue for each priority. Instead of using priorities, we can have a separate run queue for each share value and keep track of the run queues using an array. We can then find the queue corresponding to a client's share and insert the client at the end of the corresponding queue in Ofl) time.

The novel CPU scheduler in accordance with the invention may be implemented, for example, in the Linux operating system. According to this embodiment, the Red Hat Linux version 6.1 distribution and the Linux version 2.2.12- 20 kernel were used. The Linux scheduling framework for a single CPU is based on a run queue implemented as a single doubly linked list. We first describe how the standard Linux scheduler works, and then discusses the changes we made to implement NTRR in Linux. The standard Linux scheduler multiplexes a set of clients that can be assigned different priorities. The priorities are used to compute a per client measure called goodness to schedule the set of clients. Each time the scheduler is called, the goodness value for each client in the run queue is calculated. The client with the highest goodness value is then selected as the next client to execute, hi the case of ties, the first client with the highest goodness value is selected. Because the goodness of each client is calculated each time the scheduler is called, the scheduling overhead of the Linux scheduler is 0(N), where Nis the number of runnable clients. The standard manner in which Linux calculates the goodness for all clients is based on a client's priority and counter. The counter used in Linux is not the same as the time counter value used by the novel NTRR scheduler, but is instead a measure of the remaining time left in a client's time quantum. The standard time unit used in Linux for the counter and time quantum is called a jiffy, which is 10 ms by default. The goodness of a client is its priority plus its counter value. The client's counter is initially set equal to the client's priority, which has a value of 20 by default. Each time a client is executed for a jiffy, the client's counter is decremented. A client's counter is decremented until it drops below zero, at which point the client cannot be selected to execute. As a result, the default time quantum 12 for each client is 21 jiffies, or 210 ms. When the counters of all runnable clients drop below zero, the scheduler resets all the counters to their initial value.

To implement the novel NTRR scheduler in Linux, the existing scheduling infrastructure was used. For example, the same doubly linked list run queue structure as the standard Linux scheduler were carried over. However, the run queue changed to sort the clients from largest share to smallest share. Rather than scanning all the clients when a scheduling decision needs to be made, the novel NTRR scheduler implementation simply picks the next client in the run queue based on the technique discussed above with respect to FIG. 2. A consideration in connection with the Linux scheduler is that the smallest counter value that may be assigned to a client is 1. Consequently, the smallest time quantum 12 a client can have is 2 jiffies. To provide a comparable implementation with the novel VTRR scheduler, the default time quantum 12 is also 2 jiffies, or 20 ms. Such constraint does not exist if an operating system other than Linux is used.

In addition to the values associated with the client execution state, discussed above, two fields that were added to the standard client data structure in Linux were last-previous and last-next pointers which were used to optimize run queue insertion efficiency. In the Linux 2.2 kernel, memory for the client data structures is statically allocated, and never reclaimed for anything other than new client data structures. Therefore, in the exemplary embodiment, the VTRR scheduler references the last-next and last-previous pointers to check their validity, as they always refer to some client's data. Thus, the hash table method discussed was unnecessary in this embodiment. To demonstrate the effectiveness of the novel VTRR scheduler, its performance was quantitatively measured and compared against other leading approaches from both industrial practice and research. Simulation studies were conducted to compare the proportional sharing accuracy of the VTRR scheduler against both WRR and WFQ. A simulator was used for these studies for two reasons. First, the simulator isolated the impact of the scheduling algorithms themselves and purposefully do not include the effects of other activity present in an actual kernel implementation. Second, the simulator permitted the examination of the scheduling behavior of these different algorithms across hundreds of thousands of different combinations of clients with different share values.

Detailed measurements of real kernel scheduler performance were conducted by comparing the VTRR scheduler implementation against both the standard Linux scheduler and a WFQ scheduler. In particular, comparing against the standard Linux scheduler and measuring its performance is important because of its growing popularity as a platform for server as well as desktop systems.

All of the kernel scheduler measurements were performed on a Gateway 2000 E 1400 system with a 433 MHz Intel Celeron CPU, 128 MB RAM, and 10 GB hard drive. The system was installed with the Red Hat Linux 6.1 distribution running the Linux version 2.2.12-20 kernel. The measurements were done by using a minimally intrusive tracing facility that logs events at significant points in the application and the operating system code. This is performed via a light- weight mechanism that writes timestamped event identifiers into a memory log. The mechanism takes advantage of the high-resolution clock cycle counter available with the Intel CPU to provide measurement resolution at the granularity of a few nanoseconds. Getting a timestamp simply involved reading the hardware cycle counter register, which could be read from user-level or kernel-level code. The cost of the mechanism on the system was measured to be roughly 70 ns per event. The kernel scheduler measurements were performed on a fully functional system to represent a realistic system environment, in which all experiments were performed with all system functions running and the system connected to the network. A scheduling simulator was used to evaluate the proportional fairness of the novel VTRR scheduler in comparison to two other schedulers, WRR and WFQ. The simulator is a user-space program that measures the service time error, as discussed above in equation (2). The simulator takes four inputs, the scheduling algorithm, the number of clients N, the total number of shares S, and the number of client-share combinations. The simulator randomly assigns shares to clients and scales the share values to ensure that they sum to S. It then schedules the clients using the specified algorithm as a real scheduler would, and tracks the resulting service time error. The simulator runs the scheduler until the resulting schedule repeats, then computes the maximum (most positive) and minimum (most negative) service time error across the nonrepeating portion of the schedule for the given set of clients and share assignments. The simulator assumes that all clients are runnable at all times. This process of random share allocation and scheduler simulation is repeated for the specified number of client-share combinations. An average highest service time error and average lowest service time error were computed for the specified number of client-share combinations to obtain an "average-case" error range. To measure proportional fairness accuracy, simulations were run for each scheduling algorithm considered on 40 different combinations of N and S. For each set of (N, S), 10,000 client-share combinations were run and the resulting average error ranges were determined. The average service time error ranges for NTRR, WRR, and WFQ are shown in FIGS. 4-7. FIGS. 4 and 5 show a comparison of the error ranges for WRR versus

NTRR, in which FIG. 4 showing the error ranges for WRR and FIG. 5 shows the error ranges for VTRR. FIGS. 4 and 5 show two surfaces plotted on axes of the same scale, representing the maximum and minimum service time error as a function of N and S. Within the range of values of N and S shown, WRR's error range reaches as low as - 398 tu and as high as 479 tu (FIG. 4). With the time units expressed in 10 ms jiffies as in Linux, a client under WRR can on average get ahead of its correct CPU time allocation by 4.79 seconds, or behind by 3.98 seconds, which is a substantial amount of service time error. In contrast, FIG. 5 shows that the NTRR scheduler has a much smaller error range than WRR and is much more accurate. Because the error axis is scaled to display the wide range of WRR's error values as shown in FIG. 4, it is difficult to even distinguish the two surfaces for NTRR in FIG. 5. The NTRR scheduler's service time error only ranges from -3.8 to 10.6 tu; this can be seen more clearly in FIGS. 6 and 7.

FIGS. 6 and 7 show a comparison of the error ranges for the WFQ versus the VTRR scheduler, in which FIG. 6 showing the error ranges for WFQ and FIG. 7 showing the error ranges for VTRR. As is the case in FIGS. 6 and 7, each graph shows two surfaces plotted on axes of the same scale, representing the maximum and minimum service time error as a function of N and S. The VTRR graph in FIG. 7 includes the same data as the VTRR graph in FIG. 5, but the error axis is scaled more naturally. Within the range of values of N and S shown, WFQ's average error range reaches as low as -1 tu and as high as 2 tu (FIG. 6), as opposed to NTRR's error range from -3.8 to 10.6 tu (FIG. 7). The error ranges for WFQ are smaller than NTRR, but the difference between WFQ and NTRR is much smaller than the difference between NTRR and WRR. With the time units expressed in 10 ms jiffies as in Linux, a client under WFQ can on average get ahead of its correct CPU time allocation by 10 ms, or behind by 20 ms, while a client under NTRR can get ahead by 38 ms or behind by 106 ms. In both cases, the service time errors are small. In fact, the service time errors are even below the threshold of delay noticeable by most human beings for response time on interactive applications.

The data produced by the simulations confirm that the novel NTRR scheduler has fairness properties that are much better than WRR, and nearly as good as WFQ. For the domain of values simulated, the service time error for NTRR falls into an average range almost two orders of magnitude smaller than WRR's error range. While NTRR's error range is not quite as good as WFQ, even the largest error measured, 10.6 tu, would likely be unnoticeable in most applications, given the size of time unit used by most schedulers. Furthermore, it is shown below that this degree of accuracy is provided at much lower overhead than WFQ.

To evaluate the scheduling overhead of the novel VTRR scheduler, VTRR was implemented in the Linux operating system and the overhead of the novel VTRR implementation was compared against the overhead of both the Linux scheduler and a WFQ scheduler. Several experiments were conducted in which each client executed a simple micro-benchmark which performed a few operations in a while loop. A control program was used to fork a specified number of clients. Once all clients were runnable, the execution time of each scheduling operation that occurred during a fixed time duration of 30 seconds was measured. This was done by inserting a counter and timestamped event identifiers in the Linux scheduling framework. The measurements required two timestamps for each scheduling decision, so variations of 140 ns are possible due to measurement overhead. These experiments were performed on the standard Linux scheduler, WFQ, and the novel VTRR scheduler for 1 client up to 200 clients.

FIG. 8 shows the average execution time required by each scheduler to select a client to execute. For this experiment, the particular implementation details of the WFQ scheduler affect the overhead, so results from two different implementations of WFQ are included. In the first, labeled "WFQ [0(N)]" the run queue is implemented as a simple linked list which must be searched on every scheduling decision. The second, labeled "WFQ [O(logN)]" uses a heap-based priority queue with O(logN) insertion time. Most fair queueing-based schedulers are implemented in the first fashion, due to the difficulty of maintaining complex data structures in the kernel. In the exemplary embodiment, for example, a separate, fixed- length array was necessary to maintain the heap-based priority queue. If the number of clients ever exceeds the length of the array, a costly array reallocation must be performed. An initial array size was chosen which was large enough to contain more than 200 clients, so this additional cost is not reflected in the measurements.

As shown in FIG. 8, the increase in scheduling overhead as the number of clients increases varies a great deal between different schedulers. The novel NTRR scheduler has the smallest scheduling overhead. It requires less than 800 ns to select a client to execute and the scheduling overhead is essentially constant for all numbers of clients. In contrast, the overhead for Linux and for 0(N) WFQ scheduling grows linearly with the number of clients. The Linux scheduler imposes 100 times more overhead than NTRR when scheduling a mix of 200 clients. In fact, the Linux scheduler still spends almost 10 times as long scheduling a single micro-benchmark client as NTRR does scheduling 200 clients. NTRR outperforms Linux and WFQ even for small numbers of clients because the NTRR scheduling code is simpler and hence runs significantly faster. The novel NTRR scheduler performs even better compared to Linux and WFQ for large numbers of clients because it has constant time overhead as opposed to the linear time overhead of the other schedulers. While O(logN) WFQ has much smaller overhead than Linux or

0(N) WFQ, it still imposes significantly more overhead than the novel NTRR scheduler, particularly with large numbers of clients. With 200 clients, O(logZV) WFQ has an overhead more than 6 times that of the novel NTRR scheduler. WFQ's more complex data structures require more time to maintain, and the time required to make a scheduling decision is still dependent on the number of clients, so the overhead would only continue to grow worse as more clients are added. NTRR's scheduling decisions always take the same amount of time, regardless of the number of clients. The large overhead shown for the Linux scheduler for a single client running the micro-benchmark bears further examination. This overhead is much larger than the single-client overhead for any of the other schedulers tested. In fact, the Linux scheduler incurs more scheduling overhead per client for one client than for two clients running the micro-benchmark. This behavior is unexpected as one would expect there would be less scheduling overhead if there is only one client to run. However, the extra overhead is due to the periodic counter recalculations of the Linux scheduler. In Linux, time can be broken into scheduling epochs. An epoch is a period of time in which each client has received one complete quantum. At the end of every epoch, a recalculation of the priorities of all clients is performed. This includes clients that are not runnable, such as those that perform system functions. There are usually a substantial number of these system clients on a general-purpose machine that are not runnable most of the time. When only one client is runnable, an epoch is the length of the single client's quantum, so the counter recalculation occurs on every schedule. When more clients are runnable, the epoch doesn't end until all clients have received allocations, so the epoch recalculation overhead is amortized over a number of scheduling decisions.

Several experiments were conducted to measure the scheduling behavior of the standard Linux scheduler, WFQ, and the novel NTRR scheduler at fine time resolutions. A 30 second workload of five micro-benchmarks were run with different proportional sharing parameters. Using VTRR and WFQ, the five micro- benchmarks were run with shares 1,2,3,4, and 5, respectively. To provide similar proportional sharing behavior using the Linux scheduler, the five micro-benchmarks were run with user priorities 19, 17, 15, 13, and 11, respectively. This translates to internal priorities used by the scheduler of 1, 3, 5, 7, and 9, respectively. This then translates into the clients running for 20 ms, 40 ms, 60 ms, 80 ms, and 100 ms time quanta, respectively. The smallest time quantum used is the same for all three schedulers. The mapping between proportional sharing and user input priorities is non-intuitive in Linux. The scheduling behavior for this workload appears similar across all of the schedulers when viewed at a coarse granularity. The relative resource consumption rates of the micro-benchmarks are virtually identical to their respective shares at a coarse granularity.

When the measurements are viewed over a shorter time scale of one second, the actual scheduling sequences on each scheduler over this time interval are shown in FIGS. 9, 10, and 11 for the Linux, WFQ, and the novel NTRR scheduler respectively. These measurements were made by sampling a client's execution from within the client by recording multiple high resolution timestamps each time that a client was executed. The Linux scheduler (FIG. 9) is least effective in scheduling the clients evenly and predictably. Both WFQ (FIG. 10) and the novel NTRR scheduler (FIG. 11) do a much better job of scheduling the clients proportionally at a fine granularity. In both cases, there is a clear repeating scheduling pattern every 300 ms.

Linux does not have a perfect repeating pattern because the order in which it schedules clients changes depending on exactly when the scheduler function is called. This is because once Linux selects a client to execute, it does not preempt the client even if its goodness drops below that of other clients. Instead, it runs the client until its counter drops below zero or an interrupt or other scheduling event occurs. If a scheduling event occurs, then Linux will again consider the goodness of all clients, otherwise it does not. Since interrupts can cause a scheduling event and can occur at arbitrary times, the resulting order in which clients are scheduled does not have a repeating pattern. As a result, applications being scheduled using WFQ and VTRR will receive a more even level of CPU service than if they are scheduled using the Linux scheduler.

To demonstrate VTRR's efficient proportional sharing of resources on real applications, two experiments were performed, one running multimedia applications and the other running virtual machines. The performance of VTRR versus the standard Linux scheduler and WFQ are shown. One experiment was to run multiple MPEG audio encoders with different shares on each of the three schedulers. The encoder test was implemented by running five copies of an MPEG audio encoder. The encoder clients were allotted shares of 1, 2, 3, 4, and 5, and were instrumented with time stamp event recorders in a manner similar to how we recorded time in our micro-benchmark programs. Each encoder took its input from the same file, but wrote output to its own file. MPEG audio is encoded in chunks called frames, so our instrumented encoder records a timestamp after each frame is encoded, allowing us to easily observe the effect of resource share on single-frame encoding time. FIGS. 12,13, and 14 show the number of frames encoded over time for the Linux default scheduler, WFQ, and the novel VTRR scheduler, respectively. The Linux scheduler (FIG. 12) clearly does not provide sharing as fairly as WFQ (FIG. 13) or VTRR (FIG. 14) when viewed over a short time interval. The "staircase" effect indicates that CPU resources are provided in bursts, which, for a time-critical task like audio streaming, can mean extra jitter, resulting in delays and dropouts. It can be inferred from the smoother curves of the WFQ and VTRR graphs that WFQ and VTRR scheduling provide fair resource allocation at a much smaller granularity. When analyzed at a fine resolution, we can detect some differences in the proportional sharing behavior of the applications when running under WFQ versus VTRR, but the difference is far smaller than the difference compared with Linux, which is clearly visible. VTRR trades some precision in instantaneous proportional fairness for much lower scheduling overhead.

Another experiment was performed to run several VMware virtual machines on top a Linux operating system, and then compare the performance of applications within the virtual machines when the virtual machines were scheduled using different schedulers. For this experiment, three virtual machines were run simultaneously with respective shares of 1, 2, and 3. A simple timing benchmark was executed within each virtual machine to measure the relative performance of the virtual machine. The hardware clock cycle counters were used in doing these measurements as the standard operating system timing mechanisms within a virtual machine are a poor measure of elapsed time. The experiment was conducted using the standard Linux scheduler, WFQ, and VTRR. The results were similar to the previous experiments, with Linux performing the worst in terms of evenly distributing CPU cycles, and VTRR and WFQ scheduling providing more comparable scheduling accuracy in proportionally allocating resources.

It will be understood that the foregoing is only illustrative of the principles of the invention, and that various modifications can be made by those skilled in the art without departing from the scope and spirit of the invention.

Claims

1. A proportional share scheduling apparatus for scheduling total resources available for a central processing unit, the apparatus comprising:

(a) a plurality of clients, each one of the clients having an execution state with a value indicative of a proportional resource allocation of said total resources; and

(b) a scheduler having a run queue which stores each one of said clients in a sorted order beginning with a first client having a largest proportional resource allocation and followed by subsequent clients arranged in order of diminishing proportional resource allocations, and which is configured to (i) in said sorted order beginning with said first client having said largest resource allocation, schedule the execution of one of said clients in said run queue for a constant time quantum;

(ii) determine a client virtual time as a measure of a degree to which a next subsequent client in said run queue has received service relative to its respective proportional allocation;

(iii) determine a queue virtual time as a measure of a degree to which all of said clients in said run queue have received service relative to said total resources, said queue virtual time determined for said next subsequent time quantum; (iv) compare the client virtual time with the queue virtual time; (v) execute said next subsequent client in said sorted order for said constant time quantum if said run queue of said client virtual time is greater than or equal to said queue virtual time; and

(vi) return to said beginning of said run queue to execute said first client for the constant time quantum if said if client virtual time is less than said queue virtual time.

2. The apparatus of claim 1, wherein said state of each one of said clients further comprises a value of a time counter, and wherein said scheduler is further configured to set a respective value of a time counter for each of said clients to the number of time quanta representative of said client's respective proportional resource allocation.

3. The apparatus of claim 2, wherein said scheduler is further configured to decrement said value of said time counter after said client is executed for said constant time quantum.

4. The apparatus of claim 1, wherein the scheduler is configured to increment the client virtual time by an amount comprising a ratio of said constant time quantum to said respective client's proportional resource allocation.

5. The apparatus of claim 4, wherein the scheduler is configured to increment the queue virtual time by an amount comprising a ratio of said constant time quantum to said total resource allocation of all of said clients in said run queue.

6. The apparatus of claim 5, wherein said state of each one of said clients further comprises an indication of whether said client is runnable, and wherein said scheduler is configured to insert each of said clients into said run queue if said client is runnable, and to remove said client from said queue if said client is not runnable.

7. The apparatus of claim 6, wherein said state of each one of said clients further comprises a client identification, and wherein said scheduler is configured to record a reference to the client identification of a previous client in said run queue and a next subsequent client in said run queue when said client is removed from said run queue.

8. The apparatus of claim 7, wherein the scheduler is configured to update the client virtual time of said client when said indication of whether said client was previously not runnable subsequently becomes runnable.

9. The apparatus of claim 8, wherein the scheduler is configured to update said client virtual time by replacing said previous client virtual time with the greater of the previous client virtual time and the queue virtual time.

10. A method of proportional share scheduling for scheduling total resources for a central processing unit among a plurality of clients, each one of said clients having a state with a value indicative of a proportional resource allocation of said total resources, the method comprising:

(a) storing each one of said clients in a run queue in a sorted order beginning with a first client having a largest proportional resource allocation and followed by subsequent clients arranged in order of diminishing proportional resource allocations; and

(b) sequentially executing, in said sorted order beginning with said first client having said largest resource allocation, one of said clients in said run queue for a constant time quantum ,

(c) determining a client virtual time as a measure of a degree to which a next subsequent client in said run queue has received service relative to its respective proportional allocation;

(d) determining a queue virtual time as a measure of a degree to which all of said clients in said run queue have received service relative to the total resource allocation at said next subsequent time quantum; (e) comparing the client virtual time with the queue virtual time,

(f) executing said next subsequent client for said constant time quantum if said run queue if said client virtual time is greater than or equal to said queue virtual time; and

(g) returning to said beginning of said run queue to execute said first client for the constant time quantum if said if client virtual time is less than said queue virtual time.

11. The method of claim 10, wherein said state of each one of said clients further comprises a value of a time counter, and wherein said method further comprises setting a respective value of a time counter for each of said clients to the number of time quanta representative of said client's respective proportional resource allocation.

12. The method of claim 11, further comprising decrementing said value of said time counter after said client is executed for said constant time quantum.

13. The method of claim 12, further comprising incrementing the client virtual time by an amount comprising a ratio of said constant time quantum to said respective client's proportional resource allocation.

14. The method of claim 13, further comprising incrementing the queue virtual time by an amount comprising a ratio of said constant time quantum to said total resource allocation of all of said clients in said run queue.

15. The method of claim 14, wherein said state of each one of said clients further comprises an indication of whether said client is runnable, said method further comprising inserting each of said clients into said run queue if said client is runnable, and removing said client from said queue if said client is not runnable.

16. The method of claim 15, wherein said state of each one of said clients further comprises a client identification, said method further comprising recording a reference to the client identification of a next previous client in said run queue and a next subsequent client in said run queue when said client is removed from said run queue.

17. The method of claim 16, further comprising updating the client virtual time of said client when said indication of whether said client was previously not runnable subsequently becomes runnable.

18. The method of claim 17 , further comprising updating said client virtual time by replacing said previous client virtual time with the greater of the previous client virtual time and the queue virtual time.