WO2011120019A2 - Fine grain performance resource management of computer systems - Google Patents

Fine grain performance resource management of computer systems Download PDF

Info

Publication number
WO2011120019A2
WO2011120019A2 PCT/US2011/030096 US2011030096W WO2011120019A2 WO 2011120019 A2 WO2011120019 A2 WO 2011120019A2 US 2011030096 W US2011030096 W US 2011030096W WO 2011120019 A2 WO2011120019 A2 WO 2011120019A2
Authority
WO
WIPO (PCT)
Prior art keywords
task
rate
processor
clock
performance
Prior art date
Application number
PCT/US2011/030096
Other languages
French (fr)
Other versions
WO2011120019A3 (en
Inventor
Gary Allen Gibson
Valeri Popescu
Original Assignee
Virtualmetrix, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Virtualmetrix, Inc. filed Critical Virtualmetrix, Inc.
Priority to JP2013501534A priority Critical patent/JP2013527516A/en
Priority to KR1020127027941A priority patent/KR20130081213A/en
Priority to EP11760356.3A priority patent/EP2553573A4/en
Priority to CN2011800254093A priority patent/CN102906696A/en
Publication of WO2011120019A2 publication Critical patent/WO2011120019A2/en
Publication of WO2011120019A3 publication Critical patent/WO2011120019A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3206Monitoring of events, devices or parameters that trigger a change in power modality
    • G06F1/3228Monitoring task completion, e.g. by use of idle timers, stop commands or wait commands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/324Power saving characterised by the action undertaken by lowering clock frequency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/329Power saving characterised by the action undertaken by task scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • G06F11/3419Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment by assessing time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5094Allocation of resources, e.g. of the central processing unit [CPU] where the allocation takes into account power or heat criteria
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/865Monitoring of software
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/501Performance criteria
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/507Low-level
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the subject matter described herein relates to systems, methods, and articles for management of performance resources utilized by tasks executing in a processor system.
  • a computing system not only consists of physical resources
  • processors memory, peripherals, buses, etc.
  • performance resources such as processor cycles, clock speed, memory and I/O bandwidth and main/cache memory space.
  • the performance resources have generally been managed inefficiently or not managed at all.
  • processors are underutilized, consume too much energy and are robbed of some of their performance potential.
  • Many computer systems are capable of dynamically controlling the system and/or processor clock frequency(s). Lowering the clock frequency can dramatically lower the power consumption due to semiconductor scaling effects that allow processor supply voltages to be lowered when the clock frequency is lowered. Thus, being able to reduce the clock frequency, provided the computer system performs as required, can lead to reduced energy consumption, heat generation, etc.
  • processors are able to rapidly enter and exit idle or sleep states where they may consume very small amounts of energy compared to their active state(s).
  • processors and/or part or all of a computer system in sleep state can be used to reduce overall energy consumption provided the computer system performs as required.
  • Execution of a plurality of tasks by a processor system are monitored. Based on this monitoring, tasks requiring additional performance resources are identified by calculating a progress error and/or one or more progress limit errors for each task. Thereafter, performance resources of the processor system allocated to each identified task are adjusted. Such adjustment can comprise: adjusting a clock rate of at least one processor in the processor system executing the task, adjusting an amount of cache and/or buffers to be utilized by the task, and/or adjusting an amount of input/output (I/O) bandwidth to be utilized by the task.
  • I/O input/output
  • Each task can be selected from a group comprising: a single task, a group of tasks, a thread, a group of threads, a single state machine, a group of state machines, a single virtual machine, and a group of virtual machines, and any combination thereof.
  • the processor can comprise: a single processor, a multi -processor, a processor system supporting multi -threading (e.g., simultaneous or pseudo-simultaneous multithreading, etc.), and/or a multi-core processor.
  • Monitored performance metrics associated with the tasks executing / to be executed can be changed. For example, data transference can initially be monitored and later processor cycles can be monitored.
  • the progress error rate can be equal to a differential between work completed by the task and work to be completed by the task. Alternatively, the progress error rate is equal to a difference between a work completion rate for completed work and an expected work rate for the task.
  • Each task can have an associated execution priority and an execution deadline (and such priority and/or deadline can be specified by a scheduler and/or it can be derived / used as part of a rate adaption function or a parameter to a rate adaption function). In such cases, the performance resources of the processor system can be adjusted to enable each identified task to be completed prior to its corresponding execution deadline and according to its corresponding execution priority.
  • Performance resources can be adjusted on a task-by-task basis.
  • Each task can have an associated performance profile that is used to establish the execution priority and the execution deadline for the task.
  • the associated performance profile can specify at least one performance parameter.
  • the performance parameter can, for example, be a cache occupancy quota specifying an initial maximum and/or minimum amount of buffers to be used by the task and the cache occupancy quota can be dynamically adjusted during execution of the task.
  • the cache occupancy quota can be dynamically adjusted based on at least one of: progress error, a cache miss rate for the task, a cache hit rate or any other metrics indicative of performance.
  • the performance parameter can specify initial bandwidth requirements for the execution of the task and such bandwidth requirements can be dynamically adjusted during execution of the task.
  • a processor clock demand rate required by each task can be determined. Based on such determinations, an aggregate clock demand rate based on the determined processor clock demand rate for all tasks can be computed. In response, the processor system clock rate can be adjusted to accommodate the aggregate clock demand rate. In some cases, the processor system clock rate can be adjusted to the aggregate clock demand rate plus an overhead demand rate.
  • the processor clock demand rate can be calculated as a product of a current processor system clock rate with expected execution time for completion of the task divided by a time interval.
  • the processor clock demand rate for each task can be updated based on errors affecting performance of the task and, as a result, the aggregate clock demand rate can be updated based on the updated processor clock demand rate for each task.
  • Updating of the processor clock demand rate for each task or the aggregate clock demand rate can use at least one adaptation function to dampen or enhance rapid rate changes.
  • a processor clock rate for each task can be added to the aggregate clock demand rate when the task is ready-to-run as determined by a scheduler or other system component that determines when a task is ready-to-run (such as an I/O subsystem completing an I/O request on which the task is blocked).
  • the aggregate clock demand rate can be calculated over a period of time such that, at times, the processor system clock rate is higher than the aggregate clock demand rate, and at other times, the processor system clock rate is lower than the aggregate clock demand rate.
  • the processor system can include at least two processors and the aggregate clock demand rate can be determined for each of the at least two processors and be based on the processor demand rate for tasks executing using the corresponding processor. In such arrangements, the clock rate for each of the at least two processors can be adjusted separately and accordingly.
  • Each task is allocated physical memory. At least one task can utilize at least one virtual memory space that is mapped to at least a portion of the physical memory.
  • execution of a plurality of tasks by a processor system are monitored to determine at least one monitored value for each of the tasks.
  • the at least one monitored value characterizes at least one factor affecting performance of the corresponding task by the processor system.
  • Each task has an associated task performance profile that specifies at least one performance parameter,
  • the corresponding monitored value is compared with the corresponding at least one performance parameter specified in the associated task performance profile. Based on this comparing, it is determined, for each of the tasks based on the comparing, whether performance resources utilized for the execution of the task should be adjusted or whether performance resources utilized for the execution of the task should be maintained. Thereafter, performance resources can be adjusted by modifying a processor clock rate for each of the tasks for which it was determined that performance resources allocated to such task should be adjusted and maintaining performance resources for each of the tasks for which it was determined that performance resources allocated to the task should be maintained.
  • the monitored value can characterize an amount of work completed by the task.
  • the amount of work completed by the task can be derived from at least one of: an amount of data transferred when executing the task, a number of processor instructions completed when executing the task, processor cycles, execution time, etc.
  • a current program state is determined for each task and the associated task performance profile specifies two or more program states having different performance parameters.
  • the monitored value can be compared to the performance parameter for the current program state (and what is monitored can be changed (e.g., instructions data transfererence, etc.)).
  • At least one performance profile of a task being executed can be modified so that a corresponding performance parameter is changed.
  • the monitored value can be compared to the changed performance parameter.
  • a processor clock demand rate required by each task can be determined. Thereafter, an aggregate clock demand rate can be computed based on the determined processor clock demand rate for all tasks. As a result, the processor system clock rate can be adjusted to accommodate the aggregate clock demand rate.
  • a processor clock demand rate required by a particular task can be dynamically adjusted based on a difference between an expected or completed work rate and at least one progress limiting rate (e.g., a progress limit error, etc.). The processor clock demand rate required by each task can be based on an expected time of completion of the corresponding task.
  • the processor system clock rate can be selectively reduced to a level that does not affect the expected time of completion of the tasks.
  • the processor system clock rate can be set to either of a sleep or idle state until such time that the aggregate clock demand is greater than zero.
  • the processor system clock rate can fluctuate above and below the aggregate clock demand rate during a period of time provided that an average processor system clock rate during the period of time is above the aggregate clock demand rate.
  • the performance profile can specify an occupancy quota limiting a number of buffers a task can utilize.
  • the occupancy quota can be dynamically adjusted based on a difference between an expected and completed work rate and one or more progress limiting rate (e.g., progress limit error etc.) Other performance metrics from a single source or multiple sources can be used to adjust the occupancy quota.
  • Utilization of bandwidth by an input / output subsystem of the processor system can be selectively controlled so that performance requirements of each task are met.
  • the amount of bandwidth utilized can be dynamically adjusted based on a difference between an expected and completed work rate and one or more progress limiting rate (e.g., progress error, etc.).
  • Other performance metrics e.g., progress limit error, etc.
  • a system includes at least one processor, a plurality of buffers, a scheduler module, a metering module, an adaptive clock manager module, a cache occupancy manager module, and an input/output bandwidth manager module.
  • the scheduler module can schedule a plurality of tasks to be executed by the at least one processor (and in some implementations each task has an associated execution priority and/or an execution deadline).
  • the metering module can monitor execution of the plurality of tasks and to identify tasks that require additional processing resources.
  • the adaptive clock manager module can selectively adjust a clock rate of the at least one processor when executing a task.
  • the cache occupancy manager module can selectively adjust a maximum amount of buffers to be utilized by a task.
  • the input / output bandwidth manager module can selectively adjust a maximum amount of input/output (I/O) bandwidth to be utilized by a task.
  • Articles of manufacture are also described that comprise computer executable instructions permanently stored on computer readable media, which, when executed by a computer, causes the computer to perform operations herein.
  • computer systems are also described that may include a processor and a memory coupled to the processor. The memory may temporarily or permanently store one or more programs that cause the processor to perform one or more of the operations described herein.
  • performance requirements in such a way as to provide performance guarantees / targets while at the same time using minimal resources, can allow a computer system to have greater capacity (because required resources for each component is minimized).
  • the current subject matter can allow a computer system to require fewer/smaller physical computer resources thereby lowering cost and/or reducing physical size.
  • overall power consumption can be reduced because fewer power consuming resources are needed.
  • information such as aggregate clock rates, progress error and progress limit error can be used to inform a scheduler on which processor to schedule tasks.
  • FIG. 1 is a block diagram of a computer system with performance resource management
  • FIG. 2 is a block diagram of a metering module
  • FIG. 3 is a block diagram of a performance resource manager module
  • FIG 4 is a diagram illustrating a calendar queue
  • FIG. 5 is a process flow diagram illustrating a technique for processor system performance resource management.
  • FIG. 1 is a simplified block diagram of a computer system including a processor system 10, a management module 106, an I/O (Input / Output) subsystem 108 and a system memory 150.
  • the processor system 10 can include one or more of a central processing unit, a processor, a microprocessor, a processor core and the like.
  • the processor system 10 can comprise a plurality of processors and/or a multi- core processor.
  • the functional elements of the processor system depicted in FIG. 1 can be implemented in hardware or with a combination of hardware and software (or firmware).
  • the processor system 10 can include an instruction cache 104, instruction fetch/branch unit 1 15, an instruction decode module 125, an execution unit 135, a load/store unit 140, a data cache 145, a clock module 180 for controlling the processor system's clock speed(s), an idle state module 184 for controlling the idle or sleep state of the processor system, a DMA (Direct Memory Access) module 186, a performance management system 105 and a scheduler module 130.
  • the performance management system 105 can include a metering module 1 10 and a performance resource management module 120.
  • a task context memory which stores the task performance profile for a task, can be incorporated into the system memory 150. In other implementations, the task context memory may be independent of the system memory 150.
  • a task may be referred to as a set of instruction to be executed by the processor system 10.
  • the term task can be interpreted to include a group of tasks (unless otherwise stated).
  • a task can also comprise processes such as instances of computer programs that are being executed, threads of execution such as one or more simultaneously, or pseudo-simultaneously, executing instances of a computer program closely sharing resources, etc. that execute within one or more processor systems 10 (e.g., microprocessors) or virtual machines such as virtual execution environments on one or more processors.
  • a virtual machine (VM) is a software implementation of a machine (computer) that executes programs like a real machine.
  • the tasks can be state machines such as image processors, cryptographic processors and the like.
  • the management module 106 can be part of the computer system coupled to the processing module (for example, a program residing in the system memory 150).
  • the management module 106 can create, and/or retrieve previously created performance profiles from system memory 150 or from storage devices such as hard disk drives, non-volatile memory, etc., and assign task performance profiles that specify task performance parameters to tasks directly or through their task context (a set of data containing the information needed to manage a particular task).
  • the management module 106 can control the allocation of resources by
  • the I/O subsystem module 108 can be part of the computer system coupled to the processing module (for example, a program residing in the system memory 150).
  • the I/O subsystem module 108 can control, and/or enable, and/or provide the means for the communication between the processing system, and the outside world possibly a human, storage devices, or another processing system.
  • Inputs are the signals or data received by the system, and outputs are the signals or data sent from it.
  • Storage can be used to store information for later retrieval; examples of storage devices include hard disk drives and non-volatile semiconductor memory. Devices for communication between computer systems, such as modems and network cards, typically serve for both input and output.
  • the performance management system 105 of the processor system 10 can control the allocation of processor performance resources to individual tasks and for the processor system.
  • the performance management system 105 can control the allocation of state machine performance resources to individual tasks executing in the state machine.
  • the management module 106 can control the allocation of resources by determining/controlling the task performance profiles (e,g. through a set of policies/rules, etc.). For example, by controlling the allocation of performance resources to all tasks, each task can be provided with throughput and response time guarantees.
  • processor resources of the processor system 10 and/or a computing system incorporating the processor system 10 that includes the I/O subsystem module 108 and the system memory 150, etc.
  • performance resources are utilized.
  • the minimization of performance resources increases efficiency lowering energy consumption and requiring fewer/smaller physical computer resources resulting in lowered cost.
  • the minimization of performance resources allocated to each task can enable the processor system 10 to have greater capacity enabling more tasks to run on the system while similarly providing throughput and response time guarantees to the larger number of tasks.
  • Tasks can be assigned performance profiles that specify task performance parameters.
  • task performance parameters include work to be completed, We, time interval, Ti, and maximum work to be completed, Wm, cache occupancy and I/O (Input / Output) bandwidth requirements as described elsewhere in this document.
  • the time interval can represent a deadline such that the task is expected to complete We work within Ti time.
  • the work to be completed can determine the expected work to be performed by the task when it is scheduled for execution.
  • the maximum work to be completed can specify the maximum work the task may accumulate if, for example, the completion of its expected work is postponed.
  • the time interval can also be utilized by the scheduling module 130 to influence scheduling decisions, such as using the time interval to influence when a Task should run or as a deadline (the maximal time allowed for the task to complete its expected work).
  • these parameters can dynamically change with task state such that the performance profile parameters are sets of parameters where each set may be associated with one or more program states and changed dynamically during the task's execution.
  • a scheduler module (as well as related aspects that can be used in connection with the current subject matter) is described in U.S. Patent App. Pub. No 2009/0055829 Al, the contents of which are hereby fully incorporated by reference.
  • Performance profiles can be assigned to groups of tasks similar to the performance profile for an individual task.
  • tasks that are members of a group share a common performance profile and the performance resource parameters can be derived from that common profile.
  • a subset of the performance parameters can be part of a group performance profile while others are part of individual task performance profile.
  • a task profile can include expect work parameters while the task is a member of a group that shares I/O bandwidth and cache occupancy performance parameters.
  • a multiplicity of groups can exist where tasks are members of one or more groups that specify both common and separate performance profile parameters where the parameters utilized by the performance resource manager are derived from the various performance profiles (through a set of policies/rules)
  • the work can be a measure of data transference, processor instructions completed, or other meaningful units of measure of work done by the processor system 10 or state machine such as image processors, cryptographic processors and the like. As this work can be measured to a fine granularity, the performance resources can be similarly managed to a fine granularity.
  • the processor system 10 can execute instructions stored in the system memory 150 where many of the instructions operate on data stored in the system memory 150.
  • the instructions can be referred to as a set of instructions or program instructions throughout this document.
  • the system memory 150 can be physically distributed in the computer system.
  • the instruction cache 104 can temporarily store instructions from the system memory 150.
  • the instruction cache 104 can act as a buffer memory between system memory 150 and the processor system 10. When instructions are to be executed, they are typically retrieved from system memory 150 and copied into the instruction cache 104. If the same instruction or group of instructions is used frequently in a set of program instructions, storage of these instructions in the instruction cache 104 can yield an increase in throughput because system memory accesses are eliminated.
  • the fetch/branch unit 115 can be coupled to the instruction cache 104 and configured to retrieve instructions from the system memory 150 for storage within the instruction cache 104.
  • the instruction decode module 125 can interpret and implement the instructions retrieved. In one implementation, the decode module 125 can break down the instructions into parts that have significance to other portions of the processor system 10.
  • the execution unit 135 can pass the decoded information as a sequence of control signals, for example, to relevant function units of the processor system 10 to perform the actions required by the instructions.
  • the execution unit can include register files and Arithmetic Logic Unit (ALU).
  • the actions required by the instructions can include reading values from registers, passing the values to an ALU (not shown) to add them together and writing the result to a register.
  • the execution unit 135 can include a load/store unit 140 that is configured to perform access to the data cache 145.
  • the load/store unit 140 can be independent of the execution unit 135.
  • the data cache 145 can be a high-speed storage device, for example a random-access memory, which contains data items that have been recently accessed from system memory 150, for example.
  • the data cache 145 can be accessed independently of the instruction cache 104.
  • FIG. 2 is a block diagram of a metering module 1 10.
  • the metering module 1 10 can measure the work performed or amount of work completed by the currently executing task(s).
  • the metering module 1 10 can monitor the execution of the task to determine a monitored value related to the amount of work completed for the task.
  • the monitored value related to the amount of work completed can be the actual amount of work completed, a counter value or the like that is proportional to or related to the amount of work completed.
  • one implementation of the metering module 110 can comprise a work completed module 210 (Wc), a work to be completed module 220 (We), a comparator module 230, and an adder module 240.
  • the work completed module 210 can be a work completed counter and the work to be completed module 220 can also be a work to be completed counter.
  • the work to be completed counter can be updated based on the work rate to account for the passage of time.
  • the work to be completed can be calculated by the performance resource manager, for example, when the task is selected for execution on the processor system by the scheduler module 130 informing the performance resource manager of the task selection.
  • the metering module 1 10 can measure and monitor the work completed by a task that is currently being executed on the processor system 10.
  • One or more tasks can be implemented on the processor system 10 (e.g., processor(s) employing simultaneous or pseudo-simultaneous multi-threading, a multi-processor, etc.).
  • the monitored value of work completed or information about the amount of work completed can be measured by the amount of instructions completed and can be acquired from the instruction fetch/branch unit 1 15 as illustrated by the arrow 170 in FIG. 1.
  • the monitored values can also be measured by the amount of data transferred through memory operations and can be acquired from the load/store unit 140 as illustrated by the arrow 165 in FIG. 1.
  • the metering module 1 when used to monitor memory operations (bandwidth), can be configured to only account for memory operations to/from certain addresses (such as a video frame buffer). This configuration can be varied on a task-by-task basis (with the configuration information part of the Task Context or task performance profile). In some implementations, there can be separate metering modules 1 10 for instruction completion and memory operations depending on specific details of the computer system implementation. These metering modules would be similar to a single metering module 1 10. As some processing modules 10 handle multiple tasks (threads) simultaneously, the instructions completed information can include information as to which thread had completed certain instructions (typically by tagging the information with thread or process or task identifier(s)).
  • the memory operations information can similarly include this thread identifier in order for the metering module 110 associate these operations to the correct task.
  • Processing modules 10 which include one or more of a central processing unit, a processor, a microprocessor, a processor core, etc can include a plurality of metering modules 110 for each such processor.
  • a monitored value related to the work performed or work completed Wc can be measured by counting the accesses to memory, instructions completed, and/or other measurable quantities that are meaningful measurements of work by the currently executing task(s).
  • the monitored value for example the number of accesses to memory, which can include the size of the access, can be received at the adder module 240 where they are summed and provided to the work completed module 210.
  • the monitored values can also be measured by the memory operations that can be acquired from the load/store unit 140 as illustrated by the arrow 165 in FIG. 1.
  • the work to be completed module 220 can receive a parameter value We related to the amount of work to be completed.
  • the parameter value related to the amount of work to be completed and/or work rate can be a predetermined value that is stored in the task performance profile of a task.
  • the work to be completed parameter value can be the actual amount of work to be completed, a counter value or the like that is proportional to or related to the amount of work to be completed.
  • the parameter value can be a constant parameter or calculated from the work rate to include, for example, work credit which can be calculated to account for the time the task waits to be executed by multiplying the work rate by the passage of time.
  • the work credit can also be calculated continuously, or periodically, such that the work to be done increases with the passage of time at the work rate even while the task in running. This computed work to be done can be limited to being no larger than a maximum work parameter.
  • the parameter values can be predetermined by the management module 106 during the process of mapping a task to a computer system.
  • the work completed can be compared to the work to be completed by the comparator module 230.
  • the result of this comparison, the progress error can be a value representing a differential between the work completed and work to be completed and/or between the work completion rate and the work to be completed rate (the expected work rate) by including time in the comparison.
  • One implementation can calculate a progress error based on a task achieving its expected work to be completed, within an expected runtime.
  • a negative progress error in the above example relation, can indicate the work completion is greater than the expected work at elapsed time qt.
  • a progress error can be used to allocate or adjust the allocation of performance related resources to tasks as detailed elsewhere in this document.
  • One or more instances of meter modules can be utilized to determine if task's progress is limited (directly or indirectly) by quantities a meter module may measure; memory accesses or cache miss occurrences (i.e., failed attempts to read or write a piece of data in the buffer resulting in a main memory access, etc.), for instance, by metering those quantities and comparing them to pre-calculated parameters.
  • the progress limit measurement can be achieved by providing the We module 220 of a meter module instance with a value to be compared to the accumulated metered quantity in the Wc module 210.
  • the value supplied to module 220 can be considered a progress limit parameter.
  • a comparator function can then compare the two values, including a comparison with respect to time, to determine if progress is limited by the quantity measured; for example, limited by a certain cash miss rate or memory access rate.
  • the result can be expressed as a progress error (note that this result is different than the primary progress error arising from comparing work completed to work to be completed).
  • the progress limit error values can be used to allocate or adjust the allocation of performance related resources to tasks as detailed elsewhere in this document.
  • the progress limit parameters may be part of the task's performance profile [0051]
  • a history of progress error and progress limit error values, from current and previous times a task was executing on the processor system, can be utilized to allocate or adjust the allocation of performance related resources to tasks as detailed elsewhere in this document. These values can be represented, for example, as cumulated progress and progress limit error values or as a series of current and historical values (which may be part of the task's performance profile).
  • the adaptive clock manager module 320 can manage the processor system's clock speed(s) by determining the required clock speed and setting the clock rate of the processor system 10 via the clock control module 180.
  • the processor system's clock speed(s) can be determined by computing the aggregate clock demand rate of the tasks in the computer system.
  • the task demand rate can represent the clock rate demand for task i to complete its expected work, We, within a time interval or deadline Ti.
  • the aggregate demand rate can include demand rates from the ready-to-run tasks while in other implementations the demand rate can include estimated demand rates from not ready-to-run tasks, calculating and/or speculating on when those tasks will be ready to run.
  • the overhead demand rate can be a constant parameter or it can depend on system state such that one or more values for the overhead demand rate is selected depending on system state.
  • the overhead demand rate can be contained in the task demand rate (which then can incorporate the processor system overhead activity on behalf of the task).
  • the overhead demand rate can be predetermined by the management module 106 during the process of mapping task to a computer system.
  • the expected execution time is the expected time for the task to complete its expected work and can be part of the task's performance profile. In general, the expected execution time can be derived from the previous executions of the task (running on the processor system) and can be a measure of the cumulative time for the task's expected work to be completed. In addition, the expected execution time is typically dependent on the processor system frequency.
  • the task's demand rate can be a minimal clock rate for the task to complete its expected work within its time interval or deadline of Ti.
  • the task demand rate can be part of the task's performance profile.
  • the clock manager module 320 can request the processor run at a clock frequency related to the aggregate demand rate, Ard, making such requests when the value of Ard changes in accordance with certain dependencies describe elsewhere in this document.
  • the actual system may only be capable of supporting a set of discrete processor and system clock frequencies, in which case the system is set to a supported frequency such that the processor system frequency is higher than or equal to the aggregate demand rate.
  • multiple clock cycles can be required to change the clock frequency in which case the requested clock rate can be adjusted to account for clock switching time.
  • the progress error and/or progress limit errors can be monitored and the task demand rate updated based on one or more of these values, for example at periodic intervals.
  • the updated task demand rate results in a new aggregate demand rate which can result in changing the processor system's clock as described elsewhere in this document.
  • the progress error and progress limit errors can be used to adjust the demand rate directly or through one or more rate adaption functions implemented by the adaptive clock manager module 320. For example, one rate adaption function can adjust the task demand rate if the error is larger than certain limits, while another adaption function can change the demand rate should the error persist for longer than a certain period of time.
  • the rate adaption function(s) can be used to dampen rapid changes in task and/or aggregate demand rates which may be undesirable in particular processor systems and/or arising from certain tasks and can be system dependent and/or task dependent.
  • the rate adaptation functions can be part of the task's performance profile.
  • the adaptive clock manager module 320 can adjust the aggregate demand rate by adjusting the individual task demand rates to account for the tasks meeting their expected work in their expected time.
  • the processor clock frequency can be adjusted relative to the aggregate demand rate while adjusting the individual task demand rates separately with both adjustments arising from progress error and progress limit error values.
  • the processor clock frequency, the aggregate demand rate, and individual task demand rates can be adjusted to match the sum of all tasks', being considered, expected work completed to their work to be completed in a closed loop form.
  • Demand rate adjustments can allow the overhead demand rate to be included in the individual tasks demand rates and thus be an optional parameter.
  • Minimum and maximum threshold parameters can be associated with the task demand rate. These minimum and maximum threshold parameters can relate to progress error and progress limit error and can be used to limit the minimum and/or maximum task demand rate. In another implementation, thresholds can limit the minimum and maximum processor clock frequency chosen during the execution of the task. The minimum and maximum threshold parameters can be part of the task's performance profile.
  • the adaptive clock manager module 320 can detect when adjusting the processor clock frequency higher does not increase the work completed rate and the requested clock rate can be adjusted down without adversely reducing the rate of work completed. This condition can be detected, for example, by observing a change, or lack thereof, in progress error as processor frequency is changed.
  • the clock manager module 320 can adjust the requested clock rate higher when the task's state changes such that increasing the clock frequency higher does increase the work completed rate. This detection can be accomplished by setting the processor clock frequency such that the progress error meets a certain threshold criteria, and when the error falls below a certain threshold, the clock frequency can be adjusted higher as greater progress is indicated by the reduction in progress error.
  • Certain rate adaption function(s) which can include progress error and/or progress limit error, can be utilized in computing the processor clock frequency. These rate adaption functions can be system and/or task dependent and can be part of the task performance profile.
  • the task demand rate, rate adaption parameters, progress limit parameters, and/or thresholds, etc. can dynamically change with task state such that the performance profile parameters are sets of parameters where each set may be associated with one or more program states and changed dynamically during the execution of the task by the management module 106.
  • the management module 106 can adjust directly by the task (rather than the management module 106).
  • a task's demand rate can be added to the aggregate demand rate when the task becomes ready-to-run which may be determined by the scheduler module 130 (e.g., based on scheduling or other events such as becoming unblocked on I/O operations, etc.) or other subsystems such as the I/O subsystem.
  • This demand rate can initially be specified by, or calculated from, the tasks performance profile and can be updated based, for example, on the task's work completion progress over time, updated through a rate adaption function as a function of progress error, and the like.
  • the performance profile can contain one or more task state dependent performance parameters. In such cases, the task demand rate can be updated when these parameters change due to task state, or system state, change and can be further updated while the task is executing on the processor system through rate error adaptation (using the progress error and/or progress limit error in the computation of performance profile parameters).
  • the aggregate demand rate can be recalculated from the individual task demand rates.
  • the new aggregate demand rate can be calculated by subtracting the task's cumulative demand rate at the end of the time interval or current execution (when the expected work is completed), which ever is later, by placing the cumulative demand rate in a time-based queuing system, such as a calendar queue, which presents certain information at a specific time in the future.
  • This implementation reserves the task's demand rate within the aggregate demand rate from the time the task rate is first added until the end of its time interval or its completes execution, whichever is later.
  • the adaptive clock manager module 320 can utilize a calendar queue for example, Calendar Queue Entry 1 (other calendar queue techniques can be utilized).
  • the adaptive clock manager module 320 can insert a task's cumulative clock demand rate into the location Ti-Rt (difference from the time interval, to the current real time, Rt) units in the future (for example the tasks under Calendar Queue Entry N-l).
  • the index can be calculated as MAX(Ti - Rt,
  • MAX_CALENDAR_SIZE - 1) where MAX_CALENDAR_SIZE (N) is the number of discrete time entries of the calendar queue.
  • the index can represent a time related value in the future from the current time or real time.
  • a task with Ti > Rt can be reinserted into the calendar queue within a certain threshold.
  • the threshold and the size of the calendar can depend on the system design, precision of the real time clock and the desired time granularity.
  • the calendar queue can be a circular queue such that as the real time advances, the previous current time entry becomes the last entry in the calendar queue.
  • entry 0 becomes the oldest queue entry.
  • the index can take into account the fact that the calendar is a circular queue.
  • the current time index can advance from 0 to N-l as real time advances. Thus at point N-l the current time index wraps back to zero.
  • the adaptive clock manager module 320 can additionally manage entering into and resuming from the processor system's idle state. Should the aggregate clock demand be zero, the clock manager module 320 can place the processor system into an idle state until such time that the aggregate clock rate is/will be greater than zero. In some processor systems, multiple clock cycles may be required to enter and resume from idle state, in which case the time entering and resuming idle state as well the requested clock rate upon resuming active state can be adjusted to account for idle enter and resume time (as well as clock switching time).
  • the clock manager module 320 can also be capable of achieving certain aggregate demand rates, over a period of time, by requesting a frequency greater than or equal to the aggregate demand rate and placing the processor system into an idle state such that the average frequency (considering the idle time to have frequency of zero) equal to or higher than the aggregate demand rate.
  • the processor system 10 has greater energy efficiency executing at higher frequency and is then placed in idle state to satisfy certain aggregate demand rates.
  • the requested rate can be adapted to be higher than the calculated aggregate demand rate to bias placing the processing system in idle state.
  • the parameters from which the frequency and idle state selection are made can be derived from characterizing the processor system by the management module 106 during the process of mapping task(s) to a computer system.
  • the adaptive clock management module can request the processor system enter idle state by signaling the idle state module 184 to idle the processor system.
  • the idle state can be exited when an event, such as an interrupt from an I/O device or timer, etc occurs.
  • the aggregate demand rate can be calculated individually for each processor or collectively for all processors or a subset of processors or a combination of these. Some tasks can be assigned to certain processors while others may be free to run on any or a certain set of processors.
  • the aggregate demand rate can be calculated for the all processors observing the restrictions and freedoms of each task has to run on a certain processor including an affinity property where it is desirable to run a task on a particular processor.
  • each processor clock rates and idle states can be controlled individually.
  • the clock manager module 320 can select a combination of clock rates while idling one or more processors to achieve minimum energy.
  • the idle states may be, a single clock rate can be chosen while idling one or more processors to achieve minimum energy consumption.
  • the clock rate can be chosen such that the aggregate demand rate for all, or a plurality of subsets of, processors is divided among the processors to achieve certain desired goals, such as maximizing throughput or minimizing task completion times of a tasks individually or of parallel computations performed by a plurality of tasks. Interaction with the scheduler module 130 (in the determination of which task(s) execute in which processor) may be necessary to achieve the desired goals.
  • the clock module 180 and idle state module 184 can have interaction with other computer system components, not shown in the drawings. These interactions may be necessary to enable changing the one or more processors' clock speed(s) or idle state(s). For example, changing the processor frequency can require changing the clock speed of busses, peripherals, the clock speed of system memory 150, etc. Similarly, to place the processor in or resume from a idle state, certain busses, peripherals, system memory 150, etc may require preparation before such state is entered (such as quiescing an I/O device and writing its buffers to system memory) or active state is resumed (such as initializing an I/O device to commence operation(s)).
  • the cache occupancy management module 340 can manage the use of buffer or cache occupancy quotas. These occupancy quotas can be numerical limits of the number of buffers a task may (or should) use.
  • the occupancy quota, Oq, and current occupancy Oc can be additionally stored in the task's performance profile.
  • Cache occupancy can be selectively allocated using, for example, a cache replacement algorithm such as those described in co-pending U.S. Pat. App. Ser. No. 13/072,529 entitled "Control of Processor Cache Memory Occupancy", filed on March 25, 201 1 and claiming priority to U.S. Pat. App. Ser. No. 61,341,069, the contents of both applications are hereby incorporated by reference.
  • Occupancy in this case can be characterized as an indication of actual number of buffers being used by a task.
  • a buffer is a memory or region of memory used to temporarily hold data (such as an input/output buffer cache) while it is being moved from one place to another or to allow faster access (such as a processor instruction/data cache).
  • the occupancy counter Oc can be incremented, as buffers are de-allocated to the task the occupancy counter can be decremented. Whenever the occupancy quota is greater than the occupancy counter (Oc > Oq), the task is exceeding its occupancy quota.
  • Occupancy quotas can contain multiple quota parameters such that higher or lower priority is given to comparing the occupancy to these additional quotas.
  • a task's occupancy quota can be part of its performance profile.
  • This performance profile parameter may be statically set, may be dependent on program state, or may be dynamically calculated by the cache occupancy manager. Dynamic occupancy quotas may be adjusted based on the performance of the task, for example meeting its deadline, based on the cache miss information during its execution or feedback from execution in terms of expected work compared to work completed using progress error and/or progress limit errors as described elsewhere in this document.
  • the cache occupancy manager can adjust the occupancy quotas. Such adjustments can be based, for example, on pre-defined / configured limits which in turn can be a combination of system-level configured limits and limits contained in the task's performance profile. In one implementation, the occupancy quota can be adjusted based on the differential between a task's expected work rate and work completed rate, utilizing progress error for instance, or the cache miss rate, or a combination of the two.
  • the computation of the occupancy quota can be made such that that the occupancy quota can be increased when a task is below its expected work rate or the cache miss rate is above a certain threshold; conversely, the occupancy quota can be reduced when the task is exceeding its expected work or the cache miss rate is below a certain threshold.
  • This computation can also take progress limiting error values into account, for example, by detecting that the progress is being limited by another factor other than occupancy.
  • the cache occupancy management module can control occupancy quotas by setting quotas in the instruction cache 104 and/or data cache 145 if they have occupancy quota control mechanisms, or other buffer / caching components that can be part of, or coupled to, the processing system or computer system, such as a program stored in system memory 150.
  • the cache occupancy parameters can relate to a task (or group of tasks) such that the system allocates occupancy quotas to or on behalf of the task; perhaps keeping track of a task if utilized by both the cache occupancy management module and the respective I/O subsystems.
  • the quota control mechanisms can be implemented in hardware or software (firmware) or a combination of both.
  • Cache occupancy can include mapping virtual memory, memory management techniques allowing tasks to utilize virtual memory address space(s) which may be separate from physical address space(s), to physical memory.
  • the physical memory in effect acts as a cache allowing a plurality of tasks to share physical memory wherein the total size of the virtual memory space(s) may be larger than the size of physical memory, or larger than the physical memory allocated to one or more tasks, and thus the physical memory, and/or a portion thereof, acts as a "cache".
  • Physical memory occupancy of a task can be managed as described elsewhere in this document.
  • the management module may be a separate module, as in 106, or may be an integral part of one or more operating systems, virtual machine monitors, etc.
  • a multiplicity of caches and/or buffer subsystems can exist and thus there can be several occupancy quota parameters utilized and stored in the task's performance profile.
  • These caches and buffers can be embodied in hardware or software (firmware) or a combination of both.
  • a task's occupancy quota(s) can be modified such that work completed rate is matched to the expected work completed rate in a closed loop form where occupancy can be increased to meet expected work rates and/or decreased when expected work rates are being met or exceeded.
  • the modification of occupancy quota(s) can utilize rate adaption functions which may be task and dependent on task state.
  • Task prioritization relative to occupancy quotas can be utilized to guarantee certain higher priority tasks meet their expected work at the expense of lower priority tasks.
  • the management module 106 can control the overall allocation of occupancy quotas by determining/controlling the maximum and minimum occupancy quotas and/or the maximum and minimum changes allowed to occupancy quotas, etc (e,g. through a set of policies/rules).
  • the I/O bandwidth management module 360 can manage the computer system's input output subsystem(s) utilization of bandwidth (which is a measure of data transference per unit time). I/O operations performed by tasks, or by an operating system on behalf of a task's I/O request(s) for instance, can be managed as a performance resource by the I/O bandwidth manager to ensure that tasks performance requirements of 10 operations are met.
  • a task's I/O bandwidth can be part of its performance profile. This performance can be statically set (based on, for example, program state), or it can be dynamically calculated, such as by the I/O bandwidth manager. Dynamic I/O bandwidth values can be adjusted based on the performance of the task, for example, meeting its calculated deadline or feedback from execution in terms of expected work rate vs. work completed rate.
  • the I/O bandwidth manager can adjust the I/O bandwidth parameters, within certain configured limits which can be a combination of system-level configured limits and limits contained in the task's performance profile.
  • the I/O bandwidth can be modified utilizing progress error and/or progress limit error values, or the expected I/O rate, or a combination of these.
  • the computation of an I/O bandwidth rate can be made such that that the I/O bandwidth may be increased or decreased depending on progress and/or progress limit error values and thresholds. In general, these values and thresholds can be determined to match the tasks work completed rate to the work to be completed rate without using I/O bandwidth unnecessarily.
  • a task's work can may be the I/O bandwidth rate, in which case task primary work is the transference of I/O data at a certain rate.
  • I/O bandwidths can be adjusted such that the work completed rate is matched to the work to be completed rate in a closed loop form; where I/O bandwidths can be increased to meet expected work rates and/or decreased when expected work rates are being exceeded considering progress and progress limit errors.
  • I/O resources can be allocated through I/O bandwidth allocations, managed through the I/O bandwidth manager, in such a way as to provide system performance guarantees. Such guarantees can be that the total I/O bandwidth is not over allocated or that certain tasks receive their I/O bandwidth at the expense of others (depending on a set of policies/rules).
  • the I/O bandwidth management module can control I/O bandwidth by setting bandwidth parameters in the I/O subsystem module 108 for such bandwidth control mechanisms that exist, or other I/O components that may be part of, or coupled to, the processing system or computer system, such as a program stored in system memory 150.
  • the I/O bandwidth parameters can relate to a task (or group of tasks) such that the system allocates bandwidth to or on behalf the task. In some variations, this can comprises keeping track of a task ID to associate with I/O operations such that the I/O bandwidth management module and the respective I/O subsystems may attribute data transference to a specific task.
  • the I/O bandwidth control mechanisms can be implemented in hardware or software (firmware) or a combination of both.
  • DMA controllers can be utilized. Direct memory access is a feature of modern computers and microprocessors that allows certain hardware subsystems within the computer to access system memory for reading and/or writing independently of the central processing unit. Many hardware systems use DMA including disk drive controllers, graphics cards, network cards, sound cards and Graphics Processing Units (GPUs). DMA can also used for intra-chip data transfer in multi-core processors, especially in multiprocessor system-on-chips, where its processing element is equipped with a local memory (often called scratchpad memory) and DMA can be used for transferring data between the local memory and the main memory.
  • DMA can also used for intra-chip data transfer in multi-core processors, especially in multiprocessor system-on-chips, where its processing element is equipped with a local memory (often called scratchpad memory) and DMA can be used for transferring data between the local memory and the main memory.
  • the I/O bandwidth manager can control I/O bandwidth through mechanisms that provide a bandwidth control mechanism to I/O operations, through bandwidth shaping.
  • Bandwidth shaping can be accomplished by delaying certain data transference requests until sufficient time has passed to accumulate credit for the transference (where credit is a measure of data that is accumulated over time at a certain rate, representing the bandwidth).
  • the I/O operation or the bandwidth management of data transference, including DMA, operations can be implemented in hardware or by software (or firmware).
  • I/O bandwidth management system can request I/O operation prioritization based on tasks matching their work completed to their work to be completed, taking progress error and progress limit error into account. This can, for example, consider progress and progress limit errors for all tasks of interest such that tasks with greater progress error, within certain progress limit error values, are given priority over tasks with lesser progress error within progress limit error values.
  • the progress error and progress limit errors can be used to adjust a task's I/O bandwidth parameters directly or though one or more rate adaption functions implemented by the I/O bandwidth manager.
  • one rate adaption function can be to only adjust the I/O bandwidth if the error is larger than certain limits while another adaption function can only may only change the demand rate should the error persist for longer than a certain period of time.
  • the rate adaption function(s) can be system dependent and/or task dependent.
  • the rate adaptation functions can be part of the task's performance profile.
  • Task prioritization relative to I/O bandwidth can be utilized to guarantee certain higher priority tasks meet their expected work at the expense of lower priority Tasks.
  • the management module 106 can control the overall allocation of I/O bandwidth by determining/controlling the maximum and minimum I/O bandwidth and/or bandwidth parameters (e,g. through a set of
  • the scheduler module 130 can select the next task(s) to be executed from its list of tasks based on the task parameters including task priority.
  • the scheduler module 130 can indicate that a higher priority task is ready to the processor system 10.
  • the processor system 10 (or software on the processor system 10) can decide to preemptively switch from the currently running task and run the higher priority task.
  • the scheduler module 130 or software in the processor system can indicate that a higher priority task is to be selected for execution, perhaps replacing a currently running task. In which case, the task currently running or executed in the processor system 10 can also be indicated to the performance resource manager 120.
  • the state of the metering module(s) 1 10 utilized for the currently running task can be saved in the task's context and the metering module is directed to monitor the newly selected task, by the performance resource manager (by updating the modules 210, 220 and the comparator function(s) within the metering module). Additional state in the performance resource manager can be modified similarly as a result of this task switching.
  • scheduling can be assigned on a processor-by-processor basis such that a task on a particular processor can be influenced by progress errors and/or progress limit errors of that task. This can be also be done on a thread-by-thread basis for multi-thread systems.
  • FIG. 5 is a process flow diagram illustrating a method 500, in which, at 510, execution of a plurality of tasks by a processor system are monitored. Based on the monitoring, at 520, tasks requiring adjustment of performance resources are identified by calculating at least one of a progress error and a progress limit error for each task.
  • performance resources of the processor system allocated to each identified task are adjusted.
  • the adjusting can include, for example, one or more of: adjusting a clock rate of at least one processor in the processor system executing the task, adjusting an amount of cache and/or buffer to be utilized by the task, and adjusting an amount of input/output (I/O) bandwidth to be utilized by the task.
  • Various implementations of the subject matter described herein may be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a
  • programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Abstract

Execution of a plurality of tasks by a processor system are monitored. Based on this monitoring, tasks requiring adjustment of performance resources are identified by calculating at least one of a progress error or a progress limit error for each task. Thereafter, performance resources of the processor system allocated to each identified task are adjusted. Such adjustment can comprise: adjusting a clock rate of at least one processor in the processor system executing the task, adjusting an amount of cache and/or buffers to be utilized by the task, and/or adjusting an amount of input/output (I/O) bandwidth to be utilized by the task. Related systems, apparatus, methods and articles are also described.

Description

FINE GRAIN PERFORMANCE RESOURCE MANAGEMENT
OF COMPUTER SYSTEMS
CROSS REFERENCE TO RELATED APPLICATION
[0001] The present application claims priority under 35 U.S.C. § 119 to U.S. Provisional Application Serial Number 61/341,170, filed March 26, 2010, entitled "METHOD AND APPARATUS FOR FINE GRAIN PERFORMANCE RESOURCE MANAGEMENT OF COMPUTER SYSTEMS", and to U.S. Provisional Application Serial Number 61/341,069, filed March 26, 2010, entitled "METHOD AND
APPARATUS FOR THE CONTROL OF PROCESSOR CACHE MEMORY OCCUPANCY", the disclosures of which are incorporated herein by reference.
TECHNICAL FIELD
[0002] The subject matter described herein relates to systems, methods, and articles for management of performance resources utilized by tasks executing in a processor system.
BACKGROUND
[0003] A computing system not only consists of physical resources
(processors, memory, peripherals, buses, etc.) but also performance resources such as processor cycles, clock speed, memory and I/O bandwidth and main/cache memory space. In traditional approaches, the performance resources have generally been managed inefficiently or not managed at all. As a result, processors are underutilized, consume too much energy and are robbed of some of their performance potential. [0004] Many computer systems are capable of dynamically controlling the system and/or processor clock frequency(s). Lowering the clock frequency can dramatically lower the power consumption due to semiconductor scaling effects that allow processor supply voltages to be lowered when the clock frequency is lowered. Thus, being able to reduce the clock frequency, provided the computer system performs as required, can lead to reduced energy consumption, heat generation, etc. Similarly, many processors, as well as associated interfaces and/or peripherals, are able to rapidly enter and exit idle or sleep states where they may consume very small amounts of energy compared to their active state(s). As with lowering the clock frequency, placing one or more processors and/or part or all of a computer system in sleep state, can be used to reduce overall energy consumption provided the computer system performs as required.
[0005] In practice, conventional power management approaches detect idle times or "use modes" with slow system response when one or more processors can be idled or run at a lower clock speed and thus save energy. Power management based on "use modes" often has too coarse of a granularity to effectively take advantage of all energy reduction opportunities all the time.
SUMMARY
[0006] Execution of a plurality of tasks by a processor system are monitored. Based on this monitoring, tasks requiring additional performance resources are identified by calculating a progress error and/or one or more progress limit errors for each task. Thereafter, performance resources of the processor system allocated to each identified task are adjusted. Such adjustment can comprise: adjusting a clock rate of at least one processor in the processor system executing the task, adjusting an amount of cache and/or buffers to be utilized by the task, and/or adjusting an amount of input/output (I/O) bandwidth to be utilized by the task.
[0007] Each task can be selected from a group comprising: a single task, a group of tasks, a thread, a group of threads, a single state machine, a group of state machines, a single virtual machine, and a group of virtual machines, and any combination thereof. The processor can comprise: a single processor, a multi -processor, a processor system supporting multi -threading (e.g., simultaneous or pseudo-simultaneous multithreading, etc.), and/or a multi-core processor.
[0008] Monitored performance metrics associated with the tasks executing / to be executed can be changed. For example, data transference can initially be monitored and later processor cycles can be monitored.
[0009] The progress error rate can be equal to a differential between work completed by the task and work to be completed by the task. Alternatively, the progress error rate is equal to a difference between a work completion rate for completed work and an expected work rate for the task. Each task can have an associated execution priority and an execution deadline (and such priority and/or deadline can be specified by a scheduler and/or it can be derived / used as part of a rate adaption function or a parameter to a rate adaption function). In such cases, the performance resources of the processor system can be adjusted to enable each identified task to be completed prior to its corresponding execution deadline and according to its corresponding execution priority.
[0010] Performance resources can be adjusted on a task-by-task basis. Each task can have an associated performance profile that is used to establish the execution priority and the execution deadline for the task. The associated performance profile can specify at least one performance parameter. The performance parameter can, for example, be a cache occupancy quota specifying an initial maximum and/or minimum amount of buffers to be used by the task and the cache occupancy quota can be dynamically adjusted during execution of the task. The cache occupancy quota can be dynamically adjusted based on at least one of: progress error, a cache miss rate for the task, a cache hit rate or any other metrics indicative of performance.
[0011] The performance parameter can specify initial bandwidth requirements for the execution of the task and such bandwidth requirements can be dynamically adjusted during execution of the task.
[0012] A processor clock demand rate required by each task can be determined. Based on such determinations, an aggregate clock demand rate based on the determined processor clock demand rate for all tasks can be computed. In response, the processor system clock rate can be adjusted to accommodate the aggregate clock demand rate. In some cases, the processor system clock rate can be adjusted to the aggregate clock demand rate plus an overhead demand rate. The processor clock demand rate can be calculated as a product of a current processor system clock rate with expected execution time for completion of the task divided by a time interval. The processor clock demand rate for each task can be updated based on errors affecting performance of the task and, as a result, the aggregate clock demand rate can be updated based on the updated processor clock demand rate for each task. Updating of the processor clock demand rate for each task or the aggregate clock demand rate can use at least one adaptation function to dampen or enhance rapid rate changes. A processor clock rate for each task can be added to the aggregate clock demand rate when the task is ready-to-run as determined by a scheduler or other system component that determines when a task is ready-to-run (such as an I/O subsystem completing an I/O request on which the task is blocked). The aggregate clock demand rate can be calculated over a period of time such that, at times, the processor system clock rate is higher than the aggregate clock demand rate, and at other times, the processor system clock rate is lower than the aggregate clock demand rate.
[0013] The processor system can include at least two processors and the aggregate clock demand rate can be determined for each of the at least two processors and be based on the processor demand rate for tasks executing using the corresponding processor. In such arrangements, the clock rate for each of the at least two processors can be adjusted separately and accordingly.
[0014] Each task is allocated physical memory. At least one task can utilize at least one virtual memory space that is mapped to at least a portion of the physical memory.
[0015] In another aspect, execution of a plurality of tasks by a processor system are monitored to determine at least one monitored value for each of the tasks. The at least one monitored value characterizes at least one factor affecting performance of the corresponding task by the processor system. Each task has an associated task performance profile that specifies at least one performance parameter, For each task, the corresponding monitored value is compared with the corresponding at least one performance parameter specified in the associated task performance profile. Based on this comparing, it is determined, for each of the tasks based on the comparing, whether performance resources utilized for the execution of the task should be adjusted or whether performance resources utilized for the execution of the task should be maintained. Thereafter, performance resources can be adjusted by modifying a processor clock rate for each of the tasks for which it was determined that performance resources allocated to such task should be adjusted and maintaining performance resources for each of the tasks for which it was determined that performance resources allocated to the task should be maintained.
[0016] The monitored value can characterize an amount of work completed by the task. The amount of work completed by the task can be derived from at least one of: an amount of data transferred when executing the task, a number of processor instructions completed when executing the task, processor cycles, execution time, etc.
[0017] In some variations, a current program state is determined for each task and the associated task performance profile specifies two or more program states having different performance parameters. With such an arrangement, the monitored value can be compared to the performance parameter for the current program state (and what is monitored can be changed (e.g., instructions data transfererence, etc.)).
[0018] At least one performance profile of a task being executed can be modified so that a corresponding performance parameter is changed. As a result, the monitored value can be compared to the changed performance parameter.
[0019] A processor clock demand rate required by each task can be determined. Thereafter, an aggregate clock demand rate can be computed based on the determined processor clock demand rate for all tasks. As a result, the processor system clock rate can be adjusted to accommodate the aggregate clock demand rate. A processor clock demand rate required by a particular task can be dynamically adjusted based on a difference between an expected or completed work rate and at least one progress limiting rate (e.g., a progress limit error, etc.). The processor clock demand rate required by each task can be based on an expected time of completion of the corresponding task.
[0020] The processor system clock rate can be selectively reduced to a level that does not affect the expected time of completion of the tasks. The processor system clock rate can be set to either of a sleep or idle state until such time that the aggregate clock demand is greater than zero. The processor system clock rate can fluctuate above and below the aggregate clock demand rate during a period of time provided that an average processor system clock rate during the period of time is above the aggregate clock demand rate.
[0021] The performance profile can specify an occupancy quota limiting a number of buffers a task can utilize. The occupancy quota can be dynamically adjusted based on a difference between an expected and completed work rate and one or more progress limiting rate (e.g., progress limit error etc.) Other performance metrics from a single source or multiple sources can be used to adjust the occupancy quota.
[0022] Utilization of bandwidth by an input / output subsystem of the processor system can be selectively controlled so that performance requirements of each task are met. The amount of bandwidth utilized can be dynamically adjusted based on a difference between an expected and completed work rate and one or more progress limiting rate (e.g., progress error, etc.). Other performance metrics (e.g., progress limit error, etc.) from a single source or multiple sources can be used to adjust the occupancy quota. [0023] In a further aspect, a system includes at least one processor, a plurality of buffers, a scheduler module, a metering module, an adaptive clock manager module, a cache occupancy manager module, and an input/output bandwidth manager module. The scheduler module can schedule a plurality of tasks to be executed by the at least one processor (and in some implementations each task has an associated execution priority and/or an execution deadline). The metering module can monitor execution of the plurality of tasks and to identify tasks that require additional processing resources. The adaptive clock manager module can selectively adjust a clock rate of the at least one processor when executing a task. The cache occupancy manager module can selectively adjust a maximum amount of buffers to be utilized by a task. The input / output bandwidth manager module can selectively adjust a maximum amount of input/output (I/O) bandwidth to be utilized by a task.
[0024] Articles of manufacture are also described that comprise computer executable instructions permanently stored on computer readable media, which, when executed by a computer, causes the computer to perform operations herein. Similarly, computer systems are also described that may include a processor and a memory coupled to the processor. The memory may temporarily or permanently store one or more programs that cause the processor to perform one or more of the operations described herein.
[0025] The subject matter described herein provides many advantages. For example, by optimizing cache/buffer utilization and I/O bandwidth (based on
performance requirements) in such a way as to provide performance guarantees / targets while at the same time using minimal resources, can allow a computer system to have greater capacity (because required resources for each component is minimized). In addition, the current subject matter can allow a computer system to require fewer/smaller physical computer resources thereby lowering cost and/or reducing physical size. In addition, overall power consumption can be reduced because fewer power consuming resources are needed. In addition, with multi-processors information such as aggregate clock rates, progress error and progress limit error can be used to inform a scheduler on which processor to schedule tasks.
[0026] The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings and the description below. Other features and advantages of the subject matter described herein will be apparent from the description and drawings, and from the claims.
DESCRIPTION OF DRAWINGS
[0027] FIG. 1 is a block diagram of a computer system with performance resource management;
[0028] FIG. 2 is a block diagram of a metering module;
[0029] FIG. 3 is a block diagram of a performance resource manager module;
[0030] FIG 4 is a diagram illustrating a calendar queue; and
[0031] FIG. 5 is a process flow diagram illustrating a technique for processor system performance resource management.
[0032] Like reference symbols in the various drawings indicate like elements.
DETAILED DESCRIPTION
[0033] FIG. 1 is a simplified block diagram of a computer system including a processor system 10, a management module 106, an I/O (Input / Output) subsystem 108 and a system memory 150. Some of the commonly known elements of the processor system and the computer system are not shown in the figure in order to aid understanding of the current subject matter. The processor system 10 can include one or more of a central processing unit, a processor, a microprocessor, a processor core and the like. For example, the processor system 10 can comprise a plurality of processors and/or a multi- core processor. The functional elements of the processor system depicted in FIG. 1 can be implemented in hardware or with a combination of hardware and software (or firmware).
[0034] The processor system 10 can include an instruction cache 104, instruction fetch/branch unit 1 15, an instruction decode module 125, an execution unit 135, a load/store unit 140, a data cache 145, a clock module 180 for controlling the processor system's clock speed(s), an idle state module 184 for controlling the idle or sleep state of the processor system, a DMA (Direct Memory Access) module 186, a performance management system 105 and a scheduler module 130. The performance management system 105 can include a metering module 1 10 and a performance resource management module 120. In one implementation, a task context memory, which stores the task performance profile for a task, can be incorporated into the system memory 150. In other implementations, the task context memory may be independent of the system memory 150.
[0035] Throughout this document, a task may be referred to as a set of instruction to be executed by the processor system 10. Although the term task is sometimes referred to singularly, the term task can be interpreted to include a group of tasks (unless otherwise stated). A task can also comprise processes such as instances of computer programs that are being executed, threads of execution such as one or more simultaneously, or pseudo-simultaneously, executing instances of a computer program closely sharing resources, etc. that execute within one or more processor systems 10 (e.g., microprocessors) or virtual machines such as virtual execution environments on one or more processors. A virtual machine (VM) is a software implementation of a machine (computer) that executes programs like a real machine. In some implementations, the tasks can be state machines such as image processors, cryptographic processors and the like.
[0036] The management module 106 can be part of the computer system coupled to the processing module (for example, a program residing in the system memory 150). The management module 106 can create, and/or retrieve previously created performance profiles from system memory 150 or from storage devices such as hard disk drives, non-volatile memory, etc., and assign task performance profiles that specify task performance parameters to tasks directly or through their task context (a set of data containing the information needed to manage a particular task). In some implementations, the management module 106 can control the allocation of resources by
determining/controlling the task performance profiles (e,g., through a set of
policies/rules, etc.).
[0037] The I/O subsystem module 108 can be part of the computer system coupled to the processing module (for example, a program residing in the system memory 150). The I/O subsystem module 108 can control, and/or enable, and/or provide the means for the communication between the processing system, and the outside world possibly a human, storage devices, or another processing system. Inputs are the signals or data received by the system, and outputs are the signals or data sent from it. Storage can be used to store information for later retrieval; examples of storage devices include hard disk drives and non-volatile semiconductor memory. Devices for communication between computer systems, such as modems and network cards, typically serve for both input and output.
[0038] The performance management system 105 of the processor system 10 can control the allocation of processor performance resources to individual tasks and for the processor system. In some implementations, the performance management system 105 can control the allocation of state machine performance resources to individual tasks executing in the state machine. In other implementations the management module 106 can control the allocation of resources by determining/controlling the task performance profiles (e,g. through a set of policies/rules, etc.). For example, by controlling the allocation of performance resources to all tasks, each task can be provided with throughput and response time guarantees. In addition, by allocating the minimum performance resources to all tasks, a minimal amount of processor resources of the processor system 10 and/or a computing system incorporating the processor system 10 (that includes the I/O subsystem module 108 and the system memory 150, etc.) performance resources are utilized. In one example, the minimization of performance resources increases efficiency lowering energy consumption and requiring fewer/smaller physical computer resources resulting in lowered cost. In another example, the minimization of performance resources allocated to each task can enable the processor system 10 to have greater capacity enabling more tasks to run on the system while similarly providing throughput and response time guarantees to the larger number of tasks.
[0039] Tasks can be assigned performance profiles that specify task performance parameters. Examples of task performance parameters include work to be completed, We, time interval, Ti, and maximum work to be completed, Wm, cache occupancy and I/O (Input / Output) bandwidth requirements as described elsewhere in this document. The time interval can represent a deadline such that the task is expected to complete We work within Ti time. The work to be completed can determine the expected work to be performed by the task when it is scheduled for execution. The maximum work to be completed can specify the maximum work the task may accumulate if, for example, the completion of its expected work is postponed. The time interval, as well as other performance parameters, can also be utilized by the scheduling module 130 to influence scheduling decisions, such as using the time interval to influence when a Task should run or as a deadline (the maximal time allowed for the task to complete its expected work). The work rate, Wr, can be expressed through the relation Wr = We/Ti. In one implementation, these parameters can dynamically change with task state such that the performance profile parameters are sets of parameters where each set may be associated with one or more program states and changed dynamically during the task's execution. One example of a scheduler module (as well as related aspects that can be used in connection with the current subject matter) is described in U.S. Patent App. Pub. No 2009/0055829 Al, the contents of which are hereby fully incorporated by reference.
[0040] Performance profiles can be assigned to groups of tasks similar to the performance profile for an individual task. In one implementation, tasks that are members of a group share a common performance profile and the performance resource parameters can be derived from that common profile.
[0041] In some variations, a subset of the performance parameters can be part of a group performance profile while others are part of individual task performance profile. For instance, a task profile can include expect work parameters while the task is a member of a group that shares I/O bandwidth and cache occupancy performance parameters. A multiplicity of groups can exist where tasks are members of one or more groups that specify both common and separate performance profile parameters where the parameters utilized by the performance resource manager are derived from the various performance profiles (through a set of policies/rules)
[0042] The work can be a measure of data transference, processor instructions completed, or other meaningful units of measure of work done by the processor system 10 or state machine such as image processors, cryptographic processors and the like. As this work can be measured to a fine granularity, the performance resources can be similarly managed to a fine granularity.
[0043] The processor system 10 can execute instructions stored in the system memory 150 where many of the instructions operate on data stored in the system memory 150. The instructions can be referred to as a set of instructions or program instructions throughout this document. The system memory 150 can be physically distributed in the computer system. The instruction cache 104 can temporarily store instructions from the system memory 150. The instruction cache 104 can act as a buffer memory between system memory 150 and the processor system 10. When instructions are to be executed, they are typically retrieved from system memory 150 and copied into the instruction cache 104. If the same instruction or group of instructions is used frequently in a set of program instructions, storage of these instructions in the instruction cache 104 can yield an increase in throughput because system memory accesses are eliminated.
[0044] The fetch/branch unit 115 can be coupled to the instruction cache 104 and configured to retrieve instructions from the system memory 150 for storage within the instruction cache 104. The instruction decode module 125 can interpret and implement the instructions retrieved. In one implementation, the decode module 125 can break down the instructions into parts that have significance to other portions of the processor system 10. The execution unit 135 can pass the decoded information as a sequence of control signals, for example, to relevant function units of the processor system 10 to perform the actions required by the instructions. The execution unit can include register files and Arithmetic Logic Unit (ALU). The actions required by the instructions can include reading values from registers, passing the values to an ALU (not shown) to add them together and writing the result to a register. The execution unit 135 can include a load/store unit 140 that is configured to perform access to the data cache 145. In other implementations, the load/store unit 140 can be independent of the execution unit 135. The data cache 145 can be a high-speed storage device, for example a random-access memory, which contains data items that have been recently accessed from system memory 150, for example. In one implementation, the data cache 145 can be accessed independently of the instruction cache 104.
[0045] FIG. 2 is a block diagram of a metering module 1 10. For explanatory purposes, FIG. 2 will be discussed with reference to FIG. 1. The metering module 1 10 can measure the work performed or amount of work completed by the currently executing task(s). In one implementation, the metering module 1 10 can monitor the execution of the task to determine a monitored value related to the amount of work completed for the task. The monitored value related to the amount of work completed can be the actual amount of work completed, a counter value or the like that is proportional to or related to the amount of work completed.
[0046] In general, one implementation of the metering module 110 can comprise a work completed module 210 (Wc), a work to be completed module 220 (We), a comparator module 230, and an adder module 240. The work completed module 210 can be a work completed counter and the work to be completed module 220 can also be a work to be completed counter. The work to be completed counter can be updated based on the work rate to account for the passage of time. The work to be completed can be calculated by the performance resource manager, for example, when the task is selected for execution on the processor system by the scheduler module 130 informing the performance resource manager of the task selection.
[0047] The metering module 1 10 can measure and monitor the work completed by a task that is currently being executed on the processor system 10. One or more tasks can be implemented on the processor system 10 (e.g., processor(s) employing simultaneous or pseudo-simultaneous multi-threading, a multi-processor, etc.). In one implementation the monitored value of work completed or information about the amount of work completed can be measured by the amount of instructions completed and can be acquired from the instruction fetch/branch unit 1 15 as illustrated by the arrow 170 in FIG. 1. The monitored values can also be measured by the amount of data transferred through memory operations and can be acquired from the load/store unit 140 as illustrated by the arrow 165 in FIG. 1. The metering module 1 10, when used to monitor memory operations (bandwidth), can be configured to only account for memory operations to/from certain addresses (such as a video frame buffer). This configuration can be varied on a task-by-task basis (with the configuration information part of the Task Context or task performance profile). In some implementations, there can be separate metering modules 1 10 for instruction completion and memory operations depending on specific details of the computer system implementation. These metering modules would be similar to a single metering module 1 10. As some processing modules 10 handle multiple tasks (threads) simultaneously, the instructions completed information can include information as to which thread had completed certain instructions (typically by tagging the information with thread or process or task identifier(s)). The memory operations information can similarly include this thread identifier in order for the metering module 110 associate these operations to the correct task. Processing modules 10 which include one or more of a central processing unit, a processor, a microprocessor, a processor core, etc can include a plurality of metering modules 110 for each such processor.
[0048] A monitored value related to the work performed or work completed Wc can be measured by counting the accesses to memory, instructions completed, and/or other measurable quantities that are meaningful measurements of work by the currently executing task(s). The monitored value, for example the number of accesses to memory, which can include the size of the access, can be received at the adder module 240 where they are summed and provided to the work completed module 210. The monitored values can also be measured by the memory operations that can be acquired from the load/store unit 140 as illustrated by the arrow 165 in FIG. 1. The work to be completed module 220 can receive a parameter value We related to the amount of work to be completed. The parameter value related to the amount of work to be completed and/or work rate can be a predetermined value that is stored in the task performance profile of a task. The work to be completed parameter value can be the actual amount of work to be completed, a counter value or the like that is proportional to or related to the amount of work to be completed. The parameter value can be a constant parameter or calculated from the work rate to include, for example, work credit which can be calculated to account for the time the task waits to be executed by multiplying the work rate by the passage of time. The work credit can also be calculated continuously, or periodically, such that the work to be done increases with the passage of time at the work rate even while the task in running. This computed work to be done can be limited to being no larger than a maximum work parameter. In one implementation, the parameter values can be predetermined by the management module 106 during the process of mapping a task to a computer system.
[0049] The work completed can be compared to the work to be completed by the comparator module 230. The result of this comparison, the progress error, can be a value representing a differential between the work completed and work to be completed and/or between the work completion rate and the work to be completed rate (the expected work rate) by including time in the comparison. One implementation can calculate a progress error based on a task achieving its expected work to be completed, within an expected runtime. For example, the error may be calculated by the relation: Progress Error = ( qt / Qi ) * We - Wc; where qt is the elapsed time since the task started executing and Qi is the expected time to complete the work to be completed; which may be dependent on processor and/or computer system state, such as the processor system clock frequency. A negative progress error, in the above example relation, can indicate the work completion is greater than the expected work at elapsed time qt. A progress error can be used to allocate or adjust the allocation of performance related resources to tasks as detailed elsewhere in this document.
[0050] One or more instances of meter modules can be utilized to determine if task's progress is limited (directly or indirectly) by quantities a meter module may measure; memory accesses or cache miss occurrences (i.e., failed attempts to read or write a piece of data in the buffer resulting in a main memory access, etc.), for instance, by metering those quantities and comparing them to pre-calculated parameters. In one implementation, the progress limit measurement can be achieved by providing the We module 220 of a meter module instance with a value to be compared to the accumulated metered quantity in the Wc module 210. The value supplied to module 220 can be considered a progress limit parameter. A comparator function can then compare the two values, including a comparison with respect to time, to determine if progress is limited by the quantity measured; for example, limited by a certain cash miss rate or memory access rate. The result can be expressed as a progress error (note that this result is different than the primary progress error arising from comparing work completed to work to be completed). The progress limit error values can be used to allocate or adjust the allocation of performance related resources to tasks as detailed elsewhere in this document. The progress limit parameters may be part of the task's performance profile [0051] A history of progress error and progress limit error values, from current and previous times a task was executing on the processor system, can be utilized to allocate or adjust the allocation of performance related resources to tasks as detailed elsewhere in this document. These values can be represented, for example, as cumulated progress and progress limit error values or as a series of current and historical values (which may be part of the task's performance profile).
[0052] The adaptive clock manager module 320 can manage the processor system's clock speed(s) by determining the required clock speed and setting the clock rate of the processor system 10 via the clock control module 180. The processor system's clock speed(s) can be determined by computing the aggregate clock demand rate of the tasks in the computer system. The aggregate clock demand rate, Ard, which represents the cumulated demand rate of all tasks being considered, can be equal to the SUM , = Tasks { Trd[i] } + Ro where Trd[i] is the task demand rate for task i and Ro is the overhead demand rate of the processor/system not accounted for in the individual task's demand rates. The task demand rate can represent the clock rate demand for task i to complete its expected work, We, within a time interval or deadline Ti. In one implementation, the aggregate demand rate can include demand rates from the ready-to-run tasks while in other implementations the demand rate can include estimated demand rates from not ready-to-run tasks, calculating and/or speculating on when those tasks will be ready to run.
[0053] The overhead demand rate can be a constant parameter or it can depend on system state such that one or more values for the overhead demand rate is selected depending on system state. For some implementations, the overhead demand rate can be contained in the task demand rate (which then can incorporate the processor system overhead activity on behalf of the task). In one implementation, the overhead demand rate can be predetermined by the management module 106 during the process of mapping task to a computer system.
[0054] In cases in which the processor system's clock frequency F is constant while task i is running, the task demand rate can be calculated by the product of the frequency and expected execution time divided by the time interval; Trd[i] = ( F * Qi ) / Ti, where F is the actual clock rate during the tasks expected execution time Qi and Ti is the time interval or deadline. The expected execution time is the expected time for the task to complete its expected work and can be part of the task's performance profile. In general, the expected execution time can be derived from the previous executions of the task (running on the processor system) and can be a measure of the cumulative time for the task's expected work to be completed. In addition, the expected execution time is typically dependent on the processor system frequency. The task's demand rate can be a minimal clock rate for the task to complete its expected work within its time interval or deadline of Ti. In another implementation in which the processor system's frequency changes during the tasks execution (because the aggregate clock demand rate changes for instance), the task demand rate can be computed as the SUM j = FrequencyChanges{ ( F[j] * Qi[j] ) / Ti} where the expected execution time is divided into segments, one for each frequency (change) sub-interval. The task demand rate can be part of the task's performance profile.
[0055] In one implementation, the clock manager module 320 can request the processor run at a clock frequency related to the aggregate demand rate, Ard, making such requests when the value of Ard changes in accordance with certain dependencies describe elsewhere in this document. The actual system may only be capable of supporting a set of discrete processor and system clock frequencies, in which case the system is set to a supported frequency such that the processor system frequency is higher than or equal to the aggregate demand rate. In some processor systems, multiple clock cycles can be required to change the clock frequency in which case the requested clock rate can be adjusted to account for clock switching time.
[0056] During each task's execution, the progress error and/or progress limit errors can be monitored and the task demand rate updated based on one or more of these values, for example at periodic intervals. In one implementation, the updated task demand rate results in a new aggregate demand rate which can result in changing the processor system's clock as described elsewhere in this document. The progress error and progress limit errors can be used to adjust the demand rate directly or through one or more rate adaption functions implemented by the adaptive clock manager module 320. For example, one rate adaption function can adjust the task demand rate if the error is larger than certain limits, while another adaption function can change the demand rate should the error persist for longer than a certain period of time. The rate adaption function(s) can be used to dampen rapid changes in task and/or aggregate demand rates which may be undesirable in particular processor systems and/or arising from certain tasks and can be system dependent and/or task dependent. The rate adaptation functions can be part of the task's performance profile.
[0057] The adaptive clock manager module 320 can adjust the aggregate demand rate by adjusting the individual task demand rates to account for the tasks meeting their expected work in their expected time. In another variation, the processor clock frequency can be adjusted relative to the aggregate demand rate while adjusting the individual task demand rates separately with both adjustments arising from progress error and progress limit error values. Thus, the processor clock frequency, the aggregate demand rate, and individual task demand rates can be adjusted to match the sum of all tasks', being considered, expected work completed to their work to be completed in a closed loop form.
[0058] Demand rate adjustments, can allow the overhead demand rate to be included in the individual tasks demand rates and thus be an optional parameter.
[0059] Minimum and maximum threshold parameters can be associated with the task demand rate. These minimum and maximum threshold parameters can relate to progress error and progress limit error and can be used to limit the minimum and/or maximum task demand rate. In another implementation, thresholds can limit the minimum and maximum processor clock frequency chosen during the execution of the task. The minimum and maximum threshold parameters can be part of the task's performance profile.
[0060] The adaptive clock manager module 320 can detect when adjusting the processor clock frequency higher does not increase the work completed rate and the requested clock rate can be adjusted down without adversely reducing the rate of work completed. This condition can be detected, for example, by observing a change, or lack thereof, in progress error as processor frequency is changed. The clock manager module 320 can adjust the requested clock rate higher when the task's state changes such that increasing the clock frequency higher does increase the work completed rate. This detection can be accomplished by setting the processor clock frequency such that the progress error meets a certain threshold criteria, and when the error falls below a certain threshold, the clock frequency can be adjusted higher as greater progress is indicated by the reduction in progress error. Certain rate adaption function(s), which can include progress error and/or progress limit error, can be utilized in computing the processor clock frequency. These rate adaption functions can be system and/or task dependent and can be part of the task performance profile.
[0061] The task demand rate, rate adaption parameters, progress limit parameters, and/or thresholds, etc. can dynamically change with task state such that the performance profile parameters are sets of parameters where each set may be associated with one or more program states and changed dynamically during the execution of the task by the management module 106. In addition or alternatively, such task demand rate, rate adaptation parameters, progress limit parameters, and/or thresholds, etc. can be adjusted directly by the task (rather than the management module 106).
[0062] A task's demand rate can be added to the aggregate demand rate when the task becomes ready-to-run which may be determined by the scheduler module 130 (e.g., based on scheduling or other events such as becoming unblocked on I/O operations, etc.) or other subsystems such as the I/O subsystem. This demand rate can initially be specified by, or calculated from, the tasks performance profile and can be updated based, for example, on the task's work completion progress over time, updated through a rate adaption function as a function of progress error, and the like. The performance profile can contain one or more task state dependent performance parameters. In such cases, the task demand rate can be updated when these parameters change due to task state, or system state, change and can be further updated while the task is executing on the processor system through rate error adaptation (using the progress error and/or progress limit error in the computation of performance profile parameters).
[0063] In cases in which a task becomes non-runnable (based on, e.g., scheduling or other events such as becoming blocked on I/O operations, etc.,), the aggregate demand rate can be recalculated from the individual task demand rates. In another implementation that can have reduced overhead requirements as compared to calculating each individual task demand rate, the new aggregate demand rate can be calculated by subtracting the task's cumulative demand rate at the end of the time interval or current execution (when the expected work is completed), which ever is later, by placing the cumulative demand rate in a time-based queuing system, such as a calendar queue, which presents certain information at a specific time in the future. This implementation reserves the task's demand rate within the aggregate demand rate from the time the task rate is first added until the end of its time interval or its completes execution, whichever is later.
[0064] The adaptive clock manager module 320 can utilize a calendar queue for example, Calendar Queue Entry 1 (other calendar queue techniques can be utilized). The adaptive clock manager module 320 can insert a task's cumulative clock demand rate into the location Ti-Rt (difference from the time interval, to the current real time, Rt) units in the future (for example the tasks under Calendar Queue Entry N-l). As the calendar queue is of finite size, the index can be calculated as MAX(Ti - Rt,
MAX_CALENDAR_SIZE - 1) where MAX_CALENDAR_SIZE (N) is the number of discrete time entries of the calendar queue. When the current real time Rt advances to a non-empty calendar location, the clock manager module 320 can subtract each task's cumulated clock demand rate at that location for which Ti=Rt from the aggregate demand rate. This occurs when Ti=Rt at calendar queue entry 0 illustrated in FIG. 4. The index can represent a time related value in the future from the current time or real time. A task with Ti > Rt can be reinserted into the calendar queue within a certain threshold. The threshold and the size of the calendar can depend on the system design, precision of the real time clock and the desired time granularity. The calendar queue can be a circular queue such that as the real time advances, the previous current time entry becomes the last entry in the calendar queue. In the example 400 of FIG. 4, when the real time advances to entry 1, entry 0 becomes the oldest queue entry. The index can take into account the fact that the calendar is a circular queue. The current time index can advance from 0 to N-l as real time advances. Thus at point N-l the current time index wraps back to zero.
[0065] The adaptive clock manager module 320 can additionally manage entering into and resuming from the processor system's idle state. Should the aggregate clock demand be zero, the clock manager module 320 can place the processor system into an idle state until such time that the aggregate clock rate is/will be greater than zero. In some processor systems, multiple clock cycles may be required to enter and resume from idle state, in which case the time entering and resuming idle state as well the requested clock rate upon resuming active state can be adjusted to account for idle enter and resume time (as well as clock switching time).
[0066] The clock manager module 320 can also be capable of achieving certain aggregate demand rates, over a period of time, by requesting a frequency greater than or equal to the aggregate demand rate and placing the processor system into an idle state such that the average frequency (considering the idle time to have frequency of zero) equal to or higher than the aggregate demand rate. In implementations in which the processor system 10 has greater energy efficiency executing at higher frequency and is then placed in idle state to satisfy certain aggregate demand rates. In some
implementations, the requested rate can be adapted to be higher than the calculated aggregate demand rate to bias placing the processing system in idle state.
[0067] The parameters from which the frequency and idle state selection are made can be derived from characterizing the processor system by the management module 106 during the process of mapping task(s) to a computer system.
[0068] The adaptive clock management module can request the processor system enter idle state by signaling the idle state module 184 to idle the processor system. The idle state can be exited when an event, such as an interrupt from an I/O device or timer, etc occurs.
[0069] In multiprocessor systems, the aggregate demand rate can be calculated individually for each processor or collectively for all processors or a subset of processors or a combination of these. Some tasks can be assigned to certain processors while others may be free to run on any or a certain set of processors. The aggregate demand rate can be calculated for the all processors observing the restrictions and freedoms of each task has to run on a certain processor including an affinity property where it is desirable to run a task on a particular processor.
[0070] In one implementation of a multiprocessor system, each processor clock rates and idle states can be controlled individually. In this case, the clock manager module 320 can select a combination of clock rates while idling one or more processors to achieve minimum energy. In cases in which clock rates may not be adjusted individually; but the idle states may be, a single clock rate can be chosen while idling one or more processors to achieve minimum energy consumption. In another implementation of a multiprocessor system, the clock rate can be chosen such that the aggregate demand rate for all, or a plurality of subsets of, processors is divided among the processors to achieve certain desired goals, such as maximizing throughput or minimizing task completion times of a tasks individually or of parallel computations performed by a plurality of tasks. Interaction with the scheduler module 130 (in the determination of which task(s) execute in which processor) may be necessary to achieve the desired goals.
[0071] The clock module 180 and idle state module 184 can have interaction with other computer system components, not shown in the drawings. These interactions may be necessary to enable changing the one or more processors' clock speed(s) or idle state(s). For example, changing the processor frequency can require changing the clock speed of busses, peripherals, the clock speed of system memory 150, etc. Similarly, to place the processor in or resume from a idle state, certain busses, peripherals, system memory 150, etc may require preparation before such state is entered (such as quiescing an I/O device and writing its buffers to system memory) or active state is resumed (such as initializing an I/O device to commence operation(s)).
[0072] The cache occupancy management module 340 can manage the use of buffer or cache occupancy quotas. These occupancy quotas can be numerical limits of the number of buffers a task may (or should) use. The occupancy quota, Oq, and current occupancy Oc can be additionally stored in the task's performance profile. Cache occupancy can be selectively allocated using, for example, a cache replacement algorithm such as those described in co-pending U.S. Pat. App. Ser. No. 13/072,529 entitled "Control of Processor Cache Memory Occupancy", filed on March 25, 201 1 and claiming priority to U.S. Pat. App. Ser. No. 61,341,069, the contents of both applications are hereby incorporated by reference.
[0073] Occupancy in this case can be characterized as an indication of actual number of buffers being used by a task. A buffer is a memory or region of memory used to temporarily hold data (such as an input/output buffer cache) while it is being moved from one place to another or to allow faster access (such as a processor instruction/data cache). As buffers (or cache blocks/lines) are allocated to a task, the occupancy counter Oc can be incremented, as buffers are de-allocated to the task the occupancy counter can be decremented. Whenever the occupancy quota is greater than the occupancy counter (Oc > Oq), the task is exceeding its occupancy quota. Exceeding the occupancy quotas can cause that task's buffers to be replaced preferentially (cache block/line replacement) or prevent the allocation of new buffers until the entity is in compliance with its quota (Oc =< Oq). Occupancy quotas can contain multiple quota parameters such that higher or lower priority is given to comparing the occupancy to these additional quotas.
[0074] A task's occupancy quota can be part of its performance profile. This performance profile parameter may be statically set, may be dependent on program state, or may be dynamically calculated by the cache occupancy manager. Dynamic occupancy quotas may be adjusted based on the performance of the task, for example meeting its deadline, based on the cache miss information during its execution or feedback from execution in terms of expected work compared to work completed using progress error and/or progress limit errors as described elsewhere in this document.
[0075] The cache occupancy manager can adjust the occupancy quotas. Such adjustments can be based, for example, on pre-defined / configured limits which in turn can be a combination of system-level configured limits and limits contained in the task's performance profile. In one implementation, the occupancy quota can be adjusted based on the differential between a task's expected work rate and work completed rate, utilizing progress error for instance, or the cache miss rate, or a combination of the two. In such a variation, the computation of the occupancy quota can be made such that that the occupancy quota can be increased when a task is below its expected work rate or the cache miss rate is above a certain threshold; conversely, the occupancy quota can be reduced when the task is exceeding its expected work or the cache miss rate is below a certain threshold. This computation can also take progress limiting error values into account, for example, by detecting that the progress is being limited by another factor other than occupancy.
[0076] The cache occupancy management module can control occupancy quotas by setting quotas in the instruction cache 104 and/or data cache 145 if they have occupancy quota control mechanisms, or other buffer / caching components that can be part of, or coupled to, the processing system or computer system, such as a program stored in system memory 150. The cache occupancy parameters can relate to a task (or group of tasks) such that the system allocates occupancy quotas to or on behalf of the task; perhaps keeping track of a task if utilized by both the cache occupancy management module and the respective I/O subsystems. The quota control mechanisms can be implemented in hardware or software (firmware) or a combination of both.
[0077] Cache occupancy can include mapping virtual memory, memory management techniques allowing tasks to utilize virtual memory address space(s) which may be separate from physical address space(s), to physical memory. The physical memory in effect acts as a cache allowing a plurality of tasks to share physical memory wherein the total size of the virtual memory space(s) may be larger than the size of physical memory, or larger than the physical memory allocated to one or more tasks, and thus the physical memory, and/or a portion thereof, acts as a "cache". Physical memory occupancy of a task can be managed as described elsewhere in this document. The management module may be a separate module, as in 106, or may be an integral part of one or more operating systems, virtual machine monitors, etc.
[0078] A multiplicity of caches and/or buffer subsystems can exist and thus there can be several occupancy quota parameters utilized and stored in the task's performance profile. These caches and buffers can be embodied in hardware or software (firmware) or a combination of both.
[0079] A task's occupancy quota(s) can be modified such that work completed rate is matched to the expected work completed rate in a closed loop form where occupancy can be increased to meet expected work rates and/or decreased when expected work rates are being met or exceeded.
[0080] The modification of occupancy quota(s) can utilize rate adaption functions which may be task and dependent on task state. [0081] Task prioritization relative to occupancy quotas can be utilized to guarantee certain higher priority tasks meet their expected work at the expense of lower priority tasks. In some implementations, the management module 106 can control the overall allocation of occupancy quotas by determining/controlling the maximum and minimum occupancy quotas and/or the maximum and minimum changes allowed to occupancy quotas, etc (e,g. through a set of policies/rules).
[0082] The I/O bandwidth management module 360 can manage the computer system's input output subsystem(s) utilization of bandwidth (which is a measure of data transference per unit time). I/O operations performed by tasks, or by an operating system on behalf of a task's I/O request(s) for instance, can be managed as a performance resource by the I/O bandwidth manager to ensure that tasks performance requirements of 10 operations are met.
[0083] A task's I/O bandwidth can be part of its performance profile. This performance can be statically set (based on, for example, program state), or it can be dynamically calculated, such as by the I/O bandwidth manager. Dynamic I/O bandwidth values can be adjusted based on the performance of the task, for example, meeting its calculated deadline or feedback from execution in terms of expected work rate vs. work completed rate.
[0084] The I/O bandwidth manager can adjust the I/O bandwidth parameters, within certain configured limits which can be a combination of system-level configured limits and limits contained in the task's performance profile. The I/O bandwidth can be modified utilizing progress error and/or progress limit error values, or the expected I/O rate, or a combination of these. The computation of an I/O bandwidth rate can be made such that that the I/O bandwidth may be increased or decreased depending on progress and/or progress limit error values and thresholds. In general, these values and thresholds can be determined to match the tasks work completed rate to the work to be completed rate without using I/O bandwidth unnecessarily. A task's work can may be the I/O bandwidth rate, in which case task primary work is the transference of I/O data at a certain rate. As a task's I/O bandwidths can be adjusted such that the work completed rate is matched to the work to be completed rate in a closed loop form; where I/O bandwidths can be increased to meet expected work rates and/or decreased when expected work rates are being exceeded considering progress and progress limit errors.
[0085] I/O resources can be allocated through I/O bandwidth allocations, managed through the I/O bandwidth manager, in such a way as to provide system performance guarantees. Such guarantees can be that the total I/O bandwidth is not over allocated or that certain tasks receive their I/O bandwidth at the expense of others (depending on a set of policies/rules).
[0086] The I/O bandwidth management module can control I/O bandwidth by setting bandwidth parameters in the I/O subsystem module 108 for such bandwidth control mechanisms that exist, or other I/O components that may be part of, or coupled to, the processing system or computer system, such as a program stored in system memory 150. The I/O bandwidth parameters can relate to a task (or group of tasks) such that the system allocates bandwidth to or on behalf the task. In some variations, this can comprises keeping track of a task ID to associate with I/O operations such that the I/O bandwidth management module and the respective I/O subsystems may attribute data transference to a specific task. The I/O bandwidth control mechanisms can be implemented in hardware or software (firmware) or a combination of both.
[0087] In some implementations, DMA controllers can be utilized. Direct memory access is a feature of modern computers and microprocessors that allows certain hardware subsystems within the computer to access system memory for reading and/or writing independently of the central processing unit. Many hardware systems use DMA including disk drive controllers, graphics cards, network cards, sound cards and Graphics Processing Units (GPUs). DMA can also used for intra-chip data transfer in multi-core processors, especially in multiprocessor system-on-chips, where its processing element is equipped with a local memory (often called scratchpad memory) and DMA can be used for transferring data between the local memory and the main memory.
[0088] The I/O bandwidth manager can control I/O bandwidth through mechanisms that provide a bandwidth control mechanism to I/O operations, through bandwidth shaping. Bandwidth shaping can be accomplished by delaying certain data transference requests until sufficient time has passed to accumulate credit for the transference (where credit is a measure of data that is accumulated over time at a certain rate, representing the bandwidth). The I/O operation or the bandwidth management of data transference, including DMA, operations can be implemented in hardware or by software (or firmware).
[0089] A multiplicity of I/O subsystems, or instances of subsystems, devices and interfaces can exist and thus there may be multiple I/O bandwidth parameters utilized and stored in the task's performance profile. These I/O subsystems can be embodied in hardware or software (firmware) or a combination of both. [0090] Task prioritization relative to I/O bandwidth can be utilized to guarantee certain higher priority tasks meet their expected work at the expense of lower priority Tasks. In another implementation, the I/O bandwidth management system can request I/O operation prioritization based on tasks matching their work completed to their work to be completed, taking progress error and progress limit error into account. This can, for example, consider progress and progress limit errors for all tasks of interest such that tasks with greater progress error, within certain progress limit error values, are given priority over tasks with lesser progress error within progress limit error values.
[0091] The progress error and progress limit errors can be used to adjust a task's I/O bandwidth parameters directly or though one or more rate adaption functions implemented by the I/O bandwidth manager. For example, one rate adaption function can be to only adjust the I/O bandwidth if the error is larger than certain limits while another adaption function can only may only change the demand rate should the error persist for longer than a certain period of time. The rate adaption function(s) can be system dependent and/or task dependent. The rate adaptation functions can be part of the task's performance profile.
[0092] Task prioritization relative to I/O bandwidth can be utilized to guarantee certain higher priority tasks meet their expected work at the expense of lower priority Tasks. In some implementations, the management module 106 can control the overall allocation of I/O bandwidth by determining/controlling the maximum and minimum I/O bandwidth and/or bandwidth parameters (e,g. through a set of
policies/rules). [0093] The scheduler module 130 can select the next task(s) to be executed from its list of tasks based on the task parameters including task priority. The scheduler module 130 can indicate that a higher priority task is ready to the processor system 10. The processor system 10 (or software on the processor system 10) can decide to preemptively switch from the currently running task and run the higher priority task. The scheduler module 130 or software in the processor system can indicate that a higher priority task is to be selected for execution, perhaps replacing a currently running task. In which case, the task currently running or executed in the processor system 10 can also be indicated to the performance resource manager 120. When this happens, the state of the metering module(s) 1 10 utilized for the currently running task can be saved in the task's context and the metering module is directed to monitor the newly selected task, by the performance resource manager (by updating the modules 210, 220 and the comparator function(s) within the metering module). Additional state in the performance resource manager can be modified similarly as a result of this task switching. In a multi-processor system, scheduling can be assigned on a processor-by-processor basis such that a task on a particular processor can be influenced by progress errors and/or progress limit errors of that task. This can be also be done on a thread-by-thread basis for multi-thread systems.
[0094] FIG. 5 is a process flow diagram illustrating a method 500, in which, at 510, execution of a plurality of tasks by a processor system are monitored. Based on the monitoring, at 520, tasks requiring adjustment of performance resources are identified by calculating at least one of a progress error and a progress limit error for each task.
Subsequently, at 530, performance resources of the processor system allocated to each identified task are adjusted. The adjusting can include, for example, one or more of: adjusting a clock rate of at least one processor in the processor system executing the task, adjusting an amount of cache and/or buffer to be utilized by the task, and adjusting an amount of input/output (I/O) bandwidth to be utilized by the task.
[0095] Various implementations of the subject matter described herein may be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a
programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
[0096] These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the term "machine-readable medium" refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine- readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor. [0097] Although a few variations have been described in detail above, other modifications are possible. For example, the logic flow depicted in the accompanying Figures and described herein do not require the particular order shown, or sequential order, to achieve desirable results. Other implementations may be within the scope of the following claims.

Claims

WHAT IS CLAIMED IS:
1. A method comprising:
monitoring execution of a plurality of tasks by a processor system;
identifying, based on the monitoring, tasks requiring adjustment of performance resources by calculating at least one of a progress error and a progress limit error for each task; and
adjusting performance resources of the processor system allocated to each identified task;
wherein adjusting the performance resources comprise one or more of:
adjusting a clock rate of at least one processor in the processor system executing the task;
adjusting an amount of cache and/or buffers to be utilized by the task; or adjusting an amount of input/output (I/O) bandwidth to be utilized by the task.
2. A method as in claim 1, wherein the progress error is equal to a differential between work completed by the task and work to be completed by the task.
3. A method as in claim 1 or 2, wherein the progress limit error is equal to a difference between a work completion rate for completed work and an expected work rate for the remainder of the task.
4. A method as in any of the preceding claims, wherein each task is selected from a group comprising: a single task, a group of tasks, a thread, a group of threads, a single state machine, a group of state machines, a single virtual machine, and a group of virtual machines.
5. A method as in any of the preceding claims, wherein the processor comprises a system selected from a group comprising: a single processor, a multi-processor, a processor system supporting simultaneous multi-threading, a multi-core processor.
6. A method as in any of the preceding claims, wherein each task has an associated execution priority and an execution deadline, and wherein the performance resources of the processor system are adjusted to enable each identified task to be completed by its corresponding execution deadline and according to its corresponding execution priority.
7. A method as in any of the preceding claims, wherein the performance resources are adjusted on a task-by-task basis.
8. A method as in any of the preceding claims, wherein each task has an associated performance profile that is used, by a scheduler module, to establish the execution priority and the execution deadline for the task.
9. A method as in claim 8, wherein the associated performance profile specifies at least one performance parameter.
10. A method as in claim 9, wherein the performance parameter is a cache occupancy quota specifying an initial maximum and/or minimum amount of buffers to be used by the task, wherein the cache occupancy quota is dynamically adjusted during execution of the task.
11. A method as in claim 10, wherein the cache occupancy quota is dynamically adjusted based on progress error for the task.
12. A method as in claim 10 or 11, wherein the performance parameter specifies initial bandwidth requirements for the execution of the task, wherein the bandwidth requirements are dynamically adjusted during execution of the task.
13. A method as in any of the preceding claims, further comprising:
determining a processor clock demand rate required by each task; and computing an aggregate clock demand rate based on the determined processor clock demand rate for all tasks;
wherein the processor system clock rate is adjusted to accommodate the aggregate clock demand rate.
14. A method as in claim 13, wherein the processor system clock rate is adjusted to the aggregate clock demand rate plus an overhead demand rate.
15. A method as in claim 13 or 14, wherein determining a processor clock demand rate is a product of a current processor system clock rate with expected execution time for completion of the task divided within a time interval.
16. A method as in any of claims 13-15, wherein the processor clock demand rate for each task is updated based on progress errors affecting the performance of the task, wherein the aggregate clock demand rate is updated based on the updated processor clock demand rate for each task.
17. A method as in claim 16, wherein the updating of the processor clock demand rate for each task or the aggregate clock demand rate uses at least one adaptation function to dampen or enhance rapid rate changes.
18. A method as in any of claims 13-17, wherein a processor clock rate for each task is added to the aggregate clock demand rate when the task is ready-to-run.
19. A method as in any of claims 13-18, wherein the aggregate clock demand rate is calculated over a period of time such that, at times, the processor system clock rate is higher than the aggregate clock demand rate, and at other times, the processor system clock rate is lower than the aggregate clock demand rate.
20. A method as in any of claims 13-19, wherein the processor system comprises at least two processors, and wherein the aggregate clock demand rate is determined for each of the at least two processors and is based on the processor demand rate for tasks executing using the corresponding processor, and wherein the clock rate for each of the at least two processors are adjusted separately and accordingly.
21. A method as in any of the preceding claims, wherein each task is allocated physical memory, and wherein the method further comprises: enabling at least one task to utilize at least one virtual memory address space, the at least one virtual memory address space being mapped to at least a portion of the physical memory.
22. A method comprising:
monitoring execution of a plurality of tasks by a processor system to determine at least one monitored value for each of the tasks, the at least one monitored value characterizing at least one factor affecting performance of the corresponding task by the processor system, each task having an associated task performance profile that specifies at least one performance parameter; and
comparing, for each of the tasks, the corresponding monitored value with the corresponding at least one performance parameter specified in the associated task performance profile;
determining, for each of the tasks based on the comparing, whether performance resources utilized for the execution of the task should be adjusted or whether
performance resources utilized for the execution of the task should be maintained; and adjusting performance resources by modifying a processor clock rate for each of the tasks for which it was determined that performance resources allocated to such task should be adjusted and maintaining performance resources for each of the tasks for which it was determined that performance resources allocated to the task should be maintained.
23. A method as in claim 22, wherein the monitored value characterizes an amount of work completed by the task.
24. A method as in claim 23, wherein the amount of work completed by the task is derived from at least one of: an amount of data transferred when executing the task, a number of processor instructions completed when executing the task, processor cycles, and execution time.
25. A method as in any of claims 22-24, further comprising:
determining, for each task, a current program state for the task;
wherein the associated task performance profile specifies two or more program states having different performance parameters, and wherein the monitored value is compared to the performance parameter for the current program state.
26. A method as in any of claims 22-25, further comprising:
modifying at least one performance profile of a task being executed so that a corresponding performance parameter is changed;
wherein the monitored value is compared to the changed performance parameter.
27. A method as in any of claims 22-26, further comprising:
determining a processor clock demand rate required by each task;
computing an aggregate clock demand rate based on the determined processor clock demand rate for all tasks; and
adjusting a processor system clock to accommodate the aggregate clock demand rate.
28. A method as in claim 27, further comprising:
dynamically adjusting a processor clock demand rate required by a particular task based on a difference between an expected and completed work rate and at least one progress limiting rate.
29. A method as in claim 28, wherein the processor clock demand rate required by each task is based on an expected time of completion of the corresponding task.
30. A method as in claim 29, further comprising: reducing the processor system clock rate to a level that does not affect the expected time of completion of the tasks.
31. A method as in any of claims 28-30, further comprising:
reducing the processor system clock rate in either of a sleep or idle state until such time that the aggregate clock demand is greater than zero.
32. A method as in any of claims 28-31 , wherein the processor system clock rate fluctuates above and below the aggregate clock demand rate during a period of time provided that an average processor system clock rate during the period of time is above or equal to the aggregate clock demand rate.
33. A method as in any of claims 22-32, wherein the performance profile further specifies an occupancy quota influencing an amount of cache and/or buffers a task can utilize.
34. A method as in claim 33, wherein the occupancy quota is dynamically adjusted based on a difference between an expected or completed work rate and at least one progress limiting rate.
35. A method as in any of claims 22-34, wherein utilization of bandwith by an input / output subsystem of the processor system is controlled so that performance requirements of each task are met.
36. A method as in claim 35, wherein an amount of bandwidth utilized is dynamically adjusted based on a difference between an expected and completed work rate and at least one progress limiting rate.
37. A processor system comprising:
at least one processor;
a plurality of buffers; a scheduler module to schedule a plurality of tasks to be executed by the at least one processor;
a metering module to monitor execution of the plurality of tasks and to identify tasks that require additional processing resources;
an adaptive clock manager module to selectively adjust a clock rate of the at least one processor when executing a task;
a cache occupancy manager module to selectively adjust a maximum amount of cache and/or buffers to be utilized by a task; and
an input / output bandwidth manager module to selectively adjust a maximum amount of input/output (I/O) bandwidth to be utilized by a task.
PCT/US2011/030096 2010-03-26 2011-03-25 Fine grain performance resource management of computer systems WO2011120019A2 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2013501534A JP2013527516A (en) 2010-03-26 2011-03-25 Fine-grained performance resource management for computer systems
KR1020127027941A KR20130081213A (en) 2010-03-26 2011-03-25 Fine grain performance resource management of computer systems
EP11760356.3A EP2553573A4 (en) 2010-03-26 2011-03-25 Fine grain performance resource management of computer systems
CN2011800254093A CN102906696A (en) 2010-03-26 2011-03-25 Fine grain performance resource management of computer systems

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US34106910P 2010-03-26 2010-03-26
US34117010P 2010-03-26 2010-03-26
US61/341,170 2010-03-26
US61/341,069 2010-03-26

Publications (2)

Publication Number Publication Date
WO2011120019A2 true WO2011120019A2 (en) 2011-09-29
WO2011120019A3 WO2011120019A3 (en) 2012-01-26

Family

ID=44673905

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2011/030096 WO2011120019A2 (en) 2010-03-26 2011-03-25 Fine grain performance resource management of computer systems

Country Status (5)

Country Link
EP (1) EP2553573A4 (en)
JP (1) JP2013527516A (en)
KR (1) KR20130081213A (en)
CN (1) CN102906696A (en)
WO (1) WO2011120019A2 (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014085707A (en) * 2012-10-19 2014-05-12 Renesas Electronics Corp Cache control apparatus and cache control method
WO2014113055A1 (en) 2013-01-17 2014-07-24 Xockets IP, LLC Offload processor modules for connection to system memory
WO2014138354A1 (en) * 2013-03-08 2014-09-12 Insyde Software Corp. A method and device to perform event thresholding in a firmware environment utilizing a scalable sliding time window
US9286472B2 (en) 2012-05-22 2016-03-15 Xockets, Inc. Efficient packet handling, redirection, and inspection using offload processors
US9378161B1 (en) 2013-01-17 2016-06-28 Xockets, Inc. Full bandwidth packet handling with server systems including offload processors
US9495308B2 (en) 2012-05-22 2016-11-15 Xockets, Inc. Offloading of computation for rack level servers and corresponding methods and systems
CN107291370A (en) * 2016-03-30 2017-10-24 杭州海康威视数字技术股份有限公司 A kind of cloud storage system dispatching method and device
CN107463357A (en) * 2017-08-22 2017-12-12 中车青岛四方车辆研究所有限公司 Task scheduling system, dispatching method, Simulation of Brake system and emulation mode
CN107547270A (en) * 2017-08-14 2018-01-05 天脉聚源(北京)科技有限公司 A kind of method and device of smart allocation task burst
EP3361386A1 (en) * 2012-09-29 2018-08-15 Intel Corporation Intelligent far memory bandwidth scaling
US10209998B2 (en) 2016-06-17 2019-02-19 Via Alliance Semiconductor Co., Ltd. Multi-threading processor and a scheduling method thereof
WO2021171156A1 (en) * 2020-02-28 2021-09-02 3M Innovative Properties Company Deep causal learning for advanced model predictive control
WO2022212385A1 (en) * 2021-03-31 2022-10-06 Advanced Micro Devices, Inc. Low power state selection based on idle duration history
US11714549B2 (en) 2020-02-28 2023-08-01 3M Innovative Properties Company Deep causal learning for data storage and processing power management
WO2024001994A1 (en) * 2022-06-28 2024-01-04 华为技术有限公司 Energy-saving management method and apparatus, and computing device and computer-readable storage medium

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101587579B1 (en) * 2014-02-19 2016-01-22 한국과학기술원 Memory balancing method for virtual system
WO2015145598A1 (en) * 2014-03-26 2015-10-01 株式会社 日立製作所 Data distribution device for parallel operation processing system, data distribution method, and data distribution program
CN103929769B (en) * 2014-05-04 2017-02-15 中国科学院微电子研究所 Scheduling method and system applied to wireless communication system simulation
CN105357097A (en) * 2014-08-19 2016-02-24 中兴通讯股份有限公司 Virtual network (VN) regulation method and system
EP3230874B1 (en) * 2014-12-14 2021-04-28 VIA Alliance Semiconductor Co., Ltd. Fully associative cache memory budgeted by memory access type
US10157081B2 (en) * 2015-11-13 2018-12-18 Telefonaktiebolaget Lm Ericsson (Publ) Trainer of many core systems for adaptive resource control
US10146583B2 (en) * 2016-08-11 2018-12-04 Samsung Electronics Co., Ltd. System and method for dynamically managing compute and I/O resources in data processing systems
KR101958112B1 (en) * 2017-09-29 2019-07-04 건국대학교 산학협력단 Apparatus for scheduling tasks and method for scheduling tasks
CN108897619B (en) * 2018-06-27 2020-05-05 国家超级计算天津中心 Multi-level resource flexible configuration method for super computer
CN110852965A (en) * 2019-10-31 2020-02-28 湖北大学 Video illumination enhancement method and system based on generation countermeasure network
CN112965885B (en) * 2019-12-12 2024-03-01 中科寒武纪科技股份有限公司 Detection method and device for access bandwidth, computer equipment and readable storage medium
CN110874272A (en) * 2020-01-16 2020-03-10 北京懿医云科技有限公司 Resource allocation method and device, computer readable storage medium and electronic device
CN111506402B (en) * 2020-03-31 2023-06-27 上海氪信信息技术有限公司 Computer task scheduling method, device, equipment and medium for machine learning modeling
CN114724233A (en) * 2020-12-21 2022-07-08 青岛海尔多媒体有限公司 Method and device for gesture control of terminal equipment and terminal equipment
CN112559440B (en) * 2020-12-30 2022-11-25 海光信息技术股份有限公司 Method and device for realizing serial service performance optimization in multi-small-chip system
CN112925633A (en) * 2021-05-12 2021-06-08 浙江华创视讯科技有限公司 Embedded task scheduling method and device, electronic equipment and storage medium
CN113589916A (en) * 2021-07-29 2021-11-02 维沃移动通信有限公司 Memory control method and device
KR20230119832A (en) * 2022-02-08 2023-08-16 삼성전자주식회사 Electronic device and operation method of electronic device allocating memory resource to task

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7228546B1 (en) * 2000-01-28 2007-06-05 Hewlett-Packard Development Company, L.P. Dynamic management of computer workloads through service level optimization
US6845456B1 (en) * 2001-05-01 2005-01-18 Advanced Micro Devices, Inc. CPU utilization measurement techniques for use in power management
US7539994B2 (en) * 2003-01-03 2009-05-26 Intel Corporation Dynamic performance and resource management in a processing system
US7770034B2 (en) * 2003-12-16 2010-08-03 Intel Corporation Performance monitoring based dynamic voltage and frequency scaling
US20050198636A1 (en) * 2004-02-26 2005-09-08 International Business Machines Corporation Dynamic optimization of batch processing
US7281145B2 (en) * 2004-06-24 2007-10-09 International Business Machiness Corporation Method for managing resources in a CPU by allocating a specified percentage of CPU resources to high priority applications
JP4117889B2 (en) * 2004-11-08 2008-07-16 インターナショナル・ビジネス・マシーンズ・コーポレーション Computer and method for controlling communication for executing web application
US7721127B2 (en) * 2006-03-28 2010-05-18 Mips Technologies, Inc. Multithreaded dynamic voltage-frequency scaling microprocessor
JPWO2007141849A1 (en) * 2006-06-07 2009-10-15 株式会社日立製作所 Semiconductor integrated circuit
JP2008282150A (en) * 2007-05-09 2008-11-20 Matsushita Electric Ind Co Ltd Signal processor and signal processing system
WO2009029549A2 (en) * 2007-08-24 2009-03-05 Virtualmetrix, Inc. Method and apparatus for fine grain performance management of computer systems
JP5040773B2 (en) * 2008-03-31 2012-10-03 富士通株式会社 Memory buffer allocation device and program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of EP2553573A4 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9619406B2 (en) 2012-05-22 2017-04-11 Xockets, Inc. Offloading of computation for rack level servers and corresponding methods and systems
US9558351B2 (en) 2012-05-22 2017-01-31 Xockets, Inc. Processing structured and unstructured data using offload processors
US9286472B2 (en) 2012-05-22 2016-03-15 Xockets, Inc. Efficient packet handling, redirection, and inspection using offload processors
US9495308B2 (en) 2012-05-22 2016-11-15 Xockets, Inc. Offloading of computation for rack level servers and corresponding methods and systems
EP3361386A1 (en) * 2012-09-29 2018-08-15 Intel Corporation Intelligent far memory bandwidth scaling
JP2014085707A (en) * 2012-10-19 2014-05-12 Renesas Electronics Corp Cache control apparatus and cache control method
US9378161B1 (en) 2013-01-17 2016-06-28 Xockets, Inc. Full bandwidth packet handling with server systems including offload processors
US9436640B1 (en) 2013-01-17 2016-09-06 Xockets, Inc. Full bandwidth packet handling with server systems including offload processors
US9436638B1 (en) 2013-01-17 2016-09-06 Xockets, Inc. Full bandwidth packet handling with server systems including offload processors
US9436639B1 (en) 2013-01-17 2016-09-06 Xockets, Inc. Full bandwidth packet handling with server systems including offload processors
US9460031B1 (en) 2013-01-17 2016-10-04 Xockets, Inc. Full bandwidth packet handling with server systems including offload processors
US9288101B1 (en) 2013-01-17 2016-03-15 Xockets, Inc. Full bandwidth packet handling with server systems including offload processors
US9250954B2 (en) 2013-01-17 2016-02-02 Xockets, Inc. Offload processor modules for connection to system memory, and corresponding methods and systems
US9348638B2 (en) 2013-01-17 2016-05-24 Xockets, Inc. Offload processor modules for connection to system memory, and corresponding methods and systems
WO2014113055A1 (en) 2013-01-17 2014-07-24 Xockets IP, LLC Offload processor modules for connection to system memory
WO2014138354A1 (en) * 2013-03-08 2014-09-12 Insyde Software Corp. A method and device to perform event thresholding in a firmware environment utilizing a scalable sliding time window
US10353765B2 (en) 2013-03-08 2019-07-16 Insyde Software Corp. Method and device to perform event thresholding in a firmware environment utilizing a scalable sliding time-window
CN107291370A (en) * 2016-03-30 2017-10-24 杭州海康威视数字技术股份有限公司 A kind of cloud storage system dispatching method and device
US10209998B2 (en) 2016-06-17 2019-02-19 Via Alliance Semiconductor Co., Ltd. Multi-threading processor and a scheduling method thereof
CN107547270A (en) * 2017-08-14 2018-01-05 天脉聚源(北京)科技有限公司 A kind of method and device of smart allocation task burst
CN107463357A (en) * 2017-08-22 2017-12-12 中车青岛四方车辆研究所有限公司 Task scheduling system, dispatching method, Simulation of Brake system and emulation mode
CN107463357B (en) * 2017-08-22 2024-03-12 中车青岛四方车辆研究所有限公司 Task scheduling system, scheduling method, braking simulation system and simulation method
WO2021171156A1 (en) * 2020-02-28 2021-09-02 3M Innovative Properties Company Deep causal learning for advanced model predictive control
JP2023505617A (en) * 2020-02-28 2023-02-09 スリーエム イノベイティブ プロパティズ カンパニー Deep causal learning for advanced model predictive control
US11714549B2 (en) 2020-02-28 2023-08-01 3M Innovative Properties Company Deep causal learning for data storage and processing power management
WO2022212385A1 (en) * 2021-03-31 2022-10-06 Advanced Micro Devices, Inc. Low power state selection based on idle duration history
US11543877B2 (en) 2021-03-31 2023-01-03 Advanced Micro Devices, Inc. Low power state selection based on idle duration history
WO2024001994A1 (en) * 2022-06-28 2024-01-04 华为技术有限公司 Energy-saving management method and apparatus, and computing device and computer-readable storage medium

Also Published As

Publication number Publication date
KR20130081213A (en) 2013-07-16
CN102906696A (en) 2013-01-30
JP2013527516A (en) 2013-06-27
EP2553573A4 (en) 2014-02-19
EP2553573A2 (en) 2013-02-06
WO2011120019A3 (en) 2012-01-26

Similar Documents

Publication Publication Date Title
US8782653B2 (en) Fine grain performance resource management of computer systems
WO2011120019A2 (en) Fine grain performance resource management of computer systems
US8302098B2 (en) Hardware utilization-aware thread management in multithreaded computer systems
US8397236B2 (en) Credit based performance managment of computer systems
US8484498B2 (en) Method and apparatus for demand-based control of processing node performance
US7958316B2 (en) Dynamic adjustment of prefetch stream priority
US8219993B2 (en) Frequency scaling of processing unit based on aggregate thread CPI metric
Lee et al. Prefetch-aware DRAM controllers
US8205206B2 (en) Data processing apparatus and method for managing multiple program threads executed by processing circuitry
US8924690B2 (en) Apparatus and method for heterogeneous chip multiprocessors via resource allocation and restriction
JP5735638B2 (en) Method and apparatus for cache control
US9342122B2 (en) Distributing power to heterogeneous compute elements of a processor
US8522245B2 (en) Thread criticality predictor
US8397052B2 (en) Version pressure feedback mechanisms for speculative versioning caches
Lee et al. Prefetch-aware memory controllers
CN114651230A (en) Soft watermarking in thread-shared resources by thread mediation
US20240004725A1 (en) Adaptive power throttling system
US11354127B2 (en) Method of managing multi-tier memory displacement using software controlled thresholds
US20130145101A1 (en) Method and Apparatus for Controlling an Operating Parameter of a Cache Based on Usage

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201180025409.3

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11760356

Country of ref document: EP

Kind code of ref document: A2

WWE Wipo information: entry into national phase

Ref document number: 2013501534

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2011760356

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 9035/CHENP/2012

Country of ref document: IN

ENP Entry into the national phase

Ref document number: 20127027941

Country of ref document: KR

Kind code of ref document: A