US20130332937A1 - Heterogeneous Parallel Primitives Programming Model - Google Patents

Heterogeneous Parallel Primitives Programming Model Download PDF

Info

Publication number
US20130332937A1
US20130332937A1 US13/904,791 US201313904791A US2013332937A1 US 20130332937 A1 US20130332937 A1 US 20130332937A1 US 201313904791 A US201313904791 A US 201313904791A US 2013332937 A1 US2013332937 A1 US 2013332937A1
Authority
US
United States
Prior art keywords
memory
workgroup
hpp
distributed array
kernel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/904,791
Inventor
Benedict R. Gaster
Lee W. Howes
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced Micro Devices Inc
Original Assignee
Advanced Micro Devices Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Micro Devices Inc filed Critical Advanced Micro Devices Inc
Priority to US13/904,791 priority Critical patent/US20130332937A1/en
Assigned to ADVANCED MICRO DEVICES, INC. reassignment ADVANCED MICRO DEVICES, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GASTER, BENEDICT R., HOWES, LEE W.
Publication of US20130332937A1 publication Critical patent/US20130332937A1/en
Priority to US15/797,702 priority patent/US11231962B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing

Definitions

  • the present invention relates generally to a programming model for a heterogeneous processor system.
  • heterogeneous computing platforms are becoming mainstream.
  • these heterogeneous systems are low-level, not composable, and their behavior is often implementation defined even for standardized programming models.
  • HPP heterogeneous parallel primitives
  • a method and system for executing an asynchronous task on a heterogeneous computing platform are provided.
  • An asynchronous task configured to execute on a grid is initialized.
  • An initially unknown result that becomes available during execution is encapsulated.
  • the asynchronous task is executed on the grid.
  • the result is assigned to the asynchronous task when the result becomes available during execution.
  • a heterogeneous parallel primitives (HPP) platform generates an unbound distributed in a plurality of memories of different types. Once generated, the distributed array is bound to a kernel that executes a workgroup on a processor in a heterogeneous computing platform. During execution, the bound distributed array is accessed by the workgroup.
  • HPP parallel primitives
  • FIG. 1 is a block diagram of a heterogeneous parallel primitives execution model, according to an embodiment.
  • FIG. 2 is a block diagram that shows bound and unbound distributed arrays access, according to an embodiment.
  • FIG. 3 is a block diagram of a channel usage flow, according to an embodiment.
  • GPU Graphics processing units
  • SIMD single instruction multiple data
  • GPU Graphics processing units
  • CPU central processing unit
  • GPU functions as the host or controlling processor and hands-off specialized functions, such as graphics processing, to other processors such as GPUs.
  • Multi-core CPUs where each CPU has multiple processing cores, offer processing capabilities for specialized functions (e.g., graphics processing) similar to those available on the GPU.
  • One or more of the computation cores of multi-core CPUs or GPUs can be part of the same die (e.g., AMD FusionTM) or, alternatively, in different dies (e.g., Intel XeonTM with NVIDIA GPU).
  • hybrid cores having characteristics of both CPU and GPU e.g., CellSPETM, Intel LarrabeeTM
  • GPGPU style of computing advocates using the CPU to primarily execute control code and to offload performance critical data-parallel code to the GPU.
  • the GPU is primarily used as an accelerator.
  • the combination of multi-core CPUs and GPGPU computing model encompasses both CPU cores and GPU cores as accelerator targets. Many of the multi-core CPU cores have performance that is comparable to GPUs in many areas.
  • OpenCL OpenCL
  • BrookGPU by Stanford University
  • CUDA compute unified device architecture
  • Khronos Group an industry consortium named Khronos Group.
  • the OpenCL framework offers a C-like development environment which users can create applications for the GPU.
  • OpenCL enables the user, for example, to specify instructions for offloading some computations, such as data-parallel computations, to a GPU.
  • OpenCL also provides a compiler and a runtime environment in which code can be compiled and executed within a heterogeneous, or other, computing system.
  • Heterogeneous computing platforms can include multiple CPUs and GPUs. For performance reasons CPUs and GPUs in a heterogeneous computing platform are designed differently and perform different functions. For example, GPUs support wide vectors and substantial register files to optimize throughput computing goals. CPUs are optimized for latency, dedicating logic to caches and out-of-order dependence control.
  • heterogeneous computing platforms are difficult to develop efficiently. Particularly, given different functions of CPU and GPU cores, a difficulty arises in developing an efficient programming model for the heterogeneous computing platform.
  • CUDA application program interface
  • CUDA includes a simplified API interface compared to the previous graphics oriented programming environments.
  • Microsoft's C++ AMP design is another example that eases composability by linking the benefits of C++ type safety with GPU programming, as do pragma-based models such as OpenACC.
  • SPMD single program multiple data
  • Example conventional programming models that follow SPMD are OpenGL, CUDA and other low-level GPU intermediate languages. On a GPU, those programming models execute in an SPMD-on-SIMD fashion. This technique is sometimes known as a single instruction multiple thread (“SIMT”) implementation.
  • SIMD single instruction multiple thread
  • OpenCL's memory model does not allow any communication between work groups without the use of atomic operations. OpenCL also does not provide methods that guarantee that memory writes commit to global visibility and provides little or no control of memory ordering.
  • CUDA offers a partial solution to this issue with a “threadfence” operation. The “threadfence” operation ensures that the workitems within a work group have completed operating on their allocated sections in memory.
  • SIMD The SIMD nature of execution leads to other problems. For example, in the SIMD model a workitem is mapped to an individual SIMD lane in a larger hardware thread. SIMD model then uses execution masks to switch execution between the workitem subsets when control flow diverges. No guarantees of progress can be made in the presence of dependencies between lanes. CUDA's limited hardware space allows programmers to make assumptions about how wide a hardware thread is and how many SIMD lanes are included in the hardware thread. OpenCL, on the other hand, does not allow programmers to make such an assumption.
  • Braided parallelism is a combination of data parallelism and task parallelism.
  • Conventional programming models such as OpenCL and CUDA implement data parallelism.
  • task parallelism can also be implemented in a heterogeneous computing platform, as described below.
  • a game engine that implements a heterogeneous computing platform displays many types of parallelism. It includes parallel AI tasks, concurrent workitems for user interfaces, and massive data-parallel particle simulations, to name a few examples.
  • parallel AI tasks concurrent workitems for user interfaces
  • massive data-parallel particle simulations to name a few examples.
  • the video engine fails to exhibit parallelism in its entirety. In fact, the entire video engine is not parallel as many of its tasks are generated dynamically.
  • Persistent threads may be used for building scheduling systems within threads and thus circumventing the hardware scheduler. This approach is commonly used to reduce overhead that arises from massively parallel data executions. Persistent threads, however, also demonstrate a need and limitation in conventional programming models for implementing braided parallelism.
  • HPP heterogeneous parallel primitives
  • HPP programming model may be a combination of OpenCL, C++11 and Concurrency Runtime by Microsoft.
  • HPP adopts the execution model of OpenCL and extends the OpenCL's execution model with braided parallelism, the hosting of object oriented C++11 language and a stricter and more controllable memory model.
  • HPP may be embedded into C++11 as a library and a device kernel language, that is designed to target both CPU and massively multi-threaded GPU devices.
  • HPP includes three components, a platform model, an execution model and a memory model.
  • Platform model specifies an abstract hardware model, consisting of the host processor coordinating execution and one or more compute units capable of dispatching and executing HPP kernels.
  • HPP evolves the device model of OpenCL from a single threaded device to a set of explicitly programmable work coordinators capable of launching units of work on the compute cores as seen in FIG. 1 , according to an embodiment of the present invention.
  • Execution model defines how HPP programming model is configured on the host and how kernels are executed on the device. Unlike the conventional GPU programming models described above, HPP supports both data-parallelism and task-parallelism as first class execution models.
  • coordinators are single-thread scalar programs. Coordinators perform reads and writes into globally visible memory. The read and writes include atomic operations. Coordinators also perform conditional flows. The conditional flows include iteration. Additionally, coordinators dispatch kernels on the compute units.
  • coordinators execute on the Coord Schedulers of FIG. 1 .
  • kernels execute on compute units (CUs) and assume an explicitly parallel execution.
  • the term “kernel”, as used herein, refers to a program and/or processing logic that is executed as one or more workitems in parallel having the same code base. Each kernel describes the execution of a single lane of execution called a workitem. When coordinators dispatch a kernel, multiple workitems may execute sharing the same kernel code.
  • coordinators scheduling program runs on coord1 scheduler entities, so that programs run concurrently execute concurrently with kernels. This enables coordinators to dispatch new kernels while other kernels are concurrently executing.
  • workitems are organized into workgroups of size 1 or more. Collections of workitems within a workgroup are executed in lock-step as part of a vector, called an mvector (machine vector), potentially using predication.
  • mvector machine vector
  • the specific length of an mvector is implementation defined and is exposed as a symbolic constant (MVECTOR SIZE).
  • memory model defines an abstract memory hierarchy that kernels use.
  • the abstract memory hierarchy works regardless of the actual underlying memory architecture. Unlike the conventional GPGPU models the memory hierarchy is closer to a more traditional shared memory system. For example, scratch pad memories are not exposed explicitly.
  • HPP programming model also adopts the C++11 memory model for workitems communications.
  • the code snippet in Table 1 shows an HPP application that atomically increments its input in parallel:
  • HPP programming model enables developers to introduce data and task parallelism.
  • the example below demonstrates in pseudo code how HPP programming model enables programmers to introduce data and task parallelism.
  • Table 2A is a function for multiplying two matrices.
  • the iteration spaces of the outer two “for” loops are independent of each other. Because the “for” loops are independent of each other, they can be executed in parallel.
  • One conventional way to parallelize the pseudo code Table 2A in a data parallel execution is to use size*size number of workitems, where each workitem executes the inner loop with a corresponding index from the 2D iteration space.
  • Table 2A In a data programming model, the algorithm in Table 2A can be parallelized using a parallelFor function.
  • the pseudo code for the parallelFor function is shown in Table 2B.
  • HPP programming model includes task parallel runtime (TPR) that supports data parallelism as a first class citizen.
  • HPP programming model Similar to popular TPRs designed specifically for the CPU, HPP programming model's tasks can be data-parallel. The difference is that in HPP programming model, tasks maintain data-parallel representations much later in the execution process and hence more efficiently map to highly data parallel architectures.
  • the pseudo code in Table 2B is rewritten into an HPP version in Table 2C.
  • Table 2C uses parallel tasks and a notion of the future, to execute the matrix multiplication described in Table 2B.
  • the future represents data that will be present at some point in the future and hence is a proxy for synchronizing the asynchronous tasks.
  • HPP programming model provides asynchronous tasks that execute on the grid.
  • the difference between HPP tasks and the conventional OpenCL tasks is that HHP tasks encode the behavior of an asynchronous agent that can execute like a ConcRT style task or an OpenCL-style dispatch.
  • Table 3A includes example pseudo code that defines an HPP task as a template class.
  • HPP is an asynchronous tasking model
  • a developer configures inter-task dependencies.
  • the Future ⁇ T> type controls dependencies by encapsulating an initially unknown result that will become available at some later point in the future, as demonstrated in an example in Table 2C, above. Waiting on or assigning from a future waits on completion and gives access to the now-available value.
  • Table 3B is an example source code that shows execution of two tasks.
  • the functionality of the two tasks, f1 and f22 is elided for space, and represented as ( . . . ).
  • the futures of tasks f1 and f2 are combined into a single future task f3, that is waited upon, which is implemented by an f2.wait( ) function.
  • the memory hierarchy of modern computer architectures is complex and explicitly or implicitly exposes different memory levels and localities.
  • An example of explicitly managed scratch pad memory structure is the memory visible in a conventional OpenCL programming model.
  • Another example is an SMP system that has similar properties, such as a NUMA locality.
  • false sharing is an issue for multi-threaded applications.
  • a class of programming languages called partitioned global address space assumes a single global address space that can be logically partitioned into regions. Each region may be allocated to a particular local processor. In PGAS a window is mapped over parts of the global memory creating local memory regions. Explicit loads and stores move data in and out of those local memory regions.
  • Global memory provides a shared and coherent view of all memory, while scratch pad memories provide “local” disjoint views, internally shared and coherent, on to subsets of the global view.
  • devices have multiple memories.
  • Example memories are cache memories and on chip global memories.
  • Distributed arrays in HPP programming model generalize the multiple memories into a PGAS abstraction of persistent user managed memory regions.
  • the regions sub-divide memory (i.e., a single unified global memory or regions themselves). Visibility of the memory regions, i.e., memory sharing and coherence, is defined with respect to a region node and its ancestors.
  • One example use case is to abstractly manage OpenCL's workgroup local memory, as shown in FIG. 2 , and described in detail below.
  • the invention is not limited to this embodiment.
  • distributed arrays are defined in terms of regions and segments. Regions are accessible entities that may be placed into memory and accessed. A region defines a memory visibility constraint as a layer in hierarchy. Segments are leaf memory allocations. Leafs are created by distributing a region across a set of nodes in the execution graph. A region may be divided into segments based on the number of subtasks created at the appropriate level of the hierarchy. Unlike a conventional global memory, distributed arrays that are bound to executions are segmented. A bound segment can be accessed from a particular workgroup, but may or may not be accessed by other workgroups.
  • FIG. 2 is a block diagram 200 that shows memory management using distributed arrays, according to an embodiment of the present invention.
  • Table 4A below includes example pseudo code that defines a distributed array as a template class.
  • the distributed array When an instance of distributed array is created, the distributed array is unbound, as illustrated by an unbound distributed array in FIG. 2 . Once created, abstract regions and sub-regions in unbound distributed array may be allocated.
  • a specific region within bound distributed array can be accessed, using a getRegion( ) function.
  • the getRegion( ) function returns a region in bound distributed array.
  • the example pseudo code for the returned region is show in Table 4C below.
  • a region's access interface is defined by the parameter AccessPattern.
  • StructuredArrayAccess defines a Fortran array style interface exposing an array class (designated as [ ] in Fortran), along with its members to support array slicing and transformations.
  • Example pseudo code for using distributed arrays is shown in FIG. 4D below.
  • a single region in the distributed array is allocated using darray.allocRegion(darray.getMaxRegionSize( )) function. Once allocated, the region is bound in the execution of the kernel, using a _device(hpp) function included in pseudo code in Table 4D. The region is accessed within the kernel using the local workgroup ID index for each workitem.
  • This example highlights a key feature of distributed arrays in the HPP programming model. Namely, because coherence is described in terms of ancestors, it is safe to allocate an independent region to each workgroup.
  • the memory implementation moves regions into on-chip scratch pad memories on the GPU on demand.
  • the memory implementation also performs cache memory prefetching on the CPU.
  • the memory implementation also moves regions, depending on location in the region tree, into scratch pad memories, or moves a family of regions whose access is known to be limited to a particular CPU or GPU.
  • GPU cores may be used for general purpose computing, GPUs are primarily used to processing graphics workloads.
  • graphics workloads are data-flow pipes.
  • graphics workload may include hull shading, tessellation and domain shading which can be implemented as a pipe that amplifies or consumes work at each stage.
  • the hull shader specifies tessellation factors for edges of a triangle such that the tessellator might divide that triangle into many other triangles.
  • An example use case may be varying the viewing of an object based on the distance from the camera—the closer the distance to the viewer, more detail being needed near the viewer.
  • the conventional hardware scheduling and memory buffers may efficiently handle these workloads and are optimized for maintaining a high level of utilization.
  • the hardware scheduler schedules just enough work for a GPU at each stage to keep the pipeline busy without starvation.
  • conventional programming models for GPUs do not have such capability.
  • HPP programming model exposes this feature to the developer.
  • HPP programming model adopted the concept of communication channels and applied it to dynamic scheduling systems. Given the massively data-parallel nature of GPU dispatches the usual approach is that the hardware scheduler will issue more work as resources become available. It is this approach HPP programming model maintains through channels, such that rather than utilizing blocking reads the consumer is created at the point of read in a fine-grained fashion.
  • a similar approach is used in various CPU task-oriented runtime systems such as the agents library that runs on top of Microsoft's concurrency runtime.
  • FIG. 3 is a block diagram 300 of a flowchart for a data-flow in a channel, according to an embodiment of the present invention.
  • block 1 the basic structure of a kernel, command queue, channel and scheduling hardware (control processor) is displayed.
  • block 2 the kernel is enqueued, and launches workitems in block 3.
  • the launched workitems write into the channel in block 4.
  • the written data is displayed in the work channel in block 5.
  • the control processor detects a launch condition for the channel in block 6 and launches consumer workitems in 7. Consumer workitems consume the contents of the channel in block 8.
  • block 9 the process continues as the next set of workitems is written into the channel.
  • the implementation approach differs from a conventional approach that exposes fixed-function and programmable processing stages that are linked via data queues.
  • the conventional approach fails to describe coordination language and scheduling of the HPP programming model.
  • the channel interface may be defined by the pseudo code in Table 5A below, according to one embodiment.
  • the executeWith( ) method in Table 5A associates a coordinator predicate that returns true if the corresponding consumer kernel should be dispatched. Additionally, the channel write( ) method blocks if the channel is full, thus allowing consumers to reduce the amount of data stored in the channel before continuing. In the HPP programming platform, channel data store are locked into on-chip cache and thus are limited in size. An advantage is that good data between producer/consumer can be locally seen.
  • coordinators are control programs describing when to trigger consumers, as described above. They are expressed as a restricted domain specific language, embedded into C++.
  • Table 5B for calculating a global reduction ties together the distributed arrays and channels.
  • the input size is a multiple of MVECTOR_SIZE variable.
  • a single distributed array is used with two disjoint regions.
  • a single channel is used to store the results of each work-group's reduction, with a trigger executing a second kernel to reduce the resulting channel data, once full.
  • Table 5B demonstrates the use of distributed army for localized communication, and the use of channels for global communication, in the HPP programming model.
  • the conventional GPGPU solutions limit the synchronization via barrier operations to memory consistency and workitems reaching the same PC.
  • the conventional GPGPU solutions are also limited to cases that do not include divergent control flow, or cases that do include the divergent control flow that guarantee that all workitems enter a conditional branch if any one workitem enters the conditional branch.
  • HPP addresses these limitations by introducing barriers that can be used in a control flow and can be used across work groups.
  • the source code in Table 6A below defines the barrier class and the relevant methods, in one embodiment.
  • a barrier is initialized with a count that represents the number of participants in the barrier.
  • the participants may be workitems.
  • the barrier class also includes skip( ) wait( ) and arrive( ) methods.
  • the wait( ) method blocks any workitem that performs the wait( ) method from continuing execution until the other participants (i.e., workitems) have also taken part.
  • the wait( ) method may be performed by a consumer.
  • the arrive( ) method may be performed by a workitem that participates in the barrier, but does not wait for other workitems.
  • the arrive( )method may be performed by a producer.
  • the skip( ) method may be performed by a workitem that withdraws from further participation in the barrier.
  • the withdrawn workitem does not count against the other participants that have executed a waiting method.
  • the skip( ) method may be used by a workitem who has left the execution loop, such that the remaining workitems may continue synchronizing on the barrier after the workitem leaves.
  • HPP programming model controls the use of barriers to maintain proper execution of a workgroup. For example, replacing the call to a skip( ) method in the else branch, in Table 6C, with wait( ) may be invalid. For example, it may not be possible to know the number of times someOpaqueLibaryFunction( ) may use the barrier. However, instead of replacing a skip( ) method with a wait( ) method, two barriers may be used in the HPP programming model. The embodiment, is shown as Table 6D below:
  • barrier objects may also be used to synchronize dependent kernels.
  • the host may delegate to multiple CPU devices to process the function, as shown in Table 6E below:
  • the function _gpu_sync( ) is an inter work-group barrier operation.
  • the cross work-group variant of HPP's barrier may be implemented using the Global Data Share (GDS) in AMD's HD7970, GPU.
  • GDS Global Data Share
  • GDS is a 64K on chip global memory with barrier functionality across the whole device.
  • the _gpu_sync( ) function may be implemented using the algorithm described above.

Abstract

With the success of programming models such as OpenCL and CUDA, heterogeneous computing platforms are becoming mainstream. However, these heterogeneous systems are low-level, not composable, and their behavior is often implementation defined even for standardized programming models. In contrast, the method and system embodiments for the heterogeneous parallel primitives (HPP) programming model disclosed herein provide a flexible and composable programming platform that guarantees behavior even in the case of developing high-performance code.

Description

    RELATED APPLICATIONS
  • This application is related to the U.S. Provisional Patent Application No. 61/652,772, filed on May 29, 2012, which is incorporated by reference herein in its entirety.
  • BACKGROUND
  • 1. Field of the Invention
  • The present invention relates generally to a programming model for a heterogeneous processor system.
  • 2. Background Art
  • With the success of programming models such as OpenCL and CUDA, heterogeneous computing platforms are becoming mainstream. However, these heterogeneous systems are low-level, not composable, and their behavior is often implementation defined even for standardized programming models.
  • Thus what is needed are system and method for a heterogeneous parallel primitives (HPP) programming model that provides a flexible and composable programming platform that guarantees behavior even in the case of developing high-performance code.
  • SUMMARY OF EMBODIMENTS
  • According to an embodiment, a method and system for executing an asynchronous task on a heterogeneous computing platform are provided. An asynchronous task configured to execute on a grid is initialized. An initially unknown result that becomes available during execution is encapsulated. The asynchronous task is executed on the grid. The result is assigned to the asynchronous task when the result becomes available during execution.
  • According to another embodiment, system for managing memory is provided. A heterogeneous parallel primitives (HPP) platform generates an unbound distributed in a plurality of memories of different types. Once generated, the distributed array is bound to a kernel that executes a workgroup on a processor in a heterogeneous computing platform. During execution, the bound distributed array is accessed by the workgroup.
  • Further embodiments, features, and advantages of the present invention, as well as the structure and operation of the various embodiments of the present invention, are described in detail below with reference to the accompanying drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS/FIGURES
  • The accompanying drawings, which are incorporated in and constitute part of the specification, illustrate embodiments of the invention and, together with the general description given above and the detailed description of the embodiment given below, serve to explain the principles of the present invention. In the drawings:
  • FIG. 1 is a block diagram of a heterogeneous parallel primitives execution model, according to an embodiment.
  • FIG. 2 is a block diagram that shows bound and unbound distributed arrays access, according to an embodiment.
  • FIG. 3 is a block diagram of a channel usage flow, according to an embodiment.
  • The invention will now be described with reference to the accompanying drawings. In the drawings, generally, like reference numbers indicate identical or functionally similar elements. Additionally, generally, the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears.
  • DETAILED DESCRIPTION
  • While the present invention is described herein with illustrative embodiments for particular applications, it should be understood that the invention is not limited thereto. Those skilled in the art with access to the teachings provided herein will recognize additional modifications, applications, and embodiments within the scope thereof and additional fields in which the invention would be of significant utility.
  • The embodiment(s) described, and references in the specification to “one embodiment”, “an embodiment”, “an example embodiment”, etc., indicate that the embodiment(s) described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is understood that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
  • Graphics processing units (GPU) generally comprise multiple processing elements that are ideally suited for executing the same instruction on parallel data streams, as in the case of a single instruction multiple data (SIMD) device, or in data-parallel processing. In many computing models, a central processing unit (CPU) functions as the host or controlling processor and hands-off specialized functions, such as graphics processing, to other processors such as GPUs.
  • Multi-core CPUs, where each CPU has multiple processing cores, offer processing capabilities for specialized functions (e.g., graphics processing) similar to those available on the GPU. One or more of the computation cores of multi-core CPUs or GPUs can be part of the same die (e.g., AMD Fusion™) or, alternatively, in different dies (e.g., Intel Xeon™ with NVIDIA GPU). Recently, hybrid cores having characteristics of both CPU and GPU (e.g., CellSPE™, Intel Larrabee™) have been proposed for general purpose GPU (GPGPU) style computing. The GPGPU style of computing advocates using the CPU to primarily execute control code and to offload performance critical data-parallel code to the GPU. The GPU is primarily used as an accelerator. The combination of multi-core CPUs and GPGPU computing model encompasses both CPU cores and GPU cores as accelerator targets. Many of the multi-core CPU cores have performance that is comparable to GPUs in many areas.
  • Several programming models have been developed for heterogeneous computing platforms that have CPUs and GPUs. These programming models include BrookGPU by Stanford University, the compute unified device architecture (CUDA) by NVIDIA, and OpenCL by an industry consortium named Khronos Group. The OpenCL framework offers a C-like development environment which users can create applications for the GPU. OpenCL enables the user, for example, to specify instructions for offloading some computations, such as data-parallel computations, to a GPU. OpenCL also provides a compiler and a runtime environment in which code can be compiled and executed within a heterogeneous, or other, computing system.
  • Heterogeneous computing platforms can include multiple CPUs and GPUs. For performance reasons CPUs and GPUs in a heterogeneous computing platform are designed differently and perform different functions. For example, GPUs support wide vectors and substantial register files to optimize throughput computing goals. CPUs are optimized for latency, dedicating logic to caches and out-of-order dependence control.
  • Because of those different functions, heterogeneous computing platforms are difficult to develop efficiently. Particularly, given different functions of CPU and GPU cores, a difficulty arises in developing an efficient programming model for the heterogeneous computing platform.
  • Existing programming models attempt to efficiently program the heterogeneous computing platforms using several programming models. For example, GPU programming models have expanded over recent years to higher levels of flexibility. Both OpenCL and CUDA support heterogeneous computing platforms to some degree. For example, by structuring the programming model as a data-parallel methodology with weak communication guarantees, these programming models ensure that code may execute on varied target platforms. However, conventional programming models have fundamental problems. They lack composability of operations and flexibility in the execution.
  • To ease the composability burden for the heterogeneous computing platform development, conventional programming models concentrate on the application program interface (API) simplifications. CUDA, for example, includes a simplified API interface compared to the previous graphics oriented programming environments. Microsoft's C++ AMP design is another example that eases composability by linking the benefits of C++ type safety with GPU programming, as do pragma-based models such as OpenACC.
  • Additionally, conventional programming models follow an inflexible single program multiple data (“SPMD”) model. Example conventional programming models that follow SPMD are OpenGL, CUDA and other low-level GPU intermediate languages. On a GPU, those programming models execute in an SPMD-on-SIMD fashion. This technique is sometimes known as a single instruction multiple thread (“SIMT”) implementation. However, the SIMD model limits the developer's ability to flexibly use the heterogeneous computing platform. For example, OpenCL's memory model does not allow any communication between work groups without the use of atomic operations. OpenCL also does not provide methods that guarantee that memory writes commit to global visibility and provides little or no control of memory ordering. In another example, CUDA offers a partial solution to this issue with a “threadfence” operation. The “threadfence” operation ensures that the workitems within a work group have completed operating on their allocated sections in memory.
  • The SIMD nature of execution leads to other problems. For example, in the SIMD model a workitem is mapped to an individual SIMD lane in a larger hardware thread. SIMD model then uses execution masks to switch execution between the workitem subsets when control flow diverges. No guarantees of progress can be made in the presence of dependencies between lanes. CUDA's limited hardware space allows programmers to make assumptions about how wide a hardware thread is and how many SIMD lanes are included in the hardware thread. OpenCL, on the other hand, does not allow programmers to make such an assumption.
  • Conventional programming models also place restrictions on the synchronization barriers. For example, restricting barriers within the divergent control flow is not necessarily a hardware limitation, but a factor of a conventional programming model. In one example, Titanium programming language by NVIDIA prohibits barriers inside any divergent control flows. In another example, SPMD implementations for modern CPUs use the notion of maximum convergence to avoid barriers in a control flow altogether. The notion of maximum convergence guarantees that when two program instances follow the same control path, the programs are guaranteed to execute each program statement concurrently.
  • Further, conventional programming models fail to utilize braided parallelism. Braided parallelism is a combination of data parallelism and task parallelism. Conventional programming models, such as OpenCL and CUDA implement data parallelism. However in addition to data parallelism, task parallelism can also be implemented in a heterogeneous computing platform, as described below.
  • For example, a game engine that implements a heterogeneous computing platform displays many types of parallelism. It includes parallel AI tasks, concurrent workitems for user interfaces, and massive data-parallel particle simulations, to name a few examples. However, even when the components in the game engine exhibit parallelism, the video engine fails to exhibit parallelism in its entirety. In fact, the entire video engine is not parallel as many of its tasks are generated dynamically.
  • A need for implementing task-graph executions on a GPU is shown by existence of persistent threads. Persistent threads may be used for building scheduling systems within threads and thus circumventing the hardware scheduler. This approach is commonly used to reduce overhead that arises from massively parallel data executions. Persistent threads, however, also demonstrate a need and limitation in conventional programming models for implementing braided parallelism.
  • Conventional heterogeneous computing platforms also lack in composability. Conventionally, workitems that process work on a GPU are divided into synchronizable work groups. Those work groups share data. One way to synchronize work groups is by using a barrier that enforces memory consistency and workitem ordering. The conventional barriers however, are defined to work across only work groups and do not enable global synchronization. As a result, conventional barriers are precluded from synchronizing workitems in most divergent control flows.
  • Additionally, many conventional GPGPU programming models expose distinct memory address spaces (also referred to as domains). Prior to processing data by a GPGPU, the data must be moved explicitly in and out of these domains. This poses several issues. First, when loading third-party libraries in and out of the domains, a GPU developer must be aware of the memory spaces of the library's parameters, and may be required to write additional data movement code when the library has unexpected parameters and memory requirements. Second, there is little to no way to enforce how library functions are called and over what width in a work group. This results in an assumption that libraries either execute across an entire work group or on a single workitem. When the library is being executed on the entire work group, the work group may be synchronized using barrier synchronization and share state internally. However, the conventional programming platforms do not support a library that is being executed on a single workitem and explicitly do not support such state sharing.
  • 1. Introduction to Heterogeneous Parallel Primitives Programming Model
  • A heterogeneous parallel primitives (HPP) programming model is designed to solve the above described limitations of conventional heterogeneous computing platforms. HPP is a braided parallel programming model that supports task and data parallelism, and solidifies flexibility and composability concepts that have been lacking in the conventional programming models.
  • In an embodiment, HPP programming model may be a combination of OpenCL, C++11 and Concurrency Runtime by Microsoft. For example, HPP adopts the execution model of OpenCL and extends the OpenCL's execution model with braided parallelism, the hosting of object oriented C++11 language and a stricter and more controllable memory model. In an embodiment, HPP may be embedded into C++11 as a library and a device kernel language, that is designed to target both CPU and massively multi-threaded GPU devices.
  • HPP includes three components, a platform model, an execution model and a memory model.
  • Platform model specifies an abstract hardware model, consisting of the host processor coordinating execution and one or more compute units capable of dispatching and executing HPP kernels. To enable support for both data and task parallelism HPP evolves the device model of OpenCL from a single threaded device to a set of explicitly programmable work coordinators capable of launching units of work on the compute cores as seen in FIG. 1, according to an embodiment of the present invention.
  • Execution model defines how HPP programming model is configured on the host and how kernels are executed on the device. Unlike the conventional GPU programming models described above, HPP supports both data-parallelism and task-parallelism as first class execution models.
  • In the execution model, coordinators are single-thread scalar programs. Coordinators perform reads and writes into globally visible memory. The read and writes include atomic operations. Coordinators also perform conditional flows. The conditional flows include iteration. Additionally, coordinators dispatch kernels on the compute units.
  • In an embodiment, coordinators execute on the Coord Schedulers of FIG. 1.
  • In one example, kernels execute on compute units (CUs) and assume an explicitly parallel execution. The term “kernel”, as used herein, refers to a program and/or processing logic that is executed as one or more workitems in parallel having the same code base. Each kernel describes the execution of a single lane of execution called a workitem. When coordinators dispatch a kernel, multiple workitems may execute sharing the same kernel code.
  • In one example, coordinators (scheduling program runs on coord1 scheduler entities, so that programs run concurrently) execute concurrently with kernels. This enables coordinators to dispatch new kernels while other kernels are concurrently executing.
  • In one example, workitems are organized into workgroups of size 1 or more. Collections of workitems within a workgroup are executed in lock-step as part of a vector, called an mvector (machine vector), potentially using predication. The specific length of an mvector is implementation defined and is exposed as a symbolic constant (MVECTOR SIZE).
  • In one example, memory model defines an abstract memory hierarchy that kernels use. The abstract memory hierarchy works regardless of the actual underlying memory architecture. Unlike the conventional GPGPU models the memory hierarchy is closer to a more traditional shared memory system. For example, scratch pad memories are not exposed explicitly.
  • In one example, HPP programming model also adopts the C++11 memory model for workitems communications. The code snippet in Table 1 shows an HPP application that atomically increments its input in parallel:
  • TABLE 1
    #include <atomic>
    void inc(atomic_int &input, int numOfTimes)
    {
      parallelFor(Range<1>(numOfTimes,
      [input] (Index<1>) _device(hpp) {
      input.add(1);
      });
    }
  • 2. Task and Data Parallelism in a Heterogeneous Parallel Primitives Programming Model
  • HPP programming model enables developers to introduce data and task parallelism. The example below demonstrates in pseudo code how HPP programming model enables programmers to introduce data and task parallelism. Table 2A is a function for multiplying two matrices.
  • TABLE 2A
    void matixMul(
      int size,
      double * inputA,
      double * inputB,
      double * output)
    {
      for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
          double sum = 0;
          for (int k = 0; k < size; ++k) {
            double a = inputA[i * size + k];
            double b = inputB[k * size + j];
            sum += a * b;
          }
          C[i * size + j] = sum;
        }
      }
    }
  • In Table 2A, the iteration spaces of the outer two “for” loops are independent of each other. Because the “for” loops are independent of each other, they can be executed in parallel. One conventional way to parallelize the pseudo code Table 2A in a data parallel execution is to use size*size number of workitems, where each workitem executes the inner loop with a corresponding index from the 2D iteration space.
  • In a data programming model, the algorithm in Table 2A can be parallelized using a parallelFor function. The pseudo code for the parallelFor function is shown in Table 2B.
  • TABLE 2B
    void matixMul(
      int size,
      Pointer<double> inputA,
      Pointer<double> inputB,
      Pointer<double> output)
    {
      parallelFor(
      Range<2>(size, size),
      [inputA,inputB,output] (
        Index<2> index) _device(hpp) {
        unsigned int i = index.getX( );
        unsigned int j = index.getY( );
        double sum = 0;
        for (unsigned int k = 0; k < size; ++k) {
          double a = inputA[i * size + k];
          double b = inputB[k * size + j];
          sum += a * b;
        {
        output[i * size + j] = sum;
      });
    }
  • The implementation in Table 2B is not dissimilar from the data parallel model popularized by Open MP and the GPGPU programming models. However, unlike conventional programming models, where task parallelisms is implemented on CPUs, HPP programming model includes task parallel runtime (TPR) that supports data parallelism as a first class citizen.
  • Similar to popular TPRs designed specifically for the CPU, HPP programming model's tasks can be data-parallel. The difference is that in HPP programming model, tasks maintain data-parallel representations much later in the execution process and hence more efficiently map to highly data parallel architectures.
  • In an embodiment, the pseudo code in Table 2B is rewritten into an HPP version in Table 2C. Table 2C uses parallel tasks and a notion of the future, to execute the matrix multiplication described in Table 2B. The future represents data that will be present at some point in the future and hence is a proxy for synchronizing the asynchronous tasks.
  • TABLE 2C
    void matixMul(
       int size,
       Pointer<double> inputA,
       Pointer<double> inputB,
       Pointer<double> output)
    {
       Task<void, Index<2>> matMul(
          [inputA,inputB,output]
             (Index<2> index) _device(hpp) {
             unsigned int i = index.getX( );
             unsigned int j = index.getY( );
             double sum = 0;
             for (unsigned int k = 0; k < size; ++k) {
                double a = inputA[i * size + k];
                double b = inputB[k * size + j];
                sum += a * b;
             }
             output[i * size + j] = sum;
          });
          Future<void> future = matMul.enqueue(
          Range<2>(size, size));
          future.wait( );
    }
  • 3. Tasks
  • In one example, HPP programming model provides asynchronous tasks that execute on the grid. The difference between HPP tasks and the conventional OpenCL tasks is that HHP tasks encode the behavior of an asynchronous agent that can execute like a ConcRT style task or an OpenCL-style dispatch.
  • Table 3A below includes example pseudo code that defines an HPP task as a template class.
  • TABLE 3A
    template<
       typename ReturnType_,
       typename IndexType>
    class Task
    {
    public:
       typedef std::vector<ReturnType_> ReturnDataType;
       template< typename FunctionType >
       Task( FunctionType f );
       template<
          typename T_,
          typename RangeType>
       auto enqueue(
          RangeType_r,
          Future<T_> )
       -> Future<ReturnDataType_>;
    };
  • In one example, as HPP is an asynchronous tasking model, a developer configures inter-task dependencies. The Future<T> type controls dependencies by encapsulating an initially unknown result that will become available at some later point in the future, as demonstrated in an example in Table 2C, above. Waiting on or assigning from a future waits on completion and gives access to the now-available value.
  • Table 3B is an example source code that shows execution of two tasks. The functionality of the two tasks, f1 and f22 is elided for space, and represented as ( . . . ). The futures of tasks f1 and f2 are combined into a single future task f3, that is waited upon, which is implemented by an f2.wait( ) function.
  • TABLE 3B
    Future<int> f1 = Task<int>(...).enqueue(...);
    Future<float> f2 = Task<float>(...).enqueue(...);
    auto f3 = f1 && f2;
    f3.wait( );
  • 4. Distributed Arrays
  • The memory hierarchy of modern computer architectures is complex and explicitly or implicitly exposes different memory levels and localities. An example of explicitly managed scratch pad memory structure is the memory visible in a conventional OpenCL programming model. Another example is an SMP system that has similar properties, such as a NUMA locality. However, without knowledge of cache layout, false sharing is an issue for multi-threaded applications.
  • A class of programming languages called partitioned global address space (PGAS) assumes a single global address space that can be logically partitioned into regions. Each region may be allocated to a particular local processor. In PGAS a window is mapped over parts of the global memory creating local memory regions. Explicit loads and stores move data in and out of those local memory regions. Global memory provides a shared and coherent view of all memory, while scratch pad memories provide “local” disjoint views, internally shared and coherent, on to subsets of the global view.
  • In practice, devices have multiple memories. Example memories are cache memories and on chip global memories. Distributed arrays in HPP programming model generalize the multiple memories into a PGAS abstraction of persistent user managed memory regions. The regions sub-divide memory (i.e., a single unified global memory or regions themselves). Visibility of the memory regions, i.e., memory sharing and coherence, is defined with respect to a region node and its ancestors.
  • One example use case is to abstractly manage OpenCL's workgroup local memory, as shown in FIG. 2, and described in detail below. However, the invention is not limited to this embodiment.
  • In an embodiment, distributed arrays are defined in terms of regions and segments. Regions are accessible entities that may be placed into memory and accessed. A region defines a memory visibility constraint as a layer in hierarchy. Segments are leaf memory allocations. Leafs are created by distributing a region across a set of nodes in the execution graph. A region may be divided into segments based on the number of subtasks created at the appropriate level of the hierarchy. Unlike a conventional global memory, distributed arrays that are bound to executions are segmented. A bound segment can be accessed from a particular workgroup, but may or may not be accessed by other workgroups.
  • FIG. 2 is a block diagram 200 that shows memory management using distributed arrays, according to an embodiment of the present invention.
  • Table 4A below includes example pseudo code that defines a distributed array as a template class.
  • TABLE 4A
    template<
       typename T = void
       bool Persistent = true,
       template <class Type_> AccessPattern=
          ScatterGather>
    DistArray
    {
    ...
    }
  • When an instance of distributed array is created, the distributed array is unbound, as illustrated by an unbound distributed array in FIG. 2. Once created, abstract regions and sub-regions in unbound distributed array may be allocated.
  • When the unbound array is passed to a kernel it becomes a bound array, as illustrated by bound distributed array in FIG. 2. In an embodiment, the pseudo code for binding unbound distributed array and matching it with a corresponding kernel argument is shown in Table 4B below:
  • TABLE 4B
    template<
       typename T = void
       template <class Type_> AccessPattern_> =
       ScatterGather>
    BoundDistArray
    {
       ...
       getRegion(Region<T_>);
    };
  • Once the bound distributed array is within a kernel, a specific region within bound distributed array can be accessed, using a getRegion( ) function. The getRegion( ) function returns a region in bound distributed array. The example pseudo code for the returned region is show in Table 4C below.
  • TABLE 4C
    template <
    typename T_,
    template<typename Type_> class AccessPattern=
    StructuredArrayAccess>
    class Region : public AccessPattern_<Type_>
    {
       ...
       size_t getRegionSize( );
    };
  • In the example pseudo code in Table 4C, a region's access interface is defined by the parameter AccessPattern. For example, StructuredArrayAccess defines a Fortran array style interface exposing an array class (designated as [ ] in Fortran), along with its members to support array slicing and transformations.
  • Example pseudo code for using distributed arrays is shown in FIG. 4D below.
  • TABLE 4D
    DistArray<float> darray;
    Region<float> region;
    region = darray.allocRegion(darray.getMaxRegionSize( ));
    parallelFor(
       Range<1,1>(
       darray.getTotalSize( ),
          Range<1>(region.getSize( ))),
       darray,
       [region] (
          Index<1> i,
          BoundDistArray<float> a) _device(hpp) {
          a(region)[index.getLocalX( )] += index.getX( );
    });
  • In this example, a single region in the distributed array is allocated using darray.allocRegion(darray.getMaxRegionSize( )) function. Once allocated, the region is bound in the execution of the kernel, using a _device(hpp) function included in pseudo code in Table 4D. The region is accessed within the kernel using the local workgroup ID index for each workitem. This example highlights a key feature of distributed arrays in the HPP programming model. Namely, because coherence is described in terms of ancestors, it is safe to allocate an independent region to each workgroup.
  • In an embodiment, the memory implementation moves regions into on-chip scratch pad memories on the GPU on demand. The memory implementation also performs cache memory prefetching on the CPU. In an embodiment, the memory implementation also moves regions, depending on location in the region tree, into scratch pad memories, or moves a family of regions whose access is known to be limited to a particular CPU or GPU.
  • 5. Channels
  • Although GPU cores may be used for general purpose computing, GPUs are primarily used to processing graphics workloads. In an embodiment, graphics workloads are data-flow pipes. For example, graphics workload may include hull shading, tessellation and domain shading which can be implemented as a pipe that amplifies or consumes work at each stage. The hull shader specifies tessellation factors for edges of a triangle such that the tessellator might divide that triangle into many other triangles. An example use case may be varying the viewing of an object based on the distance from the camera—the closer the distance to the viewer, more detail being needed near the viewer.
  • The conventional hardware scheduling and memory buffers may efficiently handle these workloads and are optimized for maintaining a high level of utilization. The hardware scheduler schedules just enough work for a GPU at each stage to keep the pipeline busy without starvation. However, conventional programming models for GPUs do not have such capability.
  • As the hardware is designed to manage pipelines of this sort, HPP programming model exposes this feature to the developer. To this end HPP programming model adopted the concept of communication channels and applied it to dynamic scheduling systems. Given the massively data-parallel nature of GPU dispatches the usual approach is that the hardware scheduler will issue more work as resources become available. It is this approach HPP programming model maintains through channels, such that rather than utilizing blocking reads the consumer is created at the point of read in a fine-grained fashion. A similar approach is used in various CPU task-oriented runtime systems such as the agents library that runs on top of Microsoft's concurrency runtime.
  • FIG. 3 is a block diagram 300 of a flowchart for a data-flow in a channel, according to an embodiment of the present invention. In block 1, the basic structure of a kernel, command queue, channel and scheduling hardware (control processor) is displayed. In block 2 the kernel is enqueued, and launches workitems in block 3. The launched workitems write into the channel in block 4. The written data is displayed in the work channel in block 5. The control processor detects a launch condition for the channel in block 6 and launches consumer workitems in 7. Consumer workitems consume the contents of the channel in block 8. At block 9, the process continues as the next set of workitems is written into the channel.
  • The implementation approach differs from a conventional approach that exposes fixed-function and programmable processing stages that are linked via data queues. However, the conventional approach fails to describe coordination language and scheduling of the HPP programming model.
  • The channel interface may be defined by the pseudo code in Table 5A below, according to one embodiment.
  • TABLE 5A
    templates<class T_>
    class Channel
    {
    public:
       Channel(size_t);
       template<typename F_>
       void executeWith(
          Coordinator const& coord,
          Range<1> r,
          Ff);
       size_t size( );
       void write(const T_& v);
    };
  • The executeWith( ) method in Table 5A associates a coordinator predicate that returns true if the corresponding consumer kernel should be dispatched. Additionally, the channel write( ) method blocks if the channel is full, thus allowing consumers to reduce the amount of data stored in the channel before continuing. In the HPP programming platform, channel data store are locked into on-chip cache and thus are limited in size. An advantage is that good data between producer/consumer can be locally seen.
  • In an embodiment, coordinators are control programs describing when to trigger consumers, as described above. They are expressed as a restricted domain specific language, embedded into C++.
  • The following example in Table 5B, for calculating a global reduction ties together the distributed arrays and channels. For simplicity the example assumes that the input size is a multiple of MVECTOR_SIZE variable. A single distributed array is used with two disjoint regions. A single channel is used to store the results of each work-group's reduction, with a trigger executing a second kernel to reduce the resulting channel data, once full.
  • TABLE 5B
    int channelSize = 32;
    vector<int> input = ... ;
    Channel<int> results(channelSize);
    DistArray<int> darray;
    Region<float> region1; // used in the 1st pass
    Region<float> region2; // used in the 2nd pass
    Region<float> ;
    region1 = darray.allocRegion(MVECTOR_SIZE);
    region2 = darray.allocRegion(channelSize);
    int result;
    results.executeWith(
       [=] (Channel<int>* c) -> bool _device(coord) {
       return c->size( ) == numWorkGroups;
    },
    Range<1,1>(channelSize, channelSize), darray, [&result, region2] (
       Index<1,1> index,
       BoundDistArray<float> a)
       vector<int> v) _device(hpp) {
          int accumulator = 0;
          int id = index.getLocalX( );
          Segment<float> seg = a(region);
          seg[id] = v[id];
          seg.barrier( );
       for(int offset = get_local_size(0) / 2;   offset > 0;
       offset = offset / 2)
    {
          if (id < offset) {
             int other = seg[id + offset];
             int mine = seg[id];
             seg[id] = mine + other;
          }
          seg.barrier( );
       }
       if (id == 0) {
          *result = seq[0];
       }
    }});
    parallelFor( Range<1,1>(input.size( ), MVECTOR-SIZE), darray
    [&results, input] (
       Index<1,1> index,
       BoundDistArray<float> a) _device(hpp) {
       // parallel reduce kernel body here
    }});
  • The example in Table 5B demonstrates the use of distributed army for localized communication, and the use of channels for global communication, in the HPP programming model.
  • 6. Barriers
  • Coordinating shared data is critical in the development of parallel programs that scale. The conventional GPGPU solutions limit the synchronization via barrier operations to memory consistency and workitems reaching the same PC. The conventional GPGPU solutions are also limited to cases that do not include divergent control flow, or cases that do include the divergent control flow that guarantee that all workitems enter a conditional branch if any one workitem enters the conditional branch.
  • HPP addresses these limitations by introducing barriers that can be used in a control flow and can be used across work groups.
  • The source code in Table 6A below defines the barrier class and the relevant methods, in one embodiment.
  • TABLE 6A
    class Barrier
    {
    public:
       Barrier(size_t count);
       void skip( );
       void wait( );
       void arrive( );
    };
  • In the example above, a barrier is initialized with a count that represents the number of participants in the barrier. In one embodiment, the participants may be workitems. The barrier class also includes skip( ) wait( ) and arrive( ) methods.
  • The wait( ) method blocks any workitem that performs the wait( ) method from continuing execution until the other participants (i.e., workitems) have also taken part. In an embodiment, the wait( ) method may be performed by a consumer.
  • The arrive( ) method may be performed by a workitem that participates in the barrier, but does not wait for other workitems. In an embodiment, the arrive( )method may be performed by a producer.
  • The skip( ) method may be performed by a workitem that withdraws from further participation in the barrier. The withdrawn workitem does not count against the other participants that have executed a waiting method. In an embodiment, the skip( ) method may be used by a workitem who has left the execution loop, such that the remaining workitems may continue synchronizing on the barrier after the workitem leaves.
  • The methods above allow for the use of barriers in a control flow. For example, workitems that enter the else or exit branch in the control flow, can call the skip( ) method and be removed from execution. The remaining workitems can then continue iterating and communicating through scratch memory and wait on the barrier.
  • The example source code for using barriers in a control flow is shown in Table 6B:
  • TABLE 6B
    Barrier b(8);
    parallelFor(Range<1>, [&b, scratch] (Index<1> i) {
       scratch[i.getX( )] = i.getX( );
       if( i.getX( ) < 4 ) {
          for( int j = 0; j < i.getX( ); ++j ) {
             b.wait( );
             x[i.getX( )] += scratch[j+1 ];
          }
          b.skip( );
       } else {
          b.skip( );
          x[i.getX( )] = 17;
       }
    });
  • By passing barrier objects to functions and skipping elsewhere those functions are safe to synchronize on the barrier without dependencies on external workitems. For example, consider the function in Table 6C, below:
  • TABLE 6C
    void someOpaqueLibraryFunction(const int i, Barrier &b);
       Barrier b(8);
       parallelFor(Range<1>, [&b, scratch] (Index<1> i) {
          scratch[i] = val;
          if( i.getX( ) < 4 ) {
             someOpaqueLibraryFunction(i, b);
          } else {
             b.skip( );
             x[i.getX( )] = 17;
          }
    });
  • In addition to using barrier objects in the control flow, HPP programming model controls the use of barriers to maintain proper execution of a workgroup. For example, replacing the call to a skip( ) method in the else branch, in Table 6C, with wait( ) may be invalid. For example, it may not be possible to know the number of times someOpaqueLibaryFunction( ) may use the barrier. However, instead of replacing a skip( ) method with a wait( ) method, two barriers may be used in the HPP programming model. The embodiment, is shown as Table 6D below:
  • TABLE 6D
    Barrier b(8);
    Barrier b2(8);
    parallelFor(Range<1>, [&b, &b2, scratch] Index<1> i) {
       scratch[i] = val;
       if( i < 4 ) {
          someOpaqueLibraryFunction(i, b);
          b2.wait( );
       } else {
          b.skip( );
          b2.wait( );
          x[i] = 17;
       }
    });
  • In an embodiment, barrier objects may also be used to synchronize dependent kernels. For example, the host may delegate to multiple CPU devices to process the function, as shown in Table 6E below:
  • TABLE 6E
    for(...) {
       parallelFor(Range<1>(N), foo);
    }
  • In Table 6E, implicit synchronization occurs following each invocation of the parallelFor( ) function, with an intention of pushing the “for loop” on to a respective GPU. The goal is to reduce the cost of synchronization between the host and device, as shown in Table 6G, below:
  • TABLE 6G
    void foo(Index<1> index, ...) _device(hpp)
    {
       for(...) {
          foo(index, ...);
       _gpu_sync( );
       }
    }
  • In Table 6G, the function _gpu_sync( ) is an inter work-group barrier operation.
  • In an embodiment, the cross work-group variant of HPP's barrier may be implemented using the Global Data Share (GDS) in AMD's HD7970, GPU. GDS is a 64K on chip global memory with barrier functionality across the whole device. Additionally the _gpu_sync( ) function may be implemented using the algorithm described above.
  • The Summary and Abstract sections may set forth one or more but not all exemplary embodiments of the present invention as contemplated by the inventor(s), and thus, are not intended to limit the present invention and the appended claims in any way.
  • The present invention has been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.
  • The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying knowledge within the skill of the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present invention. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.
  • The breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims (24)

What is claimed is:
1. A method comprising:
encapsulating an initially unknown result that will become an available result after an asynchronous task is executed;
executing the asynchronous task on a grid; and
assigning the available result to the asynchronous task in response to the result becoming available during the executing.
2. The method of claim 1, further comprising:
using the asynchronous task to enable task and data parallelism in a heterogeneous computing platform.
3. The method of claim 1, further comprising:
declaring the asynchronous task using an object oriented programming language.
4. A system for executing an asynchronous task, comprising:
a heterogeneous computing platform including at least one GPU processor and configured to:
encapsulate an initially unknown result that will become an available result after the asynchronous task is executed;
execute the asynchronous task on a grid; and
assign the available result to the asynchronous task in response to the result becoming available during the execution.
5. The system of claim 4, further comprising a task parallel runtime configured to use the asynchronous task to enable task and data parallelism its the heterogeneous computing platform.
6. The system of claim 4, wherein the heterogeneous computing platform is further configured to declare the asynchronous task using an object oriented programming language.
7. A method comprising:
generating an unbound distributed array in a plurality of memories of different types associated with a heterogeneous computing platform;
binding the distributed array for a kernel configured to execute a workgroup on a processor in the heterogeneous computing platform; and
accessing the distributed array bound to the kernel as the kernel executes the workgroup.
8. The method of claim 7, further comprising:
generalizing the plurality of memories of different types into a persistent global address space (PGAS) abstraction; and
receiving an indication from the kernel for managing a region in the PGAS abstraction.
9. The method of claim 7, wherein a memory in the plurality of memories is a global chip memory.
10. The method of claim 7, wherein the memory is a cache memory.
11. The method of claim 7, further comprising:
allocating a plurality of regions and a plurality of segments within the distributed array.
12. The method of claim 11, wherein the accessing further comprises:
accessing a region in a plurality of regions using a workgroup ID index associated with the workgroup.
13. The method of claim 12, wherein:
the workgroup further comprises a plurality of workitems, and
the workgroup ID index identifies a workitem in the plurality of workitems.
14. The method of claim 11, further comprising:
moving the plurality of regions in the distributed array to the scratch pad memory on the GPU device.
15. The method of claim 7, further comprising:
performing a cache memory prefetching for the distributed array on a CPU.
16. A system comprising:
a heterogeneous parallel primitives (HPP) platform configured to:
generate an unbound distributed array in a plurality of memories of different types;
bind the distributed array to a kernel configured to execute a workgroup on a processor in a heterogeneous computing platform; and
access the distributed array bound to the kernel as the kernel executes the workgroup.
17. The system of claim 16, wherein the HPP platform is further configured to:
generalize the plurality of memories of different types into a persistent global address space (PGAS) abstraction; and
receive an indication from the kernel for managing a region in the PGAS abstraction.
18. The system of claim 16, wherein a memory in the plurality of memories is a global chip memory.
19. The system of claim 16, wherein the memory is a cache memory.
20. The system of claim 16, wherein the HPP platform is further configured to:
allocate a plurality of regions and a plurality of segments within the distributed array.
21. The system of claim 20, wherein the HPP platform is further configured to:
access a region in the plurality of regions using a workgroup ID index associated with the workgroup.
22. The system of claim 21, wherein:
the workgroup further comprises a plurality of workitems, and
the workgroup ID index identifies a workitem in the plurality of workitems.
23. The system of claim 20, wherein the HPP platform is further configured to:
move the plurality of regions in the distributed array to the scratch pad memory on the CPU device.
24. The system of claim 16, wherein the HPP platform is further configured to:
perform a cache memory prefetching for the distributed array on a CPU.
US13/904,791 2012-05-29 2013-05-29 Heterogeneous Parallel Primitives Programming Model Abandoned US20130332937A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US13/904,791 US20130332937A1 (en) 2012-05-29 2013-05-29 Heterogeneous Parallel Primitives Programming Model
US15/797,702 US11231962B2 (en) 2012-05-29 2017-10-30 Heterogeneous parallel primitives programming model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201261652772P 2012-05-29 2012-05-29
US13/904,791 US20130332937A1 (en) 2012-05-29 2013-05-29 Heterogeneous Parallel Primitives Programming Model

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/797,702 Continuation US11231962B2 (en) 2012-05-29 2017-10-30 Heterogeneous parallel primitives programming model

Publications (1)

Publication Number Publication Date
US20130332937A1 true US20130332937A1 (en) 2013-12-12

Family

ID=49716354

Family Applications (2)

Application Number Title Priority Date Filing Date
US13/904,791 Abandoned US20130332937A1 (en) 2012-05-29 2013-05-29 Heterogeneous Parallel Primitives Programming Model
US15/797,702 Active 2034-06-23 US11231962B2 (en) 2012-05-29 2017-10-30 Heterogeneous parallel primitives programming model

Family Applications After (1)

Application Number Title Priority Date Filing Date
US15/797,702 Active 2034-06-23 US11231962B2 (en) 2012-05-29 2017-10-30 Heterogeneous parallel primitives programming model

Country Status (1)

Country Link
US (2) US20130332937A1 (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140189317A1 (en) * 2012-12-28 2014-07-03 Oren Ben-Kiki Apparatus and method for a hybrid latency-throughput processor
US20150187040A1 (en) * 2013-12-27 2015-07-02 Jayanth N. Rao Scheduling and dispatch of gpgpu workloads
US20160139624A1 (en) * 2014-11-14 2016-05-19 Advanced Micro Devices, Inc. Processor and methods for remote scoped synchronization
US20170061566A1 (en) * 2015-08-26 2017-03-02 Intel Corporation Technologies for offloading network packet processing to a gpu
US9690894B1 (en) * 2015-11-02 2017-06-27 Altera Corporation Safety features for high level design
US10083037B2 (en) 2012-12-28 2018-09-25 Intel Corporation Apparatus and method for low-latency invocation of accelerators
US20180300139A1 (en) * 2015-10-29 2018-10-18 Intel Corporation Boosting local memory performance in processor graphics
US10140129B2 (en) 2012-12-28 2018-11-27 Intel Corporation Processing core having shared front end unit
US10223436B2 (en) 2016-04-27 2019-03-05 Qualcomm Incorporated Inter-subgroup data sharing
US10230529B2 (en) * 2015-07-31 2019-03-12 Microsft Technology Licensing, LLC Techniques to secure computation data in a computing environment
US10346195B2 (en) 2012-12-29 2019-07-09 Intel Corporation Apparatus and method for invocation of a multi threaded accelerator
US10379869B2 (en) * 2014-08-26 2019-08-13 International Business Machines Corporation Optimize control-flow convergence on SIMD engine using divergence depth
US10409560B1 (en) * 2015-11-18 2019-09-10 Amazon Technologies, Inc. Acceleration techniques for graph analysis programs
US10769837B2 (en) 2017-12-26 2020-09-08 Samsung Electronics Co., Ltd. Apparatus and method for performing tile-based rendering using prefetched graphics data
US11275586B2 (en) * 2020-05-29 2022-03-15 Advanced Micro Devices, Inc. Task graph generation for workload processing
US20220091880A1 (en) * 2020-09-24 2022-03-24 Advanced Micro Devices, Inc. Fine-grained conditional dispatching
US11429769B1 (en) * 2020-10-30 2022-08-30 Xilinx, Inc. Implementing a hardware description language memory using heterogeneous memory primitives
US11481256B2 (en) 2020-05-29 2022-10-25 Advanced Micro Devices, Inc. Task graph scheduling for workload processing

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5999987A (en) * 1994-02-11 1999-12-07 International Business Machines Corporation Concurrent processing in object oriented parallel and near parallel
US6182154B1 (en) * 1994-11-21 2001-01-30 International Business Machines Corporation Universal object request broker encapsulater
US20080282064A1 (en) * 2007-05-07 2008-11-13 Michael Norman Day System and Method for Speculative Thread Assist in a Heterogeneous Processing Environment
US20090259996A1 (en) * 2008-04-09 2009-10-15 Vinod Grover Partitioning cuda code for execution by a general purpose processor
US20110246960A1 (en) * 2009-07-25 2011-10-06 Irina Kleingon Methods for software mass production
US20120054770A1 (en) * 2010-08-31 2012-03-01 International Business Machines Corporation High throughput computing in a hybrid computing environment
US8381203B1 (en) * 2006-11-03 2013-02-19 Nvidia Corporation Insertion of multithreaded execution synchronization points in a software program

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3551353B2 (en) * 1998-10-02 2004-08-04 株式会社日立製作所 Data relocation method
US9478062B2 (en) * 2006-09-19 2016-10-25 Imagination Technologies Limited Memory allocation in distributed memories for multiprocessing
US20080196030A1 (en) * 2007-02-13 2008-08-14 Buros William M Optimizing memory accesses for multi-threaded programs in a non-uniform memory access (numa) system
US7921261B2 (en) * 2007-12-18 2011-04-05 International Business Machines Corporation Reserving a global address space
US8224955B2 (en) * 2009-05-07 2012-07-17 International Business Machines Corporation Ensuring affinity at all affinity domains by folding at each affinity level possible for a partition spanning multiple nodes
KR101613971B1 (en) * 2009-12-30 2016-04-21 삼성전자주식회사 Method for transforming program code
US8635626B2 (en) * 2010-12-29 2014-01-21 Sap Ag Memory-aware scheduling for NUMA architectures

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5999987A (en) * 1994-02-11 1999-12-07 International Business Machines Corporation Concurrent processing in object oriented parallel and near parallel
US6182154B1 (en) * 1994-11-21 2001-01-30 International Business Machines Corporation Universal object request broker encapsulater
US8381203B1 (en) * 2006-11-03 2013-02-19 Nvidia Corporation Insertion of multithreaded execution synchronization points in a software program
US20080282064A1 (en) * 2007-05-07 2008-11-13 Michael Norman Day System and Method for Speculative Thread Assist in a Heterogeneous Processing Environment
US20090259996A1 (en) * 2008-04-09 2009-10-15 Vinod Grover Partitioning cuda code for execution by a general purpose processor
US20110246960A1 (en) * 2009-07-25 2011-10-06 Irina Kleingon Methods for software mass production
US20120054770A1 (en) * 2010-08-31 2012-03-01 International Business Machines Corporation High throughput computing in a hybrid computing environment

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10083037B2 (en) 2012-12-28 2018-09-25 Intel Corporation Apparatus and method for low-latency invocation of accelerators
US10664284B2 (en) 2012-12-28 2020-05-26 Intel Corporation Apparatus and method for a hybrid latency-throughput processor
US20140189317A1 (en) * 2012-12-28 2014-07-03 Oren Ben-Kiki Apparatus and method for a hybrid latency-throughput processor
US9417873B2 (en) * 2012-12-28 2016-08-16 Intel Corporation Apparatus and method for a hybrid latency-throughput processor
US10255077B2 (en) 2012-12-28 2019-04-09 Intel Corporation Apparatus and method for a hybrid latency-throughput processor
US10140129B2 (en) 2012-12-28 2018-11-27 Intel Corporation Processing core having shared front end unit
US10095521B2 (en) 2012-12-28 2018-10-09 Intel Corporation Apparatus and method for low-latency invocation of accelerators
US10089113B2 (en) 2012-12-28 2018-10-02 Intel Corporation Apparatus and method for low-latency invocation of accelerators
US10346195B2 (en) 2012-12-29 2019-07-09 Intel Corporation Apparatus and method for invocation of a multi threaded accelerator
US10937118B2 (en) * 2013-12-27 2021-03-02 Intel Corporation Scheduling and dispatch of GPGPU workloads
US20150187040A1 (en) * 2013-12-27 2015-07-02 Jayanth N. Rao Scheduling and dispatch of gpgpu workloads
US20190259129A1 (en) * 2013-12-27 2019-08-22 Intel Corporation Scheduling and dispatch of gpgpu workloads
US10235732B2 (en) * 2013-12-27 2019-03-19 Intel Corporation Scheduling and dispatch of GPGPU workloads
US10936323B2 (en) * 2014-08-26 2021-03-02 International Business Machines Corporation Optimize control-flow convergence on SIMD engine using divergence depth
US20190294444A1 (en) * 2014-08-26 2019-09-26 International Business Machines Corporation Optimize control-flow convergence on simd engine using divergence depth
US10379869B2 (en) * 2014-08-26 2019-08-13 International Business Machines Corporation Optimize control-flow convergence on SIMD engine using divergence depth
US20160139624A1 (en) * 2014-11-14 2016-05-19 Advanced Micro Devices, Inc. Processor and methods for remote scoped synchronization
US9804883B2 (en) * 2014-11-14 2017-10-31 Advanced Micro Devices, Inc. Remote scoped synchronization for work stealing and sharing
US10230529B2 (en) * 2015-07-31 2019-03-12 Microsft Technology Licensing, LLC Techniques to secure computation data in a computing environment
US20170061566A1 (en) * 2015-08-26 2017-03-02 Intel Corporation Technologies for offloading network packet processing to a gpu
US10445850B2 (en) * 2015-08-26 2019-10-15 Intel Corporation Technologies for offloading network packet processing to a GPU
US20180300139A1 (en) * 2015-10-29 2018-10-18 Intel Corporation Boosting local memory performance in processor graphics
US10768935B2 (en) * 2015-10-29 2020-09-08 Intel Corporation Boosting local memory performance in processor graphics
US20200371804A1 (en) * 2015-10-29 2020-11-26 Intel Corporation Boosting local memory performance in processor graphics
US20180218094A1 (en) * 2015-11-02 2018-08-02 Altera Corporation Safety features for high level design
US10366190B2 (en) * 2015-11-02 2019-07-30 Altera Corporation Safety features for high level design
US9690894B1 (en) * 2015-11-02 2017-06-27 Altera Corporation Safety features for high level design
US10007748B2 (en) 2015-11-02 2018-06-26 Altera Corporation Safety features for high level design
US10409560B1 (en) * 2015-11-18 2019-09-10 Amazon Technologies, Inc. Acceleration techniques for graph analysis programs
US10223436B2 (en) 2016-04-27 2019-03-05 Qualcomm Incorporated Inter-subgroup data sharing
US10769837B2 (en) 2017-12-26 2020-09-08 Samsung Electronics Co., Ltd. Apparatus and method for performing tile-based rendering using prefetched graphics data
US11275586B2 (en) * 2020-05-29 2022-03-15 Advanced Micro Devices, Inc. Task graph generation for workload processing
US11481256B2 (en) 2020-05-29 2022-10-25 Advanced Micro Devices, Inc. Task graph scheduling for workload processing
US20220091880A1 (en) * 2020-09-24 2022-03-24 Advanced Micro Devices, Inc. Fine-grained conditional dispatching
US11809902B2 (en) * 2020-09-24 2023-11-07 Advanced Micro Devices, Inc. Fine-grained conditional dispatching
US11429769B1 (en) * 2020-10-30 2022-08-30 Xilinx, Inc. Implementing a hardware description language memory using heterogeneous memory primitives

Also Published As

Publication number Publication date
US20180060124A1 (en) 2018-03-01
US11231962B2 (en) 2022-01-25

Similar Documents

Publication Publication Date Title
US11231962B2 (en) Heterogeneous parallel primitives programming model
US11625885B2 (en) Graphics processor with non-blocking concurrent architecture
Gaster et al. Can gpgpu programming be liberated from the data-parallel bottleneck?
McCool Scalable programming models for massively multicore processors
Augonnet et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
US9424099B2 (en) Method and system for synchronization of workitems with divergent control flow
Heller et al. Closing the performance gap with modern c++
Dastgeer et al. Flexible runtime support for efficient skeleton programming on heterogeneous GPU-based systems
Neelima et al. Recent trends in software and hardware for GPGPU computing: a comprehensive survey
Dokulil et al. Implementing the open community runtime for shared-memory and distributed-memory systems
Feinbube et al. Joint forces: from multithreaded programming to GPU computing
Miyoshi et al. FLAT: A GPU programming framework to provide embedded MPI
Zheng et al. HiWayLib: A software framework for enabling high performance communications for heterogeneous pipeline computations
Huynh et al. Mapping streaming applications onto GPU systems
Odajima et al. GPU/CPU work sharing with parallel language XcalableMP-dev for parallelized accelerated computing
Janjic et al. Lapedo: hybrid skeletons for programming heterogeneous multicore machines in Erlang
Aumage et al. Task-based performance portability in hpc
Aldinucci et al. Accelerating sequential programs using FastFlow and self-offloading
Kessler et al. Flexible scheduling and thread allocation for synchronous parallel tasks
Jääskeläinen et al. TCEMC: A co-design flow for application-specific multicores
Danalis et al. Scalable dense linear algebra on heterogeneous hardware
Diehl et al. Shared memory parallelism in Modern C++ and HPX
Schuele Efficient parallel execution of streaming applications on multi-core processors
Chauhan et al. Parallel Computing Models and Analysis of OpenMP Optimization on Intel i7 and Xeon Processors
Yamagiwa et al. Carsh: A commandline execution support for stream-based acceleration environment

Legal Events

Date Code Title Description
AS Assignment

Owner name: ADVANCED MICRO DEVICES, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GASTER, BENEDICT R.;HOWES, LEE W.;SIGNING DATES FROM 20130725 TO 20131104;REEL/FRAME:031647/0412

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION