WO2001016750A2 - High-availability, shared-memory cluster - Google Patents

High-availability, shared-memory cluster Download PDF

Info

Publication number
WO2001016750A2
WO2001016750A2 PCT/US2000/024329 US0024329W WO0116750A2 WO 2001016750 A2 WO2001016750 A2 WO 2001016750A2 US 0024329 W US0024329 W US 0024329W WO 0116750 A2 WO0116750 A2 WO 0116750A2
Authority
WO
WIPO (PCT)
Prior art keywords
shared
memory
shared memory
processing node
smn
Prior art date
Application number
PCT/US2000/024329
Other languages
French (fr)
Other versions
WO2001016750A3 (en
Inventor
Lynn Parker West
Ted Scardamalia
Original Assignee
Times N Systems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Times N Systems, Inc. filed Critical Times N Systems, Inc.
Priority to AU71136/00A priority Critical patent/AU7113600A/en
Publication of WO2001016750A2 publication Critical patent/WO2001016750A2/en
Publication of WO2001016750A3 publication Critical patent/WO2001016750A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • G06F9/526Mutual exclusion algorithms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/0284Multiple user address space allocation, e.g. using different base addresses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0817Cache consistency protocols using directory methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/457Communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0837Cache consistency protocols with software control, e.g. non-cacheable data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/52Indexing scheme relating to G06F9/52
    • G06F2209/523Mode

Definitions

  • the invention relates generally to the field of multiprocessor computer systems. More particularly, the invention relates to computer systems that utilize a high-availability, shared-memory cluster.
  • the clustering of workstations is a well-known art. In the most common cases, the clustering involves workstations that operate almost totally independently, utilizing the network only to share such services as a printer, license-limited applications, or shared files.
  • some software packages allow a cluster of workstations to share work.
  • the work arrives, typically as batch jobs, at an entry point to the cluster where it is queued and dispatched to the workstations on the basis of load.
  • message-passing means that a given workstation operates on some portion of a job until communications (to send or receive data, typically) with another workstation is necessary. Then, the first workstation prepares and communicates with the other workstation.
  • MPP Massively Parallel Processor
  • a highly streamlined message-passing subsystem can typically require 10,000 to 20,000 CPU cycles or more.
  • Message-passing parallel processor systems have been offered commercially for years but have failed to capture significant market share because of poor performance and difficulty of programming for typical parallel applications. Message-passing parallel processor systems do have some advantages. In particular, because they share no resources, message-passing parallel processor systems are easier to provide with high-availability features. What is needed is a better approach to parallel processor systems.
  • U.S. Patent Applications 09/273,430, filed March 19, 1999 and PCT/US00/01262, filed January 18, 2000 are hereby expressly incorporated by reference herein for all purposes.
  • U.S. Ser. No. 09/273,430 improved upon the concept of shared memory by teaching the concept which will herein be referred to as a tight cluster.
  • the concept of a tight cluster is that of individual computers, each with its own CPU(s), memory, I/O, and operating system, but for which collection of computers there is a portion of memory which is shared by all the computers and via which they can exchange information.
  • 09/273,430 describes a system in which each processing node is provided with its own private copy of an operating system and in which the connection to shared memory is via a standard bus.
  • the advantage of a tight cluster in comparison to an SMP is "scalability" which means that a much larger number of computers can be attached together via a tight cluster than an SMP with little loss of processing efficiency.
  • That word and usually a number of surrounding words are transferred to that particular processor's cache memory transparently by cache-memory hardware. That word and the surrounding words (if any) are transferred into a portion of the particular processor's cache memory that is called a cache line or cache block.
  • the representation in the cache memory will become different from the value in shared memory. That cache line within that particular processor's cache memory is, at that point, called a "dirty" line.
  • the particular processor with the dirty line when accessing that memory address will see the new (modified) value.
  • Other processors accessing that memory address will see the old (unmodified) value in shared memory. This lack of coherence between such accesses will lead to incorrect results.
  • Modern computers, workstations, and PCs which provide for multiple processors and shared memory, therefore, also provide high-speed, transparent cache coherence hardware to assure that if a line in one cache changes and another processor subsequently accesses a value which is in that address range, the new values will be transferred back to memory or at least to the requesting processor.
  • the expense of the additional hardware is significant and performance is degraded.
  • Caches can be maintained coherent by software provided that sufficient cache-management instructions are provided by the manufacturer. However, in many cases, an adequate arsenal of such instructions are not provided.
  • a "heartbeat" function In a system involving multiple independent computers, a function can be provided such that each of said computers occasionally signals to at least a subset of the other processors an indication that status is operational. Failure to signal this heartbeat is a primary indication that the computer has failed, in either hardware or software. Should a companion processor fail to receive the heartbeat within a specified period of time following the previous received heartbeat signal, said companion processor will execute a verification routine. Should the results of the verification routine indicate computer failure, the system will enter checkpoint restart mode and will restart. The failed computer will be removed from the group upon restart and an operator message will be issued as part of the restart.
  • SMP symmetric multiprocessors
  • such a heartbeat function is normally not applicable, as all the processors are using a single copy of the software, and software failure is the most common failure.
  • the cache of the failed processor may contain dirty, required operating system status, so that recovery is often impossible. Since there is no way to determine from other processors whether recovery is possible, there are no known examples of SMP systems which attempt to recover from processor or memory failures.
  • heartbeat functionality can be provided in the context of a message passing system, such systems have performance deficiencies, as discussed above. Therefore, what is also needed in an approach to providing the heartbeat function in the context of a symmetric multiprocessor system.
  • a goal of the invention is to simultaneously satisfy the above-discussed requirements of improving and expanding the tight cluster concept which, in the case of the prior art, are not satisfied.
  • One embodiment of the invention is based on an apparatus, comprising: a shared memory unit; a first processing node coupled to said shared memory unit; and a second processing node coupled to said shared memory unit, wherein the shared memory unit includes a range of addresses that are duplicated.
  • Another embodiment of the invention is based on a method, comprising duplicating a shared address range in a shared memory unit that is coupled to a plurality of processing nodes.
  • Another embodiment of the invention is based on an electronic media, comprising: a computer program adapted to duplicate a shared address range in a shared memory unit that is coupled to a plurality of processing nodes.
  • Another embodiment of the invention is based on a computer program comprising computer program means adapted to perform the step of duplicating a shared address range in a shared memory unit that is coupled to a plurality of processing nodes when said computer program is run on a computer.
  • Another embodiment of the invention is based on a system, comprising a multiplicity of processors, each with some private memory and all sharing some portion of memory, interconnected and arranged such that memory accesses to a first set of address ranges will be to local, private memory whereas memory accesses to a second set of address ranges will be to shared memory, and arranged such that at least a portion of one special range of shared memory which is duplicated so that two such are both written by each WRITE to that particular location, for which each READ of the secondary (mirrored) section is discarded or precluded.
  • Another embodiment of the invention is based on a system comprised of a multiplicity of processing nodes, at least two shared-memory nodes, (SMNs) and means at each processing node to connect each processing node to each of said multiple SMNs.
  • Said system to include means for communicating all Load and Store software instructions to shared memory to all of the essentially-identical shared memory nodes.
  • Said system to include means for assuring that if any of said
  • the communication means ceases use of that SMN and notifies the attached processing node that the identified SMN has failed. Said system to assure that if atomic operations are performed affecting the SMN's that all secondary SMN's are corrected to contain the same value that the primary SMN contains.
  • FIG. 1 illustrates a block schematic view of a share-as-needed basic system, representing an embodiment of the invention.
  • FIG. 2 illustrates a block schematic view of a share-as-needed highly available system, representing an embodiment of the invention.
  • FIG. 3 illustrates a block schematic view of a compute node dual port PCI adapter, representing an embodiment of the invention.
  • the invention is applicable to systems of the type of Pfister or the type of U.S. Ser. No. 09/273,430 in which each processing node has its own copy of an operating system.
  • the invention is also applicable to other types of multiple processing node systems.
  • a tight cluster is defined as a cluster of workstations or an arrangement within a single, multiple-processor machine in which the processors are connected by a high-speed, low-latency interconnection, and in which some but not all memory is shared among the processors.
  • accesses to a first set of ranges of memory addresses will be to local, private memory but accesses to a second set of memory address ranges will be to shared memory.
  • the invention can include an environment as described in U.S. Ser. No. 09/273,430 in which the second set of address ranges, the set of shared address ranges, includes at least one range which is duplicated. WRITES to this memory range are written to both memories, and READS are read from a first but are ECC checked and, in the event of ECC failure on a READ, are then subsequently read from the second memory. On such a failure, an operator warning is issued.
  • mirroring a portion of shared memory can provide high availability, by which is meant protection from both hardware and software failures.
  • the control of the mirrored memory requires that the operating system extensions and the application subsystem be written to appropriately utilize the shared memory.
  • Each processor can be provided with a very high-speed, low-latency communication means to the shared-memory.
  • the low-latency communication means can include a communication link based on traces, cables and/or optical fiber(s).
  • the low-latency communication means can include hardware (e.g., a circuit), firmware (e.g., flash memory) and/or software (e.g., a program).
  • the communications means can include on-chip traces and/or waveguides. Further, each of these communication means can be duplicated.
  • the invention can include arranging the shared memory in banks.
  • the duplicated data can segregated.
  • the banks can be provided with separate interfaces. Further, the banks can be provided with duplicated interfaces.
  • Each processor in the system can also be provided with a specific interconnection to the shared memory, arranged such that there are two connections to a particular range of memory addresses, and two banks of shared memory responsive to said particular range of addresses, said range referred to as the mirrored portion of said shared memory.
  • the invention can include providing the duplicated shared memory ranges with separate power supplies. Similarly, the invention can include providing the duplicated interfaces with separate power supplies. In-turn, the power supplies can be backed-up. The invention can include eliminating all single points of failure in a shared-memory as-needed computing system.
  • a first, and most fundamental, of these features are provided by the nature of the close cluster system: a multiplicity of processors, each running its own copy of the operating system, each from its own private memory, exchanging information via shared memory.
  • a second feature is the provision of a heartbeat function between the processors of the system.
  • the semaphore range can be used as an aid in failure recovery when accompanied by "heartbeat" mechanisms.
  • this invention teaches the concept that the owning processor write its identification to the semaphore location.
  • subsequent heartbeat mechanisms indicate that a processor has failed, the processor detecting the heartbeat failure will search and release semaphores owned by the failed processor.
  • the hardware subsystem transparently to software, continuously monitors each node and informs at least two nodes when a failure does occur.
  • Normal processing is then suspended for a few milliseconds while the nodes determine which is the failing element and prepare to recover.
  • a third feature is the mirroring of shared memory. Since memory is expensive, only the most important portion(s) of shared memory can be mirrored to reduce cost. Of course, more or all of shared memory can be mirrored. For example, a semaphore region that is used to pass signals between processors can be duplicated in the mirrored-memory region.
  • a fourth feature is that the interconnection between each processor and the mirrored memory is via a separate switch section.
  • the separate switch sections can be separately powered. These sections can also be separately powered from conventional memory.
  • a fifth feature is that the operating system elements that deal with shared memory have no single point of failure.
  • a companion disclosure describes extensions that can clean up semaphores and signaling elements by a process running on a first processor on behalf of a second, failed processor.
  • the other shared resources held by the failed processor are recorded in mirrored memory, and a companion cleanup process can release these facilities.
  • a sixth feature is the provision of sufficient power-supply redundancy to protect from failure of the supply.
  • the protection will be provided by separate supplies for each processor and for the memories and for the separate switch to the mirrored memory.
  • a seventh feature is that the application or subsystem be structured to have no single point of failure. In practice, this can be achieved by snapshot saving the operation at recurring points, and storing all state information necessary to restore the application or subsystem to that point. Any updates beyond the snapshot must be such that they can be discarded without loss, or must be separately joumaled (either into mirrored memory or to other reliable storage means) for full recovery to be possible.
  • the method utilized is to determine which state information is critical to a recovery process, to store that state at specific points, and journal all asynchronous events after that point. Then when a failure occurs, the enabler subsystem restores the saved state, factors in the joumaled events, then resumes processing from that point.
  • An eighth feature is at least one multitailed storage facility, each with connections to at least two of the multiple processors in the system and to a disk, said disk to be utilized for all joumaled information required by the application for the restoration of operation in the event of a failure.
  • the purpose of the multitailed storage facility (multiple connection) is to access the disk via a second processor should the first fail. Unless the storage facility provides its own redundancy, at least two, mirrored, are required.
  • the shared elements include a memory and an "atomic memory" in which a Load to a particular location causes not only a Load of that location but also a Store to that location.
  • These shared facilities can be prevented from causing system failure if they are duplexed, and if each Load or Store to a shared facility is passed to each node of the duplexed pair. One is designated the primary and the other, the secondary shared node. Then, if the primary one of the duplexed pair fails, the second one has all of the state information that was in the failing one and thus a switch-over to the backup node allows operation to continue.
  • Each processing node should be provided with connectivity to both of the shared-facility nodes.
  • the data which is Load from the secondary shared- facility node must be discarded at the processing node which issues the Load operation.
  • the Loading processor must Store back to an atomic location the data which it acquired from the primary shared node. In this way, the two shared-facility nodes will have the same data at any time that the primary node may fail.
  • the system consists of a shared-memory node which includes an atomic complex and a "doorbell" signaling mechanism by which the processing nodes signal to each other.
  • the hardware subsystem consists of PCI adapters which contain significant intelligence in hardware, with a connection mechanism between each of these PCI adapters and a companion set of PCI adapters in the primary shared memory node.
  • Each processing node is provided with a second PCI adapter or with a second channel out of it's single PCI adapter .
  • the second channel is provided with a connection mechanism to the secondary shared-memory node.
  • the hardware subsystem passes information between the various processing nodes and the shared-memory node. In addition, the hardware subsystem continually monitors the node to which it is attached and the link to the companion node, and the hardware adapters continuously pass this information to each other.
  • the hardware subsystem is set to a state in which the hardware subsystem can differentiate which shared-memory node is the primary and which is the secondary.
  • writes are passed to both shared memory nodes.
  • PCI READs are also passed to both, but PCI READ responses from the secondary shared-memory node are discarded.
  • the second node operates in a "stealth" mode, copying information sent by software to the primary PCI adapter but otherwise remaining quiescent at its processing-node interface, unless it determines that the secondary node or the connection to the secondary node is not properly operational, at which point it signals the software at both ends with an interrupt and relies on software to notify an operator that the backup system has failed.
  • Figure 1 is a drawing of an over-all share-as-needed system, showing multiple processor nodes, a single shared-memory node, and individual connection means connecting the processing nodes to the shared-memory node.
  • element 101 shows the processing nodes in the system. There can be multiple such nodes, as figure 101 shows.
  • Element 102 shows the shared-memory node for the system of figure 1
  • element 103 shows the links from the various processing nodes to the single shared-memory node.
  • Figure 2 shows a drawing of a system showing multiple processor nodes, two shared-memory nodes, and connection means linking each processing node to both shared-memory nodes.
  • element 201 shows the processing nodes in the system.
  • Element 202 shows the primary shared- memory node for the system of figure 2
  • element 203 is the secondary shared-memory node in that system.
  • Element 204 shows the links from the various processing nodes to both the primary and the secondary shared-memory nodes.
  • Figure 3 shows a drawing of a PCI adapter at a processing node, showing multiple link interfaces to the multiple shared-memory nodes.
  • element 301 shows the PCI Bus interface logic
  • element 302 shows the address translator which determines whether a PCI Read or Write Command is intended for shared memory.
  • Element 303 is the data buffers used for passing data to and from the PCI interface
  • element 304 is the various control registers required to manage the operation of a PCI adapter.
  • Elements 305 and 307 are the send-side interfaces to the primary and secondary shared-memory units respectively, and elements 306 and 308 are the corresponding receive-side interfaces to the shared-memory units.
  • Element 309 directs the PCI Read and Write Commands to elements 305 and 307. In addition, element 309 accepts the results of those commands from elements 306 and 308. During normal operation, element 309 performs three functions. First, for ordinary PCI Read commands, it accepts the result from the primary SMN (if received) and if received correctly, 309 discards the result from the secondary SMN. For ordinary PCI Write commands, 309 accepts the acknowledgements from both SMN's to be sure both are received correctly. For atomic PCI Read commands, element 309 accepts the result from the primary SMN (if received) and if received correctly, 309 then compares the data to that received from the primary SMN. If they differ, element 309 issues a atomic Store to the addressed atomic location within the secondary SMN to assure that the two SMN's remain coherent.
  • element 309 notifies software, via one of the control registers, that the secondary SMN has failed, and then abandons operations to the secondary SMN.
  • element 309 notifies software, via one of the control registers, that the secondary SMN has failed, and then abandons operations to the primary SMN, and elevates the interface to the secondary SMN to primary status.
  • the adapter of figure 3 When notified by software to switch to the secondary SMN, the adapter of figure 3, through element 309, abandons operations to the primary SMN and begins operations to the secondary SMN.
  • This process could be done using two different adapters in each processing node.
  • Three SMNs could be used in conjunction with majority logic at the processing node to detect additional failure modes.
  • the primary SMN can provide the result of an atomic Read to the secondary SMN.
  • preferred embodiments of the invention can be identified one at a time by testing for the substantially highest performance.
  • the test for the substantially highest performance can be carried out without undue experimentation by the use of a simple and conventional benchmark (speed) experiment.
  • substantially is defined as at least approaching a given state (e.g., preferably within 10% of, more preferably within 1% of, and most preferably within 0.1% of).
  • coupled as used herein, is defined as connected, although not necessarily directly, and not necessarily mechanically.
  • means as used herein, is defined as hardware, firmware and/or software for achieving a result.
  • program or phrase computer program as used herein, is defined as a sequence of instructions designed for execution on a computer system.
  • a program may include a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, a source code, an object code, and/or other sequence of instructions designed for execution on a computer system.
  • Practical Applications of the Invention A practical application of the invention that has value within the technological arts is waveform transformation. Further, the invention is useful in conjunction with data input and transformation (such as are used for the purpose of speech recognition), or in conjunction with transforming the appearance of a display (such as are used for the purpose of video games), or the like. There are virtually innumerable uses for the invention, all of which need not be detailed here. Advantages of the Invention
  • a system representing an embodiment of the invention, can be cost effective and advantageous for at least the following reasons.
  • the invention improves the availability of parallel computing systems.
  • the invention imporves the reliability of parallel computing systems.
  • the invention also improves the scalability of parallel computing systems.
  • the high-availability, shared-memory cluster described herein can be a separate module, it will be manifest that the high- availability, shared-memory cluster may be integrated into the system with which it is associated.
  • all the disclosed elements and features of each disclosed embodiment can be combined with, or substituted for, the disclosed elements and features of every other disclosed embodiment except where such elements or features are mutually exclusive.

Abstract

Methods, systems and devices are described for a high-availability, shared-memory cluster. An apparatus, includes a shared memory unit; a first processing node coupled to the shared memory unit; and a second processing node coupled to the shared memory unit. The shared memory unit includes a range of addresses that are duplicated. The methods, systems and devices provide advantages because the speed and scalability of parallel processor systems is enhanced.

Description

HIGH-AVAILABILITY, SHARED-MEMORY CLUSTER
BACKGROUND OF THE INVENTION 1. Field of the Invention
The invention relates generally to the field of multiprocessor computer systems. More particularly, the invention relates to computer systems that utilize a high-availability, shared-memory cluster. 2. Discussion of the Related Art The clustering of workstations is a well-known art. In the most common cases, the clustering involves workstations that operate almost totally independently, utilizing the network only to share such services as a printer, license-limited applications, or shared files.
In more-closely-coupled environments, some software packages (such as NQS) allow a cluster of workstations to share work. In such cases the work arrives, typically as batch jobs, at an entry point to the cluster where it is queued and dispatched to the workstations on the basis of load.
In both of these cases, and all other known cases of clustering, the operating system and cluster subsystem are built around the concept of message-passing. The term message-passing means that a given workstation operates on some portion of a job until communications (to send or receive data, typically) with another workstation is necessary. Then, the first workstation prepares and communicates with the other workstation.
Another well-known art is that of clustering processors within a machine, usually called a Massively Parallel Processor or MPP, in which the techniques are essentially identical to those of clustered workstations. Usually, the bandwidth and latency of the interconnect network of an MPP are more highly optimized, but the system operation is the same.
In the general case, the passing of a message is an extremely expensive operation; expensive in the sense that many CPU cycles in the sender and receiver are consumed by the process of sending, receiving, bracketing, verifying, and routing the message, CPU cycles that are therefore not available for other operations. A highly streamlined message-passing subsystem can typically require 10,000 to 20,000 CPU cycles or more.
There are specific cases wherein the passing of a message requires significantly less overhead. However, none of these specific cases is adaptable to a general-purpose computer system.
Message-passing parallel processor systems have been offered commercially for years but have failed to capture significant market share because of poor performance and difficulty of programming for typical parallel applications. Message-passing parallel processor systems do have some advantages. In particular, because they share no resources, message-passing parallel processor systems are easier to provide with high-availability features. What is needed is a better approach to parallel processor systems.
There are alternatives to the passing of messages for closely-coupled cluster work. One such alternative is the use of shared memory for inter- processor communication.
Shared-memory systems, have been much more successful at capturing market share than message-passing systems because of the dramatically superior performance of shared-memory systems, up to about four-processor systems. In Search of Clusters, Gregory F. Pfister 2nd ed. (January 1998) Prentice Hall Computer Books, ISBN: 0138997098 describes a computing system with multiple processing nodes in which each processing node is provided with private, local memory and also has access to a range of memory which is shared with other processing nodes. The disclosure of this publication in its entirety is hereby expressly incorporated herein by reference for the purpose of indicating the background of the invention and illustrating the state of the art.
However, providing high availability for traditional shared-memory systems has proved to be an elusive goal. The nature of these systems, which share all code and all data, including that data which controls the shared operating systems, is incompatible with the separation normally required for high availability. What is needed is an approach to shared-memory systems that improves availability. Although the use of shared memory for inter-processor communication is a well-known art, prior to the teachings of U.S. Ser. No. 09/273,430, filed March 19, 1999, entitled Shared Memory Apparatus and Method for Multiprocessing Systems, the processors shared a single copy of the operating system. The problem with such systems is that they cannot be efficiently scaled beyond four to eight way systems except in unusual circumstances. All known cases of said unusual circumstances are such that the systems are not good price-performance systems for general-purpose computing.
The entire contents of U.S. Patent Applications 09/273,430, filed March 19, 1999 and PCT/US00/01262, filed January 18, 2000 are hereby expressly incorporated by reference herein for all purposes. U.S. Ser. No. 09/273,430, improved upon the concept of shared memory by teaching the concept which will herein be referred to as a tight cluster. The concept of a tight cluster is that of individual computers, each with its own CPU(s), memory, I/O, and operating system, but for which collection of computers there is a portion of memory which is shared by all the computers and via which they can exchange information. U.S. Ser. No. 09/273,430 describes a system in which each processing node is provided with its own private copy of an operating system and in which the connection to shared memory is via a standard bus. The advantage of a tight cluster in comparison to an SMP is "scalability" which means that a much larger number of computers can be attached together via a tight cluster than an SMP with little loss of processing efficiency.
What is needed are improvements to the concept of the tight cluster. What is also needed is an expansion of the concept of the tight cluster. Another well-known art is the use of memory caches to improve performance. Caches provide such a significant performance boost that most modern computers use them. At the very top of the performance (and price) range all of memory is constructed using cache-memory technologies. However, this is such an expensive approach that few manufacturers use it. All manufacturers of personal computers (PCs) and workstations use caches except for the very low end of the PC business where caches are omitted for price reasons and performance is, therefore, poor. Caches, however, present a problem for shared-memory computing systems; the problem of coherence. As a particular processor reads or writes a word of shared memory, that word and usually a number of surrounding words are transferred to that particular processor's cache memory transparently by cache-memory hardware. That word and the surrounding words (if any) are transferred into a portion of the particular processor's cache memory that is called a cache line or cache block.
If the transferred cache line is modified by the particular processor, the representation in the cache memory will become different from the value in shared memory. That cache line within that particular processor's cache memory is, at that point, called a "dirty" line. The particular processor with the dirty line, when accessing that memory address will see the new (modified) value. Other processors, accessing that memory address will see the old (unmodified) value in shared memory. This lack of coherence between such accesses will lead to incorrect results.
Modern computers, workstations, and PCs which provide for multiple processors and shared memory, therefore, also provide high-speed, transparent cache coherence hardware to assure that if a line in one cache changes and another processor subsequently accesses a value which is in that address range, the new values will be transferred back to memory or at least to the requesting processor. However, the expense of the additional hardware is significant and performance is degraded.
Caches can be maintained coherent by software provided that sufficient cache-management instructions are provided by the manufacturer. However, in many cases, an adequate arsenal of such instructions are not provided.
Moreover, even in cases where the instruction set is adequate, the software overhead is so great that no examples of are known of commercially successful machines which use software-managed coherence.
Thus, the existing hardware and software cache coherency approaches are unsatisfactory. What is also needed, therefore, is a better approach to cache coherency. Another well-known art is that of a "heartbeat" function. In a system involving multiple independent computers, a function can be provided such that each of said computers occasionally signals to at least a subset of the other processors an indication that status is operational. Failure to signal this heartbeat is a primary indication that the computer has failed, in either hardware or software. Should a companion processor fail to receive the heartbeat within a specified period of time following the previous received heartbeat signal, said companion processor will execute a verification routine. Should the results of the verification routine indicate computer failure, the system will enter checkpoint restart mode and will restart. The failed computer will be removed from the group upon restart and an operator message will be issued as part of the restart.
In symmetric multiprocessors (SMP's) such a heartbeat function is normally not applicable, as all the processors are using a single copy of the software, and software failure is the most common failure. Also, in event of hardware failure, the cache of the failed processor may contain dirty, required operating system status, so that recovery is often impossible. Since there is no way to determine from other processors whether recovery is possible, there are no known examples of SMP systems which attempt to recover from processor or memory failures.
While the heartbeat functionality can be provided in the context of a message passing system, such systems have performance deficiencies, as discussed above. Therefore, what is also needed in an approach to providing the heartbeat function in the context of a symmetric multiprocessor system.
SUMMARY OF THE INVENTION A goal of the invention is to simultaneously satisfy the above-discussed requirements of improving and expanding the tight cluster concept which, in the case of the prior art, are not satisfied. One embodiment of the invention is based on an apparatus, comprising: a shared memory unit; a first processing node coupled to said shared memory unit; and a second processing node coupled to said shared memory unit, wherein the shared memory unit includes a range of addresses that are duplicated. Another embodiment of the invention is based on a method, comprising duplicating a shared address range in a shared memory unit that is coupled to a plurality of processing nodes. Another embodiment of the invention is based on an electronic media, comprising: a computer program adapted to duplicate a shared address range in a shared memory unit that is coupled to a plurality of processing nodes. Another embodiment of the invention is based on a computer program comprising computer program means adapted to perform the step of duplicating a shared address range in a shared memory unit that is coupled to a plurality of processing nodes when said computer program is run on a computer. Another embodiment of the invention is based on a system, comprising a multiplicity of processors, each with some private memory and all sharing some portion of memory, interconnected and arranged such that memory accesses to a first set of address ranges will be to local, private memory whereas memory accesses to a second set of address ranges will be to shared memory, and arranged such that at least a portion of one special range of shared memory which is duplicated so that two such are both written by each WRITE to that particular location, for which each READ of the secondary (mirrored) section is discarded or precluded. Another embodiment of the invention is based on a system comprised of a multiplicity of processing nodes, at least two shared-memory nodes, (SMNs) and means at each processing node to connect each processing node to each of said multiple SMNs. Said system to include means for communicating all Load and Store software instructions to shared memory to all of the essentially-identical shared memory nodes. Said system to include means for assuring that if any of said
SMNs fails, the communication means ceases use of that SMN and notifies the attached processing node that the identified SMN has failed. Said system to assure that if atomic operations are performed affecting the SMN's that all secondary SMN's are corrected to contain the same value that the primary SMN contains.
These, and other goals and embodiments of the invention will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following description, while indicating preferred embodiments of the invention and numerous specific details thereof, is given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the invention without departing from the spirit thereof, and the invention includes all such modifications.
BRIEF DESCRIPTION OF THE DRAWINGS A clear conception of the advantages and features constituting the invention, and of the components and operation of model systems provided with the invention, will become more readily apparent by referring to the exemplary, and therefore nonlimiting, embodiments illustrated in the drawings accompanying and forming a part of this specification, wherein like reference characters (if they occur in more than one view) designate the same parts. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale.
FIG. 1 illustrates a block schematic view of a share-as-needed basic system, representing an embodiment of the invention.
FIG. 2 illustrates a block schematic view of a share-as-needed highly available system, representing an embodiment of the invention.
FIG. 3 illustrates a block schematic view of a compute node dual port PCI adapter, representing an embodiment of the invention.
DESCRIPTION OF PREFERRED EMBODIMENTS The invention and the various features and advantageous details thereof are explained more fully with reference to the nonlimiting embodiments that are illustrated in the accompanying drawings and detailed in the following description of preferred embodiments. Descriptions of well known components and processing techniques are omitted so as not to unnecessarily obscure the invention in detail. The teachings of U.S. Ser. No. 09/273,430 include a system which is a single entity; one large supercomputer. The invention is also applicable to a cluster of workstations, or even a network.
The invention is applicable to systems of the type of Pfister or the type of U.S. Ser. No. 09/273,430 in which each processing node has its own copy of an operating system. The invention is also applicable to other types of multiple processing node systems.
The context of the invention can include a tight cluster as described in U.S. Ser. No. 09/273,430. A tight cluster is defined as a cluster of workstations or an arrangement within a single, multiple-processor machine in which the processors are connected by a high-speed, low-latency interconnection, and in which some but not all memory is shared among the processors. Within the scope of a given processor, accesses to a first set of ranges of memory addresses will be to local, private memory but accesses to a second set of memory address ranges will be to shared memory. The significant advantage to a tight cluster in comparison to a message-passing cluster is that, assuming the environment has been appropriately established, the exchange of information involves a single STORE instruction by the sending processor and a subsequent single LOAD instruction by the receiving processor. The establishment of the environment, taught by U.S. Ser. No.
09/273,430 and more fully by companion disclosures (U.S. Provisional Application Ser. No. 60/220,794, filed July 26, 2000; U.S. Provisional Application Ser. No. 60/220,748, filed July 26, 2000; WSGR 15245-711; WSGR 15245-713; WSGR 15245-715; WSGR 15245-716; WSGR 15245-717; WSGR 15245-718; WSGR 15245-719; and WSGR 15245-720, the entire contents of all which are hereby expressly incorporated herein by reference for all purposes) can be performed in such a way as to require relatively little system overhead, and to be done once for many, many information exchanges. Therefore, a comparison of 10,000 instructions for message-passing to a pair of instructions for tight-clustering, is valid.
The invention can include an environment as described in U.S. Ser. No. 09/273,430 in which the second set of address ranges, the set of shared address ranges, includes at least one range which is duplicated. WRITES to this memory range are written to both memories, and READS are read from a first but are ECC checked and, in the event of ECC failure on a READ, are then subsequently read from the second memory. On such a failure, an operator warning is issued.
Within such a system, mirroring a portion of shared memory can provide high availability, by which is meant protection from both hardware and software failures. The control of the mirrored memory requires that the operating system extensions and the application subsystem be written to appropriately utilize the shared memory.
Each processor can be provided with a very high-speed, low-latency communication means to the shared-memory. The low-latency communication means can include a communication link based on traces, cables and/or optical fiber(s). The low-latency communication means can include hardware (e.g., a circuit), firmware (e.g., flash memory) and/or software (e.g., a program). The communications means can include on-chip traces and/or waveguides. Further, each of these communication means can be duplicated.
The invention can include arranging the shared memory in banks. The duplicated data can segregated. The banks can be provided with separate interfaces. Further, the banks can be provided with duplicated interfaces.
Each processor in the system can also be provided with a specific interconnection to the shared memory, arranged such that there are two connections to a particular range of memory addresses, and two banks of shared memory responsive to said particular range of addresses, said range referred to as the mirrored portion of said shared memory.
The invention can include providing the duplicated shared memory ranges with separate power supplies. Similarly, the invention can include providing the duplicated interfaces with separate power supplies. In-turn, the power supplies can be backed-up. The invention can include eliminating all single points of failure in a shared-memory as-needed computing system.
The provision of high availability is facilitated by several hardware and software features, the identity and function of which are taught by this invention. A first, and most fundamental, of these features are provided by the nature of the close cluster system: a multiplicity of processors, each running its own copy of the operating system, each from its own private memory, exchanging information via shared memory. A second feature is the provision of a heartbeat function between the processors of the system. The semaphore range can be used as an aid in failure recovery when accompanied by "heartbeat" mechanisms. When capturing a semaphore, this invention teaches the concept that the owning processor write its identification to the semaphore location. When subsequent heartbeat mechanisms indicate that a processor has failed, the processor detecting the heartbeat failure will search and release semaphores owned by the failed processor.
This can be implemented if the various nodes are able to determine when any node, processing or shared-facility, fails. In the preferred embodiment, the hardware subsystem, transparently to software, continuously monitors each node and informs at least two nodes when a failure does occur.
Normal processing is then suspended for a few milliseconds while the nodes determine which is the failing element and prepare to recover.
A third feature is the mirroring of shared memory. Since memory is expensive, only the most important portion(s) of shared memory can be mirrored to reduce cost. Of course, more or all of shared memory can be mirrored. For example, a semaphore region that is used to pass signals between processors can be duplicated in the mirrored-memory region.
A fourth feature is that the interconnection between each processor and the mirrored memory is via a separate switch section. The separate switch sections can be separately powered. These sections can also be separately powered from conventional memory.
A fifth feature is that the operating system elements that deal with shared memory have no single point of failure. A companion disclosure describes extensions that can clean up semaphores and signaling elements by a process running on a first processor on behalf of a second, failed processor.
Similarly, in the preferred embodiment the other shared resources held by the failed processor are recorded in mirrored memory, and a companion cleanup process can release these facilities.
A sixth feature is the provision of sufficient power-supply redundancy to protect from failure of the supply. In the preferred embodiment, the protection will be provided by separate supplies for each processor and for the memories and for the separate switch to the mirrored memory.
A seventh feature is that the application or subsystem be structured to have no single point of failure. In practice, this can be achieved by snapshot saving the operation at recurring points, and storing all state information necessary to restore the application or subsystem to that point. Any updates beyond the snapshot must be such that they can be discarded without loss, or must be separately joumaled (either into mirrored memory or to other reliable storage means) for full recovery to be possible.
This can be termed a recovery process. This requirement must be met by an application or an application enabler subsystem running on the basic system.
The method utilized is to determine which state information is critical to a recovery process, to store that state at specific points, and journal all asynchronous events after that point. Then when a failure occurs, the enabler subsystem restores the saved state, factors in the joumaled events, then resumes processing from that point.
An eighth feature is at least one multitailed storage facility, each with connections to at least two of the multiple processors in the system and to a disk, said disk to be utilized for all joumaled information required by the application for the restoration of operation in the event of a failure. The purpose of the multitailed storage facility (multiple connection) is to access the disk via a second processor should the first fail. Unless the storage facility provides its own redundancy, at least two, mirrored, are required.
Shared-memory systems of the type described in US Patent Application Number 09/273,430, a filing assigned to Times N Systems, Inc., are quite amenable to high-availability design. In a system of this type, each processing node is provided with a full set of privately-owned facilities, including processor, memory, and I/O devices. Therefore, only the most exceptional of failures by one node with affect other nodes in the system. Only those elements are shared in such a system which need to be shared. These include a portion of memory and mechanisms to assure coordination of processing for shared tasks. The design is therefore such that the failure of any single processing node can be prevented from causing system failure. The shared elements include a memory and an "atomic memory" in which a Load to a particular location causes not only a Load of that location but also a Store to that location. These shared facilities can be prevented from causing system failure if they are duplexed, and if each Load or Store to a shared facility is passed to each node of the duplexed pair. One is designated the primary and the other, the secondary shared node. Then, if the primary one of the duplexed pair fails, the second one has all of the state information that was in the failing one and thus a switch-over to the backup node allows operation to continue.
Each processing node should be provided with connectivity to both of the shared-facility nodes. The data which is Load from the secondary shared- facility node must be discarded at the processing node which issues the Load operation. To assure that competing Load operations to the atomic facility (which may not arrive in the same order at the secondary node as at the primary node) must result in the same data being stored in the corresponding locations in the two shared-facility nodes. Therefore, the Loading processor must Store back to an atomic location the data which it acquired from the primary shared node. In this way, the two shared-facility nodes will have the same data at any time that the primary node may fail.
THE PREFERRED EMBODIMENT In the preferred embodiment, the system consists of a shared-memory node which includes an atomic complex and a "doorbell" signaling mechanism by which the processing nodes signal to each other. The hardware subsystem consists of PCI adapters which contain significant intelligence in hardware, with a connection mechanism between each of these PCI adapters and a companion set of PCI adapters in the primary shared memory node.
Each processing node is provided with a second PCI adapter or with a second channel out of it's single PCI adapter . The second channel is provided with a connection mechanism to the secondary shared-memory node. The hardware subsystem passes information between the various processing nodes and the shared-memory node. In addition, the hardware subsystem continually monitors the node to which it is attached and the link to the companion node, and the hardware adapters continuously pass this information to each other.
Similarly, the hardware subsystem is set to a state in which the hardware subsystem can differentiate which shared-memory node is the primary and which is the secondary. Writes are passed to both shared memory nodes. PCI READs are also passed to both, but PCI READ responses from the secondary shared-memory node are discarded.
In the case where two separate PCI adapters are used in the processing node, the second node operates in a "stealth" mode, copying information sent by software to the primary PCI adapter but otherwise remaining quiescent at its processing-node interface, unless it determines that the secondary node or the connection to the secondary node is not properly operational, at which point it signals the software at both ends with an interrupt and relies on software to notify an operator that the backup system has failed.
For either case (one dual-connection PCI adapter or two separate adapters) software at the processing node re-programs the hardware to fully- activate the secondary connection when the primary connection reports significant problems.
Figure 1 is a drawing of an over-all share-as-needed system, showing multiple processor nodes, a single shared-memory node, and individual connection means connecting the processing nodes to the shared-memory node. With reference to figure 1, element 101 shows the processing nodes in the system. There can be multiple such nodes, as figure 101 shows. Element 102 shows the shared-memory node for the system of figure 1, and element 103 shows the links from the various processing nodes to the single shared-memory node. Figure 2 shows a drawing of a system showing multiple processor nodes, two shared-memory nodes, and connection means linking each processing node to both shared-memory nodes. With reference to figure 2, element 201 shows the processing nodes in the system. There can be multiple such nodes, as figure 201 shows. Element 202 shows the primary shared- memory node for the system of figure 2, and element 203 is the secondary shared-memory node in that system. Element 204 shows the links from the various processing nodes to both the primary and the secondary shared-memory nodes.
Figure 3 shows a drawing of a PCI adapter at a processing node, showing multiple link interfaces to the multiple shared-memory nodes. With reference to figure 3, element 301 shows the PCI Bus interface logic, and element 302 shows the address translator which determines whether a PCI Read or Write Command is intended for shared memory. Element 303 is the data buffers used for passing data to and from the PCI interface, and element 304 is the various control registers required to manage the operation of a PCI adapter. Elements 305 and 307 are the send-side interfaces to the primary and secondary shared-memory units respectively, and elements 306 and 308 are the corresponding receive-side interfaces to the shared-memory units.
Element 309 directs the PCI Read and Write Commands to elements 305 and 307. In addition, element 309 accepts the results of those commands from elements 306 and 308. During normal operation, element 309 performs three functions. First, for ordinary PCI Read commands, it accepts the result from the primary SMN (if received) and if received correctly, 309 discards the result from the secondary SMN. For ordinary PCI Write commands, 309 accepts the acknowledgements from both SMN's to be sure both are received correctly. For atomic PCI Read commands, element 309 accepts the result from the primary SMN (if received) and if received correctly, 309 then compares the data to that received from the primary SMN. If they differ, element 309 issues a atomic Store to the addressed atomic location within the secondary SMN to assure that the two SMN's remain coherent.
For all of the above operations, if the secondary SMN fails to respond or produces an illegal response, element 309 notifies software, via one of the control registers, that the secondary SMN has failed, and then abandons operations to the secondary SMN. Similarly, for all of the above operations, if the primary SMN fails to respond or produces an illegal response, element 309 notifies software, via one of the control registers, that the secondary SMN has failed, and then abandons operations to the primary SMN, and elevates the interface to the secondary SMN to primary status. When notified by software to switch to the secondary SMN, the adapter of figure 3, through element 309, abandons operations to the primary SMN and begins operations to the secondary SMN.
This process could be done using two different adapters in each processing node. Three SMNs could be used in conjunction with majority logic at the processing node to detect additional failure modes. The primary SMN can provide the result of an atomic Read to the secondary SMN.
While not being limited to any particular performance indicator or diagnostic identifier, preferred embodiments of the invention can be identified one at a time by testing for the substantially highest performance. The test for the substantially highest performance can be carried out without undue experimentation by the use of a simple and conventional benchmark (speed) experiment.
The term substantially, as used herein, is defined as at least approaching a given state (e.g., preferably within 10% of, more preferably within 1% of, and most preferably within 0.1% of). The term coupled, as used herein, is defined as connected, although not necessarily directly, and not necessarily mechanically. The term means, as used herein, is defined as hardware, firmware and/or software for achieving a result. The term program or phrase computer program, as used herein, is defined as a sequence of instructions designed for execution on a computer system. A program may include a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, a source code, an object code, and/or other sequence of instructions designed for execution on a computer system. Practical Applications of the Invention A practical application of the invention that has value within the technological arts is waveform transformation. Further, the invention is useful in conjunction with data input and transformation (such as are used for the purpose of speech recognition), or in conjunction with transforming the appearance of a display (such as are used for the purpose of video games), or the like. There are virtually innumerable uses for the invention, all of which need not be detailed here. Advantages of the Invention
A system, representing an embodiment of the invention, can be cost effective and advantageous for at least the following reasons. The invention improves the availability of parallel computing systems. The invention imporves the reliability of parallel computing systems. The invention also improves the scalability of parallel computing systems.
All the disclosed embodiments of the invention described herein can be realized and practiced without undue experimentation. Although the best mode of carrying out the invention contemplated by the inventors is disclosed above, practice of the invention is not limited thereto. Accordingly, it will be appreciated by those skilled in the art that the invention may be practiced otherwise than as specifically described herein.
For example, although the high-availability, shared-memory cluster described herein can be a separate module, it will be manifest that the high- availability, shared-memory cluster may be integrated into the system with which it is associated. Furthermore, all the disclosed elements and features of each disclosed embodiment can be combined with, or substituted for, the disclosed elements and features of every other disclosed embodiment except where such elements or features are mutually exclusive.
It will be manifest that various additions, modifications and rearrangements of the features of the invention may be made without deviating from the spirit and scope of the underlying inventive concept. It is intended that the scope of the invention as defined by the appended claims and their equivalents cover all such additions, modifications, and rearrangements.
The appended claims are not to be interpreted as including means-plus- function limitations, unless such a limitation is explicitly recited in a given claim using the phrase "means for." Expedient embodiments of the invention are differentiated by the appended subclaims.

Claims

CLAIMS What is claimed is:
1. An apparatus, comprising: a shared memory unit; a first processing node coupled to said shared memory unit; and a second processing node coupled to said shared memory unit, wherein the shared memory unit includes a range of addresses that are duplicated.
2. The apparatus of claim 1, wherein said shared memory units includes a first bank and a second bank, data is said second bank duplicating data in said first bank.
3. The apparatus of claim 1 , further comprising a first duplicated interface coupled between said first processing node and said shared memory unit and a second duplicated interface coupled between said first processing node and said shared memory unit.
4. The apparatus of claim 1, further comprising a duplicate power supply coupled to said shared memory unit.
5. The apparatus of claim 1, further comprising a multitailed storage facility coupled to said first processing node and said second processing node; and a computer-readable medium coupled to said multitailed storage facility.
6. A computer system comprising the apparatus of claim 1.
7. A method, comprising duplicating a shared address range in a shared memory unit that is coupled to a plurality of processing nodes.
8. The method of claim 7, wherein WRITES to said shared address range are written to a first memory range and a second memory range, and READS are read from said first memory range.
9. The method of claim 8, wherein READS are read from said first address range after error correction code checking.
10. The method of claim 9, wherein, in an event of error correction code failure on a READ, said READ is read from the second address range.
11. The method of claim 10, wherein, in said event an operator warning is issued.
12. An electronic media, comprising: a computer program adapted to duplicate a shared address range in a shared memory unit that is coupled to a plurality of processing nodes.
13. A computer program comprising computer program means adapted to perform the step of duplicating a shared address range in a shared memory unit that is coupled to a plurality of processing nodes when said computer program is run on a computer.
14. A computer program as claimed in claim 13, embodied on a computer- readable medium.
15. A system, comprising a multiplicity of processors, each with some private memory and all sharing some portion of memory, interconnected and arranged such that memory accesses to a first set of address ranges will be to local, private memory whereas memory accesses to a second set of address ranges will be to shared memory, and arranged such that at least a portion of one special range of shared memory which is duplicated so that two such are both written by each WRITE to that particular location, for which each READ of the secondary (mirrored) section is discarded or precluded.
16. The system of claim 15, wherein a heartbeat function is provided between the processors using the shared memory.
17. The system of claim 16, wherein a semaphore region is included in the mirrored memory range.
18. The system of claim 17, wherein the switch which connects the shared memory to the processing units is duplexed.
19. The system of claim 16, wherein a processor can clean up semaphores left by a companion processor which has stopped its heartbeat.
20. The system of claim 15, further comprising redundant mass storage.
21. A system comprised of a multiplicity of processing nodes, at least two shared-memory nodes, (SMNs) and means at each processing node to connect each processing node to each of said multiple SMNs. Said system to include means for communicating all Load and Store software instructions to shared memory to all of the essentially-identical shared memory nodes. Said system to include means for assuring that if any of said SMNs fails, the communication means ceases use of that SMN and notifies the attached processing node that the identified SMN has failed. Said system to assure that if atomic operations are performed affecting the SMN's that all secondary SMN's are corrected to contain the same value that the primary SMN contains.
22. The system of claim 21 , wherein the means for maintaining correctness are in PCI adapters in the processing nodes.
23. The system of claim 21 , wherein the means for assuring that atomic operations remain coherent across the multiple SMNs is performed via SMN- SMN transfer means rather than from the processing node means.
24. The system of claim 21, wherein a sequence number is associated with each transfer from a processing node to the SMNs and wherein failure is defined to include failure to respond after a defined period of time, and also is defined to include mismatch of returned sequence number.
25. The system of claim 21 , wherein the atomic operations to all except the primary SMN are delayed and are translated at the processing node to be non- atomic update operations, which then are transferred to the alternative SMNs to assure that the atomic values remain coherent.
PCT/US2000/024329 1999-08-31 2000-08-31 High-availability, shared-memory cluster WO2001016750A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU71136/00A AU7113600A (en) 1999-08-31 2000-08-31 High-availability, shared-memory cluster

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US15215199P 1999-08-31 1999-08-31
US60/152,151 1999-08-31
US22074800P 2000-07-26 2000-07-26
US22097400P 2000-07-26 2000-07-26
US60/220,748 2000-07-26
US60/220,974 2000-07-26

Publications (2)

Publication Number Publication Date
WO2001016750A2 true WO2001016750A2 (en) 2001-03-08
WO2001016750A3 WO2001016750A3 (en) 2002-01-17

Family

ID=27387201

Family Applications (9)

Application Number Title Priority Date Filing Date
PCT/US2000/024147 WO2001016737A2 (en) 1999-08-31 2000-08-31 Cache-coherent shared-memory cluster
PCT/US2000/024039 WO2001016760A1 (en) 1999-08-31 2000-08-31 Switchable shared-memory cluster
PCT/US2000/024150 WO2001016738A2 (en) 1999-08-31 2000-08-31 Efficient page ownership control
PCT/US2000/024298 WO2001016743A2 (en) 1999-08-31 2000-08-31 Shared memory disk
PCT/US2000/024217 WO2001016741A2 (en) 1999-08-31 2000-08-31 Semaphore control of shared-memory
PCT/US2000/024329 WO2001016750A2 (en) 1999-08-31 2000-08-31 High-availability, shared-memory cluster
PCT/US2000/024248 WO2001016742A2 (en) 1999-08-31 2000-08-31 Network shared memory
PCT/US2000/024216 WO2001016761A2 (en) 1999-08-31 2000-08-31 Efficient page allocation
PCT/US2000/024210 WO2001016740A2 (en) 1999-08-31 2000-08-31 Efficient event waiting

Family Applications Before (5)

Application Number Title Priority Date Filing Date
PCT/US2000/024147 WO2001016737A2 (en) 1999-08-31 2000-08-31 Cache-coherent shared-memory cluster
PCT/US2000/024039 WO2001016760A1 (en) 1999-08-31 2000-08-31 Switchable shared-memory cluster
PCT/US2000/024150 WO2001016738A2 (en) 1999-08-31 2000-08-31 Efficient page ownership control
PCT/US2000/024298 WO2001016743A2 (en) 1999-08-31 2000-08-31 Shared memory disk
PCT/US2000/024217 WO2001016741A2 (en) 1999-08-31 2000-08-31 Semaphore control of shared-memory

Family Applications After (3)

Application Number Title Priority Date Filing Date
PCT/US2000/024248 WO2001016742A2 (en) 1999-08-31 2000-08-31 Network shared memory
PCT/US2000/024216 WO2001016761A2 (en) 1999-08-31 2000-08-31 Efficient page allocation
PCT/US2000/024210 WO2001016740A2 (en) 1999-08-31 2000-08-31 Efficient event waiting

Country Status (4)

Country Link
EP (3) EP1214651A2 (en)
AU (9) AU7108300A (en)
CA (3) CA2382927A1 (en)
WO (9) WO2001016737A2 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6920485B2 (en) 2001-10-04 2005-07-19 Hewlett-Packard Development Company, L.P. Packet processing in shared memory multi-computer systems
US6999998B2 (en) 2001-10-04 2006-02-14 Hewlett-Packard Development Company, L.P. Shared memory coupling of network infrastructure devices
US7254745B2 (en) 2002-10-03 2007-08-07 International Business Machines Corporation Diagnostic probe management in data processing systems
EP1895413A3 (en) * 2006-08-18 2009-09-30 Fujitsu Limited Access monitoring method and device for shared memory

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040017301A (en) * 2001-07-13 2004-02-26 코닌클리케 필립스 일렉트로닉스 엔.브이. Method of running a media application and a media system with job control
US7685381B2 (en) 2007-03-01 2010-03-23 International Business Machines Corporation Employing a data structure of readily accessible units of memory to facilitate memory access
US7899663B2 (en) 2007-03-30 2011-03-01 International Business Machines Corporation Providing memory consistency in an emulated processing environment
US9442780B2 (en) * 2011-07-19 2016-09-13 Qualcomm Incorporated Synchronization of shader operation
US9064437B2 (en) 2012-12-07 2015-06-23 Intel Corporation Memory based semaphores
WO2014190486A1 (en) * 2013-05-28 2014-12-04 华为技术有限公司 Method and system for supporting resource isolation under multi-core architecture

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3668644A (en) * 1970-02-09 1972-06-06 Burroughs Corp Failsafe memory system
US4414624A (en) * 1980-11-19 1983-11-08 The United States Of America As Represented By The Secretary Of The Navy Multiple-microcomputer processing
EP0372578A2 (en) * 1988-12-09 1990-06-13 Tandem Computers Incorporated Memory management in high-performance fault-tolerant computer system
US5175839A (en) * 1987-12-24 1992-12-29 Fujitsu Limited Storage control system in a computer system for double-writing
US5206952A (en) * 1990-09-12 1993-04-27 Cray Research, Inc. Fault tolerant networking architecture
US5398331A (en) * 1992-07-08 1995-03-14 International Business Machines Corporation Shared storage controller for dual copy shared data
US5495570A (en) * 1991-07-18 1996-02-27 Tandem Computers Incorporated Mirrored memory multi-processor system
US5568609A (en) * 1990-05-18 1996-10-22 Fujitsu Limited Data processing system with path disconnection and memory access failure recognition
US5664089A (en) * 1994-04-26 1997-09-02 Unisys Corporation Multiple power domain power loss detection and interface disable
US5903763A (en) * 1993-03-26 1999-05-11 Fujitsu Limited Method of recovering exclusive control instruction and multi-processor system using the same

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4484262A (en) * 1979-01-09 1984-11-20 Sullivan Herbert W Shared memory computer method and apparatus
US4403283A (en) * 1980-07-28 1983-09-06 Ncr Corporation Extended memory system and method
US4725946A (en) * 1985-06-27 1988-02-16 Honeywell Information Systems Inc. P and V instructions for semaphore architecture in a multiprogramming/multiprocessing environment
JPH063589B2 (en) * 1987-10-29 1994-01-12 インターナシヨナル・ビジネス・マシーンズ・コーポレーシヨン Address replacement device
EP0343646B1 (en) * 1988-05-26 1995-12-13 Hitachi, Ltd. Task execution control method for a multiprocessor system with enhanced post/wait procedure
US4992935A (en) * 1988-07-12 1991-02-12 International Business Machines Corporation Bit map search by competitive processors
US5434970A (en) * 1991-02-14 1995-07-18 Cray Research, Inc. System for distributed multiprocessor communication
JPH04271453A (en) * 1991-02-27 1992-09-28 Toshiba Corp Composite electronic computer
US5315707A (en) * 1992-01-10 1994-05-24 Digital Equipment Corporation Multiprocessor buffer system
US5434975A (en) * 1992-09-24 1995-07-18 At&T Corp. System for interconnecting a synchronous path having semaphores and an asynchronous path having message queuing for interprocess communications
DE4238593A1 (en) * 1992-11-16 1994-05-19 Ibm Multiprocessor computer system
US5590308A (en) * 1993-09-01 1996-12-31 International Business Machines Corporation Method and apparatus for reducing false invalidations in distributed systems
US5636359A (en) * 1994-06-20 1997-06-03 International Business Machines Corporation Performance enhancement system and method for a hierarchical data cache using a RAID parity scheme
US6587889B1 (en) * 1995-10-17 2003-07-01 International Business Machines Corporation Junction manager program object interconnection and method
US5940870A (en) * 1996-05-21 1999-08-17 Industrial Technology Research Institute Address translation for shared-memory multiprocessor clustering
US5784699A (en) * 1996-05-24 1998-07-21 Oracle Corporation Dynamic memory allocation in a computer using a bit map index
JPH10142298A (en) * 1996-11-15 1998-05-29 Advantest Corp Testing device for ic device
US5829029A (en) * 1996-12-18 1998-10-27 Bull Hn Information Systems Inc. Private cache miss and access management in a multiprocessor system with shared memory
US5918248A (en) * 1996-12-30 1999-06-29 Northern Telecom Limited Shared memory control algorithm for mutual exclusion and rollback
US6360303B1 (en) * 1997-09-30 2002-03-19 Compaq Computer Corporation Partitioning memory shared by multiple processors of a distributed processing system
EP0908825B1 (en) * 1997-10-10 2002-09-04 Bull S.A. A data-processing system with cc-NUMA (cache coherent, non-uniform memory access) architecture and remote access cache incorporated in local memory

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3668644A (en) * 1970-02-09 1972-06-06 Burroughs Corp Failsafe memory system
US4414624A (en) * 1980-11-19 1983-11-08 The United States Of America As Represented By The Secretary Of The Navy Multiple-microcomputer processing
US5175839A (en) * 1987-12-24 1992-12-29 Fujitsu Limited Storage control system in a computer system for double-writing
EP0372578A2 (en) * 1988-12-09 1990-06-13 Tandem Computers Incorporated Memory management in high-performance fault-tolerant computer system
US5568609A (en) * 1990-05-18 1996-10-22 Fujitsu Limited Data processing system with path disconnection and memory access failure recognition
US5206952A (en) * 1990-09-12 1993-04-27 Cray Research, Inc. Fault tolerant networking architecture
US5495570A (en) * 1991-07-18 1996-02-27 Tandem Computers Incorporated Mirrored memory multi-processor system
US5398331A (en) * 1992-07-08 1995-03-14 International Business Machines Corporation Shared storage controller for dual copy shared data
US5903763A (en) * 1993-03-26 1999-05-11 Fujitsu Limited Method of recovering exclusive control instruction and multi-processor system using the same
US5664089A (en) * 1994-04-26 1997-09-02 Unisys Corporation Multiple power domain power loss detection and interface disable

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6920485B2 (en) 2001-10-04 2005-07-19 Hewlett-Packard Development Company, L.P. Packet processing in shared memory multi-computer systems
US6999998B2 (en) 2001-10-04 2006-02-14 Hewlett-Packard Development Company, L.P. Shared memory coupling of network infrastructure devices
US7254745B2 (en) 2002-10-03 2007-08-07 International Business Machines Corporation Diagnostic probe management in data processing systems
EP1895413A3 (en) * 2006-08-18 2009-09-30 Fujitsu Limited Access monitoring method and device for shared memory

Also Published As

Publication number Publication date
WO2001016761A2 (en) 2001-03-08
AU7113600A (en) 2001-03-26
WO2001016761A3 (en) 2001-12-27
AU7110000A (en) 2001-03-26
WO2001016741A2 (en) 2001-03-08
WO2001016738A9 (en) 2002-09-12
EP1214653A2 (en) 2002-06-19
EP1214652A2 (en) 2002-06-19
CA2382728A1 (en) 2001-03-08
WO2001016742A2 (en) 2001-03-08
AU7108300A (en) 2001-03-26
WO2001016738A3 (en) 2001-10-04
WO2001016743A3 (en) 2001-08-09
AU6949700A (en) 2001-03-26
WO2001016760A1 (en) 2001-03-08
WO2001016743A2 (en) 2001-03-08
WO2001016742A3 (en) 2001-09-20
WO2001016741A3 (en) 2001-09-20
CA2382927A1 (en) 2001-03-08
WO2001016740A3 (en) 2001-12-27
WO2001016750A3 (en) 2002-01-17
WO2001016737A2 (en) 2001-03-08
WO2001016738A2 (en) 2001-03-08
WO2001016738A8 (en) 2001-05-03
WO2001016740A2 (en) 2001-03-08
AU7100700A (en) 2001-03-26
CA2382929A1 (en) 2001-03-08
AU7112100A (en) 2001-03-26
WO2001016737A3 (en) 2001-11-08
AU6949600A (en) 2001-03-26
WO2001016743A8 (en) 2001-10-18
EP1214651A2 (en) 2002-06-19
AU7474200A (en) 2001-03-26
AU7108500A (en) 2001-03-26

Similar Documents

Publication Publication Date Title
Bernick et al. NonStop/spl reg/advanced architecture
JP2500038B2 (en) Multiprocessor computer system, fault tolerant processing method and data processing system
KR940004386B1 (en) System and method for data processing
KR910007762B1 (en) Distributed multiprocess transaction processing system
US5327553A (en) Fault-tolerant computer system with /CONFIG filesystem
US5802265A (en) Transparent fault tolerant computer system
US5958070A (en) Remote checkpoint memory system and protocol for fault-tolerant computer system
US6189111B1 (en) Resource harvesting in scalable, fault tolerant, single system image clusters
US5099485A (en) Fault tolerant computer systems with fault isolation and repair
JP2618073B2 (en) Data processing method and system
US5502728A (en) Large, fault-tolerant, non-volatile, multiported memory
CA2009529C (en) Servicing interrupt requests in a data processing system without using the services of an operating system
Siewiorek Fault tolerance in commercial computers
Kim Highly available systems for database applications
US20050240806A1 (en) Diagnostic memory dump method in a redundant processor
Baker et al. A flexible ServerNet-based fault-tolerant architecture
CA2003337A1 (en) High-performance computer system with fault-tolerant capability
JPH0656587B2 (en) Fault tolerant data processing system
JPH02501603A (en) Lock control method in multiprocessing data system
KR940002340B1 (en) Computer system for multiple operation
JP3595033B2 (en) Highly reliable computer system
WO2001016750A2 (en) High-availability, shared-memory cluster
CN1755660B (en) Diagnostic memory dump method in a redundant processor
US5539875A (en) Error windowing for storage subsystem recovery
Dal Cin et al. Fault tolerance in distributed shared memory multiprocessors

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US US US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

AK Designated states

Kind code of ref document: A3

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US US US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

DPE2 Request for preliminary examination filed before expiration of 19th month from priority date (pct application filed from 20040101)