CN100429622C - Dynamic reassignment of data ownership - Google Patents

Dynamic reassignment of data ownership Download PDF

Info

Publication number
CN100429622C
CN100429622C CNB2004800215879A CN200480021587A CN100429622C CN 100429622 C CN100429622 C CN 100429622C CN B2004800215879 A CNB2004800215879 A CN B2004800215879A CN 200480021587 A CN200480021587 A CN 200480021587A CN 100429622 C CN100429622 C CN 100429622C
Authority
CN
China
Prior art keywords
node
data item
specific data
data
section point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CNB2004800215879A
Other languages
Chinese (zh)
Other versions
CN1829962A (en
Inventor
罗杰·J·班福德
萨希坎什·钱德拉塞克拉
安杰洛·普鲁希诺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oracle International Corp
Oracle America Inc
Original Assignee
Oracle International Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oracle International Corp filed Critical Oracle International Corp
Publication of CN1829962A publication Critical patent/CN1829962A/en
Application granted granted Critical
Publication of CN100429622C publication Critical patent/CN100429622C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

Various techniques are described for improving the performance of a shared-nothing database system in which at least two of the nodes that are running the shared-nothing database system have shared access to a disk. Specifically, techniques are provided for changing the ownership of data in a shared-nothing database dynamically, based on factors such as which node would be the most efficient owner relative to the performance of a particular operation. Once determined, the ownership of the data may be changed permanently to the new owner, or temporarily for the duration of the particular operation.

Description

Data managing method
Technical field
The present invention relates to be used for managing the technology that the nothing of moving is shared the data of (shared-nothing) Database Systems on shared disk hardware.
Background technology
The multiprocessing computer system generally is divided three classes: all resource sharings (shared-eVerything) system, shared disc system and do not have shared system.In all resource sharing systems, all volatile memory devices (hereinafter being commonly referred to as " storer ") and all Nonvolatile memory devices (hereinafter being commonly referred to as " disk ") in can the direct access system of the program on all processors.Therefore, require the senior wiring between the different computer modules, so that the function of all resource sharings to be provided.In addition, with regard to all resource sharing structures, also there is the scalability restriction.
In shared disc system, processor and storer are grouped into node.Each node in the shared disc system itself can constitute all resource sharing systems that comprise multiprocessor and multi-memory.Program on all processors all disks in can access system, but only belong to program on the processor of specific node can direct access at the storer of specific intranodal.Shared disc system usually requires the wiring lacked than all resource sharing systems.Because all nodes can all data of access, so shared disc system can also easily adapt to unbalanced workload condition.Yet shared disc system is subject to the influence of related system expense (coherence overhead).For example, if first node has been revised data and Section Point wants to read or revise these identical data, then must take a plurality of steps to guarantee that the right version of data is offered Section Point.
In no shared system, all processors, storer and disk are grouped into node.As in shared disc system, in no shared system, itself can constitute all resource sharing systems or shared disc system each node.Storer and the disk that the program of moving on specific node can the specific intranodal of direct access only.The no shared system of the multiprocessing system of three kinds of general types requires the minimum wiring between the various system components usually.Yet no shared system is subject to the influence of unbalanced workload condition most.For example, may all be present on the disk of specific node by all data of access treating during the particular task.Therefore, only the program in this intranodal operation can be used for execution work particle (work granule), even the program on other nodes all keeps idle condition.
The database that moves on multi-node system generally is divided into two classes: shared disk database and shared-nothing database.
The shared disk database
The shared disk database comes co-ordination based on following hypothesis: suppose all data by database system management for all processing nodes that Database Systems can be used all as seen.Therefore, in the shared disk database, server can distribute any work to the program on any node, and be included in duration of work will be by the location independent of the disk of the data of access.
Because all nodes can both the identical data of access, and each node all has its oneself dedicated cache, so a plurality of versions of same data item may reside in the buffer memory of a plurality of nodes of any amount.Regrettably, when this means the particular version when a node requirement specific data item, this node must be coordinated so that the particular version of data item is transferred into requesting node mutually with other nodes.Thereby the shared disk database is considered to the principle operation with " data transmission ", and wherein, data must be sent to the node of designated these data of processing.
Such data transmit requirement may cause " examination (ping) ".Especially, when the copy of the data item required by node is present in the buffer memory of another node, examination will appear.Examination may require data item is write disk, reads from disk then.The performance of checking necessary disk operating can reduce the performance of Database Systems significantly.
The shared disk database both can have been shared on the computer system in nothing and move, and also can move on the shared disk computer system.In order do not have to share operation shared disk database on the computer system, software support program (software support) can be added to operating system or can provide other hardware can the access remote disk with the permission program.
Shared-nothing database
The shared-nothing database suppose program can only be comprised in these data of access when belonging on the disk of same node point with program in data.Therefore, if specific node is wanted by the data item executable operations that another node had, then specific node must send request to another node, ask another node to carry out this operation.Thereby shared-nothing database is considered to carry out " function transmission ", rather than transmits data between node.
Because any given data item is all only had by a node, the copy that therefore has only this node (" owners " of data) forever in its buffer memory, to have data.Therefore, need not desired cache coherency mechanism type in the shared disk Database Systems.In addition, the cached version of data item is not saved in disk so that another node can deposit this data item in its buffer memory then, does not therefore have shared system and do not suffer and check relevant performance loss owing to require the node that has data item.
Shared-nothing database can and not have on the multiprocessing system of sharing and move at the shared disk multiprocessing system.In order on the shared disk machine, to move shared-nothing database, can provide a kind of mechanism to be used for database is carried out subregion (partitioning), and the entitlement of each subregion is distributed to specific node.
Have only seised node can mean that the working load in the shared-nothing database may become extremely uneven to the fact that data item is operated.For example, in the system of ten nodes, 90% of all working requirement may relate to by data that had in the node.Therefore, this node overwork, and the computational resource of other nodes is not fully used.For " balance again " working load, can make the shared-nothing database off line, and data (and entitlement) can be reallocated between node.Yet this process relates to mobile potentially mass data, and solution working load that may be only interim is unbalance.
Summary of the invention
The invention provides a kind of method of managing data that is used for, this method may further comprise the steps: keep a plurality of persistent data items on long-time memory, wherein, long-time memory can a plurality of nodes of access, and persistent data items comprises the specific data item that is stored in the ad-hoc location on the long-time memory; Each exclusive ownership in the persistent data items is distributed in a plurality of nodes one, and wherein, the specific node of a plurality of nodes is assigned with the exclusive ownership of specific data item; When any node wanted to carry out the operation that relates to specific data item, because specific data item monopolized by specific node and have, so the node that desired operation is performed was sent to specific node with operation, is used for specific node to the specific data item executable operations; Come the tabulate statistics table by collecting about at least one the information in system performance and the working load; Based on dynamically the reallocate entitlement of persistent data items of statistical form, to improve at least one in system performance and the handling capacity.
The present invention also provides a kind of method that is used for data management, this method may further comprise the steps: can keep a plurality of persistent data items on the long-time memory of a plurality of nodes of access, persistent data items comprises the specific data item that is stored in the ad-hoc location on the long-time memory; Each exclusive ownership in the persistent data items is distributed in the node one, and wherein, the first node of a plurality of nodes is assigned with the exclusive ownership of specific data item; When any node wanted to carry out the operation that relates to specific data item, because specific data item monopolized by first node and have, so the node that desired operation is performed was sent to first node with operation, is used for first node to the specific data item executable operations; When the exclusive ownership of specific data item is held by first node, receive the instruction of request to the specific data item executable operations; And operation is carried out by the Section Point that is different from first node.
Description of drawings
Describe the present invention by the example in the accompanying drawing, but be not limited to this, identical in the accompanying drawings reference number is represented similar elements, wherein:
Fig. 1 is the block diagram that the group who comprises two shared disk subsystems according to an embodiment of the invention is shown; And
Fig. 2 is the block diagram that can implement the computer system of embodiments of the invention.
Embodiment
The various technology of the performance that is used to improve the shared-nothing database system that comprises the shared disk storage system have hereinafter been described.In the following description,, described a plurality of specific details, understood the present invention is had completely for the purpose of explaining.Yet, obviously, do not having can to realize the present invention under the situation of these specific detail yet.In other example, with the block diagram form known structure and equipment are shown, to avoid unnecessarily making the present invention unclear.
Functional overview
Hereinafter described the various technology of the performance that is used to improve shared-nothing database system, wherein, at least two nodes in the node of operation shared-nothing database system can be shared the ground accessing disk.As determined by the no shared structure of Database Systems, in any given time, each data item is still only had by a node.Yet this fact of accessing disk of utilizing at least some nodes in the node of operation shared-nothing database system to share is with balance and recover shared-nothing database system again more effectively.
Which especially, be provided for dynamically changing the proprietorial technology of the data in the shared-nothing database based on being the most effective possessory factor such as node for the performance of specific operation.In case determined, then the entitlement of data can be by for good and all or change to the new owner provisionally during specific operation.
For fear of operating relevant system overhead with traditional entitlement reallocation, can use the technology of describing in 50277-2277 to carry out the entitlement reallocation, it does not require that the data of will just reallocated move from the position that data are present on the long-time memory.
The exemplary group (cluster) who comprises shared disc system
Fig. 1 is the block diagram that the group 100 that can implement embodiments of the invention is shown.Group 100 comprises five nodes 102,104,106,108 and 110, interconnection line 130 connections that these nodes communicate with one another by allowing node.Group 100 comprises two disks 150 and 152.Node 102,104 and 106 can accessing disk 150, and node 108 and 110 can accessing disk 152.Therefore, comprise node 102,104 and 106 and the subsystem of disk 150 constitute first shared disc system, and comprise node 108 and 110 and the subsystem of disk 152 constitute second shared disc system.
Group 100 is to comprise the example that does not have the relative single system of overlapping subordinate relation (membership) between two shared disk subsystems and the shared disk subsystem.Real system may be than group's 100 complexity many, having between a hundreds of node, a hundreds of shared disk and node and the shared disk is many-to-many relationship.In such system, for example, individual node that can the many disks of access can be the member of a plurality of different shared disk subsystems, and wherein, each shared disk subsystem includes shared disk in the shared disk and all nodes that can this shared disk of access.
Shared-nothing database on the shared disc system
In order to illustrate, will suppose that shared-nothing database system moves on group 110, wherein, by the database storing of shared-nothing database system management on disk 150 and 152.Nothing based on Database Systems is shared character, data can be divided into five groups or subregion 112,114,116,118 and 120.Each subregion all is assigned to node corresponding.The node of distributing to subregion is considered to be present in unique owner of all data in this subregion.In this example, node 102,104,106,108 and 110 has subregion 112,114,116,118 and 120 respectively.The subregion 112,114 and 118 that is had by node (node 102,104 and 106) that can accessing disk 150 is stored on the disk 150.Similarly, the subregion 118 and 120 that is had by node (node 108 and 110) that can accessing disk 152 is stored on the disk 152.
As shared the character defined by the nothing of the Database Systems of operation on group 100, in any given time, any data item is had by a node at the most.In addition, send the access of coordination by function to shared data.For example, in the environment of the Database Systems of supporting sql like language, the node that does not have a specific data item can cause the operation to these data by the segment that sends SQL statement to the node that has this data item really.
The entitlement mapping
Transmit in order to carry out function effectively, all nodes need all to know which data which node has.Therefore, set up the entitlement mapping, wherein, the entitlement mapping points out that data arrive the entitlement distribution of node.Run time between, different nodes with reference to entitlement mapping to send the SQL segment to correct node when the operation.
According to an embodiment, need not the mapping of determination data to node in the compilation time of SQL (or any other data base access language) statement.On the contrary, as what will be described in more detail below, data to the mapping of node can run time between set up and revise.Use technology described below, when entitlement from can access its exist a node of the disk of data change to can access its when having another node of disk of data, can under the situation of the long lasting position on the disk, carry out proprietorial change at mobile data not.
Locking
Lock is to be used for coordinating structure to the access of resource at a plurality of entities that can accessing resource.Under the situation of shared-nothing database system, need not global lock (globallocking) and coordinate access the user data in the shared-nothing database, this is because any given data item is only had by individual node.Yet,, therefore may need some to lock the inconsistent renewal that prevents the entitlement mapping because all nodes of shared-nothing database all require the mapping of access entitlement.
According to an embodiment, when the entitlement of data item when a node (" the former owner ") is redistributed to another node (" new owner "), use two node locking schemes.In addition, global lock mechanism can be used to control the access to the metadata relevant with shared-nothing database.Such metadata can comprise for example entitlement mapping.
Subregion based on memory paragraph (bucket)
As mentioned above, by subregion, and the data in each subregion are had exclusively by a node by the data of shared-nothing library management.According to an embodiment,, then each memory paragraph is distributed to subregion by setting up subregion for the logical storage section data allocations.Therefore, the data in the entitlement mapping comprise that to the mapping of node data arrive the mapping to node of the mapping of memory paragraph and memory paragraph.
According to an embodiment, data are set up by the title utilization hash function to each data item to the mapping of memory paragraph.Similarly, memory paragraph can be by using another hash function to set up to the identifier relevant with memory paragraph to the mapping of node.Alternatively, these two mappings or one of them can be used based on the subregion of scope and set up, or set up by enumerating each personal relationship simply.For example, can be divided into 50 scopes by name space 1,000,000 data item are mapped to 50 memory paragraphs data item.By 50 memory paragraphs being mapped to five nodes for each memory paragraph stored record, this record is used for (1) identification memory paragraph and the current node that is assigned memory paragraph of (2) identification then.
For for the mapping of the independent map record of each store data items, the use of memory paragraph has reduced the size of entitlement mapping significantly with respect to wherein.In addition, surpass among the embodiment of quantity of node in the quantity of memory paragraph, the use of memory paragraph make entitlement is reallocated relatively easy to the subclass of the data that have by given node.For example, new node can be assigned with single memory paragraph from the current node that is assigned ten memory paragraphs.Such reallocation will be related to this memory paragraph simply and revise the record of indication memory paragraph to the mapping of node.The data of the data of being reallocated needn't be changed to the mapping of memory paragraph.
As mentioned above, can be by using any mapping of setting up data in the various technology (including but not limited to hash subregion, scope subregion or train value) to memory paragraph.If use based on the quantity of the subregion of scope and scope indistinctively greater than the quantity of node, as long as the range key (range key) that is used for the data item subregion is the value (for example data) that can not change, then database server can adopt meticulousr (narrower) scope to realize the memory paragraph of requirement.If range key is the value that can change, then in response to the change of the range key value that is used for specific data item, data item is removed and is added to memory paragraph corresponding to the new value of the range key of data item from its former memory paragraph.
Set up proprietorial original allocation
Use above-mentioned mapping techniques, can among a plurality of nodes, share the entitlement of single table or index.At first, proprietorial distribution can be at random.For example, the user can select to be used for key and the partitioning technique (for example, hash, scope, tabulation etc.) of data to the mapping of memory paragraph, and is used for the partitioning technique of memory paragraph to the mapping of node, but does not need the original allocation of designated store section to node.Database server can be identified for the key of memory paragraph to the mapping of node based on being used for the key of data to the mapping of memory paragraph then, and creates the distribution of initial storage section to node under the situation of particular data of not considering to be represented by memory paragraph and database object.
For example, if the user selects based on key A the object subregion, then database server will use key A to decide the mapping of memory paragraph to node.In some cases, database server can add additional key or use different function (as long as it preserves the mapping of data to memory paragraph) to being used for data to the key of the mapping of memory paragraph.For example, if use key A to divide the object hash into four data memory paragraphs, then database server can be by using hash function to determine the mapping of memory paragraph to node to key B, or by simply the number of hashed value being increased to 12, in these four memory paragraphs each is subdivided into three memory paragraphs (to allow the flexible allocation of memory paragraph to node).If hash is modular function (modulo function), then the 0th, the 4th and the 8th memory paragraph will be corresponding to the memory paragraph of the 0th data to memory paragraph to the memory paragraph of node, and the 1st, the 5th and the 9th memory paragraph will arrive the memory paragraph etc. of memory paragraph corresponding to the 1st data to the memory paragraph of node.
Another embodiment is, if this object according to the key A of DATE type by the scope subregion, then can return year (date) function in year and come the mapping of specific data to memory paragraph by use.But memory paragraph to the mapping of node can be by database server by using month and year (date) in internal calculation.Each annual subregion is divided into the memory paragraph of 12 memory paragraphs to node.If the data that database server is determined specific year are by (the normally current year) access frequently, this method this 12 memory paragraphs of can among other nodes, reallocating then.
In two examples that provide, provide the memory paragraph # of memory paragraph in the above to node, then database server uniquely specified data to the memory paragraph # of memory paragraph.In these examples, the user selects to be used for key and the partitioning technique of data to the mapping of memory paragraph equally.Yet in optional embodiment, the user can not select to be used for key and the partitioning technique of data to the mapping of memory paragraph.On the contrary, being used for data also can automatically be determined by database server to the key and the partitioning technique of the mapping of memory paragraph.
According to an embodiment, database server is based on distributing how many memory paragraphs to carry out the distribution of initial memory paragraph to node to each node.For example, the node with larger capacity can be assigned with more memory paragraph.Yet in original allocation, the decision which node is which particular memory section should be assigned to is at random.
In optional embodiment, when carrying out the branch timing of memory paragraph to node, database server considers that really which data represented by memory paragraph.For example, suppose that the data that are used for particular table are divided at some memory paragraphs.Database server can consciously be assigned to identical node with all these memory paragraphs, or conscious entitlement of distributing these memory paragraphs among many nodes.Similarly, in original allocation, database server may be attempted and will distribute to the memory paragraph identical node relevant with the index that is used for these tables with the memory paragraph that epiphase closes.On the contrary, database server may be attempted and will distribute to the node that is different from the node that the memory paragraph relevant with being used for the index of these tables be assigned to the memory paragraph that epiphase closes.
Automatic entitlement reallocation based on working load
Do not consider how to carry out original allocation, can not guarantee that in fact original allocation will cause that database server will be required that the optimum of all operations carried out carries out.Therefore, according to one embodiment of present invention, the statistical form that operation gathers during based on the actual motion by the monitor database system entitlement of automatically reallocating.
For example, in one embodiment, shared-nothing database system comprises monitoring mechanism, is used to monitor the request of being made by node, and is used to each memory paragraph to keep about nonowners's node request to relate to statistical form from the frequency of the operation of the data of this memory paragraph.According to an embodiment, the entitlement mapping is maintained on the long-time memory, but the statistical form that is kept by monitoring mechanism is maintained in the volatile storage.
Based on statistical form, database server can determine, specific nonowners's node relates to operation from the data of particular memory section with the frequency request more much bigger than any other node.Based on this information, database server can be automatically be redistributed to this specific node with the entitlement of this memory paragraph.
The frequency that the request of nonowners's node relates to the operation of memory paragraph only can be included in an example of the many execution factors in the entitlement reallocation decision.For example, monitoring mechanism can also be followed the trail of owner node to the frequency from the data executable operations (" from service operations ") of particular memory section for its oneself interests.If than asking any non-entitlement node to the operation of particular memory section more frequently the data from the particular memory section to be carried out from service operations, then database server can be selected the no longer entitlement of memory allocated section to the owner node of particular memory section.Even nonowners's node asks operation to the particular memory section, database server only also can be set to pass ownership when utilization variance surpasses definite threshold value than memory paragraph being carried out from the entitlement node of service operations more frequently.
According to an embodiment, if monitoring mechanism detects some nodes with the specific resources of approximately identical frequency access such as table, the database server entitlement of the memory paragraph that closes with this epiphase that can among these nodes, distribute equably then.Therefore, if show by a large amount of accesses, then the work of this table of access will be distributed among available node more equably.
When based on which node the most frequently the data in this memory paragraph of access adjust the entitlement of memory paragraph when overtime, represent that logically the memory paragraph of related data will trend towards being had by identical node.For example, corresponding to the memory paragraph from the data of particular table will trend towards by with have corresponding to the identical node of the memory paragraph that is based upon the index on this table.
By based on the entitlement of reallocating about the statistical form of actual storage section utilization, the needs of other more complicated distribution decisions have been avoided carrying out.For example, entitlement need not based on the inquiry of considering SQL WHERE clause, JOIN condition or AGGREGATION table (query profile) to be set.Similarly, the user does not need clearly the similarity to the data given transaction.In addition, in the time of on data are present in by the disk of the transference node and the nodes sharing of assigning, the shared-nothing database server can adapt to the unbalance of working load soon or change under the situation that does not have the physical data reallocation.Because by the minimum cost of the reallocation that is present in the data generation on the shared disk, the shared-nothing database server can carry out the change and the measurement performance of data ownership effectively under the situation of not using any external tool.
The balance reallocation
If specific node continues to have its all initial storage sections, and because this specific node data in these memory paragraphs of access and be dynamically allocated some new memory paragraphs are arranged frequently, then this node overwork that may become, thus make the working load of Database Systems become unbalance.Similarly, under the situation of being reallocated without any memory paragraph to this specific node, if many initial storage sections of specific node are reallocated to other nodes, then working load will become unbalance.
Therefore, when the entitlement of memory paragraph is assigned to specific node, may wish the entitlement of different memory paragraphs is redistributed to another node from this node.Similarly, when the entitlement of memory paragraph when specific node is assigned with, may wish the entitlement of different memory paragraphs is reallocated to this specific node.Be carried out this reallocation of the unbalance influence that is used to resist other reallocation, be called " balance reallocation " at this.
Because the reallocation operation relates to some system overheads, therefore may wish only after having satisfied the unbalance threshold value of determining, to carry out the balance reallocation.For example, database server can keep " target memory paragraph quantity " for each node in the node.If the quantity of the memory paragraph that is had by specific node is reduced to the low scheduled volume of target memory paragraph quantity than this specific node, then can carry out the balance reallocation so that one or more memory paragraphs are assigned to this specific node from other nodes.For example, can surpass its what of target memory paragraph quantity separately, select from the node of its reallocation memory paragraph based on the quantity of the memory paragraph that node had.
Similarly, if the quantity of the memory paragraph that is had by specific node surpasses the target memory paragraph quantity of specific node with scheduled volume, then can carry out the balance reallocation so that one or more memory paragraphs are assigned to other nodes from this specific node.For example, the quantity of the memory paragraph that can have based on node is brought down below its what of target memory paragraph quantity separately, the node of selecting memory paragraph to be redistributed to.
Database server can use multiple factor in the process of which memory paragraph of determining will to reallocate during balance reallocation.For example, database server can be chosen in the memory paragraph of the transference node that least often relates to of transference node in service operations.Alternatively, database server can be chosen in the memory paragraph by the transference node that relates to the most frequent in the node requested operation of assigning.According to an embodiment, database server is considered many factors, comprises the frequency that from service operations in related to and this memory paragraph the frequency that in by assign node requested operation quilt related to of memory paragraph at the transference node.
Interim entitlement is distributed
In the part in front, described be used for the monitoring period based on such as which node requiring to the factor of the operation of which memory paragraph of the data proprietorial technology of reallocating.Yet though the reallocation of carrying out may be optimum at last by this way, this distribution is not all to be optimum to any given operation.Therefore, according to an embodiment, database server comprises the logic that is used for changing provisionally one group of one or more operating period the entitlement distribution of memory paragraph.After finishing one group of operation, memory paragraph is got back to its previous owner by reallocation.
For example, database server can only change the entitlement of data in big operating period.Therefore, within one day, can have by single node corresponding to all memory paragraphs of showing.Therefore, within this sky, all requests of this table are sent to this node.Can " rezone " to identical table for day tail (end-of-day) report, be used to retrieve day inquiry of the data of tail report with parallel.
Another example is that database server can change the entitlement of data during recovery operation.Especially, database server can be interim is redistributed to memory paragraph all nodes equably, and in these memory paragraphs that each node has, each node can carry out parallel reform forward with respect to other nodes recover and backward affairs return.Such entitlement reallocation can be used for database recovery operation and medium recovery operation.For example, in medium recovery operating period, after up-to-date backup had been resumed, each node can be reformed from the parallel application of archives.In these cases, each node that participates in parallel medium recovery will need to read suitable reform day to and archives.
The flexible arrangement of the parallel query parts (slaVe) on the current node that does not have data
As mentioned above, the entitlement of data during operation can be by interim change.For example, can use so interim distribution, to improve the performance of the inquiry of specifying the operation that can walk abreast.When operation is when walking abreast, this operation be broken down into can be parallel to each other the plurality of sub task that is performed.The processing that is used to carry out such subtask is called the parallel query parts.
When implementing shared-nothing database system in the shared disk environment, the arrangement of parallel query parts is not to be stipulated by the physical location of data.For example, suppose that the data (1) of table among the T are present in by on two nodes 1 and 2 disks of sharing, and (2) belong to two memory paragraph B1 and B2 corresponding to the subregion of table T.The inquiry of request scan table T can be divided into two subtasks: to the scanning of the data among the memory paragraph B1 with to the scanning of the data among the memory paragraph B2.Suppose that when database server is requested to carry out this inquiry node 1 is the owner of memory paragraph B1 and B2.In these cases, the entitlement of B2 can be given node 2 by reallocation provisionally, make the parts on the node 2 can carry out the scanning of B2, and the parts on the node 1 can be carried out the scanning of B1.
In the superincumbent example, the entitlement of a memory paragraph is assigned to another node provisionally, to relate to more node in the processing of the inquiry that can walk abreast.On the contrary, may wish mode, the entitlement of reallocating provisionally with the quantity that reduces the node that in the process of handling inquiry, relates to.For example, suppose that query requests scans this data by first group of parts, and then distribute to second group of parts, make second group of parts to carry out some subsequent operations these data.If scanning is carried out by the parts that are dispersed to many nodes, then the reallocation by this query requests will cause a large amount of intra-node communication.In order to reduce the amount of intra-node communication, can strengthen the entitlement of the subregion of table T.
For example, if when beginning inquiry B1 have by node 1 and B2 is had by node 2, then during inquiring about, can be by provisionally B2 being assigned to the entitlement that node 1 be strengthened table T from node 2.In these cases, first and second groups of parts will be on node 1.Therefore, the reallocation that takes place after scanning will cause intra-node communication.
Do not work under the proprietorial situation having
As mentioned above, in shared-nothing database system, have only the owner of data item to allow to carry out the task of relating to these data.Yet according to an aspect of the present invention, whether the nonowners of data item allows these data are operated, and depends on the required isolation level of this operation (isolation level).
For example, be in No. the 5th, 870,758, the United States Patent (USP) of " Method And Apparatus For Providing IsolationLevels In A Database System " at title, detailed description the notion of isolation level.In database system environment, some isolation levels have been defined.The isolation level that limits comprises " reading not submit to (read uncommitted) " isolation level and " reading to submit to (read committed) " isolation level.Do not oppose dirtyly to read, not submit isolation level to be enough for some operation such as data mining and statistical query for non-repeatable read and unreal reading of reading.For these operations, can operate on the node that does not have these data item the big inquiry segment of data item executable operations.
Be different from and read not submit to isolation level, read to submit to isolation level to oppose dirty reading.For the operation that needs read to submit to isolation level, before the operation that relates to this data item was carried out, the node that has one group of data item can be dumped to disk with the page or leaf of having submitted to that comprises this data item.After page or leaf was dumped to disk, the node that does not have this data item can be carried out the operation that relates to this data item.Because the page or leaf of having submitted to is dumped to disk, therefore can shares nonowners's node of this disk of ground access and can see the page or leaf that this has been submitted to.Therefore, read to submit to isolation level to be held.This operation the term of execution, the owner can be labeled as data item read-only, finishes up to operation.
Use shared disk to be used for coordinator-components communicate
When inquiry is when walking abreast, node usually serves as the role who coordinates to have participated in to carry out the various parts of this inquiry.This node that is commonly referred to " coordinator " usually receives the interim result who is produced by the parts that participate in.The coordinator may be huge from the data volume that the inquiry parts receive.Therefore, according to an aspect of the present invention,, then will be transmitted by shared disk from the interim result that the inquiry parts are sent to the coordinator if the data volume that is transmitted will be overwhelmed interconnection between the node.Especially, if the interim result's who is generated by parts amount has surpassed definite threshold value, then this result is written in the disk of sharing between parts and the coordinator, rather than directly sends to the volatile storage of the node at coordinator place from the volatile storage of the existing node of parts.Use shared disk to be particularly useful for blocking operational symbol (wherein, the user of operational symbol needs to wait for, finishes (that is, not having stream line operation between operational symbol) up to sub-operational symbol) by this way as intermediary.
Hardware overview
Fig. 2 is the block diagram that the computer system 200 that can carry out embodiments of the invention is shown.Computer system 200 comprises bus 202 or other communicator that is used to the information of transmitting and the processor 204 that is connected with bus 202 that is used for process information.Computer system 200 also comprises the primary memory 206 that is connected to bus 202, such as random access storage device (RAM) or other dynamic storage device, and the instruction that is used for store information and will carries out by processor 204.Carrying out between the order period that will be carried out by processor 204, primary memory 206 also can be used for storing temporary variable or other intermediate informations.Computer system 200 further comprises ROM (read-only memory) (ROM) 208 or is connected to other static memories of bus 202, the instruction that is used to store static information and processor 204.Memory device 210 such as disk or CD is provided, and is connected to bus 202 and is used for canned data and instruction.
Computer system 200 can be connected to display 212 such as cathode ray tube (CRT) via bus 202, is used for the display message to the computer user.The input media 214 that comprises alphanumeric key and other keys is connected to bus 202, is used for information and Instruction Selection are delivered to processor 204.The user input apparatus of another kind of type is cursor control 216, such as mouse, tracking ball or cursor direction key, is used for that directional information and Instruction Selection be delivered to processor 204 and the cursor that is used to control on the display 212 moves.Input media usually on two axles (first axle (for example X-axis) and second axle (for example Y-axis)) have two degree of freedom, make the position on the device energy given plane.
The present invention relates to the use of computer system 200, be used to carry out technology described here.According to one embodiment of present invention, be included in the processor 204 of one or more sequences of the one or more instructions in the primary memory 206 in response to execution, realize these technology by computer system 200.Such instruction can be read in primary memory 206 from other computer-readable medium such as memory storage 210.Be included in the execution of the instruction sequence in the primary memory 206, make processor 204 carry out treatment step described herein.In optional embodiment, can use hard-wired circuit (hard-wired circuitry) to replace software instruction or combine and implement this invention with software instruction.Therefore, embodiments of the invention will be not limited to any particular combinations of hardware circuit and software.
Term used herein " computer-readable medium " is meant any medium that participation provides instruction to be used to carry out to processor 204.This medium can be taked various ways, includes but not limited to non-volatile media, Volatile media and transmits medium.Non-volatile media comprises CD or disk for instance, such as memory storage 210.Volatile media comprises dynamic storage, such as primary memory 206.Transmission medium comprises concentric cable, copper cash and optical fiber, comprises the lead of forming bus 202.Transmission medium also can be taked sound wave or form of light waves, for example those sound wave and light waves that produce in radiowave and infrared data communication process.
Usually the computer-readable medium of form comprises as floppy disk, soft dish, hard disk, tape, physical medium, RAM, PROM, EPROM, FLASH-EPROM or other any storage chip or the magnetic tape cassette of perhaps any other magnetic medium, CD-ROM, any other light medium, punching paper, paper tape or any pattern with holes, carrier wave or computer-readable any other medium of mentioning below perhaps.
Various forms of computer-readable mediums can participate in one or more sequences with one or more instruction and be carried to processor 204 and be used for carrying out.For example, the instruction beginning can be carried in the disk of remote computer.Remote computer can use modulator-demodular unit to send instruction by telephone wire with instruction load in its dynamic storage then.The modulator-demodular unit of computer system 200 this locality can receive the data on the telephone wire, and uses infrared transmitter that data-switching is become infrared signal.Infrared eye can receive the data that infrared signal is carried, and suitable circuit can be put into data on the bus 202.To primary memory 206, processor 204 is from primary memory retrieval and carry out these instructions with Data-carrying for bus 202.Before or after carrying out these instructions by processor 204, the instruction that is received by primary memory 206 can optionally be stored on the memory storage 210.
Computer system 200 also comprises the communication interface 218 that is connected to bus 202.The communication interface 218 of bidirectional data communication is provided, is connected to the network link 220 that is connected with LAN (Local Area Network) 222.For example, communication interface 218 can be Integrated Service Digital Network card or modulator-demodular unit, and the data communication that is used to be provided to the telephone wire of respective type connects.And for example, communication interface 218 can be the Local Area Network card, is used to provide the data communication to compatible Local Area Network to connect.Also can use Radio Link.In any such enforcement, communication interface 218 sends and receives electric signal, electromagnetic signal and the optical signalling of the digital data stream of the various types of information of carrying expression.
Network link 220 can provide data communication to other data set by one or more network usually.For example, network link 220 can be connected with main frame 224 by LAN (Local Area Network) 222, perhaps is connected with the data equipment that ISP (ISP) 226 operates.ISP226 provides data communication services by the worldwide packet data communication network that is commonly referred to as " internet " 228 at present again.LAN (Local Area Network) 222 and internet 228 all use electric signal, electromagnetic signal or the optical signalling of carrying digital data stream.Signal by diverse network and the signal on the network link 220 and the signal by communication interface 218 all transmit numerical data and give computer system 200 or send numerical data from computer system, are the exemplary form of the carrier wave of transmission information.
Computer system 200 can send message and receive data (comprising program code) by network, network link 220 and communication interface 218.In the example of internet, server 230 can pass through internet 228, ISP 226, LAN (Local Area Network) 222 and communication interface 218, transmits the program code of being asked that is used for application program.
The code that is received can be when it is received be carried out by processor 204, and/or is stored in memory storage 210 or other non-volatile media and is used for carrying out subsequently.In this manner, computer system 200 can obtain application code with the form of carrier wave.
The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.Within the spirit and principles in the present invention all, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (16)

1. one kind is used for method of managing data, said method comprising the steps of:
Keep a plurality of persistent data items on long-time memory, wherein, described long-time memory can a plurality of nodes of access, and described persistent data items comprises the specific data item that is stored in the ad-hoc location on the described long-time memory;
Each exclusive ownership in the described persistent data items is distributed in described a plurality of node one, and wherein, the specific node of described a plurality of nodes is assigned with the exclusive ownership of described specific data item;
When any node wants to carry out the operation that relates to described specific data item, because described specific data item is monopolized by described specific node and is had, therefore expect that the described node that described operation is performed is sent to described specific node with described operation, is used for described specific node described specific data item is carried out described operation;
Come the tabulate statistics table by collecting about at least one the information in system performance and the working load; And
Based on dynamically the reallocate entitlement of described persistent data items of described statistical form, to improve at least one in system performance and the handling capacity.
2. method according to claim 1, wherein, the step of described acquisition of information comprises which node request of monitoring relates to the operation of described persistent data items.
3. method according to claim 1, wherein, described a plurality of nodes are nodes of multinode Database Systems.
4. method according to claim 1 wherein, is specified under the situation of described persistent data items with user's input of the node that is redistributed to not receiving, and carries out described step of dynamically reallocating.
5. method according to claim 3 wherein, when described multinode Database Systems continue to handle database instruction from database application, is carried out described step of dynamically reallocating.
6. method according to claim 1 wherein, after the determining time, and relates to the operation of described persistent data items based on which node request during described definite time period, carries out the described dynamically step of reallocation.
7. method according to claim 6, wherein, the described dynamically step of reallocation comprises, the entitlement of described specific data item is reallocated relates to the node of operation of described specific data item to request the most frequently.
8. method according to claim 1, wherein:
The step of described dynamic reallocation comprises reallocates the entitlement of described specific data item to first node; And
Described method further may further comprise the steps, and in response to the entitlement of described specific data item is reallocated to described first node, the entitlement of second data item dynamically is redistributed to Section Point from described first node.
9. method according to claim 8, wherein,
Described specific data item is distributed to described first node make that the threshold value relevant with described first node is exceeded; And
Be exceeded in response to described threshold value, carry out the proprietorial step of described described second data item of dynamically reallocating.
10. one kind is used for method of managing data, said method comprising the steps of:
Can keep a plurality of persistent data items on the long-time memory of a plurality of nodes of access, described persistent data items comprises the specific data item that is stored in the ad-hoc location on the described long-time memory;
Each exclusive ownership in the described persistent data items is distributed in the described node one, and wherein, the first node of described a plurality of nodes is assigned with the exclusive ownership of described specific data item;
When any node wants to carry out the operation that relates to described specific data item, because described specific data item is monopolized by described first node and is had, therefore expect that the described node that described operation is performed is sent to described first node with described operation, is used for described first node described specific data item is carried out described operation;
When the exclusive ownership of described specific data item is held by described first node, receive the instruction of request to described specific data item executable operations; And
Described operation is carried out by the Section Point that is different from described first node.
11. method according to claim 10, wherein, the described step that described operation is carried out by Section Point comprises:
At least with described Section Point carry out the child-operation of described instruction required during the same long during, the exclusive ownership of described specific data item is reallocated provisionally to described Section Point, wherein, described child-operation relates to described specific data item; And
After during described, automatically the described exclusive ownership of described specific data item reallocated back and give described first node.
12. method according to claim 11 wherein, is carried out the described step of reallocation provisionally, to allow to distribute the child-operation of parallel work-flow among the parts that are present on a plurality of nodes.
13. method according to claim 11 wherein, is carried out the described step of reallocation provisionally, to allow strengthening operation by described instruction request in the group that comprises described Section Point and do not comprise one or more nodes of described first node.
14. method according to claim 10, wherein:
The described step that described operation is carried out by Section Point comprises, do not obtain at described Section Point to make described Section Point carry out described operation under the situation of exclusive ownership of described specific data item;
Allow the described specific data item of described Section Point access, to carry out the child-operation of described instruction, wherein, described child-operation relates to described specific data item; And
After described Section Point has been finished described child-operation, stop to allow the described specific data item of described Section Point access.
15. method according to claim 14 further may further comprise the steps:
Decision reads not submit to isolation level to described application of instruction; And be dumped under the situation of disk at any dirty version of not asking described first node, allow described Section Point to carry out described child-operation described specific data item.
16. method according to claim 14 further may further comprise the steps:
Decision reads to submit to isolation level to described application of instruction; And stop described Section Point to carry out described child-operation, any dirty version of described specific data item is dumped to disk up to described first node.
CNB2004800215879A 2003-08-01 2004-07-28 Dynamic reassignment of data ownership Active CN100429622C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US49201903P 2003-08-01 2003-08-01
US60/492,019 2003-08-01
US10/831,401 2004-04-23

Publications (2)

Publication Number Publication Date
CN1829962A CN1829962A (en) 2006-09-06
CN100429622C true CN100429622C (en) 2008-10-29

Family

ID=36947551

Family Applications (4)

Application Number Title Priority Date Filing Date
CNB2004800215879A Active CN100429622C (en) 2003-08-01 2004-07-28 Dynamic reassignment of data ownership
CNB2004800219070A Active CN100449539C (en) 2003-08-01 2004-07-28 Ownership reassignment in a shared-nothing database system
CNB200480021585XA Active CN100565460C (en) 2003-08-01 2004-07-28 Be used for method of managing data
CN2004800217520A Active CN1829974B (en) 2003-08-01 2004-07-28 Parallel recovery by non-failed nodes

Family Applications After (3)

Application Number Title Priority Date Filing Date
CNB2004800219070A Active CN100449539C (en) 2003-08-01 2004-07-28 Ownership reassignment in a shared-nothing database system
CNB200480021585XA Active CN100565460C (en) 2003-08-01 2004-07-28 Be used for method of managing data
CN2004800217520A Active CN1829974B (en) 2003-08-01 2004-07-28 Parallel recovery by non-failed nodes

Country Status (1)

Country Link
CN (4) CN100429622C (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7979626B2 (en) * 2008-05-13 2011-07-12 Microsoft Corporation Flash recovery employing transaction log
US8375047B2 (en) * 2010-03-31 2013-02-12 Emc Corporation Apparatus and method for query prioritization in a shared nothing distributed database
CN102521307A (en) * 2011-12-01 2012-06-27 北京人大金仓信息技术股份有限公司 Parallel query processing method for share-nothing database cluster in cloud computing environment
US8799569B2 (en) * 2012-04-17 2014-08-05 International Business Machines Corporation Multiple enhanced catalog sharing (ECS) cache structure for sharing catalogs in a multiprocessor system
CN102968503B (en) * 2012-12-10 2015-10-07 曙光信息产业(北京)有限公司 The data processing method of Database Systems and Database Systems
US9367472B2 (en) * 2013-06-10 2016-06-14 Oracle International Corporation Observation of data in persistent memory
CN103399894A (en) * 2013-07-23 2013-11-20 中国科学院信息工程研究所 Distributed transaction processing method on basis of shared storage pool
US20150293708A1 (en) * 2014-04-11 2015-10-15 Netapp, Inc. Connectivity-Aware Storage Controller Load Balancing
CN107766001B (en) * 2017-10-18 2021-05-25 成都索贝数码科技股份有限公司 Storage quota method based on user group
CN108924184B (en) * 2018-05-31 2022-02-25 创新先进技术有限公司 Data processing method and server
CN110895483A (en) * 2018-09-12 2020-03-20 北京奇虎科技有限公司 Task recovery method and device
US11100086B2 (en) * 2018-09-25 2021-08-24 Wandisco, Inc. Methods, devices and systems for real-time checking of data consistency in a distributed heterogenous storage system
US11874816B2 (en) * 2018-10-23 2024-01-16 Microsoft Technology Licensing, Llc Lock free distributed transaction coordinator for in-memory database participants
CN110134735A (en) * 2019-04-10 2019-08-16 阿里巴巴集团控股有限公司 The storage method and device of distributed transaction log
CN112650561B (en) * 2019-10-11 2023-04-11 金篆信科有限责任公司 Transaction management method, system, network device and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5539883A (en) * 1991-10-31 1996-07-23 International Business Machines Corporation Load balancing of network by maintaining in each computer information regarding current load on the computer and load on some other computers in the network
US5675791A (en) * 1994-10-31 1997-10-07 International Business Machines Corporation Method and system for database load balancing

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5696898A (en) * 1995-06-06 1997-12-09 Lucent Technologies Inc. System and method for database access control
CA2176775C (en) * 1995-06-06 1999-08-03 Brenda Sue Baker System and method for database access administration
US5903898A (en) * 1996-06-04 1999-05-11 Oracle Corporation Method and apparatus for user selectable logging
US5907849A (en) * 1997-05-29 1999-05-25 International Business Machines Corporation Method and system for recovery in a partitioned shared nothing database system using virtual share disks
US6493726B1 (en) * 1998-12-29 2002-12-10 Oracle Corporation Performing 2-phase commit with delayed forget
EP1256225B1 (en) * 2000-02-04 2010-08-18 Listen.Com, Inc. System for distributed media network and meta data server
EP1521184A3 (en) * 2001-06-28 2006-02-22 Oracle International Corporation Partitioning ownership of a database among different database servers to control access to the database

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5539883A (en) * 1991-10-31 1996-07-23 International Business Machines Corporation Load balancing of network by maintaining in each computer information regarding current load on the computer and load on some other computers in the network
US5675791A (en) * 1994-10-31 1997-10-07 International Business Machines Corporation Method and system for database load balancing

Also Published As

Publication number Publication date
CN1829988A (en) 2006-09-06
CN1829974A (en) 2006-09-06
CN100565460C (en) 2009-12-02
CN1829962A (en) 2006-09-06
CN1829961A (en) 2006-09-06
CN100449539C (en) 2009-01-07
CN1829974B (en) 2010-06-23

Similar Documents

Publication Publication Date Title
JP4614956B2 (en) Dynamic reassignment of data ownership
CN100429622C (en) Dynamic reassignment of data ownership
JP4586019B2 (en) Parallel recovery with non-failing nodes
CN1244055C (en) Non-uniform memory access data handling system with shared intervention support
US5692182A (en) Bufferpool coherency for identifying and retrieving versions of workfile data using a producing DBMS and a consuming DBMS
US5692174A (en) Query parallelism in a shared data DBMS system
US7933882B2 (en) Dynamic cluster database architecture
US7664799B2 (en) In-memory space management for database systems
CN103782295B (en) Query explain plan in a distributed data management system
US10152500B2 (en) Read mostly instances
JP2007501457A (en) Reassign ownership in a non-shared database system
CN100517303C (en) Partitioning ownership of a database among different database servers to control access to the database
JP2007025785A (en) Database processing method, system, and program
US7080075B1 (en) Dynamic remastering for a subset of nodes in a cluster environment
CN112162846B (en) Transaction processing method, device and computer readable storage medium
CN112328700A (en) Distributed database
JP2960297B2 (en) Database system and load distribution control method
US7536422B2 (en) Method for process substitution on a database management system
US20180060236A1 (en) Method and systems for master establishment using service-based statistics
CN101714152B (en) Method for dividing database ownership among different database servers to control access to databases
AU2004262380B2 (en) Dynamic reassignment of data ownership

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CI02 Correction of invention patent application

Correction item: Priority

Correct: 2004.04.23 US 10/831,401

False: Lack of priority second

Number: 36

Page: The title page

Volume: 22

COR Change of bibliographic data

Free format text: CORRECT: PRIORITY; FROM: MISSING THE SECOND ARTICLE OF PRIORITY TO: 2004.4.23 US 10/831,401

C14 Grant of patent or utility model
GR01 Patent grant