|Publication number||US20070022145 A1|
|Application number||US 11/480,309|
|Publication date||25 Jan 2007|
|Filing date||29 Jun 2006|
|Priority date||23 Nov 2001|
|Also published as||DE60239358D1, EP1466246A1, EP1466246A4, EP1466246B1, US7287047, US8161003, US20030225800, US20090177719, WO2003046721A1|
|Publication number||11480309, 480309, US 2007/0022145 A1, US 2007/022145 A1, US 20070022145 A1, US 20070022145A1, US 2007022145 A1, US 2007022145A1, US-A1-20070022145, US-A1-2007022145, US2007/0022145A1, US2007/022145A1, US20070022145 A1, US20070022145A1, US2007022145 A1, US2007022145A1|
|Original Assignee||Srinivas Kavuri|
|Export Citation||BiBTeX, EndNote, RefMan|
|Referenced by (23), Classifications (14)|
|External Links: USPTO, USPTO Assignment, Espacenet|
This application claims priority from U.S. Provisional Patent Application No. 60/332,549, entitled “SELECTIVE DATA REPLICATION SYSTEM AND METHOD”, filed Nov. 23, 2001. The entire contents of the Provisional Application 60/332,549 are hereby incorporated herein by reference in their entirety.
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosures, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
This application is related to the following pending applications, each of which is hereby incorporated herein by reference in its entirety:
The invention disclosed herein relates generally to data storage in a computer network and more particularly to selectively copying data in a modular data and storage management system.
In the GALAXY storage management system software manufactured by CommVault Systems, Inc. of Oceanport, N.J., storage policies are utilized to direct how data is to be stored. Storage present the user with logical buckets for directing their data storage operations such as backup and retrieval. Each client points to a storage policy that allows the user to define how, where, and the duration for which data should be stored at a higher level of abstraction without having to have intimate knowledge or understanding of the underlying storage architecture and technology. The management details of the storage operations are transparent to the user.
Storage policies are thus a logical concept associated with one or more backup data sets with each backup data set being a self-contained unit of information. Each backup data set may contain data from multiple applications and from multiple clients. Within each backup data set are one or more archives which are discrete chunks or “blobs” of data generally relating to a particular application. For example, one archive might contain log files related to a data store and another archive in the same backup data set might contain the data store itself.
Backup systems often have various levels of storage. A primary backup data set, for example, indicates the default destination of storage operations for a particular set of data that the storage policy relates to and is tied to a practical set of drives. These drives are addressed independently of the library or media agent to which they are attached. The primary backup data set might, for example, contain data that is frequently accessed for a period of one to two weeks after it is stored. A storage administrator might find storing such data on a set of drives with fast access times preferable. On the other hand, such fast drives are expensive and once the data is no longer accessed as frequently, the storage administrator might find it likely to move and copy this data to an auxiliary or secondary backup data set on a less expensive tape library or other device with slower access times. Once the data from the primary backup data set is moved to the auxiliary backup data set, the data can be pruned from the primary backup data set freeing up drive space for new data.
While existing data storage systems provide a capability to copy data from the primary backup data set to auxiliary backup data sets, this copying procedure is a synchronous operation, meaning generally all data from the primary backup data set must be copied to all auxiliary backup data sets. This process is also called synchronous data replication and is inefficient in terms of data management.
A backup data set will likely contain more than one full backup of data relating to a particular application in addition to incremental or differential backups taken between full backups. For example, a storage administrator might specify for a particular backup data set of a storage policy that a full backup occur once per week with incremental backups occurring daily. If the backup data set were retained for a period of two weeks before being pruned, the backup data set would contain a first full backup of data, F1, with incremental backups I1, I2, I3, I4, I5, I6, and a second full backup F2. If each full backup required one tape and each incremental required half a tape, then 5 tapes would be required to store the data of this exemplary primary backup data set. The auxiliary backup data set would also require 5 tapes when data is transferred from primary to auxiliary backup data set.
Thus, even though synchronous data replication allows the flexibility to promote any auxiliary backup data set to be primary backup data set since the auxiliary backup data set is a full copy of the primary backup data set, tape consumption is very high. If for some reason, data cannot be copied to one auxiliary backup data set, tapes from the primary backup data set will not be rotated. Thus, users may want to copy only particular backups as their degree of required granularity changes. One prominent scheme in the field illustrating this principle is called “Grandfather, Father, Son” (GFS), in which each of the three represents a different period of time. For example, the son may represent a weekly degree of granularity, the father may represent a monthly degree of granularity, and the grandfather may represent a yearly degree of granularity.
Many users do not wish to copy all backups from the primary backup data set to all auxiliary backup data sets. Over time, the degree of granularity that users require changes and while recent data might need to be restored from any given point in time, less precision is generally required when restoring older data. Consider an exemplary storage scheme where full backups are taken weekly, incremental backups are taken daily, data is pruned after two weeks, full backups require one tape, and incremental backups require half a tape. A storage administrator in this example might require that data stored in the past month be able to be restored at a level of granularity of one day, meaning the data can be restored from any given day in the past month. At this degree of granularity, the incremental backups would be necessary to restore data. If the backup data set contained a first full backup of data, F1, with incremental backups I1, I2, I3, I4, I5, I6, and a second full backup F2, then F1, I1, I2, I3, I4, I5, I6 would be required. If incremental backup I6 is performed the same time full backup F2 is performed, the tape containing F2 would be unnecessary, since the full backup F2 could be reproduced from F1 and the incremental backups I1-I6. On the other hand, the storage administrator in this example might only require a degree of granularity of one week for data more than one month old thus the incremental backups would not be required and the full backups would suffice. In this case, only the tapes containing the full backups F1 and F2 would be required and the three tapes containing incremental backups I1, I2, I3, I4, I5, I6 would be unnecessary.
Another example is a storage policy with three backup data sets called Wkly, Mnthly, and Yrly with different retention criteria. Wkly backup data set has a retention period of 15 days, Mthly backup data set has a retention period of 6 months, and Yrly backup data set has a retention period of 7 years. Backups in this example are performed every day with a full backup on every Friday to Wkly backup data set. In addition, a full backup is done at the end of each month to Wkly backup data set. Only the full backup at the end of the week will be copied to Mnthly backup data set and only the end of the month full backup will be copied to Yrly backup data set. Under the assumption that every full backup uses 1 tape and incremental backups require ¼ of a tape, Wkly backup data set takes up to 6 tapes with at most 3 full backups and 12 incremental backups. These 6 tapes get recycled all the time. Mnthly backup data set takes 26 tapes that are constantly recycled and Yrly backup data set takes 1 tape per month for 7 years. Thus, 84 total tapes are required and are recycled over a long period of time.
Also, sometimes problems occur with bad tapes or holes in data due to hardware or software problems. In these instances, data from the primary backup data set cannot be pruned unless all data is copied to all auxiliary backup data sets which is a highly time intensive process and also requires a large number of tapes.
There is thus a need for a system which enables selective copying of data from the primary backup data set to auxiliary backup data sets, promotes efficient tape rotation, provides the capability to configure any variant of GFS scheme, and which further allows selective pruning of data from the primary backup data set.
The present invention addresses, among other things, the problems discussed above with backup data storage in a computer network.
In accordance with some aspects of the present invention, computerized methods are provided for copying electronic data in a first backup data set, the methods comprising identifying, in the first backup data set, a data item satisfying a selection criterion; and copying to a second backup data set at least a portion of the data item. In some embodiments, the data item may comprise a full backup within a primary backup data set of application data, a full backup within an auxiliary backup data set of application data, a data item associated with a data-specific ID, or other data items.
The selection criteria is a property or characteristic of the first data item used by the invention to select the first data item for copying and other purposes. In some embodiments, the selection criterion comprises a time criterion and identifying the data item comprises comparing a time the data item was stored to the time criterion. In some embodiments, the time criterion comprises a day of a week and identifying the data item comprises comparing a day of the week the data item was stored to the day of the week. In some embodiments, the time criterion comprises a day of a month and identifying the data item comprises comparing a day of the month the data item was stored to the day of the month. In some embodiments, the selection criterion comprises a cycle criterion and identifying the data item comprises comparing a number of cycles occurring since the data item was stored to the cycle criterion. In some embodiments, the cycle criterion comprises a number of full backups performed and identifying the data item comprises comparing a number of full backups performed since the data item was stored to the number of full backups.
In some embodiments, data item satisfies the selection criteria are indicated or otherwise marked or flagged. In some embodiments, indicating that the data item satisfies the selection criterion comprises associating, in a data structure, information with the data item indicating that the data item satisfies the selection criterion. In some embodiments, indicating that the data item satisfies the selection criterion comprises associating, in a matrix, information with the data item indicating that the data item satisfies the selection criterion. In some embodiments, the data item indicated is de-indicated after the data item is copied to the second backup data set. In some embodiments, the data item is de-indicated by removing, in a data structure, information associated with the data item indicating that the data item satisfies the selection criterion. In some embodiments, the data item is de-indicated by removing, in a matrix, information associated with the data item indicating that the data item satisfies the selection criterion. In some embodiments, the data item indicated will not be pruned by a pruning program unless the data item is first de-indicated.
In some embodiments, wherein the data item comprises a full backup of application data.
In some embodiments, copying at least a portion of the data item comprises performing an auxiliary copy of at least a portion of the data item. In some embodiments, the copying of at least a portion of the data item is a restart-able operation.
In one embodiment, the invention provides a system for copying electronic data, the system comprising a first backup data set containing one or more data items; a second backup data set; and a computer, connectable to the first backup data set and the second backup data set; wherein the computer is programmed to identify, in the first backup data set, a first data item satisfying a selection criteria; and to copy at least a portion of the first data item from the first backup data set to the second backup data set.
In one embodiment, the invention provides a computer usable medium storing program code which, when executed on a computerized device, causes the computerized device to execute a computerized method for copying electronic data stored in a first backup data set, the method comprising identifying, in the first backup data set, a data item satisfying a selection criterion; and copying to a second backup data set at least a portion of the data item.
The invention is illustrated in the figures of the accompanying drawings which are meant to be exemplary and not limiting, in which like references are intended to refer to like or corresponding parts, and in which:
Preferred embodiments of the invention are now described with reference to the drawings. As described further below, preferences associated with data-specific storage policies are used to perform selective data replication.
For example, the storage control 104 may move data that has been stored in the first storage media 110 into the second storage media 112 based on certain storage policies 106. In addition, the storage control 104 could move data from the second storage media 112 to any other of the storage media 108 including the Nth storage media 114. Also, the storage control 104 could move data from the first storage media 110 to the Nth storage media 114. Of course, the data could be moved in either direction, and the storage control 104 is capable of moving data generally between storage media 108.
The manager module 124 is an exemplary storage control 104, and includes storage policies 106 which are used to determine how data that the manager module 124 receives from the installed file system 122 will be stored in the storage media. The manager module 124 also includes a data structure called a master map 126 to assist in initial storage decisions in the storage media. In some embodiments, when the software application is directed to stored data, the data is sent to the installed file system, and then the manager module 124 accesses the storage policies 106 to determine the appropriate location for storage of the data. The master map 126 includes further information for directing the data to be sent to other devices, for example to the other computing devices further processing or to various storage devices 108.
A media module 128 is a hardware or software module that includes a data index 130 that provides further details of where the data is to be stored in the storage system 116. The data index 130 includes details such as the location of storage media 108, such as magnetic disc media 132 and magnetic tape media 134. The data index 130 is updated with file location information when any data is moved from one storage media 108 to another such as from the magnetic disc media 132 to the magnetic tape unit 134. Of course, additional types and more than one type of storage media 108 could be incorporated into the storage system 116.
For example, the far left column illustrates the storage sequence name followed by a first storage ID in the second column. When applications 102 direct the storage of data, the storage control 104 consults the storage policies 106 to determine an appropriate storage sequence 136 to use for storing the data according to the data-specific ID 146 of the application generating the data. A default storage policy may, for example, include storing the data according to the primary storage sequence 150 where the data would enter the first storage having an ID of 001. The data would be stored at the first storage ID for 13 weeks as indicated in the next column of the storage sequences 136. At this point the data would be moved to a second storage ID 005 where it would be stored for a duration of 26 weeks. After 26 weeks, the data would be moved to a third storage ID 002 for a duration of 52 weeks. This process would continue until the data is stored in an Nth storage ID 004. The primary storage sequence 150, of course, is an exemplary storage sequence, but is explained here as a means to understanding operations of the storage sequences 136. As can be seen from
As previously described, application data is stored according to storage sequences 136 and associated with the data-specific IDs 146 specified in storage logic 138. Thus, for example, a storage sequence 136 might require that a full backup be performed weekly with incremental backups being performed on a daily basis. Further, multiple data-specific IDs 146 might be associated with the same storage sequence, and thus a backup data set might contain numerous full backups and incremental backups from one or more different applications. For example, the primary backup data set 158 contains four full backups A1 160, A2 168, B1 172, B2 178, and A3 180, and 6 incremental backups A′1 162, A′2 164, A′3 166, A′4 170, B′1 174, and A′5 176.
Each full backup, incremental backup, and other chunk of data stored on a backup data set has a number of characteristics associated with it such as a data-specific ID 146 for the application generating the data, a date the data was stored, the amount of data stored, and other characteristics known in the art which are useful in identifying data. Among other things, these characteristics can be used, as further described herein, to identify discrete individual chunks of data within a backup data set and to perform selective data replication by copying the individual chunks from the backup data set to another backup data set using auxiliary copy and other copying methods known in the art.
Starting with the first data-specific ID 146, step 190, the manager module 124 reviews the data stored on each backup data set to determine whether anything is to be copied for that data-specific ID 146 based on selection criteria of the auxiliary backup data set to which the data to be copied, step 192. For example, an auxiliary backup data set might specify that data for a particular application, such as Microsoft Exchange data, is to be selectively copied from the primary backup data set according to certain selection criteria.
Each auxiliary backup data set has a selection criteria used to decide when to copy which full backup to it. Selection criteria used in selective data replication can be defined either in time or in cycles. Time criteria, for example, can be specified as a given day of the month and every n months and the starting month or also a given day of the week and every n weeks and a starting day of the week. Day of the month in the previous example could take the form of last day of the month. In the case of cycle criteria, a cycle represents the data stored between full backups. For example, a cycle might include a first full backup F1, incremental backups I1, I2, I3, I4, I5, I6, and a second full backup F2. For example, in some embodiments, an application manager keeps track of the cycle number for full backups on a data-specific ID 146 basis. For example, this enables backups to be pruned with smaller retention on the primary backup data set which does not leave any traces to determine the number of cycles between the existing full backups and the full backup copied to an auxiliary backup data set. Those skilled in the art will recognize that many other selection criteria could be used to perform selective data replication.
The manager module 124 then marks or otherwise flags in the master map 126 those jobs that satisfy the selection criteria as jobs to be replicated, step 194. Marking these jobs as such ensures that they will not be pruned before replication can be completed. Often, backup data sets are pruned to promote more efficient tape usage and data storage generally. For example, a storage administrator or a pruning program might prune all backups in a backup data set older than a certain date or according to other useful pruning selection criteria known in the art. When a pruning program searches for data to prune in a backup data set, it first checks to see if data items satisfying the pruning selection criteria are marked to be selectively replicated. If a data item so marked, then a pruning program will not copy the data item until the data item has been selectively replicated and unmarked accordingly as further described herein.
Unlike synchronous data replication where data is replicated archive file by archive file, and thus all backups, incremental backups, differential backups, and other backups are copied to the auxiliary backup data set, with selective data replication the manager module 124 initiates the copy operation on a job-by-job basis to all the necessary backup data sets, step 196, and copies only those full backups satisfying the retention criteria. In some embodiments, this auxiliary copy operation is restart-able since otherwise, the user may not know to restart the operation on the storage policy 106 and still may, for example, have data loss or tapes not being freed due to auxiliary copy failures.
Once the selective data replication of a particular job is complete, the manager module 124 unmarks that job indicating that the data has been successfully copied, step 198, and that job may now be pruned or otherwise manipulated. In some embodiments, the manager module 124 compensates for the same job being replicated to multiple backup data sets by reflecting this status using a matrix data structure or other technique suitable for tracking multiple items and operations in order that jobs may not be pruned before replication to all backup data sets is complete. The manager module 124 checks if there are remaining data-specific IDs 146, step 200, and either returns to step 192 or exits the subroutine, step 202, if there are not.
A time-based example illustrating the process described in
For example, if the selection criterion is given to be a given day of the month every n month, the manager module 124 will copy the most recent successful full backup from the time selective copy was configured, that happened to the primary backup data set. The manager module 124 copies the first full backup to all the backup data sets irrespective of the criterion. After the first backup is copied, the manager module 124 will try to find the most recent successful full backup as of the given day of the month starting from the current time searching backwards in time and copies that full backup. In some embodiments, if the full backup found is same as the full backup which was already copied, the manager module 124 will issue a critical event and an alert.
A cycle-based example illustrating the process described in
For example, if the selection criterion is given in cycles as 4 cycles, the manager module 124 will copy the first full backup that happens to the primary backup data set. The manager module 124 records the cycle number for this full backup. The manager module 124 will then try to find the fourth successful full backup going backward in time. If one exists, then that full backup is also copied.
Systems and modules described herein may comprise software, firmware, hardware, or any combination(s) of software, firmware, or hardware suitable for the purposes described herein. Software and other modules may reside on servers, workstations, personal computers, computerized tablets, PDAs, and other devices suitable for the purposes described herein. Software and other modules may be accessible via local memory, via a network, via a browser or other application in an ASP context, or via other means suitable for the purposes described herein. Data structures described herein may comprise computer files, variables, programming arrays, programming structures, or any electronic information storage schemes or methods, or any combinations thereof, suitable for the purposes described herein. User interface elements described herein may comprise elements from graphical user interfaces, command line interfaces, and other interfaces suitable for the purposes described herein. Screenshots presented and described herein can be displayed differently as known in the art to input, access, change, manipulate, modify, alter, and work with information.
While the invention has been described and illustrated in connection with preferred embodiments, many variations and modifications as will be evident to those skilled in this art may be made without departing from the spirit and scope of the invention, and the invention is thus not to be limited to the precise details of methodology or construction set forth above as such variations and modification are intended to be included within the scope of the invention.
|Citing Patent||Filing date||Publication date||Applicant||Title|
|US7953706||28 Mar 2008||31 May 2011||Commvault Systems, Inc.||System and method for storing redundant information|
|US8140637 *||26 Sep 2008||20 Mar 2012||Hewlett-Packard Development Company, L.P.||Communicating chunks between devices|
|US8161003||12 Mar 2009||17 Apr 2012||Commvault Systems, Inc.||Selective data replication system and method|
|US8219524 *||24 Jun 2008||10 Jul 2012||Commvault Systems, Inc.||Application-aware and remote single instance data management|
|US8230235 *||7 Sep 2006||24 Jul 2012||International Business Machines Corporation||Selective encryption of data stored on removable media in an automated data storage library|
|US8285683 *||30 Sep 2011||9 Oct 2012||Commvault Systems, Inc.||System and method for storing redundant information|
|US8495022 *||13 May 2010||23 Jul 2013||Symantec Corporation||Systems and methods for synthetic backups|
|US8712969 *||14 Sep 2012||29 Apr 2014||Commvault Systems, Inc.||System and method for storing redundant information|
|US8725687||2 Apr 2013||13 May 2014||Commvault Systems, Inc.||Systems and methods for byte-level or quasi byte-level single instancing|
|US8725731||23 Jan 2012||13 May 2014||Commvault Systems, Inc.||Systems and methods for retrieving data in a computer network|
|US8838541||25 Oct 2007||16 Sep 2014||Hewlett-Packard Development Company, L.P.||Data processing apparatus and method of processing data|
|US8892611 *||17 Dec 2010||18 Nov 2014||Condusiv Technologies Corporation||Assigning data for storage based on speed with which data may be retrieved|
|US9021198||20 Jan 2011||28 Apr 2015||Commvault Systems, Inc.||System and method for sharing SAN storage|
|US9032032 *||26 Jun 2008||12 May 2015||Microsoft Technology Licensing, Llc||Data replication feedback for transport input/output|
|US9052826||4 Jan 2011||9 Jun 2015||Condusiv Technologies Corporation||Selecting storage locations for storing data based on storage location attributes and data usage statistics|
|US9058117||9 Oct 2013||16 Jun 2015||Commvault Systems, Inc.||Block-level single instancing|
|US9063898 *||23 Nov 2004||23 Jun 2015||Hewlett-Packard Development Company, L.P.||Method of restoring backup data|
|US9092378||23 Sep 2014||28 Jul 2015||Commvault Systems, Inc.||Restoring computing environments, such as autorecovery of file systems at certain points in time|
|US9098495||24 Jun 2008||4 Aug 2015||Commvault Systems, Inc.||Application-aware and remote single instance data management|
|US9104340||26 Sep 2013||11 Aug 2015||Commvault Systems, Inc.||Systems and methods for performing storage operations using network attached storage|
|US20090327361 *||31 Dec 2009||Microsoft Corporation||Data replication feedback for transport input/output|
|US20110087657 *||14 Apr 2011||Diskeeper Corporation||Assigning data for storage based on speed with which data may be retrieved|
|US20130006946 *||3 Jan 2013||Commvault Systems, Inc.||System and method for storing redundant information|
|U.S. Classification||1/1, 707/E17.032, 714/E11.123, 707/999.204|
|International Classification||G06F12/00, G06F3/06, G06F17/30, G06F11/14|
|Cooperative Classification||Y10S707/99955, Y10S707/99952, G06F11/1451, G06F11/1448|
|European Classification||G06F11/14A10D2, G06F11/14A10P|