CN103488734A - Data processing method and deduplication engine - Google Patents

Data processing method and deduplication engine Download PDF

Info

Publication number
CN103488734A
CN103488734A CN201310425874.7A CN201310425874A CN103488734A CN 103488734 A CN103488734 A CN 103488734A CN 201310425874 A CN201310425874 A CN 201310425874A CN 103488734 A CN103488734 A CN 103488734A
Authority
CN
China
Prior art keywords
data
thread
subtask
file
memory node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310425874.7A
Other languages
Chinese (zh)
Inventor
付旭东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201310425874.7A priority Critical patent/CN103488734A/en
Publication of CN103488734A publication Critical patent/CN103488734A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Abstract

An embodiment of the invention discloses a data processing method and a deduplication engine. The data processing method is applied to a data backup system which comprises the deduplication engine and a storage node, and at least two threads are established in a thread pool through the data backup system, and a data duplicate checking task is divided into at least one subtask by the deduplication engine. The method includes: through the deduplication engine, invoking at least one thread in the thread pool to execute the subtasks included in the data duplicate checking task for data of a first document so as to inquire whether or not duplicated data of the data of the first document are included in the storage node; releasing the invoked threads; when the duplicated data of the data of the first document are not inquired, sending the data of the first document to the storage node for storage. By the method, the threads are prevented from waiting for IO (input output) return for a long time, a CPU (central processing unit) is utilized fully and efficiently, and data deduplication performance is improved.

Description

A kind of data processing method and heavily delete engine
Technical field
The present invention relates to technical field of data processing, particularly relate to a kind of data processing method and heavily delete engine.
Background technology
Data de-duplication technology is a kind of data reduction technology, is generally used for the standby system based on disk, is intended to reduce the memory capacity of using in storage system, is the hot technology of storage industry.
Data de-duplication technology is by the fingerprint of recording data blocks, and realizes the identification to repeating data according to the fingerprint of data block.When data backup, directly quote for the data block repeated, and revise the reference count in metadata, otherwise preserve new data block and record its finger print information.
In prior art, in the application data de-duplication technology, while carrying out data backup, every Backup Data stream takies a thread until this data stream has backed up, in the method single thread CPU computing abundant not, cause the data de-duplication performance to promote.
Summary of the invention
A kind of data processing method is provided in the embodiment of the present invention and has heavily deleted engine, can efficiently utilize CPU, promoted the data de-duplication performance.
In order to solve the problems of the technologies described above, the embodiment of the invention discloses following technical scheme:
First aspect, provide a kind of data processing method, and described method is applied to data backup system, and described data backup system comprises heavily deletes engine and memory node; Wherein, described data backup system is set up at least two threads in advance in thread pool; The described heavy engine of deleting is looked into important task business by data and is divided at least one subtask, and described method comprises:
The described heavy engine of deleting is transferred at least one thread in described thread pool and the data of the first file are carried out to described data is looked into the subtask that the important task business comprises, and whether has the repeating data of the data of described the first file to inquire about described memory node;
Described at least one thread that release is transferred;
And, when the repeating data of the data that do not inquire described the first file, the data of described the first file are sent to described memory node and stored.
In conjunction with above-mentioned first aspect, in the first, in possible implementation, described method also comprises:
The thread that described memory node is transferred in described thread pool is stored the data of described the first file;
Storage discharges the described thread that described memory node is transferred after finishing.
In conjunction with above-mentioned first aspect, and/or the possible implementation of the first, at the second in possible implementation, the described heavy engine of deleting is looked into the important task business by data and is divided at least one subtask, comprising:
The described heavy engine of deleting is looked into important task business by described data and is divided into three subtasks, and described three subtasks are followed successively by the deblocking subtask, fingerprint calculates subtask and repeatable block inquiry subtask;
The described heavy engine of deleting is transferred at least one thread in described thread pool and the data of the first file are carried out to described data is looked into the subtask that the important task business comprises, and comprising:
The described heavy engine of deleting is carried out described deblocking subtask to the data of described the first file of inputting the first thread, and rear data to the second thread is processed in output, data after described the second thread is to described the first thread process are carried out described fingerprint and are calculated subtask, and output is processed rear data to described the 3rd thread, data after described the 3rd thread is to described the second thread process are carried out described repeatable block inquiry subtask, wherein, described the first thread, the second thread and the 3rd thread are respectively any one thread in described thread pool.
In conjunction with above-mentioned first aspect, and/or the possible implementation of the first, and/or the possible implementation of the second, at the third in possible implementation, the described heavy engine of deleting is transferred a thread in described thread pool and the data of described the first file are carried out to described data is looked into the process of a subtask in the important task business and comprise:
The described heavy engine of deleting generates subtask, and described subtask is added into to task queue corresponding to described subtask;
Described subtask in described task queue is scheduled while carrying out, and transfers a described subtask of thread execution in described thread pool, and discharges the thread that described subtask takies after described subtask completes;
Wherein, described subtask is that described deblocking subtask, fingerprint calculate the arbitrary subtask in subtask and repeatable block inquiry subtask.
Second aspect, provide a kind of heavy engine of deleting, and the described heavy engine of deleting is applied to data backup system, and described data backup system comprises described heavy engine and the memory node deleted; Wherein, described data backup system is set up at least two threads is arranged in advance in thread pool; The described heavy engine of deleting comprises:
Division unit, be divided at least one subtask for data being looked into to the important task business;
The thread process unit, for transferring at least one thread of described thread pool; After looking into heavy unit and carrying out described data and look into the subtask that the important task business comprises, discharge described at least one thread of transferring;
Describedly look into heavy unit, for taking the thread of transferring described thread process unit, the data of the first file are carried out to described data look into the subtask that the important task business comprises, whether have the repeating data of the data of described the first file to inquire about described memory node;
Data transmission unit, for, sending to described memory node by the data of described the first file and stored when looking into heavy unit and not inquiring the repeating data of data of described the first file when described.
In conjunction with above-mentioned second aspect, in the first in possible implementation, described division unit, be divided into three subtasks specifically for described data being looked into to the important task business, and described three subtasks are followed successively by the deblocking subtask, fingerprint calculates subtask and repeatable block inquiry subtask;
Described looking into weighs unit, carry out described deblocking subtask specifically for the data of described the first file to inputting the first thread, and rear data to the second thread is processed in output, data after described the second thread is to described the first thread process are carried out described fingerprint and are calculated subtask, and output is processed rear data to described the 3rd thread, data after described the 3rd thread is to described the second thread process are carried out described repeatable block inquiry subtask, wherein, described the first thread, the second thread and the 3rd thread are respectively any one thread in described thread pool.
In conjunction with above-mentioned second aspect, and/or the possible implementation of the first, at the second in possible implementation, describedly look into heavy unit and comprise:
Generate subelement, for generating subtask, and described subtask is added into to task queue corresponding to described subtask;
Carry out subelement, while for the described subtask when described task queue, being scheduled execution, take a described subtask of thread execution in the described thread pool of transferring described thread process unit;
Described thread process unit, specifically for discharging the thread that described subtask takies after completing in described subtask;
Wherein, described subtask is that described deblocking subtask, fingerprint calculate the arbitrary subtask in subtask and repeatable block inquiry subtask.
In conjunction with above-mentioned second aspect, and/or the possible implementation of the first, and/or the possible implementation of the second, at the third in possible implementation,
The third aspect, also provide a kind of data backup system, comprises and heavily delete engine and memory node, and wherein, described data backup system is set up at least two threads are arranged in advance in thread pool;
The described heavy engine of deleting, be divided at least one subtask for data being looked into to the important task business, transfer at least one thread in described thread pool and the data of the first file are carried out to described data look into the subtask that the important task business comprises, whether have the repeating data of the data of described the first file to inquire about described memory node; Described at least one thread that release is transferred; And, when the repeating data of the data that do not inquire described the first file, the data of described the first file are sent to described memory node and stored.
In conjunction with the above-mentioned third aspect, in the first, in possible implementation, described memory node, stored the data of described the first file for a thread transferring described thread pool; Storage discharges the described thread that described memory node is transferred after finishing.
Fourth aspect, also provide a kind of heavy engine of deleting, and the described heavy engine of deleting is applied to data backup system, and described data backup system comprises heavily deletes engine and memory node; Wherein, described data backup system is set up at least two threads is arranged in advance in thread pool; The described heavy engine of deleting comprises processor and storer;
Store one section program code in described storer, described processor is for reading the program code of described storer, and the execution following steps:
Data are looked into to the important task business and be divided at least one subtask;
Transfer at least one thread in described thread pool and the data of the first file are carried out to described data look into the subtask that the important task business comprises, whether have the repeating data of the data of described the first file to inquire about described memory node;
Described at least one thread that release is transferred;
And, when the repeating data of the data that do not inquire described the first file, the data of described the first file are sent to described memory node and stored.
The embodiment of the present invention is looked into the important task business by the data that will mainly consume cpu resource and is independently taken at least one thread, thereby when carrying out data backup, having avoided thread to wait as long for IO returns, when a certain data are taken to a thread execution store tasks, can take another thread execution data to another data and look into the important task business, thereby realized CPU is fully efficiently utilized, and to the lifting of data de-duplication performance.
The accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, below will the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described, apparently, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the method flow diagram that a kind of data of the embodiment of the present invention are processed;
Fig. 2 is the method flow diagram that the another kind of data of the embodiment of the present invention are processed;
Fig. 3 carries out the method flow diagram of deblocking to the first file data in the embodiment of the present invention;
Fig. 4 carries out the method flow diagram of fingerprint calculating to the first data block in the embodiment of the present invention;
Fig. 5 is looked into heavy method flow diagram to the finger print information of the first data block in the embodiment of the present invention;
Fig. 6 is a kind of heavy structural representation of deleting engine of the embodiment of the present invention;
Fig. 7 is the structural representation of looking into heavy unit in the embodiment of the present invention.
Embodiment
In order to make those skilled in the art person understand better the technical scheme in the embodiment of the present invention, and the above-mentioned purpose of the embodiment of the present invention, feature and advantage can be become apparent more, below in conjunction with accompanying drawing, technical scheme in the embodiment of the present invention is described in further detail.
In prior art, the backup of data stream comprises data de-duplication process and data storage procedure, wherein, the data de-duplication process is also that data are looked into and heavily comprised that deblocking, fingerprint calculate and look into heavily (inquiry repeating data) sub-process, and the data storage comprises the sub-process to disk storage under the data after duplicate removal.Every Backup Data stream takies a thread until this data stream has backed up.In backup procedure, deblocking in the data de-duplication process, fingerprint calculates and looks into heavily (inquiry repeating data) three main cpu resources of sub-process consumption, and the sub-process of disk storage under the data after duplicate removal is consumed to cpu resource hardly, yet, under the data to after duplicate removal, the sub-process of disk storage has accounted for the more time of whole data backup flow process, in this storing process, CPU is almost idle, therefore, adopt and so according to every backup stream, take a thread until the method that this stream has backed up, can make thread wait as long for IO returns, CPU is used insufficient, cause the data de-duplication performance to promote, especially under the scene of the concurrent backup of height, above problem is more outstanding, can cause the concurrency deficiency, affect the monolithic backup performance, and backup performance can not be accomplished the problem by the linearity lifting of expection.
Based on this, the embodiment of the present invention provides a kind of new data processing method and has heavily deleted engine, look into the important task business by the data that will mainly consume cpu resource and independently take at least one thread, thereby can utilize fully efficiently CPU, concurrent ability between the enhancing task, promoted the data de-duplication performance.
Referring to Fig. 1, it is a kind of data processing method process flow diagram of the embodiment of the present invention.
The embodiment of the present invention is applied to data backup system, and this data backup system comprises heavily deletes engine and memory node; Wherein, data backup system is set up at least two threads in advance in thread pool, and this data processing method can comprise:
Step 101, heavily delete engine and data are looked into to important task business be divided at least one subtask.
Heavily delete engine and data can be looked into to important task business and be divided into three subtasks, these three subtasks are followed successively by the deblocking subtask, fingerprint calculates subtask and repeatable block inquiry subtask.
Wherein, deblocking is about to data to carry out piecemeal and obtains one or more data blocks, whether identical with the finger print information of each data block fingerprint calculates and one or more data blocks to be carried out respectively to fingerprint and calculate the finger print information that obtains each data block, look into and heavily inquire about in each memory node finger print information.Deblocking in the embodiment of the present invention, fingerprint calculate and look into that heavy method is calculated with deblocking of the prior art, fingerprint respectively and to look into weighing method identical, repeat no more herein.This step is without repeat when data being processed at every turn, data heavily can deleted after task division becomes above-mentioned three subtasks, and the data that need to be processed each are carried out above-mentioned subtask successively.
Certainly, in other embodiments, this heavy engine of deleting can also heavily be deleted data task division for other subtask, for example, can also comprise and be not limited to above-mentioned three subtasks in network interconnection subtask etc.
Step 102, heavily delete engine and transfer at least one thread in described thread pool and the data of the first file are carried out to described data look into the subtask that the important task business comprises, and whether has the repeating data of the data of described the first file to inquire about described memory node.
While in thread pool, having idle thread, heavily delete engine and can transfer data actual figure that at least one thread carries out the first file of data processing to needs and it is investigated the important task business, wherein, the data of the first file are only for for example, and non-limiting or refer in particular to.
These data are looked into the subtask that important task business comprises can take a thread, also can take a plurality of threads, for example, data are looked into deblocking, the fingerprint calculating in the important task business and are looked into heavy three subtasks and take altogether a thread, perhaps deblocking and fingerprint calculate two subtasks and take a thread, look into the baryon task and take another thread, perhaps deblocking, fingerprint calculate and look into heavy three subtasks and take separately a thread, and calculating, look into heavy and storage for deblocking, the fingerprint of same Backup Data needs order to carry out.For non-same Backup Data, above-mentioned subtask can executed in parallel.
In another embodiment, if calculating and look into heavy three subtasks, deblocking, fingerprint take separately a thread, this heavy engine of deleting is transferred at least one thread in described thread pool and the data of the first file are carried out to described data is looked into the subtask that the important task business comprises, and specifically can comprise:
Heavily delete engine the data of described the first file of inputting the first thread are carried out to described deblocking subtask, and rear data to the second thread is processed in output, data after described the second thread is to described the first thread process are carried out described fingerprint and are calculated subtask, and output is processed rear data to described the 3rd thread, data after described the 3rd thread is to described the second thread process are carried out described repeatable block inquiry subtask, wherein, described the first thread, the second thread and the 3rd thread are respectively any one thread in described thread pool.
When a certain subtask of concrete execution, this heavy engine of deleting specifically can be carried out following process:
Heavily delete engine and generate subtask, and described subtask is added into to task queue corresponding to described subtask;
Described subtask in described task queue is scheduled while carrying out, and transfers a described subtask of thread execution in described thread pool, and discharges the thread that described subtask takies after described subtask completes;
Wherein, described subtask can be that deblocking subtask, fingerprint calculate the arbitrary subtask in subtask and repeatable block inquiry subtask.
It is investigated important task business at the data actual figure to the first file, after whether the inquiry memory node exists the repeating data of data of the first file, regardless of result, all perform step 103, and, when the repeating data of the data that do not inquire the first file, perform step 104.
Step 103, discharge described at least one thread of transferring.
After data are looked into all subtasks that important task business comprises and are completed, no matter whether at each memory node, inquire the repeating data of the data of the first file, all discharge data and look into the thread that the important task business takies.Due to data, looking into is heavily the main process that consumes cpu resource, then renewing CPU in the storage process almost leaves unused, so, the thread taken by the thread that allows data look into heavily to take and follow-up storage subtask is separate, can be so that be backed up in the data to the second file, while especially the data actual figure of the second file be it is investigated to the important task business, can the multiplexing actual figure of the data to the first file it is investigated after the important task business thread discharged, thereby avoid the idle of CPU, realize utilizing fully efficiently CPU.
Step 104, when the repeating data of the data that do not inquire described the first file, send to described memory node by the data of described the first file and stored.
When inquiring the repeating data of the first file data, for example, in the inquiry of each memory node during the finger print information identical less than the finger print information of the arbitrary data block with the first file data, perhaps, while only inquiring the identical finger print information of finger print information with the partial data piece of the first file data, the data block of first file data that will not inquire and finger print information thereof send to memory node and are stored.If while inquiring the identical finger print information of finger print information with all data blocks of the first file data, the data of no longer storing the first file, the reference count of the repeating data of the first file data that directly modification has been stored.
Execution step 105~106 when memory node is stored.
Step 105, the thread that memory node is transferred in described thread pool is stored the data of described the first file.
In concrete storing process, while in thread pool, having idle thread, can transfer data block and the finger print information thereof of non-existent all the first file datas in each memory node of threads store.
Step 106, storage discharges the described thread that described memory node is transferred after finishing.
The embodiment of the present invention is looked into the important task business and consumes the data store tasks of cpu resource respectively by separate thread execution hardly by the data that will mainly consume cpu resource, after it is investigated the important task business, the data actual figure to the first file can discharge the thread that it takies like this, when another thread execution data store tasks of the data to the first file, can it is investigated that the thread that the important task business discharges is backed up other data by multiplexing number, especially other data actual figures be it is investigated to important task business, so just can avoid leaving unused of CPU, avoiding thread to wait as long for IO returns, thereby realized CPU is utilized fully efficiently, and to the lifting of data de-duplication performance.
Referring to Fig. 2, it is the method flow diagram of the another kind of data processing of the embodiment of the present invention.
The method can comprise:
Step 201, write buffer zone by the data of the first file, and the data of the first file are carried out to piecemeal obtain at least one first data block.
Can in system, set up in advance a thread pool in the embodiment of the present invention, before there is no pending task, all thread free time, wait for scheduling.The data backup process is divided into to a plurality of independently subtasks, is carried out by multithreading, the corresponding queue in each subtask, as data piecemeal, fingerprint calculate, look into heavy and storage.Wherein, data are looked into the subtask comprised in the important task business and are comprised deblocking, fingerprint calculating, look into heavily, comprise the storage subtask in the data store tasks.
When the backup tasks to the first file data starts, at first the data of the first file are write in buffer zone, after writing the buffer zone success, as shown in Figure 3, carry out following steps:
Step 301, generate the first piecemeal task to the first file data, and the first piecemeal task is added into to the piecemeal task queue.
Step 302, the first piecemeal task in the piecemeal task queue is scheduled while carrying out, and described the first piecemeal task takies a thread, and the first file data piecemeal is obtained at least one first data block, and discharges the thread that the first piecemeal task takies.
For example, can at first the first file data fixing organization to be backed up be become to be less than or equal to the chunk data of 9MB size, then to above-mentioned 9MB data execution step 302, be divided into the first data block of 4~12K size.Concrete, can adopt existing fixed length or elongated block algorithm to carry out piecemeal to the first data.Certainly, above-mentioned " 9MB " reaches " 4~12K " is only for example, and concrete magnitude numerical value can arrange as required.After piecemeal completes, or, after the thread taken in release the first piecemeal task, perform step 202.
Step 202, carry out fingerprint calculating to described at least one first data block, obtains the finger print information of described at least one the first data block.
Respectively the first data block is carried out to fingerprint calculating, obtain the finger print information of each first data block.As shown in Figure 4, this process can comprise:
Step 401, generate the first fingerprint calculation task to described at least one first data block, and the first fingerprint calculation task be added into to the queue of fingerprint calculation task.
Step 402, in the queue of fingerprint calculation task, the first fingerprint calculation task is scheduled while carrying out, described the first fingerprint calculation task takies a thread, described at least one first data block is carried out to fingerprint calculating, obtain the finger print information of described at least one the first data block, and discharge the thread that the first fingerprint calculation task takies.
Concrete, can adopt Hash (hash) algorithm as strong as SHAI, MD5 etc. to carry out respectively fingerprint calculating to one or more the first data blocks, after calculating completes, or, after the thread taken at release the first fingerprint calculation task, perform step 203.
Whether step 203, exist at least one and the finger print information of the finger print information repetition of described the first data block in the inquiry of each memory node.
Inquire about each memory node and whether have the finger print information identical with the finger print information of one or more the first data blocks, as shown in Figure 5, this process further comprises:
Step 501, generate the first repeatable block query task to the finger print information of described at least one the first data block, and the first repeatable block query task be added into to the query task queue.
Step 502, in the query task queue, the first repeatable block query task is scheduled while carrying out, described the first repeatable block query task takies a thread, the finger print information that whether exists the finger print information of at least one and the first data block to repeat at each querying node, and discharge the thread that the first repeatable block query task takies after inquiry completes.
Look into heavy process above-mentioned, the finger print information that repeats for the ease of fast finding, can also carry out following process:
At first, when the first continuous data block size is cumulative the most approaching or reach 1MB, think that the first data block of current this partial continuous belongs to a segment, then in the finger print information of each first data block of this segment, select the segment id of minimum finger print information as this segment; Follow-up other first data blocks are divided according to said process successively, this process is segment and divides.
According to above-mentioned segment, divide, the 9MB data that backup enters must have 9 Segment id.(follow-uply through the overweight single-instance of depositing on disk of deleting, also can organize and be called a container (Container) according to the size that is less than or equal to 9MB, each Container also comprises 9 Segment id.Certainly, the 9MB that is decided to be not of uniform size of container, its size also can be set according to actual conditions.)
When step 502 is looked into heavily, the Segment id complete or collected works that store in the Segment id of the 9MB data that backup is come in and each node container do similarity analysis, select and hit 6 containers that Segment id number is the highest, and carry out the repeatable block inquiry in these 6 containers.If each node without the container met, can not load any container and be looked into heavily, only in this 9MB(or other sizes) data inside looked into heavily.
The number of the division segment of front and look into the container number in gravity treatment, 6 containers, do not limit as the aforementioned, can be adjusted according to concrete realization, not changeless value.Wherein, the number of container can reduce or increase, and only needs to guarantee that the number that is less than or equal to segment gets final product.The said method fundamental purpose is to reduce to look into heavy scope, by looking into heavy scope, narrows down in our acceptable scope.Can certainly cancel this Segment id, directly do the looking into heavily of finger print information of overall data block.
If inquire the finger print information of the first data block of repetition at certain memory node, the reference count of the finger print information of the first data block of the repetition that directly modification has been stored.Finger print information for the first data block that does not inquire the repetition finger print information, perform step 204.
Step 204, store non-existent the first data block and finger print information thereof in each memory node, completes backup.
Concrete, can be to generate the first store tasks, and this first store tasks is added into to the store tasks queue, the first store tasks in this store tasks queue is scheduled while carrying out, the first store tasks takies a thread, to in each node all non-existent the first data block and finger print information thereof stored, and the file metadata information of renewable correspondence.After storage completes, discharge the shared thread of the first store tasks, complete backup.
The method is when carrying out data backup, having avoided thread to wait as long for IO returns, when a certain data are taken to a thread execution store tasks, can calculate or heavily look into task to another thread execution deblocking of another data call, fingerprint, thereby realized CPU is fully efficiently utilized, and to the lifting of data de-duplication performance.When how concurrent backup stream, can be the fastest will go, promote concurrent backup performance.
For example: a plurality of users are the interior backup file of section at one time, can there is the situation of a plurality of backup tasks in the different processes of heavily deleting, as the A file at deblocking, the B file is being write disk preservation single-instance data, the C file is in calculated fingerprint, the D file is being write disk preservation single-instance data, E, the F file waits to start backup, if employing art methods, the thread that can cause B file and D file to take is write the IO return state of disk in idle waiting, but can't discharging to other, cpu resource need the task of CPU computing to use, cause the E file must wait until after one of them task of A~D completes and just can start backup, and the method for the employing embodiment of the present invention, after B and D file start writing disk manipulation, this IO operation is transferred to file system and is processed, and the thread that B and D file take is released, can find that E, F file be badly in need of carrying out the operations such as piecemeal, calculated fingerprint simultaneously, therefore can at once start backup operation, thereby make total system be pipeline system work, make throughput hoisting, reach the purpose that promotes backup performance.
Be more than the description to the inventive method embodiment, below the device of realizing said method be introduced.
Referring to Fig. 6, it is a kind of heavy structural representation of deleting engine of the embodiment of the present invention.
This heavy engine of deleting is applied to data backup system, and described data backup system comprises described heavy engine and the memory node deleted; Wherein, described data backup system is set up at least two threads is arranged in advance in thread pool.This heavy engine of deleting can comprise:
Division unit 601, be divided at least one subtask for data being looked into to the important task business;
Thread process unit 602, for transferring at least one thread of described thread pool; After looking into heavy unit 603 and carrying out described data and look into the subtask that the important task business comprises, discharge described at least one thread of transferring;
Look into heavy unit 603, for taking the thread transferred described thread process unit 602, the data of the first file are carried out to described data look into the subtask that the important task business comprises, whether have the repeating data of the data of described the first file to inquire about described memory node;
Data transmission unit 604, for, sending to described memory node by the data of described the first file and stored when looking into heavy unit 603 and not inquiring the repeating data of data of described the first file when described.
In division unit 601, data are looked into after important task business is divided at least one subtask, at least one thread in described thread pool is transferred in thread process unit 602, take the thread transferred described thread process unit and the data of the first file are carried out to described data look into the subtask that the important task business comprises by looking into heavy unit 603, whether there is the repeating data of the data of described the first file to inquire about described memory node, after looking into heavy unit 603 and carrying out described data and look into the subtask that the important task business comprises, thread process unit 603 discharges described at least one thread of transferring, and when described when looking into heavy unit 603 and not inquiring the repeating data of data of described the first file, data transmission unit 604 sends to described memory node by the data of described the first file and is stored.
The data that the embodiment of the present invention will mainly consume cpu resource by said units are looked into the important task business and are taken independently thread, thereby when carrying out data backup, having avoided thread to wait as long for IO returns, when a certain data are taken to a thread execution subordinate phase task, can take another thread execution first stage task to another data, thereby realized CPU is fully efficiently utilized, and to the lifting of data de-duplication performance.
In another embodiment of the present invention, described division unit, be divided into three subtasks specifically for described data being looked into to the important task business, and described three subtasks are followed successively by the deblocking subtask, fingerprint calculates subtask and repeatable block inquiry subtask; Described looking into weighs unit, carry out described deblocking subtask specifically for the data of described the first file to inputting the first thread, and rear data to the second thread is processed in output, data after described the second thread is to described the first thread process are carried out described fingerprint and are calculated subtask, and output is processed rear data to described the 3rd thread, data after described the 3rd thread is to described the second thread process are carried out described repeatable block inquiry subtask, wherein, described the first thread, the second thread and the 3rd thread are respectively any one thread in described thread pool.
In another embodiment, as shown in Figure 7, this is looked into heavy unit 603 and may further include:
Generate subelement 701, for generating subtask, and described subtask is added into to task queue corresponding to described subtask;
Carry out subelement 702, while for the described subtask when described task queue, being scheduled execution, take a described subtask of thread execution in the described thread pool of transferring described thread process unit;
Described thread process unit, specifically for discharging the thread that described subtask takies after completing in described subtask;
Wherein, described subtask is that described deblocking subtask, fingerprint calculate the arbitrary subtask in subtask and repeatable block inquiry subtask.
The embodiment of the present invention also provides another kind heavily to delete engine, and this heavy engine of deleting is applied to data backup system, and described data backup system comprises heavily deletes engine and memory node; Wherein, described data backup system is set up at least two threads is arranged in advance in thread pool; The described heavy engine of deleting comprises processor and storer;
Wherein, store one section program code in storer, processor, for reading this program code, is carried out following steps:
Data are looked into to the important task business and be divided at least one subtask;
Transfer at least one thread in described thread pool and the data of the first file are carried out to described data look into the subtask that the important task business comprises, whether have the repeating data of the data of described the first file to inquire about described memory node;
Described at least one thread that release is transferred;
And, when the repeating data of the data that do not inquire described the first file, the data of described the first file are sent to described memory node and stored.
The embodiment of the present invention also provides a kind of data backup system, comprises and heavily deletes engine and memory node, and wherein, described data backup system is set up at least two threads are arranged in advance in thread pool;
Heavily delete engine, be divided at least one subtask for data being looked into to the important task business, transfer at least one thread in described thread pool and the data of the first file are carried out to described data look into the subtask that the important task business comprises, whether have the repeating data of the data of described the first file to inquire about described memory node; Described at least one thread that release is transferred; And, when the repeating data of the data that do not inquire described the first file, the data of described the first file are sent to described memory node and stored.
In another embodiment, memory node, stored the data of described the first file for a thread transferring described thread pool; Storage discharges the described thread that described memory node is transferred after finishing.
Those of ordinary skills can recognize, unit and the algorithm steps of each example of describing in conjunction with embodiment disclosed herein can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are carried out with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.The professional and technical personnel can specifically should be used for realizing described function with distinct methods to each, but this realization should not thought and exceeds scope of the present invention.
The those skilled in the art can be well understood to, and for convenience and simplicity of description, the specific works process of the system of foregoing description, device and unit, can, with reference to the corresponding process in preceding method embodiment, not repeat them here.
In the several embodiment that provide in the application, should be understood that disclosed system, apparatus and method can realize by another way.For example, device embodiment described above is only schematic, for example, the division of described unit, be only that a kind of logic function is divided, during actual the realization, other dividing mode can be arranged, for example a plurality of unit or assembly can in conjunction with or can be integrated into another system, or some features can ignore, or do not carry out.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, indirect coupling or the communication connection of device or unit can be electrically, machinery or other form.
The described unit as the separating component explanation can or can not be also physically to separate, and the parts that show as unit can be or can not be also physical locations, can be positioned at a place, or also can be distributed on a plurality of network element.Can select according to the actual needs some or all of unit wherein to realize the purpose of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can be also that the independent physics of unit exists, and also can be integrated in a unit two or more unit.
If described function usings that the form of SFU software functional unit realizes and during as production marketing independently or use, can be stored in a computer read/write memory medium.Understanding based on such, the part that technical scheme of the present invention contributes to prior art in essence in other words or the part of this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprise that some instructions are with so that a computer equipment (can be personal computer, server, or the network equipment etc.) or processor (processor) carry out all or part of step of the described method of each embodiment of the present invention.And aforesaid storage medium comprises: various media that can be program code stored such as USB flash disk, portable hard drive, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CDs.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited to this, anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; can expect easily changing or replacing, within all should being encompassed in protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion by the described protection domain with claim.

Claims (10)

1. a data processing method, is characterized in that, described method is applied to data backup system, and described data backup system comprises heavily deletes engine and memory node; Wherein, described data backup system is set up at least two threads in advance in thread pool; The described heavy engine of deleting is looked into important task business by data and is divided at least one subtask, and described method comprises:
The described heavy engine of deleting is transferred at least one thread in described thread pool and the data of the first file are carried out to described data is looked into the subtask that the important task business comprises, and whether has the repeating data of the data of described the first file to inquire about described memory node;
Described at least one thread that release is transferred;
And, when the repeating data of the data that do not inquire described the first file, the data of described the first file are sent to described memory node and stored.
2. method according to claim 1, is characterized in that, described method also comprises:
The thread that described memory node is transferred in described thread pool is stored the data of described the first file;
Storage discharges the described thread that described memory node is transferred after finishing.
3. method according to claim 1 and 2, is characterized in that, the described heavy engine of deleting is looked into important task business by data and is divided at least one subtask, comprising:
The described heavy engine of deleting is looked into important task business by described data and is divided into three subtasks, and described three subtasks are followed successively by the deblocking subtask, fingerprint calculates subtask and repeatable block inquiry subtask;
The described heavy engine of deleting is transferred at least one thread in described thread pool and the data of the first file are carried out to described data is looked into the subtask that the important task business comprises, and comprising:
The described heavy engine of deleting is carried out described deblocking subtask to the data of described the first file of inputting the first thread, and rear data to the second thread is processed in output, data after described the second thread is to described the first thread process are carried out described fingerprint and are calculated subtask, and output is processed rear data to described the 3rd thread, data after described the 3rd thread is to described the second thread process are carried out described repeatable block inquiry subtask, wherein, described the first thread, the second thread and the 3rd thread are respectively any one thread in described thread pool.
4. method according to claim 3, is characterized in that, the described heavy engine of deleting is transferred a thread in described thread pool and the data of described the first file are carried out to described data looked into the process of a subtask in the important task business and comprise:
The described heavy engine of deleting generates subtask, and described subtask is added into to task queue corresponding to described subtask;
Described subtask in described task queue is scheduled while carrying out, and transfers a described subtask of thread execution in described thread pool, and discharges the thread that described subtask takies after described subtask completes;
Wherein, described subtask is that described deblocking subtask, fingerprint calculate the arbitrary subtask in subtask and repeatable block inquiry subtask.
5. heavily delete engine for one kind, it is characterized in that, the described heavy engine of deleting is applied to data backup system, and described data backup system comprises described heavy engine and the memory node deleted; Wherein, described data backup system is set up at least two threads is arranged in advance in thread pool; The described heavy engine of deleting comprises:
Division unit, be divided at least one subtask for data being looked into to the important task business;
The thread process unit, for transferring at least one thread of described thread pool; After looking into heavy unit and carrying out described data and look into the subtask that the important task business comprises, discharge described at least one thread of transferring;
Describedly look into heavy unit, for taking the thread of transferring described thread process unit, the data of the first file are carried out to described data look into the subtask that the important task business comprises, whether have the repeating data of the data of described the first file to inquire about described memory node;
Data transmission unit, for, sending to described memory node by the data of described the first file and stored when looking into heavy unit and not inquiring the repeating data of data of described the first file when described.
6. the heavy engine of deleting according to claim 5, is characterized in that,
Described division unit, be divided into three subtasks specifically for described data being looked into to the important task business, and described three subtasks are followed successively by the deblocking subtask, fingerprint calculates subtask and repeatable block inquiry subtask;
Described looking into weighs unit, carry out described deblocking subtask specifically for the data of described the first file to inputting the first thread, and rear data to the second thread is processed in output, data after described the second thread is to described the first thread process are carried out described fingerprint and are calculated subtask, and output is processed rear data to described the 3rd thread, data after described the 3rd thread is to described the second thread process are carried out described repeatable block inquiry subtask, wherein, described the first thread, the second thread and the 3rd thread are respectively any one thread in described thread pool.
7. the heavy engine of deleting according to claim 6, is characterized in that, the described heavily unit of looking into comprises:
Generate subelement, for generating subtask, and described subtask is added into to task queue corresponding to described subtask;
Carry out subelement, while for the described subtask when described task queue, being scheduled execution, take a described subtask of thread execution in the described thread pool of transferring described thread process unit;
Described thread process unit, specifically for discharging the thread that described subtask takies after completing in described subtask;
Wherein, described subtask is that described deblocking subtask, fingerprint calculate the arbitrary subtask in subtask and repeatable block inquiry subtask.
8. a data backup system, is characterized in that, comprises and heavily delete engine and memory node, and wherein, described data backup system is set up at least two threads are arranged in advance in thread pool;
The described heavy engine of deleting, be divided at least one subtask for data being looked into to the important task business, transfer at least one thread in described thread pool and the data of the first file are carried out to described data look into the subtask that the important task business comprises, whether have the repeating data of the data of described the first file to inquire about described memory node; Described at least one thread that release is transferred; And, when the repeating data of the data that do not inquire described the first file, the data of described the first file are sent to described memory node and stored.
9. system according to claim 8, is characterized in that,
Described memory node, stored the data of described the first file for a thread transferring described thread pool; Storage discharges the described thread that described memory node is transferred after finishing.
10. heavily delete engine for one kind, it is characterized in that, the described heavy engine of deleting is applied to data backup system, and described data backup system comprises heavily deletes engine and memory node; Wherein, described data backup system is set up at least two threads is arranged in advance in thread pool; The described heavy engine of deleting comprises processor and storer;
Store one section program code in described storer, described processor is for reading the program code of described storer, and the execution following steps:
Data are looked into to the important task business and be divided at least one subtask;
Transfer at least one thread in described thread pool and the data of the first file are carried out to described data look into the subtask that the important task business comprises, whether have the repeating data of the data of described the first file to inquire about described memory node;
Described at least one thread that release is transferred;
And, when the repeating data of the data that do not inquire described the first file, the data of described the first file are sent to described memory node and stored.
CN201310425874.7A 2013-09-17 2013-09-17 Data processing method and deduplication engine Pending CN103488734A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310425874.7A CN103488734A (en) 2013-09-17 2013-09-17 Data processing method and deduplication engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310425874.7A CN103488734A (en) 2013-09-17 2013-09-17 Data processing method and deduplication engine

Publications (1)

Publication Number Publication Date
CN103488734A true CN103488734A (en) 2014-01-01

Family

ID=49828960

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310425874.7A Pending CN103488734A (en) 2013-09-17 2013-09-17 Data processing method and deduplication engine

Country Status (1)

Country Link
CN (1) CN103488734A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447052A (en) * 2014-09-24 2016-03-30 阿里巴巴集团控股有限公司 Data processing method and system
CN106325991A (en) * 2016-08-19 2017-01-11 东软集团股份有限公司 Instruction scheduling method and device for process engine
CN110609807A (en) * 2018-06-15 2019-12-24 伊姆西Ip控股有限责任公司 Method, apparatus, and computer-readable storage medium for deleting snapshot data
CN111159236A (en) * 2019-12-23 2020-05-15 五八有限公司 Data processing method and device, electronic equipment and storage medium
CN111338787A (en) * 2020-02-04 2020-06-26 浙江大华技术股份有限公司 Data processing method and device, storage medium and electronic device
CN113050892A (en) * 2021-03-26 2021-06-29 杭州宏杉科技股份有限公司 Method and device for protecting deduplication data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6820081B1 (en) * 2001-03-19 2004-11-16 Attenex Corporation System and method for evaluating a structured message store for message redundancy
CN102156703A (en) * 2011-01-24 2011-08-17 南开大学 Low-power consumption high-performance repeating data deleting system
US20120066209A1 (en) * 2010-09-10 2012-03-15 International Business Machines Corporation Electronic mail duplicate detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6820081B1 (en) * 2001-03-19 2004-11-16 Attenex Corporation System and method for evaluating a structured message store for message redundancy
US20120066209A1 (en) * 2010-09-10 2012-03-15 International Business Machines Corporation Electronic mail duplicate detection
CN102156703A (en) * 2011-01-24 2011-08-17 南开大学 Low-power consumption high-performance repeating data deleting system

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447052A (en) * 2014-09-24 2016-03-30 阿里巴巴集团控股有限公司 Data processing method and system
CN105447052B (en) * 2014-09-24 2018-10-02 阿里巴巴集团控股有限公司 Data processing method and system
CN106325991A (en) * 2016-08-19 2017-01-11 东软集团股份有限公司 Instruction scheduling method and device for process engine
CN106325991B (en) * 2016-08-19 2020-04-03 东软集团股份有限公司 Instruction scheduling method and device of flow engine
CN110609807A (en) * 2018-06-15 2019-12-24 伊姆西Ip控股有限责任公司 Method, apparatus, and computer-readable storage medium for deleting snapshot data
CN110609807B (en) * 2018-06-15 2023-06-23 伊姆西Ip控股有限责任公司 Method, apparatus and computer readable storage medium for deleting snapshot data
CN111159236A (en) * 2019-12-23 2020-05-15 五八有限公司 Data processing method and device, electronic equipment and storage medium
CN111338787A (en) * 2020-02-04 2020-06-26 浙江大华技术股份有限公司 Data processing method and device, storage medium and electronic device
CN111338787B (en) * 2020-02-04 2023-09-01 浙江大华技术股份有限公司 Data processing method and device, storage medium and electronic device
CN113050892A (en) * 2021-03-26 2021-06-29 杭州宏杉科技股份有限公司 Method and device for protecting deduplication data
CN113050892B (en) * 2021-03-26 2022-02-25 杭州宏杉科技股份有限公司 Method and device for protecting deduplication data

Similar Documents

Publication Publication Date Title
CN103488734A (en) Data processing method and deduplication engine
EP2288975B1 (en) Method for optimizing cleaning of maps in flashcopy cascades containing incremental maps
JP5929196B2 (en) Distributed processing management server, distributed system, distributed processing management program, and distributed processing management method
CN102436408B (en) Data storage cloud and cloud backup method based on Map/Dedup
US20130339316A1 (en) Packing deduplicated data into finite-sized containers
CN105190567A (en) System and method for managing storage system snapshots
CN104102693A (en) Object processing method and device
CN103279388B (en) For performing the system and method for one or more task
CN105159841A (en) Memory migration method and memory migration device
CN104123304A (en) Data-driven parallel sorting system and method
US11429314B2 (en) Storage device, storage system and operating method thereof
CN110333951A (en) A kind of commodity panic buying request distribution method
US20160034528A1 (en) Co-processor-based array-oriented database processing
US20130232124A1 (en) Deduplicating a file system
CN106227469A (en) Data-erasure method and system for distributed storage cluster
US6591287B1 (en) Method to increase the efficiency of job sequencing from sequential storage
CN112882663A (en) Random writing method, electronic equipment and storage medium
CN105049524B (en) A method of the large-scale dataset based on HDFS loads
US10248677B1 (en) Scaling an SSD index on a deduplicated storage system
CN106528703A (en) Deduplication mode switching method and apparatus
US20230376357A1 (en) Scaling virtualization resource units of applications
CN108132759A (en) A kind of method and apparatus that data are managed in file system
CN105468494A (en) I/O intensive application identification method
US9218275B2 (en) Memory management control system, memory management control method, and storage medium storing memory management control program
JPH04288638A (en) Computer system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140101