WO2015010555A1 - Data blocking method and device - Google Patents

Data blocking method and device Download PDF

Info

Publication number
WO2015010555A1
WO2015010555A1 PCT/CN2014/082237 CN2014082237W WO2015010555A1 WO 2015010555 A1 WO2015010555 A1 WO 2015010555A1 CN 2014082237 W CN2014082237 W CN 2014082237W WO 2015010555 A1 WO2015010555 A1 WO 2015010555A1
Authority
WO
WIPO (PCT)
Prior art keywords
data block
block
data stream
current data
fingerprint
Prior art date
Application number
PCT/CN2014/082237
Other languages
French (fr)
Chinese (zh)
Inventor
吴俊�
张亮
郭凯
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2015010555A1 publication Critical patent/WO2015010555A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/0078Avoidance of errors by organising the transmitted data in a format specifically designed to deal with errors, e.g. location
    • H04L1/0083Formatting with frames or packets; Protocol or part of protocol for error control

Definitions

  • the present invention relates to communications technologies, and in particular, to a data blocking method and apparatus. Background technique
  • RTE Redundant Traffic Elimination
  • the existing RTE technology usually adopts a modular method based on Modular Exponential (MODP) to block the data stream according to the content of the file.
  • MODP Modular Exponential
  • the M0DP-based blocking method is implemented by a sliding window, for a given modulus value p (for example, 10010), when the fingerprint value of the sliding window (for example, 1111 0111 0010) is modulo the modulus value p 0, the data content between the end of the sliding window and the end of the previous data block is divided into a data block.
  • the sliding window moves backward by one byte or a custom length.
  • the fingerprint value of the block is calculated, and compared with the fingerprint value of the stored data block in the fingerprint dictionary. If the same fingerprint value is detected, indicating that the redundant data content is found, the new The data stream is replaced with an identifier in the data stream to achieve the purpose of compressing traffic. If the same fingerprint value is not detected, the data chunk and its fingerprint are stored for redundant detection of subsequent chunking.
  • the length of the identifier used to replace the data block is fixed. If the corresponding data block length is larger, the bandwidth saved during transmission is larger.
  • large data partitioning means that the redundancy is coarse-grained, and the same data partitioning is difficult to repeat, resulting in a low redundancy rate.
  • the transmitting end uses the M0DP-based blocking method to block the original data stream, and obtains the data blocking sequence 101: the blocks SA, SB, SC, SD, SE, and SF, and replaces them with the identifiers of the corresponding blocks, for example,
  • the block SA is replaced with an identifier R, resulting in an identifier sequence 102 (referred to as a layer 1 identifier): R , R , R , R and .
  • the identifier sequence 102 is regarded as a general data stream, and the identifier sequence 102 is again partitioned using the MODP-based blocking method, and two data points of the identifier data are obtained.
  • the block is also replaced by a 2-layer identifier sequence 103: R 2 nR 2 2 .
  • the content is the replacement of the identifiers R and
  • the content of the layer 2 identifier is the identifier! ⁇ And! ⁇ Replace with the 2-layer identifier R 2 2 .
  • the layer 2 identifier represents the data blocks SA and SB
  • R 2 2 represents the blocks SC, SD, SE and SF.
  • the data partition is replaced with the upper layer identifier as much as possible to save bandwidth.
  • an embodiment of the present invention provides a data blocking method, including:
  • the i-th modulo value is extended to the left by a preset number of bits to obtain an i+1th modulo value, where i is greater than or equal to 1 and less than any natural number of the N, and the N is a total of modulo values.
  • n is one or more natural numbers that are greater than or equal to 1 and less than or equal to:
  • the fingerprint dictionary When the fingerprint value of the current data block is not stored in the fingerprint dictionary, storing the current data block into a block dictionary, and the fingerprint value of the current data block and the current data block in the block dictionary The index information in the fingerprint dictionary is stored in the fingerprint dictionary.
  • the current data block When the n is greater than 1, the current data block is written in a space occupied by the new data stream to overwrite the current data block in the new data stream. All n-1th data blocks included, when the n is equal to 1, the current data block is sequentially written into the space occupied by the new data stream;
  • the fingerprint value of the current data block is stored in the fingerprint dictionary and when n is greater than 1, the fingerprint value of the current data block is written in a space occupied by the new data stream to cover the new Data a fingerprint value of all n-1th data blocks included in the current data block in the stream, when the fingerprint value of the current data block is stored in the fingerprint dictionary, and when n is equal to 1, the current The fingerprint value of the data block is sequentially written into the space occupied by the new data stream;
  • the current data block is stored in a block dictionary
  • a fingerprint value of the current data block and the current data block are in the block dictionary
  • the index information is stored in the fingerprint dictionary, including:
  • the index information of the current data block in the block dictionary is determined according to index information of a data block of the n-1th layer included in the current data block;
  • the current data block does not include the data block of the n-1th layer, storing the current data block into the block dictionary, and the fingerprint value of the current data block and the current data block are in the
  • the index information in the block dictionary is stored in the fingerprint dictionary.
  • the method before the current data block is written in the occupied space of the new data stream, and/or Before the fingerprint value of the current data block is written in the occupied space of the new data stream, the method further includes:
  • the method further includes:
  • Finding a location of the data block of the n-1th layer included in the current data block in the new data stream including:
  • the location record table locating a record of the n-1th data block included in the current data block in the original data stream
  • the method further includes:
  • the embodiment of the present invention further provides a data blocking device, including:
  • An acquisition module configured to acquire a fingerprint value of a sliding window on the original data stream, where an initial starting position of the sliding window is the same as a starting position of the original data stream, and a length of the sliding window is a preset length; a module, configured to adopt a fingerprint value of the sliding window, and respectively modulo the first modulo value to the Nth modulo value, wherein the ith modulo value is extended to the left by a preset number of digits to obtain an i+th a modulus value, wherein i is greater than or equal to 1 and less than any natural number of the N, and the N is a total number of modulus values;
  • a blocking module configured, for each modulus value, if the value of the modulo module after modulo is zero, starting from the end position of the previous data block of the nth layer to the sliding in the original data stream
  • the data between the end positions of the windows is the current data block, and the current data block is the nth layer data block; wherein the n is greater than or equal to 1 and less than or equal to any natural number of the N:
  • a writing module configured to store, in the fingerprint dictionary, a fingerprint value of a current data block generated by the blocking module, store the current data block into a blocking dictionary, and set a fingerprint value of the current data block And index information of the current data block in the block dictionary is stored in the fingerprint dictionary, and when the n is greater than 1, the current data block is written in a space occupied by the new data stream to cover All data blocks of the nth-1th layer included in the current data block in the new data stream, when the n is equal to 1, the current data block is sequentially written into a space occupied by the new data stream;
  • the writing module is further configured to: when the fingerprint value of the current data block is stored in the fingerprint dictionary, and when n is greater than 1, write the current data block in a space occupied by the new data stream Fingerprint a value to cover a fingerprint value of all n-1th data blocks included in the current data block in the new data stream, where the fingerprint value of the current data block is stored in the fingerprint dictionary and When n is equal to 1, the fingerprint value of the current data block is sequentially written into the space occupied by the new data stream;
  • a sliding module configured to: after the writing module writes the current data block generated by the blocking module or the fingerprint value of the current data block into a space occupied by the new data stream, the sliding window is And sliding the preset length on the original data stream to obtain the fingerprint value of the sliding window on the original data stream, so that the modulo module, the writing module, and the sliding module repeatedly perform operations until the The starting position of the sliding window slides to the end of the original data stream.
  • the current data block includes a data block of an n-1th layer
  • an index information of the current data block in the block dictionary according to the The index information of the data block of the n-1th layer included in the current data block is determined.
  • the apparatus further includes:
  • a positioning module configured to: before the writing module writes the current data block in a occupied space of a new data stream, and/or write the current in the occupied space of the new data stream in the write module Before the fingerprint value of the data block, the location of the data block of the n-1th layer included in the current data block is found in the new data stream.
  • the apparatus further includes:
  • a recording module configured to record, after the block module, data in the original data stream from an end position of a previous data block of the nth layer to an end position of the sliding window as a current data block a sequence number of the current data block, a level of the current data block, a location of the current data block in the original data stream, and a location of the current data block in a location of the new data stream And saved in the location record table;
  • the positioning module is configured to: according to a location of the current data block in the original data stream and a position of an n-1th layer data block in the original data stream, in the location record table, a location location Decoding the n-1th data block included in the current data block in the original data stream; acquiring the n-1th layer included in the current data block from the recorded record of the n-1th data block The location of the data block in the new data stream.
  • the apparatus further includes: a restoring module, in the process of parsing the new data stream into the original data stream, searching for a data block corresponding to the fingerprint value in the fingerprint dictionary when a fingerprint value is found in the new data stream Index information in the block dictionary;
  • the restoring module is further configured to: search, according to the index information, a data block corresponding to the fingerprint value in the block dictionary, where the fingerprint value is in a space occupied by the new data stream, The fingerprint value is replaced with the found data block.
  • the fingerprint values of the same sliding window on the original data stream are respectively modulo to a set of modulus values, that is, the first modulus value to the Nth modulus value, according to different modes.
  • the modulo result of the value is divided into different levels of the original data stream. Therefore, for the fingerprint value of the same sliding window, the original data stream can be coarse-grained and fine-grained at the same time.
  • the generated data block if the fingerprint value of the data block exists in the fingerprint dictionary, according to the level of the data block, it is determined whether the fingerprint value of the data block is sequentially written into the space occupied by the new data stream or is replaced by the data block.
  • each layer of the data block is a block directly generated on the original data stream, and each fingerprint value is a fingerprint value of the data block.
  • the fingerprint value is also the fingerprint value of the fingerprint value, which reduces the overhead of the fingerprint value of the data block and improves the deduplication rate of the new data stream.
  • the generation of each layer of data blocks does not depend on the data blocks of other layers. Therefore, for the fingerprint values of the same sliding window, different levels of data blocks can be generated concurrently.
  • FIG. 1A is a flowchart of a data blocking method according to an embodiment of the present invention.
  • FIG. 1B is a schematic diagram of a sliding window according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of a three-level block diagram according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a block dictionary and a fingerprint dictionary according to an embodiment of the present invention
  • FIG. 4 is another schematic structural diagram of a block dictionary and a fingerprint dictionary according to an embodiment of the present invention
  • FIG. 5 is a schematic structural diagram of a data blocking device according to an embodiment of the present disclosure
  • FIG. 6 is a schematic structural diagram of another data blocking device according to an embodiment of the present invention. detailed description
  • the embodiment of the present invention can be applied to before the sending end sends the data stream to the receiving end, the sending end blocks the original data stream, forms the generated data block into a new data stream, and sends a new data stream to the receiving end, thereby saving transmission bandwidth. Improve transmission efficiency.
  • FIG. 1 is a flowchart of a data blocking method according to an embodiment of the present invention.
  • the original data stream is multi-layered by a set of modulus values: the first modulus value to the Nth modulus value.
  • N is the total number of moduli values
  • N is the default value
  • N is a natural number greater than 1.
  • the method provided in this embodiment includes:
  • Step 11 Slide the sliding window from the starting position of the original data stream to the end position of the original data stream, and the initial starting position of the sliding window is the same as the starting position of the original data stream.
  • the length of the sliding window is a preset length.
  • Step 12 Obtain the fingerprint value of the sliding window on the original data stream, and use the fingerprint value of the sliding window to respectively modulate the first modulo value to the Nth modulo value. For each modulo value, if the value is Steps 13 are performed, and then step 14 is performed.
  • the relationship between the first modulo value and the Nth modulo value is as follows:
  • the first modulo value pi is a preset value, and the first modulo value is extended to the left by a preset number of bits to obtain a second modulo value p2, After the two modulo values are extended to the left by the preset number of bits, the third modulo value p3 is obtained...
  • the ith modulo value is extended to the left by the preset number of bits to obtain the i+1th modulo value, and so on.
  • the modulus values pl, p2, ..., pN of different numbers of bits, where i is any natural number greater than or equal to 1 and less than N, and N is the total number of modulus values.
  • p4, , pN can be obtained.
  • Step 14 Slide the sliding window on the original data stream to the end position by a preset length, obtain the fingerprint value of the sliding window on the original data stream, and repeat steps 12 to 14 until the starting position of the sliding window slides to the original position. The end of the data stream.
  • the sliding window is slid from the starting position of the original data stream toward the end position of the original data stream by a preset length.
  • the fingerprint value of the current sliding window is used to modulate the first modulo value to the Nth modulo value respectively, and after performing step 13, the sliding window is slid to a preset length in the direction of the end position of the original data stream.
  • the fingerprint value of the data of the original data stream in the sliding window is called the fingerprint value of the sliding window.
  • the fingerprint value of the data may be an identifier of the data, which is used to identify the data, and the fingerprint values of different data are different.
  • the first modulo value to the Nth modulo value are respectively modulo, and when step 13 is performed, the order from the first modulo value to the Nth modulo value may be sequentially performed.
  • Each modulo value is modulo, and step 13 is performed when the value after modulo is zero; or modulo N values may be simultaneously modulo, and then from the first modulo value to the Nth modulo value In the order of the steps, step 13 is performed when the value after modulo is zero.
  • step 13 If the fingerprint value of the sliding window is zero after modulo the nth modulus, perform step 13. A data block is generated in step 13 and it is determined whether the data block or the fingerprint value of the data block is written into the new data stream footprint. Step 13 includes the following operations.
  • Step 13 includes: as the current data block, the data in the original data stream from the end position of the previous data block of the nth layer to the end position of the sliding window is the current data block, where the current data block is the nth data block; n is any natural number greater than or equal to 1 and less than or equal to N.
  • a data block is generated from the original data stream, and the starting position of the current data block is the original data.
  • the end position of the previous data block of the nth layer in the stream, and the end position of the current data block is the end position of the sliding window.
  • the data block generated for the nth modulus value is called the data block of the nth layer.
  • the number of bits of the modulus extension can be customized, and the number of bits of the current modulus extension affects the length of the next block of data.
  • the step 13 further includes: when the fingerprint value of the current data block is not stored in the fingerprint dictionary, storing the current data block into the block dictionary, and the fingerprint value of the current data block and the index information of the current data block in the block dictionary Stored in the fingerprint dictionary; if n is greater than 1, write the current data block in the space occupied by the new data stream to cover all the n-1th data blocks included in the current data block in the new data stream, otherwise the current data block The data blocks are sequentially written to the space occupied by the new data stream.
  • the block dictionary is used to store data blocks generated on the original data stream
  • the fingerprint dictionary is used to store the fingerprint value of the data block stored in the block dictionary and the index information of the data block in the block dictionary.
  • the index information of the data block in the block dictionary may include the length of the data block and the location information of the data block in the block dictionary.
  • the fingerprint dictionary After generating a data block from the original data stream, it is determined whether the fingerprint value of the current data block is stored in the fingerprint dictionary. If the fingerprint dictionary does not store the fingerprint value of the current data block, the new data stream and The current data block is not stored in the block dictionary, and the current data block needs to be written in the new data stream and the block dictionary; if the fingerprint value of the current data block is stored in the fingerprint dictionary, it can be determined that the new data stream and the block dictionary have been The current data block is stored, and the current data block does not need to be written in the new data stream and the block dictionary.
  • the nth modulus value is obtained by extending the fixed number of bits from the n-1th modulus value to the left. For the fingerprint value of the same sliding window, if the value of the fingerprint value of the sliding window is zero after the modulo of the nth modulus value, the value of the fingerprint value of the sliding window is modulo the value of the n-1th modulus value. Also zero.
  • the fingerprint value of the sliding window is zero after modulo the n-1th modulus, and the value of the fingerprint value of the sliding window is not necessarily zero after the modulo of the nth modulus.
  • the nth layer data block is not necessarily generated, but when the nth layer data block is generated, the n-1th layer data block is generated. Must have been produced. Therefore, when the sliding window is moved from the original data stream starting position to the ending position, the starting position of the nth layer data block is closer to the original data stream than the starting position of the n-1th data block. The starting position is not farther than the starting position of the data block of the n-1th layer.
  • the length of the data block of the nth layer is greater than or equal to the length of the data block of the n-1th layer, and the data block of the nth layer includes one or more data blocks of the n-1th layer.
  • the current data block includes one or more n-1th data blocks, in order to avoid new data streams.
  • the fingerprint value of the redundant data block and the redundant data block appear in the space, and the current data block is written into the space occupied by the new data stream, so as to cover all the n-1th data blocks included in the current data block in the new data stream.
  • the location of the current data block in the new data stream stores all the n-1th data blocks included in the current data block; after the current data block is written, the current data block is overwritten. All the n-1th data blocks included in the current data block in the new data stream, therefore, the current data block is stored in the location of the current data block in the new data stream.
  • the current data block is at the level n equal to 1, since the current data block is the data block of the first layer, the data block of other layers is not included in the range, so the current data block is directly written to the end of the new data stream.
  • the data block of the n-1th layer included in the current data block may be found in the new data before the current data block is written in the occupied space of the new data stream.
  • the data block of the n-1th layer included in the current data block of the nth layer can be located in the new data stream by the following method.
  • the sequence number of the data block, the level of the data block, and the correspondence between the locations of the original data streams are recorded.
  • the above correspondence can be recorded in the location record table.
  • the correspondence of the data block records is (No. 3, Level 2, the position in the original data stream).
  • the sequence number of the data block indicates the sequence in which the data blocks are generated. The smaller the sequence number, the earlier the time is generated.
  • the third data block/third data block is added to the correspondence relationship recorded for the third data block.
  • the correspondence recorded for the third data block is (No. 3, Level 2, position in the original data stream, position in the new data stream).
  • the location of the fingerprint value of the data block/data block in the data stream may be represented by the starting position of the data block/data block in the data stream and the length of the data block, or may be data block/data
  • the fingerprint value of the block is represented by the start and end positions in the data stream.
  • a record of the n-1th data block included in the current data block in the original data stream is located in the location record table. Obtaining, in the record of the n-1th data block that is located, the location of the n-th layer data block included in the current data block in the new data stream.
  • the data block of the n-1th layer included in the current data block is referred to as the data block to be replaced of the n-1th layer.
  • the location of the data block to be replaced on the n-1th layer in the location of the new data stream may be the location of the data block to be replaced in the n-1th layer in the new data stream, or Is the location of the fingerprint value of the data block to be replaced on the n-1th layer in the new data stream.
  • the fingerprint value of the to-be-replaced data block/n-th layer of the n-1th layer to be replaced may be obtained in the new data stream. s position. After obtaining the location of the fingerprint value of the to-be-replaced data block/n-1th layer of the to-be-replaced data block of the n-1th layer in the new data stream, according to the acquired location, the nth layer is used in the new data stream.
  • the current data block replaces the fingerprint value of the to-be-replaced data block/n-1th layer of the to-be-replaced data block of the n-1th layer, and the record of the n-1th layer of the to-be-replaced data block is deleted in the location record table.
  • the step 13 further includes: when the fingerprint value of the current data block is stored in the fingerprint dictionary, if n is greater than 1, the fingerprint value of the current data block is written in a space occupied by the new data stream to cover the new data. In the stream The fingerprint value of all n-1th data blocks included in the current data block, otherwise the fingerprint value of the current data block is sequentially written into the space occupied by the new data stream.
  • the fingerprint value of the current data block exists in the fingerprint dictionary, it can be determined that the current data block is already stored in the block dictionary and the new data stream. If the fingerprint value of the current data block exists in the fingerprint dictionary, and the current data block is a layer 1 data block, the fingerprint value of the current data block is written directly at the end of the last data block of the new data stream.
  • the current data block includes one or more n-1th data blocks. Therefore, when the current data block is generated, if the fingerprint value of the current data block is already stored in the fingerprint dictionary, the fingerprint values of all the n-1th data blocks in the current data block range are already written in the new data stream. In order to avoid the fingerprint value of redundant data blocks and redundant data blocks in the new data stream, the fingerprint value of the current data block is written in the space occupied by the new data stream to cover all the current data blocks included in the new data stream. The fingerprint value of the data block of the n-1th layer, thereby saving the transmission bandwidth of the new data stream.
  • the location of the current data block in the new data stream stores the fingerprint value of all the n-1th data blocks included in the current data block; After the fingerprint value, the fingerprint value of the current data block covers the fingerprint value of all the n-1th data blocks included in the current data block in the new data stream, and the current data block in the new data stream stores the current data.
  • the fingerprint value of the block is the fingerprint value of the block.
  • the data block of the n-1th layer included in the current data block may be found in the new data block before the fingerprint value of the current data block is written in the occupied space of the new data stream.
  • the location in the data stream to determine the location of the fingerprint value written to the current data block in the footprint of the new data stream.
  • the fingerprint value of the n-1th data block within the current data block range of the nth layer can be located in the new data stream by the following method. According to the position of the current data block in the original data stream and the position of the n-1th data block in the original data stream, in the location record table, the n-1th data block included in the current data block in the original data stream is located. record of.
  • the data block of the n-1th layer included in the current data block is referred to as the data block to be replaced of the n-1th layer.
  • the location of the fingerprint value of the n-1th layer of the to-be-replaced data block in the new data stream is obtained.
  • the receiving end can combine the block dictionary and the fingerprint dictionary to parse the new data stream sent by the sending end to obtain the original data stream.
  • the fingerprint dictionary searches for the index information of the data block corresponding to the fingerprint value in the block dictionary. According to the index information, the data block corresponding to the fingerprint value is searched in the block dictionary, and the fingerprint value is replaced with the searched data block in the space occupied by the new data stream.
  • the fingerprint values of the same sliding window on the original data stream are respectively moduloed to a set of moduli values, that is, the first modulo value to the Nth modulo value, according to different modulo values.
  • the modulo result performs different levels of partitioning on the original data stream. Therefore, for the fingerprint value of the same sliding window, the original data stream can be coarse-grained and fine-grained at the same time.
  • the generated data block if the fingerprint value of the data block exists in the fingerprint dictionary, according to the level of the data block, it is determined whether the fingerprint value of the data block is sequentially written into the space occupied by the new data stream or the data block is replaced.
  • the fingerprint value of the data block of the next layer is included; if the fingerprint value of the data block does not exist in the fingerprint dictionary, according to the level of the data block, it is determined whether the data block is sequentially written into the space occupied by the new data stream or replaced.
  • the next block of data blocks included in the data block thereby avoiding the phenomenon that the fingerprint values of the redundant data blocks and the redundant data blocks appear in the new data stream compared with the original data stream, and the data blocks with different granularities are realized.
  • the purpose of generating a new data stream and replacing the existing small data block with a large data block in the new data stream improves the deduplication rate of the new data stream.
  • each layer of the data block is a block directly generated on the original data stream, and each fingerprint value is a fingerprint value of the data block. It is not necessary to add an indication bit in the fingerprint value, indicating that the fingerprint value is a data block.
  • the fingerprint value is also the fingerprint value of the fingerprint value, which reduces the overhead of the fingerprint value of the data block and improves the deduplication rate of the new data stream.
  • the generation of each layer of data blocks does not depend on the data blocks of other layers. Therefore, for the fingerprint values of the same sliding window, different levels of data blocks can be generated concurrently.
  • the fingerprint value of the current data block when the fingerprint value of the current data block is not stored in the block dictionary, it is determined whether the current data block includes the n-1th data block to determine whether it is in the block dictionary.
  • Write the current data block If the current data block includes the data block of the n-1th layer, storing the fingerprint value of the current data block and the index information of the current data block in the block dictionary into the fingerprint dictionary, where The index information of the current data block in the block dictionary is determined according to index information of a data block of the n-1th layer included in the current data block. If the current data block does not include the first a data block of the n-1 layer, storing the current data block into the block dictionary, storing the fingerprint value of the current data block and the index information of the current data block in the block dictionary to the In the fingerprint dictionary.
  • FIG. 2 A three-level block diagram as shown in Figure 2.
  • the blocks in Figure 2 use modulo values p l, p2 and p3.
  • Extend pi to the left by two bits to get p3 01110.
  • the sliding window slides from left to right on the original data stream, and the fingerprint value in the sliding window is 111010.
  • the fingerprint value of the sliding window is 0 after modulo P i , and a block of the first layer is generated, which is recorded as block 1.
  • the fingerprint value of the sliding window is not 0 after the modulo values of p2 and p3.
  • the fingerprint of the block 1 is not stored in the fingerprint dictionary.
  • the block 1 is stored in the block dictionary, and the fingerprint value of the block 1 and the block 1 are stored in the block dictionary in the fingerprint dictionary.
  • Index information is included in the block dictionary.
  • the index information of the block in the block dictionary includes the address, offset, and length in the block dictionary. Then, block 1 is written to the new data stream.
  • the address of block 1 in the block dictionary is the location of the entry in the block dictionary that stores the block 1 data.
  • the sliding window continues to slide to the right on the original data stream, and the sliding window has a fingerprint value of 010110.
  • the fingerprint value of the sliding window has a value of 0 after modulo, and a block of the first layer is generated, which is recorded as block 2.
  • the value of the fingerprint value of the sliding window is also 0 after modulo p2, and a block of the second layer is generated, which is recorded as block 3.
  • the fingerprint value of the sliding window is not 0 after modulo p3.
  • the fingerprint of the block 2 is not stored in the fingerprint dictionary. As shown in FIG. 3, the block 2 is stored in the block dictionary, and the fingerprint value of the block 2 and the block 2 are stored in the block dictionary in the fingerprint dictionary.
  • block 2 is the last block in the new data stream, that is, the end of block 1.
  • the fingerprint of the block 3 is not stored in the fingerprint dictionary
  • the block 3 is stored in the block dictionary
  • the fingerprint value of the block 3 and the block 3 are stored in the block dictionary in the fingerprint dictionary.
  • Address, offset, and length, then, block 1 and block 2 of layer 1 are replaced with block 3 of layer 2 in the new data stream.
  • the sliding window continues to slide to the right on the original data stream.
  • the fingerprint value of the sliding window is 001110.
  • the fingerprint value of the sliding window is 0 for Pl, p2 and p3, and the block 4 of the first layer, the block 4 of the second layer, and the block 5 of the third layer are successively generated.
  • the partition 4 of the first layer and the partition 4 of the second layer are the same partition, and therefore, only the partition of the first layer is stored, which is denoted as the partition 4.
  • the fingerprint value of the block 4 is not yet stored in the fingerprint dictionary. As shown in FIG. 3, the block 4 is stored in the block dictionary, and the fingerprint value and the score of the block 4 are saved in the fingerprint dictionary.
  • block 4 is written to the end of the last block in the new data stream, block 1.
  • the fingerprint value of the block 5 is not yet stored in the fingerprint dictionary, and the block 5 is stored in the block dictionary.
  • the fingerprint value of the block 5 and the start address and offset of the block 5 in the block dictionary are saved in the fingerprint dictionary.
  • the quantity and length, then, the block 3 of the second layer and the block 4 are replaced with the block 5 of the third layer in the new data stream.
  • the sliding window continues to slide to the right on the original data stream. If the block 3 is generated, it is determined that the fingerprint value of the block 3 has been stored in the fingerprint dictionary and the block 3 is the block of the second layer, which can be determined in the new data stream.
  • the data block of the first layer included in the range occupied by the current block 3 has been replaced with the fingerprint value, and the block 1 of the first layer in the range of the block 3 in the original data stream is replaced by the fingerprint value of the block 3.
  • the result is block 5
  • the block 3 is the block of the second layer
  • the data block of the second layer has been replaced with the fingerprint value
  • the block 3 and the block 4 of the second layer in the range of the block 5 in the original data stream are replaced by the fingerprint value of the block 5.
  • the block 1 or 2 of the first layer is generated, since the fingerprint value of the block 1 or the block 2 is already stored in the fingerprint dictionary, and the block 1 or the block 2 is the block of the first layer, directly The fingerprint value of Block 1 or Block 2 is written at the end of the last block of the new data stream.
  • the block 3 of the second layer is composed of the block 1 and the block 2 of the first layer, in the block dictionary.
  • Block 3 has been stored and block 3 can be stored repeatedly in the block dictionary.
  • the offset of the block 1 is 0, the block 2 is stored after the block 1, and the offset of the block 2 is the length of the block 1.
  • the address of the block 3 is the same as the address of the block 1, and is also the same as the address of the block 2, and the offset of the block 3 is the offset of the block 1.
  • the length of the partition 3 is the sum of the length of the partition 1 and the length of the partition 2.
  • the block 5 of the third layer is composed of the block 3 and the block 4 of the second layer, and the block 5 is already stored in the block dictionary, and the block 5 can be repeatedly stored in the block dictionary.
  • the offset of the partition 5 is the offset of the partition 3.
  • the address of the partition 5 is the same as the address of the partition 3, and the length of the partition 5 is the sum of the length of the partition 3 and the length of the partition 4.
  • FIG. 5 is a schematic structural diagram of a data blocking device according to an embodiment of the present invention.
  • the apparatus provided in this embodiment includes: an acquisition module 56, a modulo module 51, a blocking module 52 and a writing module 53, and a sliding module 50.
  • the obtaining module 56 is configured to obtain a fingerprint value of the sliding window on the original data stream, where the initial starting position of the sliding window is the same as the starting position of the original data stream, and the length of the sliding window is a preset length.
  • the modulo module 51 is configured to perform modulo adjustment on the first modulo value to the Nth modulo value by using the fingerprint value of the sliding window, where the ith modulo value is extended to the left by a preset number of bits.
  • the i+1th modulus value wherein i is greater than or equal to 1 and less than any natural number of the N, and the N is a total number of modulus values.
  • a blocking module 52 configured, for each modulus value, if the value of the modulo module after modulo is zero, starting from the end position of the previous data block of the nth layer in the original data stream to the Sliding window knot
  • the data between the bundle positions is the current data block, and the current data block is the n-th data block; wherein the n is greater than or equal to 1 and less than or equal to any natural number of the N:
  • the writing module 53 is configured to: when the fingerprint value of the current data block generated by the blocking module is not stored in the fingerprint dictionary, store the current data block into the blocking dictionary, and use the fingerprint value of the current data block to The index information of the current data block in the block dictionary is stored in the fingerprint dictionary; if the n is greater than 1, the current data block is written in a space occupied by the new data stream to cover the current data block All the data blocks of the n-1th layer included in the current data block in the new data stream, otherwise the current data block is sequentially written into the space occupied by the new data stream;
  • the writing module 53 is further configured to: when the fingerprint value of the current data block is stored in the fingerprint dictionary, if n is greater than 1, write the current data block in a space occupied by the new data stream a fingerprint value, to cover a fingerprint value of all n-1th data blocks included in the current data block in the new data stream, or sequentially writing the fingerprint value of the current data block to the new data stream Space;
  • a sliding module 50 configured to: after the writing module writes the current data block generated by the blocking module or the fingerprint value of the current data block into a space occupied by the new data stream, the sliding window And sliding the preset length on the original data stream to the end position, and acquiring the fingerprint value of the sliding window on the original data stream, so that the modulo module 51, the writing module 53 and the sliding module 50 repeatedly perform operations. Until the starting position of the sliding window slides to the end position of the original data stream.
  • the writing module 53 is further configured to: if the current data block includes a data block of an n-1th layer, the fingerprint value of the current data block and the current data block in the block dictionary The index information is stored in the fingerprint dictionary, wherein the index information of the current data block in the block dictionary is determined according to index information of the n-1th data block included in the current data block. .
  • the writing module 53 is further configured to: if the current data block does not include the data block of the n-1th layer, store the current data block into the block dictionary, and use the current data
  • the fingerprint value of the block and the index information of the current data block in the block dictionary are stored in the fingerprint dictionary.
  • the fingerprint module of the same sliding window in the original data stream respectively modulates the first modulus value to the Nth modulus value of a set of modulus values, and the blocking module is different according to different parameters.
  • the modulo modulo result performs different levels of sharding on the original data stream. Therefore, for the fingerprint value of the same sliding window, the original data stream can be coarse-grained and fine-grained at the same time.
  • the writing module determines, according to the level of the data block, whether the fingerprint value of the data block is sequentially written into the space occupied by the new data stream or replaces the data.
  • Block includes The fingerprint value of the next layer of data block; if the fingerprint value of the data block does not exist in the fingerprint dictionary, the write module determines, according to the level of the data block, whether to sequentially write the data block to the space occupied by the new data stream or replace
  • the data block includes the next layer of data blocks, thereby avoiding the phenomenon that the fingerprint values of the redundant data blocks and the redundant data blocks appear in the new data stream compared with the original data stream, and the data blocks with different granularities are realized.
  • the purpose of generating a new data stream and replacing the existing small data block with a large data block in the new data stream improves the deduplication rate of the new data stream.
  • each layer of the data block is a block directly generated on the original data stream, and each fingerprint value is a fingerprint value of the data block. It is not necessary to add an indication bit in the fingerprint value, indicating that the fingerprint value is a data block.
  • the fingerprint value is also the fingerprint value of the fingerprint value, which reduces the overhead of the fingerprint value of the data block and improves the deduplication rate of the new data stream.
  • the generation of each layer of data blocks does not depend on data blocks of other layers. Because of this, for the fingerprint value of the same sliding window, different levels of data can be generated concurrently. Piece.
  • the apparatus of FIG. 5 may further include: a positioning module 54 and a recording module 55.
  • a recording module 55 configured to, after the blocking module 52, data in the original data stream from an end position of a previous data block of the nth layer to an end position of the sliding window as a current data block Recording a sequence number of the current data block, a level between the current data block, a location of the current data block in the original data stream, and a location of the current data block between the new data stream Correspondence relationship, and saved in the location record table;
  • a positioning module 54 configured to: before the writing module 53 writes the current data block in a occupied space of a new data stream, and/or the writing module 53 writes in the occupied space of the new data stream Before the fingerprint value of the current data block is described, the location of the data block of the n-1th layer included in the current data block is found in the new data stream.
  • the positioning module 54 is specifically configured to record at the location according to a location of the current data block in the original data stream and a location of an n-1th data block in the original data stream.
  • the record of the n-1th data block included in the current data block in the original data stream is located; and the current data block is included in the record of the n-1th data block that is located The location of the n-1th data block in the new data stream.
  • the apparatus shown in Figures 5 and 6 may further include: a restore module.
  • a restoring module in the process of parsing the new data stream into the original data stream, if a fingerprint value is found in the new data stream, searching for a data block corresponding to the fingerprint value in the fingerprint dictionary Place Defining index information in the block dictionary;
  • the restoring module is further configured to: search, according to the index information, a data block corresponding to the fingerprint value in the block dictionary, where the fingerprint value is in a space occupied by the new data stream, The fingerprint value is replaced with the found data block.
  • the various modules shown in Figures 5 and 6 can be implemented using a processor.
  • the apparatus shown in FIG. 5 and FIG. 6 may further include a storage module, configured to store the block dictionary and the fingerprint dictionary.
  • the storage module can be implemented by using a memory.

Abstract

A data blocking method and device. The method comprises: using fingerprint values of a sliding window to respectively conduct a modulus operation on the first modulus value to the Nth modulus value, if the value is zero after conducting a modulus operation on one modulus value, taking data in an original data stream from the end position of a previous data block of the nth layer to the end position of the sliding window as current data blocks; when fingerprint values of the current data blocks are not stored in a fingerprint dictionary, if n is greater than 1, writing the current data blocks into a space occupied by a new data stream, so as to cover all the (n-1)th layer of data blocks comprised in the current data blocks; otherwise, writing the current data blocks into the new data stream in sequence; otherwise, writing the fingerprint values of the current data blocks into the new data stream, so as to cover the fingerprint values of all the (n-1)th layer of data blocks comprised in the current data blocks in the new data stream; otherwise, writing the fingerprint values of the current data blocks into the space occupied by the new data stream in sequence.

Description

数据分块方法及装置  Data blocking method and device
技术领域 本发明实施例涉及通信技术, 尤其涉及一种数据分块方法及装置。 背景技术 The present invention relates to communications technologies, and in particular, to a data blocking method and apparatus. Background technique
重复传输相同或相似的数据会浪费网络资源。 采用冗余流量删除 Repeating the transmission of the same or similar data wastes network resources. Redundant traffic removal
(Redundant Traffic Elimination, 简称 RTE) 技术, 可以实现协议无关的冗 余消除, 使数据通信变得更有效率。 现有 RTE技术通常采用基于模块化指数 (Modular Exponential, MODP)的分块方法, 根据文件内容对数据流进行分块。 基于 M0DP的分块方法通过一个滑动窗口来实现, 对一个给定的模值 p (例如, 10010) , 当滑动窗口的指纹值 (例如, 1111 0111 0010) 对模值 p取模后的值 为 0, 则该滑动窗口末尾到上一个数据分块结束的位置之间的数据内容被划分为 一个数据分块。 当滑动窗口的指纹值 (例如, 1111 0111 0010) 对模值 p取模后 的值不为 0, 滑动窗口向后移动一个字节或者自定义的一个长度。 得到数据分块 后, 计算分块的指纹值, 并在指纹字典中与已存储的数据分块的指纹值进行比 较, 如果检测到相同的指纹值, 表示找到冗余的数据内容, 则在新数据流中用 一个标识符来替换该数据分块, 以达到压缩流量的目的。 如果没有检测到相同 的指纹值, 存储该数据分块及其指纹, 以用于其后分块的冗余检测。 用于替换 数据分块的标识符的长度是固定的, 如果对应的数据分块长度越大, 则传输时 节省的带宽也越大。 但是, 大的数据分块意味着去冗是粗粒度的, 相同的数据 分块难以重复出现, 从而去冗率较低。 (Redundant Traffic Elimination, RTE for short) technology, which enables protocol-independent redundancy elimination and makes data communication more efficient. The existing RTE technology usually adopts a modular method based on Modular Exponential (MODP) to block the data stream according to the content of the file. The M0DP-based blocking method is implemented by a sliding window, for a given modulus value p (for example, 10010), when the fingerprint value of the sliding window (for example, 1111 0111 0010) is modulo the modulus value p 0, the data content between the end of the sliding window and the end of the previous data block is divided into a data block. When the fingerprint value of the sliding window (for example, 1111 0111 0010) is not modulo the modulus value p, the sliding window moves backward by one byte or a custom length. After obtaining the data block, the fingerprint value of the block is calculated, and compared with the fingerprint value of the stored data block in the fingerprint dictionary. If the same fingerprint value is detected, indicating that the redundant data content is found, the new The data stream is replaced with an identifier in the data stream to achieve the purpose of compressing traffic. If the same fingerprint value is not detected, the data chunk and its fingerprint are stored for redundant detection of subsequent chunking. The length of the identifier used to replace the data block is fixed. If the corresponding data block length is larger, the bandwidth saved during transmission is larger. However, large data partitioning means that the redundancy is coarse-grained, and the same data partitioning is difficult to repeat, resulting in a low redundancy rate.
目前, 提出了一种分层的分块方案, 可同时实现粗粒度和细粒度的冗余消 除, 以提高去冗率。 发送端采用基于 M0DP的分块方法对原始数据流进行分块, 得到数据分块序列 101 : 分块 SA、 SB、 SC、 SD、 SE和 SF, 并替换为相应分块的标 识符, 例如, 分块 SA替换为标识符 R , 得到标识符序列 102 (称为 1层标识符) : R 、 R 、 R 、 R 和 。 将标识符序列 102视为一般的数据流, 再次使用基于 M0DP的分块方法对标识符序列 102进行分块, 得到了标识符数据的两个数据分 块, 同样将其替换为 2层的标识符序列 103: R2 nR2 2。 例如, 内容是标识符 R 和、 的分块替换为第 2层标识符 内容是标识符 !^和!^替换为 2层标识符 R2 2。 从数据内容看, 第 2层标识符 表了数据分块 SA和 SB, R2 2代表了分块 SC、 SD、 SE和 SF。 在发送端向接收端发送的新数据流中, 尽可能用高层的标识符来 替换数据分块, 以节省带宽。 At present, a hierarchical blocking scheme is proposed, which can achieve coarse-grained and fine-grained redundancy elimination at the same time to improve the deduplication ratio. The transmitting end uses the M0DP-based blocking method to block the original data stream, and obtains the data blocking sequence 101: the blocks SA, SB, SC, SD, SE, and SF, and replaces them with the identifiers of the corresponding blocks, for example, The block SA is replaced with an identifier R, resulting in an identifier sequence 102 (referred to as a layer 1 identifier): R , R , R , R and . The identifier sequence 102 is regarded as a general data stream, and the identifier sequence 102 is again partitioned using the MODP-based blocking method, and two data points of the identifier data are obtained. The block is also replaced by a 2-layer identifier sequence 103: R 2 nR 2 2 . For example, the content is the replacement of the identifiers R and , and the content of the layer 2 identifier is the identifier! ^And! ^ Replace with the 2-layer identifier R 2 2 . From the data content, the layer 2 identifier represents the data blocks SA and SB, and R 2 2 represents the blocks SC, SD, SE and SF. In the new data stream sent by the sender to the receiver, the data partition is replaced with the upper layer identifier as much as possible to save bandwidth.
然而, 上述层次式分块方案中, 第 2层标识符是对第 1层标识符进行分块后 产生的,必须在第 2层标识符中增加指示位,用于指示该标识符是分块的标识符, 还是标识符的标识符。 否则, 接收端无法还原出原始数据。 因此, 上述层次式 分块方案, 增加了标识符的开销, 降低了去冗率。 发明内容 本发明实施例提供一种数据分块方法及装置, 用以提高数据分块的去冗率。 第一方面, 本发明实施例提供一种数据分块方法, 包括:  However, in the above hierarchical blocking scheme, the layer 2 identifier is generated after the layer 1 identifier is segmented, and an indicator bit must be added to the layer 2 identifier to indicate that the identifier is a block. The identifier, or the identifier of the identifier. Otherwise, the receiving end cannot restore the original data. Therefore, the above hierarchical blocking scheme increases the overhead of the identifier and reduces the deduplication ratio. SUMMARY OF THE INVENTION Embodiments of the present invention provide a data blocking method and apparatus for improving data deblocking rate. In a first aspect, an embodiment of the present invention provides a data blocking method, including:
11.获取原数据流上滑动窗口的指纹值, 所述滑动窗口的初始起始位置与所 述原数据流的起始位置相同, 所述滑动窗口的长度为预设长度;  11. Obtain a fingerprint value of a sliding window on the original data stream, where an initial starting position of the sliding window is the same as a starting position of the original data stream, and a length of the sliding window is a preset length;
12.采用所述滑动窗口的指纹值, 分别对第 1个模值至第 N个模值进行取模, 对于每个模值, 如果取值后的值为零均执行歩骤 13和 14, 其中, 将第 i个模值向 左扩展预设位数后得到第 i+1个模值, 所述 i为大于等于 1并且小于所述 N的任一 个自然数, 所述 N为模值的总个数;  12. Using the fingerprint value of the sliding window, respectively modulating the first modulo value to the Nth modulo value, and for each modulo value, if the value after the value is zero, steps 13 and 14 are performed, The i-th modulo value is extended to the left by a preset number of bits to obtain an i+1th modulo value, where i is greater than or equal to 1 and less than any natural number of the N, and the N is a total of modulo values. Number
13.将所述原数据流中从第 n层的上一个数据块的结束位置开始到所述滑动 窗口的结束位置之间的数据作为当前数据块, 所述当前数据块为第 n层数据块; 其中, 所述 n为大于等于 1并且小于等于所述 N的任一个自然数:  13. Data in the original data stream from an end position of a previous data block of the nth layer to an end position of the sliding window as a current data block, where the current data block is an nth layer data block Wherein n is one or more natural numbers that are greater than or equal to 1 and less than or equal to:
在指纹字典中没有存储所述当前数据块的指纹值时, 将所述当前数据块存 储到分块字典中, 将所述当前数据块的指纹值和所述当前数据块在所述分块字 典中的索引信息存储到所述指纹字典中, 当所述 n大于 1时, 在新数据流占用的 空间中写入所述当前数据块, 以覆盖在所述新数据流中所述当前数据块包括的 所有第 n-1层的数据块, 当所述 n等于 1时, 将所述当前数据块顺序写入所述新数 据流占用的空间;  When the fingerprint value of the current data block is not stored in the fingerprint dictionary, storing the current data block into a block dictionary, and the fingerprint value of the current data block and the current data block in the block dictionary The index information in the fingerprint dictionary is stored in the fingerprint dictionary. When the n is greater than 1, the current data block is written in a space occupied by the new data stream to overwrite the current data block in the new data stream. All n-1th data blocks included, when the n is equal to 1, the current data block is sequentially written into the space occupied by the new data stream;
在所述指纹字典中存储有所述当前数据块的指纹值时并且当 n大于 1时, 在 所述新数据流占用的空间中写入所述当前数据块的指纹值, 以覆盖所述新数据 流中所述当前数据块包括的所有第 n-1层的数据块的指纹值, 在所述指纹字典中 存储有所述当前数据块的指纹值时并且当 n等于 1时, 将所述当前数据块的指纹 值顺序写入所述新数据流占用的空间; When the fingerprint value of the current data block is stored in the fingerprint dictionary and when n is greater than 1, the fingerprint value of the current data block is written in a space occupied by the new data stream to cover the new Data a fingerprint value of all n-1th data blocks included in the current data block in the stream, when the fingerprint value of the current data block is stored in the fingerprint dictionary, and when n is equal to 1, the current The fingerprint value of the data block is sequentially written into the space occupied by the new data stream;
14.将所述滑动窗口在所述原数据流上朝结束位置滑动所述预设长度, 获取 原数据流上滑动窗口的指纹值, 重复执行歩骤 12-14, 直至所述滑动窗口的起始 位置滑动至所述原数据流的结束位置。  14. Slide the sliding window on the original data stream toward the end position by the preset length, obtain the fingerprint value of the sliding window on the original data stream, and repeat steps 12-14 until the sliding window starts. The start position slides to the end position of the original data stream.
结合第一方面, 第一种可能的实现方式中, 将所述当前数据块存储到分块 字典中, 将所述当前数据块的指纹值和所述当前数据块在所述分块字典中的索 引信息存储到所述指纹字典中, 包括:  With reference to the first aspect, in a first possible implementation, the current data block is stored in a block dictionary, and a fingerprint value of the current data block and the current data block are in the block dictionary The index information is stored in the fingerprint dictionary, including:
当所述当前数据块包括第 n-1层的数据块时, 将所述当前数据块的指纹值和 所述当前数据块在所述分块字典中的索引信息存储到所述指纹字典中, 其中, 所述当前数据块在所述分块字典中的索引信息, 根据所述当前数据块包括的第 n-1层的数据块的索引信息确定;  When the current data block includes the data block of the n-1th layer, storing the fingerprint value of the current data block and the index information of the current data block in the block dictionary into the fingerprint dictionary, The index information of the current data block in the block dictionary is determined according to index information of a data block of the n-1th layer included in the current data block;
当所述当前数据块不包括第 n-1层的数据块时, 将所述当前数据块存储到所 述分块字典中, 将所述当前数据块的指纹值和所述当前数据块在所述分块字典 中的索引信息存储到所述指纹字典中。  When the current data block does not include the data block of the n-1th layer, storing the current data block into the block dictionary, and the fingerprint value of the current data block and the current data block are in the The index information in the block dictionary is stored in the fingerprint dictionary.
结合第一方面, 或结合第一方面的第一种可能的实现方式, 在第二种可能 的实现方式中, 在新数据流的占用空间中写入所述当前数据块之前, 和 /或在所 述新数据流的占用空间中写入所述当前数据块的指纹值之前, 还包括:  With reference to the first aspect, or in combination with the first possible implementation of the first aspect, in the second possible implementation, before the current data block is written in the occupied space of the new data stream, and/or Before the fingerprint value of the current data block is written in the occupied space of the new data stream, the method further includes:
查找所述当前数据块包括的第 n-1层的数据块在所述新数据流中的位置。 结合第一方面, 或结合第一方面的第一种和第二种可能的实现方式, 在第 三种可能的实现方式中, 在所述将所述原数据流中从第 n层的上一个数据块的结 束位置开始到所述滑动窗口的结束位置之间的数据作为当前数据块之后, 还包 括:  Finding a location of the data block of the n-1th layer included in the current data block in the new data stream. In combination with the first aspect, or in combination with the first and second possible implementations of the first aspect, in a third possible implementation, the previous one of the original data streams from the nth layer After the data between the end position of the data block and the end position of the sliding window is taken as the current data block, the method further includes:
记录所述当前数据块的序号, 与所述当前数据块所在的层次、 所述当前数 据块在所述原数据流的位置和所述当前数据块在所述新数据流的位置之间的对 应关系, 并保存在位置记录表中;  Recording a sequence number of the current data block, a level corresponding to the current data block, a location of the current data block in the original data stream, and a location of the current data block in the new data stream Relationship, and saved in the location record table;
查找所述当前数据块包括的第 n-1层的数据块在所述新数据流中的位置, 包 括:  Finding a location of the data block of the n-1th layer included in the current data block in the new data stream, including:
根据所述当前数据块在所述原数据流中的位置和第 n-1层数据块在所述原 数据流中的位置, 在所述位置记录表中, 定位所述原数据流中所述当前数据块 包括的第 n-1层数据块的记录; According to the location of the current data block in the original data stream and the n-1th data block in the original a location in the data stream, in the location record table, locating a record of the n-1th data block included in the current data block in the original data stream;
从定位到的第 n-1层数据块的记录中, 获取所述当前数据块包括的第 n-1层 数据块在所述新数据流中的位置。  Obtaining, in the record of the n-1th data block that is located, the position of the n-1th data block included in the current data block in the new data stream.
结合第一方面, 或结合第一方面的第一种至第三种可能的实现方式, 在第 四种可能的实现方式中, 上述方法还包括:  In conjunction with the first aspect, or in combination with the first to third possible implementations of the first aspect, in a fourth possible implementation, the method further includes:
将所述新数据流解析为所述原数据流的过程中, 当在所述新数据流查找到 指纹值时, 在所述指纹字典中查找所述指纹值对应的数据块在所述分块字典中 的索引信息;  In the process of parsing the new data stream into the original data stream, when the fingerprint data value is found in the new data stream, searching for the data block corresponding to the fingerprint value in the fingerprint dictionary in the block Index information in the dictionary;
根据所述索引信息, 在所述分块字典中查找所述指纹值对应的数据块, 在 所述指纹值在所述新数据流占用的空间中, 将所述指纹值替换为查找到的数据 块。  Searching, according to the index information, a data block corresponding to the fingerprint value in the block dictionary, and replacing the fingerprint value with the found data in a space occupied by the new data stream Piece.
第二方面, 本发明实施例还提供一种数据分块装置, 包括:  In a second aspect, the embodiment of the present invention further provides a data blocking device, including:
获取模块: 用于获取原数据流上滑动窗口的指纹值, 所述滑动窗口的初始 起始位置与所述原数据流的起始位置相同, 所述滑动窗口的长度为预设长度; 取模模块, 用于采用所述滑动窗口的指纹值, 分别对第 1个模值至第 N个模 值进行取模, 其中, 将第 i个模值向左扩展预设位数后得到第 i+1个模值, 所述 i 为大于等于 1并且小于所述 N的任一个自然数, 所述 N为模值的总个数;  An acquisition module: configured to acquire a fingerprint value of a sliding window on the original data stream, where an initial starting position of the sliding window is the same as a starting position of the original data stream, and a length of the sliding window is a preset length; a module, configured to adopt a fingerprint value of the sliding window, and respectively modulo the first modulo value to the Nth modulo value, wherein the ith modulo value is extended to the left by a preset number of digits to obtain an i+th a modulus value, wherein i is greater than or equal to 1 and less than any natural number of the N, and the N is a total number of modulus values;
分块模块, 用于对于每个模值, 如果所述取模模块取模后的值为零, 将所 述原数据流中从第 n层的上一个数据块的结束位置开始到所述滑动窗口的结束 位置之间的数据作为当前数据块, 所述当前数据块为第 n层数据块; 其中, 所述 n为大于等于 1并且小于等于所述 N的任一个自然数:  a blocking module, configured, for each modulus value, if the value of the modulo module after modulo is zero, starting from the end position of the previous data block of the nth layer to the sliding in the original data stream The data between the end positions of the windows is the current data block, and the current data block is the nth layer data block; wherein the n is greater than or equal to 1 and less than or equal to any natural number of the N:
写模块, 用于在所述指纹字典中没有存储所述分块模块产生的当前数据块 的指纹值时, 将所述当前数据块存储到分块字典中, 将所述当前数据块的指纹 值和所述当前数据块在所述分块字典中的索引信息存储到所述指纹字典中, 当 所述 n大于 1时, 在新数据流占用的空间中写入所述当前数据块, 以覆盖在所述 新数据流中所述当前数据块包括的所有第 n-1层的数据块, 当所述 n等于 1时, 将 所述当前数据块顺序写入所述新数据流占用的空间;  a writing module, configured to store, in the fingerprint dictionary, a fingerprint value of a current data block generated by the blocking module, store the current data block into a blocking dictionary, and set a fingerprint value of the current data block And index information of the current data block in the block dictionary is stored in the fingerprint dictionary, and when the n is greater than 1, the current data block is written in a space occupied by the new data stream to cover All data blocks of the nth-1th layer included in the current data block in the new data stream, when the n is equal to 1, the current data block is sequentially written into a space occupied by the new data stream;
所述写模块, 还用于在所述指纹字典中存储有所述当前数据块的指纹值时 并且所述 n大于 1时, 在所述新数据流占用的空间中写入所述当前数据块的指纹 值, 以覆盖所述新数据流中所述当前数据块包括的所有第 n-1层的数据块的指纹 值, 在所述指纹字典中存储有所述当前数据块的指纹值时并且当所述 n等于 1时, 将所述当前数据块的指纹值顺序写入所述新数据流占用的空间; The writing module is further configured to: when the fingerprint value of the current data block is stored in the fingerprint dictionary, and when n is greater than 1, write the current data block in a space occupied by the new data stream Fingerprint a value to cover a fingerprint value of all n-1th data blocks included in the current data block in the new data stream, where the fingerprint value of the current data block is stored in the fingerprint dictionary and When n is equal to 1, the fingerprint value of the current data block is sequentially written into the space occupied by the new data stream;
滑动模块, 用于在所述写模块将所述分块模块产生的所述当前数据块或所 述当前数据块的指纹值写入所述新数据流占用的空间后, 将所述滑动窗口在所 述原数据流上朝结束位置滑动所述预设长度, 获取原数据流上滑动窗口的指纹 值, 使所述取模模块、 所述写模块和所述滑动模块重复执行操作, 直至所述滑 动窗口的起始位置滑动至所述原数据流的结束位置。  a sliding module, configured to: after the writing module writes the current data block generated by the blocking module or the fingerprint value of the current data block into a space occupied by the new data stream, the sliding window is And sliding the preset length on the original data stream to obtain the fingerprint value of the sliding window on the original data stream, so that the modulo module, the writing module, and the sliding module repeatedly perform operations until the The starting position of the sliding window slides to the end of the original data stream.
结合第二方面, 在第一种可能的实现方式中, 当所述当前数据块包括第 n-1 层的数据块时, 所述当前数据块在所述分块字典中的索引信息, 根据所述当前 数据块包括的第 n-1层的数据块的索引信息确定。  With reference to the second aspect, in a first possible implementation manner, when the current data block includes a data block of an n-1th layer, an index information of the current data block in the block dictionary, according to the The index information of the data block of the n-1th layer included in the current data block is determined.
结合第二方面, 或结合第一方面的第一种可能的实现方式中, 在第二种可 能的实现方式中, 所述装置还包括:  With reference to the second aspect, or in combination with the first possible implementation of the first aspect, in a second possible implementation, the apparatus further includes:
定位模块, 用于在所述写模块在新数据流的占用空间中写入所述当前数据 块之前, 和 /或在所述写模块在所述新数据流的占用空间中写入所述当前数据块 的指纹值之前, 查找所述当前数据块包括的第 n-1层的数据块在所述新数据流中 的位置。  a positioning module, configured to: before the writing module writes the current data block in a occupied space of a new data stream, and/or write the current in the occupied space of the new data stream in the write module Before the fingerprint value of the data block, the location of the data block of the n-1th layer included in the current data block is found in the new data stream.
结合第二方面, 或结合第一方面的第一种和第二种可能的实现方式中, 在 第三种可能的实现方式中, 所述装置还包括:  With reference to the second aspect, or in combination with the first and second possible implementations of the first aspect, in a third possible implementation, the apparatus further includes:
记录模块, 用于在所述分块模块将所述原数据流中从第 n层的上一个数据块 的结束位置开始到所述滑动窗口的结束位置之间的数据作为当前数据块之后, 记录所述当前数据块的序号, 与所述当前数据块所在的层次、 所述当前数据块 在所述原数据流的位置和所述当前数据块在所述新数据流的位置之间的对应关 系, 并保存在位置记录表中;  a recording module, configured to record, after the block module, data in the original data stream from an end position of a previous data block of the nth layer to an end position of the sliding window as a current data block a sequence number of the current data block, a level of the current data block, a location of the current data block in the original data stream, and a location of the current data block in a location of the new data stream And saved in the location record table;
所述定位模块, 用于根据所述当前数据块在所述原数据流中的位置和第 n-1 层数据块在所述原数据流中的位置, 在所述位置记录表中, 定位所述原数据流 中所述当前数据块包括的第 n-1层数据块的记录; 从定位到的第 n-1层数据块的 记录中, 获取所述当前数据块包括的第 n-1层数据块在所述新数据流中的位置。  The positioning module is configured to: according to a location of the current data block in the original data stream and a position of an n-1th layer data block in the original data stream, in the location record table, a location location Decoding the n-1th data block included in the current data block in the original data stream; acquiring the n-1th layer included in the current data block from the recorded record of the n-1th data block The location of the data block in the new data stream.
结合第二方面, 或结合第一方面的第一种至第三种可能的实现方式中, 在 第四种可能的实现方式中, 所述装置还包括: 还原模块, 用于将所述新数据流解析为所述原数据流的过程中, 当在所述 新数据流查找到指纹值时, 在所述指纹字典中查找所述指纹值对应的数据块在 所述分块字典中的索引信息; With reference to the second aspect, or in combination with the first to third possible implementations of the first aspect, in a fourth possible implementation, the apparatus further includes: a restoring module, in the process of parsing the new data stream into the original data stream, searching for a data block corresponding to the fingerprint value in the fingerprint dictionary when a fingerprint value is found in the new data stream Index information in the block dictionary;
所述还原模块, 还用于根据所述索引信息, 在所述分块字典中查找所述指 纹值对应的数据块, 在所述指纹值在所述新数据流占用的空间中, 将所述指纹 值替换为查找到的数据块。  The restoring module is further configured to: search, according to the index information, a data block corresponding to the fingerprint value in the block dictionary, where the fingerprint value is in a space occupied by the new data stream, The fingerprint value is replaced with the found data block.
本发明实施例提供的数据分块方法及装置, 在原数据流上同一个滑动窗口 的指纹值分别对一组模值, 即第 1个模值至第 N个模值进行取模, 根据不同模值 的取模结果对原数据流进行不同层次的分块, 因此, 对于同一个滑动窗口的指 纹值, 可以对原数据流同时进行粗粒度和细粒度的分块。 对于产生的数据块, 如果指纹字典中存在该数据块的指纹值, 根据该数据块所在的层次, 确定是将 该数据块的指纹值顺序写入新数据流占用的空间还是替换该数据块包括的下一 层数据块的指纹值; 如果指纹字典中不存在该数据块的指纹值, 根据该数据块 所在的层次, 确定是将该数据块顺序写入新数据流占用的空间还是替换该数据 块包括的下一层数据块, 从而, 避免了与原数据流相比在新数据流中出现多余 的数据块和多余的数据块的指纹值的现象, 实现了采用不同粒度的数据块生成 新数据流, 并且在新数据流中采用大数据块替换已有的小数据块的目的, 提高 了新数据流的去冗率。 本实施例中每层数据块均是在原数据流上直接产生的分 块, 每个指纹值均是数据块的指纹值, 不需要在指纹值中增加指示位, 指示该 指纹值是数据块的指纹值还是指纹值的指纹值, 减少了数据块的指纹值的开销, 提高了新数据流的去冗率。 另外, 由于不同的模值产生不同层次的数据块, 每 层数据块的产生不依赖于其它层的数据块, 因此, 对于同一个滑动窗口的指纹 值, 可以并发产生不同层次的数据块。 附图说明  The data blocking method and device provided by the embodiment of the present invention, the fingerprint values of the same sliding window on the original data stream are respectively modulo to a set of modulus values, that is, the first modulus value to the Nth modulus value, according to different modes. The modulo result of the value is divided into different levels of the original data stream. Therefore, for the fingerprint value of the same sliding window, the original data stream can be coarse-grained and fine-grained at the same time. For the generated data block, if the fingerprint value of the data block exists in the fingerprint dictionary, according to the level of the data block, it is determined whether the fingerprint value of the data block is sequentially written into the space occupied by the new data stream or is replaced by the data block. If the fingerprint value of the data block does not exist in the fingerprint dictionary, according to the level of the data block, it is determined whether the data block is sequentially written into the space occupied by the new data stream or the data is replaced. The next layer of data blocks included in the block, thereby avoiding the phenomenon that redundant data blocks and redundant data blocks are present in the new data stream compared with the original data stream, thereby realizing the generation of new data blocks with different granularities. The data stream, and the purpose of replacing the existing small data blocks with large data blocks in the new data stream, improves the deduplication rate of the new data stream. In this embodiment, each layer of the data block is a block directly generated on the original data stream, and each fingerprint value is a fingerprint value of the data block. It is not necessary to add an indication bit in the fingerprint value, indicating that the fingerprint value is a data block. The fingerprint value is also the fingerprint value of the fingerprint value, which reduces the overhead of the fingerprint value of the data block and improves the deduplication rate of the new data stream. In addition, since different modulo values generate different levels of data blocks, the generation of each layer of data blocks does not depend on the data blocks of other layers. Therefore, for the fingerprint values of the same sliding window, different levels of data blocks can be generated concurrently. DRAWINGS
图 1A为本发明实施例提供的一种数据分块方法流程图;  1A is a flowchart of a data blocking method according to an embodiment of the present invention;
图 1B为本发明实施例提供的滑动窗口示意图;  FIG. 1B is a schematic diagram of a sliding window according to an embodiment of the present invention; FIG.
图 2为本发明实施例提供的一种三层次式分块示意图;  2 is a schematic diagram of a three-level block diagram according to an embodiment of the present invention;
图 3为本发明实施例提供的分块字典和指纹字典的一种结构示意图; 图 4为本发明实施例提供的分块字典和指纹字典的另一种结构示意图; 图 5为本发明实施例提供的一种数据分块装置结构示意图; FIG. 3 is a schematic structural diagram of a block dictionary and a fingerprint dictionary according to an embodiment of the present invention; FIG. 4 is another schematic structural diagram of a block dictionary and a fingerprint dictionary according to an embodiment of the present invention; FIG. 5 is a schematic structural diagram of a data blocking device according to an embodiment of the present disclosure;
图 6为本发明实施例提供的另一种数据分块装置结构示意图。 具体实施方式  FIG. 6 is a schematic structural diagram of another data blocking device according to an embodiment of the present invention. detailed description
本发明实施例可以适用于发送端向接收端发送数据流之前, 发送端对原数 据流进行分块, 将产生的数据块组成新数据流, 向接收端发送新数据流, 从而 节省传输带宽, 提高传输效率。  The embodiment of the present invention can be applied to before the sending end sends the data stream to the receiving end, the sending end blocks the original data stream, forms the generated data block into a new data stream, and sends a new data stream to the receiving end, thereby saving transmission bandwidth. Improve transmission efficiency.
图 1A为本发明实施例提供的一种数据分块方法流程图。 本实施例通过一组 模值: 第 1个模值至第 N个模值, 对原数据流进行多层次的分块。 其中, N为模值 的总数, N为预设值, N为大于 1的自然数。  FIG. 1 is a flowchart of a data blocking method according to an embodiment of the present invention. In this embodiment, the original data stream is multi-layered by a set of modulus values: the first modulus value to the Nth modulus value. Where N is the total number of moduli values, N is the default value, and N is a natural number greater than 1.
如图 1A所示, 本实施例提供的方法包括:  As shown in FIG. 1A, the method provided in this embodiment includes:
歩骤 11 : 将滑动窗口从原数据流的起始位置朝原数据流的结束位置滑动预 设长度, 滑动窗口的初始起始位置与原数据流的起始位置相同。 所述滑动窗口 的长度为预设长度。  Step 11: Slide the sliding window from the starting position of the original data stream to the end position of the original data stream, and the initial starting position of the sliding window is the same as the starting position of the original data stream. The length of the sliding window is a preset length.
歩骤 12: 获取原数据流上滑动窗口的指纹值, 采用滑动窗口的指纹值, 分 别对第 1个模值至第 N个模值进行取模, 对于每个模值, 如果取值后的值为零均 执行歩骤 13, 然后执行歩骤 14。  Step 12: Obtain the fingerprint value of the sliding window on the original data stream, and use the fingerprint value of the sliding window to respectively modulate the first modulo value to the Nth modulo value. For each modulo value, if the value is Steps 13 are performed, and then step 14 is performed.
第 1个模值至第 N个模值之间的关系如下: 第 1个模值 pi为预设值, 第 1个模 值向左扩展预设位数后得到第 2个模值 p2, 第 2个模值向左扩展预设位数后得到 第 3个模值 p3…将第 i个模值向左扩展预设位数后得到第 i+1个模值, 以此类推, 得到一组不同位数的模值 pl, p2, …, pN, 其中 i为大于等于 1小于 N的任一个自 然数, N为模值的总个数。 举例来说, 将模值 pl=1100, 向左扩展 2位, 得到模值 p2=01 1100, 将模值 p2再向左扩展 1位, 得到模值 p3=1011100。 采用同样的方法 可得到 p4, , pN。  The relationship between the first modulo value and the Nth modulo value is as follows: The first modulo value pi is a preset value, and the first modulo value is extended to the left by a preset number of bits to obtain a second modulo value p2, After the two modulo values are extended to the left by the preset number of bits, the third modulo value p3 is obtained... The ith modulo value is extended to the left by the preset number of bits to obtain the i+1th modulo value, and so on. The modulus values pl, p2, ..., pN of different numbers of bits, where i is any natural number greater than or equal to 1 and less than N, and N is the total number of modulus values. For example, the modulus value pl=1100 is extended to the left by 2 bits to obtain the modulus value p2=01 1100, and the modulus value p2 is further extended to the left by 1 bit to obtain the modulus value p3=1011100. In the same way, p4, , pN can be obtained.
歩骤 14: 将滑动窗口在原数据流上朝结束位置滑动预设长度, 获取原数据 流上滑动窗口的指纹值, 重复执行歩骤 12至歩骤 14, 直至滑动窗口的起始位置 滑动至原数据流的结束位置。  Step 14: Slide the sliding window on the original data stream to the end position by a preset length, obtain the fingerprint value of the sliding window on the original data stream, and repeat steps 12 to 14 until the starting position of the sliding window slides to the original position. The end of the data stream.
如图 1B所示, 将滑动窗口从原数据流的起始位置朝原数据流的结束位置滑 动预设长度。 采用当前滑动窗口的指纹值分别对第 1个模值至第 N个模值进行取 模, 并执行歩骤 13后, 将滑动窗口朝原数据流的结束位置方向滑动预设长度, 继续采用下一个滑动窗口的指纹值分别对第 1个模值至第 N个模值进行取模, 并 执行歩骤 13, 直至滑动窗口的起始位置滑动至原数据流的结束位置。 其中, 原 数据流在滑动窗口中的数据的指纹值称为滑动窗口的指纹值。 数据的指纹值可 以是数据的标识符, 用于标识数据, 不同数据的指纹值不相同。 As shown in FIG. 1B, the sliding window is slid from the starting position of the original data stream toward the end position of the original data stream by a preset length. The fingerprint value of the current sliding window is used to modulate the first modulo value to the Nth modulo value respectively, and after performing step 13, the sliding window is slid to a preset length in the direction of the end position of the original data stream. Continue to modulo the first modulo value to the Nth modulo value by using the fingerprint value of the next sliding window, and perform step 13 until the starting position of the sliding window slides to the end position of the original data stream. The fingerprint value of the data of the original data stream in the sliding window is called the fingerprint value of the sliding window. The fingerprint value of the data may be an identifier of the data, which is used to identify the data, and the fingerprint values of different data are different.
采用滑动窗口的指纹值, 分别对第 1个模值至第 N个模值进行取模, 并执行 歩骤 13时, 可以按照从第 1个模值至第 N个模值的顺序, 依次对每个模值进行取 模, 在取模后的值为零时执行歩骤 13; 也可以先同时对 N个模值进行取模, 然后 再按照从第 1个模值至第 N个模值的顺序, 在取模后的值为零时执行歩骤 13。  Using the fingerprint value of the sliding window, the first modulo value to the Nth modulo value are respectively modulo, and when step 13 is performed, the order from the first modulo value to the Nth modulo value may be sequentially performed. Each modulo value is modulo, and step 13 is performed when the value after modulo is zero; or modulo N values may be simultaneously modulo, and then from the first modulo value to the Nth modulo value In the order of the steps, step 13 is performed when the value after modulo is zero.
如果滑动窗口的指纹值对第 n个模值取模后的值为零, 执行歩骤 13。 在歩骤 13中产生数据块, 并确定是将数据块还是数据块的指纹值写入新数据流占用空 间中。 歩骤 13包括以下操作。  If the fingerprint value of the sliding window is zero after modulo the nth modulus, perform step 13. A data block is generated in step 13 and it is determined whether the data block or the fingerprint value of the data block is written into the new data stream footprint. Step 13 includes the following operations.
歩骤 13包括: 将原数据流中从第 n层的上一个数据块的结束位置开始到滑动 窗口的结束位置之间的数据作为当前数据块,当前数据块为第 n层数据块;其中, n为大于等于 1并且小于等于 N的任一个自然数。  Step 13 includes: as the current data block, the data in the original data stream from the end position of the previous data block of the nth layer to the end position of the sliding window is the current data block, where the current data block is the nth data block; n is any natural number greater than or equal to 1 and less than or equal to N.
换言之, 在上述歩骤 12中, 如果滑动窗口的指纹值对第 n个模值取模后的值 为零, 则从原数据流中产生一个数据块, 当前数据块的起始位置为原数据流中 第 n层的上一个数据块的结束位置, 当前数据块的结束位置为滑动窗口的结束位 置。 对于第 n个模值产生的数据块, 称为第 n层的数据块。 模值扩展的位数可以 自定义, 对当前模值扩展的位数影响下一层数据块的长度。  In other words, in the above step 12, if the fingerprint value of the sliding window is zero after the modulo of the nth modulus, a data block is generated from the original data stream, and the starting position of the current data block is the original data. The end position of the previous data block of the nth layer in the stream, and the end position of the current data block is the end position of the sliding window. The data block generated for the nth modulus value is called the data block of the nth layer. The number of bits of the modulus extension can be customized, and the number of bits of the current modulus extension affects the length of the next block of data.
歩骤 13还包括: 在指纹字典中没有存储当前数据块的指纹值时, 将当前数 据块存储到分块字典中, 将当前数据块的指纹值和当前数据块在分块字典中的 索引信息存储到指纹字典中; 若 n大于 1, 在新数据流占用的空间中写入当前数 据块, 以覆盖在新数据流中当前数据块包括的所有第 n-1层的数据块, 否则将当 前数据块顺序写入新数据流占用的空间。  The step 13 further includes: when the fingerprint value of the current data block is not stored in the fingerprint dictionary, storing the current data block into the block dictionary, and the fingerprint value of the current data block and the index information of the current data block in the block dictionary Stored in the fingerprint dictionary; if n is greater than 1, write the current data block in the space occupied by the new data stream to cover all the n-1th data blocks included in the current data block in the new data stream, otherwise the current data block The data blocks are sequentially written to the space occupied by the new data stream.
分块字典用于存储原数据流上产生的数据块, 指纹字典用于存储分块字典 中保存的数据块的指纹值和数据块在分块字典中的索引信息。 举例来说, 数据 块在分块字典中的索引信息可以包括数据块的长度和数据块在分块字典中的位 置信息。  The block dictionary is used to store data blocks generated on the original data stream, and the fingerprint dictionary is used to store the fingerprint value of the data block stored in the block dictionary and the index information of the data block in the block dictionary. For example, the index information of the data block in the block dictionary may include the length of the data block and the location information of the data block in the block dictionary.
从原数据流中产生一个数据块后, 判断指纹字典中是否存储有当前数据块 的指纹值。 如果指纹字典中没有存储当前数据块的指纹值, 可确定新数据流和 分块字典中没有存储当前数据块, 需要在新数据流和分块字典中写入当前数据 块; 如果指纹字典中存储有当前数据块的指纹值, 可确定新数据流和分块字典 中已存储有当前数据块, 不需要在新数据流和分块字典中写入当前数据块。 After generating a data block from the original data stream, it is determined whether the fingerprint value of the current data block is stored in the fingerprint dictionary. If the fingerprint dictionary does not store the fingerprint value of the current data block, the new data stream and The current data block is not stored in the block dictionary, and the current data block needs to be written in the new data stream and the block dictionary; if the fingerprint value of the current data block is stored in the fingerprint dictionary, it can be determined that the new data stream and the block dictionary have been The current data block is stored, and the current data block does not need to be written in the new data stream and the block dictionary.
第 n个模值由第 n-1个模值向左扩展固定位数后得到。 对同一个滑动窗口的 指纹值来说, 如果滑动窗口的指纹值对第 n个模值取模后的值为零, 则滑动窗口 的指纹值对第 n-1个模值取模后的值也为零。 而滑动窗口的指纹值对第 n-1个模 值取模后的值为零, 滑动窗口的指纹值对第 n个模值取模后的值不一定为零。 对 于同一个滑动窗口的指纹值, 产生第 n-1层的数据块后, 并不一定会产生第 n层 的数据块, 但产生了第 n层数据块时, 第 n-1层的数据块一定已经产生。 因此, 从原数据流起始位置向结束位置移动滑动窗口时, 与第 n-1层的数据块的起始位 置相比, 第 n层的数据块的起始位置会更接近于原数据流的起始位置, 而不会比 第 n-1层的数据块的起始位置远, 因而。 第 n层的数据块的长度大于等于第 n-1层 的数据块的长度, 第 n层的数据块会包括有一个或多个第 n-1层的数据块。  The nth modulus value is obtained by extending the fixed number of bits from the n-1th modulus value to the left. For the fingerprint value of the same sliding window, if the value of the fingerprint value of the sliding window is zero after the modulo of the nth modulus value, the value of the fingerprint value of the sliding window is modulo the value of the n-1th modulus value. Also zero. The fingerprint value of the sliding window is zero after modulo the n-1th modulus, and the value of the fingerprint value of the sliding window is not necessarily zero after the modulo of the nth modulus. For the fingerprint value of the same sliding window, after the n-1th data block is generated, the nth layer data block is not necessarily generated, but when the nth layer data block is generated, the n-1th layer data block is generated. Must have been produced. Therefore, when the sliding window is moved from the original data stream starting position to the ending position, the starting position of the nth layer data block is closer to the original data stream than the starting position of the n-1th data block. The starting position is not farther than the starting position of the data block of the n-1th layer. The length of the data block of the nth layer is greater than or equal to the length of the data block of the n-1th layer, and the data block of the nth layer includes one or more data blocks of the n-1th layer.
在指纹字典中没有存储当前数据块的指纹值的情况下, 如果当前数据块所 在的层次 n大于 1, 当前数据块包括有一个或多个第 n-1层的数据块, 为了避免新 数据流中出现多余的数据块和多余的数据块的指纹值, 将当前数据块写入新数 据流占用的空间中, 以在新数据流中覆盖当前数据块包括的所有第 n-1层的数据 块。 其中, 在写入当前数据块之前, 新数据流中当前数据块所在的位置存储的 是当前数据块包括的所有第 n-1层的数据块; 在写入当前数据块之后, 当前数据 块覆盖了新数据流中当前数据块包括的所有第 n-1层的数据块, 因此, 新数据流 中当前数据块所在的位置存储的是当前数据块。  In the case that the fingerprint value of the current data block is not stored in the fingerprint dictionary, if the current data block is at a level n greater than 1, the current data block includes one or more n-1th data blocks, in order to avoid new data streams. The fingerprint value of the redundant data block and the redundant data block appear in the space, and the current data block is written into the space occupied by the new data stream, so as to cover all the n-1th data blocks included in the current data block in the new data stream. . Wherein, before the current data block is written, the location of the current data block in the new data stream stores all the n-1th data blocks included in the current data block; after the current data block is written, the current data block is overwritten. All the n-1th data blocks included in the current data block in the new data stream, therefore, the current data block is stored in the location of the current data block in the new data stream.
如果当前数据块所在的层次 n等于 1, 由于当前数据块是第 1层的数据块, 其 范围内不包括其它层的数据块, 因此, 直接将当前数据块写入新数据流的末尾。  If the current data block is at the level n equal to 1, since the current data block is the data block of the first layer, the data block of other layers is not included in the range, so the current data block is directly written to the end of the new data stream.
当前数据块所在的层次 n大于 1时, 可以在新数据流的占用空间中写入所述 当前数据块之前, 查找所述当前数据块包括的第 n-1层的数据块在所述新数据流 中的位置, 以便在新数据流的占用空间中确定写入当前数据块的位置。 可以通 过以下方法在新数据流中定位第 n层的当前数据块包括的第 n-1层的数据块。  When the level n of the current data block is greater than 1, the data block of the n-1th layer included in the current data block may be found in the new data before the current data block is written in the occupied space of the new data stream. The position in the stream to determine where to write the current data block in the footprint of the new data stream. The data block of the n-1th layer included in the current data block of the nth layer can be located in the new data stream by the following method.
首先, 产生一个数据块后, 记录该数据块的序号、 该数据块所在的层次和 在原数据流的位置三者之间的对应关系。 上述对应关系可以记录在位置记录表 中。 例如, 第 3个数据块产生后, 因为第 3个数据块是第 2层的数据块, 为第 3个 数据块记录的对应关系是 (序号 3, 层次 2, 在原数据流中的位置) 。 其中, 数 据块的序号表示数据块产生的先后顺序, 序号越小表示产生的时间越早。 在新 数据流中写入该数据块 /该数据块的指纹值后, 在上述对应关系中增加该数据块 /该数据块的指纹值在新数据流中的位置。 例如, 在新数据流中写入第 3个数据 块 /第 3个数据块的指纹值后, 在为第 3个数据块记录的对应关系中增加第 3个数 据块 /第 3个数据块的指纹值在新数据流中的位置。 此时, 为第 3个数据块记录的 对应关系是 (序号 3, 层次 2, 在原数据流中的位置、 在新数据流中的位置) 。 其中, 数据块 /数据块的指纹值在数据流中的位置, 可以用数据块 /数据块的指 纹值在数据流中的起始位置和数据块的长度来表示, 也可以是数据块 /数据块的 指纹值在数据流中的起始位置和结束位置来表示。 通过记录的数据块 /数据块的 指纹值在数据流中的位置, 可以在数据流中查找到该数据块 /该数据块的指纹 值。 First, after generating a data block, the sequence number of the data block, the level of the data block, and the correspondence between the locations of the original data streams are recorded. The above correspondence can be recorded in the location record table. For example, after the third data block is generated, because the third data block is the data block of the second layer, it is the third one. The correspondence of the data block records is (No. 3, Level 2, the position in the original data stream). The sequence number of the data block indicates the sequence in which the data blocks are generated. The smaller the sequence number, the earlier the time is generated. After the fingerprint value of the data block/the data block is written in the new data stream, the position of the data block/the fingerprint value of the data block in the new data stream is increased in the above correspondence. For example, after the fingerprint value of the third data block/third data block is written in the new data stream, the third data block/third data block is added to the correspondence relationship recorded for the third data block. The location of the fingerprint value in the new data stream. At this time, the correspondence recorded for the third data block is (No. 3, Level 2, position in the original data stream, position in the new data stream). Wherein, the location of the fingerprint value of the data block/data block in the data stream may be represented by the starting position of the data block/data block in the data stream and the length of the data block, or may be data block/data The fingerprint value of the block is represented by the start and end positions in the data stream. By the position of the recorded fingerprint/data block fingerprint value in the data stream, the data block/the data block's fingerprint value can be found in the data stream.
当指纹字典中没有存储当前数据块的指纹值且当前数据块所在的层次 n大 于 1时, 根据当前数据块在原数据流中的位置和第 n-1层的数据块在原数据流中 的位置, 在位置记录表中定位原数据流中当前数据块包括的第 n-1层的数据块的 记录。 从定位到的第 n-1层数据块的记录中, 获取所述当前数据块包括的第 n- 1 层数据块在所述新数据流中的位置。  When the fingerprint value of the current data block is not stored in the fingerprint dictionary and the level n of the current data block is greater than 1, according to the position of the current data block in the original data stream and the position of the n-1th data block in the original data stream, A record of the n-1th data block included in the current data block in the original data stream is located in the location record table. Obtaining, in the record of the n-1th data block that is located, the location of the n-th layer data block included in the current data block in the new data stream.
当前数据块包括的第 n-1层的数据块称为第 n-1层的待替换数据块。 在原数 据流中产生一个数据块后, 在新数据流中有可能写入的是该数据块, 也可能写 入的是该数据块的指纹值。 因此, 在位置记录表中, 记录的第 n-1层的待替换数 据块在新数据流的位置, 有可能是第 n-1层的待替换数据块在新数据流中的位 置, 也可能是第 n-1层的待替换数据块的指纹值在新数据流中的位置。 从定位到 的第 n-1层的待替换数据块的记录中, 可获取到第 n-1层的待替换数据块 /第 n- 1 层的待替换数据块的指纹值在新数据流中的位置。 获取第 n-1层的待替换数据块 /第 n-1层的待替换数据块的指纹值在新数据流中的位置之后, 根据获取到的位 置, 在新数据流中用第 n层的当前数据块替换第 n-1层的待替换数据块 /第 n-1层 的待替换数据块的指纹值, 同时在位置记录表中删除第 n-1层的待替换数据块的 记录。  The data block of the n-1th layer included in the current data block is referred to as the data block to be replaced of the n-1th layer. After a data block is generated in the original data stream, it is possible to write the data block in the new data stream, or it may write the fingerprint value of the data block. Therefore, in the location record table, the location of the data block to be replaced on the n-1th layer in the location of the new data stream may be the location of the data block to be replaced in the n-1th layer in the new data stream, or Is the location of the fingerprint value of the data block to be replaced on the n-1th layer in the new data stream. From the record of the to-be-replaced data block of the n-1th layer that is located, the fingerprint value of the to-be-replaced data block/n-th layer of the n-1th layer to be replaced may be obtained in the new data stream. s position. After obtaining the location of the fingerprint value of the to-be-replaced data block/n-1th layer of the to-be-replaced data block of the n-1th layer in the new data stream, according to the acquired location, the nth layer is used in the new data stream. The current data block replaces the fingerprint value of the to-be-replaced data block/n-1th layer of the to-be-replaced data block of the n-1th layer, and the record of the n-1th layer of the to-be-replaced data block is deleted in the location record table.
歩骤 13还包括: 在指纹字典中存储有所述当前数据块的指纹值时, 若 n大于 1, 在新数据流占用的空间中写入所述当前数据块的指纹值, 以覆盖新数据流中 当前数据块包括的所有第 n-1层的数据块的指纹值, 否则将当前数据块的指纹值 顺序写入所述新数据流占用的空间。 The step 13 further includes: when the fingerprint value of the current data block is stored in the fingerprint dictionary, if n is greater than 1, the fingerprint value of the current data block is written in a space occupied by the new data stream to cover the new data. In the stream The fingerprint value of all n-1th data blocks included in the current data block, otherwise the fingerprint value of the current data block is sequentially written into the space occupied by the new data stream.
如果指纹字典中存在当前数据块的指纹值, 可以确定分块字典和新数据流 中已存储有当前数据块。 如果指纹字典中存在当前数据块的指纹值, 且当前数 据块是第 1层数据块, 则直接在新数据流的最后一个数据块的结尾处写入当前数 据块的指纹值。  If the fingerprint value of the current data block exists in the fingerprint dictionary, it can be determined that the current data block is already stored in the block dictionary and the new data stream. If the fingerprint value of the current data block exists in the fingerprint dictionary, and the current data block is a layer 1 data block, the fingerprint value of the current data block is written directly at the end of the last data block of the new data stream.
在当前数据块所在的层次 n大于 1的情况下, 当前数据块范围内包括有一个 或多个第 n-1层数据块。 因而, 产生当前数据块时, 如果当前数据块的指纹值已 存储在指纹字典中, 则当前数据块范围内的所有第 n-1层数据块的指纹值也已经 写入新数据流中。 为了避免新数据流中出现多余的数据块和多余的数据块的指 纹值, 在新数据流占用的空间中写入当前数据块的指纹值, 以覆盖在新数据流 中当前数据块包括的所有第 n-1层的数据块的指纹值, 从而节省了新数据流的传 输带宽。 其中, 在写入当前数据块的指纹值之前, 新数据流中当前数据块所在 的位置存储的是当前数据块包括的所有第 n-1层的数据块的指纹值; 在写入当前 数据块的指纹值之后, 当前数据块的指纹值覆盖了新数据流中当前数据块包括 的所有第 n-1层的数据块的指纹值, 新数据流中当前数据块所在的位置存储的是 当前数据块的指纹值。  In the case where the level n of the current data block is greater than 1, the current data block includes one or more n-1th data blocks. Therefore, when the current data block is generated, if the fingerprint value of the current data block is already stored in the fingerprint dictionary, the fingerprint values of all the n-1th data blocks in the current data block range are already written in the new data stream. In order to avoid the fingerprint value of redundant data blocks and redundant data blocks in the new data stream, the fingerprint value of the current data block is written in the space occupied by the new data stream to cover all the current data blocks included in the new data stream. The fingerprint value of the data block of the n-1th layer, thereby saving the transmission bandwidth of the new data stream. Wherein, before writing the fingerprint value of the current data block, the location of the current data block in the new data stream stores the fingerprint value of all the n-1th data blocks included in the current data block; After the fingerprint value, the fingerprint value of the current data block covers the fingerprint value of all the n-1th data blocks included in the current data block in the new data stream, and the current data block in the new data stream stores the current data. The fingerprint value of the block.
当前数据块所在的层次 n大于 1时, 可以在新数据流的占用空间中写入所述 当前数据块的指纹值之前, 查找当前数据块包括的第 n-1层的数据块在所述新数 据流中的位置, 以便在新数据流的占用空间中确定写入当前数据块的指纹值的 位置。 可以通过以下方法在新数据流中定位到第 n层的当前数据块范围内的第 n-1层的数据块的指纹值。 根据当前数据块在原数据流中的位置和第 n-1层数据 块在原数据流中的位置, 在上述位置记录表中, 定位原数据流中当前数据块包 括的第 n-1层的数据块的记录。 从定位到的第 n-1层数据块的记录中, 获取所述 当前数据块包括的第 n-1层数据块在所述新数据流中的位置。 当前数据块包括的 第 n-1层的数据块称为第 n-1层的待替换数据块。 在定位到的第 n-1层的待替换数 据块的记录中, 获取第 n-1层的待替换数据块的指纹值在新数据流中的位置。 获 取到第 n-1层的待替换数据块的指纹值在新数据流中的位置后, 根据获取到的位 置, 在新数据流中用当前数据块的指纹值替换第 n-1层的待替换数据块的指纹 值, 同时在位置记录表中删除第 n-1层的待替换数据块的记录。 滑动窗口的起始位置滑动至原数据流的结束位置时, 结束对原数据流的分 块, 此时的新数据流为原数据流的替代数据流。 接收端可以结合分块字典和指 纹字典对发送端发送的新数据流进行解析, 获得原数据流。 将新数据流解析为 原数据流的过程中, 若在新数据流查找到指纹值, 在指纹字典中查找指纹值对 应的数据块在分块字典中的索引信息。 根据索引信息, 在分块字典中查找所述 指纹值对应的数据块, 在指纹值在所述新数据流占用的空间中, 将指纹值替换 为查找到的数据块。 When the level n of the current data block is greater than 1, the data block of the n-1th layer included in the current data block may be found in the new data block before the fingerprint value of the current data block is written in the occupied space of the new data stream. The location in the data stream to determine the location of the fingerprint value written to the current data block in the footprint of the new data stream. The fingerprint value of the n-1th data block within the current data block range of the nth layer can be located in the new data stream by the following method. According to the position of the current data block in the original data stream and the position of the n-1th data block in the original data stream, in the location record table, the n-1th data block included in the current data block in the original data stream is located. record of. Obtaining, in the record of the n-1th data block that is located, the location of the n-1th data block included in the current data block in the new data stream. The data block of the n-1th layer included in the current data block is referred to as the data block to be replaced of the n-1th layer. In the recorded record of the n-1th layer to be replaced, the location of the fingerprint value of the n-1th layer of the to-be-replaced data block in the new data stream is obtained. After obtaining the location of the fingerprint value of the data block to be replaced on the n-1th layer in the new data stream, replacing the n-1th layer with the fingerprint value of the current data block in the new data stream according to the acquired location The fingerprint value of the data block is replaced, and the record of the n-1th layer of the data block to be replaced is deleted in the location record table. When the start position of the sliding window slides to the end position of the original data stream, the segmentation of the original data stream is ended, and the new data stream at this time is the substitute data stream of the original data stream. The receiving end can combine the block dictionary and the fingerprint dictionary to parse the new data stream sent by the sending end to obtain the original data stream. In the process of parsing the new data stream into the original data stream, if the fingerprint value is found in the new data stream, the fingerprint dictionary searches for the index information of the data block corresponding to the fingerprint value in the block dictionary. According to the index information, the data block corresponding to the fingerprint value is searched in the block dictionary, and the fingerprint value is replaced with the searched data block in the space occupied by the new data stream.
本实施例提供的数据分块方法, 在原数据流上同一个滑动窗口的指纹值分 别对一组模值, 即第 1个模值至第 N个模值进行取模, 根据不同模值的取模结果 对原数据流进行不同层次的分块, 因此, 对于同一个滑动窗口的指纹值, 可以 对原数据流同时进行粗粒度和细粒度的分块。 对于产生的数据块, 如果指纹字 典中存在该数据块的指纹值, 根据该数据块所在的层次, 确定是将该数据块的 指纹值顺序写入新数据流占用的空间还是在替换该数据块包括的下一层数据块 的指纹值; 如果指纹字典中不存在该数据块的指纹值, 根据该数据块所在的层 次, 确定是将该数据块顺序写入新数据流占用的空间还是替换该数据块包括的 下一层数据块, 从而, 避免了与原数据流相比, 在新数据流中出现多余的数据 块和多余的数据块的指纹值的现象, 实现了采用不同粒度的数据块生成新数据 流, 并且在新数据流中采用大数据块替换已有的小数据块的目的, 提高了新数 据流的去冗率。 本实施例中每层数据块均是在原数据流上直接产生的分块, 每 个指纹值均是数据块的指纹值, 不需要在指纹值中增加指示位, 指示该指纹值 是数据块的指纹值还是指纹值的指纹值, 减少了数据块的指纹值的开销, 提高 了新数据流的去冗率。 另外, 由于不同的模值产生不同层次的数据块, 每层数 据块的产生不依赖于其它层的数据块, 因此, 对于同一个滑动窗口的指纹值, 可以并发产生不同层次的数据块。  In the data blocking method provided in this embodiment, the fingerprint values of the same sliding window on the original data stream are respectively moduloed to a set of moduli values, that is, the first modulo value to the Nth modulo value, according to different modulo values. The modulo result performs different levels of partitioning on the original data stream. Therefore, for the fingerprint value of the same sliding window, the original data stream can be coarse-grained and fine-grained at the same time. For the generated data block, if the fingerprint value of the data block exists in the fingerprint dictionary, according to the level of the data block, it is determined whether the fingerprint value of the data block is sequentially written into the space occupied by the new data stream or the data block is replaced. The fingerprint value of the data block of the next layer is included; if the fingerprint value of the data block does not exist in the fingerprint dictionary, according to the level of the data block, it is determined whether the data block is sequentially written into the space occupied by the new data stream or replaced The next block of data blocks included in the data block, thereby avoiding the phenomenon that the fingerprint values of the redundant data blocks and the redundant data blocks appear in the new data stream compared with the original data stream, and the data blocks with different granularities are realized. The purpose of generating a new data stream and replacing the existing small data block with a large data block in the new data stream improves the deduplication rate of the new data stream. In this embodiment, each layer of the data block is a block directly generated on the original data stream, and each fingerprint value is a fingerprint value of the data block. It is not necessary to add an indication bit in the fingerprint value, indicating that the fingerprint value is a data block. The fingerprint value is also the fingerprint value of the fingerprint value, which reduces the overhead of the fingerprint value of the data block and improves the deduplication rate of the new data stream. In addition, since different modulo values generate different levels of data blocks, the generation of each layer of data blocks does not depend on the data blocks of other layers. Therefore, for the fingerprint values of the same sliding window, different levels of data blocks can be generated concurrently.
进一歩, 为节省分块字典的存储空间, 在分块字典中没有存储当前数据块 的指纹值时, 通过判断当前数据块是否包括第 n-1层的数据块, 确定是否在分块 字典中写入当前数据块。 若所述当前数据块包括第 n-1层的数据块, 将所述当前 数据块的指纹值和所述当前数据块在所述分块字典中的索引信息存储到所述指 纹字典中, 其中, 所述当前数据块在所述分块字典中的索引信息, 根据所述当 前数据块包括的第 n-1层的数据块的索引信息确定。 若所述当前数据块不包括第 n-1层的数据块, 将所述当前数据块存储到分块字典中, 将所述当前数据块的指 纹值和所述当前数据块在所述分块字典中的索引信息存储到所述指纹字典中。 Further, in order to save the storage space of the block dictionary, when the fingerprint value of the current data block is not stored in the block dictionary, it is determined whether the current data block includes the n-1th data block to determine whether it is in the block dictionary. Write the current data block. If the current data block includes the data block of the n-1th layer, storing the fingerprint value of the current data block and the index information of the current data block in the block dictionary into the fingerprint dictionary, where The index information of the current data block in the block dictionary is determined according to index information of a data block of the n-1th layer included in the current data block. If the current data block does not include the first a data block of the n-1 layer, storing the current data block into the block dictionary, storing the fingerprint value of the current data block and the index information of the current data block in the block dictionary to the In the fingerprint dictionary.
如图 2所示的三层次式分块示意图。 图 2中的分块采用的模值为 p l, p2和 p3。 pl=10, 将 pi向左扩展 1位, 得到 p2=110。 将 pi向左扩展两位, 得到 p3=01110。  A three-level block diagram as shown in Figure 2. The blocks in Figure 2 use modulo values p l, p2 and p3. Pl=10, expand pi to the left by 1 bit, and get p2=110. Extend pi to the left by two bits to get p3=01110.
滑动窗口在原数据流上从左向右滑动, 滑动窗口内的指纹值为 111010时。 该滑动窗口的指纹值对 P i取模后的值为 0, 产生一个第 1层的分块, 记为分块 1。 此时, 该滑动窗口的指纹值对 p2和 p3取模后的值均不为 0。 在指纹字典中没有存 储分块 1的指纹, 如图 3所示, 在分块字典中存储分块 1, 并在指纹字典中保存分 块 1的指纹值和分块 1在分块字典中的索引信息。 分块在分块字典中的索引信息 包括在分块字典中的地址、 偏移量和长度。 然后, 将分块 1写入新数据流中。 举 例来说, 分块 1在分块字典中的地址为分块字典中存储分块 1数据的条目的位置。  The sliding window slides from left to right on the original data stream, and the fingerprint value in the sliding window is 111010. The fingerprint value of the sliding window is 0 after modulo P i , and a block of the first layer is generated, which is recorded as block 1. At this time, the fingerprint value of the sliding window is not 0 after the modulo values of p2 and p3. The fingerprint of the block 1 is not stored in the fingerprint dictionary. As shown in FIG. 3, the block 1 is stored in the block dictionary, and the fingerprint value of the block 1 and the block 1 are stored in the block dictionary in the fingerprint dictionary. Index information. The index information of the block in the block dictionary includes the address, offset, and length in the block dictionary. Then, block 1 is written to the new data stream. For example, the address of block 1 in the block dictionary is the location of the entry in the block dictionary that stores the block 1 data.
滑动窗口在原数据流上继续向右滑动, 滑动窗口的指纹值为 010110。 该滑 动窗口的指纹值对 Pi取模后的值为 0, 产生一个第 1层的分块, 记为分块 2。 该滑 动窗口的指纹值对 p2取模后的值也为 0, 产生一个第 2层的分块, 记为分块 3。 该 滑动窗口的指纹值对 p3取模后的值不为 0。 在指纹字典中没有存储分块 2的指纹, 如图 3所示, 在分块字典中存储分块 2, 并在指纹字典中保存分块 2的指纹值和分 块 2在分块字典中的 2起始地址、 偏移量和长度, 然后, 将分块 2与入新数据流中 最后一个分块即分块 1的末尾。 产生分块 3后, 在指纹字典中没有存储分块 3的指 纹, 在分块字典中存储分块 3, 并在指纹字典中保存分块 3的指纹值和分块 3在分 块字典中的地址、 偏移量和长度, 然后, 在新数据流中用第 2层的分块 3替换第 1 层的分块 1和分块 2。  The sliding window continues to slide to the right on the original data stream, and the sliding window has a fingerprint value of 010110. The fingerprint value of the sliding window has a value of 0 after modulo, and a block of the first layer is generated, which is recorded as block 2. The value of the fingerprint value of the sliding window is also 0 after modulo p2, and a block of the second layer is generated, which is recorded as block 3. The fingerprint value of the sliding window is not 0 after modulo p3. The fingerprint of the block 2 is not stored in the fingerprint dictionary. As shown in FIG. 3, the block 2 is stored in the block dictionary, and the fingerprint value of the block 2 and the block 2 are stored in the block dictionary in the fingerprint dictionary. 2 Start address, offset and length, then, block 2 is the last block in the new data stream, that is, the end of block 1. After generating the block 3, the fingerprint of the block 3 is not stored in the fingerprint dictionary, the block 3 is stored in the block dictionary, and the fingerprint value of the block 3 and the block 3 are stored in the block dictionary in the fingerprint dictionary. Address, offset, and length, then, block 1 and block 2 of layer 1 are replaced with block 3 of layer 2 in the new data stream.
滑动窗口在原数据流上继续向右滑动, 滑动窗口的指纹值为 001110。 该滑 动窗口的指纹值对 Pl、 p2和 p3均为 0, 先后产生了第 1层的分块 4、 第 2层的分块 4 和第 3层的分块 5。 第 1层的分块 4和第 2层的分块 4是同一个分块, 因此, 只保存 第 1层的分块, 记为分块 4。 产生分块 4后, 分块 4的指纹值还没有存储在指纹字 典中, 如图 3所示, 在分块字典中存储分块 4, 并在指纹字典中保存分块 4的指纹 值和分块 4在分块字典中的起始地址、 偏移量和长度, 然后, 将分块 4写入新数 据流中最后一个分块即分块 1的末尾。 产生分块 5后, 分块 5的指纹值还没有存储 在指纹字典中, 则在分块字典中存储分块 5,  The sliding window continues to slide to the right on the original data stream. The fingerprint value of the sliding window is 001110. The fingerprint value of the sliding window is 0 for Pl, p2 and p3, and the block 4 of the first layer, the block 4 of the second layer, and the block 5 of the third layer are successively generated. The partition 4 of the first layer and the partition 4 of the second layer are the same partition, and therefore, only the partition of the first layer is stored, which is denoted as the partition 4. After the block 4 is generated, the fingerprint value of the block 4 is not yet stored in the fingerprint dictionary. As shown in FIG. 3, the block 4 is stored in the block dictionary, and the fingerprint value and the score of the block 4 are saved in the fingerprint dictionary. The starting address, offset and length of block 4 in the blocking dictionary, then block 4 is written to the end of the last block in the new data stream, block 1. After the block 5 is generated, the fingerprint value of the block 5 is not yet stored in the fingerprint dictionary, and the block 5 is stored in the block dictionary.
在指纹字典中保存分块 5的指纹值和分块 5在分块字典中的起始地址、 偏移 量和长度, 然后, 在新数据流中用第 3层的分块 5替换第 2层的分块 3和分块 4。 滑动窗口在原数据流上继续向右滑动, 如果产生的是分块 3, 确定分块 3的 指纹值已存储在指纹字典且分块 3是第 2层的分块, 可以确定在新数据流中, 当 前分块 3占据的范围内包括的第 1层的数据块已替换为指纹值, 用分块 3的指纹值 替换原数据流中分块 3范围内的第 1层的分块 1和分块 2。 如果产生的是分块 5, 确 定分块 5的指纹值已存储在指纹字典且分块 3是第 2层的分块, 可以确定在新数据 流中, 当前分块 5占据的范围内包括的第 2层的数据块已替换为指纹值, 用分块 5 的指纹值替换原数据流中分块 5范围内的第 2层的分块 3和分块 4。 如果产生的是 第 1层的分块 1或 2, 由于分块 1或分块 2的指纹值已存储在指纹字典中, 且分块 1 或分块 2是第 1层的分块, 直接将分块 1或分块 2的指纹值写入新数据流的最后一 个分块的末尾处。 The fingerprint value of the block 5 and the start address and offset of the block 5 in the block dictionary are saved in the fingerprint dictionary. The quantity and length, then, the block 3 of the second layer and the block 4 are replaced with the block 5 of the third layer in the new data stream. The sliding window continues to slide to the right on the original data stream. If the block 3 is generated, it is determined that the fingerprint value of the block 3 has been stored in the fingerprint dictionary and the block 3 is the block of the second layer, which can be determined in the new data stream. The data block of the first layer included in the range occupied by the current block 3 has been replaced with the fingerprint value, and the block 1 of the first layer in the range of the block 3 in the original data stream is replaced by the fingerprint value of the block 3. Block 2. If the result is block 5, it is determined that the fingerprint value of the block 5 has been stored in the fingerprint dictionary and the block 3 is the block of the second layer, and it can be determined that in the new data stream, the range occupied by the current block 5 is included. The data block of the second layer has been replaced with the fingerprint value, and the block 3 and the block 4 of the second layer in the range of the block 5 in the original data stream are replaced by the fingerprint value of the block 5. If the block 1 or 2 of the first layer is generated, since the fingerprint value of the block 1 or the block 2 is already stored in the fingerprint dictionary, and the block 1 or the block 2 is the block of the first layer, directly The fingerprint value of Block 1 or Block 2 is written at the end of the last block of the new data stream.
进一歩, 为节省分块字典的存储空间, 如果产生的是分块 3, 由图 2可知, 第 2层的分块 3由第 1层的分块 1和分块 2组成, 分块字典中已存储了分块 3, 可以 不在分块字典中重复存储分块 3。 如图 4所示, 分块 1的偏移量为 0, 分块 2存放在 分块 1之后, 分块 2的偏移量为分块 1的长度。 对于第 2层的分块 3, 分块 3的地址 与分块 1的地址相同, 也与分块 2的地址相同, 分块 3的偏移量为分块 1的偏移量。 分块 3的长度为分块 1的长度与分块 2的长度之和。 同样, 第 3层的分块 5由第 2层 的分块 3和分块 4组成, 分块字典中已存储了分块 5, 可以不在分块字典中重复存 储分块 5。 分块 5的偏移量为分块 3的偏移量。 分块 5的地址与分块 3的地址相同, 分块 5的长度为分块 3的长度与分块 4的长度之和。  Further, in order to save the storage space of the block dictionary, if the block 3 is generated, as can be seen from FIG. 2, the block 3 of the second layer is composed of the block 1 and the block 2 of the first layer, in the block dictionary. Block 3 has been stored and block 3 can be stored repeatedly in the block dictionary. As shown in Fig. 4, the offset of the block 1 is 0, the block 2 is stored after the block 1, and the offset of the block 2 is the length of the block 1. For the block 3 of the second layer, the address of the block 3 is the same as the address of the block 1, and is also the same as the address of the block 2, and the offset of the block 3 is the offset of the block 1. The length of the partition 3 is the sum of the length of the partition 1 and the length of the partition 2. Similarly, the block 5 of the third layer is composed of the block 3 and the block 4 of the second layer, and the block 5 is already stored in the block dictionary, and the block 5 can be repeatedly stored in the block dictionary. The offset of the partition 5 is the offset of the partition 3. The address of the partition 5 is the same as the address of the partition 3, and the length of the partition 5 is the sum of the length of the partition 3 and the length of the partition 4.
图 5本发明实施例提供的一种数据分块装置结构示意图。 如图 5所示, 本实 施例提供的装置包括: 获取模块 56、 取模模块 51、 分块模块 52和写模块 53以及 滑动模块 50。  FIG. 5 is a schematic structural diagram of a data blocking device according to an embodiment of the present invention. As shown in FIG. 5, the apparatus provided in this embodiment includes: an acquisition module 56, a modulo module 51, a blocking module 52 and a writing module 53, and a sliding module 50.
获取模块 56, 用于获取原数据流上滑动窗口的指纹值, 所述滑动窗口的初 始起始位置与所述原数据流的起始位置相同, 所述滑动窗口的长度为预设长度。  The obtaining module 56 is configured to obtain a fingerprint value of the sliding window on the original data stream, where the initial starting position of the sliding window is the same as the starting position of the original data stream, and the length of the sliding window is a preset length.
取模模块 51, 用于采用所述滑动窗口的指纹值, 分别对第 1个模值至第 N个 模值进行取模, 其中, 将第 i个模值向左扩展预设位数后得到第 i+1个模值, 所 述 i为大于等于 1并且小于所述 N的任一个自然数, 所述 N为模值的总个数。  The modulo module 51 is configured to perform modulo adjustment on the first modulo value to the Nth modulo value by using the fingerprint value of the sliding window, where the ith modulo value is extended to the left by a preset number of bits. The i+1th modulus value, wherein i is greater than or equal to 1 and less than any natural number of the N, and the N is a total number of modulus values.
分块模块 52, 用于对于每个模值, 如果所述取模模块取模后的值为零, 将 所述原数据流中从第 n层的上一个数据块的结束位置开始到所述滑动窗口的结 束位置之间的数据作为当前数据块, 所述当前数据块为第 n层数据块; 其中, 所 述 n为大于等于 1并且小于等于所述 N的任一个自然数: a blocking module 52, configured, for each modulus value, if the value of the modulo module after modulo is zero, starting from the end position of the previous data block of the nth layer in the original data stream to the Sliding window knot The data between the bundle positions is the current data block, and the current data block is the n-th data block; wherein the n is greater than or equal to 1 and less than or equal to any natural number of the N:
写模块 53, 用于在指纹字典中没有存储所述分块模块产生的当前数据块的 指纹值时, 将所述当前数据块存储到分块字典中, 将所述当前数据块的指纹值 和所述当前数据块在所述分块字典中的索引信息存储到所述指纹字典中; 若所 述 n大于 1, 在新数据流占用的空间中写入所述当前数据块, 以覆盖在所述新数 据流中所述当前数据块包括的所有第 n-1层的数据块, 否则将所述当前数据块顺 序写入所述新数据流占用的空间;  The writing module 53 is configured to: when the fingerprint value of the current data block generated by the blocking module is not stored in the fingerprint dictionary, store the current data block into the blocking dictionary, and use the fingerprint value of the current data block to The index information of the current data block in the block dictionary is stored in the fingerprint dictionary; if the n is greater than 1, the current data block is written in a space occupied by the new data stream to cover the current data block All the data blocks of the n-1th layer included in the current data block in the new data stream, otherwise the current data block is sequentially written into the space occupied by the new data stream;
所述写模块 53, 还用于在所述指纹字典中存储有所述当前数据块的指纹值 时, 若 n大于 1, 在所述新数据流占用的空间中写入所述当前数据块的指纹值, 以覆盖所述新数据流中所述当前数据块包括的所有第 n-1层的数据块的指纹值, 否则将所述当前数据块的指纹值顺序写入所述新数据流占用的空间;  The writing module 53 is further configured to: when the fingerprint value of the current data block is stored in the fingerprint dictionary, if n is greater than 1, write the current data block in a space occupied by the new data stream a fingerprint value, to cover a fingerprint value of all n-1th data blocks included in the current data block in the new data stream, or sequentially writing the fingerprint value of the current data block to the new data stream Space;
滑动模块 50, 用于在所述写模块将所述分块模块产生的所述当前数据块或 所述当前数据块的指纹值写入所述新数据流占用的空间后, 将所述滑动窗口在 所述原数据流上朝结束位置滑动所述预设长度, 获取原数据流上滑动窗口的指 纹值, 使所述取模模块 51、 所述写模块 53和所述滑动模块 50重复执行操作, 直 至所述滑动窗口的起始位置滑动至所述原数据流的结束位置。  a sliding module 50, configured to: after the writing module writes the current data block generated by the blocking module or the fingerprint value of the current data block into a space occupied by the new data stream, the sliding window And sliding the preset length on the original data stream to the end position, and acquiring the fingerprint value of the sliding window on the original data stream, so that the modulo module 51, the writing module 53 and the sliding module 50 repeatedly perform operations. Until the starting position of the sliding window slides to the end position of the original data stream.
可选地, 所述写模块 53, 还用于若所述当前数据块包括第 n-1层的数据块, 将所述当前数据块的指纹值和所述当前数据块在所述分块字典中的索引信息存 储到所述指纹字典中, 其中, 所述当前数据块在所述分块字典中的索引信息, 根据所述当前数据块包括的第 n-1层的数据块的索引信息确定。  Optionally, the writing module 53 is further configured to: if the current data block includes a data block of an n-1th layer, the fingerprint value of the current data block and the current data block in the block dictionary The index information is stored in the fingerprint dictionary, wherein the index information of the current data block in the block dictionary is determined according to index information of the n-1th data block included in the current data block. .
可选地,所述写模块 53,还用于若所述当前数据块不包括第 n-1层的数据块, 将所述当前数据块存储到所述分块字典中, 将所述当前数据块的指纹值和所述 当前数据块在所述分块字典中的索引信息存储到所述指纹字典中。  Optionally, the writing module 53 is further configured to: if the current data block does not include the data block of the n-1th layer, store the current data block into the block dictionary, and use the current data The fingerprint value of the block and the index information of the current data block in the block dictionary are stored in the fingerprint dictionary.
本实施例提供的数据分块装置, 取模模块在原数据流上同一个滑动窗口的 指纹值分别对一组模值第 1个模值至第 N个模值进行取模, 分块模块根据不同模 值的取模结果对原数据流进行不同层次的分块, 因此, 对于同一个滑动窗口的 指纹值, 可以对原数据流同时进行粗粒度和细粒度的分块。 对于产生的数据块, 如果指纹字典中存在该数据块的指纹值, 写模块根据该数据块所在的层次, 确 定是将该数据块的指纹值顺序写入新数据流占用的空间还是替换该数据块包括 的下一层数据块的指纹值; 如果指纹字典中不存在该数据块的指纹值, 写模块 根据该数据块所在的层次, 确定是将该数据块顺序写入新数据流占用的空间还 是替换该数据块包括的下一层数据块, 从而, 避免了与原数据流相比在新数据 流中出现多余的数据块和多余的数据块的指纹值的现象, 实现了采用不同粒度 的数据块生成新数据流, 并且在新数据流中采用大数据块替换已有的小数据块 的目的 , 提高了新数据流的去冗率。 本实施例中每层数据块均是在原数据流上 直接产生的分块, 每个指纹值均是数据块的指纹值, 不需要在指纹值中增加指 示位, 指示该指纹值是数据块的指纹值还是指纹值的指纹值, 减少了数据块的 指纹值的开销, 提高了新数据流的去冗率。 另外, 由于, 不同的模值产生不同 层次的数据块, 每层数据块的产生不依赖于其它层的数据块, 因国此, 对于同 一个滑动窗口的指纹值, 可以并发产生不同层次的数据块。 In the data blocking device provided by the embodiment, the fingerprint module of the same sliding window in the original data stream respectively modulates the first modulus value to the Nth modulus value of a set of modulus values, and the blocking module is different according to different parameters. The modulo modulo result performs different levels of sharding on the original data stream. Therefore, for the fingerprint value of the same sliding window, the original data stream can be coarse-grained and fine-grained at the same time. For the generated data block, if the fingerprint value of the data block exists in the fingerprint dictionary, the writing module determines, according to the level of the data block, whether the fingerprint value of the data block is sequentially written into the space occupied by the new data stream or replaces the data. Block includes The fingerprint value of the next layer of data block; if the fingerprint value of the data block does not exist in the fingerprint dictionary, the write module determines, according to the level of the data block, whether to sequentially write the data block to the space occupied by the new data stream or replace The data block includes the next layer of data blocks, thereby avoiding the phenomenon that the fingerprint values of the redundant data blocks and the redundant data blocks appear in the new data stream compared with the original data stream, and the data blocks with different granularities are realized. The purpose of generating a new data stream and replacing the existing small data block with a large data block in the new data stream improves the deduplication rate of the new data stream. In this embodiment, each layer of the data block is a block directly generated on the original data stream, and each fingerprint value is a fingerprint value of the data block. It is not necessary to add an indication bit in the fingerprint value, indicating that the fingerprint value is a data block. The fingerprint value is also the fingerprint value of the fingerprint value, which reduces the overhead of the fingerprint value of the data block and improves the deduplication rate of the new data stream. In addition, since different moduli values generate different levels of data blocks, the generation of each layer of data blocks does not depend on data blocks of other layers. Because of this, for the fingerprint value of the same sliding window, different levels of data can be generated concurrently. Piece.
如图 6所示, 图 5中装置还可包括: 定位模块 54和记录模块 55。  As shown in FIG. 6, the apparatus of FIG. 5 may further include: a positioning module 54 and a recording module 55.
记录模块 55, 用于在所述分块模块 52将所述原数据流中从第 n层的上一个数 据块的结束位置开始到所述滑动窗口的结束位置之间的数据作为当前数据块之 后, 记录所述当前数据块的序号, 与所述当前数据块所在的层次、 所述当前数 据块在所述原数据流的位置和所述当前数据块在所述新数据流的位置之间的对 应关系, 并保存在位置记录表中;  a recording module 55, configured to, after the blocking module 52, data in the original data stream from an end position of a previous data block of the nth layer to an end position of the sliding window as a current data block Recording a sequence number of the current data block, a level between the current data block, a location of the current data block in the original data stream, and a location of the current data block between the new data stream Correspondence relationship, and saved in the location record table;
定位模块 54, 用于在所述写模块 53在新数据流的占用空间中写入所述当前 数据块之前, 和 /或所述写模块 53在所述新数据流的占用空间中写入所述当前数 据块的指纹值之前, 查找所述当前数据块包括的第 n-1层的数据块在所述新数据 流中的位置。  a positioning module 54, configured to: before the writing module 53 writes the current data block in a occupied space of a new data stream, and/or the writing module 53 writes in the occupied space of the new data stream Before the fingerprint value of the current data block is described, the location of the data block of the n-1th layer included in the current data block is found in the new data stream.
进一歩, 所述定位模块 54, 具体用于根据所述当前数据块在所述原数据流 中的位置和第 n-1层数据块在所述原数据流中的位置, 在所述位置记录表中, 定 位所述原数据流中所述当前数据块包括的第 n-1层数据块的记录; 从定位到的第 n-1层数据块的记录中, 获取所述当前数据块包括的第 n-1层数据块在所述新数 据流中的位置。  Further, the positioning module 54 is specifically configured to record at the location according to a location of the current data block in the original data stream and a location of an n-1th data block in the original data stream. In the table, the record of the n-1th data block included in the current data block in the original data stream is located; and the current data block is included in the record of the n-1th data block that is located The location of the n-1th data block in the new data stream.
为了根据新数据流还原出原数据流, 图 5和图 6中所示的装置还可包括: 还 原模块。  In order to restore the original data stream based on the new data stream, the apparatus shown in Figures 5 and 6 may further include: a restore module.
还原模块, 用于将所述新数据流解析为所述原数据流的过程中, 若在所述 新数据流查找到指纹值, 在所述指纹字典中查找所述指纹值对应的数据块在所 述分块字典中的索引信息; a restoring module, in the process of parsing the new data stream into the original data stream, if a fingerprint value is found in the new data stream, searching for a data block corresponding to the fingerprint value in the fingerprint dictionary Place Defining index information in the block dictionary;
所述还原模块, 还用于根据所述索引信息, 在所述分块字典中查找所述指 纹值对应的数据块, 在所述指纹值在所述新数据流占用的空间中, 将所述指纹 值替换为查找到的数据块。  The restoring module is further configured to: search, according to the index information, a data block corresponding to the fingerprint value in the block dictionary, where the fingerprint value is in a space occupied by the new data stream, The fingerprint value is replaced with the found data block.
图 5和图 6所示的各个模块可以用处理器 (processor ) 来实现。 可选地, 图 5和图 6所示的装置还可以包括存储模块, 用于存储所述分块字典和所述指纹字 典。 可选地, 所述存储模块可以用存储器来实现。  The various modules shown in Figures 5 and 6 can be implemented using a processor. Optionally, the apparatus shown in FIG. 5 and FIG. 6 may further include a storage module, configured to store the block dictionary and the fingerprint dictionary. Optionally, the storage module can be implemented by using a memory.
本领域普通技术人员可以理解: 实现上述方法实施例的全部或部分歩骤可 以通过程序指令相关的硬件来完成, 前述的程序可以存储于一计算机可读取存 储介质中, 该程序在执行时, 执行包括上述方法实施例的歩骤; 而前述的存储 介质包括: R0M、 RAM, 磁碟或者光盘等各种可以存储程序代码的介质。  A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, when executed, The steps of the foregoing method embodiments are performed; and the foregoing storage medium includes: various media that can store program codes, such as ROM, RAM, disk or optical disk.
最后应说明的是: 以上实施例仅用以说明本发明的技术方案, 而非对其限 制; 尽管参照前述实施例对本发明进行了详细的说明, 本领域的普通技术人员 应当理解: 其依然可以对前述各实施例所记载的技术方案进行修改, 或者对其 中部分技术特征进行等同替换; 而这些修改或者替换, 并不使相应技术方案的 本质脱离本发明各实施例技术方案的范围。  It should be noted that the above embodiments are only for explaining the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: The technical solutions described in the foregoing embodiments are modified, or some of the technical features are equivalently replaced; and the modifications or substitutions do not deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

权 利 要 求 书 Claims
1、 一种数据分块方法, 其特征在于, 包括: A data blocking method, comprising:
11.获取原数据流上滑动窗口的指纹值, 所述滑动窗口的初始起始位置与所 述原数据流的起始位置相同, 所述滑动窗口的长度为预设长度;  11. Obtain a fingerprint value of a sliding window on the original data stream, where an initial starting position of the sliding window is the same as a starting position of the original data stream, and a length of the sliding window is a preset length;
12.采用所述滑动窗口的指纹值, 分别对第 1个模值至第 N个模值进行取模, 对于每个模值, 如果取值后的值为零均执行歩骤 13和 14, 其中, 将第 i个模值向 左扩展预设位数后得到第 i+1个模值, 所述 i为大于等于 1并且小于所述 N的任一 个自然数, 所述 N为模值的总个数;  12. Using the fingerprint value of the sliding window, respectively modulating the first modulo value to the Nth modulo value, and for each modulo value, if the value after the value is zero, steps 13 and 14 are performed, The i-th modulo value is extended to the left by a preset number of bits to obtain an i+1th modulo value, where i is greater than or equal to 1 and less than any natural number of the N, and the N is a total of modulo values. Number
13.将所述原数据流中从第 n层的上一个数据块的结束位置开始到所述滑动 窗口的结束位置之间的数据作为当前数据块, 所述当前数据块为第 n层数据块; 其中, 所述 n为大于等于 1并且小于等于所述 N的任一个自然数:  13. Data in the original data stream from an end position of a previous data block of the nth layer to an end position of the sliding window as a current data block, where the current data block is an nth layer data block Wherein n is one or more natural numbers that are greater than or equal to 1 and less than or equal to:
在指纹字典中没有存储所述当前数据块的指纹值时, 将所述当前数据块存 储到分块字典中, 将所述当前数据块的指纹值和所述当前数据块在所述分块字 典中的索引信息存储到所述指纹字典中, 当所述 n大于 1时, 在新数据流占用的 空间中写入所述当前数据块, 以覆盖在所述新数据流中所述当前数据块包括的 所有第 n-1层的数据块, 当所述 n等于 1时, 将所述当前数据块顺序写入所述新数 据流占用的空间;  When the fingerprint value of the current data block is not stored in the fingerprint dictionary, storing the current data block into a block dictionary, and the fingerprint value of the current data block and the current data block in the block dictionary The index information in the fingerprint dictionary is stored in the fingerprint dictionary. When the n is greater than 1, the current data block is written in a space occupied by the new data stream to overwrite the current data block in the new data stream. All n-1th data blocks included, when the n is equal to 1, the current data block is sequentially written into the space occupied by the new data stream;
在所述指纹字典中存储有所述当前数据块的指纹值时并且当 n大于 1时, 在 所述新数据流占用的空间中写入所述当前数据块的指纹值, 以覆盖所述新数据 流中所述当前数据块包括的所有第 n-1层的数据块的指纹值, 在所述指纹字典中 存储有所述当前数据块的指纹值时并且当 n等于 1时, 将所述当前数据块的指纹 值顺序写入所述新数据流占用的空间;  When the fingerprint value of the current data block is stored in the fingerprint dictionary and when n is greater than 1, the fingerprint value of the current data block is written in a space occupied by the new data stream to cover the new a fingerprint value of all n-1th data blocks included in the current data block in the data stream, when the fingerprint value of the current data block is stored in the fingerprint dictionary, and when n is equal to 1, the The fingerprint value of the current data block is sequentially written into the space occupied by the new data stream;
14.将所述滑动窗口在所述原数据流上朝结束位置滑动所述预设长度, 获取 原数据流上滑动窗口的指纹值, 重复执行歩骤 12-14, 直至所述滑动窗口的起始 位置滑动至所述原数据流的结束位置。  14. Slide the sliding window on the original data stream toward the end position by the preset length, obtain the fingerprint value of the sliding window on the original data stream, and repeat steps 12-14 until the sliding window starts. The start position slides to the end position of the original data stream.
2、 根据权利要求 1所述的方法, 其特征在于, 将所述当前数据块存储到分 块字典中, 将所述当前数据块的指纹值和所述当前数据块在所述分块字典中的 索引信息存储到所述指纹字典中, 包括:  2. The method according to claim 1, wherein the current data block is stored in a block dictionary, and the fingerprint value of the current data block and the current data block are in the block dictionary The index information is stored in the fingerprint dictionary, including:
当所述当前数据块包括第 n-1层的数据块时, 将所述当前数据块的指纹值和 所述当前数据块在所述分块字典中的索引信息存储到所述指纹字典中, 其中, 所述当前数据块在所述分块字典中的索引信息, 根据所述当前数据块包括的第 n-1层的数据块的索引信息确定; When the current data block includes a data block of the n-1th layer, the fingerprint value of the current data block is The index information of the current data block in the block dictionary is stored in the fingerprint dictionary, wherein the index information of the current data block in the block dictionary is included according to the current data block. The index information of the n-1 layer data block is determined;
当所述当前数据块不包括第 n-1层的数据块时, 将所述当前数据块存储到所 述分块字典中, 将所述当前数据块的指纹值和所述当前数据块在所述分块字典 中的索引信息存储到所述指纹字典中。  When the current data block does not include the data block of the n-1th layer, storing the current data block into the block dictionary, and the fingerprint value of the current data block and the current data block are in the The index information in the block dictionary is stored in the fingerprint dictionary.
3、 根据权利要求 1或 2所述的方法, 其特征在于, 在新数据流的占用空间中 写入所述当前数据块之前, 和 /或在所述新数据流的占用空间中写入所述当前数 据块的指纹值之前, 还包括:  The method according to claim 1 or 2, characterized in that before the current data block is written in the occupied space of the new data stream, and/or written in the occupied space of the new data stream Before describing the fingerprint value of the current data block, it also includes:
查找所述当前数据块包括的第 n-1层的数据块在所述新数据流中的位置。 Finding a location of the data block of the n-1th layer included in the current data block in the new data stream.
4、 根据权利要求 3所述的方法, 其特征在于: 4. The method of claim 3, wherein:
在所述将所述原数据流中从第 n层的上一个数据块的结束位置开始到所述 滑动窗口的结束位置之间的数据作为当前数据块之后, 还包括:  After the data between the end position of the previous data block of the nth layer and the end position of the sliding window in the original data stream is used as the current data block, the method further includes:
记录所述当前数据块的序号, 与所述当前数据块所在的层次、 所述当前数 据块在所述原数据流的位置和所述当前数据块在所述新数据流的位置之间的对 应关系, 并保存在位置记录表中;  Recording a sequence number of the current data block, a level corresponding to the current data block, a location of the current data block in the original data stream, and a location of the current data block in the new data stream Relationship, and saved in the location record table;
查找所述当前数据块包括的第 n-1层的数据块在所述新数据流中的位置, 包 括:  Finding a location of the data block of the n-1th layer included in the current data block in the new data stream, including:
根据所述当前数据块在所述原数据流中的位置和第 n-1层数据块在所述原 数据流中的位置, 在所述位置记录表中, 定位所述原数据流中所述当前数据块 包括的第 n-1层数据块的记录;  And locating the location in the original data stream in the location record table according to a location of the current data block in the original data stream and a location of an n-1th data block in the original data stream a record of the n-1th data block included in the current data block;
从定位到的第 n-1层数据块的记录中, 获取所述当前数据块包括的第 n-1层 数据块在所述新数据流中的位置。  Obtaining, in the record of the n-1th data block that is located, the position of the n-1th data block included in the current data block in the new data stream.
5、 根据权利要求 1或 2所述的方法, 其特征在于, 还包括:  5. The method according to claim 1 or 2, further comprising:
将所述新数据流解析为所述原数据流的过程中, 当在所述新数据流查找到 指纹值时, 在所述指纹字典中查找所述指纹值对应的数据块在所述分块字典中 的索引信息;  In the process of parsing the new data stream into the original data stream, when the fingerprint data value is found in the new data stream, searching for the data block corresponding to the fingerprint value in the fingerprint dictionary in the block Index information in the dictionary;
根据所述索引信息, 在所述分块字典中查找所述指纹值对应的数据块, 在 所述指纹值在所述新数据流占用的空间中, 将所述指纹值替换为查找到的数据 块。 6、 一种数据分块装置, 其特征在于, 包括: Searching, according to the index information, a data block corresponding to the fingerprint value in the block dictionary, and replacing the fingerprint value with the found data in a space occupied by the new data stream Piece. 6. A data blocking device, comprising:
获取模块: 用于获取原数据流上滑动窗口的指纹值, 所述滑动窗口的初始 起始位置与所述原数据流的起始位置相同, 所述滑动窗口的长度为预设长度; 取模模块, 用于采用所述滑动窗口的指纹值, 分别对第 1个模值至第 N个模 值进行取模, 其中, 将第 i个模值向左扩展预设位数后得到第 i+1个模值, 所述 i 为大于等于 1并且小于所述 N的任一个自然数, 所述 N为模值的总个数;  An acquisition module: configured to acquire a fingerprint value of a sliding window on the original data stream, where an initial starting position of the sliding window is the same as a starting position of the original data stream, and a length of the sliding window is a preset length; a module, configured to adopt a fingerprint value of the sliding window, and respectively modulo the first modulo value to the Nth modulo value, wherein the ith modulo value is extended to the left by a preset number of digits to obtain an i+th a modulus value, wherein i is greater than or equal to 1 and less than any natural number of the N, and the N is a total number of modulus values;
分块模块, 用于对于每个模值, 如果所述取模模块取模后的值为零, 将所 述原数据流中从第 n层的上一个数据块的结束位置开始到所述滑动窗口的结束 位置之间的数据作为当前数据块, 所述当前数据块为第 n层数据块; 其中, 所述 n为大于等于 1并且小于等于所述 N的任一个自然数:  a blocking module, configured, for each modulus value, if the value of the modulo module after modulo is zero, starting from the end position of the previous data block of the nth layer to the sliding in the original data stream The data between the end positions of the windows is the current data block, and the current data block is the nth layer data block; wherein the n is greater than or equal to 1 and less than or equal to any natural number of the N:
写模块, 用于在所述指纹字典中没有存储所述分块模块产生的当前数据块 的指纹值时, 将所述当前数据块存储到分块字典中, 将所述当前数据块的指纹 值和所述当前数据块在所述分块字典中的索引信息存储到所述指纹字典中, 当 所述 n大于 1时, 在新数据流占用的空间中写入所述当前数据块, 以覆盖在所述 新数据流中所述当前数据块包括的所有第 n-1层的数据块, 当所述 n等于 1时, 将 所述当前数据块顺序写入所述新数据流占用的空间;  a writing module, configured to store, in the fingerprint dictionary, a fingerprint value of a current data block generated by the blocking module, store the current data block into a blocking dictionary, and set a fingerprint value of the current data block And index information of the current data block in the block dictionary is stored in the fingerprint dictionary, and when the n is greater than 1, the current data block is written in a space occupied by the new data stream to cover All data blocks of the nth-1th layer included in the current data block in the new data stream, when the n is equal to 1, the current data block is sequentially written into a space occupied by the new data stream;
所述写模块, 还用于在所述指纹字典中存储有所述当前数据块的指纹值时 并且所述 n大于 1时, 在所述新数据流占用的空间中写入所述当前数据块的指纹 值, 以覆盖所述新数据流中所述当前数据块包括的所有第 n-1层的数据块的指纹 值, 在所述指纹字典中存储有所述当前数据块的指纹值时并且当所述 n等于 1时, 将所述当前数据块的指纹值顺序写入所述新数据流占用的空间;  The writing module is further configured to: when the fingerprint value of the current data block is stored in the fingerprint dictionary, and when n is greater than 1, write the current data block in a space occupied by the new data stream a fingerprint value to cover a fingerprint value of all n-1th data blocks included in the current data block in the new data stream, when the fingerprint value of the current data block is stored in the fingerprint dictionary and When the n is equal to 1, the fingerprint value of the current data block is sequentially written into the space occupied by the new data stream;
滑动模块, 用于在所述写模块将所述分块模块产生的所述当前数据块或所 述当前数据块的指纹值写入所述新数据流占用的空间后, 将所述滑动窗口在所 述原数据流上朝结束位置滑动所述预设长度, 获取原数据流上滑动窗口的指纹 值, 使所述取模模块、 所述写模块和所述滑动模块重复执行操作, 直至所述滑 动窗口的起始位置滑动至所述原数据流的结束位置。  a sliding module, configured to: after the writing module writes the current data block generated by the blocking module or the fingerprint value of the current data block into a space occupied by the new data stream, the sliding window is And sliding the preset length on the original data stream to obtain the fingerprint value of the sliding window on the original data stream, so that the modulo module, the writing module, and the sliding module repeatedly perform operations until the The starting position of the sliding window slides to the end of the original data stream.
7、 根据权利要求 6所述的装置, 其特征在于: 当所述当前数据块包括第 n-1 层的数据块时, 所述当前数据块在所述分块字典中的索引信息, 根据所述当前 数据块包括的第 n-1层的数据块的索引信息确定。  The device according to claim 6, wherein: when the current data block includes a data block of an n-1th layer, index information of the current data block in the block dictionary, according to the The index information of the data block of the n-1th layer included in the current data block is determined.
8、 根据权利要求 6或 7所述的装置, 其特征在于, 还包括: 定位模块, 用于在所述写模块在新数据流的占用空间中写入所述当前数据 块之前, 和 /或在所述写模块在所述新数据流的占用空间中写入所述当前数据块 的指纹值之前, 查找所述当前数据块包括的第 n-1层的数据块在所述新数据流中 的位置。 The device according to claim 6 or 7, further comprising: a positioning module, configured to: before the writing module writes the current data block in a occupied space of a new data stream, and/or write the current in the occupied space of the new data stream in the write module Before the fingerprint value of the data block, the location of the data block of the n-1th layer included in the current data block is found in the new data stream.
9、 根据权利要求 8所述的装置, 其特征在于, 还包括:  9. The device according to claim 8, further comprising:
记录模块, 用于在所述分块模块将所述原数据流中从第 n层的上一个数据块 的结束位置开始到所述滑动窗口的结束位置之间的数据作为当前数据块之后, 记录所述当前数据块的序号, 与所述当前数据块所在的层次、 所述当前数据块 在所述原数据流的位置和所述当前数据块在所述新数据流的位置之间的对应关 系, 并保存在位置记录表中;  a recording module, configured to record, after the block module, data in the original data stream from an end position of a previous data block of the nth layer to an end position of the sliding window as a current data block a sequence number of the current data block, a level of the current data block, a location of the current data block in the original data stream, and a location of the current data block in a location of the new data stream And saved in the location record table;
所述定位模块, 具体用于根据所述当前数据块在所述原数据流中的位置和 第 n-1层数据块在所述原数据流中的位置, 在所述位置记录表中, 定位所述原数 据流中所述当前数据块包括的第 n-1层数据块的记录; 从定位到的第 n-1层数据 块的记录中, 获取所述当前数据块包括的第 n-1层数据块在所述新数据流中的位 置。  The positioning module is specifically configured to: according to a location of the current data block in the original data stream and a location of the n-1th data block in the original data stream, in the location record table, a record of the n-1th data block included in the current data block in the original data stream; and obtaining an n-1th included in the current data block from the recorded record of the n-1th data block The location of the layer data block in the new data stream.
10、 根据权利要求 6或 7所述的装置, 其特征在于, 还包括:  The device according to claim 6 or 7, further comprising:
还原模块, 用于将所述新数据流解析为所述原数据流的过程中, 当在所述 新数据流查找到指纹值时, 在所述指纹字典中查找所述指纹值对应的数据块在 所述分块字典中的索引信息;  a restoring module, in the process of parsing the new data stream into the original data stream, searching for a data block corresponding to the fingerprint value in the fingerprint dictionary when a fingerprint value is found in the new data stream Index information in the block dictionary;
所述还原模块, 还用于根据所述索引信息, 在所述分块字典中查找所述指 纹值对应的数据块, 在所述指纹值在所述新数据流占用的空间中, 将所述指纹 值替换为查找到的数据块。  The restoring module is further configured to: search, according to the index information, a data block corresponding to the fingerprint value in the block dictionary, where the fingerprint value is in a space occupied by the new data stream, The fingerprint value is replaced with the found data block.
PCT/CN2014/082237 2013-07-23 2014-07-15 Data blocking method and device WO2015010555A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310312216.7 2013-07-23
CN201310312216.7A CN104348571B (en) 2013-07-23 2013-07-23 Deblocking method and device

Publications (1)

Publication Number Publication Date
WO2015010555A1 true WO2015010555A1 (en) 2015-01-29

Family

ID=52392700

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/082237 WO2015010555A1 (en) 2013-07-23 2014-07-15 Data blocking method and device

Country Status (2)

Country Link
CN (1) CN104348571B (en)
WO (1) WO2015010555A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108958572A (en) * 2017-05-25 2018-12-07 腾讯科技(深圳)有限公司 Message data processing method, device, storage medium and computer equipment

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787107B (en) * 2016-03-22 2018-10-30 南京工程学院 A kind of big data redundant detecting method
CN108256352B (en) * 2018-01-15 2021-10-22 北京安博通科技股份有限公司 Method, device and terminal for automatically packaging web protection feature library
CN111722787B (en) * 2019-03-22 2021-12-03 华为技术有限公司 Blocking method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040143713A1 (en) * 2003-01-22 2004-07-22 Niles Ronald S. System and method for backing up data
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
CN101968796A (en) * 2010-09-09 2011-02-09 北京邮电大学 Method for segmenting bidirectionally and concurrently executed file level variable-length data
CN103019887A (en) * 2012-12-12 2013-04-03 华为技术有限公司 Data backup method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479245B (en) * 2010-11-30 2013-07-17 英业达集团(天津)电子技术有限公司 Data block segmentation method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040143713A1 (en) * 2003-01-22 2004-07-22 Niles Ronald S. System and method for backing up data
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
CN101968796A (en) * 2010-09-09 2011-02-09 北京邮电大学 Method for segmenting bidirectionally and concurrently executed file level variable-length data
CN103019887A (en) * 2012-12-12 2013-04-03 华为技术有限公司 Data backup method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108958572A (en) * 2017-05-25 2018-12-07 腾讯科技(深圳)有限公司 Message data processing method, device, storage medium and computer equipment
CN108958572B (en) * 2017-05-25 2022-12-16 腾讯科技(深圳)有限公司 Message data processing method, device, storage medium and computer equipment

Also Published As

Publication number Publication date
CN104348571A (en) 2015-02-11
CN104348571B (en) 2018-02-06

Similar Documents

Publication Publication Date Title
KR101710025B1 (en) Joint rewriting and error correction in write-once memories
WO2018076952A1 (en) Method and apparatus for storage and playback positioning of video file
US20170308437A1 (en) Parity protection for data chunks in an object storage system
US20210271557A1 (en) Data encoding, decoding and recovering method for a distributed storage system
WO2015010555A1 (en) Data blocking method and device
JP2007507989A5 (en)
WO2017020576A1 (en) Method and apparatus for file compaction in key-value storage system
WO2014067063A1 (en) Duplicate data retrieval method and device
CN110795272B (en) Method and system for atomic and latency guarantees facilitated on variable-size I/O
CN103929609B (en) A kind of video recording playback method and device
WO2013078644A1 (en) Route prefix storage method and device and route address searching method and device
CN104581406A (en) Network video recording and playback system and method
US20230177013A1 (en) System and method for error-resilient data reduction
US11385794B2 (en) System and method for data compaction and security using multiple encoding algorithms
US11366790B2 (en) System and method for random-access manipulation of compacted data files
CN107229620A (en) The storage method and device of a kind of video data
CN1463441A (en) Trick play for MP3
CN112380383A (en) Efficient fault-tolerant indexing method for real-time video stream data
CN105188075B (en) Voice quality optimization method and device, terminal
US8868429B2 (en) Method and device for storing audio data
CN110381128A (en) A kind of method for uploading and cloud storage model suitable for files in stream media
CN116594572B (en) Floating point number stream data compression method, device, computer equipment and medium
CN102522088B (en) Decoding method and device of audio frequency
US8988258B2 (en) Hardware compression using common portions of data
CN102263606B (en) Channel data coding and decoding method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14829730

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14829730

Country of ref document: EP

Kind code of ref document: A1