CN114518846A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN114518846A
CN114518846A CN202210023708.3A CN202210023708A CN114518846A CN 114518846 A CN114518846 A CN 114518846A CN 202210023708 A CN202210023708 A CN 202210023708A CN 114518846 A CN114518846 A CN 114518846A
Authority
CN
China
Prior art keywords
data
data block
block
check
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210023708.3A
Other languages
Chinese (zh)
Inventor
魏舒展
赵亚飞
顾隽清
陈亮
董元元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202210023708.3A priority Critical patent/CN114518846A/en
Publication of CN114518846A publication Critical patent/CN114518846A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/062Securing storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

The embodiment of the specification provides a data processing method and a data processing device, wherein the data processing method comprises the following steps: dividing data to be processed into a plurality of data blocks; dividing the plurality of data blocks into at least two data block sets according to a calculation rule of a preset check algorithm; determining a data block group according to an adjacent data block set in the at least two data block sets; processing the data block group according to the preset verification algorithm to obtain an initial verification block of the data block group; and determining a target check block of the data to be processed according to the initial check block. Thereby ensuring the safety of the data to be processed. The initial check block can be recovered subsequently based on the target check block, so that the safety of the initial check block is ensured; meanwhile, the data redundancy ratio of the check block data is reduced, and the storage cost of the data is further reduced.

Description

Data processing method and device
Technical Field
The embodiment of the specification relates to the technical field of data processing, in particular to a data processing method.
Background
With the development of computer technology, in order to ensure the reliability of data and lower storage cost, an erasure code technology capable of minimizing the storage overhead of a system on the premise of ensuring the reliability of data is widely applied to the technical field of data storage. For example, when the data is stored in a multi-AZ (available area) manner, each AZ is an independently managed physical data center, corresponding erasure code data is configured for the user data through an erasure code technology, and the user data and the erasure code data are stored in a plurality of separate data centers, so that when a single AZ suffers from a machine room fault or a network equipment fault, and the like, and the user data is lost, the user data can still be recovered through the erasure code data, and thus the security of the user data is ensured.
However, in order to ensure the security of erasure code data under the condition of abnormal machine rooms and machines, the designed erasure code data often has a high data redundancy ratio, which increases the storage cost of the data to a certain extent.
Disclosure of Invention
In view of this, the present specification provides a data processing method. One or more embodiments of the present specification also relate to a data processing apparatus, a computing device, a computer-readable storage medium, and a computer program, so as to solve the technical problems in the prior art.
According to a first aspect of embodiments herein, there is provided a data processing method including:
dividing data to be processed into a plurality of data blocks;
dividing the plurality of data blocks into at least two data block sets according to a calculation rule of a preset check algorithm;
determining a data block group according to an adjacent data block set in the at least two data block sets;
processing the data block group according to the preset verification algorithm to obtain an initial verification block of the data block group;
and determining a target check block of the data to be processed according to the initial check block.
According to a second aspect of embodiments of the present specification, there is provided a data processing apparatus comprising:
a slicing module configured to slice data to be processed into a plurality of data blocks;
the dividing module is configured to divide the plurality of data blocks into at least two data block sets according to a calculation rule of a preset checking algorithm;
a first determining module configured to determine a data block group according to an adjacent data block set of the at least two data block sets;
the processing module is configured to process the data block group according to the preset verification algorithm to obtain an initial verification block of the data block group;
a second determining module configured to determine a target parity chunk of the data to be processed according to the initial parity chunk.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is for storing computer-executable instructions and the processor is for executing the computer-executable instructions, which when executed by the processor implement the steps of the data processing method.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the data processing method.
According to a fifth aspect of embodiments herein, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the data processing method.
The data processing method provided by the specification comprises the steps of dividing data to be processed into a plurality of data blocks; dividing the plurality of data blocks into at least two data block sets according to a calculation rule of a preset check algorithm; determining a data block group according to an adjacent data block set in the at least two data block sets; processing the data block group according to the preset verification algorithm to obtain an initial verification block of the data block group; and determining a target check block of the data to be processed according to the initial check block.
Specifically, the method determines a data block set based on a data block obtained after data to be processed is segmented; and then, determining a data block group according to an adjacent data block set in the at least two data block sets, and then processing the data block group according to a preset check algorithm to obtain an initial check block of the data block group, so that the subsequent recovery of the data to be processed based on the initial data block is facilitated, and the safety of the data to be processed is ensured. In addition, the target check block of the data to be processed is determined according to the initial check block, so that the initial check block can be recovered based on the target check block conveniently in the follow-up process, and the safety of the initial check block is ensured; meanwhile, the data redundancy ratio of the check block is reduced, and the storage cost of the data is further reduced.
Drawings
FIG. 1 is a diagram illustrating a design of erasure coded data according to an embodiment of the present disclosure;
FIG. 2 is a flow chart of a data processing method provided by an embodiment of the present specification;
fig. 3 is a schematic diagram of erasure codes in a data processing method according to an embodiment of the present disclosure;
FIG. 4 is a flowchart illustrating a data processing method according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present specification;
fig. 6 is a block diagram of a computing device according to an embodiment of the present disclosure.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
First, the noun terms referred to in one or more embodiments of the present specification are explained.
Erasure Code (Erasure Code): the method is a coding fault-tolerant technology, and the basic principle of the method is to fragment stored data, generate k + m parts of data from k parts of original data in a certain checking calculation mode, and restore the data into the original data through any k parts of data in the k + m parts. Thus, even if part of the data is lost, the system can still recover the original data.
Usable zone (AZ): refers to physical areas where power and network are independent of each other in the same region, for example, AZ may be an independently managed physical data center. The network delay between instances within the same usable area is smaller.
Poly az (aviavailability Zones): under the multi-AZ scheme, data is distributed into a plurality of independent data centers, and data is guaranteed to be still usable when a single AZ encounters a machine room or network equipment failure.
Data redundancy ratio: the erasure code uses an erasure code algorithm to segment data to generate k data blocks, and then codes the data blocks to generate m redundant check blocks, so as to achieve the purpose of fault tolerance. The multiple of the total data in the original data, that is, (k + m)/k, is stored as the data redundancy ratio of the erasure code.
With the continuous development of computer technology, the storage scale of a distributed system is becoming larger and larger; device errors in distributed systems are also a significant problem. Therefore, the storage cost and reliability of data are factors to be considered when designing the distributed system. The erasure code file can minimize the storage overhead of the system on the premise of ensuring the reliability of data, so the erasure code technology is widely applied to the technical field of storage.
With increasing demands on data reliability by users, many data storage systems support data storage in a multi-AZ manner. Each AZ is an independently managed physical data center. When data is stored in a multi-AZ manner, user data is dispersed into a plurality of independent data centers, so that the data of users can still be accessed when a single AZ encounters a computer room or network equipment failure.
Meanwhile, in order to reduce the storage space occupation of data in multi-AZ storage, erasure codes are widely applied to multi-AZ data storage of a data storage system. When a single AZ has a computer room or network equipment failure, the system can acquire data from other computer rooms and recover abnormal data to provide the data for users. However, in order to ensure the security of erasure code data under the condition of abnormal machine rooms and machines, the erasure code design under multi-AZ often has a higher data redundancy ratio, which increases the storage cost of data to a certain extent.
Referring to fig. 1, fig. 1 is a schematic diagram of a design scheme of erasure code data according to an embodiment of the present specification, where: AZ is an available area, A1-A10 is a data block obtained after user data are segmented, P (1-1) and P (2-1) are local check blocks generated by A1-A5 through RS erasure code calculation; p (1-2) and P (2-2) are local check blocks generated by A6-A10 by adopting the same RS erasure code calculation; X1-X7 is a local check block generated by calculating RS erasure codes of corresponding copies in AZ1 and AZ2, for example, X1 is generated by calculation using a1 and a 6; however, this design has the problem of high data redundancy, as shown in fig. 1, when the number of local parity chunks in a single AZ is set to 1, the system can tolerate only the loss of any 3 local parity chunks (assuming that P (2-1), P (2-2), and X7 parity chunks do not exist, then when data chunks a1, a2, a6, and a7 are lost, these four data chunks are unrecoverable). Therefore, in order to tolerate any 4 copy errors, two local parity blocks need to be stored in each data AZ, which results in 11 parity blocks needing to be generated for 10 data blocks, resulting in a high data redundancy ratio, which may be up to 2.1.
In view of this, in the present specification, a data processing method is provided, and the present specification simultaneously relates to a data processing apparatus, a computing device, a computer-readable storage medium, and a computer program, which are described in detail one by one in the following embodiments.
Fig. 2 is a flowchart illustrating a data processing method according to an embodiment of the present specification, which specifically includes the following steps.
Step 202: and dividing the data to be processed into a plurality of data blocks.
In practical applications, the data processing method can be applied to a data storage system capable of supporting data storage in a multi-AZ (usable area) manner; for example, the data storage system may be a distributed storage system that supports data storage in a multi-AZ manner.
The data to be processed may be understood as data that needs to be stored, for example, data such as a document or a multimedia file that is sent by a user and needs to be stored. The data block is obtained by splitting the data to be processed, and the data block includes partial data content in the data to be processed, for example, the data to be processed sent by the user is 10MB, and the data block can be split into 10 data blocks, and the size of each data block is 1 MB.
Specifically, the data storage system can segment the data to be processed when receiving the data to be processed, thereby obtaining a plurality of data blocks.
In a specific implementation process, the data storage system can segment the data to be processed based on the calculation rule of the preset check algorithm, so that a plurality of data blocks obtained after segmentation can be dispersedly stored in the data storage center. The specific implementation is as follows.
The dividing of the data to be processed into a plurality of data blocks comprises:
receiving data to be processed sent by a user;
and dividing the data to be processed into a plurality of data blocks based on a calculation rule of a preset check algorithm.
The predetermined check algorithm may be an algorithm capable of calculating a check block, for example, an erasure code algorithm. In practical applications, the predetermined check algorithm may be an RS erasure code algorithm (reed-solomon type erasure code), an array erasure code, a low density parity check erasure code, or the like. The following describes the data processing method provided in this specification, taking a preset check algorithm as an RS erasure code algorithm as an example. In the case that the predetermined check algorithm is an array erasure correcting code or a low-density parity-check erasure correcting code, reference may be made to corresponding or corresponding description contents in this specification, which is not described in detail herein.
Correspondingly, the calculation rule can be understood as erasure code coding configuration of the preset check algorithm in each check block calculation process. For example, the erasure code coding configuration can be (5,1), (2,1), (5,1) indicates that the preset parity algorithm can calculate 5 data blocks at a time, thereby obtaining a1 parity block. The (2,1) indicates that the preset parity algorithm can calculate 2 data blocks at a time, so as to obtain a1 parity block.
The data storage center can store data to be processed and an initial check block and a target check block corresponding to the data to be processed; for example, in the case that the data storage system to which the data processing method is applied can support data storage in a multi-AZ manner, the data storage center may be understood as AZ, a physical data center, a computer room or a database, and the like.
Specifically, the data processing system can determine a calculation rule of a preset verification algorithm configured in advance in the data processing system when receiving to-be-processed data sent by a user, and segment the to-be-processed data based on the technical rule, so as to obtain a plurality of data blocks.
Taking the application of the data processing method to a data storage scene as an example, the segmentation of the data to be processed based on the number of the data storage centers is further explained, wherein the data storage system may be a distributed storage system, the data to be processed is a document, the preset check algorithm may be an RS erasure code algorithm, and the calculation rule is an erasure code encoding configuration (5, 1).
Based on this, when receiving a10 MB document sent by a user, the distributed storage system determines the erasure coding configuration of the RS erasure coding algorithm configured in advance, which is (5, 1). Based on the erasure code configuration, the RS erasure code algorithm is determined to generate 1 verification block for 5 data blocks, and based on the 1 verification block, the distributed storage system divides the 10MB document into 10 data blocks, so that the subsequent RS erasure code algorithm configures the corresponding verification block for the 10MB document sent by the user.
Further, in a specific implementation process, the data storage system can segment the data to be processed based on the number of the data storage centers, so that a plurality of data blocks obtained after segmentation can be dispersedly stored in the data storage centers. The specific implementation is as follows.
The dividing of the data to be processed into a plurality of data blocks comprises:
receiving data to be processed sent by a user, and determining the number of data storage centers;
and dividing the data to be processed into a plurality of data blocks based on the number of the data storage centers.
The data storage center can store data to be processed and an initial check block and a target check block corresponding to the data to be processed; for example, in the case that the data storage system to which the data processing method is applied can support data storage in a multi-AZ manner, the data storage center may be understood as AZ, a physical data center, a computer room or a database, and the like.
Specifically, when receiving data to be processed sent by a user, the data storage system determines the number of data storage centers, and segments the data to be processed based on the number of the data storage centers, so as to obtain a plurality of data blocks.
In the above example, the data storage system may be a distributed storage system, the data storage center is AZ, and the data to be processed is a document.
The user sends a document of size 10MB to the distributed storage system, which determines the number of AZ, which may be 3, after receiving the document sent by the user. And then, the distributed storage system segments the document with the size of 10MB into 10 data blocks according to the number of AZ, wherein the size of each data block can be set according to the actual application scenario, for example, the size of the data block can be 1 MB.
In practical application, a10 MB document sent by a user can be further segmented according to a data block size predefined by the distributed system, for example, the data block size is 2MB in the distributed storage system; based on the data block size, a10 MB document can be divided into 5 data blocks.
Step 204: and dividing the plurality of data blocks into at least two data block sets according to a calculation rule of a preset check algorithm.
Wherein, a data block set can be understood as a set composed of a specific number of data blocks; in practical application, a preset check algorithm can process a data block set, so as to obtain a check block of the data block set.
Specifically, after dividing the data to be processed into a plurality of data blocks, the data storage system can determine a calculation rule of the preset check algorithm, and divide the plurality of data blocks into at least two data block sets based on the calculation rule.
Along the above example, the preset check algorithm may be an RS erasure code algorithm, based on which, after the distributed storage system divides a document with a size of 10MB into 10 data blocks with a size of 1MB, an erasure code configuration of the erasure code algorithm may be determined, where the erasure code configuration may be (5, 1); dividing 10 data blocks into two data block sets based on the erasure code configuration, where each data block set includes 5 data blocks, see fig. 3, and fig. 3 is a schematic diagram of an erasure code in a data processing method provided in an embodiment of this specification, where a1-a5 and a6-a10 are respectively divided data block sets, AZ is an available area, a1-a10 is a data block obtained after user data is split, and P (1-1) and P (1-2) are local parity check blocks generated by calculating the data block sets using RS erasure codes; X1-X5 is a global check block generated by calculation of RS erasure codes of corresponding copies in AZ1 and AZ 2; x6 is a check block generated by computing X1-X5 by using RS erasure codes, and P is a check block generated by computing P (1-1) and P (1-2) by using RS erasure codes.
It should be noted that, in this specification, only fig. 3 is taken as an example to explain the data processing method, and parameters such as the number and size of the AZ (available area), the data block set, and the check block may be set according to an actual application scenario, which is not limited in this specification.
Step 206: and determining a data block group according to the adjacent data block set in the at least two data block sets.
The data block group may be understood as a data block group consisting of a specific number of data block sets.
In a case provided by this specification, the determining a data block group according to an adjacent data block set of the at least two data block sets includes:
and determining at least two adjacent data block sets in the at least two data block sets as a data block group.
Specifically, the data storage system determines at least two of the at least two data block sets, and two adjacent data block sets determine one data block group. That is to say, the data block group includes at least two data block sets, and each of the at least two data block sets is an adjacent data block set, and at the same time, each data block set can only construct one data block group, that is, the data block set only exists in one data block group.
Along the above example, as shown in fig. 3, data chunk set a1-a5 and data chunk set a6-a10 are adjacent data chunk sets, and based on this, the data storage system forms the data chunk set a1-a5 and data chunk set a6-a10 into one data chunk set.
In the embodiment of the specification, at least two adjacent data block sets in the at least two data block sets are determined as one data block group, so that a subsequent preset verification algorithm can rapidly process the data block group, and the efficiency of determining the initial verification block and the target verification block is improved.
In another case provided by this specification, the determining a data block group according to an adjacent data block set of the at least two data block sets includes:
determining at least two adjacent data block sets in the at least two data block sets and position information of data blocks in the at least two adjacent data block sets;
and determining the data block group according to the position information of the data blocks in the at least two adjacent data block sets.
Wherein the location information of the data block may be understood as information indicating where the data block is located in the set of data blocks.
Specifically, after determining the data block set, the data storage system can determine at least two adjacent data block sets in the at least two data block sets and the position information of the data blocks included in the at least two adjacent data block sets, and then determine the data block group based on the position information of the data blocks in the at least two adjacent data block sets.
Further, the determining a data block group according to the location information of the data blocks in the at least two adjacent data block sets includes:
determining the corresponding relation between the data block of each data block set in the at least two adjacent data block sets and the data block of other data block sets according to the position information of the data block in the at least two adjacent data block sets;
and determining the data block group according to the corresponding relation.
Along the above example, as shown in fig. 3, the data block set a1-a5 and the data block set a6-a10 are adjacent data block sets, and based on this, the data storage system can determine, based on the position information of the data blocks in each data block set, the correspondence between the data blocks included in the data block set a1-a5 and the data blocks included in the data block set a6-a10, for example, the data block a1 and the data block a6 have a correspondence. Then, based on the corresponding relationship, the data storage system constructs a data block group by using the corresponding two data blocks, wherein the data block group comprises a data block A1 and a data block A6.
The data processing method provided by the specification determines the data block group based on the corresponding relationship between the data block of each data block set and the data block of other data block sets determined by the position information of the data blocks in at least two adjacent data block sets, so that a subsequent preset check algorithm can rapidly process the data block group, and the efficiency of determining the initial check block and the target check block is improved.
Step 208: and processing the data block group according to the preset verification algorithm to obtain an initial verification block of the data block group.
In practical application, when a specific number of data blocks in the data block group are lost, the data blocks can be recovered based on the initial check block, so that a complete data block group is obtained, and the security of the data to be processed is further ensured.
Specifically, after the data block group is determined, the data storage system can calculate the data block based on a preset check algorithm, so as to generate an initial check block corresponding to the data block group.
In the data processing method provided in this specification, the preset checking algorithm is a first checking algorithm, when at least two adjacent data block sets in the at least two data block sets are determined as one data block set;
correspondingly, the processing the data block group according to the preset check algorithm to obtain an initial check block of the data block group includes:
determining a data block set in each data block group;
and processing each data block set in each data block group according to the first check algorithm to obtain an initial check block of each data block set in each data block group.
The first check algorithm may be understood as an algorithm capable of processing a data block set to obtain an initial check block; for example, in the case that the data block set includes 5 data blocks, the first check algorithm may be an RS erasure code algorithm capable of calculating the 5 data blocks and obtaining the check blocks, that is, an RS erasure code algorithm configured as (5, 1). Correspondingly, the initial check block may be understood as an initial check block calculated by a first check algorithm, for example, the initial check block may be a check block obtained after processing a data block set by an RS erasure coding algorithm. In practical applications, the RS erasure coding algorithm configured as (5,1) through erasure coding can generate an initial parity block, which can be understood as a local parity block. Meanwhile, the number of the initial check blocks obtained by the first check algorithm may be set according to an actual application scenario, which is not limited in this specification. For example, the number of the initial parity chunks may be 1.
Specifically, after determining at least two adjacent data block sets in at least two data block sets as one data block group, the data storage system determines the data block set in each data block group, and calculates the data block set in each data block group according to a first check algorithm, so as to obtain an initial check block of each data block set in each data block group, thereby facilitating subsequent recovery of data to be processed based on the initial data block, and ensuring the security of the data to be processed.
In practical application, the data block set in each data block group is calculated according to the first check algorithm, which may be understood as calculating the data blocks included in the data block set in each data block group according to the first check algorithm, and using the check block obtained by calculating the data blocks as the initial check block of each data block set in each data block group.
Along with the above example, referring to fig. 3, after determining a data block group according to two sets, namely, the data block set a1-a5 and the data block set a6-a10, the distributed storage system can obtain the data block sets included in the data block group, that is, the data block set a1-a5 and the data block set a6-a10, and perform calculation processing on the data blocks included in the data block set through RS erasure codes, so as to obtain the local parity codes P (1-1) of the data block set a1-a5 and the local parity codes P (1-2) of the data block set a6-a 10.
Further, the processing each data block set in each data block group according to the first check algorithm to obtain an initial check block of each data block set in each data block group includes:
determining matrix parameters of the first check algorithm, and constructing a first check matrix according to the matrix parameters;
and calculating each data block set in each data block group according to the first check matrix to obtain an initial check block of each data block set in each data block group.
The matrix parameters may be understood as parameters required for forming the first check matrix, wherein the matrix parameters may be linearly non-dependent parameters. In practical applications, the matrix parameters may be completely independent and linearly non-correlated RS erasure code parameters. Correspondingly, the first check matrix may be understood as a matrix capable of calculating an initial check block of each data block set in each data block group.
Along the above example, after determining the data block set in each data block group, the distributed storage system can determine the RS erasure code parameter of the RS erasure code algorithm, where the RS erasure code parameter is a completely independent and linearly non-correlated parameter. And constructing an RS erasure code matrix for calculating the data block set based on the RS erasure code parameters, where the matrix may be < a1, a2, a3, a4, a5> or < b1, b2, b3, b4, b5>, where a1 to a5 respectively represent each row in a matrix, and b1 to b5 respectively represent each column in a matrix.
Then, the distributed storage system multiplies the data chunk set a1-a5 by < a1, a2, A3, a4, a5> to obtain a local check chunk P (1-1), that is, the P (1-1) is generated by calculation for a1-a5 by using RS erasure codes with parameters < a1, a2, A3, a4, and a5 >. Multiplying the data block set A6-A10 by < b1, b2, b3, b4, b5> to obtain a local check block P (1-2), that is, the P (1-2) is generated by calculation of RS erasure codes of parameters < b1, b2, b3, b4, b5> for the A6-A10.
It should be noted that the local check block in each AZ is generated by using a completely independent and linearly non-correlated RS erasure code parameter calculation. And the rank of any sub-matrix of the two groups of coding parameters is not 0.
In the embodiment provided by the present specification, the first check matrix is constructed according to the determined matrix parameter of the first check algorithm, and each data block set in each data block group is calculated according to the first check matrix, so as to obtain the initial check block of each data block set in each data block group, thereby facilitating subsequent recovery of the data to be processed based on the initial data block, and ensuring the security of the data to be processed.
In a case where a data block group is determined according to position information of data blocks in at least two adjacent data block sets, in the data processing method provided in this specification, the preset checking algorithm is a second checking algorithm;
correspondingly, the processing the data block group according to the preset verification algorithm to obtain an initial verification block of the data block group includes:
determining matrix parameters of a second check algorithm, and constructing a second check matrix according to the matrix parameters;
and calculating the data block group according to the second check matrix to obtain an initial check block of the data block group.
Wherein, the data block group comprises two corresponding data blocks. Correspondingly, the second check algorithm can be understood as an algorithm capable of processing the data blocks in the data block group to obtain an initial check block; for example, in the case that the data block group includes 2 data blocks, the second check algorithm may be an RS erasure code algorithm capable of calculating the 2 data blocks and obtaining the check block, that is, the erasure code encoding configuration is (2, 1). Correspondingly, the initial check block may be understood as an initial check block calculated by the second check algorithm, for example, the initial check block may be a check block obtained after processing the data block set by the RS erasure coding algorithm. In practical applications, the RS erasure coding algorithm configured as (2,1) through erasure coding can generate an initial parity block, which can be understood as a global parity block. In addition, the number of the initial check blocks obtained by the second check algorithm may be set according to an actual application scenario, which is not limited in this specification. For example, the number of the initial parity chunks may be 1. The matrix parameters can be understood as parameters required for forming the second check matrix, wherein the matrix parameters can be linearly uncorrelated parameters. The second check matrix can be understood as a matrix which is constructed on the basis of matrix parameters and is capable of calculating the initial check blocks of the data block group.
Following the above example, referring to fig. 3, after determining the data block groups, the distributed storage system can determine the RS erasure code parameters of the RS erasure code algorithm. And constructing an RS erasure code matrix for calculating the data block group based on the RS erasure code parameter, where the matrix may be < a1, a2>, and the a1 to a2 respectively represent each row in one matrix.
Then, the distributed storage system multiplies the data chunks included in each data chunk group by < a1, a2> to obtain a global parity chunk of the data chunk group, specifically, the data chunks a1 and a6 included in the data chunk group are multiplied by < a1, a2> to obtain a global parity chunk X1 of the data chunk group. That is, X1-X5 is generated by calculating RS erasure codes for corresponding copies of AZ1 and AZ2, that is, X1 is generated by calculating a1 and a 6.
In the embodiments provided in this specification, a second check matrix is constructed by determining matrix parameters of a second check algorithm; and the data block group is calculated according to the second check matrix to obtain the initial check block of the data block group, so that the data to be processed based on the initial data block can be recovered subsequently, and the safety of the data to be processed is ensured.
Step 210: and determining a target check block of the data to be processed according to the initial check block.
In practical application, when the initial check block is lost in the process of data recovery for data to be processed, the data recovery for the initial check block can be performed based on the target check block, and the data recovery for the data to be processed is performed based on the recovered initial check block.
In a data processing method provided by the present specification, in a case that each data block set in each data block group is processed according to a first parity algorithm to obtain an initial parity chunk of each data block set in each data block group, in the data processing method provided by the present specification, determining a target parity chunk of the data to be processed according to the initial parity chunk includes:
and processing the initial check block of each data block set in each data block group according to a second check algorithm to obtain a target check block of each data block group.
The target check block may be understood as a check block obtained by calculating the initial check block of each data block set through a second check algorithm.
Specifically, after the initial check block of each data block set is determined, the data storage system can process the initial check block of each data block set in each data block group based on the target algorithm, so as to obtain the target check block of each data block group.
Further, the processing the initial check block of each data block set in each data block group according to a second check algorithm to obtain the target check block of each data block group includes:
determining matrix parameters of a second check algorithm, and constructing a second check matrix according to the matrix parameters;
and calculating the initial check block of each data block set in each data block group according to the second check matrix to obtain the target check block of the data block group.
Following the above example, referring to fig. 3, after the distributed storage system determines the initial parity chunks of each data chunk set in the data chunk set, that is, after determining P (1-1) and P (1-2) in fig. 3, the distributed storage system can determine the RS erasure code parameters of the RS erasure code algorithm. And constructing an RS erasure code matrix for calculating the data block group based on the RS erasure code parameter, where the matrix may be < a1, a2>, and the a1 to a2 respectively represent each row in one matrix.
Then, the distributed storage system multiplies P (1-1) and P (1-2) by < a1, a2>, so as to obtain a target check block (check block P in fig. 3) of the data block group, that is, the check block P is generated by using RS erasure code calculation consistent with the above for P (1-1) and P (1-2).
In the embodiment of the specification, a second check matrix is constructed through the determined matrix parameters of the second check algorithm; and calculating the initial check block of each data block set in each data block group according to the second check matrix to obtain the target check block of each data block group. The initial check block can be recovered based on the target check block, so that the safety of the initial check block is ensured; meanwhile, the data redundancy ratio of the check block data is reduced, and the storage cost of the data is further reduced.
In a data processing method provided in this specification, in a case where a second check matrix is constructed according to matrix parameters of a second check algorithm to calculate a data block group and obtain an initial check block of the data block group, the data processing method provided in this specification, where the determining a target check block of the to-be-processed data according to the initial check block further includes:
determining matrix parameters of a first check algorithm, and constructing a first check matrix according to the matrix parameters;
and calculating the initial check block of the data block group according to the first check matrix to obtain the target check block of each data block group.
The target check block may be understood as a check block obtained by calculating the initial check block of each data block group through the first check algorithm.
Along with the above example, referring to fig. 3, after determining the initial check block of each data block group, that is, X1-X5, the distributed storage system can determine the RS erasure code parameters of the RS erasure code algorithm, where the RS erasure code parameters are completely independent and linearly non-correlated parameters. And constructing an RS erasure code matrix for calculating the data block set based on the RS erasure code parameters, where the matrix may be < c1, c2, c3, c4, c5>, and the c1 to c5 respectively represent each row in a matrix.
Then, the distributed storage system multiplies the initial check block X1-X5 of the data block group by < c1, c2, c3, c4, c5>, so as to obtain a check block X6, that is, the check block X6 is calculated and generated by the erasure code of the parameter < c1, c2, c3, c4, c5> for X1-X5. The X6 is generated by calculation using a completely independent and linearly uncorrelated RS erasure code parameter.
In specific implementation, the < a1, a2, a3, a4, a5>, < b1, b2, b3, b4, b5> and < c1, c2, c3, c4, c5> are three different sets of matrices, wherein the rank of any sub-matrix of the three sets of coding parameters is not 0; and the parameters in each matrix may be set according to an actual application scenario, which is not specifically limited in this specification.
In the embodiment of the specification, a first check matrix is constructed according to the determined matrix parameters of a first check algorithm; the initial check block of each data block group is calculated according to the first check matrix to obtain a target check block of each data block group, so that the initial check block can be recovered based on the target check block conveniently in the follow-up process, and the safety of the initial check block is ensured; meanwhile, the data redundancy ratio of the check block data is reduced, and the storage cost of the data is further reduced.
Further, in the embodiment provided in this specification, when the data to be processed is lost, data recovery can be performed on the data to be processed based on the target parity chunk and the initial parity chunk, so that the security of the data to be processed is ensured. The specific implementation is as follows.
After the target parity chunk of the data to be processed is determined according to the initial parity chunk, the method further includes:
receiving a data acquisition request sent by a user aiming at the data to be processed;
under the condition that the data block of the data to be processed meets the data recovery condition, performing data recovery on the data to be processed according to the initial check block or the target check block;
and sending the data to be processed after data recovery to the user.
The data obtaining request may be understood as a request for obtaining data to be processed, and the data recovery condition may be set according to an actual application scenario, for example, the data recovery condition may be that a data block or a check block corresponding to the data to be processed is lost.
Specifically, the data storage system judges whether the data to be processed meets a data recovery condition or not under the condition that a data acquisition request sent by a user for the data to be processed is received, if so, performs data recovery on the data to be processed according to an initial check block or a target check block, and sends the data to be processed after the data recovery to the user; if not, the data to be processed is not processed, and the data to be processed is directly sent to the user.
In practical application, the data recovery is performed on the data to be processed through the initial check block, which can be understood as performing the recovery on the data block lost by the data to be processed through the initial check block; the data recovery of the data to be processed by the target check block can be understood as recovering the lost initial data block by the target check block, and then recovering the lost data block of the data to be processed based on the recovered initial check block.
Along the above example, referring to fig. 3, in a case that the distributed storage system receives a data acquisition request of a user for a document, the distributed storage system determines whether a data chunk and/or an initial check chunk of the document is lost, that is, whether any data chunk in a1-10 and/or any check chunk in (X1-X6, P (1-1), P (1-2)) in fig. 3 is lost.
When a data block is lost, the data block is recovered by using the local check blocks (P (1-1) and P (1-2)) or the global check block (X1-X5) in an attempt, and if the data block still cannot be recovered, the lost data copy (data block) is recovered by using a method of jointly recovering the local check blocks (P (1-1) and P (1-2)) and the global check block (X1-X5).
In practical applications, in the process of performing data recovery through the local parity chunks and/or the global parity chunks, the data chunks that are not lost need to be referred to, that is, the local parity chunks and/or the global parity chunks need to be recovered based on the data chunks that are not lost.
For example, when data chunk A1 is lost, the system may recover lost data chunk A1 based on the non-lost data chunk A6 and the global parity chunk X1. Or the lost data chunk A1 is recovered by using the non-lost data chunk A2-A5 and the partial parity chunk P (1-1).
When the data blocks A1, A2, A6 and A7 are lost, the system can recover the four lost data blocks by using the local parity blocks P (1-1) and P (1-2) to decode jointly with the global parity blocks X1 and X2 based on the data blocks A3-A5 and A8-A10 which are not lost.
When a parity block is lost, recovery may be based on the data block and/or the non-lost parity block. For example, when the partial parity chunk P (1-1) is lost, the system may recover the lost partial parity chunk P (1-1) based on the data chunks A1-A5 that are not lost. Or, based on the non-lost local parity block P (1-2) and the parity block P, the lost local parity block P (1-1) is recovered.
When parity blocks are lost concurrently with data blocks, recovery may be based on the non-lost data blocks as well as the non-lost parity blocks.
For example, when data blocks A5, A10, and local parity blocks P (1-1), P (1-2) are lost, the system can recover the lost data blocks and parity blocks by using global parity block X5 in conjunction with parity block P to decode based on the data blocks A1-A4, A6-A9 that are not lost.
When the data chunks A6 and A7 and the global parity chunks X1 and X2 are lost, the system may recover the lost data chunks and parity chunks based on the non-lost data chunks A1-A2, A8-A10, the local parity chunks P (1-2) and the global parity chunks X3-X6.
When the partial check blocks P (1-1) and/or P (1-2) are lost, recovering P (1-1) and/or P (1-2) through the check blocks P based on the partial check blocks which are not lost; if any global parity chunk in the global parity chunks X1-X5 is lost, the lost global parity chunk may be recovered through the parity chunk X6 based on the global parity chunk that is not lost.
In the embodiment of the present specification, when a data acquisition request sent by a user for data to be processed is received; under the condition that a data block of the data to be processed meets a data recovery condition, performing data recovery on the data to be processed according to the initial check block or the target check block; the data to be processed after the data recovery is sent to the user, so that the safety of the data to be processed is ensured; the use experience of the user is improved.
In the data processing method provided by the present specification, a data block set is determined based on a data block obtained by segmenting data to be processed; and then, determining a data block group according to an adjacent data block set in the at least two data block sets, and then processing the data block group according to a preset check algorithm to obtain an initial check block of the data block group, so that the subsequent recovery of the data to be processed based on the initial data block is facilitated, and the safety of the data to be processed is ensured. In addition, the target check block of the data to be processed is determined according to the initial check block, so that the initial check block can be recovered based on the target check block in the follow-up process, and the safety of the initial check block is ensured; meanwhile, the data redundancy ratio of the check block data is reduced, and the storage cost of the data is further reduced.
The following describes the data processing method further by taking an application of the data processing method provided in this specification in configuring erasure code scenes for user data as an example with reference to fig. 4. Fig. 4 shows a flowchart of a processing procedure of a data processing method according to an embodiment of the present specification, which specifically includes the following steps.
Step 402: the distributed storage system is capable of receiving user data written by a user.
The user data may be a document, a multimedia file, etc., for example, a document with a size of 10 MB.
Step 404: the system segments user data to obtain a plurality of data blocks.
Specifically, the distributed storage system determines the erasure coding configuration of the pre-configured RS erasure coding algorithm, which is (5,1), when receiving the 10MB document sent by the user. Based on the erasure code configuration, the RS erasure code algorithm is determined to generate 1 verification block for 5 data blocks, and based on the 1 verification block, the distributed storage system divides the 10MB document into 10 data blocks, so that the subsequent RS erasure code algorithm configures the corresponding verification block for the 10MB document sent by the user.
Step 406: the system divides the plurality of data blocks into at least two sets of data blocks.
Specifically, the distributed storage system determines an erasure code configuration of an RS erasure code algorithm, which may be (5,1), and divides each 5 adjacent data blocks of the 10 data blocks into a set based on the erasure code configuration, thereby obtaining a data block set a1-a5 and a data block set a6-a 10.
Step 408: the system divides at least two sets of data blocks into sets of data blocks.
Specifically, the distributed storage system divides at least two adjacent data block sets in the data block set into one data block group after the obtained data block set.
For example, the data block sets a1-a5 and a6-a10 are adjacent data block sets, so the data block sets a1-a5 and a6-a10 can be divided into one data block group.
Step 410: global check blocks X1-X5 are generated using data blocks A1-A10, with erasure coding configured as RS erasure coding of (2, 1).
Specifically, after obtaining the data block group, first determining each data block set in the data block group, where a correspondence between the included data block and data blocks included in other data block sets, for example, the data block a1 in the data block set a1-a5, and the data block a6 in the data block set a6-a10, has a correspondence.
Secondly, based on the RS erasure code configuration (2,1), calculating a plurality of data blocks with the association relation to obtain global check blocks corresponding to the data blocks. For example, the data block a1 and the data block a6 are calculated to obtain the global parity block X1 of the data block a1 corresponding to the data block a6, and based on the calculation, the data blocks a2-a5 and a7-a10 are calculated to obtain the global parity block X1-X5 of the data block a1-a 10.
Step 412: and constructing coding matrix coding parameters in different AZs.
Specifically, RS erasure code parameters of the RS erasure code algorithm are determined, and the RS erasure code parameters are completely independent and linearly non-correlated parameters.
And constructing an RS erasure code matrix used for calculating the data block based on the RS erasure code parameters, wherein the RS erasure code matrix can be < a1, a2, a3, a4, a5>, < b1, b2, b3, b4, b5> or < c1, c2, c3, c4 and c5>, wherein the a1 to the a5 respectively represent each row in one matrix, the b1 to the b5 respectively represent each row in one matrix, and the c1 to the c5 respectively represent each row in one matrix.
Wherein, the rank of any sub-matrix of all AZ local check block parameters is not 0 (specifically, cauchy matrix generation may be used).
Step 414: and generating local check blocks P (1-1), P (1-2) and X6 by using the local coding matrix coding parameters of AZ and calculating the data blocks in each AZ.
Specifically, the distributed storage system multiplies the data block set a1-a5 by < a1, a2, A3, a4, a5> to obtain a local check block P (1-1), that is, the P (1-1) is generated by calculation for a1-a5 using RS erasure codes with parameters < a1, a2, A3, a4, a5 >.
Multiplying the data block set A6-A10 by < b1, b2, b3, b4 and b5> to obtain a local check block P (1-2), namely, the P (1-2) is generated by the A6-A10 through RS erasure code calculation of parameters < b1, b2, b3, b4 and b5 >.
Multiplying the initial check block X1-X5 of the data block group by < c1, c2, c3, c4, c5>, thereby obtaining a check block X6, that is, the check block X6 is calculated by the erasure code of the parameter < c1, c2, c3, c4, c5> for X1-X5.
The < a1, a2, a3, a4, a5>, < b1, b2, b3, b4, b5> and < c1, c2, c3, c4, c5> are three different sets of matrices, wherein the rank of any sub-matrix of the three sets of coding parameters is not 0.
Step 416: and generating a global check block P by RS coding of (2,1) by using P (1-1) and P (1-2).
Specifically, after determining P (1-1) and P (1-2), P (1-1) and P (1-2) can be calculated based on the RS erasure code configured as (2,1) in the erasure code configuration, so as to obtain the global parity check block P corresponding to P (1-1) and P (1-2).
Step 418: when data is lost, the lost data copy is recovered by attempting to recover the data using the local parity chunks and/or the global parity chunks.
When a data block is lost, an attempt is made to recover the lost data block using the local parity blocks P (1-1) and P (1-2) or the global parity block X1-X5 based on the data block that was not lost. If the data block can still not be repaired, the lost data copy is repaired by using the joint repair mode of the local check blocks P (1-1) and P (1-2) and the global check block X1-X5.
For example, when data blocks A1, A2, A6, A7 are lost, the system may recover the four lost data blocks based on data blocks A3-A5, A8-A10 using the local parity blocks P (1-1), P (1-2) in conjunction with the global parity blocks X1, X2. When the partial check block P (1-1) or P (1-2) is lost, the P (1-1) or P (1-2) can be recovered through the check block P and/or the data block; if any parity chunk in the global parity chunks X1-X5 is lost, the lost parity chunk may be recovered through the parity chunk X6 and/or the data chunk.
In practical applications, the above steps are exemplified by (19, 10,3) erasure code configuration (19 denotes the total number of copies, 10 denotes the number of data blocks, and 3 denotes the number of AZ). When the data blocks a1, a2, a6, a7 are lost, the system can use the local parity blocks P (1-1), P (1-2) to decode jointly with the global parity blocks X1, X2 to recover the four lost data blocks.
Compared with AZ coding adopting two local copies, the scheme has the same copy fault tolerance, can tolerate the data loss of any 5 copies, but has lower data redundancy ratio which is 1.9 and is less than 2.1 of AZ coding adopting two local copies; meanwhile, compared with the AZ coding adopting a single local copy, the scheme has higher data fault tolerance, the AZ coding adopting the single local copy only tolerates any 3 copies to be lost, the scheme can tolerate any 5 copies to be lost, and the data security is further ensured.
In a data processing method provided in this specification, an erasure code data coding scheme is proposed for reducing a data redundancy ratio in a multi-usable-area environment. By adopting the joint decoding design of the local check block in the AZ and the global check block between the AZ, the data repair capability of the erasure codes in the multi-AZ environment is improved, and the data redundancy ratio of the multi-AZ erasure codes is further reduced.
Corresponding to the above method embodiment, the present specification further provides an embodiment of a data processing apparatus, and fig. 5 shows a schematic structural diagram of a data processing apparatus provided in an embodiment of the present specification. As shown in fig. 5, the apparatus includes:
a slicing module 502 configured to slice data to be processed into a plurality of data blocks;
a dividing module 504 configured to divide the plurality of data blocks into at least two data block sets according to a calculation rule of a preset check algorithm;
a first determining module 506 configured to determine a data block group according to an adjacent data block set of the at least two data block sets;
a processing module 508, configured to process the data block group according to the preset verification algorithm, so as to obtain an initial verification block of the data block group;
a second determining module 510 configured to determine a target parity chunk of the data to be processed according to the initial parity chunk.
Optionally, the first determining module 506 is further configured to:
and determining at least two adjacent data block sets in the at least two data block sets as one data block group.
Optionally, the preset checking algorithm is a first checking algorithm;
accordingly, the processing module 508 is further configured to:
determining a data block set in each data block group;
and processing each data block set in each data block group according to the first check algorithm to obtain an initial check block of each data block set in each data block group.
Optionally, the processing module 508 is further configured to:
determining matrix parameters of the first check algorithm, and constructing a first check matrix according to the matrix parameters;
and calculating each data block set in each data block group according to the first check matrix to obtain an initial check block of each data block set in each data block group.
Optionally, the second determining module 510 is further configured to:
and processing the initial check block of each data block set in each data block group according to a second check algorithm to obtain a target check block of each data block group.
Optionally, the second determining module 510 is further configured to:
determining matrix parameters of a second check algorithm, and constructing a second check matrix according to the matrix parameters;
and calculating the initial check block of each data block set in each data block group according to the second check matrix to obtain the target check block of each data block group.
Optionally, the first determining module 506 is further configured to:
determining at least two adjacent data block sets in the at least two data block sets and position information of data blocks in the at least two adjacent data block sets;
and determining the data block group according to the position information of the data blocks in the at least two adjacent data block sets.
Optionally, the first determining module 506 is further configured to:
determining the corresponding relation between the data block of each data block set in the at least two adjacent data block sets and the data block of other data block sets according to the position information of the data block in the at least two adjacent data block sets;
and determining the data block group according to the corresponding relation.
Optionally, the preset checking algorithm is a second checking algorithm;
accordingly, the processing module 508 is further configured to:
determining matrix parameters of a second check algorithm, and constructing a second check matrix according to the matrix parameters;
and calculating the data block group according to the second check matrix to obtain an initial check block of the data block group.
Optionally, the second determining module 510 is further configured to:
determining matrix parameters of a first check algorithm, and constructing a first check matrix according to the matrix parameters;
and calculating the initial check block of the data block group according to the first check matrix to obtain a target check block of the data block group.
Optionally, the segmentation module 502 is further configured to:
receiving data to be processed sent by a user, and determining the number of data storage centers;
and dividing the data to be processed into a plurality of data blocks based on the number of the data storage centers.
Optionally, the data processing apparatus further comprises a receiving module configured to:
receiving a data acquisition request sent by a user aiming at the data to be processed;
under the condition that the data block of the data to be processed meets the data recovery condition, performing data recovery on the data to be processed according to the initial check block or the target check block;
and sending the data to be processed after data recovery to the user.
The data processing apparatus provided in this specification determines a data block set based on a data block obtained by segmenting data to be processed; and then, determining a data block group according to an adjacent data block set in the at least two data block sets, and then processing the data block group according to a preset check algorithm to obtain an initial check block of the data block group, so that the subsequent recovery of the data to be processed based on the initial data block is facilitated, and the safety of the data to be processed is ensured. In addition, the target check block of the data to be processed is determined according to the initial check block, so that the initial check block can be recovered based on the target check block in the follow-up process, and the safety of the initial check block is ensured; meanwhile, the data redundancy ratio of the check block data is reduced, and the storage cost of the data is further reduced.
The above is a schematic configuration of a data processing apparatus of the present embodiment. It should be noted that the technical solution of the data processing apparatus and the technical solution of the data processing method belong to the same concept, and details that are not described in detail in the technical solution of the data processing apparatus can be referred to the description of the technical solution of the data processing method.
FIG. 6 illustrates a block diagram of a computing device 600 provided in accordance with one embodiment of the present specification. The components of the computing device 600 include, but are not limited to, a memory 610 and a processor 620. The processor 620 is coupled to the memory 610 via a bus 630 and a database 650 is used to store data.
Computing device 600 also includes access device 640, access device 640 enabling computing device 600 to communicate via one or more networks 660. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 640 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 600, as well as other components not shown in FIG. 6, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 6 is for purposes of example only and is not limiting as to the scope of the present description. Those skilled in the art may add or replace other components as desired.
Computing device 600 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 600 may also be a mobile or stationary server.
Wherein the processor 620 is configured to execute computer-executable instructions that, when executed by the processor 620, implement the steps of the data processing method described above.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the data processing method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the data processing method.
An embodiment of the present specification further provides a computer-readable storage medium storing computer-executable instructions, which when executed by a processor implement the steps of the data processing method described above.
The above is an illustrative scheme of a computer-readable storage medium of the embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the data processing method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the data processing method.
An embodiment of the present specification further provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the data processing method.
The above is an illustrative scheme of a computer program of the present embodiment. It should be noted that the technical solution of the computer program and the technical solution of the data processing method belong to the same concept, and details that are not described in detail in the technical solution of the computer program can be referred to the description of the technical solution of the data processing method.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts, but those skilled in the art should understand that the present embodiment is not limited by the described acts, because some steps may be performed in other sequences or simultaneously according to the present embodiment. Furthermore, those skilled in the art will appreciate that the embodiments described in this specification are presently preferred and that no acts or modules are required in the implementations of the disclosure.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, and to thereby enable others skilled in the art to best understand the specification and utilize the specification. The specification is limited only by the claims and their full scope and equivalents.

Claims (14)

1. A method of data processing, comprising:
dividing data to be processed into a plurality of data blocks;
dividing the plurality of data blocks into at least two data block sets according to a calculation rule of a preset check algorithm;
determining a data block group according to an adjacent data block set in the at least two data block sets;
processing the data block group according to the preset verification algorithm to obtain an initial verification block of the data block group;
and determining a target check block of the data to be processed according to the initial check block.
2. The data processing method of claim 1, the determining a group of data blocks from an adjacent one of the at least two sets of data blocks, comprising:
and determining at least two adjacent data block sets in the at least two data block sets as a data block group.
3. The data processing method according to claim 2, wherein the preset checking algorithm is a first checking algorithm;
correspondingly, the processing the data block group according to the preset verification algorithm to obtain an initial verification block of the data block group includes:
determining a data block set in each data block group;
and processing each data block set in each data block group according to the first check algorithm to obtain an initial check block of each data block set in each data block group.
4. The data processing method according to claim 3, wherein the processing each data block set in each data block group according to the first checking algorithm to obtain an initial checking block of each data block set in each data block group comprises:
determining matrix parameters of the first check algorithm, and constructing a first check matrix according to the matrix parameters;
and calculating each data block set in each data block group according to the first check matrix to obtain an initial check block of each data block set in each data block group.
5. The data processing method of claim 3, wherein the determining the target parity chunk of the data to be processed according to the initial parity chunk comprises:
and processing the initial check block of each data block set in each data block group according to a second check algorithm to obtain a target check block of each data block group.
6. The data processing method according to claim 5, wherein the processing the initial parity chunk of each data chunk set in each data chunk group according to the second parity algorithm to obtain the target parity chunk of each data chunk group comprises:
determining matrix parameters of a second check algorithm, and constructing a second check matrix according to the matrix parameters;
and calculating the initial check block of each data block set in each data block group according to the second check matrix to obtain the target check block of each data block group.
7. The data processing method of claim 1, the determining a group of data blocks from an adjacent one of the at least two sets of data blocks, comprising:
determining at least two adjacent data block sets in the at least two data block sets and position information of data blocks in the at least two adjacent data block sets;
and determining the data block group according to the position information of the data blocks in the at least two adjacent data block sets.
8. The data processing method of claim 7, wherein determining a group of data blocks according to location information of data blocks in the at least two adjacent sets of data blocks comprises:
determining the corresponding relation between the data block of each data block set in the at least two adjacent data block sets and the data block of other data block sets according to the position information of the data block in the at least two adjacent data block sets;
and determining the data block group according to the corresponding relation.
9. The data processing method of claim 7, the preset checking algorithm being a second checking algorithm;
correspondingly, the processing the data block group according to the preset verification algorithm to obtain an initial verification block of the data block group includes:
determining matrix parameters of a second check algorithm, and constructing a second check matrix according to the matrix parameters;
and calculating the data block group according to the second check matrix to obtain an initial check block of the data block group.
10. The data processing method of claim 9, wherein determining the target parity chunk of the data to be processed according to the initial parity chunk further comprises:
determining matrix parameters of a first check algorithm, and constructing a first check matrix according to the matrix parameters;
and calculating the initial check block of the data block group according to the first check matrix to obtain a target check block of the data block group.
11. The data processing method according to claim 1, after determining the target parity chunk of the data to be processed according to the initial parity chunk, further comprising:
receiving a data acquisition request sent by a user aiming at the data to be processed;
under the condition that the data block of the data to be processed meets the data recovery condition, performing data recovery on the data to be processed according to the initial check block or the target check block;
and sending the data to be processed after data recovery to the user.
12. A data processing apparatus comprising:
a slicing module configured to slice data to be processed into a plurality of data blocks;
the dividing module is configured to divide the plurality of data blocks into at least two data block sets according to a calculation rule of a preset checking algorithm;
a first determining module configured to determine a data block group according to an adjacent data block set of the at least two data block sets;
the processing module is configured to process the data block group according to the preset verification algorithm to obtain an initial verification block of the data block group;
a second determining module configured to determine a target parity chunk of the data to be processed according to the initial parity chunk.
13. A computing device, comprising:
a memory and a processor;
the memory is for storing computer-executable instructions, and the processor is for executing the computer-executable instructions, which when executed by the processor, implement the steps of the data processing method of any one of claims 1 to 11.
14. A computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the data processing method of any one of claims 1 to 11.
CN202210023708.3A 2022-01-10 2022-01-10 Data processing method and device Pending CN114518846A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210023708.3A CN114518846A (en) 2022-01-10 2022-01-10 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210023708.3A CN114518846A (en) 2022-01-10 2022-01-10 Data processing method and device

Publications (1)

Publication Number Publication Date
CN114518846A true CN114518846A (en) 2022-05-20

Family

ID=81596862

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210023708.3A Pending CN114518846A (en) 2022-01-10 2022-01-10 Data processing method and device

Country Status (1)

Country Link
CN (1) CN114518846A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115391093A (en) * 2022-08-18 2022-11-25 江苏安超云软件有限公司 Data processing method and system
CN115469818A (en) * 2022-11-11 2022-12-13 苏州浪潮智能科技有限公司 Disk array writing processing method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090193314A1 (en) * 2008-01-25 2009-07-30 Peter Michael Melliar-Smith Forward error correction for burst and random packet loss for real-time multi-media communication
CN107615248A (en) * 2015-06-17 2018-01-19 华为技术有限公司 Distributed data storage method, control device and system
CN112860475A (en) * 2021-02-04 2021-05-28 山东云海国创云计算装备产业创新中心有限公司 Method, device, system and medium for recovering check block based on RS erasure code
CN113687975A (en) * 2021-07-14 2021-11-23 重庆大学 Data processing method, device, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090193314A1 (en) * 2008-01-25 2009-07-30 Peter Michael Melliar-Smith Forward error correction for burst and random packet loss for real-time multi-media communication
CN107615248A (en) * 2015-06-17 2018-01-19 华为技术有限公司 Distributed data storage method, control device and system
CN112860475A (en) * 2021-02-04 2021-05-28 山东云海国创云计算装备产业创新中心有限公司 Method, device, system and medium for recovering check block based on RS erasure code
CN113687975A (en) * 2021-07-14 2021-11-23 重庆大学 Data processing method, device, equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115391093A (en) * 2022-08-18 2022-11-25 江苏安超云软件有限公司 Data processing method and system
CN115391093B (en) * 2022-08-18 2024-01-02 江苏安超云软件有限公司 Data processing method and system
CN115469818A (en) * 2022-11-11 2022-12-13 苏州浪潮智能科技有限公司 Disk array writing processing method, device, equipment and medium
CN115469818B (en) * 2022-11-11 2023-03-24 苏州浪潮智能科技有限公司 Disk array writing processing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
US10241864B2 (en) Expanding information dispersal algorithm width without rebuilding through imposter slices
US10613776B2 (en) Appyling multiple hash functions to generate multiple masked keys in a secure slice implementation
US10114588B2 (en) Consolidating encoded data slices in read memory devices in a distributed storage network
US10346246B2 (en) Recovering data copies in a dispersed storage network
CN110502365B (en) Data storage and recovery method and device and computer equipment
US10013191B2 (en) Encoding data for storage in a dispersed storage network
CN109643258B (en) Multi-node repair using high-rate minimal storage erase code
CN114518846A (en) Data processing method and device
WO2014121593A1 (en) Distributed storage method, device and system
CN114546707A (en) Data processing method and device
US10365969B2 (en) Multiple wireless communication systems stream slices based on geography
US10922198B1 (en) Cloning failing memory devices in a dispersed storage network
US11922015B2 (en) Generating recovered data in a storage network
US20170177230A1 (en) Adaptive dispersed storage network (dsn) and system
WO2015180038A1 (en) Partial replica code construction method and device, and data recovery method therefor
US10331519B2 (en) Application of secret sharing schemes at multiple levels of a dispersed storage network
CN113296695A (en) Writing method and device of erasure code data in multi-AZ environment
WO2017004157A1 (en) Method and system for processing data access requests during data transfers
US10223033B2 (en) Coordinating arrival times of data slices in a dispersed storage network
CN115357425A (en) Code configuration conversion method, erasure code coding method, device and system
US10095582B2 (en) Partial rebuilding techniques in a dispersed storage unit
US11226980B2 (en) Replicating containers in object storage using intents
US20230342250A1 (en) Allocating Data in a Decentralized Computer System
US10346218B2 (en) Partial task allocation in a dispersed storage network
Zhu et al. An Improved Bound and Singleton-Optimal Constructions of Fractional Repetition Codes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination