CN108319618B

CN108319618B - Data distribution control method, system and device of distributed storage system

Info

Publication number: CN108319618B
Application number: CN201710036337.1A
Authority: CN
Inventors: 姚文辉; 陆靖; 吕鹏程; 常艳军; 朱家稷
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2017-01-17
Filing date: 2017-01-17
Publication date: 2022-05-06
Anticipated expiration: 2037-01-17
Also published as: CN108319618A

Abstract

A data distribution control method, system and device of a distributed storage system are disclosed, the distribution control system determines a distribution strategy adopted by first data from a plurality of distribution strategies provided by the distributed storage system, wherein the plurality of distribution strategies comprise a strategy distributed across fault-tolerant domains and a strategy distributed in one fault-tolerant domain; and then distributing a fault-tolerant domain for the first data and writing the data according to the adopted distribution strategy and the topological relation of the distributed storage system. The method and the device increase the distribution attribute setting of the data, and can adapt to the difference requirement of the data.

Description

Data distribution control method, system and device of distributed storage system

Technical Field

The present invention relates to a distributed storage system, and more particularly, to a method, a system, and an apparatus for controlling data distribution of a distributed storage system.

Background

In the current large-scale distributed storage system, in order to realize that data can still be accessed when a problem occurs in a certain fault-tolerant domain, a mode of storing multiple copies of the data across the fault-tolerant domain is adopted to resist the problem of data availability caused by a single fault-tolerant domain fault. For example, in a Hadoop Distributed File System (HDFS), multiple copies of data are Distributed to different racks for storage. In the HDFS, one rack forms a fault-tolerant domain (also called an error domain) which represents a physical unit in error, and by placing the fault-tolerant domain in different racks, data can still be accessed when a power supply of a certain rack or a corresponding switch fails.

Some distributed storage systems are distributed across regions, and power supply of a certain region or network infrastructure failure also occurs in a national region, so that data of the storage system in the region cannot be accessed, and further upper-layer application failure is caused.

When the related technology carries out data distribution control on the distributed storage system, the topological relation of the distributed storage system is collected firstly, the topological relation among the storage nodes in the whole distributed storage system is generated, and fault-tolerant domains are automatically divided. When creating data, the number of copies to which the data is written is specified, and multiple copies of the data are stored in multiple fault-tolerant domains, i.e., distributed across the fault-tolerant domains. When a fault-tolerant domain fails to cause the loss of the copies of the data, the number of the copies of the data can be recovered through the data copying process.

The inventor of the present application finds that the above data distribution control method actually considers that the fault-tolerant domains are equivalent, the bandwidths of the fault-tolerant inter-domain networks are the same, the unit prices are the same, and the delays of the transmission of the fault-tolerant inter-domain networks are the same. However, this is not the case, for example, in a data center under a layered network architecture, the network bandwidth between machines under the same rack can reach the network card bandwidth, and the delay is within 0.3ms, but after passing through the aggregation switch (PSW), the network bandwidth between machines is reduced to 1/3, and the delay is close to 0.5 ms. When the storage system is deployed across regions, networks in different regions generally have flow limitation, and the bandwidth is smaller. Meanwhile, due to the increase of the transmission distance, the transmission delay is increased in a proportional relation.

The related art adopts the same distribution strategy for all data, and does not consider the difference of different data. For example, some data may have high availability requirements and may not have a large amount of data, while some data may not have high availability requirements but may have a large amount of data and may have high read and write throughput. By adopting the same distribution strategy, the data with high availability requirement may not meet the requirement of defending against large-scale faults, while the data with low availability requirement may not meet the requirement of data read-write speed, and heavy burden is also brought to the network.

In addition, the related technology adopts single-level fault-tolerant domain division, for example, the fault-tolerant domain division is carried out according to a rack, so that the difference among different levels of fault-tolerant domains is difficult to show, and troubles are caused to the rationality and the effectiveness of data distribution. For example, due to the layered architecture of the network, the bandwidth between the two racks is not necessarily the same, and may go through the core switch and even through the internet, which may slow data writing once the bandwidth between the two racks is small. In order to remedy such a problem, one way is to adjust a network architecture so that bandwidths between any two nodes in the system are the same, but the cost of network equipment and architecture adjustment is relatively high, the difficulty of network wiring inside a machine room is also increased, the scale that the network can bear is reduced, and the purpose of cost reduction cannot be achieved.

Disclosure of Invention

In view of this, an embodiment of the present invention provides a data distribution control method for a distributed storage system, including:

determining a distribution strategy adopted by the first data from a plurality of distribution strategies provided by a distributed storage system, wherein the plurality of distribution strategies comprise a strategy distributed across fault-tolerant domains and a strategy distributed in one fault-tolerant domain;

and distributing a fault-tolerant domain for the first data and writing the data according to the adopted distribution strategy and the topological relation of the distributed storage system.

The embodiment of the invention also provides a data distribution control system in the distributed storage system, which comprises a strategy determination module and a distribution and write-in module, wherein:

the policy determination module is configured to: determining a distribution strategy adopted by the first data from a plurality of distribution strategies provided by a distributed storage system, wherein the plurality of distribution strategies comprise a strategy distributed across fault-tolerant domains and a strategy distributed in one fault-tolerant domain;

the distribution and write module is configured to: and distributing a fault-tolerant domain for the first data and writing the data according to the adopted distribution strategy and the topological relation of the distributed storage system.

The embodiment of the invention also provides a distribution control device in the distributed storage system, which comprises a processor and a memory, and is characterized in that:

the memory is configured to: saving the program code;

the processor is configured to: reading the program code to perform the following data distribution control process:

The scheme increases the distribution attribute setting of the data and can adapt to the difference requirement of the data. For example, the distribution of key data across fault-tolerant domains can be realized, and when a single fault-tolerant domain fails, read-write service can still be provided; and non-critical data are distributed in the fault-tolerant domain, the bandwidth cost of the inter-domain network is not increased, and the read-write performance of the data can be improved.

Drawings

FIG. 1 is a diagram of a hierarchical network architecture of storage nodes in a distributed storage system according to an embodiment of the present invention;

FIG. 2 is a flow chart of a data distribution control method according to an embodiment of the present invention;

FIG. 3 is a block diagram of a distributed control system in accordance with an embodiment of the present invention;

FIG. 4 is a diagram of a hierarchical network architecture of storage nodes in a distributed storage system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

Example one

Fig. 1 is a hierarchical network architecture of storage nodes in a distributed storage system (also referred to as a system for short in this document) according to this embodiment, which includes the following nodes:

a storage node, which is an entity storing data, is also called a Machine (Machine).

An Access Switch (ASW), which is a switch that provides network access to storage nodes, is typically deployed in a manner that connects multiple machines in a rack to the same access switch.

A aggregation switch (PSW), which is a point of aggregation for multiple access layer switches, handles all traffic from the access layer devices and provides an uplink to the core layer.

A core switch (DSW), a switch disposed in a core layer (network backbone section) is referred to as a core switch. The core layer provides an optimized and reliable backbone transmission structure through high-speed forwarding communication.

The distributed storage system of the embodiment is deployed across regions, and devices in different regions are connected through a backbone network. The geographic region herein refers to a geographic region such as Beijing, Shanghai, Guangzhou, etc., but may also be a country or countries or other geographic regions depending on the network deployment.

Besides the storage nodes, the distributed storage system also comprises a distributed control system and a client, wherein the distributed control system is used for realizing the functions of determining a data distribution strategy, distributing fault-tolerant domains and writing data, receiving user instructions, executing and the like. The client can perform signaling interaction with the distributed control system and perform data reading and writing with the storage node.

It should be noted that the network architectures of distributed storage systems of different sizes and requirements are different. For example, some distributed storage systems are deployed in the same domain, and nodes at some levels of the above architecture, such as aggregation switches, are optional.

The topological relation of the distributed storage system mentioned herein refers to the topological relation of the storage nodes, and includes the above hierarchical network architecture, and also includes attribute information such as names of the storage nodes, storage spaces of the fault-tolerant domains, and the like.

Based on the above network architecture of this embodiment, this embodiment divides the distributed storage system into multiple levels of fault-tolerant domains, and includes, in order from low level to high level:

a fault-tolerant domain at the machine level, where one or a group of machines accessing the same access switch may constitute a fault-tolerant domain at the level, and each machine may include multiple disks to write multiple copies of the same data. The Machine-layer fault-tolerant domain is represented by Machine in the figure.

The fault-tolerant domain of the access stratum, one or more access switches connected to the same upper-layer switch (PSW or DSW in this embodiment) and devices connected thereunder may form a fault-tolerant domain of the hierarchy, and a fault-tolerant domain of the access stratum may include one or more fault-tolerant domains of a machine layer. The fault-tolerant domain of the access stratum typically corresponds to one or more chassis, represented by Rack in the figure.

The fault-tolerant domain of the convergence layer, one or more convergence switches connected to the same upper layer switch (DSW in this embodiment) and devices connected thereunder may form a fault-tolerant domain of the layer, and a fault-tolerant domain of the convergence layer may include one or more fault-tolerant domains of the access layer. The fault-tolerant domain of the convergence layer is denoted Pod in the figure.

The fault-tolerant domain of the data center layer, one or more core layer switches and devices connected under the core layer switches in the same data center can form one fault-tolerant domain of the layer, and the fault-tolerant domain of one data center layer can comprise one or more fault-tolerant domains of the aggregation layer. One fault-tolerant domain of a data center tier typically corresponds to one computer room, although it is possible that one data center contains multiple computer rooms. Indicated by Zone in the figure. And a plurality of core layer switches in the Zone can access the Internet through optical fibers.

The distributed storage system deployed across regions can divide the fault-tolerant domains of the regions, and one fault-tolerant domain of a region can comprise one or more fault-tolerant domains of data center layers in the same region. The fault-tolerant domain of the geographical layer is not shown in the figure and is denoted by Region in the text.

Generally, when two machines are distributed across fault-tolerant domains, the higher the level of the fault-tolerant domain is, the smaller the network bandwidth is, the higher the transmission delay and the bandwidth cost are, but the stronger the defense capability to faults is when data is distributed in the two machines; conversely, the lower the hierarchy of the fault-tolerant domains that are crossed, the larger the network bandwidth among them, the lower the transmission delay and bandwidth cost, but the poorer the defense capability against faults. For example, bandwidth is greatest across machines but between machines within the same Rack, and latency is least. The network bandwidth between machines across the Rack but within the same Pod is less than the network bandwidth between machines within the same Rack, increasing latency. The network bandwidth across Pod but between machines within the same Zone is smaller, typically only 1/3 times the bandwidth of a stand-alone network card, and the transmission delay and bandwidth cost of network communications are relatively higher. The network bandwidth across zones but between machines in the same Zone is typically larger than the network bandwidth between machines across zones. The network bandwidth between machines in different regions is related to the bandwidth allocated by Internet, and is far less than the sum of the network card bandwidths of all machines in one region of any system.

Herein, the fault-tolerant domain level B one level higher than the fault-tolerant domain level a is referred to as an upper level of the fault-tolerant domain level a, and the fault-tolerant domain level a is referred to as a lower level of the fault-tolerant domain level B. For example, the previous level of Machine is Rack, the next level of Rack is Machine; the last level of Rack is Pod, the next level of Pod is Rack, and so on. This is also relevant to the hierarchical partitioning of the fault tolerant domain, e.g. if there is no partitioning Pod, the previous level of Rack is Zone and the next level of Zone is Rack.

The partitioning of the fault-tolerant domain hierarchy is related to the network architecture of the distributed storage system, e.g., when the distributed storage system is not deployed across a domain, the fault-tolerant domain of the domain hierarchy is not partitioned. The division of the fault-tolerant domain hierarchy may also be different for the same network architecture. For example, based on the network architecture of the present embodiment, one or more levels of fault-tolerant domains may be partitioned. Such as dividing only Zone, or dividing only Rack and Zone, or dividing only Rack, Pod and Zone, or dividing only Machine, Rack and Zone, etc.

Based on the above architecture and the division of the fault tolerant domain, this embodiment provides a data distribution control method of a distributed storage system, as shown in fig. 2, including:

step 110, determining a distribution strategy adopted by first data from a plurality of distribution strategies provided by a distributed storage system, wherein the plurality of distribution strategies comprise a strategy distributed across fault-tolerant domains and a strategy distributed in one fault-tolerant domain;

and 120, distributing a fault-tolerant domain for the first data and writing the data according to the adopted distribution strategy and the topological relation of the distributed storage system.

The first data mentioned herein may be any kind of data stored in the distributed storage system, and is not limited to a specific data.

The distribution of the first data refers to the distribution of multiple copies (currently, generally 3) of the first data in the system. For example, the number of copies of the first data is 3, if a policy of distribution across fault-tolerant domains (referred to as "cross-domain distribution", that is, distribution across multiple fault-tolerant domains) is adopted, 2 fault-tolerant domains are generally allocated, and 2 copies of the first data are stored in one fault-tolerant domain, and 1 copy is stored in another fault-tolerant domain, so as to prevent a single fault-tolerant domain from failing, and improve availability. If the strategy of distribution in a fault-tolerant domain (referred to as 'intra-domain distribution') is adopted, 3 copies of the first data are stored in the same fault-tolerant domain, such as different magnetic disks of one machine, and the data is read and written in the machine, so that the reading and writing speed can be improved, and the bandwidth resources between the fault-tolerant domains can be saved. Of course, the present invention does not exclude storing more than 2 or 4 copies of the data. When the distribution is carried out in a cross-domain mode, the number of the distributed fault-tolerant domains can be more than 3.

In this embodiment, determining the distribution policy adopted by the first data according to the availability requirement of the first data includes:

if the first data is high-availability data which is required to be still accessible when a single fault-tolerant domain fails, determining that the first data adopts a strategy distributed across the fault-tolerant domains;

if the first data is low available data which can be stopped being accessed and regenerated when the fault-tolerant domain fails, the first data is determined to adopt a strategy distributed in the fault-tolerant domain.

The availability requirement that the data is still accessible when a single fault-tolerant domain fails refers to that the data is still accessible when the single fault-tolerant domain fails after cross-domain distribution is realized.

The availability requirement of the first data (whether the first data is high availability data or low availability data) may be determined according to a system default configuration. In addition, the availability requirements of the data can be customized by users (such as background maintenance personnel, business users and the like of the distributed storage system), the users can customize some data to high-availability data and other data to low-availability data, and the system determines the availability requirements of the data according to the customization of the users.

In another embodiment, the user may directly specify the distribution policy for the first data, that is, the distribution policy specified by the user for the first data is determined as the distribution policy adopted by the first data.

In yet another embodiment, the user specifies the fault-tolerant domain directly for the first data, in which case, if the user specifies one fault-tolerant domain for the first data, it may be determined that the first data employs a policy that is distributed within one fault-tolerant domain, and if the user specifies multiple fault-tolerant domains for the first data, it may be determined that the first data employs a policy that is distributed across the fault-tolerant domains.

In practical applications, where data is required to be distributed across domains, the requirements for availability may also vary. The distributed storage system of the embodiment is divided into a plurality of layers of fault-tolerant domains. In this scenario, the policy distributed across the fault-tolerant domain may be subdivided into multiple policies, such as a policy distributed across machines, a policy distributed across racks, a policy distributed across Pod, a policy distributed across zones, a policy distributed across regions, and a policy distributed across regions, where the availability, bandwidth, and latency provided by the policy distributed across different levels of fault-tolerant domains are different. Therefore, when the first data is determined to adopt a strategy distributed across fault-tolerant domains, the embodiment further configures the hierarchy of the across fault-tolerant domains according to user customization or system default so as to better meet the requirement of data differentiation. In an actual system, not all layers need to be used, and only some of the layers may be used. In addition, the number of the allocated fault-tolerant domains is default to 2, and if more fault-tolerant domains need to be allocated, the number of the fault-tolerant domains can be specially appointed when a distribution strategy is determined.

As described above, when two machines are distributed across fault-tolerant domains, the higher the hierarchy of the fault-tolerant domains, the smaller the network bandwidth between the machines, the higher the transmission delay and the bandwidth cost, but the stronger the defense capability against faults when the data is distributed in the two machines. Therefore, for high-availability data distributed across domains, the availability requirement can be correspondingly refined, for example, the high-availability data still accessible when a single Rack fails can be adopted, and a strategy distributed across racks can be adopted; requiring highly available data that is still accessible upon a single Zone failure, policies distributed across zones may be employed, and so forth. The first data employs a policy distributed across which level of fault tolerance domain, which may be customized by system default or by the user. The system may also redefine a variety of refined availability requirements (still accessible when a single Rack fails, still accessible when a single Zone fails, etc.) under highly available data, with one availability requirement customized by the user for the first data. Or, the system may also define multiple cross-domain distribution policies with different fault-tolerant domain hierarchies, and the user directly designates one distribution policy for the first data to complete the customization of the fault-tolerant domain hierarchies. The above approaches may also be used in combination, for example, if the user is not customized, the system defaults to adopt a policy distributed across zones (also may be across Rack or other) for high availability data, and if the user is customized (e.g., specified to be distributed across zones), a user-customized policy is preferably adopted.

For intra-domain distributed policies, the hierarchy need not be redefined, as it is not required to cross any domain, and is typically written in the same storage node. But it is also possible to define within which level of fault-tolerant domain the distribution is required, for example, within the Machine, to ensure good read-write performance of the data.

In this embodiment, when allocating a fault-tolerant domain to the first data according to the adopted distribution policy and the topological relation of the distributed storage system, if the first data adopts the intra-domain distribution policy, the fault-tolerant domain where the generator of the first data is located is allocated to the first data, and multiple copies of the first data are preferentially written into the same storage node in the allocated fault-tolerant domain. The fault-tolerant domain where the generator of the first data is located is allocated for the first data, so that the data can be conveniently written in, and bandwidth resources are saved. And writing a plurality of copies of the first data into the same storage node can have better read-write performance and is convenient for copying the data. Here, the generator of the first data may refer to a process that generates the first data, and the process may be scheduled to be executed by a different fault-tolerant domain.

In this embodiment, when allocating a fault-tolerant domain to the first data according to the adopted distribution policy and the topological relation of the distributed storage system, if the first data adopts a cross-domain distribution policy and determines the hierarchy of the cross-tolerant domains, a plurality of fault-tolerant domains are allocated to the first data at the determined hierarchy, and are preferentially allocated in the same fault-tolerant domain of the previous hierarchy. For example, when the first data adopts a strategy of cross-Rack distribution, 2 Rack-layer fault-tolerant domains are allocated to the first data, and the 2 Rack-layer fault-tolerant domains are preferentially allocated in the same Pod (preferentially allocated in the same Zone when there is no Pod). Therefore, on the basis of meeting the availability requirement, the bandwidth between the fault-tolerant domains is larger, the delay is smaller, and the writing and copying of data are facilitated.

The actual system allocates the first data with "storage nodes in the fault-tolerant domain", because the present application mainly discusses the distribution of data in the fault-tolerant domain, for the sake of brevity, the term "allocating the fault-tolerant domain" does not mean that no storage nodes need to be allocated.

In this embodiment, when the first data adopts a policy distributed across fault-tolerant domains, a fault-tolerant domain is allocated to the first data and data is written into the first data according to the adopted distribution policy and a topological relationship of the distributed storage system, and one of the following two ways may be adopted:

in the simultaneous writing mode, when the number M of normal fault-tolerant domains of the distributed storage system is greater than or equal to the number N of fault-tolerant domains required by the adopted distribution strategy, N fault-tolerant domains are distributed for the first data and are written simultaneously; when M is less than N, M fault-tolerant domains are distributed for the first data and written simultaneously, and after the number of normal fault-tolerant domains reaches N, the first data are distributed to the N fault-tolerant domains through data copying, wherein N and M are positive integers, and N is more than or equal to 2; or

And in the time-sharing writing mode, the first data is written into one allocated fault-tolerant domain, and then the first data is distributed to N fault-tolerant domains through data copying.

For example, for the metadata of the upper-layer application system, the data volume is not large, but the requirement availability is high, and the requirement can defend against the fault of a single fault-tolerant domain during the writing process. Thus, a simultaneous write approach is used. For backup data with low availability requirement, a time-sharing writing mode can be adopted.

After the data distribution is completed, possibly because the fault-tolerant domain fails to cause the topology of the distributed storage system to change, the system can find whether the copy of the first data is missing or not through security detection, and when the copy of the first data is missing, data replication can be initiated to supplement the missing copy. The hazards caused by the abnormal occurrence of the nodes in different levels are different, and the embodiment adopts different processing modes:

when the copy of the first data is lost due to the fault-tolerant domain failure of the machine layer, immediately initiating data replication to supplement the copy of the first data;

when the fault-tolerant domain faults of other layers except the machine layer cause the loss of the copy of the first data, a set fault waiting time is firstly passed, and after the fault waiting time is up, if the fault is not eliminated, data copying is initiated to supplement the copy of the first data loss.

For example, when a fault-tolerant domain of a machine layer is abnormal, the data on the fault-tolerant domain is generally unavailable or lost for a long time, and data replication needs to be initiated immediately during fault processing to recover the stored data urgently; the switch only has power failure or abnormal flow, and is generally unavailable for a short time, but the frequency of occurrence is relatively high, and the data hung on the machine below the switch cannot be lost, so that the data replication is generally delayed to start when the corresponding fault-tolerant domain (Rack, Pod, Zone) is abnormal; in addition, when the Zone may have a fault due to an abnormal backbone network, for example, due to a construction failure, or due to an abnormal power failure in a certain area, the data replication may be delayed in the event of the fault.

In addition, in the data replication process initiated by the missing of the first data copy, the source location and the target location of the data replication are determined first, a corresponding flow quota is applied to each level of fault-tolerant domain that passes from the source location to the target location, and the data replication is started after the application is successful. Taking the example that the first data is distributed across the zones, assuming that there are 2 copies in ZoneA and 1 copy in ZoneB, if one copy of data in ZoneA is lost, one copy needs to be supplemented for the first data. The copy in the ZoneB is preferentially taken as the source location of the data copy, in order to save the flow, a target location is preferentially selected in the ZoneB, and assuming that the source location and the target location are at different racks, before the data copy is started, corresponding flow quota including Machine, Rack and Pod (data copy passes through the aggregation switch) is applied on each level of fault-tolerant domain passing from the source location to the target location. The application of the flow rate limit can effectively control the flow of the network, and the problem that the system cannot work normally due to network congestion is avoided.

The present embodiment further provides a distribution control system in a distributed storage system, as shown in fig. 3, including a policy determining module 10 and an allocating and writing module 20, where:

the policy determination module 10 is arranged to: determining a distribution strategy adopted by the first data from a plurality of distribution strategies provided by a distributed storage system, wherein the plurality of distribution strategies comprise a strategy distributed across fault-tolerant domains and a strategy distributed in one fault-tolerant domain;

the distribution and writing module 20 is arranged to: and distributing a fault-tolerant domain for the first data and writing the data according to the adopted distribution strategy and the topological relation of the distributed storage system.

Alternatively,

the policy determination module determines a distribution policy to be used by the first data from a plurality of distribution policies provided by the distributed storage system, including:

determining a distribution strategy adopted by the first data according to the availability requirement of the first data; or

Determining a distribution strategy designated by a user for the first data as a distribution strategy adopted by the first data; or

And determining a distribution strategy adopted by the first data according to one or more fault tolerance domains specified by a user for the first data.

Alternatively,

the policy determination module determines a distribution policy adopted by the first data according to the availability requirement of the first data, and the policy determination module includes:

if the first data is low available data which can stop being accessed and regenerated when the fault-tolerant domain fails, determining that the first data adopts a strategy distributed in the fault-tolerant domain;

the availability requirement of the first data is determined according to a default configuration of the system or a user customization or an indication of an external system.

Alternatively,

the distributed storage system is divided into fault-tolerant domains of one or more layers of a machine layer, an access layer, a convergence layer, a data center layer and a region layer.

Alternatively,

the strategy determination module determines a distribution strategy adopted by the first data, and the strategy determination module comprises the following steps: and when determining that the first data adopts a strategy distributed across fault-tolerant domains, determining the hierarchy of the crossed fault-tolerant domains according to system default configuration or user customization.

Alternatively,

when the policy determining module determines that the first data adopts a policy distributed in a fault-tolerant domain, the allocating and writing module allocates the fault-tolerant domain for the first data and writes the data, including: distributing the fault-tolerant domain where the generator of the first data is located for the first data, and preferentially writing a plurality of copies of the first data into the same storage node in the distributed fault-tolerant domain.

Alternatively,

when the policy determining module determines that the first data adopts a policy distributed across fault-tolerant domains, the allocating and writing module allocates the fault-tolerant domains for the first data and writes the data according to the adopted distribution policy and the topological relation of the distributed storage system, including:

distributing N fault-tolerant domains for the first data and writing in simultaneously when the number M of normal fault-tolerant domains of the distributed storage system is larger than or equal to the number N of fault-tolerant domains required by the adopted distribution strategy by adopting a simultaneous writing mode; when M is less than N, M fault-tolerant domains are distributed for the first data and written in at the same time, and after the number of normal fault-tolerant domains reaches N, the first data is distributed to N fault-tolerant domains through data copying, wherein N and M are positive integers, and N is more than or equal to 2; or

And writing the first data into an allocated fault-tolerant domain in a time-sharing writing mode, and distributing the first data to N fault-tolerant domains by data copying.

Alternatively,

the distributed control system further comprises a security check module and a data recovery module, wherein:

the security check module is configured to: when the fault-tolerant domain fault is detected to cause the loss of the copy of the first data, the data recovery module is informed and carries the hierarchy information of the fault-tolerant domain with the fault;

the data recovery module is configured to: and after receiving the notification, determining whether the machine layer has a fault according to the layer information, if so, immediately initiating data copying to supplement the first data missing copy, and if not, firstly passing a set fault waiting time, and if the fault waiting time is up, initiating data copying to supplement the first data missing copy.

Alternatively,

the security check module is configured to: when the fault-tolerant domain fault is detected to cause the loss of the copy of the first data, the data recovery module is informed;

the data recovery module is configured to: and initiating data replication after receiving the notification, determining a source position and a target position of the data replication, applying for a corresponding flow limit on each level of fault-tolerant domain from the source position to the target position, and starting the data replication after the application is successful.

Each module of the distributed control system in this embodiment respectively implements part of processing of the data distributed control method in this embodiment, and other contents described in the data distributed control method may also be implemented by the module corresponding to the distributed control system, which is not described in detail here.

The actual deployment of each module of the distributed control system is not limited in the present invention, and may be deployed on a certain entity, such as a metadata server of a distributed storage system, or each module may be separately deployed on a plurality of entities, may be deployed on the same entity as other functional modules, or may be deployed independently. In addition, the above-mentioned module division can be in different manners, such as a policy determination and writing function as a function of the data writing module, an allocation of the fault-tolerant domain as a function of the distributed control module, and so on. However, as long as the functions of the distributed control system are the same, the division of the modules and the names of the modules are not substantially different, and the distributed control system is within the protection scope of the present invention.

The present embodiment further provides a distributed control apparatus in a distributed storage system, including a processor and a memory, wherein:

the memory is configured to: saving the program code;

the processor is configured to: reading the program code to perform the following distributed control processing:

The processor may also execute other processing in the distributed control method of this embodiment, which is not described in detail herein.

Example two

Fig. 4 is a topology diagram of a storage node in the distributed storage system according to the embodiment, including the storage node, an access switch, and a core switch, please refer to fig. 1 in the first embodiment. In this embodiment, the hierarchy of the fault tolerant domain is divided into 3 layers, that is, a Machine, a Rack, and a Zone, the distributed storage system includes a plurality of zones, a Zone a and a Zone b are shown in the figure, and each Zone includes a plurality of racks.

Based on the above topology, each process of the data distribution control method of the present embodiment is described below.

The data distribution control method comprises a topological relation establishing process, a data writing process and a data recovery process. Wherein:

the process of establishing the topological relation comprises the following steps:

step one, a storage node generates topology information and carries the topology information during registration, the topology information comprises identification information of a fault-tolerant domain in addition to identification information of the storage node,

in this embodiment, when each storage node is deployed, in addition to the identification information of the storage node itself, identification information of the Machine (for example, the Machine may be omitted for the storage node itself), the Rack, and the Zone where the storage node is located is also generated, and these identification information may be encoded in the name of the storage node, and reported to the distribution control system during registration, which is also convenient for a user to know the location of the storage node in the topology according to the name of the storage node.

And step two, after the distributed control system receives the registration information of the storage nodes, updating the topological relation of the system, adding the nodes and updating the content in each fault-tolerant domain.

The content in the fault tolerant domain includes information such as media space (storage capacity) in the fault tolerant domain. The distribution control module can receive the input of management commands and edit the automatically generated topological relation except generating and updating the topological relation of the storage nodes of the whole system. For example, if the automatically generated topological relation graph does not meet the topological relation of the actual requirement, the topological relation can be adjusted by changing the position information of the existing node through the operation and maintenance tool;

the data writing process comprises the following steps:

firstly, a distribution control system determines a distribution strategy adopted by various data;

the embodiment divides the data into high available data and low available data according to the availability requirement of the data, and the high available data is divided into two types, the first type of high available data is required to be still accessible when a single Zone fails, and the second type of high available data is required to be still accessible when a single Rack fails. The low availability data can be stopped from being accessed and regenerated when any layer of fault-tolerant domain fails. The system defaults the metadata or configuration information and other data of the upper application system into first high-availability data, adopts a cross-Zone distribution strategy, and the data requires very high availability and has small data traffic. Data such as personal pictures, various transaction logs, server access logs, personal photos and videos in cloud storage in the social software are defaulted to second high-availability data, and the data are not allowed to be lost and unavailable for a long time by adopting a strategy of distribution across Rack. The system defaults data such as intermediate files and the like generated in the parallel operation process of the large-scale data set into low available data, the data can be regenerated through a recalculation method, but the read-write throughput is required to be very high, and the data can be written in the same storage node. The default of the data availability may be embodied in the correspondence between the data type of the system configuration and the distribution policy.

The embodiment allows a user to customize the availability requirement of data, and a certain type of data can be customized into first high-availability data, second high-availability data or low-availability data, for example, critical data such as financial data, business data and the like which are very important for the user can be customized into the first high-availability data; for some non-important temporary storage files, it may be customized to low availability data. For data customized by a user, the availability requirement is based on user customization, for data not customized by the user, the data which cannot be identified or not configured by the system can be defaulted to be second high available data, and a strategy distributed across the Rack is adopted.

The present embodiment also allows a user to specify one or more fault tolerant domains directly for data. For example, when providing a distributed storage service, a set of machines storing data for a user is generally called a service Cluster (Cluster), and machines in the Cluster may be assigned to multiple fault-tolerant domains in a certain hierarchy, or may be assigned to the same fault-tolerant domain. The system can realize cross-domain distribution or intra-domain distribution of the write-in data according to the Cluster configured by the user.

Step two, when first data needs to be stored, the distribution control system distributes a fault-tolerant domain for the first data according to a distribution strategy adopted by the first data and a topological relation of the system;

when allocating the fault-tolerant domain for the first data, in addition to considering the requirement of the distribution policy, it is determined whether there are a sufficient number of fault-tolerant domains to allocate and whether there is sufficient storage space in the fault-tolerant domains according to the topological relation. As mentioned above, this step requires allocating a storage node in the fault-tolerant domain to the first data, which is briefly described as allocating the fault-tolerant domain.

For low-availability data adopting an intra-domain distribution strategy, a fault-tolerant domain where a data generator is located is preferentially allocated to the low-availability data by a distribution control system, or the fault-tolerant domain is called as a local fault-tolerant domain, so that the highest bandwidth of write-in data is ensured.

And step three, writing the first data into the distributed fault-tolerant domain by the distributed control system.

During writing, it may not be mandatory to write as required by the distribution policy. In an actual system, when a fault-tolerant domain of a certain level is abnormal, if a plurality of copies are required to be written into different fault-tolerant domains by force, the writing is likely to fail due to the insufficient number of normal fault-tolerant domains, so that the availability of the system is reduced. But if the number of the normal fault-tolerant domains of the system is enough, writing is required according to the requirement of the distribution strategy.

For example, the first data to be written employs a policy distributed across zones. The data is written to multiple zones in effect to prevent the unavailability of data when a problem arises with the switch to which a single Zone corresponds. Therefore, data can be written in different zones as far as possible. For example, the first data needs to write 3 copies, and when 2 zones are normal, 2 copies are written under one Zone, and one copy is written in another Zone. If there is an exception to one Zone in the current system and there is only one Zone available, then 3 copies of the first data may be written to one Zone first. Forcing a write to 2 zones at this time would fail to write the data, resulting in reduced availability. After the abnormal Zone is recovered, the first data can be copied to another Zone through background processing until the distribution requirement is met.

The data recovery process includes:

step one, a distribution control system checks the change of a topological relation and updates the medium space information of storage nodes in each level of fault-tolerant domain in time;

the distributed control system can detect which fault-tolerant domains of which level of the system fail through a heartbeat mechanism.

Step two, when the distributed control system finds that the fault-tolerant domain fault causes the loss of the copy of the first data, a data copying process is initiated;

the different fault-tolerant domains have different error types, different occurrence probabilities, different influence ranges and different influences on data reliability and availability. The occurrence probability is very high, such as disk error, ASW crash, PSW packet loss and the like. The embodiment has different processing methods when errors occur in different levels of fault-tolerant domains. If the copy of the first data is lost due to Machine failure, immediately initiating data copying to supplement the copy of the first data; if the failure of the Rack and the Zone causes the loss of the copy of the first data (generally, the data loss cannot be caused), the set failure waiting time is firstly passed, and if the failure waiting time is up, the data copying is started again if the failure is not eliminated.

Step three, the distribution control system searches the distribution requirement of the first data and the position (source position) of the normal copy, and calculates the new position (target position) of the copy needing to be supplemented;

the positions of the copies comprise fault-tolerant domains of all levels where the copies are located and storage nodes where the copies are located, and the copies needing to be supplemented can be reallocated to one position, so that the position is called a new position, the new position is a target position of data replication, and the position of a normal copy is a source position of the data replication.

In the embodiment, when the source position and the target position are selected, the network transmission cost between two points is fully considered, and high-loss data recovery is prevented. For example, there are two zones in the system, ZoneA and ZoneB, and the data requires 3 copies to be written, assuming that two copies were originally written in ZoneA and one copy was written in ZoneB. If one Rack in the ZoneA fails to cause one copy to be lost, and the copy in the ZoneB is selected as a copy source when the data is recovered, the storage node in the ZoneB is preferentially selected when the target position is selected, and the flow 0 loss between the Zones is reduced in the copying process, so that the cost can be reduced and the recovery speed can be increased.

And step four, the distributed control system applies for corresponding flow limit on each level of fault-tolerant domain from the source position to the target position, and starts data replication after the application is successful.

By defining total flow limit among the fault-tolerant domains and applying for flow quota in the data recovery process, the network flow impact caused by data recovery can be effectively inhibited, and the influence range of faults is prevented from being aggravated because more fault-tolerant domains enter abnormal states of the network due to the data recovery.

EXAMPLE III

The present embodiment relates to data distribution control of an upper application system. The upper application system in this embodiment takes a cloud disk system used by a user as an example, and illustrates how to perform distributed control when storing data. The cloud disk system uses the distributed storage system as an underlying storage system, such as the distributed storage system of embodiment one or embodiment two may be used.

In the application scenario that a user stores data through a cloud disk, the corresponding data distribution control method comprises the following steps:

step one, the cloud disk system provides a plurality of cloud disks with different availability for users: a high availability cloud disk and a low availability cloud disk;

the high-availability cloud disk can be charged and can promise availability indexes such as recovery time of a fault to a user, and the low-availability cloud disk is free and does not promise to the user.

Secondly, the cloud disk system determines the availability requirement of the data according to the type of the cloud disk selected by the user, namely the data stored by using the high-availability cloud disk is determined as high-availability data, and the data stored by using the low-availability cloud disk is determined as low-availability data;

when the user uses the cloud disk to store data, the user purchases a high-availability cloud disk to store the data according to the requirement on the availability of the data, or selects a free cloud disk to store the data. For example, for a user who uses the cloud disk frequently and cannot tolerate being unavailable for a long time, the user may purchase a highly available cloud disk for saving. And for the data which is not frequently used by the user and can be tolerated to be unavailable for a long time, a free cloud disk is selected for storage.

And step three, when the cloud disk system writes data into the distributed storage system, the cloud disk system informs the distributed storage system of the availability requirement of the data.

The cloud disk system indicates high availability data for data to be saved to a high availability cloud disk and low availability data for data to be saved to a free cloud disk.

And step four, adopting a cross-domain strategy, such as cross-distribution, for high-availability data and adopting a strategy distributed in the same fault-tolerant domain for low-availability data by the distributed storage system.

By the method of the embodiment, the difference requirements of users on data availability can be met, and corresponding services can be provided.

Although the cloud disk system is taken as an example in the present embodiment, the method of the present embodiment may also be adopted for other data storage service systems.

Example four

The present embodiment also relates to data distribution control of an upper application system. The upper application system of this embodiment takes a social software system, such as dingding, as an example, and illustrates how to perform distributed control when storing data of the upper application system. The social software system uses the distributed storage system as an underlying storage system, such as the distributed storage system of embodiment one or embodiment two.

In the application scene of the social software data storage, the corresponding data distribution control method comprises the following steps:

step one, a distributed storage system determines a corresponding distribution strategy and/or a write-in mode according to the requirements of a Service Level Agreement (SLA) of a social software system;

the SLAs of the social software system have availability requirements on the data, such as 99.9%, 99.99%, etc. Higher levels of SLA require higher availability of the system, with shorter downtime and failover time. For SLAs of 99.9% and above, the data requirements can still be accessed when a single fault-tolerant domain fails, otherwise the requirements of the SLAs cannot be met, so that a strategy of cross-domain distribution, such as cross-Zone distribution, is adopted. When SLA is 99.9%, the data writing adopts a time-sharing writing mode. When the SLA requirement is 99.99% or more, the data writing adopts a simultaneous writing mode.

The distribution policy and/or the write mode corresponding to the SLA of the social software system may be specified by a maintenance person of the distributed storage system according to the SLA, or a corresponding relationship between the SLA and the distribution policy and/or the write mode is configured, and the configuration is automatically performed according to the SLA of the social software system (which is one of the default configurations described above).

And step two, when the distributed storage system receives the data of the social software system, searching the assigned distribution strategy and writing mode, distributing a fault-tolerant domain for the first data according to the searched distribution strategy, and writing the data according to the searched writing mode.

It should be noted that, for other upper-layer application systems, the availability requirement of the data may also be determined according to the requirements of the SLA level, and is not limited to the social software system.

EXAMPLE five

The embodiment relates to data distribution control of a distributed storage system in a big data real-time computing scene.

Under the application scene of big data calculation, the corresponding data distribution control method comprises the following steps:

step one, the input and output data generated by calculating the default big data of the distributed storage system are high-availability data, and a cross-domain distribution strategy is adopted. The intermediate data generated during the default operation is low available data, and a strategy of distribution in a fault-tolerant domain is adopted;

the cross-domain distribution policy in this embodiment is a cross-Zone distribution policy. Because the intermediate data can be regenerated, in order to save space and network traffic, the present embodiment distributes it as low available data only within one fault-tolerant domain. Taking the calculation of online shopping by Taobao as an example, the input data includes various data of the user for purchasing commodities and evaluation, logistics information provided by a logistics company and the like. The output data includes various information statistically derived based on the input data such as total transaction amount, transaction number, the store with the largest transaction amount, and the like. The intermediate data is data that needs to be temporarily saved when output data is calculated from input data, and the data can be regenerated and thus can be used as low-availability data.

Step two, the distributed storage system appoints a fault-tolerant domain for big data calculation to store the data;

the method for specifying the fault-tolerant domain is optional, and for a scene of big data real-time operation, for example, under a specific scene that real-time calculation needs to be performed when double-11 panning transaction live is performed with a large screen live broadcast, the requirement on the efficiency of data calculation is very high, and at this time, a distribution strategy for specifying the fault-tolerant domain can be adopted. For example, for all input and output data, regardless of the region of the data source, the data source is saved into 2 zones, and the 2 zones may be. The specific fault-tolerant domain is specified, so that data sources in different regions can be imported into the same fault-tolerant domain to be subjected to data processing in a centralized manner, the bandwidth is efficiently utilized, and the execution efficiency such as the efficiency of data operation is improved.

And step three, selecting a fault-tolerant domain according to a corresponding distribution strategy and writing data for different data generated by big data calculation by the distributed storage system.

When data is written, if the real-time requirement on the data is high, a simultaneous writing mode can be adopted. Otherwise, a time-sharing writing mode can be adopted.

EXAMPLE six

The embodiment relates to data distribution control during cloud platform data backup. Data backup is one of basic functions of a cloud platform, and a large amount of data such as business data and operation logs generated by the cloud platform every day needs to be backed up.

The distributed storage system for implementing data backup of the cloud platform in this embodiment may adopt a layered network architecture as in the first or second embodiment, where the fault-tolerant domain is divided into a fault-tolerant domain Zone of a data center layer and a fault-tolerant domain RACK of an access layer. When backup data of a cloud platform, such as a database Snapshot (Snapshot), is saved, the backup data needs to be written into a corresponding fault-tolerant domain in a distributed storage system according to a distribution strategy and a write strategy.

Under the application scene of cloud platform data backup, the corresponding data distribution control method comprises the following steps:

step one, adopting a cross-Zone distribution strategy for a database snapshot, wherein the number of cross-fault-tolerant domains is 2;

in the data backup scenario related to this embodiment, since the data import flow is very large, a time-sharing write method is adopted, that is, the imported data is written into the local Zone nearby, and then the background is asynchronously copied into another Zone.

Step two, in the backup process, performing cross-domain distribution on state information generated by backup operation in a simultaneous writing mode, and performing time-sharing writing on backup data;

during backup, a type of data, i.e. copied state information, is generated and used to record the states of the data that have been successfully backed up (after the backup is completed, the data source data can be immediately deleted to contain more data), which data are being backed up, and the like. These status information are indispensable for normal replication and need to have higher availability, so this embodiment adopts a simultaneous writing manner for the status information of the replication process, that is, writes to 2 zones simultaneously, to ensure that data import can still be performed normally when a problem occurs in a single Zone. Specifically, for the status information generated by the copy process, when data writing is performed, if two zones of the system are normal, the 2 zones should be written simultaneously. If only one of the two zones is normal, and the other Zone is in power failure or network disconnection, in order to ensure that the data writing is successful, a plurality of copies of the metadata are written to the Zone where the data generator is located. And after the other Zone returns to normal, copying a copy of the metadata to the other Zone to finish distribution.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data distribution control method of a distributed storage system includes:

distributing a fault-tolerant domain for the first data and writing the data according to the adopted distribution strategy and the topological relation of the distributed storage system, wherein the method comprises the following steps:

if the first data adopts a strategy of intra-domain distribution, distributing a fault-tolerant domain where a generator of the first data is located for the first data, and preferentially writing a plurality of copies of the first data into the same storage node in the distributed fault-tolerant domain;

if the first data adopts a cross-domain distribution strategy and the level of the cross-fault-tolerant domain is determined, distributing a plurality of fault-tolerant domains for the first data in the determined level, and preferentially distributing the fault-tolerant domains in the same fault-tolerant domain of the previous level;

when first data is written by adopting a strategy of crossing fault-tolerant domain distribution, a simultaneous writing mode is adopted, and when the number M of normal fault-tolerant domains of a distributed storage system is greater than or equal to the number N of fault-tolerant domains required by the adopted distribution strategy, N fault-tolerant domains are distributed for the first data and written simultaneously; when M is less than N, M fault-tolerant domains are distributed for the first data and written in at the same time, and after the number of normal fault-tolerant domains reaches N, the first data is distributed to N fault-tolerant domains through data copying, wherein N and M are positive integers, and N is more than or equal to 2; or

2. The method of claim 1, wherein:

determining a distribution strategy adopted by the first data from a plurality of distribution strategies provided by the distributed storage system, wherein the distribution strategies comprise:

3. The method of claim 2, wherein:

determining a distribution strategy adopted by the first data according to the availability requirement of the first data, wherein the distribution strategy comprises the following steps:

4. The method of claim 1, wherein:

5. The method of claim 1, wherein:

the determining of the distribution strategy adopted by the first data comprises: and when determining that the first data adopts a strategy distributed across fault-tolerant domains, determining the hierarchy of the crossed fault-tolerant domains according to system default configuration or user customization.

6. The method of claim 4, wherein:

after allocating a fault-tolerant domain for the first data and writing data, the method further comprises:

7. The method of claim 4, wherein:

after allocating a fault-tolerant domain for the first data and writing the data, the method further comprises:

in the data copying process initiated by the missing of the first data copy, determining a source position and a target position of the data copy, applying for a corresponding flow limit on each level of fault-tolerant domain from the source position to the target position, and starting the data copy after the application is successful.

8. The method of claim 1, wherein:

determining a distribution strategy adopted by the first data from a plurality of distribution strategies provided by the distributed storage system, wherein the determination comprises one or more of the following determination modes:

when the first data is data to be stored in a high-availability cloud disk, a cross-domain distribution strategy is adopted; when the first data is data to be stored in the low-availability cloud disk, a strategy of distribution in the same fault-tolerant domain is adopted;

determining a distribution strategy adopted by first data according to a service level agreement related to the first data, wherein when the service level agreement is greater than or equal to 99.9%, the first data is determined to adopt a cross-domain distribution strategy;

when the first data is input and output data of big data calculation, a cross-domain distribution strategy is adopted; and when the first data is intermediate data generated by big data calculation, a strategy of distribution in a fault-tolerant domain is adopted.

9. The method of claim 1, wherein:

when the first data is database backup data, a time-sharing writing mode is adopted; and when the first data is the data of the state information generated during database backup, a simultaneous writing mode is adopted.

10. A data distribution control system in a distributed storage system comprises a strategy determination module and a distribution and writing module, and is characterized in that:

the distribution and write module is configured to: distributing a fault-tolerant domain for the first data and writing the data according to the adopted distribution strategy and the topological relation of the distributed storage system, wherein the method comprises the following steps:

if the first data adopts a cross-domain distribution strategy and the level of the cross fault-tolerant domain is determined, distributing a plurality of fault-tolerant domains for the first data in the determined level, and preferentially distributing the first data in the same fault-tolerant domain of the previous level;

when first data is written by adopting a strategy of crossing fault-tolerant domain distribution, a simultaneous writing mode is adopted, and when the number M of normal fault-tolerant domains of a distributed storage system is greater than or equal to the number N of fault-tolerant domains required by the adopted distribution strategy, N fault-tolerant domains are distributed for the first data and written simultaneously; when M is less than N, M fault-tolerant domains are distributed for the first data and written simultaneously, and after the number of normal fault-tolerant domains reaches N, the first data are distributed to the N fault-tolerant domains through data copying, wherein N and M are positive integers, and N is more than or equal to 2; or

11. The distributed control system of claim 10, wherein:

12. The distributed control system of claim 11, wherein:

13. The distributed control system of claim 10, wherein:

14. The distributed control system of claim 10, wherein:

15. The distributed control system of claim 13, wherein:

16. The distributed control system of claim 13, wherein:

17. The distributed control system of claim 10, wherein:

the strategy determining module determines a distribution strategy adopted by the first data from a plurality of distribution strategies provided by the distributed storage system, and comprises one or more of the following determination modes:

18. The distributed control system of claim 10, wherein:

the distribution and writing module adopts a time-sharing writing mode when the first data is database backup data; and when the first data is the data of the state information generated during database backup, a simultaneous writing mode is adopted.

19. A distributed control apparatus in a distributed storage system, comprising a processor and a memory, characterized in that:

the memory is arranged to: saving the program code;