CN114741029A

CN114741029A - Data distribution method applied to deduplication storage system and related equipment

Info

Publication number: CN114741029A
Application number: CN202210278975.5A
Authority: CN
Inventors: 罗来龙; 郭得科; 程葛瑶; 任棒棒; 孙宇辰
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2022-07-12
Anticipated expiration: 2042-03-21

Abstract

The application provides a data distribution method applied to a deduplication storage system and related equipment. The method comprises the following steps: constructing a hypergraph model according to an initial storage file data set which arrives at a duplicate removal storage system in advance; obtaining distribution information of the updated storage file data through a preset hypergraph divider based on the hypergraph model and the updated storage file data continuously arriving at the repeated deletion storage system; and distributing the updated storage file data to the server corresponding to the distribution information in the plurality of servers according to the distribution information. The generation time of the distribution method is shortened, the access span information is effectively limited, and the access efficiency is further improved.

Description

Data distribution method applied to deduplication storage system and related equipment

Technical Field

The present application relates to the field of data storage technologies, and in particular, to a data allocation method and related device for a deduplication storage system.

Background

With the advent of social networks, cloud video services, and the like, the volume of data in the big data era is increasing at an unprecedented rate. At the same time, a large amount of duplicate data is being flushed to the storage system. The duplication of data inevitably results in unnecessary storage space and network transmission, which puts a great burden on the storage system, and thus redundant data needs to be deleted to save space.

Based on the above, the distributed storage system in the prior art eliminates redundant data blocks through deduplication technology, which reduces access efficiency because fewer block copies are maintained and partitioned blocks of files may be separated on several independent servers, and therefore, excessive communication rounds are required to retrieve files, which increases network congestion and input-output resource consumption. Furthermore, current strategies to address this problem inevitably result in more space usage by employing overwrite techniques or supplemental data copies.

Disclosure of Invention

In view of the above, an objective of the present invention is to provide a data allocation method and related apparatus applied to a deduplication storage system, so as to solve the above technical problems.

In view of the above, a first aspect of the present application provides a data allocation method applied to a deduplication storage system, where the deduplication storage system includes a plurality of servers, the method including:

constructing a hypergraph model according to an initial storage file data set which arrives at a duplicate removal storage system in advance;

obtaining distribution information of the updated storage file data through a preset hypergraph divider based on the hypergraph model and the updated storage file data which continuously reaches the repeated deletion storage system;

and distributing the updated storage file data to a server corresponding to the distribution information in the plurality of servers according to the distribution information.

A second aspect of the present application provides a data distribution apparatus applied to a deduplication storage system, including:

the model building module is configured to build a hypergraph model according to an initial storage file data set which arrives at the duplicate removal storage system in advance;

the hypergraph dividing module is configured to obtain distribution information of the updated storage file data through a preset hypergraph divider based on the hypergraph model and the updated storage file data which continuously reaches the repeated deletion storage system;

the distribution module is configured to distribute the updated storage file data to the servers corresponding to the distribution information in the plurality of servers according to the distribution information.

A third aspect of the application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first aspect when executing the program.

From the above, it can be seen that the data allocation method and the related device applied to the deduplication storage system provided by the application construct the hypergraph model according to the initial storage file data set which arrives in advance at the deduplication storage system, convert the problem of access time through minimizing all storage file data queries into the hypergraph partition problem to improve the access efficiency, allocate the updated storage file data which continuously arrives at the deduplication storage system to the servers corresponding to the allocation information in the deduplication storage system through the hypergraph divider, realize the updating of the storage file data while realizing the reallocation area of the hypergraph model, incrementally adjust the existing allocated area according to the updated storage file data, shorten the generation time of the allocation method, effectively limit the access span information, and avoid network congestion and input and output resource consumption, and meanwhile, the access efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions in the present application or the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of a data allocation method applied to a deduplication storage system according to an embodiment of the present application;

FIG. 2 is a diagram illustrating initial storage file data and initial data blocks according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a hypergraph model according to an embodiment of the present application;

FIG. 4-a is a diagram illustrating initial data block allocation according to an embodiment of the present application;

FIG. 4-b is a diagram illustrating another allocation of initial data blocks according to an embodiment of the present application;

FIG. 5-a is a schematic diagram of the performance of access span information of different algorithms applied by the hypergraph partitioner of the present application as a function of increasing storage files;

FIG. 5-b is a schematic diagram of performance of the hypergraph partitioner of the embodiment of the present application with increasing repetition data deletion rate with stored files under different algorithms;

FIG. 5-c is a schematic diagram of the performance of updating the load standard deviation of the stored file data to the server with increasing stored files under different algorithms applied by the hypergraph partitioner of the embodiment of the present application;

FIG. 5-d is a schematic diagram of performance of allocation policy generation time increasing with storage files under different algorithms applied by the hypergraph partitioner of the embodiment of the present application;

FIG. 6-a is a schematic diagram of access span information performance with an increase in training data ratio for different algorithms applied by the hypergraph partitioner of the present embodiment;

FIG. 6-b is a schematic diagram of access span information performance with increasing number of servers under different algorithms applied by the hypergraph partitioner of the embodiment of the present application;

FIG. 6-c is a schematic diagram of access span information performance for different server capacities under different algorithms applied by the hypergraph partitioner of the embodiment of the present application;

FIG. 7-a is a schematic diagram of the performance of updating access span information with different amounts of stored file data by different algorithms applied by the hypergraph partitioner in the embodiment of the present application;

FIG. 7-b is a schematic diagram of access span information performance of different algorithms applied by the hypergraph partitioner of the present application at different training data ratios;

FIG. 8 is a schematic structural diagram of a deduplication storage system of an embodiment of the present application;

fig. 9 is a schematic view of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings in combination with specific embodiments.

It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the present application belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

In the related art, the general deduplication divides the stored file data into a plurality of variable-size blocks, and deduplication is performed by comparing fingerprints of each block, and although only one copy is reserved in the storage system for each last block, the number of data copies is greatly reduced, however, the data reduction method inevitably reduces access efficiency.

However, the related art policies for file access problems all result in more space occupation, strictly limit access range, only allow one server, when the related blocks are separated, some block copies need to be supplemented to reconstruct the original file at the server level, migration is performed according to the data sharing dependency relationship between data according to the limited access span, one file is used as the minimum comparison unit, in these policies, any single file can be retrieved from accessing one server, which involves more data copies, and then results in more space occupation, and another way is to use the rewriting technology to selectively write repeated blocks into a new container to relieve the data fragmentation problem, but rewriting sacrifices space to improve data locality, and also violates global data deduplication.

In addition, it is difficult and confusing in the related art to directly model access span metrics with mathematical expressions, which involves a multivariate relationship between files, i.e., a file contains multiple blocks, any of which may be shared by multiple files. In addition, internal data correlation between storage files is captured and modeled into a weighted hypergraph, however, the traditional hypergraph partitioning scheme is generated on the basis of the constructed whole hypergraph, but the traditional hypergraph partitioning scheme is not suitable for the situation of storage data updating because the corresponding hypergraph needs to be reconstructed, and the updated hypergraph is partitioned again, so that waste of computing resources and delay which is difficult to bear in scheme generation are inevitably caused. Furthermore, when the previous and updated hypergraph partitioning schemes do not match, the hypergraph partitioning algorithm may fail because block migration will be invoked to cater for the new allocation policy.

The embodiment of the application provides a data distribution method applied to a duplicate removal storage system, a hypergraph model is built according to an initial storage file data set which arrives at the duplicate removal storage system in advance, access time of all storage file data queries is minimized to be converted into a hypergraph partition problem in a mode of improving access efficiency, a hypergraph divider is used for distributing continuously arrived update data blocks of update storage file data to corresponding servers, the hypergraph divider realizes update of the storage file data while avoiding redistribution areas of the hypergraph model through an increasing hypergraph partition algorithm, the existing distribution areas are adjusted in an increasing mode according to the update storage file data, block migration cost is avoided, generation time of the distribution method is shortened, access span information is effectively limited, and access efficiency is improved.

As shown in fig. 1, the method of the present embodiment includes:

step 101, constructing a hypergraph model according to an initial storage file data set which arrives at a deduplication storage system in advance.

In this step, it is a computationally expensive task to compute all possible allocation choices and identify allocation choices that better narrow the access scope for a common set of files. Therefore, attempts are made to intuitively capture the intrinsic dependencies between documents and model them as a graph. The regular graph usually describes a group of objects with binary relations, but in a deduplication storage system, data often establishes multivariate relations, namely more complex single-to-multiple or multiple-to-multiple multi-relation relations. In particular, a file is composed of multiple blocks, some of which may be shared by multiple files. Therefore, when the data relation is modeled by using the rule graph, if the multivariate relation is simply and forcibly converted into the binary relation, a lot of useful related information can be lost, so that a hypergraph model is constructed according to an initial storage file data set which arrives at the deduplication storage system in advance, and the hypergraph partition problem is converted into a mode of improving the access efficiency by minimizing the access time of queries of all storage file data.

Compared with a regular graph, the hypergraph can accurately describe the relationship between objects with multiple associations, and the main difference between the graphs is the number of vertices associated with one edge. A superedge, i.e., an edge in a hypergraph, extends the concept of an edge by allowing more than two vertices to be connected, which facilitates hypergraph modeling of complex data relationships and efficient and effective graph analysis.

And 102, obtaining the distribution information of the updated storage file data through a preset hypergraph divider based on the hypergraph model and the updated storage file data which continuously reaches the repeated deletion storage system.

In this step, the hypergraph divider performs calculation by an Incremental _ hmis algorithm (Incremental hypergraph segmentation method), and the hypergraph divider is obtained based on the hmis algorithm, specifically, the Incremental _ hmis algorithm program is as follows:

and (3) the hypergraph divider is enabled to incrementally update the existing allocated region through an Incremental _ hMetis algorithm (an Incremental hypergraph division method) so as to obtain the updated storage file data allocation information. The Incremental _ hmatis algorithm (Incremental hypergraph segmentation method) avoids re-executing the whole hypergraph partition process, updates the data of the storage file while avoiding the re-allocation of the area of the hypergraph model, incrementally adjusts the existing allocated area according to the updated data of the storage file, ensures that no block migration cost exists, and shortens the generation time of the allocation method.

The access span information is the average number of servers involved in executing a query for storing file data, and directly influences the total communication which must be executed when executing the query for storing file data. With the increase of access span information, more rounds of communication are required to retrieve related data blocks on the distributed servers, which occupies more network resources, especially under the condition of excessive subscription of a communication network, the communication bandwidth across racks becomes a bottleneck, the hypergraph divider effectively limits the access span information through an Incremental _ hmatis algorithm (Incremental hypergraph division method), so that the access efficiency is improved, and the recalculation resources are minimized when the hypergraph is divided.

And 103, distributing the updated storage file data to a server corresponding to the distribution information in the plurality of servers according to the distribution information.

In the step, the updated storage file data is distributed to a proper server in the distributed deduplication storage system according to the distribution information, the access range of the file access query is minimized under the condition of no data redundancy, namely, the access span information is minimized, and the access efficiency is improved.

Through the scheme, the hypergraph model is built according to the initial storage file data set which arrives at the deduplication storage system in advance, the relationship among objects with multiple associations can be accurately described through the built hypergraph, the hypergraph model is favorable for complex data relationship and efficient and effective graph analysis of hypergraph modeling, the access time of all storage file data query is minimized to be converted into the hypergraph zoning problem in a mode of improving the access efficiency, the continuously arriving updated storage file data are distributed to a proper server through the hypergraph divider, the hypergraph divider carries out calculation through an Incremental _ hMetas algorithm (Incremental hypergraph division method), the storage file data are updated while the redistribution zone of the hypergraph model is avoided, the existing distributed zone is adjusted incrementally according to the updated storage file data, the block migration cost is avoided, the generation time of the distribution method is shortened, and the access span information is effectively limited, so that the access efficiency is improved, and the recalculation resources are minimized when the hypergraph is segmented.

In some embodiments, step 101, comprises:

step 1011, obtaining an initial storage file data set which arrives at the deduplication storage system in advance, where the initial storage file data set includes a plurality of initial storage file data, and each of the initial storage file data includes a plurality of initial data blocks.

And 1012, obtaining correlation information according to the initial data block, and constructing a hypergraph model based on the correlation information.

In the above scheme, the initial storage file data is storage file data which arrives at the deduplication storage system in advance, each initial storage file includes a plurality of initial data blocks with variable sizes, wherein each initial data block stored inside is associated with each other through a certain amount of shared content, each initial data block may be shared by a plurality of initial storage file data, based on a corresponding relationship between a multivariate relationship and a hypergraph structure, data correlation (i.e., correlation information) between the initial data blocks of each initial storage file data in the distributed deduplication storage system is modeled as a hypergraph model, the multivariate relationship between the storage file data is maintained through the hypergraph model, and information loss occurring in an original binary correlation diagram is avoided.

In some embodiments, the initial data block is provided with first fingerprint information, and the update storage file data comprises a plurality of update data blocks provided with second fingerprint information;

step 102, comprising:

step 1021, inputting the hypergraph model, the residual storage capacity preset by each server and the updated storage file data into the hypergraph divider.

Step 1022, respectively performing similarity comparison between the second fingerprint information of each update data block and the first fingerprint information of each initial data block to obtain a comparison result.

Step 1023, determining access span information of the update storage file data and an area allocated to each update data block based on the comparison result, and outputting the access span information of the update storage file data and the area allocated to each update data block by the hypergraph divider, taking the access span information and the area allocated to each update data block as the allocation information.

In the above scheme, each initial data block and each update data block have a flag, the first fingerprint information is the flag of each initial data block, and the second fingerprint information is the flag of each update data block.

Inputting the hypergraph model, the preset residual storage capacity of each server and the updated storage file data into a preset hypergraph divider, comparing second fingerprint information of each updated data block in the updated storage file data continuously reaching the deduplication storage system with first fingerprint information of all initial data blocks distributed in each area in the hypergraph model respectively to obtain a comparison result, distributing the areas of the updated data blocks of the updated storage file data according to the comparison result, obtaining access span information of the updated storage file data, and outputting the access span information of the updated storage file data and the area distributed by each updated data block through the hypergraph divider.

In some embodiments, step 1023 includes:

step a1, taking the updated data block with the similar comparison result as a target data block, taking the area where the initial data block corresponding to the second fingerprint information of the target updated data block is located as the area of the target updated data block, updating the preset initial access span information of the updated stored file data to obtain first access span information, updating the updated data block of the updated stored file data, deleting the target updated data block, and obtaining an unallocated updated data block.

Step a2, obtaining the total updated remaining storage capacity of the allocated server based on the target update data block, and obtaining the total size information by each of the unallocated update data blocks.

Step a3, obtaining the area allocated by the unallocated updated data block according to the remaining storage capacity and the size information of the allocated server, updating the first access span information to obtain second access span information, and using the second access span information as the access span information of the updated stored file data.

In the scheme, each update data block with a similar comparison result is used as a target update data block, an area in which first fingerprint information of an initial data block is similar to second fingerprint information of the target update data block in divided areas of the hypergraph model is used as an area of the target update data block, the target update data block is indicated to be stored in a current deduplication storage system, initial access span information preset for updating storage file data is updated to obtain first access span information, the access span information is added by 1, the update data block for updating the storage file data is updated, the target update data block which is distributed is removed, and an unallocated update data block is obtained.

And then calculating the sum of the sizes of the unallocated update data blocks (namely, total size information), updating the total capacity of the underutilized servers in the access span information of the stored file data (namely, total updated residual storage capacity), obtaining the allocated areas of the unallocated update data blocks according to the sum of the sizes of the unallocated update data blocks and the total capacity of the underutilized servers in the access span information of the updated stored file data, and updating the first access span information to obtain second access span information, wherein the second access span information is the final access span information of the updated stored file data.

In some embodiments, step a2, comprises:

acquiring the residual storage capacity of the allocated servers based on the target update data block to obtain the updated residual storage capacity, and summing the updated residual storage capacity of each allocated server to obtain the total updated residual storage capacity;

and calculating the size of each unallocated update data block to obtain size information, and summing each size information to obtain the total size information of the unallocated update data blocks.

In the scheme, the underutilized storage capacity of the server corresponding to the area allocated to each target update data block is calculated, and the total update residual storage capacity is obtained according to the underutilized storage capacity of each corresponding server.

And calculating the size information of each unallocated update data block in the update storage file data, and summing the size information of each unallocated update data block to obtain the total size information of the unallocated update data blocks.

In some embodiments, step a3, comprises:

in response to determining that the total size information is less than or equal to the total updated remaining storage capacity, allocating each of the unallocated updated data blocks to each of the regions according to a load balancing computation process;

in response to determining that the total size information is larger than the total update remaining storage capacity, dividing the unallocated update data blocks according to the update remaining storage capacity of the server corresponding to the allocated update data blocks to obtain divided update data blocks, allocating the divided update data blocks to the area corresponding to the server having the update remaining storage capacity allocated by the allocated update data blocks, allocating the unallocated update data blocks exceeding the total update remaining storage capacity to the unallocated server of the deduplication storage system, and updating the first access span information to obtain the second access span information, and using the second access span information as the access span information of the update storage file data.

In the above scheme, when the total size information is less than or equal to the total update remaining storage capacity, it indicates that the total update remaining storage capacity of the server corresponding to the area where the allocated update data block of the updated storage file data is located can store the unallocated update data block, and the unallocated update data block is allocated to the server corresponding to the area where the allocated update data block is located by load balancing calculation processing.

Load balancing is a basic principle of a distributed storage system, storage is unbalanced and even space overflows, data migration is inevitably caused, so that total bandwidth occupation is increased, network congestion is aggravated, and server capacity is set to a value relatively lower than an actual value, so that each related server has a sufficient amount of underutilized space resources, and efficient data distribution is realized.

When the total size information is larger than the total update residual storage capacity, the size of the unallocated update data blocks is divided, the divided update data blocks are allocated to servers which can store the divided update data blocks and correspond to the areas where the allocated update data blocks are located, after the residual storage capacity of the servers corresponding to the areas where all the allocated update data blocks are full, less server participation is used as an allocation rule, and in order to reduce access span information of the updated stored file data, the residual parts of the unallocated update data blocks are allocated to servers corresponding to areas other than the areas where the allocated update data blocks are located, so that the access span information is minimum, and the access efficiency is improved.

In some embodiments, each of the zones corresponds to a server;

step 103, including:

acquiring the number of the servers distributed by the updated data block according to the access span information of the updated storage file data;

determining a server corresponding to each update data block based on the area allocated by the update data block;

and distributing the updated storage file data to the servers corresponding to the distribution information in the plurality of servers according to the number of the servers distributed by the updated data blocks and the servers distributed correspondingly to each updated data block.

In the above scheme, the total number of servers to which all update data blocks of the update storage file data are allocated is obtained according to the access span information of the update storage file data, each area corresponds to one server, so that the server to which each update data block is allocated can be determined based on the area to which each update data block is allocated, and all update data blocks in the update storage file data are allocated to the corresponding servers through the total number of servers to which all update data blocks are allocated and the server to which each update data block is allocated, so that the update storage file data allocation is completed.

In some embodiments, step 1012, said building a hypergraph model based on said relevance information comprises:

constructing a vertex model according to the initial data block of each initial storage file data, and constructing an associated weight model according to the vertex model;

constructing a hyper-edge model based on the relevance information and the vertex model of each of the initial data blocks;

constructing a super-edge weight model through the super-edge model, the association weight model and the vertex model;

and obtaining the hypergraph model according to the vertex model, the association weight model, the hyper-edge weight model and the hyper-edge model.

In the above scheme, for example, as shown in fig. 2, 5 pieces of initial storage file data (f)₁～f₅) Is divided into 9 unique initial data blocks (b)₁～b₉) As shown in fig. 3, the hypergraph model G ═ V, E, W, each vertex model V ∈ V corresponds to one initial data block, each hyperedge model E ∈ E reflects access span information of the initial stored file data in the partition block, and the associated weight model (W, E, W) of the vertex model_v) Can reflect the data amount of the corresponding initial data block, and the super-edge weight model (w)_e) It indicates the access frequency of the file. Each of these initial data blocks consists of a vertex model (v) in the hypergraph₁～v₉) Represents, and each of the hyper-edge models (e)₁～e₅) Includes the initial storage file data (f)₁～f₅) The vertex model corresponding to the initial data block of the allocation region.

As shown in FIGS. 4-a and 4-b, 5 initial storage file data (f)₁～f₅) Is divided into 9 initial data blocks (b)₁～b₉) In which vertex model (v) in the hypergraph₁～v₉) Representing the initial data block, the hyper-edge model (e)₁～e₅) Access span information of the initial storage file data in the partition block in fig. 3 is represented. These initial data blocks are distributed to 4 servers(s) with different access spans in fig. 4-a and 4-b₁～s₄). For initial storage file data f₅Represented as the superceding model e₅The access span in FIG. 4-a is information 3(s)₂，s₃，s₄) The access span information in FIG. 4-b is reduced to 1(s)₃) In general, FIG. 4-a is a plan viewThe access span information is 14/5 (f)₁～

f

₅3, 3, 3, 2, 3, respectively), but the more rational distribution in fig. 4-b reduces it to 9/5 (f)₁～

f

₅2, 2, 2, 2, 1, respectively).

Optimizing a hypergraph model, wherein an optimization target is to determine the position of a data block in a deduplication storage system so as to minimize the access range of file access query (namely, minimize access span information), converting the optimization target into areas by the hypergraph model, dividing vertexes of the hypergraph, wherein each area corresponds to one server, and minimizing the access range of the file access query by minimizing a hyper-edge weight model connecting vertex models of different areas.

In some embodiments, for a distributed deduplication storage system with multiple storage nodes, several algorithms used by the hypergraph partitioner are evaluated through the real dataset to demonstrate the superior performance of the Incremental _ hmatis algorithm (Incremental hypergraph partitioning method), where the five algorithms are as follows:

the first method is Incremental _ hmtis (Incremental hypergraph segmentation method), which is the augmentation of the hypergraph segmentation method based on hypermap segmentation of hmtis computations.

The second method is hmis (hypergraph split software package) which is a mature hypergraph partition package that generates a partition scheme based on the constructed hypergraph, and therefore, it cannot be directly used for data allocation for updating data storage.

The third method is HyperHeur (heuristic data distribution method based on hypergraph segmentation), which is a heuristic incremental data distribution strategy and does not need to input a hypergraph and an area distributed by the hypergraph.

The fourth method is LOFS (lightweight three-layer hash data storage method), which provides a lightweight three-layer hash method to determine the distribution server of each newly arrived file, and this method strictly limits the access range to one server, thereby ensuring the access efficiency.

The fifth method is Random (Random allocation method) discarding all deduplication and randomly allocating a unique chunk on the distributed deduplication storage system.

The additive _ hmtis is compared to the other four methods by the following four indicators:

span information is accessed for optimization purposes. When the access range is reduced, the access efficiency can be ensured, and the communication cost of file query is reduced.

The deduplication ratio is defined as the ratio of the storage space saved by different comparison methods to the original occupied space. The de-weighting rate must be less than 1.

And (4) generating time of the strategy, recording the generating time of the compared method, and verifying the time complexity of the compared method.

The evaluation method first decompresses and divides the trace in the real data set downloaded from the website into data blocks of unequal size using a content-defined chunking method. Each data block is represented using a fingerprint of MD5(Message-Digest Algorithm 5). The modeled hypergraph is constructed for each dataset based on any pair of content-containing relationships between the tracked decompressed files. The original hypermap partitioning policy of hmtis was built based on the top 50% of the files per dataset by default, with the goal of minimizing the total access range of over 10 homogeneous servers. The default total server capacity is set to 1.5 times the total chunk size and then evenly allocated to the involved servers. Training scale, number of servers, and server capacity were also varied to evaluate the performance robustness of the included _ hmenis and compared in different scenarios.

Large-scale tests are respectively carried out on access span, deduplication rate and policy generation time on two different types of data sets, as shown in fig. 5-a, 85% of files are used as original storage data (namely, initial storage file data), and with the arrival of the latter file (updating the storage file data), the access span information of the Incremental _ hmatis is slightly increased, and is increased from 2.28 to 3.09. This shows the accuracy of the generated data allocation policy. Otherwise, the Random method keeps its access range at 10, which is the number of servers used by default. The access range of the hmiti method is superior to that of the Incremental _ hmiti and HyperHeau. The reason is that the allocation scheme is generated based on a hypergraph of updates that each file arrives. The LOFS method has a minimum access range, which is always fixed to 1 because it adds a block of one file on the server to facilitate data access queries. This most efficient data access mode comes at the expense of a low repetition data erasure rate.

As shown in fig. 5-b, compared with the other four comparison methods, the LOFS method has a deduplication rate lower by about 40%, and completely eliminates redundant data blocks, and the deduplication rates of the other four comparison methods are kept above 49.8%.

As shown in fig. 5-c, the loading criteria of the server upon arrival of the stored file data is illustrated. The Random method generates the lowest load standard deviation because it randomly distributes data blocks across the relevant servers. The std load of HyperHeau is highest in the comparison method, and gradually decreases from about 9053 to about 8723. The reason is that when storage space becomes scarce, parts of data that arrive later are easily allocated to servers with lower space utilization. This facilitates load balancing and load reduction. Otherwise, the LOFS approach exacerbates file arrival with load imbalance despite low load. The reason is that the LOFS, in which the first 85% of the stored files are the most balanced, is trained, and the stored file data later breaks this storage balance.

Policy generation time as shown in fig. 5-d, the policy generation time for hmtis is the longest because when a new file (i.e., updated stored file data) arrives, the modeled hypergraph needs to be reconstructed and the partition policy is recalculated based on the newly constructed hypergraph. The second is the LOFS policy, which distributes data by performing a three-tier hash calculation on the entire file. Whereas the Incremental _ hMetis, HyperHeau and Random methods always produce a low and stable strategy generation time, only about 1 ms. The effectiveness of the Incremental _ hMetis method is verified.

As shown in FIG. 6-a, access span performance under different variables is shown. Specifically, as the proportion of training data increases, the access span of the Incremental _ hmatis gradually decreases from 4.52 to 2.36. This is because the more data that is trained, the more complete the data correlation of the data set used. This enhances the accuracy of the distribution of subsequent test data. The access span of the Random and hmtis methods remains stable because their policy generation does not involve the data training process. Although in the training phase, the LOFS policy aggregates the data blocks of a file together, resulting in an access span of 1 at all times.

As shown in fig. 6-b, it is described that the access time of the inclusive _ hmatis slightly increases as the number of servers involved increases, compared to the other four methods. This is because the total capacity of the server is fixed to 1.5 times the total block size in the dataset. Thus, the capacity of any single server decreases as the number of servers increases. This limits the aggregation of blocks with high correlation. The access range of the Random method is closely related to the number of servers, and the access range of the LOFS method is always kept at 1.

The access span performance at different server capacities is shown in fig. 6-c. The increase in server capacity facilitates the storage of more closely related blocks in one server, by comparing several approaches to obtain an efficient way of narrowing the access span information. Note that the access scope of the Random and LOFS methods is not affected.

The access span information with or without consideration of the file access frequency is finally evaluated as shown in fig. 7-a and as shown in fig. 7-b, where XX _ NF denotes a certain one of the five comparison algorithms without consideration of the access frequency, and XX denotes a certain one of the five comparison algorithms with consideration of the access frequency. Compared with the XX _ NF method, the XX method has absolute advantages, in particular, in FIG. 7-a, when the file storage rate is 85%, the influence of the file heat (i.e. the super edge weight in the hypergraph model, representing the access frequency of the file) is ignored, the access span information of the Incremental _ hMetis is increased from 2.276 to 3.556, and the access span information of the hMetis increased from 2.178 to 3.484. The reason is that both methods tend to narrow the access range of the file with higher query frequency, which is beneficial to narrow the whole access range. Otherwise, XX _ NF would treat all files equally and the data blocks of hot files would not be particularly aggregated, resulting in relatively high access span information.

Therefore, the included _ hmis can realize a data distribution strategy with high access efficiency for the distributed data de-duplication system. At the same time, limited access time, high deduplication rates, and lower policy generation time may be achieved for updated/newly arrived data (i.e., updated stored file data). Specifically, the included _ hMetis reduces storage space by about 40% compared to the LOFS method, reduces access span information by about 1.1% compared to the HyperHeur method, and has a policy generation time about 160 times that of the hMetis method.

It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.

It should be noted that the above describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same inventive concept, corresponding to the method of any embodiment, the application also provides a data distribution device applied to the deduplication storage system.

Referring to fig. 8, the data distribution apparatus applied to the deduplication storage system includes:

a model construction module 801 configured to construct a hypergraph model from an initial storage file data set that has previously arrived at a deduplication storage system;

a hypergraph partitioning module 802 configured to obtain allocation information of the updated storage file data through a preset hypergraph partitioner based on the hypergraph model and the updated storage file data continuously arriving at the deduplication storage system;

an allocating module 803 configured to allocate the updated storage file data to a server corresponding to the allocation information among the plurality of servers according to the allocation information.

In some embodiments, model building module 801 includes:

an obtaining unit configured to obtain an initial storage file data set that arrives at the deduplication storage system in advance, the initial storage file data set including a plurality of initial storage file data, wherein each of the initial storage file data includes a plurality of initial data blocks;

and the model building unit is configured to obtain correlation information according to the initial data block and build a hypergraph model based on the correlation information.

hypergraph partitioning module 802, comprising:

an input unit configured to input the hypergraph model, a remaining storage capacity preset by each server, and the updated storage file data into the hypergraph divider;

the comparison unit is configured to perform similarity comparison on the second fingerprint information of each update data block and the first fingerprint information of each initial data block respectively to obtain a comparison result;

an allocation unit configured to determine access span information of the update storage file data and an area allocated to each of the update data blocks based on the comparison result, and output the access span information of the update storage file data and the area allocated to each of the update data blocks through the hypergraph divider, taking the access span information and the area allocated to each of the update data blocks as the allocation information.

In some embodiments, a dispensing unit, comprises:

a first processing subunit, configured to use the updated data block whose comparison result is similar as a target data block, use the area where the initial data block corresponding to the second fingerprint information of the target updated data block is located as the area of the target updated data block, update initial access span information preset for the updated stored file data to obtain first access span information, update the updated data block of the updated stored file data, and delete the target updated data block to obtain an unallocated updated data block;

a calculation processing subunit configured to acquire a total update remaining storage capacity of the servers to which allocation has been made based on the target update data block, and acquire total size information by each of the unallocated update data blocks;

a second processing subunit, configured to obtain, according to the remaining storage capacity of the allocated server and the size information, the area allocated to the unallocated update data block, update the first access span information to obtain second access span information, and use the second access span information as the access span information of the updated storage file data.

In some embodiments, the computing processing subunit is specifically configured to:

and calculating the size of each unallocated update data block to obtain size information, and summing each size information to obtain total size information of the unallocated update data blocks.

In some embodiments, the second processing subunit is specifically configured to:

in response to determining that the total size information is less than or equal to the total updated remaining storage capacity, allocating each of the unallocated updated data blocks to each of the regions in accordance with a load balancing calculation process;

In some embodiments, each of the zones corresponds to a server;

the assignment module 803 is specifically configured to:

acquiring the number of the servers distributed by the update data block according to the access span information of the update storage file data;

In some embodiments, the model building unit is specifically configured to:

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more pieces of software and/or hardware in the practice of the present application.

The apparatus in the foregoing embodiment is used to implement the corresponding data allocation method applied to the deduplication storage system in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to the method of any embodiment described above, the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the data allocation method applied to the deduplication storage system described in any embodiment above is implemented.

Fig. 9 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 901, a memory 902, an input/output interface 903, a communication interface 904, and a bus 905. Wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 enable a communication connection within the device with each other through a bus 905.

The processor 901 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present specification.

The Memory 902 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 902 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 902 and called by the processor 901 for execution.

The input/output interface 903 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 904 is used for connecting a communication module (not shown in the figure) to realize communication interaction between the device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 905 includes a pathway to transfer information between various components of the device, such as processor 901, memory 902, input/output interface 903, and communication interface 904.

It should be noted that although the above-mentioned device only shows the processor 901, the memory 902, the input/output interface 903, the communication interface 904 and the bus 905, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the foregoing embodiment is used to implement the corresponding data allocation method applied to the deduplication storage system in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described again here.

Based on the same inventive concept, corresponding to any of the above embodiments, the present application also provides a non-transitory computer readable storage medium storing computer instructions for causing the computer to execute the data allocation method applied to the deduplication storage system as described in any of the above embodiments.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, for storing information may be implemented in any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The computer instructions stored in the storage medium of the foregoing embodiment are used to enable the computer to execute the data allocation method applied to the deduplication storage system according to any of the foregoing embodiments, and have the beneficial effects of the corresponding method embodiments, which are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, technical features in the above embodiments or in different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present application are intended to be included within the scope of the present application.

Claims

1. A data allocation method applied to a deduplication storage system, wherein the deduplication storage system comprises a plurality of servers, the method comprising:

2. The method of claim 1, wherein constructing a hypergraph model from an initial set of stored file data previously arrived at a deduplication storage system comprises:

acquiring an initial storage file data set which reaches the deduplication storage system in advance, wherein the initial storage file data set comprises a plurality of initial storage file data, and each initial storage file data comprises a plurality of initial data blocks;

and obtaining correlation information according to the initial data block, and constructing a hypergraph model based on the correlation information.

3. The method according to claim 2, wherein the initial data block is provided with first fingerprint information, and the update storage file data comprises a plurality of update data blocks provided with second fingerprint information;

obtaining the distribution information of the updated storage file data through a preset hypergraph divider based on the hypergraph model and the updated storage file data which continuously reaches the repeated deletion storage system, wherein the distribution information comprises the following steps:

inputting the hypergraph model, the preset residual storage capacity of each server and the updated storage file data into the hypergraph divider;

respectively carrying out similarity comparison on the second fingerprint information of each updating data block and the first fingerprint information of each initial data block to obtain a comparison result;

determining access span information of the updated storage file data and an area allocated to each update data block based on the comparison result, outputting the access span information of the updated storage file data and the area allocated to each update data block through the hypergraph divider, and taking the access span information and the area allocated to each update data block as the allocation information.

4. The method according to claim 3, wherein the obtaining the access span information of the updated stored file data and the allocated area of each of the updated data blocks based on the comparison result comprises:

taking the update data block with the similar comparison result as a target data block, taking the area where the initial data block corresponding to the second fingerprint information of the target update data block is located as the area of the target update data block, updating initial access span information preset by the updated stored file data to obtain first access span information, updating the update data block of the updated stored file data, deleting the target update data block, and obtaining an unallocated update data block;

acquiring total update remaining storage capacity of the allocated servers based on the target update data block, and acquiring total size information through each of the unallocated update data blocks;

and obtaining the area allocated by the unallocated updated data block according to the residual storage capacity and the size information of the allocated server, updating the first access span information to obtain second access span information, and using the second access span information as the access span information of the updated stored file data.

5. The method of claim 4, wherein the obtaining of the total updated remaining storage capacity of the allocated servers based on the target update data block and the obtaining of the total size information through each of the unallocated update data blocks comprises:

6. The method according to claim 5, wherein the obtaining the allocated area of the unallocated updated data block according to the remaining storage capacity of the allocated server and the size information, updating the first access span information to obtain second access span information, and using the second access span information as the access span information of the updated storage file data comprises:

7. The method of claim 3, wherein each of said zones corresponds to a server;

the allocating the updated storage file data to the server corresponding to the allocation information in the plurality of servers according to the allocation information includes:

8. The method of claim 2, wherein the constructing a hypergraph model based on the relevance information comprises:

9. A data distribution apparatus for use in a deduplication storage system, comprising:

the hypergraph division module is configured to obtain distribution information of the updated storage file data through a preset hypergraph divider based on the hypergraph model and the updated storage file data which continuously reaches the repeated deletion storage system;

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 8 when executing the program.