CN114579046B

CN114579046B - Cloud storage similar data detection method and system

Info

Publication number: CN114579046B
Application number: CN202210070755.3A
Authority: CN
Inventors: 田纹龙; 何婷婷; 叶旭明; 薛晓晔; 李瑞轩; 万亚平; 欧阳纯萍; 刘永彬; 刘征海; 刘洋
Original assignee: University of South China
Current assignee: University of South China
Priority date: 2022-01-21
Filing date: 2022-01-21
Publication date: 2024-01-02
Anticipated expiration: 2042-01-21
Also published as: CN114579046A

Abstract

The invention provides a cloud storage similar data detection method and a cloud storage similar data detection system, wherein the method comprises the following steps: a model training stage, preprocessing training data to obtain a training data block; extracting feature vectors of all training data blocks by using a MinHash algorithm to obtain first vectors without embedded semantics and training a machine learning model, and obtaining a weight matrix between the first vectors and the vectors with embedded context semantics and a trained model after training; processing the predicted data by using the trained model and the same processing method as the preprocessing training data to obtain a predicted data block; extracting feature vectors of all predicted data blocks by using a MinHash algorithm to obtain vectors of the predicted data without embedded semantics; matrix multiplication is carried out on vectors of the predicted data without embedded semantics and the weight matrix, so that vectors of the training data after the embedded semantics are obtained; the most similar data block is found by the Annoy algorithm. The method can reduce calculation cost, solve the problem of unstable extraction of the characteristic values and improve detection accuracy.

Description

Cloud storage similar data detection method and system

Technical Field

The invention relates to the technical field of similar data detection, in particular to a cloud storage similar data detection method and system.

Background

With the development of networks and storage technologies, cloud storage has been widely used in daily life, and people prefer to pay for their online data through the cloud storage service due to the reliability and flexibility of the cloud storage service. However, cloud storage services are flooded with a large amount of redundant data. Such redundant data not only reduces the storage utilization of the cloud storage service provider, but also increases the financial budget of the user's cloud storage service. For this reason, the conventional redundant data deduplication technology is one of the important technologies widely used in cloud storage at present. Through identifying and eliminating redundant data blocks, the technology can effectively improve the cloud storage utilization rate and save the user data storage cost. However, the conventional redundant data deduplication technology can only distinguish redundant data blocks from non-redundant data blocks, but cannot identify and eliminate redundant data portions in similar data blocks. For this reason, existing similar data detection techniques use the fingerprint value and distribution of the data blocks to determine redundant data portions among similar data blocks. However, the existing method has no good robustness in similar data detection, is easily interfered by other factors, and causes unstable feature extraction problems, such as modification and deletion of data block content, influence of block length change and the like.

Disclosure of Invention

In view of the above problems, the present invention provides a cloud storage similar data detection method, in particular, a cloud storage similar data detection method based on block-level semantics, where the cloud storage similar data detection method includes:

model training stage, training steps are:

preprocessing training data to obtain a training data block;

extracting feature vectors of all training data blocks by using a MinHash algorithm to obtain a first vector without embedded semantics;

training a machine learning model based on the first vector to obtain a weight matrix between the first vector and the vector embedded with the context semantics and a trained model;

in the model prediction stage, the prediction steps are as follows:

processing the predicted data by using the trained model and the same processing method as the preprocessing training data to obtain a predicted data block;

extracting feature vectors of all predicted data blocks by using a MinHash algorithm to obtain vectors of the predicted data without embedded semantics;

matrix multiplication is carried out on vectors of the predicted data without embedded semantics and the weight matrix, so that vectors of the training data after the embedded semantics are obtained;

all vectors of training data after embedding semantics are constructed into a binary tree through an Annoy algorithm, each vector is a node of the binary tree, and other nodes closest to the node corresponding to the current data block are judged, so that the data block most similar to the current data block is found.

Preferably, the step of extracting feature vectors of all training data blocks using the MinHash algorithm includes:

taking a preset number of hash functions, scanning the content of the training data blocks, calculating to obtain a hash value corresponding to each hash function, and then summing and averaging the calculated hash values to obtain an initial characteristic value of the training data blocks;

and scanning the initial characteristic value by using a sliding window, taking the data information in the window as a sub-characteristic value every time the sliding window moves, generating a characteristic vector corresponding to the sub-characteristic value according to a mapping function between the sub-characteristic value and the characteristic vector, and finally summing and averaging the characteristic vectors corresponding to all the sub-characteristic values to be used as the characteristic vector of the data block.

Preferably, the step of preprocessing the training data to obtain a training data block comprises:

unifying the input training data types into bit streams;

and divides the bit stream into a number of training data blocks.

Preferably, training a machine learning model based on the first vector to obtain a weight matrix between the first vector and the vector embedded with context semantics, and specifically comprising the steps of:

and inputting a first vector corresponding to the context of the data block into an input layer of the machine learning model, taking the first vector corresponding to the data block as an output layer of the machine learning model, taking a difference value between the input layer and the output layer as a loss, continuously updating a weight matrix, and finally obtaining the weight matrix embedded with the context information.

Preferably, the weight matrix specifically includes a weight matrix of an output layer and a weight matrix of an input layer.

Preferably, performing matrix multiplication on the vector of the predicted data without embedded semantics and the weight matrix to obtain a vector of the training data after embedded semantics further includes:

and the weight matrix for matrix multiplication is the weight matrix of the output layer of the machine learning model.

Preferably, after the most similar data blocks are found, redundant portions between similar data blocks are also deleted using differential encoding.

According to another aspect of the present invention, there is also disclosed a cloud storage similar data detection system, in particular a cloud storage similar data detection system based on block-level semantics, the cloud storage similar data detection system comprising a memory and a processor, the memory storing a computer program therein;

the processor is configured to execute a cloud storage similar data detection method according to any one of the preceding claims when the computer program is run.

The invention fully considers the context relation among the data blocks, namely semantic information among the data blocks, provides a cloud storage similar data detection technology based on block-level semantics, utilizes machine learning to perform characterization learning, breaks through the thinking of the traditional similar block recognition technology relying on hash value extraction, combines the context of the data blocks, embeds the semantics into the feature set of the data blocks, reduces the calculation cost, solves the problem of unstable feature value extraction existing in the prior art, improves the accuracy of similar data block detection, and improves the storage utilization rate and user experience.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow chart of a detection method according to an embodiment of the invention.

Detailed Description

Various exemplary embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.

The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

In the first embodiment, an example of a similar data detection method based on block-level semantics is described in detail below with reference to fig. 1.

Model training stage, training step includes:

1. preprocessing training data to obtain a training data block;

2. extracting feature vectors of all training data blocks by using a MinHash algorithm to obtain vectors without embedded semantics as initial vectors (namely first vectors);

3. training a machine learning model based on the first vector to obtain a weight matrix between the first vector and the vector embedded with the context semantics and a trained model;

in the model prediction stage, the prediction steps are as follows:

1. processing the predicted data by using the trained model and the same processing method as the preprocessing training data to obtain a predicted data block;

2. extracting feature vectors of all predicted data blocks by using a MinHash algorithm to obtain vectors of the predicted data without embedded semantics;

3. matrix multiplication is carried out on vectors of the predicted data without embedded semantics and the weight matrix, so that vectors of the training data after the embedded semantics are obtained;

In some embodiments, the step of extracting feature vectors of all training data blocks in the model training phase using a MinHash algorithm includes:

taking n hash functions (n is 80, 400 or other selectable numbers and the like), scanning the content of a training data block, calculating to obtain a hash value corresponding to each hash function, and then summing and averaging the calculated hash values to obtain an initial characteristic value of the training data block, so that characteristic deviation caused by different contents in the data block is reduced;

and scanning the initial characteristic value by using the sliding window, taking the data information in the window as a sub-characteristic value every time the sliding window moves, generating a characteristic vector corresponding to the sub-characteristic value according to a mapping function between the sub-characteristic value and the characteristic vector, and finally summing and averaging the characteristic vectors corresponding to all the sub-characteristic values to be used as the characteristic vector of the data block.

In some embodiments, preprocessing the training data to obtain a training data block includes:

unifying the input training data types into bit streams;

and divides the bit stream into N training data blocks.

In some embodiments, training a machine learning model based on the first vector to obtain a weight matrix between the first vector and the vector after embedding the context semantics, specifically comprising the steps of:

and inputting the first vector corresponding to the context of the data block into an input layer of the machine learning model, taking the first vector corresponding to the data block as an output layer of the machine learning model, taking the difference value between the input layer and the output layer as loss, continuously updating the weight matrix, and finally obtaining the weight matrix embedded with the context information.

Specifically, the machine learning network is an input layer, an intermediate layer and an output layer respectively. Wherein the middle layer is respectively associated with the input layer and the output layer through two weight matrixes W, U. Here, the input layer is the context vector of the data block, i.e., the initial vector of the first k and last k data blocks of the current data block, the output layer is the initial vector of the current data block, and the middle layer represents the vector of the embedded semantics (initial value is 0). The input layer XW is hidden1 and the output layer XU-1 is hidden2, both of which can be seen as vectors embedding semantics. Therefore, the difference between hidden1 and hidden2 is taken as a loss, the weight matrices W and U are continuously updated, and finally the weight matrices W and U embedded with the context information are obtained. By the weight matrix U, only the initial feature vector of the data block can be input, and the feature vector after embedding the semantics can be obtained without inputting the context information.

In some embodiments, matrix multiplying the vector of the predicted data without embedded semantics with the weight matrix in the prediction stage to obtain the vector of the training data after embedded semantics further includes:

the weight matrix for matrix multiplication is the weight matrix U of the output layer of the machine learning model, the weight matrix U obtained through the training process can be reused, training can be additionally carried out (on the basis of the model, the weight matrix is carried out again through other data), repeated calculation is avoided, and the time and the calculation cost for data deduplication are reduced.

In some embodiments, after the most similar data blocks are found in the prediction stage, redundant portions between similar data blocks are also deleted using differential encoding.

Specifically, the data compression step is:

(1) And acquiring a data block of the training data and a semantic model corresponding to the data block, and setting a compression threshold g.

(2) And extracting parameters in the semantic model corresponding to all the data blocks to be used as a compression characteristic matrix of the data blocks.

(3) Traversing all the data blocks to perform the following operations:

step one, obtaining a compression characteristic matrix of a current data block.

Step two, traversing all basic block compression feature matrixes, and searching a basic block corresponding to the compression feature matrix with the minimum distance from the current compression feature matrix.

And thirdly, if the distance between the two compression feature matrixes is still larger than the set threshold g, proving that the current data block is not suitable for compression, storing the current data block as is, and adding the compression feature matrix of the current data block into the compression feature matrix of the basic block, namely, using the compression feature matrix as a Base channel.

And step four, if the distance between the two compression feature matrixes is still smaller than the set threshold g, compressing the current data block, generating a Delta data block by using a Delta Compression algorithm, wherein the Delta data block only comprises different parts of the Base data block and the current data block, and adding the index of the found most similar data block and the Delta data block into a Delta file.

(6) And (3) compressing the data uploaded by the user into a Base file and a Delta file through the step (5), wherein the sum of the volumes of the Base file and the Delta file is smaller than the volume of the data file uploaded originally. Finally, the purpose of deleting redundant parts among similar data blocks by utilizing differential coding is achieved.

According to another embodiment, a cloud storage similar data detection system, in particular a cloud storage similar data detection system based on block-level semantic embedding, is disclosed, comprising a memory and a processor, wherein a computer program is stored in the memory;

a processor configured to perform a cloud storage similar data detection method based on block level semantic embedding in any of the embodiments described above when running a computer program.

The cloud storage similar data detection system based on the block-level semantic embedding can be operated in computing equipment such as a desktop computer, a notebook computer, a palm computer and a cloud server. The cloud storage similar data detection system based on the block-level semantic embedding can be used for operating devices including, but not limited to, a processor and a memory.

Those skilled in the art will appreciate that the examples are merely examples of a block-level semantic-embedding-based cloud storage similar data detection system, and are not limiting of the block-level semantic-embedding-based cloud storage similar data detection system, and may include more or fewer components than examples, or may combine certain components, or different components, e.g., the block-level semantic-embedding-based cloud storage similar data detection system may further include an input-output device, a network access device, a bus, etc. The Processor may be a Central Processing Unit (CPU), other general purpose Processor, digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), field-Programmable Gate array (FPGA), other Programmable logic device, discrete Gate or transistor logic device, discrete hardware components, or the like. The general processor may be a microprocessor or the processor may be any conventional processor, etc., and the processor is a control center of the cloud storage similar data detection system based on the block-level semantic embedding, and connects various parts of the whole cloud storage similar data detection system operable system based on the block-level semantic embedding by using various interfaces and lines. The memory may be used to store the computer program and/or module, and the processor may implement various functions of the cloud storage similar data detection system based on block-level semantic embedding by running or executing the computer program and/or module stored in the memory and invoking data stored in the memory. The memory may mainly include a memory program area and a memory data area. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart-Media-Card (SMC), secure-Digital (SD) Card, flash Card (Flash-Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the invention, but any modifications, equivalents, and improvements made within the spirit and principle of the present invention should be included in the scope of the present invention.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.

Claims

1. A cloud storage similar data detection method, in particular to a cloud storage similar data detection method based on block-level semantics, which is characterized by comprising the following steps:

model training stage, training steps are:

preprocessing training data to obtain a training data block;

in the model prediction stage, the prediction steps are as follows:

2. The cloud storage similarity data detection method of claim 1, wherein the step of extracting feature vectors of all training data blocks using a MinHash algorithm comprises:

3. The cloud storage similar data detection method as in claim 1, wherein preprocessing training data to obtain training data blocks comprises:

unifying the input training data types into bit streams;

and divides the bit stream into a number of training data blocks.

4. The cloud storage similarity data detection method of claim 1, wherein training a machine learning model based on the first vector to obtain a weight matrix between the first vector and a vector embedded with context semantics, specifically comprises the steps of:

5. The cloud storage similar data detection method as in claim 4, wherein said weight matrix comprises a weight matrix of an output layer and a weight matrix of an input layer.

6. The cloud storage similarity data detection method of claim 5, wherein matrix multiplying the vector of the predicted data without embedded semantics with the weight matrix to obtain the vector of the training data with embedded semantics further comprises:

7. The cloud storage similar data detection method as claimed in claim 1, wherein redundant parts between similar data blocks are also deleted by differential encoding after finding the most similar data blocks.

8. A cloud storage similar data detection system, in particular to a cloud storage similar data detection system based on block-level semantics, which comprises a memory and a processor, wherein a computer program is stored in the memory;

the processor, when executing the computer program, is configured to perform a cloud storage similar data detection method according to any one of claims 1-7.