CN110647585A

CN110647585A - Data deployment system with automatic screening and backup functions

Info

Publication number: CN110647585A
Application number: CN201910906104.1A
Authority: CN
Inventors: 宋仪轩; 阚苏立; 谢可辉
Original assignee: Jiangsu Healthcare Big Data Protection And Development Co Ltd
Current assignee: Jiangsu Healthcare Big Data Protection And Development Co Ltd
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2020-01-03

Abstract

The invention relates to the technical field of data deployment, in particular to a data deployment system with automatic screening and backup functions. The system comprises a data screening unit and a data backup unit, wherein the data screening unit and the data backup unit realize data interaction through the Internet. In the data deployment system with the automatic screening and backup functions, the data screening unit is arranged, the environment suitable for rapidly screening the massive initial detectors is constructed through reasonable redundancy and distribution of the initial detectors and the self bodies according to the requirements of an initial detector screening algorithm on the characteristics of the massive initial detectors and the limited self bodies in a large data environment, the data backup unit is arranged, a data channel acquires data from a client and caches the data in the cloud storage unit, processed data are stored in a virtualized storage, data cloud backup is achieved, and data loss is prevented.

Description

Data deployment system with automatic screening and backup functions

Technical Field

The invention relates to the technical field of data deployment, in particular to a data deployment system with automatic screening and backup functions.

Background

With the advent of the big data era, the data information amount is increasingly huge, so that the data deployment efficiency is low, and meanwhile, the data contains a large amount of information, so that once the data deployment system is damaged, the data is difficult to retrieve, and serious loss is caused.

Disclosure of Invention

The present invention is directed to a data deployment system with automatic screening and backup functions to address one or more of the deficiencies set forth in the background above.

In order to achieve the above object, the present invention provides a data deployment system with automatic screening and backup functions, which includes a data screening unit and a data backup unit, wherein the data screening unit and the data backup unit implement data interaction through the internet, the data screening unit is used for screening data, and the flow of the data screening unit is as follows:

s11, storing the self bodies with limited quantity into the memory of the storage and calculation node;

s12, according to the matching rule, matching and checking the detectors in the storage and computing node mass initial detector subset;

s13, judging that the initial detectors in the massive initial detector subset can become candidate maturity detectors;

and S14, sending the candidate maturity detector and the maximum matching value with the self body to the optimization node.

Preferably, in S12, the BMH2CKMP algorithm is used as the matching rule, and the steps are as follows:

firstly, modifying self-body data to be positive matching based on a BMH2C algorithm;

preprocessing to obtain a next array from the pattern string;

and thirdly, when the patterns are matched, if the characters are mismatched, judging whether the jump value is positive or negative, selecting jump in the positive direction to obtain the maximum displacement, searching a next array in the negative direction, and moving the position of j to force i not to backtrack.

Preferably, the next array is defined as follows:

that is, when next [ j ] ═ k > 0, P [0 … k-1] ═ P [ j-k, j-1] is represented.

Preferably, the process for judging the candidate maturity detector is as follows:

s21, saving all storage and calculation nodes and sending the storage and calculation nodes to a candidate maturity detector of the optimization node;

s22, sequencing mass candidate mature detectors from small to large according to the maximum matching degree and constructing a set;

s23, circularly taking out the detector of the candidate maturity;

s24, judging whether the number of detectors reaches the set value of the system, if so, circulating to S23, and if not, entering the next step;

and S25, taking the first candidate mature detector in the set and putting the first candidate mature detector in the optimized set.

Preferably, the data backup unit comprises a cloud storage unit and a client unit, and the flow of the cloud storage unit is as follows:

s31, the data channel obtains data from the client and caches the data in the cloud storage unit;

s32, carrying out correlation processing on the data;

and S33, storing the processed data in a virtualized storage.

Preferably, in S32, the data correlating process includes a data compressing unit, a data encrypting unit and a data de-duplicating unit.

Preferably, the data compression unit adopts an LZHJ algorithm, and the flow of the LZHJ algorithm is as follows:

(xi) in the Forward buffer, for a string in the Forward buffer, it is marked X₁,X₂,…,X_N；

② recording the current matching character string as Y₁,Y₂,…,Y_KWherein Y is_KThe last character in the sliding compression window;

thirdly, recording the current maximum matching length as N>m, and X₁＝Y₁,X₂＝Y₂；

Fourthly, to X₁,X₂,…,X_NAnd Y₁,Y₂,…,Y_KSequential comparison is carried out, and the obtained matching length is recorded as Lengthamax { i | X_i＝Y_i,2≤i≤min(N,K)}。

Preferably, the data encryption unit adopts RSA algorithm, and the algorithm steps are as follows:

selecting two different large prime numbers p and q at will, and calculating a product r as p q;

randomly selecting a large integer e, wherein the e is relatively prime with (p-1) q-1, and the integer e is used as an encryption key;

determining a decryption key d;

the integers r and e are disclosed, but d is not disclosed.

Preferably, the data de-duplication unit is classified based on de-duplication granularity, and the steps are as follows:

deleting repeated data of the full document level;

secondly, eliminating redundant file blocks;

and thirdly, byte level redundancy elimination.

Preferably, the client unit comprises an application interface module and an operation request module, wherein the application interface module is used for providing relevant application programs for clients needing data backup and recovery, and the operation request module is used for sending data backup or recovery requests to the clients through the application programs.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the data deployment system with the automatic screening and backup functions, the data screening unit is arranged, and an environment suitable for rapidly screening massive initial detectors is constructed according to the requirements of an initial detector screening algorithm and the reasonable redundancy and distribution of the initial detectors and self bodies for the characteristics of massive initial detectors and limited self bodies in a large data environment.

2. In the data deployment system with the automatic screening and backup functions, a data backup unit is arranged, a data channel acquires data from a client and caches the data in a cloud storage unit, processed data are stored in a virtualized storage, data cloud backup is achieved, and data loss is prevented.

3. In the data deployment system with the automatic screening and backup functions, data are processed through the data compression unit, the data encryption unit and the repeated data deletion unit, so that redundant information is reduced, and data processing is accelerated.

Drawings

FIG. 1 is a block diagram of the overall structure of the present invention;

FIG. 2 is a flow chart of a data screening unit of the present invention;

FIG. 3 is a flow chart of a candidate maturity detector of the present invention;

FIG. 4 is a flow chart of the BMH2CKMP algorithm of the present invention;

FIG. 5 is a block diagram of a data backup unit according to the present invention;

FIG. 6 is a flow chart of a cloud storage unit of the present invention;

FIG. 7 is a block diagram of data processing modules of the present invention;

FIG. 8 is a flow chart of the LZHJ algorithm of the present invention;

FIG. 9 is a flow chart of the RSA algorithm of the present invention;

FIG. 10 is a flow chart of the heavy granularity classification of the present invention;

fig. 11 is a block diagram of a client unit of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

The invention provides a data deployment system with automatic screening and backup functions, as shown in fig. 1-4, comprising a data screening unit and a data backup unit, wherein the data screening unit and the data backup unit realize data interaction through the internet, the data screening unit is used for screening data, and the flow of the data screening unit is as follows:

In this embodiment, the data screening unit is based on a Map/Reduce model, and based on the characteristic that distributed concurrent execution is performed in each storage and computation node, a partition monitoring strategy for the massive initial detectors is designed on the basis that the massive initial detector subsets are stored in each storage and computation node.

Specifically, in S12, the BMH2CKMP algorithm is used as the matching rule, and the steps are as follows:

preprocessing to obtain a next array from the pattern string;

The next array is defined as follows:

Still further, the specific steps of sending the candidate maturity detector and the maximum matching value with the self body to the optimization node are shown as the following algorithm:

Partition_Selection(detector_subset，self_set)

{

while (Detector _ subset also has an unchecked initial detector)

{

Taking out the initial detector which is not checked;

setting the value of the maximum matching degree max _ match to be 0;

setting a flag that indicates whether the detector can be a candidate maturity detector to be 1;

while (self _ set also has unchecked autologous)

{

Checking the matching degree m of the initial detector and the self body by using a matching rule;

if (the matching degree M is less than the threshold M set by the system)

If(m>max_.match)

max_match＝m；

else

The detector cannot become a candidate maturity detector, setting flag to 0

}

If(flag＝＝1)

Outputting the candidate maturity detector and the maximum matching degree max _ match with the self body to an optimization node;

}

the algorithm is used for circularly taking out an undetected initial detector and carrying out initial setting on the initial detector, wherein the maximum matching degree of the initial detector is 0, and the initial flag bit 1 of a candidate maturity detector is set; then circularly matching with the undetected self body in the system, if the matching degree is smaller than a threshold value set by the system and is larger than the maximum matching degree, setting the candidate maturity mark of the initial detector to be 1, otherwise, setting the non-maturity mark to be 0; after the detector is determined to be a candidate mature detector, outputting the candidate mature detector and the maximum matching degree max _ match between the detector and the self body to an optimization node, and carrying out next optimization; the calculation amount of the algorithm is mainly dependent on the number of initial detectors in the large data system, and the calculation amount of the algorithm is in a linear increasing trend along with the increase of the number of the initial detectors.

It is worth noting that the process of determining the candidate maturity detector is as follows:

s23, circularly taking out the detector of the candidate maturity;

Example 2

As a second embodiment of the present invention, in order to implement data backup, the present invention further improves a data backup unit, as a preferred embodiment, as shown in fig. 5 to 11, the data backup unit includes a cloud storage unit and a client unit, and the flow of the cloud storage unit is as follows:

s32, carrying out correlation processing on the data;

and S33, storing the processed data in a virtualized storage.

In S32, the data correlation processing unit includes a data compression unit, a data encryption unit, and a data de-duplication unit.

In the embodiment, the cloud storage unit is developed based on a Hadoop network, the HDFS is a distributed file system specially designed for cheap hardware, bottom support is provided for data storage in a distributed computing mode, data fault tolerance is built in a software layer, high throughput is provided for accessing data of an application program, and the method can be applied to creation and development of a cloud storage system.

Further, the data compression unit adopts an LZHJ algorithm, and the flow of the LZHJ algorithm is as follows:

② recording the current matching character string as Y₁,Y₂,…,Y_KWherein Y is_KCompressing the last in the window for slidingA character;

Further, the data encryption unit adopts an RSA algorithm, which comprises the following steps:

randomly selecting a large integer e, wherein e is relatively prime to (p-1) x (q-1), the integer e is used as an encryption key, and the selection of e is easy, for example, all prime numbers larger than p and q can be used;

determining a decryption key d, d: d can be calculated according to e, p and q;

the integers r and e are disclosed, but d is not disclosed.

Encrypting a plaintext P (assuming that P is an integer less than r) into a ciphertext C by:

C＝Pemodulor。

and decrypting the ciphertext C into a plaintext P, wherein the calculation method comprises the following steps:

P＝Cdmodulor。

however, it is not possible to calculate d from r and e alone (not p and q). Thus, anyone can encrypt the plaintext, but only the authorized user (knowing d) can decrypt the ciphertext.

It should be noted that the deduplication unit is classified based on deduplication granularity, and the steps are as follows:

deleting repeated data of the full document level;

secondly, eliminating redundant file blocks;

and thirdly, byte level redundancy elimination.

Wherein, the data de-duplication of the full file hierarchy: and detecting and deleting repeated data by taking the whole file as a unit, calculating the hash value of the whole file, and searching whether the same file exists in the storage system according to the hash value of the file. The method has the advantages that the calculation speed is very high under the common hardware condition;

file block redundancy elimination: dividing a file into data blocks in different modes, and detecting by taking the data blocks as units;

wherein: byte level deduplication: duplicate content is looked up and deleted from the byte level and the differential partial content is typically generated by a differential compression strategy.

The client unit comprises an application interface module and an operation request module, wherein the application interface module is used for providing related application programs for clients needing data backup and recovery, and the operation request module is used for sending data backup or recovery requests to the clients through the application programs.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A data deployment system with automatic screening and backup functions comprises a data screening unit and a data backup unit, and is characterized in that: the data screening unit and the data backup unit realize data interaction through the internet, the data screening unit is used for screening data, and the flow of the data screening unit is as follows:

2. The data deployment system with automatic screening and backup functions of claim 1, wherein: in S12, the BMH2CKMP algorithm is used as the matching rule, and the steps are as follows:

preprocessing to obtain a next array from the pattern string;

3. The data deployment system with automatic screening and backup functions of claim 2, wherein: the next array is defined as follows:

that is, when next [ j ] ═ k > 0, P [0.. k-1] ═ P [ j-k, j-1] is represented.

4. The data deployment system with automatic screening and backup functions of claim 1, wherein: the candidate maturity detector determination process is as follows:

s23, circularly taking out the detector of the candidate maturity;

5. The data deployment system with automatic screening and backup functions of claim 1, wherein: the data backup unit comprises a cloud storage unit and a client unit, and the cloud storage unit comprises the following processes:

s32, carrying out correlation processing on the data;

and S33, storing the processed data in a virtualized storage.

6. The data deployment system with automatic screening and backup functions of claim 5, wherein: in S32, the data correlation processing unit includes a data compression unit, a data encryption unit, and a data de-duplication unit.

7. The data deployment system with automatic screening and backup functions of claim 6, wherein: the data compression unit adopts an LZHJ algorithm, and the flow of the LZHJ algorithm is as follows:

(xi) in the Forward buffer, for a string in the Forward buffer, it is marked X₁，X₂，...，X_N；

② recording the current matching character string as Y₁，Y₂，...，Y_KWherein Y is_KThe last character in the sliding compression window;

(iii) the current maximum matching length is recorded as N > m, and X₁＝Y₁，X₂＝Y₂；

Fourthly, to X₁，X₂，...，X_NAnd Y₁，Y₂，...，Y_KCarrying out sequential comparison, and recording the obtained matching length as

Lengthmax{i|X_i＝Y_i，2≤i≤min(N，K)}。

8. The data deployment system with automatic screening and backup functions of claim 6, wherein: the data encryption unit adopts RSA algorithm, and the algorithm steps are as follows:

determining a decryption key d;

the integers r and e are disclosed, but d is not disclosed.

9. The data deployment system with automatic screening and backup functions of claim 6, wherein: the data de-duplication unit is classified based on de-duplication granularity and comprises the following steps:

deleting repeated data of the full document level;

secondly, eliminating redundant file blocks;

and thirdly, byte level redundancy elimination.

10. The data deployment system with automatic screening and backup functions of claim 1, wherein: the client unit comprises an application interface module and an operation request module, wherein the application interface module is used for providing related application programs for clients needing data backup and recovery, and the operation request module is used for sending data backup or recovery requests to the clients through the application programs.