CN112632057B

CN112632057B - Data management method and system based on big data

Info

Publication number: CN112632057B
Application number: CN202110252566.3A
Authority: CN
Inventors: 郑玮琨; 盘思乐; 梁麟
Original assignee: Shenzhen Institute of Information Technology
Current assignee: Shenzhen Institute of Information Technology
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2021-05-25
Anticipated expiration: 2041-03-09
Also published as: CN112632057A

Abstract

The invention relates to a data management method and a system based on big data, wherein the method comprises the following steps: step S1: starting a data node to initiate data governance and determining a data governance range; step S2: determining a control node in the determination process of the data governance range; step S3: the control node determines a data governance subgraph based on the system graph; step S4: data exchange and sharing are carried out based on the data governance subgraph; the invention can effectively reduce the busy degree of the memory and the cpu, and greatly improve the data access efficiency; data exchange and sharing among corresponding data nodes of the service system are realized, the aim of data management is realized, and the value of data is improved. The method can provide multi-level data management with safety guarantee, convert centralized data management into part of random distributed safety management, and realize high-efficiency data management through local data exchange and redirection.

Description

Data management method and system based on big data

Technical Field

The invention belongs to the technical field of big data processing, and particularly relates to a data management method and system based on big data.

Background

Data is the basis and core of big data engineering, and the integrity, timeliness and quality of the data are guarantee conditions of all targets. The development of economy and technology is oriented to 'intellectualization' under the support of big data, and the running conditions of all social production fields are monitored by integrating various data information, so that the improvement and optimization of the safety production management work are realized. By real-time acquisition, data storage, data analysis and comprehensive query of data information of big data, all walks of life can capture, find and analyze efficiently, valuable information can be economically mined from data with various types and large quantity, and data support is provided for production and operation comprehensive management, comprehensive scheduling, comprehensive coordination and comprehensive command of all walks of life. However, due to the difference of organization, service system and data platform, many data organizations present their own arrays, data are not shared, data are duplicated, information is not connected, data are not distributed uniformly, and the utilization condition of the data platform is unbalanced; from the viewpoint of data platform hardware, that is, from the viewpoint of data content transparency, part of data is updated too fast due to the problem of device capacity, so that critical data is actively lost in advance, and some data is unconditionally retained. How to effectively manage and effectively provide the big data so that the provision of the big data meets the user experience of the user access request is a technical problem to be solved. The invention realizes the data exchange and sharing among the corresponding data nodes of the service system, realizes the goal of data management and improves the value of data. The method can provide multi-level data management with safety guarantee, convert centralized data management into part of random distributed safety management, and realize high-efficiency data management through local data exchange and redirection.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a data governance method and system based on big data, wherein the method comprises:

step S1: starting a data node to initiate data governance and determining a data governance range;

step S2: determining a control node in the determination process of the data governance range;

step S3: the control node determines a data governance subgraph based on the system graph;

step S4: and performing data exchange and sharing based on the data governance subgraph.

Further, the starting conditions are specifically as follows: and when a starting triggering instruction is received, the starting condition is met.

Further, a timing device is arranged for a specific data node, and when the timing time point arrives, the specific data node becomes a starting data node.

Further, the starting data node meeting the starting condition initiates data governance and obtains a data governance range, specifically: starting a data node, randomly selecting P directly connected data nodes, initiating a starting instruction to the P data nodes, and classifying the P data nodes into a current data management range; each data node in the P data nodes repeats the steps of randomly selecting P directly connected data nodes, initiating a starting instruction and falling into the current data management range until a termination condition is met;

before a starting instruction is initiated, comparing the historical average busy degree of the previous-level data node in the received starting instruction with the busy degree of the current data node, and including the data node with the smaller busy degree and the data node identification corresponding to the data node in the starting instruction for sending;

the termination conditions are as follows: the number of data nodes in the current data management range reaches a preset number;

the method comprises the steps that a current data node requests P data nodes selected by the current data node to be added into a data management range before sending a starting instruction, and if the size of the data management range is not exceeded after the P data nodes are added, the P data nodes are allowed to be added into the data management range; if the data are added into the P nodes and exceed the data management range, the termination condition is met, and subsequently sent requests for adding the data range are all refused; the data management range is a data node set participating in the data management; the size of the data governance range is the size of the data node set.

Further, the preset number is 1000.

A big-data based data governance system, the system comprising: the data nodes are connected with each other and used for storing data of a service system; the data nodes have or do not have a direct connection relation; representing the data nodes and the connection relation thereof by a system diagram; each data node is a node in the system graph.

Further, the data of the service system is big data.

Further, the client is configured to initiate a data access request and receive a request result of the service system access request from the service system.

Further, data of the business system is stored on one or more data nodes.

Furthermore, the data nodes are also used for data transfer and sharing according to the data governance indication.

The invention can provide multi-level data management with safety guarantee, convert centralized data management into part of random distributed safety management, and realize high-efficiency data management through local data exchange and redirection; the beneficial effects specifically include: (1) a multilayer system diagram corresponding to the security level is set, and hierarchical data management is realized on the whole, so that the security guarantee can be obtained for data movement and sharing; (2) the start of non-centralized data node management can be initiated only when the starting condition is met, and the safety of the system is greatly improved on the basis of realizing basic data management by randomly delineating the action range; (3) distributed control is carried out from a global system diagram, and the whole system is finally balanced based on a mode of continuously eliminating the lowest valley through quantitative calculation; (4) by adopting a mode of combining data exchange with redirection and partial redirection and setting the exchange proportion quantitatively, a local balance is achieved in the subgraph, and the difficulty of data management is reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, and are not to be considered limiting of the invention, in which:

FIG. 1 is a schematic diagram of a big data-based data governance method according to the present invention.

Detailed Description

The present invention will now be described in detail with reference to the drawings and specific embodiments, wherein the exemplary embodiments and descriptions are provided only for the purpose of illustrating the present invention and are not to be construed as limiting the present invention.

The invention relates to a data management system based on big data, which comprises: a plurality of interconnected data nodes; the data node is used for storing data of the service system; the data of the service data is big data; the data nodes have or do not have a direct connection relation; representing the data nodes and the connection relation thereof by a system diagram; each data node is a node in the system graph, and an edge in the system graph corresponds to two data nodes with a direct connection relation; and there is no corresponding edge between two data nodes without direct connection relation; forming a multi-layer system diagram for the data governance system based on the security level; each layer of system diagram corresponds to one security level; in a system diagram corresponding to a security level, only data nodes and edges corresponding to the security level are included, and data nodes and associated edges thereof not corresponding to the security level are invisible under the current security level; by the method, hierarchical data management is realized on the whole, so that the data movement and sharing can be ensured safely; the security level may also be a trust level;

alternatively, when the communication overhead between two data nodes is smaller than a preset value, a corresponding edge is arranged between the two data nodes, otherwise, the corresponding edge is not arranged;

the data nodes are also used for data transfer and sharing according to the data governance instruction;

the system also comprises one or more clients, wherein the clients are used for initiating data access requests and receiving request results of the service system access requests from the service system;

the invention relates to a data management method based on big data, which comprises the following steps:

step S1: when the starting condition is met, the starting data nodes meeting the starting condition initiate data management and determine a data management range;

the starting conditions are specifically as follows: when a starting trigger instruction is received, the starting condition is met; for example: the starting condition is that a starting instruction is sent manually;

alternatively: setting a timing device for a specific data node, wherein when a timing time point arrives, the specific data node becomes a starting data node;

the starting data node meeting the starting condition initiates data management and obtains a data management range, and the method specifically comprises the following steps: starting a data node, randomly selecting P directly connected data nodes, initiating a starting instruction to the P data nodes, and classifying the P data nodes into a current data management range; each data node in the P data nodes repeats the steps of randomly selecting P directly connected data nodes, initiating a starting instruction and falling into the current data management range until a termination condition is met;

preferably: before a starting instruction is initiated, comparing the historical average busy degree of the previous-level data node in the received starting instruction with the busy degree of the current data node, and including the data node with the smaller busy degree and the data node identification corresponding to the data node in the starting instruction for sending;

the termination conditions are as follows: the number of data nodes in the current data management range reaches a preset number; for example: 1000 nodes;

alternatively: the termination condition is that the starting times from the starting of the data node reach preset times;

preferably: the method comprises the steps that a current data node requests P data nodes selected by the current data node to be added into a data management range before sending a starting instruction, and if the size of the data management range is not exceeded after the P data nodes are added, the P data nodes are allowed to be added into the data management range; if the data are added into the P nodes and exceed the data management range, the termination condition is met, and subsequently sent requests for adding the data range are all refused; that is, the data governance range is a data node set participating in the data governance at this time; the size of the data governance range is the size of the data node set;

preferably: said P is equal to 1;

preferably: when the number of the directly connected nodes is less than P, all the directly connected nodes are selected, and the data nodes are requested to be added into a data management range;

step S2: determining a control node in the determination process of the data governance range; specifically, the method comprises the following steps: in the process of determining the data management range, continuously updating the data node with the lowest busy degree, and finally selecting the data node with the lowest busy degree in the data management range as a control node; when the data nodes are added into the data management range, the busy degree of the current data nodes is compared with the busy degree in the current data management range to determine the latest minimum busy degree and the corresponding data node identification; recording Q data nodes with the lowest busy degree and corresponding data node identifications, and finally selecting one data node from the Q data nodes as a control node; the control node obtains the control right relatively randomly, so that the possibility of tampering is further reduced;

preferably: starting a data node to determine a control node in the determination process of the data management range; and is also used for saving relevant data;

preferably: the busy degree is a historical average busy degree; the starting instruction comprises a minimum value of historical average busy degree in a sending path and a data node identifier corresponding to the minimum value; after a starting instruction is sent out, the minimum value of the historical average busy degree in the path is saved in the current data node;

preferably: the busyness degree can be calculated by using indexes such as the utilization rate of a CPU (Central processing Unit), the utilization rate of a memory, the access frequency of the memory and the like;

in the prior art, data management is usually performed by a centralized node through a fixed node in a centralized mode, but on one hand, the mode has great potential safety hazard and simultaneously causes great pressure on the fixed node; the selection of the starting node and the control node is not fixed, the starting of non-centralized data node management can be initiated only when the starting condition is met, and the safety of the system is greatly improved on the basis of realizing basic data management by randomly delineating the management range;

preferably: setting a timing device for a specific data node by taking a Q +1 th data node with relatively low busy degree as the specific node, wherein the specific data node becomes a starting data node when a timing time point is reached;

alternatively: setting a timing device for the specific data nodes by taking the Q +1 to Q + Q data nodes with relatively low busy degree as specific nodes to be selected, and when a timing time point is reached, selecting one data node from the Q +1 to Q + Q data nodes as the specific node and enabling the specific data node to become a starting data node; the selection mode is that the data node with the lowest access heat degree is accessed at the moment at the timing time point; or the selected mode is a Q-node distributed referral mode, and the control node does not need to be managed or started to participate at the last time; in this way, the initiating node is set to initiate data governance in a relatively random and conditionally allowable manner;

step S3: the control node determines a data governance subgraph based on the system graph; specifically, the method comprises the following steps: the control node acquires control power and a data management range from the starting data node; acquiring a system diagram corresponding to the current security level; connecting data nodes in the data governance range based on the connection relation in the system diagram to form a governance diagram; constructing a data governance subgraph by taking the Q data nodes with the lowest busyness degree as a center in a governance graph; different from the mode of overall control or single node starting data management in the prior art, the distributed control is carried out from the overall system diagram, the whole system is finally balanced by continuously eliminating the lowest valley, and the control node is determined whether to be tampered or not by comparing the frequency degree in the determination process to ensure the safety;

the method comprises the following steps of constructing Q data governance subgraphs by taking Q data nodes with the lowest busyness degree as centers in a governance graph, and specifically comprises the following steps:

step SA 1: initializing all data nodes in the governance graph to be unmarked;

step SA 2: acquiring a data node from the Q data nodes as a current central data node; initializing a current data management subgraph taking a current central data node as a center;

preferably: comparing the lowest busy degree of the path stored by the current data node with the busy degree of the current control node, and if the busy degree of the current control node is a non-lowest value, determining that the current control node is tampered and carrying out safety feedback; the safety feedback mode is to stop the data management process and perform manual feedback;

step SA 3: acquiring unmarked data nodes with the highest busyness degree directly connected with the central data nodes as current data nodes based on the governance map;

step SA 4: judging whether the busy degree of the current data treatment subgraph is within a threshold range after the current data node is added, if so, entering the step SA 5; if the busy degree is smaller than the threshold range, allowing the current data node to join the current data governance subgraph and entering the step SA3 to continuously acquire and join the data node; if the busy degree is larger than the threshold range, rejecting the current data node to join the current data governance range subgraph, and entering step SA 5; the threshold range is a reasonable range of the busy degree of the data node and can be set according to the state of the data node in the data management graph;

calculating the busy degree TDGR of the current data governance subgraph by adopting a formula (1);

(1);

wherein: a busy base value DBi and a busy degree DGRi of the ith data node in the data management subgraph; for example: the busy base value is the size of the storage space of the data node, and the busy degree is the idle rate or the busy degree of the storage space; the busy base number value is a judgment base number value of the target object aimed at by the busy degree index;

preferably: when the busyness degree is set to be multiple types of representations, performing weighted summation of the multiple types of representations to simultaneously satisfy balance among the multiple types of representations;

step SA 5: finishing the current data governance subgraph; if the Q data nodes are processed, the step SA6 is executed, otherwise, the step SA2 is executed;

step SA 6: completing all data governance subgraphs;

the number of the data governance subgraphs and the range of data exchange can be adaptively adjusted by adjusting the numerical value of Q and the threshold range; when a data node is added to enter the current data governance subgraph, if the data node is already in the current data governance subgraph, continuing to add the next data node without repeated addition;

step S4: data exchange and sharing are carried out based on the data governance subgraph; specifically, the method comprises the following steps: the control node sends the control right to the data governance subgraph; the data governance subgraph carries out data exchange in the subgraph and forms data sharing in the data governance subgraph;

the control node sends the control right to a data management subgraph, and the method specifically comprises the following steps: the control node sends the control right to a central data node of the data governance subgraph;

the data governance subgraph carries out data exchange in the subgraph and specifically comprises the following steps:

step SB 1: distributing data exchange proportion among the data nodes according to the busy degree of each data node; specifically, the method comprises the following steps: taking the proportion of the product of the busy base number DBi and the busy degree DGRi of each data node as the distribution data exchange proportion; wherein: each data node comprises a central data node;

step SB 2: exchanging the storage areas on the data nodes to the central data node according to the allocated data exchange proportion, and finally enabling the proportion of the product of the busy base number DBi and the busy degree DGRi of each data node to be closest to 1: 1;

preferably: the data node comprises a plurality of storage areas, the storage area with the lowest access heat on the data node is selected for exchange, and partial data in the storage area with the lowest access heat are directly exchanged out of the current data node; and/or selecting a data area with the highest access heat on the data node, backing up and storing the data area with the highest access heat to a central data node, and transferring partial access through data redirection to reduce the heat;

the data sharing is formed inside the data subgraph, and the method specifically comprises the following steps: after the data areas are exchanged, setting redirection in the data node, and when receiving access to the storage area, completely or partially redirecting the access to the central data node; when the exchanged data area with the lowest access heat is exchanged, setting full redirection, and when the exchanged data area with the highest access heat is exchanged, setting partial redirection;

the partial redirection specifically comprises: calculating the average busy degree of all storage areas of the data nodes, monitoring the busy degree of each storage area, redirecting R continuous accesses aiming at the storage areas to a central data node when the busy degree of the storage areas is higher than the average busy degree, and after the redirection of the R accesses, the busy degree of the storage areas is lower than (average busy degree (N-1)/N), wherein: n is the number of storage areas in the data node; the method is different from the prior pure data redirection, adopts a mode of combining data exchange, redirection and partial redirection, achieves local balance in the subgraph by quantitatively setting exchange proportion, and reduces the difficulty of data management; through continuous R value redirection, on the basis of guaranteeing data access continuity, the heat of a data storage area is reduced, so that the response of a data center is not too low; thus, even under the condition of multiple banks, a continuous cross-bank access can be ensured not to be interrupted;

the beneficial effects of the invention include: (1) a multilayer system diagram corresponding to the security level is set, and hierarchical data management is realized on the whole, so that the security guarantee can be obtained for data movement and sharing; (2) the start of non-centralized data node management can be initiated only when the starting condition is met, and the safety of the system is greatly improved on the basis of realizing basic data management by randomly delineating the action range; (3) distributed control is carried out from a global system diagram, and the whole system is finally balanced based on a mode of continuously eliminating the lowest valley through quantitative calculation; (4) by adopting a mode of combining data exchange with redirection and partial redirection and setting an exchange proportion in a quantized mode, local balance is achieved inside the subgraph, and the difficulty of data management is reduced;

as will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Those skilled in the art will appreciate that all or part of the steps in the above method embodiments may be implemented by a program to instruct relevant hardware to perform the steps, and the program may be stored in a computer-readable storage medium, which is referred to herein as a storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A data governance method based on big data is characterized by comprising the following steps:

step S4: data exchange and sharing are carried out based on the data governance subgraph;

the starting conditions are specifically as follows: when a starting trigger instruction is received, the starting condition is met;

setting a timing device for a specific data node, wherein when a timing time point arrives, the specific data node becomes a starting data node;

the starting data nodes meeting the starting conditions initiate data management and obtain a data management range, and the method specifically comprises the following steps: starting a data node, randomly selecting P directly connected data nodes, initiating a starting instruction to the P data nodes, and classifying the P data nodes into a current data management range; each data node in the P data nodes repeats the steps of randomly selecting P directly connected data nodes, initiating a starting instruction and falling into the current data management range until a termination condition is met;

the method comprises the steps that a current data node requests P data nodes selected by the current data node to be added into a data management range before sending a starting instruction, and if the size of the data management range is not exceeded after the P data nodes are added, the P data nodes are allowed to be added into the data management range; if the data are added into the P data nodes and exceed the data management range, the termination condition is met, and subsequently sent requests for adding the data range are all refused; the data management range is a data node set participating in the data management; the size of the data governance range is the size of the data node set.

2. The big-data-based data governance method according to claim 1, wherein the preset number is 1000.

3. A big-data based data governance system that employs the big-data based data governance method of claim 1, the system comprising: the data nodes are connected with each other and used for storing data of a service system; the data nodes have or do not have a direct connection relation; representing the data nodes and the connection relation thereof by a system diagram; each data node is a node in the system graph.

4. The big data based data governance system according to claim 3, wherein the data of the business system is big data.

5. The big data based data governance system according to claim 4, wherein the client is configured to initiate a data access request and receive a request result of the business system access request from the business system.

6. The big data based data governance system according to claim 5, wherein data for the business system is stored on one or more data nodes.

7. The big data-based data governance system according to claim 6, wherein the data nodes are further configured to transfer and share data according to data governance instructions.