CN113918638A - Data processing link determining method, system, device and storage medium - Google Patents

Data processing link determining method, system, device and storage medium Download PDF

Info

Publication number
CN113918638A
CN113918638A CN202111234636.9A CN202111234636A CN113918638A CN 113918638 A CN113918638 A CN 113918638A CN 202111234636 A CN202111234636 A CN 202111234636A CN 113918638 A CN113918638 A CN 113918638A
Authority
CN
China
Prior art keywords
link
storage node
graph
target
edge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111234636.9A
Other languages
Chinese (zh)
Inventor
邓家胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202111234636.9A priority Critical patent/CN113918638A/en
Publication of CN113918638A publication Critical patent/CN113918638A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method, a system, equipment and a storage medium for determining a data processing link, wherein the method comprises the following steps: all storage nodes in the target data system are used as vertexes of the graph; determining an edge between every two vertexes in the graph according to a data processing relation between every two storage nodes, wherein every two storage nodes correspond to every two vertexes; determining the weight corresponding to the edge according to the program log between every two storage nodes; determining an optimal link from a source storage node to a target storage node in the target data system according to the source storage node in the target data system, the target storage node in the target data system and the graph. The embodiment of the invention can screen the most reasonable and efficient data processing link from all the feasible links more intuitively and conveniently.

Description

Data processing link determining method, system, device and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, a system, a device, and a storage medium for determining a data processing link.
Background
The data processing link is a carrier for data flow in the application system, the flow of various data in the data system is realized through the data processing link, and the data processing link can be understood as a data processing sequence.
In order to implement data cleaning, integration and loading in an existing data system, data warehouse technology (ETL) programs with the number of models being more than 10 times are required to be developed from a source end system to the models to be applied, all the ETL programs are combined into a processing link, basic blood is provided for the data system, the data processing link has different effects of implementing data circulation, but due to lack of planning or excessive dependence on manual intervention to judge the rationality of the data processing link, various complicated ETL programs form a data processing link, various ETL programs are combined in a messy manner, and the combined data processing link is easy to generate circular reference, so that system blockage, data confusion and even the data system is rushed.
Disclosure of Invention
The invention provides a method, a system, equipment and a storage medium for determining a data processing link, and mainly aims to determine the most appropriate data processing link in a target data system.
In a first aspect, an embodiment of the present invention provides a method for determining a data processing link, including:
taking each storage node in the target data system as a vertex of a graph, wherein the graph is determined by the vertex, an edge and a weight corresponding to the edge;
for every two vertexes in the graph and every two storage nodes in the target system, determining whether an edge between every two vertexes in the graph exists or not according to a data processing relation between every two storage nodes, wherein every two storage nodes correspond to every two vertexes;
if the edges between every two vertexes exist in the graph, determining the corresponding weight of the edges between every two vertexes according to the program log between every two storage nodes;
determining an optimal link from a source storage node to a target storage node in the target data system according to the source storage node in the target data system, the target storage node in the target data system and the graph.
Preferably, the determining an edge between each two vertices in the graph according to the data processing relationship between each two storage nodes includes;
if one of every two storage nodes can transmit data to the other storage node through a program, determining that a connected edge exists between every two vertexes in the graph;
and if one storage node cannot transmit data to the other storage node through the program, determining that no connected edge exists between every two vertexes in the graph.
Preferably, the determining the weight corresponding to the edge according to the program log between each two storage nodes includes:
acquiring the weight corresponding to the edge according to a preset weight calculation formula according to the reading amount of the distributed file system between every two storage nodes, the MR data recorded in the program log and the CPU consumption recorded in the program log;
wherein, the preset weight calculation formula is specifically as follows:
q HDFS Q1+ MR Q2+ CPU consumption Q3;
wherein Q represents a weight, HDFS is a read volume of the distributed file system, MR represents stored data described in the program log, CPU consumption represents processor consumption described in the program log, Q1 represents a first preset coefficient, Q2 represents a second preset coefficient, and Q3 represents a third preset coefficient.
Preferably, the determining an optimal link from a source storage node to a target storage node in the target data system according to the source storage node in the target data system, the target storage node in the target data system, and the graph includes:
acquiring all feasible links from a source vertex corresponding to the source storage node to a target vertex corresponding to the target storage node in the graph according to the source storage node, the target storage node and the graph, wherein the feasible links are composed of a plurality of edges;
determining the consumption cost corresponding to each feasible link according to the weight corresponding to all edges in each feasible link;
and acquiring the optimal link according to the consumption cost corresponding to each feasible link.
Preferably, the obtaining the optimal link according to the consumption cost corresponding to each feasible link includes:
and acquiring the optimal link according to the consumption cost corresponding to each feasible link, the packet loss rate of each feasible link and the transmission load of each feasible link.
Preferably, the obtaining the optimal link according to the consumption cost corresponding to each feasible link includes:
and acquiring the optimal link according to the consumption cost corresponding to each feasible link and a shortest path method.
Preferably, the determining the consumption cost corresponding to each feasible link according to the weights corresponding to all the edges in each feasible link includes:
and taking the sum of the weights corresponding to all edges in each feasible link as the consumption cost corresponding to each feasible link.
In a second aspect, an embodiment of the present invention provides a data processing link determining system, including:
the vertex module is used for taking each storage node in the target data system as a vertex of a graph, and the graph is determined by the vertex, the edge and the weight corresponding to the edge;
the edge module is used for determining whether an edge between every two vertexes in the graph exists according to a data processing relation between every two storage nodes for every two vertexes in the graph and every two storage nodes in the target system, wherein every two storage nodes correspond to every two vertexes;
the weight module is used for determining the weight corresponding to the edge between every two vertexes according to the program log between every two storage nodes if the edge between every two vertexes exists in the graph;
and the screening module is used for determining the optimal link from the source storage node to the target storage node according to the source storage node in the target data system, the target storage node in the target data system and the graph.
In a third aspect, an embodiment of the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the data processing link determination method when executing the computer program.
In a fourth aspect, an embodiment of the present invention provides a computer storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the data processing link determining method.
The embodiment of the invention provides a method, a system, equipment and a storage medium for determining a data processing link, which are used for converting a structural relationship and a data transmission relationship between storage nodes in a target data system into nodes and edges in a graph, converting consumption cost during data processing between the storage nodes in the target data system into weight of the edges in the graph so as to calculate the consumption cost of all feasible links between a source storage node and the target storage node, and selecting the most appropriate optimal link according to the consumption cost. In the embodiment of the invention, the consumption cost between the source storage node and the target storage node is quantized, and the most reasonable and efficient data processing link can be screened out from all feasible links more intuitively and conveniently, so that whether the selected data processing link in the target data system is reasonable or not can be judged, the unreasonable data processing link is corrected, and the data processing efficiency is improved.
Drawings
Fig. 1 is an application scenario diagram of a data processing link determining method provided in an embodiment of the present invention;
fig. 2 is a flowchart of a data processing link determining method according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a specific application of a data processing link determining method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a data processing link determining system according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Fig. 1 is an application scenario diagram of a data processing link determining method provided in an embodiment of the present invention, as shown in fig. 1, first, a source storage node of a target data system and a target storage node of the target data system are input in a client, and the source storage node of the target data system and the target storage node of the target data system are sent to a server, and the server receives the source storage node of the target data system and the target storage node of the target data system, and executes the data processing link determining method according to the source storage node and the target storage node, so as to determine an optimal link.
It should be noted that the server may be implemented by an independent server or a server cluster composed of a plurality of servers. The client may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The client and the server may be connected through bluetooth, USB (Universal Serial Bus), or other communication connection manners, which is not limited in this embodiment of the present invention.
Fig. 2 is a flowchart of a data processing link determining method according to an embodiment of the present invention, as shown in fig. 2, and as shown in fig. 2, the method includes:
s210, taking each storage node in the target data system as a vertex of a graph, wherein the graph is determined by the vertex, edges and weights corresponding to the edges;
in the embodiment of the invention, the target data system is a target object needing to determine the data processing sequence, and can be a bank system, an e-commerce system, an OA system and the like, and can be determined according to actual needs.
In a target data system, there are generally many storage nodes, which may be called storage nodes for storing data, and tables and files may be called storage nodes. For example, if a try catch is written during the storage process, and if an exception occurs, the exit is rolled back, and the storage node is needed, and where the exit is rolled back is the location where the storage node is marked.
The technical idea of the embodiment of the invention is to refer to a solution thought of a shortest path problem in a graph, convert a storage node in a target data system, a data transmission relation between the storage nodes and a data transmission consumption cost between the storage nodes into weights of nodes, edges and edges in a communication link graph, and convert an optimal link between two storage nodes in the target data system into the shortest path solution problem of the graph so as to determine an optimal data processing link in the target data system.
It should be noted that the graph in the embodiment of the present invention refers to a communication link graph, the communication link graph includes a vertex and an edge, the vertex includes a data element and several branches pointing to other subtrees, the edge is a physical line from one vertex to another vertex, and there is no other exchange vertex in the middle. In the communication link graph, an edge can only connect two vertices.
In order to convert storage nodes, data transmission relations among the storage nodes, and data transmission consumption costs among the storage nodes in the target data system into weights of nodes, edges, and edges in the communication link graph, in the embodiment of the present invention, all the storage nodes in the target data system are used as vertices of the graph, each storage node corresponds to a vertex of the graph, and vertices of the graphs corresponding to different storage nodes are different, so as to complete a first step of graph construction operation.
It should be further noted that, in the embodiment of the present invention, the graph may be an undirected graph or a directed graph, and if data transmission processing can be performed between two storage nodes in a forward direction or a reverse direction, the graph is an undirected graph, that is, data in the graph may be transmitted from one vertex to another vertex or from another vertex to one vertex, data transmission between the two vertices is bidirectional, and data transmission between the two vertices is bidirectional; if the data transmission processing can be carried out between two storage nodes only along a given direction, the graph is a directed graph, namely, data can be transmitted from one vertex to another vertex in the graph, and the transmission between the two vertices is unidirectional.
S220, determining whether an edge between every two vertexes in the graph exists according to a data processing relation between every two storage nodes for every two vertexes in the graph and every two storage nodes in the target system, wherein every two storage nodes correspond to every two vertexes;
after determining the vertices of the graph, edges in the graph need to be determined, where the edges in the graph represent data transmission relationships between storage nodes corresponding to two vertices, and specifically, in this embodiment of the present invention, if data transmission can be performed between the storage nodes corresponding to two vertices, an edge exists between the two vertices, and if data transmission cannot be performed between the storage nodes corresponding to two vertices, an edge does not exist between the two vertices.
Specifically, for every two storage nodes, in combination with an actual service scenario, for example, in a banking system, data transmission cannot be directly performed between a bottom layer data storage node and a front end data storage node, while data transmission can be performed between the bottom layer data storage node and a middleware storage node, and the front end data storage node can also be transmitted with the middleware storage node, and whether data can be processed from one storage node to another storage node is determined; if two storage nodes can only be processed from one side to the other, and not vice versa, the edges in the graph are directional and can only point from a vertex corresponding to one storage node to a vertex corresponding to the other storage node.
S230, if an edge between every two vertexes exists in the graph, determining the corresponding weight of the edge between every two vertexes according to the program log between every two storage nodes;
in the embodiment of the invention, in the process of running the program by the target data system, a trace left by running the program is left, and the record file or the file set used for recording the system operation event can be divided into an event log and a message log. The method has important roles in processing historical data, tracing diagnosis problems, understanding system activities and the like. The program log is an ETL program log.
The weight is used for quantifying the consumption cost of data processing between different storage nodes, and the data transmission between different vertexes is the data transmission between different storage nodes, and the consumption is different, so the consumption cost of data transmission processing between two storage nodes is represented by the weight.
In the embodiment of the invention, the shortest path calculation problem in the graph is referred to, the consumption cost between two storage nodes is converted into the weight of the edge, so that a complete weighted graph is constructed, and the solution idea of the shortest path in the graph is used as the calculation idea of the optimal data processing link.
In the embodiment of the invention, data transmission is performed through an ETL program, the ETL program aims at processing data in one storage node into another storage node, the ETL program can generate logs during daily execution, the logs are a section of text, the storage nodes from the back are analyzed through a regular expression and serve as a source vertex, the storage nodes behind insert are analyzed through the regular expression and serve as a target vertex, and meanwhile, the logs can record data such as the number of MRs and the consumption time of a CPU.
And calculating the weight of each edge according to the data recorded in the ETL program log and a preset rule, so that the corresponding weights of the vertex, the edge and the edge in the graph can be determined, and one graph is established.
S240, determining the optimal link from the source storage node to the target storage node according to the source storage node in the target data system, the target storage node in the target data system and the graph.
And finally, when the optimal link between two storage nodes needs to be calculated, finding out a source vertex corresponding to the source storage node in the graph, calculating all feasible links between the source vertex and the target vertex according to a target vertex corresponding to the target storage node, calculating the consumption cost of each link according to the weight of the edge occupied by each feasible link, and screening out the optimal link from all the feasible links by taking the shortest path as a selection principle.
In one embodiment, the link with the lowest consumption cost can be directly used as the best link.
The embodiment of the invention provides a data processing link determining method, which comprises the steps of converting the structural relationship and the data transmission relationship among all storage nodes in a target data system into nodes and edges in a graph, converting the consumption cost during data processing among the storage nodes in the target data system into the weight of the edges in the graph, calculating the consumption cost of all feasible links between a source storage node and the target storage node, and selecting the most appropriate optimal link according to the consumption cost. In the embodiment of the invention, the consumption cost between the source storage node and the target storage node is quantized, and the most reasonable and efficient data processing link can be screened out from all feasible links more intuitively and conveniently, so that whether the selected data processing link in the target data system is reasonable or not can be judged, the unreasonable data processing link is corrected, and the data processing efficiency is improved.
On the basis of the foregoing embodiment, preferably, the determining, according to the data processing relationship between each two storage nodes, an edge between each two vertices in the graph includes;
if one storage node can transmit data to another storage node through a program, determining that a connected edge exists between every two vertexes in the graph;
and if one storage node cannot transmit data to the other storage node through the program, determining that no connected edge exists between every two vertexes in the graph.
Specifically, in the embodiment of the present invention, if data transmission can be performed between storage nodes corresponding to two vertices, an edge exists between the two vertices, and if data transmission cannot be performed between storage nodes corresponding to two vertices, an edge does not exist between the two vertices.
On the basis of the foregoing embodiment, preferably, the determining, according to the program log between each two storage nodes, the weight corresponding to the edge includes:
and acquiring the weight corresponding to the edge according to a preset weight calculation formula according to the reading amount of the distributed file system between every two storage nodes, the MR data recorded in the program log and the CPU consumption recorded in the program log.
In particular, the distributed file system, i.e. the distributed file system, includes these two aspects, and from the perspective of the client use of the file system, it is a standard file system, and provides a series of application program interfaces, so as to create, move, delete, and read/write the file or directory. From the internal implementation point of view, the distributed file system is no longer responsible for managing the local disk as the ordinary file system, and the file content and the directory structure of the distributed file system are not stored on the local disk but transmitted to a remote system through a network. Moreover, the same file storage is not only stored on one machine, but also stored in a distributed manner on a cluster of machines, and services are cooperatively provided, namely distributed storage.
The MR data described in the ETL program log refers to a cleansing program in the ETL data, and the separators of elements between sets must be uniform regardless of the set type in one table, and in cleansing, the separators of elements in a set must be uniform.
The CPU consumption refers to that when the CPU processes a task, the CPU has a certain size, and CPU processing space with different sizes is consumed according to the size of the task.
In the embodiment of the invention, the weight corresponding to the edge is determined from three aspects of reading quantity, MR data and CPU consumption quantity of the distributed file system between the two storage nodes, and the three indexes are selected after a plurality of tests and can reflect the corresponding consumption cost between the two storage nodes most, and are the optimal selection for calculating the consumption cost.
On the basis of the foregoing embodiment, preferably, the preset weight calculation formula is specifically as follows:
q HDFS Q1+ MR Q2+ CPU consumption Q3;
wherein Q represents a weight, HDFS is a read volume of the distributed file system, MR represents stored data described in the program log, CPU consumption represents processor consumption described in the program log, Q1 represents a first preset coefficient, Q2 represents a second preset coefficient, and Q3 represents a third preset coefficient.
In the embodiment of the invention, the value of Q1 is 0.3, the value of Q2 is 0.1, and the value of Q3 is 0.6.
Specifically, the calculation formula is obtained through experience after multiple tests, and consumption cost between two storage nodes can be accurately judged through the calculation formula.
On the basis of the foregoing embodiment, preferably, the determining an optimal link from a source storage node to a target storage node in the target data system according to the source storage node in the target data system, the target storage node in the target data system, and the graph includes:
acquiring all feasible links from a source vertex corresponding to the source storage node to a target vertex corresponding to the target storage node in the graph according to the source storage node, the target storage node and the graph, wherein the feasible links are composed of a plurality of edges;
determining the consumption cost corresponding to each feasible link according to the weight corresponding to all edges in each feasible link;
and acquiring the optimal link according to the consumption cost corresponding to each feasible link.
Specifically, a source vertex corresponding to the source storage node and a target vertex corresponding to the target storage node are found in the obtained graph, and all feasible links from the source vertex to the target vertex are found.
In an embodiment, the consumption cost corresponding to each feasible link is determined according to the weights corresponding to all the edges in each feasible link, which may be directly calculating the sum of the weights corresponding to all the edges in each feasible link, and taking the sum as the consumption cost corresponding to each feasible link.
In another embodiment, the consumption cost corresponding to each feasible link is determined according to the weight corresponding to all the edges in each feasible link, specifically, the communication accuracy corresponding to an edge in each feasible link is calculated first, the communication accuracy is the communication accuracy between the storage nodes corresponding to two vertexes connected by the edge, the communication accuracy is taken as the weight coefficient of the edge, generally, the higher the communication accuracy is, the more reliable the communication of the two storage nodes corresponding to the edge is, the higher the weight coefficient corresponding to the edge is, the lower the communication accuracy is, the less reliable the communication of the two storage nodes corresponding to the edge is, and the lower the weight coefficient corresponding to the edge is. The communication accuracy and the weight in each feasible link are subjected to weighted summation to obtain the consumption cost corresponding to each feasible link, and the consumption cost corresponding to each feasible link is determined by combining two aspects of communication accuracy and communication consumption.
Fig. 3 is a schematic diagram of a specific application of a data processing link determining method according to an embodiment of the present invention, as shown in fig. 3, in the diagram, a vertex 1 is a source vertex, and a vertex 5 is a target vertex, a line with an arrow in the diagram indicates a connection relationship between two vertices, a line without edge connection indicates that data transmission cannot be performed between two vertices, a number on the arrow indicates a weight, and consumption costs corresponding to all feasible links from the source vertex 1 to the target vertex 5 and each feasible link are as follows:
feasible link 1: vertex 1- > vertex 2- > vertex 3- > vertex 4- > vertex 5, with a corresponding cost of consumption of 1+1+2+ 1- > 5.
Feasible link 2: vertex 1- > vertex 5, which corresponds to a cost of consumption of 8.
Feasible link 3: vertex 1- > vertex 7- > vertex 6- > vertex 5, with a corresponding cost of consumption of 1+2+ 4- > 7.
Feasible link 4: vertex 1- > vertex 6- > vertex 5, with a corresponding cost of consumption of 5+ 4-9.
As can be seen from the above, the corresponding consumption cost is the feasible link 1 with the minimum consumption cost, but the requirement of the consumption cost from the source storage node to the target storage node in the target data system is not greater than 7, so both the feasible link 1 and the feasible link 3 meet the requirement, and the appropriate feasible link is selected from the two to serve as the optimal link.
Assuming feasible link 1 as the optimal link, the optimal link in the embodiment of the present invention is storage node 1- > storage node 2- > storage node 3- > storage node 4- > storage node 5 in the target data system by corresponding the vertex in feasible link 1 to the storage node in the target data system, and the data processing link is the optimal data processing link from storage node 1 to storage node 5 in the target data system.
In an actual application scenario, different target data systems have different focused indexes, some target data systems only aim at one index or a plurality of indexes, the focus on other indexes is not high, some target data systems expect that each index can meet the requirement, and different target data systems have different methods for selecting the best link from the feasible links.
In the embodiment of the invention, aiming at some target data systems only considering data transmission consumption cost, the feasible link with the minimum consumption cost is directly used as the optimal link, and the most appropriate data processing link is selected for the target data systems.
On the basis of the foregoing embodiment, preferably, the obtaining the optimal link according to the consumption cost corresponding to each feasible link includes:
and acquiring the optimal link according to the consumption cost corresponding to each feasible link, the packet loss rate of each feasible link and the transmission load of each feasible link.
In the embodiment of the present invention, for some target data systems that need to consider the indexes of consumption cost, packet loss rate, and transmission load at the same time, the aspects of consumption cost, packet loss rate, and transmission load of feasible links need to be considered at the same time, and the optimal consumption link is selected from all the feasible links.
It should be noted that the Loss probability (Loss Tolerance) refers to the ratio of the number of lost packets in the transmitted data set during the test. The calculation method comprises the following steps: the packet loss rate is [ (input message-output message)/input message ] + 100%.
The packet loss rate is related to the packet length and the packet transmission frequency. Generally, when the flow rate of the gigabit network card is greater than 200Mbps, the packet loss rate is less than five ten-thousandths; when the flow rate of the hundred million network cards is more than 60Mbps, the packet loss rate is less than one ten thousandth. Testing is typically done over a range of throughputs.
When screening is performed from the feasible links, in addition to considering the consumption cost of each feasible link, the maximum bearable load and the packet loss rate of each feasible link are combined, taking the feasible link 1 as an example to illustrate, the consumption cost of the feasible link 1 is the minimum, but the packet loss rate is higher, the maximum bearable load of the communication link on the feasible link 1 is smaller and does not meet the requirement, while the consumption cost of the feasible link 3 is a little, but the packet loss rate is smaller and the maximum bearable load is larger, and then the feasible link 3 is selected as the optimal link from the source storage node to the target storage node in the target data system.
In the embodiment of the invention, aiming at some target data systems only considering data transmission consumption cost, packet loss rate and maximum bearable load, indexes in all aspects are comprehensively considered, feasible links meeting requirements on consumption cost, packet loss rate and maximum bearable load are screened out as optimal links, and the most appropriate data processing links are selected for the target data systems.
On the basis of the foregoing embodiment, preferably, the obtaining the optimal link according to the consumption cost corresponding to each feasible link includes:
and acquiring the optimal link according to the consumption cost corresponding to each feasible link and a shortest path method.
As can be seen from the above, the lowest consumption cost is the feasible link 1, and the optimal link in the embodiment of the present invention is the storage node 1 — > storage node 2 — > storage node 3 — > storage node 4 — > storage node 5 in the target data system by corresponding the vertex in the feasible link 1 to the storage node in the target data system, and the data processing link is the optimal data processing link from the storage node 1 to the storage node 5 in the target data system.
To sum up, an embodiment of the present invention provides a method for determining a data processing link, where a structural relationship and a data transmission relationship between storage nodes in a target data system are converted into nodes and edges in a graph, a consumption cost during data processing between the storage nodes in the target data system is converted into a weight of the edges in the graph, so as to calculate consumption costs of all feasible links between a source storage node and the target storage node, and an optimal link is selected according to the consumption costs. In the embodiment of the invention, the consumption cost between the source storage node and the target storage node is quantized, and the most reasonable and efficient data processing link can be screened out from all feasible links more intuitively and conveniently, so that whether the selected data processing link in the target data system is reasonable or not can be judged, the unreasonable data processing link is corrected, and the data processing efficiency is improved.
After multiple tests, the weights corresponding to the edges are determined according to the reading quantity, the MR data and the CPU consumption quantity of the distributed file system between the two storage nodes, so that the optimal selection in the process of calculating the consumption cost is facilitated, and the selection of the subsequent optimal link is facilitated.
In addition, for some target data systems only considering data transmission consumption cost, the feasible link with the minimum consumption cost is directly used as the optimal link, and the most appropriate data processing link is selected for the target data systems.
And finally, aiming at some target data systems only considering the data transmission consumption cost, the packet loss rate and the maximum bearable load, comprehensively considering indexes of all aspects, screening out a feasible link meeting the requirements of the consumption cost, the packet loss rate and the maximum bearable load as an optimal link, and selecting the most appropriate data processing link for the target data systems.
Fig. 4 is a schematic structural diagram of a data processing link determining system according to an embodiment of the present invention, as shown in fig. 4, the system includes a vertex module 410, an edge module 420, a weight module 430, and a filter module 440, where:
the vertex module 410 is configured to use each storage node in the target data system as a vertex of a graph, where the graph is determined by the vertex, an edge, and a weight corresponding to the edge;
the edge module 420 is configured to determine, for every two vertices in the graph and every two storage nodes in the target system, whether an edge between every two vertices in the graph exists according to a data processing relationship between every two storage nodes, where every two storage nodes correspond to every two vertices;
the weight module 430 is configured to determine, according to the program log between each two storage nodes, a weight corresponding to an edge between each two vertices in the graph if the edge between each two vertices exists;
the screening module 440 is configured to determine an optimal link from a source storage node to a target storage node in the target data system according to the source storage node in the target data system, the target storage node in the target data system, and the graph.
On the basis of the above embodiment, preferably, the side module includes a first side confirmation unit and a second side confirmation unit, wherein:
the first edge confirming unit is used for determining that an edge connected between every two vertexes in the graph exists if one storage node in every two storage nodes can transmit data to the other storage node through a program;
the second edge confirming unit is used for determining that no connected edge exists between every two vertexes in the graph if one storage node cannot transmit data to the other storage node through a program.
On the basis of the above embodiment, preferably, the weighting module includes a first weighting unit, wherein:
the first weighting unit is used for acquiring the weight corresponding to the edge according to a preset weight calculation formula according to the reading amount of the distributed file system between every two storage nodes, the MR data recorded in the program log and the CPU consumption amount recorded in the program log;
wherein, the preset weight calculation formula is specifically as follows:
q HDFS Q1+ MR Q2+ CPU consumption Q3;
wherein Q represents a weight, HDFS is a read volume of the distributed file system, MR represents stored data described in the program log, CPU consumption represents processor consumption described in the program log, Q1 represents a first preset coefficient, Q2 represents a second preset coefficient, and Q3 represents a third preset coefficient.
On the basis of the above embodiment, preferably, the screening module includes a first screening unit, a second screening unit and a third screening unit, wherein:
the first screening unit is used for acquiring all feasible links from a source vertex corresponding to the source storage node to a target vertex corresponding to the target storage node in the graph according to the source storage node, the target storage node and the graph, wherein the feasible links are composed of a plurality of edges;
the second screening unit is used for determining the consumption cost corresponding to each feasible link according to the weight corresponding to all the edges in each feasible link;
the third screening unit is configured to obtain the optimal link according to the consumption cost corresponding to each feasible link.
On the basis of the above embodiment, preferably, the third screening unit includes a fourth screening unit, wherein:
the fourth screening unit is configured to obtain the optimal link according to the consumption cost corresponding to each feasible link, the packet loss rate of each feasible link, and the transmission load of each feasible link.
On the basis of the above embodiment, preferably, the third screening unit includes a fifth screening unit, wherein:
and the fifth screening unit is used for acquiring the optimal link according to the shortest path method and the consumption cost corresponding to each feasible link.
On the basis of the foregoing embodiment, preferably, the determining the consumption cost corresponding to each feasible link according to the weights corresponding to all edges in each feasible link includes:
and taking the sum of the weights corresponding to all edges in each feasible link as the consumption cost corresponding to each feasible link.
The various modules in the data processing link determination system described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
The present embodiment is a system embodiment corresponding to the above method embodiment, and the specific implementation process thereof is the same as the above method implementation process, and please refer to the above method embodiment for details, which is not described herein again.
Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention, where the computer device may be a server, and its internal structural diagram may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a computer storage medium and an internal memory. The computer storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the computer storage media. The database of the computer device is used for storing data generated or acquired during execution of the data processing link determination method, such as source storage nodes, target storage nodes, graphs and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data processing link determination method.
In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the steps of the data processing link determination method in the above embodiments, the data processing link determination method is as follows:
taking each storage node in the target data system as a vertex of a graph, wherein the graph is determined by the vertex, an edge and a weight corresponding to the edge;
for every two vertexes in the graph and every two storage nodes in the target system, determining whether an edge between every two vertexes in the graph exists or not according to a data processing relation between every two storage nodes, wherein every two storage nodes correspond to every two vertexes;
if an edge between every two vertexes exists in the graph, determining corresponding weight between every two vertexes according to the program log between every two storage nodes;
determining an optimal link from a source storage node to a target storage node in the target data system according to the source storage node in the target data system, the target storage node in the target data system and the graph.
Alternatively, the processor, when executing the computer program, implements the functions of the modules/units in this embodiment of the data processing link determination system as follows:
the vertex module is used for taking each storage node in the target data system as a vertex of a graph, and the graph is determined by the vertex, the edge and the weight corresponding to the edge;
the edge module is used for determining whether an edge between every two vertexes in the graph exists according to a data processing relation between every two storage nodes for every two vertexes in the graph and every two storage nodes in the target system, wherein every two storage nodes correspond to every two vertexes;
the weight module is used for determining the weight corresponding to the edge between every two vertexes according to the program log between every two storage nodes if the edge between every two vertexes exists in the graph;
and the screening module is used for determining the optimal link from the source storage node to the target storage node according to the source storage node in the target data system, the target storage node in the target data system and the graph.
In an embodiment, a computer storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the data processing link determining method in the above embodiments, the data processing link determining method comprising:
taking each storage node in the target data system as a vertex of a graph, wherein the graph is determined by the vertex, an edge and a weight corresponding to the edge;
for every two vertexes in the graph and every two storage nodes in the target system, determining whether an edge between every two vertexes in the graph exists or not according to a data processing relation between every two storage nodes, wherein every two storage nodes correspond to every two vertexes;
if the edges between every two vertexes exist in the graph, determining the corresponding weight of the edges between every two vertexes according to the program log between every two storage nodes;
determining an optimal link from a source storage node to a target storage node in the target data system according to the source storage node in the target data system, the target storage node in the target data system and the graph.
Alternatively, the computer program, when executed by a processor, implements the functions of the modules/units in the embodiment of the data processing link determination system described above, which is as follows:
the vertex module is used for taking each storage node in the target data system as a vertex of a graph, and the graph is determined by the vertex, the edge and the weight corresponding to the edge;
the edge module is used for determining whether an edge between every two vertexes in the graph exists according to a data processing relation between every two storage nodes for every two vertexes in the graph and every two storage nodes in the target system, wherein every two storage nodes correspond to every two vertexes;
the weight module is used for determining the weight corresponding to the edge between every two vertexes according to the program log between every two storage nodes if the edge between every two vertexes exists in the graph;
and the screening module is used for determining the optimal link from the source storage node to the target storage node according to the source storage node in the target data system, the target storage node in the target data system and the graph.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A method for determining a data processing link, comprising:
taking each storage node in the target data system as a vertex of a graph, wherein the graph is determined by the vertex, an edge and a weight corresponding to the edge;
for every two vertexes in the graph and every two storage nodes in the target data system, determining whether an edge between every two vertexes in the graph exists or not according to a data processing relation between every two storage nodes, wherein every two storage nodes correspond to every two vertexes;
if the edges between every two vertexes exist in the graph, determining the corresponding weight of the edges between every two vertexes according to the program log between every two storage nodes;
determining an optimal link from a source storage node to a target storage node in the target data system according to the source storage node in the target data system, the target storage node in the target data system and the graph.
2. The data processing link determining method according to claim 1, wherein the determining an edge between every two vertices in the graph according to the data processing relationship between every two storage nodes comprises;
if one of every two storage nodes can transmit data to the other storage node through a program, determining that a connected edge exists between every two vertexes in the graph;
and if one storage node cannot transmit data to the other storage node through the program, determining that no connected edge exists between every two vertexes in the graph.
3. The method according to claim 1, wherein determining the weight corresponding to the edge according to the program log between each two storage nodes comprises:
acquiring the weight corresponding to the edge according to a preset weight calculation formula according to the reading amount of the distributed file system between every two storage nodes, the MR data recorded in the program log and the CPU consumption recorded in the program log;
wherein, the preset weight calculation formula is specifically as follows:
q HDFS Q1+ MR Q2+ CPU consumption Q3;
wherein Q represents a weight, HDFS is a read volume of the distributed file system, MR represents stored data described in the program log, CPU consumption represents processor consumption described in the program log, Q1 represents a first preset coefficient, Q2 represents a second preset coefficient, and Q3 represents a third preset coefficient.
4. The method of claim 1, wherein determining the optimal link from a source storage node to a target storage node in the target data system based on the source storage node in the target data system, the target storage node in the target data system, and the graph comprises:
acquiring all feasible links from a source vertex corresponding to the source storage node to a target vertex corresponding to the target storage node in the graph according to the source storage node, the target storage node and the graph, wherein the feasible links are composed of a plurality of edges;
determining the consumption cost corresponding to each feasible link according to the weight corresponding to all edges in each feasible link;
and acquiring the optimal link according to the consumption cost corresponding to each feasible link.
5. The method according to claim 4, wherein the obtaining the optimal link according to the consumption cost corresponding to each feasible link comprises:
and acquiring the optimal link according to the consumption cost corresponding to each feasible link, the packet loss rate of each feasible link and the transmission load of each feasible link.
6. The method according to claim 4, wherein the obtaining the optimal link according to the consumption cost corresponding to each feasible link comprises:
and acquiring the optimal link according to the consumption cost corresponding to each feasible link and a shortest path method.
7. The method according to claim 4, wherein the determining the consumption cost corresponding to each feasible link according to the weights corresponding to all edges in each feasible link comprises:
and taking the sum of the weights corresponding to all edges in each feasible link as the consumption cost corresponding to each feasible link.
8. A data processing link determination system, comprising:
the vertex module is used for taking each storage node in the target data system as a vertex of a graph, and the graph is determined by the vertex, the edge and the weight corresponding to the edge;
the edge module is used for determining whether an edge between every two vertexes in the graph exists according to a data processing relation between every two storage nodes for every two vertexes in the graph and every two storage nodes in the target data system, wherein every two storage nodes correspond to every two vertexes;
the weight module is used for determining the weight corresponding to the edge between every two vertexes according to the program log between every two storage nodes if the edge between every two vertexes exists in the graph;
and the screening module is used for determining the optimal link from the source storage node to the target storage node according to the source storage node in the target data system, the target storage node in the target data system and the graph.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the data processing link determination method according to any one of claims 1 to 7 when executing the computer program.
10. A computer storage medium storing a computer program, the computer program, when executed by a processor, implementing the steps of the data processing link determination method according to any one of claims 1 to 7.
CN202111234636.9A 2021-10-22 2021-10-22 Data processing link determining method, system, device and storage medium Pending CN113918638A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111234636.9A CN113918638A (en) 2021-10-22 2021-10-22 Data processing link determining method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111234636.9A CN113918638A (en) 2021-10-22 2021-10-22 Data processing link determining method, system, device and storage medium

Publications (1)

Publication Number Publication Date
CN113918638A true CN113918638A (en) 2022-01-11

Family

ID=79242486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111234636.9A Pending CN113918638A (en) 2021-10-22 2021-10-22 Data processing link determining method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN113918638A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115237727A (en) * 2022-09-21 2022-10-25 云账户技术(天津)有限公司 Method and device for determining most congested sublinks, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108540876A (en) * 2018-03-12 2018-09-14 上海欣诺通信技术股份有限公司 Service path choosing method, SDN controllers, storage medium and electronic equipment
CN111274495A (en) * 2020-01-20 2020-06-12 平安科技(深圳)有限公司 Data processing method and device for user relationship strength, computer equipment and storage medium
CN111585894A (en) * 2020-05-08 2020-08-25 南方电网科学研究院有限责任公司 Network routing method and device based on weight calculation
CN112291365A (en) * 2020-11-11 2021-01-29 平安普惠企业管理有限公司 Access balance processing method and device, computer equipment and storage medium
CN112559831A (en) * 2020-12-24 2021-03-26 平安普惠企业管理有限公司 Link monitoring method and device, computer equipment and medium
CN113128914A (en) * 2021-05-17 2021-07-16 中国建设银行股份有限公司 Path generation method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108540876A (en) * 2018-03-12 2018-09-14 上海欣诺通信技术股份有限公司 Service path choosing method, SDN controllers, storage medium and electronic equipment
CN111274495A (en) * 2020-01-20 2020-06-12 平安科技(深圳)有限公司 Data processing method and device for user relationship strength, computer equipment and storage medium
CN111585894A (en) * 2020-05-08 2020-08-25 南方电网科学研究院有限责任公司 Network routing method and device based on weight calculation
CN112291365A (en) * 2020-11-11 2021-01-29 平安普惠企业管理有限公司 Access balance processing method and device, computer equipment and storage medium
CN112559831A (en) * 2020-12-24 2021-03-26 平安普惠企业管理有限公司 Link monitoring method and device, computer equipment and medium
CN113128914A (en) * 2021-05-17 2021-07-16 中国建设银行股份有限公司 Path generation method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115237727A (en) * 2022-09-21 2022-10-25 云账户技术(天津)有限公司 Method and device for determining most congested sublinks, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108595157B (en) Block chain data processing method, device, equipment and storage medium
US8375190B2 (en) Dynamtic storage hierarachy management
CN104537076B (en) A kind of file read/write method and device
CN106407207B (en) Real-time newly-added data updating method and device
US9213700B2 (en) Data archiving and de-archiving in a business environment
CN105117489B (en) Database management method and device and electronic equipment
CN110084476B (en) Case adjustment method, device, computer equipment and storage medium
CN113918638A (en) Data processing link determining method, system, device and storage medium
CN110222046B (en) List data processing method, device, server and storage medium
US10862922B2 (en) Server selection for optimized malware scan on NAS
CN110381136A (en) A kind of method for reading data, terminal, server and storage medium
CN111723004B (en) Measurement method for agile software development, measurement data output method and device
Talluri et al. Characterization of a big data storage workload in the cloud
CN116578410A (en) Resource management method, device, computer equipment and storage medium
CN111427920A (en) Data acquisition method, device, system, computer equipment and storage medium
CN110414813B (en) Index curve construction method, device and equipment
CN113849482A (en) Data migration method and device and electronic equipment
US12026393B2 (en) Apparatus and method for selecting storage location based on data usage
CN113590579B (en) Root cause analysis method, device and system based on data warehouse
US20240061494A1 (en) Monitoring energy consumption associated with users of a distributed computing system using tracing
CN112364007B (en) Mass data exchange method, device, equipment and storage medium based on database
CN112910950B (en) Uplink method and device for data to be uplink and block link point equipment
CN113723710B (en) Customer loss prediction method, system, storage medium and electronic equipment
CN116737400B (en) Queue data processing method and device and related equipment
CN116107761B (en) Performance tuning method, system, electronic device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination