CN112256720A - Data cost calculation method, system, computer device and storage medium - Google Patents

Data cost calculation method, system, computer device and storage medium Download PDF

Info

Publication number
CN112256720A
CN112256720A CN202011132525.2A CN202011132525A CN112256720A CN 112256720 A CN112256720 A CN 112256720A CN 202011132525 A CN202011132525 A CN 202011132525A CN 112256720 A CN112256720 A CN 112256720A
Authority
CN
China
Prior art keywords
data
cost
directed acyclic
acyclic graph
calculation method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011132525.2A
Other languages
Chinese (zh)
Other versions
CN112256720B (en
Inventor
陈玉
张茜
凌海挺
刘丽扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011132525.2A priority Critical patent/CN112256720B/en
Priority to PCT/CN2020/135737 priority patent/WO2021174945A1/en
Publication of CN112256720A publication Critical patent/CN112256720A/en
Application granted granted Critical
Publication of CN112256720B publication Critical patent/CN112256720B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Technology Law (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Fuzzy Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data blood-margin-based data cost calculation method, which comprises the steps of generating a data blood-margin relation through SQL sentences or SQL sentences contained in a processing script, wherein the data blood-margin relation forms a directed acyclic graph; acquiring statistical information and frequency information of task execution of a data platform, and corresponding to the statistical information and the frequency information into a directed acyclic graph; calculating the cost of nodes related to target data in the directed acyclic graph and the cost of edges; and acquiring the cost of the edge and the node, and accumulating to obtain the total cost of the target data. Therefore, after the data blood relationship is combined, the cost of the data can be calculated and displayed in a finer granularity, and meanwhile, the pricing mode of the data application can be more reasonable. Furthermore, the evaluation of the data value inside and outside the enterprise provides more detailed and reasonable reference, the cost of the data with the finest granularity is convenient to calculate, and the cost of each piece of data can be accurately quantized. Meanwhile, the invention also relates to a block chain technology.

Description

Data cost calculation method, system, computer device and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data cost calculation method, system, computer device, and storage medium.
Background
The existing data blood margin analysis program or system is mostly used for data source tracing, dependency citation analysis and other aspects, and a case used in combination with data cost calculation is not found yet. At present, enterprises process and store more and more data, a big data technology is widely applied, a large amount of resources are consumed for data processing and storage, and corresponding cost cannot be effectively calculated and displayed. The current enterprise has a larger calculation granularity for the data cost, and the difference of the data cost cannot be reflected on a finer granularity for the internal management and the related decision of the enterprise.
Most of the cost of the current data is calculated according to the whole processing process and occupied storage resources, and the cost of a table level, a field level or a record level cannot be obtained. In the case of clear data cost, reasonable pricing or cost settlement can be performed when the data is used inside or outside an enterprise.
The cost of the data can be calculated by the cost generated by using related resources, but other data used in the data processing process should also be calculated as the cost of the current data, so that more perspectives can be provided for evaluating the cost or value of the data.
Disclosure of Invention
Based on the above, the invention provides a data cost calculation method, a system, computer equipment and a storage medium, so that the cost of data can be calculated and displayed in a finer granularity, and meanwhile, the pricing mode of data application can be more reasonable.
In order to achieve the above object, the present invention provides a data cost calculation method based on data blood margin, including:
acquiring SQL sentences used in the data processing process or scripts used in the data processing process, and generating data blood-edge relations through the SQL sentences contained in the SQL sentences or the processing scripts, wherein the data blood-edge relations form a directed acyclic graph;
acquiring statistical information and frequency information of task execution of a data platform, and corresponding to the statistical information and the frequency information in the directed acyclic graph
Calculating the cost of nodes related to target data in the directed acyclic graph and the cost of edges;
and acquiring the cost of the edge and the node, and accumulating to obtain the total cost of the target data.
Preferably, the statistical information includes resource usage of each task, and the resource usage includes storage usage, CPU usage, and memory usage; the frequency information includes historical execution times and start and stop times of execution of the tasks.
Preferably, according to the difference of the data platforms, a unit price parameter of the resource usage amount of the data platform is introduced; in the calculation process of the data cost, the cost of the node is the storage cost, and the cost of the edge is the cost of a CPU and an internal memory.
Preferably, the calculating the cost of the node related to the target data in the directed acyclic graph includes: sigmaidistinct{Si}+SkWherein S isiRepresenting the cost of the storage resources occupied by the relevant node, SkRepresenting a storage cost of the target data; the calculating cost of the edge related to the target data in the directed acyclic graph comprises the following steps:
Figure BDA0002735612920000021
wherein N isLpIndicating the number of edges, X, associated with the target datapqRepresents the cost, count (L), of the resources consumed per machining instruction per passx) Indicating the number of edges in the directed acyclic graph corresponding to each machining instruction.
Preferably, the obtaining the costs of the edges and the nodes and accumulating the costs to obtain the total cost of the target data includes:
Figure BDA0002735612920000022
wherein, CkRepresenting the total cost of the target data.
Preferably, the generating a data blood relationship by the SQL statement included in the processing script, the forming a directed acyclic graph by the data blood relationship includes:
extracting a regularized SQL statement from a script file containing an SQL code, and finishing the cleaning of the SQL statement;
and performing lexical analysis on the regularized SQL sentences to generate data blood relationship, and generating a directed acyclic graph according to the data blood relationship.
Preferably, after the target total data cost is obtained, the target total data cost is uploaded into a blockchain, so that the blockchain performs encrypted storage on the target total data cost.
To achieve the above object, the present invention further provides a data cost calculation system based on data blood margin, the data cost calculation system comprising:
the data set module is used for acquiring SQL sentences used in the data processing process or scripts used in the data processing process and generating data blood-edge relations through the SQL sentences contained in the SQL sentences or the processing scripts, and the data blood-edge relations form a directed acyclic graph;
the information module is used for acquiring statistical information and frequency information of task execution of the data platform and corresponding to the directed acyclic graph;
the first calculation module is used for calculating the cost of nodes related to target data in the directed acyclic graph and the cost of edges;
and the second calculation module is used for acquiring the cost of the edge and the node and accumulating the cost to obtain the total cost of the target data.
To achieve the above object, the present invention also provides a computer device comprising a memory and a processor, wherein the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the data cost calculation method as described above.
In order to achieve the above object, the present invention further provides a storage medium storing a program file capable of implementing the data cost calculation method as described above.
The invention provides a data cost calculation method, a system, computer equipment and a storage medium, wherein the data cost calculation method generates a data blood relationship by acquiring SQL statements used in a data processing process or scripts used in the data processing process and through the SQL statements contained in the SQL statements or the processing scripts, and the data blood relationship forms a directed acyclic graph; acquiring statistical information and frequency information of task execution of a data platform, and corresponding to the statistical information and the frequency information into a directed acyclic graph; calculating the cost of nodes related to target data in the directed acyclic graph and the cost of edges; and acquiring the cost of the edge and the node, and accumulating to obtain the total cost of the target data. Therefore, the data cost calculation method can calculate and display the cost of the data in a finer granularity after combining the data blood relationship, and meanwhile, the pricing mode of the data application can be more reasonable, so that a more detailed and reasonable reference basis can be provided for the evaluation of the data value of an enterprise.
Drawings
FIG. 1 is a diagram of an implementation environment for a data cost calculation method provided in one embodiment;
FIG. 2 is a block diagram showing an internal configuration of a computer device according to an embodiment;
FIG. 3 is a flow diagram of a method of data cost calculation in one embodiment;
FIG. 4 is a diagram of a directed acyclic graph in one embodiment;
FIG. 5 is a flow diagram that illustrates the computation of nodes and edges in a directed acyclic graph, according to one embodiment;
FIG. 6 is a diagram of a directed acyclic graph in which SQL statements are multiple-input and multiple-output in one embodiment;
FIG. 7 is a schematic diagram of a data cost calculation system in one embodiment;
FIG. 8 is a schematic diagram of a computer apparatus in one embodiment;
FIG. 9 is a schematic diagram of a storage medium in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another.
Fig. 1 is a diagram of an implementation environment of the data-based blood-margin data cost calculation method provided in an embodiment, as shown in fig. 1, in which a computer device 110 and a display device 120 are included.
The computer device 110 may be a computer device such as a computer used by a user, and the computer device 110 is installed with a data cost calculation system based on the data consanguinity. When calculating, the user can perform the calculation in accordance with the data cost calculation method based on the data blood margin at the computer device 110 and display the calculation result through the display device 120.
It should be noted that the combination of the computer device 110 and the display device 120 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like.
FIG. 2 is a diagram showing an internal configuration of a computer device according to an embodiment. As shown in fig. 2, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus. The non-volatile storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions when executed by the processor can enable the processor to realize a data cost calculation method based on the data blood margin. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform a data cost calculation method based on data blooding margins. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 2 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
As shown in FIG. 3, in one embodiment, a data cost calculation method based on data consanguinity is provided, wherein the data cost refers to direct or indirect expenditure and expense of data acquisition, transmission, expression, storage, search, processing and the like by an enterprise. The data cost calculation method may be applied to the computer device 110 and the display device 120, and specifically may include the following steps:
and step 31, acquiring SQL statements used in the data processing process or scripts used in the data processing process, and generating a data blood-edge relationship through the SQL statements contained in the SQL statements or the processing scripts, wherein the data blood-edge relationship forms a directed acyclic graph.
Specifically, the data processing process and the data volume in the data warehouse are similar to a pyramid structure, processing and storage are performed from bottom to top, and the data volume of the bottom layer and resources used for processing are much larger than the data volume used for providing. The processing and storage costs of the data at the top of the pyramid cannot reflect the real manufacturing costs, and the manufacturing and storage costs of the data at the lower layer related to the processing of the data are more reasonable. Therefore, the cumulative cost of the data can be calculated relatively easily based on the data blood margin. The cumulative cost can be calculated in two ways: one way is to calculate the general cost of each node in the data blood margin, and then accumulate the cost step by step according to the blood margin relation in a recursion manner until the limit condition is met and the end is finished; the second way is to calculate the cost of the node and the cost of the edge in the graph respectively according to the directed acyclic graph generated by the data blood-related relationship, and then accumulate according to the calculation target and the related cost of the edge and the node. The method chooses the second way to do so to enable correct calculation of the data cost. For example, the calculation steps of the indexes related to the daily average deposit balance of the customer are as follows:
step 1, reading data (A, storing the current account number and balance data of the current coin) from a current account table of the current coin, writing the data into a daily average deposit balance table (E) of the current coin, and calculating the current deposit balance (A- > E) of the current coin of a client every day;
step 2, reading data from the local currency periodic account table (B, storing the local currency periodic account and balance data), writing the data into a local currency daily average deposit balance table (E), and calculating the local currency periodic deposit balance (B- > E) of each client every day;
step 3, reading data (C, storing the current account number and balance data of the foreign currency) from the current account table of the foreign currency, writing the data into a daily average deposit balance table (F) of the foreign currency, and calculating the current deposit balance (C- > F) of the foreign currency of the client every day;
step 4, reading data (D, storing the periodic account number and balance data of the foreign currency) from the periodic account table of the foreign currency, writing the data into a daily average deposit balance table (F) of the foreign currency, and calculating the periodic deposit balance (D- > F) of the foreign currency of each client every day;
and 5, reading data (E, storing the user ID and the deposit balance data of the home currency) from the daily average deposit balance table of the home currency, reading data (F, storing the user ID and the deposit balance data of the foreign currency) from the daily average deposit balance table of the foreign currency, writing the data into the daily average deposit balance table of the client (G, storing the user ID and the deposit balance data), and calculating the daily average deposit balance (E- > G, F- > G) of the client.
In steps 1-4, a customer account relation table (Z) is required to be read, the corresponding relation between the user ID and the account is stored, customer information is synchronously written into a target table, and in each step, a corresponding SQL statement is executed, data is read from a source table, processed and written into the target table. Further, the data consanguinity generates the relationships between tables and fields according to the executed SQL statement analysis, the relationships may be stored in the form of two-dimensional tables, and each piece of consanguinity data records the relationship between one piece of data, such as the field a- > field E, so that a Directed Acyclic Graph (DAG) as shown in fig. 4 may be drawn based on a plurality of pieces of consanguinity data.
Referring further to fig. 4, the nodes in the graph represent the storage of data, and the connecting lines between the nodes represent the processing of the data; the nodes may represent data tables, records, or individual fields, with edges with directions between nodes representing the computational resources occupied by the associated data processing process. Specifically, all edges in the graph are directed edges, and the data source table or field points to the data destination table or field. The data consanguinity-related cost calculation mainly involves the cost of computing resources used in the storage and processing processes, wherein the cost of resources such as manpower, field, power and the like is not considered by the data cost calculation method, namely, the data cost calculation method mainly focuses on the cost related to the storage and computing resources used in the storage and processing processes of the data, and other costs are not considered by the data cost calculation method. It should be noted that the data cost calculation method mainly uses the data blood relationship result, and the generation method is not concerned, and even the manually written blood relationship result can be used.
Further, in an embodiment, generating a data blood-edge relationship by processing an SQL statement included in a script, and generating a directed acyclic graph by the data blood-edge relationship specifically includes:
s311, extracting a regularized SQL statement from the script file containing the SQL code, and finishing the cleaning of the SQL statement;
further, the S311 includes:
s3111, acquiring a script file containing an SQL code, and searching a flag bit of the SQL code;
preferably, the script file may be a perl or the like script.
S3112, filtering irrelevant contents in the script file by using the flag bit, and reserving to obtain a regularized SQL code statement.
S312, performing lexical analysis on the regularized SQL sentences to generate data blood relationship, and generating a directed acyclic graph according to the data blood relationship.
And step 32, acquiring statistical information and frequency information of the data platform task execution, and corresponding to the directed acyclic graph.
The statistical information comprises resource usage of each task, and the resource usage comprises information such as storage usage, CPU usage and memory usage; the frequency information includes information such as the historical execution times and the start and stop times of execution of the tasks.
Specifically, the task of the data platform may be an SQL statement, each SQL corresponds to one to multiple edges in the directed acyclic graph, and after the mapping relationship is established, the resource usage amount corresponding to each edge may be referred in the calculation process.
Specifically, the processing cost of the designated data in each different time period may be counted according to the different time periods in which the tasks are executed, for example, a certain task is executed once per month, and the resource usage and cost of the relevant processing in each quarter or half year may be counted. Therefore, the related information of the target data can be clearly known according to the statistical information and the frequency information, and the data cost calculation of each time period can be facilitated.
Step 33, calculating the cost of the node and the cost of the edge related to the target data in the directed acyclic graph.
According to two calculation methods of the cumulative cost, the first method may cause repeated calculation for the nodes with multiple references, and the calculation result error may be large, for example, the cost of node Z may be accumulated by node a, node B, node C and node D in fig. 4. The second mode is to calculate the cost of each node respectively, then calculate the cost of each edge, and finally take the sum of the two as the cost of the target data, and the calculation result is more accurate, namely the data cost calculation method provided by the invention.
Further, in the process of batch processing of generated data in a big data environment, the main occupied resources are memory, CPU and memory (MEM); the stored measurement unit is byte, and the redundancy quantity is multiplied by multiple; the CPU unit is the number of cores per second, and the memory unit is MB per second. The cloud environment is relatively simple and convenient to calculate, the purchased resources can be converted into corresponding metering units to facilitate calculation, and the traditional environment needs a reasonable mode to convert software and hardware costs into corresponding metering units to perform calculation. In short, the unit price parameters of the resource usage amount of the data platform are introduced according to the difference of the data platforms, that is, the unit price of the resource usage amount of different data platforms may be different, and the calculation of the data cost is completed according to the technology and hardware type used for processing and storing the decision-making data of the data cost. Furthermore, in the same enterprise, a reasonable and uniform pricing mode can be formed according to the cost of data in the data exchange process.
Specifically, for example, the resource cost in the current big data processing environment is as follows:
1000 CPU cores, with a annual cost of 100 ten thousand dollars, and a price per core s of about 1000000/1000 (number of cores)/(365 86400) 0.0000317 dollars;
the 5TB memory costs 50 ten thousand yuan per year, and the cost per GB per second is about 500000/(5 × 1024)/(365 × 86400) ═ 0.0000030966 yuan;
the storage is 20TB with an annual cost of 5 ten thousand yuan, and the annual price per GB is about 500000/(20 x 1024) ═ 2.4414 yuan.
According to fig. 4, it is assumed that the computation resources used by the foregoing SQL (machining instruction) execution process are: the CPU2000core × s, the MEM 500GB × s, the node a related data occupies 10GB of storage, the node Z occupies 2GB of storage, and the node E related data occupies 3GB of storage, so that the processing and storage cost of the data calculated based on the portion of the directed acyclic graph is (CPU unit price) 0.0000317 × 2000+ (memory unit price) 0.0000030966 × 500+ (storage unit price) 2.4414 (10+2+3) ═ 0.0634+0.0015483+36.621 — 36.6859483, and the data cost of the portion can be accurately and quickly calculated.
Further, in one embodiment, assume that the cost C of the data node (table) K is calculatedkThe resources consumed by the tables (nodes in the DAG graph) and related processing SQL (edges in the DAG graph) of the data source are needed to be obtained through the data consanguinity. Wherein S is usediRepresenting the cost of storage resources occupied by the related nodes, and using X to represent the resources consumed by processing SQL for generating a target table; SQL for generating target table data can have multiple uses XpRespectively representing the cost of the resources consumed by each SQL; each SQL will be executed for multiple uses XpqRepresenting the cost of the resources consumed by each SQL each time; each SQL generated kinship relationship may correspond to multiple edges in the DAG, using count (L)x) Representing the number of edges in the DAG corresponding to each SQL, please refer toFig. 5, in detail, is as follows:
331. calculating the cost of the nodes in the directed acyclic graph;
specifically, the cost of the node is the storage cost, and according to the above description, the calculation formula of the node is: sigmaidistinct{Si}+SkWherein S isiRepresenting the cost of the storage resources occupied by the relevant node, SkRepresenting the storage cost of the target data.
332. The cost of an edge in a directed acyclic graph is computed.
Specifically, the cost of the edge is the cost of CPU and MEM, and according to the above description, the calculation formula of the edge:
Figure BDA0002735612920000091
wherein N isLpIndicating the number of edges, X, associated with the target datapqRepresents the cost, count (L), of the resources consumed per machining instruction per passx) Indicating the number of edges in the directed acyclic graph corresponding to each machining instruction.
And step 34, acquiring the cost of the edges and the nodes, and accumulating to obtain the total cost of the target data.
When the SQL statement is multiple in and one out (insert … from …), NLpAnd count (L)x) Equal; when the SQL statement is multiple-in and multiple-out (from … insert … insert …), NLpLess than count (L)x)。
Accordingly, the total target data cost, i.e., the total data cost C of the node (table) K, can be summarizedkThe following calculation formula is provided:
Figure BDA0002735612920000092
further, for example, taking data processing of the node G in fig. 4 as an example, the SQL statement is multiple in and multiple out, and 5 SQL statements are involved, which are:
a + Z → E is X1:
insert into table_E
select z.cust_id,a.bal
from table_A a
join table_Z z
on a.acct_no=z.acct_no。
Table level data blood relationship can be generated according to the SQL:
a → E is marked LAEZ → E is marked LZEThe SQL corresponds to two sides Z → E and A → E in the figure, the cust _ id data in the E table is from the Z table, and the bal data in the E table is from the A table.
X1Corresponding count (L)x1)=2,NL12. By analogy, B + Z → E is X2C + Z → E is X3D + Z → E is X4Corresponding count (L)x)=2,NLP=2。
E + F → G is X5
insert into table_G
select nvl(e.cust_id,f.cust_id)as cust_id,
sum(nvl(e.bal,0)+nvl(f.bal,0))as bal
from table_E e
full outer join table_F f
on e.cust_id=f.cust_id
group by nvl(e.cust_id,f.cust_id);
X5Corresponding count (L)x5)=2,NL5=2。
The data of Table G is derived from Table A, B, C, D, Z, E, F, where node Z appears multiple times in the DAG, and the storage cost of the multiple appearing nodes should be deduplicated in calculating the cost, so that the distinting { S }iI e in { A, B, C, D, Z, E, F }. Assuming that each SQL is executed 10 times, i.e. multiple times per day, q is 10, and the total cost of table G is C based on the above informationGSubstituting into the formula can result in:
Figure BDA0002735612920000101
further, in the current big data environment,the processing of the data is table-level, and the table-level data cost can be calculated according to the above description. For example, if table G contains 11 data fields in fig. 4, the result of dividing the data of table G by 11 may be taken as the cost for each field; for example, each record in table G stores 20 bytes, wherein 10 fields store only 1 byte of data, and the remaining field stores 10 bytes, the storage cost of storing 10 bytes is 50% of the storage cost of table G, and the storage cost of each of the other fields is 5% of table G. The cost at record level is calculated in a similar manner, e.g. table G contains 10 ten thousand records, and the cost per record is CG/100000。
In another embodiment, when the SQL statement is in multi-input and multi-output, another example is as follows, wherein the multi-input and multi-output diagram refers to fig. 6, and the related SQL is processed as follows:
From table_Aa
join table_B b
On a.id=b.id
Insert into table_C
Select a.id,a.bal+b.bal
Where a.type=1and b.type=2
Insert into table_D
Select b.id,a.bal+b.bal
Where a.type=3and b.type=4;
the SQL generates 4 edges as shown in FIG. 6, assuming that the cost of the resources consumed by a single execution of the SQL is XPThen count (L)x) If the processing cost for node D is calculated 4, then only two edges associated with node D, a → D and B → D respectively, then NLpAssuming that the SQL has been executed q 10 times, the cost of node D after executing the SQL 10 times is substituted into the calculation formula as follows:
Figure BDA0002735612920000111
according to the above description, the steps 1 to 3 describe a data cost calculation method based on data blood relationship, the data cost calculation method can be applied to the cost calculation of table-level and field-level data, and the record-level cost is calculated according to the average value of record numbers according to the table-level or field-level cost. Specifically, the processing procedure (SQL) of the data corresponds to an edge in the graph, and since each edge of the batch processing corresponds to multiple records in one table, the cost can be calculated in a mean manner for multiple batches of processed data in the same table.
Further, in one embodiment, each time the same SQL may result in different amount of used resources due to the variation of data amount, for example, a- > E in fig. 4, assuming that the cost of using resources for the first processing is 10 yuan, which corresponds to 10000 records, and the cost of using resources for the second processing is 12 yuan, which corresponds to 14000 records, then the average processing cost of 24000 records is (10+12)/24000 is about 0.091 yuan.
In an alternative embodiment, it is also possible to: and uploading the calculation result of the data blood margin-based data cost calculation method to a block chain.
Specifically, the corresponding summary information is obtained based on the calculation result of the data blood-margin-based data cost calculation method, and specifically, the summary information is obtained by performing hash processing on the calculation result of the data blood-margin-based data cost calculation method, for example, the hash information is obtained by using the sha256s algorithm. Uploading summary information to the blockchain can ensure the safety and the fair transparency of the user. The user can download the summary information from the blockchain to verify whether the calculation result of the data-based data-cost calculation method is falsified. The blockchain referred to in this example is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The invention provides a data cost calculation method based on data blood relationship, which comprises the steps of defining a data set, obtaining a directed acyclic graph generated according to the data blood relationship; calculating the cost of nodes related to target data in the directed acyclic graph and the cost of edges; and acquiring the cost of the edge and the node, and accumulating to obtain the total cost of the target data. Therefore, after the data blood relationship is combined, the cost of the data can be calculated and displayed in a finer granularity, and meanwhile, the pricing mode of the data application can be more reasonable. Furthermore, the evaluation of the data value inside and outside the enterprise provides more detailed and reasonable reference, the cost of the data with the finest granularity is convenient to calculate, and the cost of each piece of data can be accurately quantized. Meanwhile, the invention also relates to a block chain technology.
As shown in fig. 7, the present invention further provides a data cost calculation system based on data blood margin, which can be integrated in the computer device 110, and specifically can include a data set module 20, an information module 30, a first calculation module 40, and a second calculation module 50.
The data set module 20 is configured to obtain SQL statements used in a data processing process or scripts used in the data processing process, and generate data blood-edge relationships through the SQL statements contained in the SQL statements or the processing scripts, where the data blood-edge relationships form a directed acyclic graph;
the information module 30 is used for acquiring statistical information and frequency information of task execution of the data platform and corresponding to the directed acyclic graph;
the first calculating module 40 is configured to calculate costs of nodes and edges related to target data in the directed acyclic graph;
the second calculating module 50 is configured to obtain the costs of the edges and the nodes, and accumulate the costs to obtain a total cost of the target data.
In one embodiment, the statistical information includes resource usage of each task, where the resource usage includes information such as storage usage, CPU usage, and memory usage; the frequency information includes information such as the historical execution times and the start and stop times of execution of the tasks.
In one embodiment, the first calculation module 40 is configured to calculate costs of nodes and edges associated with the target data in the directed acyclic graph.
In an embodiment, the cost of a node in a directed acyclic graph is calculated, specifically, the cost of the node is a storage cost, and according to the above description, a calculation formula of the node is as follows: sigmaidistinct{Si}+SkWherein S isiRepresenting the cost of the storage resources occupied by the relevant node, SkRepresenting the storage cost of the target data.
Wherein, the cost of the edge in the directed acyclic graph is calculated, specifically, the cost of the edge is the cost of the CPU and the MEM, and according to the above description, the calculation formula of the edge is:
Figure BDA0002735612920000131
wherein, XLpIndicating the number of edges, X, associated with the target datapqRepresents the cost, count (L), of the resources consumed per machining instruction per passx) Indicating the number of edges in the directed acyclic graph corresponding to each machining instruction.
Further, in one embodiment, the second calculation module 50 is configured to obtain the costs of the edges and the nodes, and accumulate the costs to obtain the target total data cost.
Wherein, when the SQL statement is multiple-in and one-out (insert … from …), N isLpAnd count (L)x) Equal; when the SQL statement is multiple-in and multiple-out (from … insert … insert …), NLpLess than count (L)x)。
Accordingly, the total target data cost, i.e., the total data cost C of the node (table) K, can be summarizedkThe following calculation formula is provided:
Figure BDA0002735612920000132
in one embodiment, the data cost calculation system further includes a display module (not shown) for displaying the calculation result, and the display module may be a display of a desktop computer or a display device of other computer equipment.
Referring to fig. 8, fig. 8 is a schematic structural diagram of an apparatus according to an embodiment of the present invention. As shown in fig. 8, the apparatus 200 includes a processor 201 and a memory 202 coupled to the processor 201.
The memory 202 stores program instructions for implementing the data-based data-cost calculation method according to any of the above embodiments.
The processor 201 is used to execute program instructions stored by the memory 202.
The processor 201 may also be referred to as a Central Processing Unit (CPU). The processor 201 may be an integrated circuit chip having signal processing capabilities. The processor 201 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Referring to fig. 9, fig. 9 is a schematic structural diagram of a storage medium according to an embodiment of the invention. The storage medium of the embodiment of the present invention stores a program file 301 capable of implementing all the methods described above, wherein the program file 301 may be stored in the storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Claims (10)

1. A data cost calculation method based on data blood margin, the data cost calculation method comprising:
acquiring SQL sentences used in the data processing process or scripts used in the data processing process, and generating data blood-edge relations through the SQL sentences contained in the SQL sentences or the processing scripts, wherein the data blood-edge relations form a directed acyclic graph;
acquiring statistical information and frequency information of task execution of a data platform, and corresponding to the statistical information and the frequency information into a directed acyclic graph;
calculating the cost of nodes related to target data in the directed acyclic graph and the cost of edges;
and acquiring the cost of the edge and the node, and accumulating to obtain the total cost of the target data.
2. The data cost calculation method of claim 1, wherein the statistical information includes resource usage amounts per task, the resource usage amounts including storage usage amounts, CPU usage amounts, and memory usage amounts; the frequency information includes historical execution times and start and stop times of execution of the tasks.
3. The data cost calculation method of claim 2, wherein a unit price parameter of the data platform resource usage is introduced according to the difference of the data platforms; in the calculation process of the data cost, the cost of the node is the storage cost, and the cost of the edge is the cost of a CPU and an internal memory.
4. The data cost calculation method of claim 1, wherein calculating the cost of a node in the directed acyclic graph associated with the target data comprises: sigmaidistinct{Si}+SkWherein S isiRepresenting the cost of the storage resources occupied by the relevant node, SkRepresenting a storage cost of the target data;
the calculating cost of the edge related to the target data in the directed acyclic graph comprises the following steps:
Figure FDA0002735612910000011
wherein N isLpIndicating the number of edges, X, associated with the target datapqRepresents the cost, count (L), of the resources consumed per machining instruction per passx) Indicating the number of edges in the directed acyclic graph corresponding to each machining instruction.
5. The data cost calculation method of claim 4, wherein obtaining the costs of the edges and nodes and accumulating to obtain a target total data cost comprises:
Figure FDA0002735612910000012
wherein, CkRepresenting the total cost of the target data.
6. The data cost calculation method of claim 1 wherein the SQL statements contained in the instrumentation script generate data lineage relationships that form a directed acyclic graph comprising:
extracting a regularized SQL statement from a script file containing an SQL code, and finishing the cleaning of the SQL statement;
and performing lexical analysis on the regularized SQL sentences to generate data blood relationship, and generating a directed acyclic graph according to the data blood relationship.
7. The data cost calculation method of claim 1, wherein after the target total data cost is obtained, the target total data cost is uploaded into a blockchain, so that the blockchain performs encrypted storage on the target total data cost.
8. A data cost calculation system based on data consanguinity, the data cost calculation system comprising:
the data set module is used for acquiring SQL sentences used in the data processing process or scripts used in the data processing process and generating data blood-edge relations through the SQL sentences contained in the SQL sentences or the processing scripts, and the data blood-edge relations form a directed acyclic graph;
the information module is used for acquiring statistical information and frequency information of task execution of the data platform and corresponding to the directed acyclic graph;
the first calculation module is used for calculating the cost of nodes related to target data in the directed acyclic graph and the cost of edges;
and the second calculation module is used for acquiring the cost of the edge and the node and accumulating the cost to obtain the total cost of the target data.
9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to carry out the steps of the data cost calculation method of any one of claims 1 to 7.
10. A storage medium storing a program file capable of implementing the data cost calculation method according to any one of claims 1 to 7.
CN202011132525.2A 2020-10-21 2020-10-21 Data cost calculation method, system, computer device and storage medium Active CN112256720B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011132525.2A CN112256720B (en) 2020-10-21 2020-10-21 Data cost calculation method, system, computer device and storage medium
PCT/CN2020/135737 WO2021174945A1 (en) 2020-10-21 2020-12-11 Data cost calculation method, system, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011132525.2A CN112256720B (en) 2020-10-21 2020-10-21 Data cost calculation method, system, computer device and storage medium

Publications (2)

Publication Number Publication Date
CN112256720A true CN112256720A (en) 2021-01-22
CN112256720B CN112256720B (en) 2021-08-17

Family

ID=74264461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011132525.2A Active CN112256720B (en) 2020-10-21 2020-10-21 Data cost calculation method, system, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN112256720B (en)
WO (1) WO2021174945A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114064640A (en) * 2021-11-09 2022-02-18 珠海市新德汇信息技术有限公司 Blood relationship construction method, storage medium and equipment applied to data tracing
CN115511644A (en) * 2022-08-29 2022-12-23 易保网络技术(上海)有限公司 Processing method for target policy, electronic device and readable storage medium

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113868253B (en) * 2021-09-28 2024-04-23 中通服创立信息科技有限责任公司 Data relationship capturing and big data relationship tree construction method
CN114254081B (en) * 2021-12-22 2024-06-04 中冶赛迪信息技术(重庆)有限公司 Enterprise big data search system, method and electronic equipment
CN114090018B (en) * 2022-01-25 2022-05-24 树根互联股份有限公司 Index calculation method and device of industrial internet equipment and electronic equipment
CN114428822B (en) * 2022-01-27 2022-07-29 云启智慧科技有限公司 Data processing method and device, electronic equipment and storage medium
CN117076095B (en) * 2023-10-16 2024-02-09 华芯巨数(杭州)微电子有限公司 Task scheduling method, system, electronic equipment and storage medium based on DAG

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000045293A1 (en) * 1999-01-28 2000-08-03 Universite Pierre Et Marie Curie (Paris Vi) Method for generating multimedia document descriptions and device associated therewith
CN107644073A (en) * 2017-09-18 2018-01-30 广东中标数据科技股份有限公司 A kind of field consanguinity analysis method, system and device based on depth-first traversal
CN108446383A (en) * 2018-03-21 2018-08-24 吉林大学 A kind of data task redistribution method based on geographically distributed data query
CN111694858A (en) * 2020-04-28 2020-09-22 平安科技(深圳)有限公司 Data blood margin analysis method, device, equipment and computer readable storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100153431A1 (en) * 2008-12-11 2010-06-17 Louis Burger Alert triggered statistics collections
CN106991101B (en) * 2016-01-21 2021-02-02 阿里巴巴集团控股有限公司 Data table analysis processing method and device
CN109325078A (en) * 2018-09-18 2019-02-12 拉扎斯网络科技(上海)有限公司 Method and device is determined based on the data blood relationship of structured data
CN111125269B (en) * 2019-12-31 2023-05-02 腾讯科技(深圳)有限公司 Data management method, blood relationship display method and related device
CN111652652B (en) * 2020-06-09 2022-11-22 苏宁云计算有限公司 Cost calculation method and device for calculation platform, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000045293A1 (en) * 1999-01-28 2000-08-03 Universite Pierre Et Marie Curie (Paris Vi) Method for generating multimedia document descriptions and device associated therewith
CN107644073A (en) * 2017-09-18 2018-01-30 广东中标数据科技股份有限公司 A kind of field consanguinity analysis method, system and device based on depth-first traversal
CN108446383A (en) * 2018-03-21 2018-08-24 吉林大学 A kind of data task redistribution method based on geographically distributed data query
CN111694858A (en) * 2020-04-28 2020-09-22 平安科技(深圳)有限公司 Data blood margin analysis method, device, equipment and computer readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114064640A (en) * 2021-11-09 2022-02-18 珠海市新德汇信息技术有限公司 Blood relationship construction method, storage medium and equipment applied to data tracing
CN115511644A (en) * 2022-08-29 2022-12-23 易保网络技术(上海)有限公司 Processing method for target policy, electronic device and readable storage medium

Also Published As

Publication number Publication date
WO2021174945A1 (en) 2021-09-10
CN112256720B (en) 2021-08-17

Similar Documents

Publication Publication Date Title
CN112256720B (en) Data cost calculation method, system, computer device and storage medium
US11106486B2 (en) Techniques to manage virtual classes for statistical tests
US11379755B2 (en) Feature processing tradeoff management
US7035786B1 (en) System and method for multi-phase system development with predictive modeling
US7031901B2 (en) System and method for improving predictive modeling of an information system
Lu et al. Show me the money: Dynamic recommendations for revenue maximization
Keller et al. Opportunities to observe and measure intangible inputs to innovation: Definitions, operationalization, and examples
CN110852559A (en) Resource allocation method and device, storage medium and electronic device
Kuosmanen et al. Discrete and integer valued inputs and outputs in data envelopment analysis
CN110659998A (en) Data processing method, data processing apparatus, computer apparatus, and storage medium
Sahri et al. DBaaS-expert: A recommender for the selection of the right cloud database
Coyle et al. 21st century progress in computing
CN107194190B (en) Method and device for identifying influence of service object on cost in medical cost database
CN116308826A (en) Insurance product online method, apparatus, equipment and storage medium
CN110264306B (en) Big data-based product recommendation method, device, server and medium
CN114298585A (en) Material purchasing quota distribution method and device for purchasing scene
CN110442587B (en) Service information upgrading method and terminal equipment
Popuri et al. Parallelizing computation of expected values in recombinant binomial trees
CN115905692A (en) Resource borrowing evaluation data pushing method and device and computer equipment
CN116204724A (en) Financial product recommendation method and device
CN117407583A (en) Recommendation method and device, electronic equipment and storage medium
CN114092265A (en) Method and device for determining new service value of policy, storage medium and server
CN118195757A (en) Risk exposure report generation method, apparatus, device, medium and program product
JP2024012669A (en) Consignment charge calculation system
CN114862291A (en) Data asset value evaluation system, method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant