CN115328883B - Data warehouse modeling method and system - Google Patents

Data warehouse modeling method and system Download PDF

Info

Publication number
CN115328883B
CN115328883B CN202210749147.5A CN202210749147A CN115328883B CN 115328883 B CN115328883 B CN 115328883B CN 202210749147 A CN202210749147 A CN 202210749147A CN 115328883 B CN115328883 B CN 115328883B
Authority
CN
China
Prior art keywords
key
fields
field
association
path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210749147.5A
Other languages
Chinese (zh)
Other versions
CN115328883A (en
Inventor
杨立才
邵宏力
刘磊
李云
邓知知
胡超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Kunshan Rural Commercial Bank Co ltd
Original Assignee
Jiangsu Kunshan Rural Commercial Bank Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Kunshan Rural Commercial Bank Co ltd filed Critical Jiangsu Kunshan Rural Commercial Bank Co ltd
Priority to CN202210749147.5A priority Critical patent/CN115328883B/en
Publication of CN115328883A publication Critical patent/CN115328883A/en
Application granted granted Critical
Publication of CN115328883B publication Critical patent/CN115328883B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • G06F16/212Schema design and management with details for data modelling support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data warehouse modeling method and system. Displaying each table and the external key relation among the tables in a visual graph structure form to serve as a table level knowledge graph, operating a minimum spanning tree algorithm based on the table level knowledge graph to obtain each path from the main table to each expected association table, and selecting a path with the shortest path length as an optimal path; acquiring an association mode and an association field relation between a main table and an expected association table according to the optimal path; simultaneously acquiring optional fields in all expected association tables; removing unnecessary fields for modeling of the data warehouse from the optional fields, reserving the required fields, and determining processing rules of the reserved fields; and establishing a table serving as a model in a corresponding data warehouse according to the association mode, the association field relation, the fields and the processing rules. The invention improves modeling quality.

Description

Data warehouse modeling method and system
Technical Field
The invention belongs to the technical field of business intelligence, and particularly relates to a data warehouse modeling method and system.
Background
In the prior art, when data warehouse modeling is performed on table structure data, relationships among tables are manually analyzed in a manner of field naming specification, manual discrimination, compiling a PDM file and the like, and then modeling is performed. Clearly, the prior art relies heavily on manual labor, takes much time and effort, and if field naming is not standard, it is very difficult for analysts to build a data warehouse model for a data table if they are unfamiliar with the data structure, relationships, or lack ER modeling files or description documents. Even if modeling is constructed, the quality of the modeling is difficult to guarantee, and problems such as association errors, incomplete coverage and the like are likely to exist.
Disclosure of Invention
The invention provides a data warehouse modeling method and a system.
In order to solve the technical problems in the prior art, the invention provides a data warehouse modeling method system, which comprises the following steps: function dependency relationship in each table, main key of each table, external key and external key relationship;
Displaying each table and the external key relation among the tables in a visual graph structure form to be used as a table level knowledge graph; the table level knowledge graph comprises nodes and edges, wherein each node represents a table, and each edge represents an external key relation;
Determining a main table and expected association tables expected to be associated with the main table, running a minimum spanning tree algorithm based on a table level knowledge graph, obtaining each path from the main table to each expected association table, and selecting a path with the shortest path length as an optimal path; the path with the shortest path length refers to the path with the least edges between the main table and the expected association table;
According to the optimal path, acquiring an association mode and an association field relation between the main table and the expected association table, wherein the association mode is an external key relation; simultaneously acquiring optional fields in all expected association tables;
Removing unnecessary fields for modeling of the data warehouse from the optional fields, reserving the required fields, and determining processing rules of the reserved fields;
And establishing a table serving as a model in a corresponding data warehouse according to the association mode, the association field relation, the fields and the processing rules.
As a preferred implementation manner, after the table-level knowledge graph is established, calculating the dimension value of the foreign key field, and storing the dimension value in the attribute of the edge of the foreign key relation corresponding to the table-level knowledge graph as the weight of the edge; the dimension value refers to the number of lines of the outer key field after duplication removal; when the path with the shortest path length found by using the minimum spanning tree algorithm is a plurality of paths, the path with the largest sum of path weights is selected as the optimal path.
As a preferred implementation, the process of obtaining the function dependency relationship in the table is as follows: acquiring the table names of all tables in a database and the field names in all tables; for each table, analyzing the characteristics of each field according to the values of the fields in the table; calculating and obtaining the function dependency relationship among the fields in the table as the function dependency relationship in the table according to the table name, the field name and the field value for each table; the features of the fields include qualitative features and quantitative features; the qualitative feature comprises a data type of the field and the quantitative feature comprises a length of the field.
As a preferred embodiment, the procedure for obtaining the foreign key relation is as follows: and identifying the main key of each table according to the function dependency relationship in the table, searching and determining the corresponding external key in other tables according to the characteristics of the main key, and forming an external key relationship between the main key and the external key.
In a preferred embodiment, when searching and determining the foreign key, fields in other tables, which are matched with the data type of the main key and the field length, are used as the foreign key, wherein the fields matched with the data type of the main key and the field length refer to that the data type of the field is the same as the data type of the main key, and the minimum length of the field is greater than or equal to the minimum length of the main key, and the maximum length of the field is less than or equal to the maximum length of the main key.
As a preferred embodiment, two tables are arbitrarily selected from the service main table and all tables which are expected to be associated, a shortest path algorithm is used for calculating all optional shortest paths which are optional between the two tables, and an optional shortest path is used for replacing the foreign key associated path.
Another aspect of the present invention also provides a data warehouse modeling system, including: a processor; a database; and a memory in which a program is stored, a database storing tables,
Wherein when the processor executes the program, the following operations are performed:
Displaying each table and the external key relation among the tables in a visual graph structure form to be used as a table level knowledge graph; the table level knowledge graph comprises nodes and edges, wherein each node represents a table, and each edge represents an external key relation;
Determining a main table and expected association tables expected to be associated with the main table, running a minimum spanning tree algorithm based on a table level knowledge graph, obtaining each path from the main table to each expected association table, and selecting a path with the shortest path length as an optimal path; the path with the shortest path length refers to the path with the least edges between the main table and the expected association table;
According to the optimal path, acquiring an association mode and an association field relation between the main table and the expected association table, wherein the association mode is an external key relation; simultaneously acquiring optional fields in all expected association tables;
Removing unnecessary fields for modeling of the data warehouse from the optional fields, reserving the required fields, and determining processing rules of the reserved fields;
And establishing a table serving as a model in a corresponding data warehouse according to the association mode, the association field relation, the fields and the processing rules.
Compared with the prior art, the invention has the remarkable advantages that:
When the model is built in the business intelligence and data warehouse system, the invention can automatically calculate the association relation between the tables by manually selecting the main table and the expected association table and assist in generating the model. The phenomena of low manual analysis efficiency and high error rate are avoided, and the modeling quality is improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
Drawings
FIG. 1 is a schematic flow chart of one embodiment of the present invention.
FIG. 2 is a schematic diagram of a level knowledge graph in the present invention.
Fig. 3 is a schematic diagram of association of a transfer information flow meter according to an embodiment of the invention.
Fig. 4 is a diagram illustrating a path selection in one embodiment of the invention.
Detailed Description
It is easy to understand that various embodiments of the present application can be envisioned by those of ordinary skill in the art without altering the true spirit of the present application in light of the present teachings. Accordingly, the following detailed description and drawings are merely illustrative of the application and are not intended to be exhaustive or to limit or restrict the application. Rather, these embodiments are provided so that this disclosure will be thorough and complete by those skilled in the art. Preferred embodiments of the present application are described in detail below with reference to the attached drawing figures, which form a part of the present application and are used in conjunction with the embodiments of the present application to illustrate the innovative concepts of the present application.
According to the data warehouse modeling method and system, the association relation of the data warehouse model can be generated only by obtaining the foreign key relation through data analysis, the association relation comprises an association (join) mode and an association (join) condition, and the data warehouse model development is assisted.
The invention relates to a data warehouse modeling method system, which comprises the following steps: function dependency relationship in each table, main key of each table, external key and external key relationship;
Displaying each table and the external key relation among the tables in a visual graph structure form to be used as a table level knowledge graph; the table level knowledge graph comprises nodes and edges, wherein each node represents a table, and each edge represents an external key relation;
Determining a main table and expected association tables expected to be associated with the main table, running a minimum spanning tree algorithm based on a table level knowledge graph, obtaining each path from the main table to each expected association table, and selecting a path with the shortest path length as an optimal path; the path with the shortest path length refers to the path with the least edges between the main table and the expected association table;
According to the optimal path, acquiring an association mode and an association field relation between the main table and the expected association table, wherein the association mode is an external key relation; simultaneously acquiring optional fields in all expected association tables;
Removing unnecessary fields for modeling of the data warehouse from the optional fields, reserving the required fields, and determining processing rules of the reserved fields;
And establishing a table serving as a model in a corresponding data warehouse according to the association mode, the association field relation, the fields and the processing rules.
As a preferred implementation manner, after the table-level knowledge graph is established, calculating the dimension value of the foreign key field, and storing the dimension value in the attribute of the edge of the foreign key relation corresponding to the table-level knowledge graph as the weight of the edge; the dimension value refers to the number of lines of the outer key field after duplication removal; when the path with the shortest path length found by using the minimum spanning tree algorithm is a plurality of paths, the path with the largest sum of path weights is selected as the optimal path.
As a preferred implementation, the process of obtaining the function dependency relationship in the table is as follows: acquiring the table names of all tables in a database and the field names in all tables; for each table, analyzing the characteristics of each field according to the values of the fields in the table; calculating and obtaining the function dependency relationship among the fields in the table as the function dependency relationship in the table according to the table name, the field name and the field value for each table; the features of the fields include qualitative features and quantitative features; the qualitative feature comprises a data type of the field and the quantitative feature comprises a length of the field.
As a preferred embodiment, the procedure for obtaining the foreign key relation is as follows: and identifying the main key of each table according to the function dependency relationship in the table, searching and determining the corresponding external key in other tables according to the characteristics of the main key, and forming an external key relationship between the main key and the external key.
In a preferred embodiment, when searching and determining the foreign key, fields in other tables, which are matched with the data type of the main key and the field length, are used as the foreign key, wherein the fields matched with the data type of the main key and the field length refer to that the data type of the field is the same as the data type of the main key, and the minimum length of the field is greater than or equal to the minimum length of the main key, and the maximum length of the field is less than or equal to the maximum length of the main key.
As a preferred embodiment, two tables are arbitrarily selected from the service main table and all tables which are expected to be associated, a shortest path algorithm is used for calculating all optional shortest paths which are optional between the two tables, and an optional shortest path is used for replacing the foreign key associated path.
Another aspect of the present invention also provides a data warehouse modeling system, including: a processor; a database; and a memory in which a program is stored, a database storing tables,
Wherein when the processor executes the program, the following operations are performed:
Displaying each table and the external key relation among the tables in a visual graph structure form to be used as a table level knowledge graph; the table level knowledge graph comprises nodes and edges, wherein each node represents a table, and each edge represents an external key relation;
Determining a main table and expected association tables expected to be associated with the main table, running a minimum spanning tree algorithm based on a table level knowledge graph, obtaining each path from the main table to each expected association table, and selecting a path with the shortest path length as an optimal path; the path with the shortest path length refers to the path with the least edges between the main table and the expected association table;
According to the optimal path, acquiring an association mode and an association field relation between the main table and the expected association table, wherein the association mode is an external key relation; simultaneously acquiring optional fields in all expected association tables;
Removing unnecessary fields for modeling of the data warehouse from the optional fields, reserving the required fields, and determining processing rules of the reserved fields;
And establishing a table serving as a model in a corresponding data warehouse according to the association mode, the association field relation, the fields and the processing rules.
The method and system for constructing a data warehouse model described herein will be described in detail with reference to a specific embodiment. In practice, to facilitate the storage of the results of the calculations obtained by each step, a series of tables are created in the computing system to store the result data of each step. Of course, in actual operation, various tools such as text documents may be used to store the calculation results of the respective steps. As one example, the following series of data tables may be used in the construction of the data warehouse model to store the calculation results of each step:
Table 1 table LIST table TABLES _list;
table 2 field inventory table COLUMNS _list;
table 3 FUNCTION dependent record table function_ DEPENDENCY _tab;
Table 4 foreign key record table fk_info_tab;
TABLE 5 TABLE relationship information TABLE MAP TABLE RELATED;
table 6 field MAPPING information table map_mapping_col_info.
The constructed form template can be pre-placed in a storage device of the system. As shown in fig. 1, the method for constructing a data warehouse model according to the present embodiment includes the following steps:
S1, obtaining table names of all data tables used for constructing a data warehouse, and storing the table names into an obtained table list table.
The list data reading device reads the list of all the tables from the data warehouse, and the list information is stored in the preset storage device to form the list table of the all the data warehouse.
Table 1 shows a list of all tables read from the database.
Table 1 table LIST table TABLES _list (part of the example)
SYS_CODE TABLE_CODE COMMENT
S03 ods.ods_s03_acc_accp Silver-colored cushion cap account
S03 ods.ods_s03_ctr_loan_cont Contract master form
S03 ods.ods_s03_prd_bank_info Bank information
S55 ods.ods_s55_bt_discount_batch Post buying batch
S58 ods.ods_s58_m_ci_customer Customer basic information table
S58 ods.ods_s58_m_ci_person Personal customer information master table
S57 ods.ods_s57_tb_fss_transbook Transfer information flow table
The meanings of the items in Table 1 are as follows:
the sys_code is a service system number, and the service system is each working system used by a certain unit, for example, a certain bank has a loan system, a proxy wage system, and the like, and data in these service systems are stored in a data warehouse in the form of a table.
TABLE_CODE is the English name of the TABLE in the data warehouse.
COMMENT is the Chinese name of each table. The chinese names shown in the component column are for more convenience of illustration, and in practical implementation, the data information of the chinese names need not be included.
S2, obtaining the field of each table and storing the fields in a field LIST table COLUMNS _LIST.
The table data reading device reads meta information in each table from information stored in the data warehouse, and stores the meta information in a field LIST table template preset in the storage device to form a field LIST table COLUMNS _list. A part of the field list table is shown in table 2.
TABLE 2 field inventory Table COLUMNS _LIST (part of the example)
SYS_CODE TABLE_CODE COL_NUM COL_CODE COMMENT
S58 ods.ods_s58_m_ci_person 1 cust_no Customer number
S58 ods.ods_s58m_ci_person 2 cust_name Customer name
S58 ods.ods_s58_m_ci_person 3 cust_eng_name English name of customer
S58 ods.ods_s58_m_ci_person 4 py_name Pinyin name
The meanings of the items in Table 2 are as follows:
SYS_CODE is the service system number;
TABLE_CODE is the English name of the TABLE in the data warehouse;
COL_NUM is a field number;
COL_CODE is a field name;
COMMENT is the Chinese name of each field. The chinese names shown in the component list are for more convenience of illustration, and in practical implementation, the column data information of the chinese names need not be included.
S3, calculating to obtain the function dependency relationship in the table, identifying a main key according to the function dependency relationship in the table, obtaining an external key according to the main key, and forming an external key relationship between the main key and the external key.
And carrying out function dependency analysis on each table in the table list to obtain the function dependency relationship in the table. In the prior art, a plurality of algorithms can calculate and obtain function dependency relations, identify a main key according to the function dependency relations in the table, and obtain an external key according to the main key. The following is one embodiment of this step.
First, for each table, according to the table name, the field name and the field value, the function dependency relationship between each field in the same table is calculated and obtained. In this embodiment, a table with a dimension division FLAG dim_flag of 1 in the table list configuration table analog_conf_tab is obtained, and then a function dependency ANALYSIS is performed on the table requiring the function dependency ANALYSIS, so as to obtain a function dependency relationship in the table. The intra-table FUNCTION dependency relationship is stored in the FUNCTION dependency record table function_ DEPENDENCY _tab shown in table 3. Of course, as another embodiment, the dimension division FLAG dim_flag is not considered or set, and instead, a function dependency analysis is performed on all tables. There are various algorithms in the prior art that can calculate the function dependency.
TABLE 3 FUNCTION dependent record table FUNCTION_ DEPENDENCY _TAB (part of an example)
Table 3 is merely an example of intra-table function dependencies for a portion of the tables in this embodiment.
Next, for each table, the primary key of each table is identified from the intra-table function dependency. In this embodiment, the primary key is a candidate code set, which may be one or more candidate codes, and the present invention is collectively referred to as a primary key. After the function dependence relation calculation is completed, the method selects a relation set of rows with the function dependence deduction LEVEL FD_LEVEL less than or equal to 3 in the table 6, and adopts a method of quickly solving candidate codes to solve the candidate codes. This can limit a certain range and reduce the calculation pressure.
And finally, searching and determining the external keys corresponding to the main key in other tables according to the characteristics of the main key, and forming an external key relation between the main key and the external key. This step is carried out by:
traversing the obtained primary keys in turn by a table relation analysis and generation device in the system, and generating a corresponding bloom filter for the value of each primary key by a Hash method. In particular, for the federated primary key, the data (the value of each primary key) is federated to build the corresponding bloom filter.
For any two tables, such as table a and table B, the fields in table B that are likely to be foreign keys are selected based on the qualitative and quantitative characteristics of the primary keys in table a. For example, the fields that are likely to be foreign keys are filtered out based on the data TYPE col_type in the qualitative feature and the field length in the quantitative feature (max_len being the maximum length of the field, min_len being the minimum length of the field). When the data type of a certain field in the table B is a subset of the data type of the field of the primary key in the table a and the length range of the field is within the length range of the primary key field, the data type of the field can be selected as the field possibly serving as the foreign key. For example, if the primary key value in table a is numeric data with a length of 14-18 bits, the characteristics of the fields in table B that may be foreign keys need to fit into the fields with a minimum length of 14 or more, a maximum length of 18 or less, and a numeric data type.
The data which is likely to be the foreign key field is compared with the bloom filter obtained in S71 to obtain the main foreign key data coincidence rate, and if the coincidence rate is 100%, the data is the finally determined foreign key. In one embodiment, when the primary foreign key data overlap ratio is greater than the threshold value, the primary foreign key data overlap ratio may be used as the final foreign key. For example, considering the data quality problem, if necessary, a certain fault tolerance is considered for the primary foreign key coincidence rate, and when the primary foreign key coincidence rate of the contrast field existing in the bloom filter is greater than 98%, the primary foreign key coincidence rate can be identified as a foreign key. After the foreign key is determined, a main foreign key relation is formed between the main key and the foreign key, and the main foreign key relation information is stored in a table preset in a storage device, so that a foreign key record table is formed. The foreign key record table fk_info_tab is shown in table 4. In this embodiment, table 4 shows no joint foreign key since the joint primary key is not included.
TABLE 4 foreign key record Table FK_INFO_TAB
FK_SYS_CODE S03 S03 S03 S03
FK_TABLE_CODE prd_bank_info prd_bank_info prd_bank_info prd_bank_info
FK_COL_CODE bank_no bank_no bank_no bank_no
SYS_CODE S03 S55 S55 S57
TABLE_CODE acc_accp bt_discount_batch bt_discount_batch tb_fss_transbook
COL_CODE aorg_no s_mbfebankcode s_inacc_banknum payeebankno
DATA_RATE 1.000000 1.000000 1.000000 1.000000
The meaning of each entry in the foreign key record table fk_info_tab shown in table 4 is as follows:
fk_sys_code is the number of the main service system.
FK_TABLE_CODE is the parent TABLE name in the foreign key relationship.
FK_COL_CODE is the name of the parent table primary key field in the foreign key relationship.
SYS_CODE is the business system number of the sub-table in the foreign key relation.
TABLE_CODE is the name of the sub-TABLE in the foreign key relationship.
COL_CODE is the foreign key field name in the foreign key relationship.
Data_rate is the foreign key overlap ratio. I.e. the proportion of foreign key fields present in the primary key field. For example, when the value is 0.99, it is indicated that 99% of the data in the foreign key field exists in the primary key. This value should be 1 when foreign key constraints are present, but in practical applications, sometimes to ensure system performance, foreign key constraints are not established, and dirty data may occur. This value serves the function of tolerating real dirty data to some extent.
And S4, displaying each table and the external key relation among the tables in a visual graph structure form to serve as a table level knowledge graph.
After the external key relation is obtained, the tables in the database and the external key relation among the tables are stored in a graph database preset in a storage device in a graph structure mode, and a visual table level knowledge graph which can be convenient to inquire is formed.
The table level knowledge graph is shown in figure 2. The table level knowledge graph comprises 1 node and 1 side, wherein the circular node represents a table, and each node stores information representing the table and comprises basic meta information and related characteristic information of the table, such as English name, field number, table annotation (Chinese name), table number and the like. In each item of information, other information except the table english name may be used as the preferable addition information, and the node may or may not store the information. The table level knowledge graph only comprises one relation of external keys, and is represented as an edge which is connected with two nodes and is represented by an arrow in fig. 2, FK marked on the edge represents the relation of the external keys, each edge is a directed edge, wherein the node which starts is a table which belongs to a main key, and the node pointed by the arrow is a table which belongs to the external key. And each side also stores external key relation information, such as English name of the main key field, english name of the external key field, and main external key coincidence rate.
Preferably, because the foreign key may be a joint foreign key, the primary key and the foreign key are stored in the edge field storage by adopting a list, and the fields with the same subscript have association, so that the field mapping relation of the joint foreign key is completely stored. If the combined main key consists of two fields of C1 and C2, the external key fields are two fields of C3 and C4, the external key relation consists of a mapping relation of C1-C3 and a mapping relation of C2-C4, and in order to ensure that the corresponding relation is correct, list storage is adopted, and because the list has subscripts, the fields with the same subscripts indicate that the mapping relation exists.
The embodiment may store the obtained foreign key relationship in a database in a storage device, for example, a neo4j graph database, and then display the foreign key relationship in the form of graph data. It should be understood that neo4j is only one type of graph database, and that a common graph database also includes ArangoDB, orientDB, janusGraph or the like graph databases that can exhibit foreign key relationships in the form of graph data.
S5, calculating the dimension value of the foreign key field, and storing the dimension value in the attribute of the edge corresponding to the foreign key relation of the table level knowledge graph as the weight value of the edge.
The dimension value refers to the number of rows of the foreign key field after de-duplication.
In this embodiment, the dimension value of the foreign key field is calculated and stored in the attribute of the edge corresponding to the foreign key relationship in the neo4j graph database, and is used as the weight of the edge.
Preferably, this step may be omitted and the weight defaults to 1.
Preferably, the weight may be the inverse of the dimension of the foreign key field multiplied by 1000000, so as to be suitable for the original minimum spanning tree algorithm, and simplify the calculation. At this time, the smaller the weight, the larger the weight, and the coefficient of 1000000 solves the problem of too long decimal places caused when the data in the table is excessive.
S6, the user selects the appointed business main table and other tables expected to be associated with the main table on the display device.
The business master table and the expected association table have been shown as nodes in a table level knowledge graph.
For example, if it is desired to create a warehouse model that is based on transfer information pipelining and supplements relevant personal client information, the transfer information pipelining table ods, ods_s57_tb_ fss _ transbook may be selected as the master table, and the personal client information master table ods, ods_s58_m_ci_person may be selected as the desired association table.
And S7, processing the service main table and the expected association table selected by the user by using a minimum spanning tree algorithm based on the external key relation displayed in the form of graph data to obtain each path from the service main table to each expected association table. And then selecting a path with the shortest path length as the optimal path, wherein the path with the shortest path length is the path with the least number of edges between the tables.
And in the process of running the minimum spanning tree algorithm, when a plurality of shortest paths exist, selecting the path with the smallest sum of the path weights as the optimal path. I.e., the native minimum spanning tree algorithm, the path of the minimum weight is calculated.
If step S5 is not passed, that is, if the edges in the path have no weight, or the weights are all default to 1, a path with the shortest path length is selected.
In this embodiment, the path is composed of a plurality of edges (foreign key relationships), and each edge (foreign key relationship) is a field reference relationship and is presented as a tree-like relationship.
Fig. 3 shows the association relationship of the transfer information flow water meter, namely the external key relationship. In fig. 3, two paths (regardless of direction) illustrate a partial foreign key relationship of the master transfer information flow table ods.ods_s57_tb_ fss _ transbook, in which,
A first strip: ods_s58_m_ci_person — > ods_s57_tb_ fss _ transbook;
And a second strip:
ods.ods_s57_tb_fss_transbook<——ods.ods_s58_m_ci_customer——>ods.ods_s58_ m_ci_person,
The first path length is 1, the second path length is 2, and the first path is selected for association because the length of the first path is shortest.
S8, according to the optimal path, the association mode (namely, the foreign key relation, one path in the figure) and the association field relation between the service main table and the expected association table are obtained, and meanwhile, the optional field information (field information in the path) of all expected association tables is read from the table 2 and displayed on the display device.
Preferably, directed graph presentation returns can be used in an associative manner. Fig. 4 shows an example of path selection.
Finally, a transfer information flow water meter ods, ods_s57_tb_ fss _ transbook is selected as a main table through a minimum spanning tree algorithm, a personal customer information main table ods, ods_s58_m_ci_person is taken as the shortest path of a desired association table, and the path length is 1. Wherein the optional fields are all the fields (only part of TABLE 2) of TABLE 2 field LIST TABLE COLUMNS _list with table_code being ods.ods_s57_tb_ fss _ transbook, ods.ods_s58_m_ci_person.
S9, the user selects the fields to be reserved and the corresponding field processing rules from the optional fields, and unnecessary fields are removed.
Typically in data warehouse model construction, not all fields are required, so that the fields that need to be reserved can be selected from the optional fields, removing unwanted fields.
The processing rule refers to an ETL processing mode, for example, the processing rule is substring (idno, 1, 6) when the first 6 bits of the identification number field idno are intercepted.
Preferably, since the two tables may contain a plurality of foreign key relations, the two tables can be arbitrarily selected from the service main table and all tables which are expected to be associated, a shortest path algorithm is used for calculating all the shortest paths selectable between the two tables, and the selectable shortest paths are used for replacing foreign key association paths. I.e. adding or replacing the association mode and association field relation between the service main table and the table with which association is desired. For example, when three associated paths of A, C, B, A, D, B, A, C, F and B actually exist, the two paths A, C, B, A, D and B are shortest, and any one of the two paths can be selected.
Preferably, the user can modify and adjust the association condition by himself. Such as: wherein the transaction status field transtatus in the transfer information stream meter ods_s57_tb_ fss _ transbook has both a success (S) and a failure (F), when the service needs to ignore the failed transaction, the association condition may be modified from automatically generated t1.custno=t2.custno to t1.custno=t2.custno and t2.trans-status= 'S'.
Preferably, the user can add the data filtering condition by himself.
The proportion of the external key relation is 99.895%, it is easy to understand that 99.895% of the client numbers in the transfer information flow water meter ods.ods_s57_tb_ fss _ transbook are client numbers in the personal client information main table ods.ods_s58_m_ci_person, namely, a small amount of other clients such as public clients are included, and the data filtering condition T2.cust_ nois not null can be added because a model which takes transfer information flow as a main supplement to related personal client information is expected to be built;
Preferably, if the model needs to be previewed, the preview function starts, an SQL sentence is generated by using a correlation mode, a correlation field relation, a field and a processing rule, and the data warehouse preview model is generated in a browsing way, and when the result does not accord with the expectations, the steps 6, 7, 8 and 9 can be repeated until the model accords with the expectations.
S10, after the steps are finished, the user confirms that the steps are finished, and the generated processing rules are recorded in a table relation information table shown in a table 5 and a field mapping information table shown in a table 6 which are preset in a storage device.
TABLE 5 Table relationship information Table MAP_TABLE_RELATED (partial example)
TABLE 5 TABLE relation information TABLE MAP TABLE RELATED stores path association information, i.e., association manner and association field relation between the service main TABLE and the TABLE to be associated in step S8.
TABLE 5 TABLE relationship information TABLE MAP TABLE RELATED,
Job_code is a JOB number, and rows with equal JOB numbers represent an association relationship between an original table set and an original table contained in one model table.
JOB _ NAME is the JOB NAME,
DATABASE _ ENV is the target DATABASE Schema,
JOIN _ TABLE is the TABLE name,
The TABLE _ ALIAS is a TABLE designation,
Join_type is a correlation, where join_type is 4 correlation, 1 represents an inner connection, 2 represents an outer left connection, 3 represents an outer right connection, and 4 represents an outer full connection;
join_condition is an association rule.
TABLE 6 field MAPPING information table MAP_MAPPING_COL_INFO
Table 6 field MAPPING information table map_mapping_col_info stores the fields to be reserved and the processing rules in step 9.
Table 6 in the field mapping information table,
JOB CODE is the JOB number,
JOB _ NAME is the JOB NAME,
DATABASE _ ENV is the target DATABASE Schema,
TAB _ NAME is the target table NAME,
COL NO is the target table field sequence number,
COL NAME is the destination field NAME,
COL _ component is the target field annotation,
TYPE _ NAME is the TYPE of the target field,
COL LENGTH is the target field LENGTH,
COL _ SCALE is the target field precision,
Is_pk IS a primary key or not,
SOURE TAB NAME as the source table NAME,
SOURE _ COL _ VALUE is the source table field name,
SOURCE _ ALIAS is the SOURCE table field ALIAS,
SOURE TYPE is the source table field TYPE,
SOURE LENGTH is the source table field LENGTH,
SOURE _ SCALE is the source table field precision,
COL_MAPPING is a field MAPPING relationship.
Finally, the system automatically reads tables 5 and 6 and generates job configuration, and builds corresponding data warehouse model tables in the data warehouse.
The structure of each table in the above embodiments is merely an example, and in actual operation, each column data item is not necessarily only each item shown in each table in the above embodiments, and other item data may be also available.
The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto,
Any changes or substitutions that would be easily recognized by those skilled in the art within the technical scope of the present disclosure are intended to be covered by the present invention.
It should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes described in the context of a single embodiment or with reference to a single figure in order to streamline the invention and aid those skilled in the art in understanding the various aspects of the invention. The present invention should not be construed as including the features of the exemplary embodiments that are essential to the patent claims.
Those skilled in the art will appreciate that all or part of the flow of the methods of the embodiments described above may be accomplished by computer programs, which may be stored in a computer readable storage medium, instructing the relevant hardware. Wherein the computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.
It should be understood that the devices, modules, units, components, etc. included in the system of one embodiment of the invention may be adaptively changed to arrange them in a device or system different from the embodiment. The system of the embodiments may include different devices, modules, units or components combined into one device, module, unit or component, or they may be divided into a plurality of sub-devices, sub-modules, sub-units or sub-components.
The apparatus, modules, units, or components of embodiments of the invention may be implemented in hardware, in software running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that embodiments in accordance with the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention can also be implemented as a computer program product or a computer readable medium for carrying out a part or all of the methods described herein.

Claims (10)

1. A data warehouse modeling method is characterized by comprising function dependency relations in each table, main keys of each table, external keys and external key relations;
displaying each table and the external key relation among the tables in a visual graph structure form to be used as a table level knowledge graph; the table level knowledge graph comprises nodes and edges, wherein each node represents a table, and each edge represents an external key relation;
Determining a main table and an expected association table expected to be associated with the main table, operating a minimum spanning tree algorithm based on a table level knowledge graph, obtaining each path from the main table to each expected association table, and selecting a path with the shortest path length as an optimal path; the path with the shortest path length is the path with the least edge between the main table and the expected association table;
According to the optimal path, acquiring an association mode and an association field relation between the main table and the expected association table, wherein the association mode is an external key relation; simultaneously acquiring optional fields in all expected association tables;
removing unnecessary fields for modeling of the data warehouse from the optional fields, reserving the required fields, and determining processing rules of the reserved fields;
And establishing a table serving as a model in a corresponding data warehouse according to the association mode, the association field relation, the fields and the processing rules.
2. The data warehouse modeling method as claimed in claim 1, wherein after the table level knowledge graph is established, the dimension value of the foreign key field is calculated, and the dimension value is stored in the attribute of the edge corresponding to the foreign key relation of the table level knowledge graph as the weight of the edge; the dimension value refers to the number of lines of the outer key field after duplication removal;
When the path with the shortest path length found by using the minimum spanning tree algorithm is a plurality of paths, the path with the largest sum of path weights is selected as the optimal path.
3. The data warehouse modeling method of claim 1 or 2, wherein the process of obtaining the intra-table function dependencies is:
acquiring the table names of all tables in a database and the field names in all tables;
For each table, analyzing the characteristics of each field according to the values of the fields in the table; calculating and obtaining the function dependency relationship among the fields in the table as the function dependency relationship in the table according to the table name, the field name and the field value for each table;
The features of the fields include qualitative features and quantitative features; the qualitative feature comprises a data type of the field and the quantitative feature comprises a length of the field.
4. The data warehouse modeling method as claimed in claim 3, wherein the process of obtaining the foreign key relationship is:
And identifying the main key of each table according to the function dependency relationship in the table, searching and determining the corresponding external key in other tables according to the characteristics of the main key, and forming an external key relationship between the main key and the external key.
5. The data warehouse modeling method as claimed in claim 4, wherein,
When searching and determining the foreign key, taking the fields matched with the data type and the field length of the main key in other tables as the foreign key, wherein the fields matched with the data type and the field length of the main key are the same as the data type of the main key, and the minimum length of the fields is larger than or equal to the minimum length of the main key and the maximum length of the fields is smaller than or equal to the maximum length of the main key.
6. The data warehouse modeling method as claimed in claim 1, wherein two tables are arbitrarily selected from the service main table and all tables desired to be associated, wherein a shortest path algorithm is used to calculate all shortest paths selectable between the two tables, and wherein the selected shortest paths are used to replace foreign key associated paths.
7. A data warehouse modeling system, comprising:
a processor; a database; and a memory in which a program is stored, a database storing tables,
Wherein when the processor executes the program, the following operations are performed:
displaying each table and the external key relation among the tables in a visual graph structure form to be used as a table level knowledge graph; the table level knowledge graph comprises nodes and edges, wherein each node represents a table, and each edge represents an external key relation;
Determining a main table and an expected association table expected to be associated with the main table, operating a minimum spanning tree algorithm based on a table level knowledge graph, obtaining each path from the main table to each expected association table, and selecting a path with the shortest path length as an optimal path; the path with the shortest path length is the path with the least edge between the main table and the expected association table;
According to the optimal path, acquiring an association mode and an association field relation between the main table and the expected association table, wherein the association mode is an external key relation; simultaneously acquiring optional fields in all expected association tables;
removing unnecessary fields for modeling of the data warehouse from the optional fields, reserving the required fields, and determining processing rules of the reserved fields;
And establishing a table serving as a model in a corresponding data warehouse according to the association mode, the association field relation, the fields and the processing rules.
8. The data warehouse modeling system of claim 7, wherein after the table level knowledge graph is established, the dimension value of the foreign key field is calculated, and the dimension value is stored in the attribute of the edge of the table level knowledge graph corresponding to the foreign key relationship as the weight of the edge; the dimension value refers to the number of lines of the outer key field after duplication removal;
When the path with the shortest path length found by using the minimum spanning tree algorithm is a plurality of paths, the path with the largest sum of path weights is selected as the optimal path.
9. The data warehouse modeling system of claim 7 or 8, wherein,
The process of obtaining the function dependency relationship in the table is as follows:
acquiring the table names of all tables in a database and the field names in all tables;
For each table, analyzing the characteristics of each field according to the values of the fields in the table; calculating and obtaining the function dependency relationship among the fields in the table as the function dependency relationship in the table according to the table name, the field name and the field value for each table;
The features of the fields include qualitative features and quantitative features; the qualitative feature comprises a data type of a field, and the quantitative feature comprises a length of the field;
The process of obtaining the external key relation is as follows:
For each table, identifying a main key of each table according to the function dependency relationship in the table, searching and determining corresponding external keys in other tables according to the characteristics of the main key, and forming an external key relationship between the main key and the external keys; when searching and determining the foreign key, taking the fields matched with the data type and the field length of the main key in other tables as the foreign key, wherein the fields matched with the data type and the field length of the main key are the same as the data type of the main key, and the minimum length of the fields is larger than or equal to the minimum length of the main key and the maximum length of the fields is smaller than or equal to the maximum length of the main key.
10. The data warehouse modeling system of claim 7, wherein two tables are arbitrarily selected from the service master table and all tables for which association is desired, wherein a shortest path algorithm is used to calculate all shortest paths selectable between the two tables, and wherein the selected shortest paths are used to replace foreign key association paths.
CN202210749147.5A 2022-06-29 2022-06-29 Data warehouse modeling method and system Active CN115328883B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210749147.5A CN115328883B (en) 2022-06-29 2022-06-29 Data warehouse modeling method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210749147.5A CN115328883B (en) 2022-06-29 2022-06-29 Data warehouse modeling method and system

Publications (2)

Publication Number Publication Date
CN115328883A CN115328883A (en) 2022-11-11
CN115328883B true CN115328883B (en) 2024-06-18

Family

ID=83917927

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210749147.5A Active CN115328883B (en) 2022-06-29 2022-06-29 Data warehouse modeling method and system

Country Status (1)

Country Link
CN (1) CN115328883B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115712691A (en) * 2022-11-17 2023-02-24 创新奇智(重庆)科技有限公司 Data relation processing method and system
CN117539869B (en) * 2024-01-08 2024-03-15 北京睿企信息科技有限公司 Data processing system for acquiring data table

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101286151A (en) * 2007-04-13 2008-10-15 国际商业机器公司 Method for establishing multidimensional model and data store mode mappings and relevant system
CN111460047A (en) * 2020-03-09 2020-07-28 平安科技(深圳)有限公司 Method, device and equipment for constructing characteristics based on entity relationship and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581393B (en) * 2020-04-28 2022-11-25 国家电网有限公司客户服务中心 Construction method of knowledge graph based on customer service data in power industry
CN112115138A (en) * 2020-08-19 2020-12-22 第四范式(北京)技术有限公司 Method, device and equipment for determining association relation between data tables
CN114385764A (en) * 2021-08-25 2022-04-22 广东粤财金融云科技股份有限公司 Reverse analysis method and device for relational database table structure
CN114595294B (en) * 2022-03-11 2022-09-20 北京梦诚科技有限公司 Data warehouse modeling and extracting method and system
CN116226082A (en) * 2022-12-29 2023-06-06 金篆信科有限责任公司 Database model generation method and device, storage medium and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101286151A (en) * 2007-04-13 2008-10-15 国际商业机器公司 Method for establishing multidimensional model and data store mode mappings and relevant system
CN111460047A (en) * 2020-03-09 2020-07-28 平安科技(深圳)有限公司 Method, device and equipment for constructing characteristics based on entity relationship and storage medium

Also Published As

Publication number Publication date
CN115328883A (en) 2022-11-11

Similar Documents

Publication Publication Date Title
US10860548B2 (en) Generating and reusing transformations for evolving schema mapping
US9916350B2 (en) Automated creation of join graphs for unrelated data sets among relational databases
US10565498B1 (en) Deep neural network-based relationship analysis with multi-feature token model
CN115328883B (en) Data warehouse modeling method and system
WO2020135048A1 (en) Data merging method and apparatus for knowledge graph
US20230229677A9 (en) Artificial intelligence-based property data linking system
CN106104591B (en) Mapping attributes of keyed entities
US20140351241A1 (en) Identifying and invoking applications based on data in a knowledge graph
CA2659743A1 (en) Primenet data management system
WO2016029230A1 (en) Automated creation of join graphs for unrelated data sets among relational databases
US20230281200A1 (en) Method, apparatus, and computer-readable medium to extract a referentially intact subset from a database
CA3099480A1 (en) Systems and methods for generating graph data structure objects with homomorphism
Neto et al. Efficient computation and visualization of multiple density-based clustering hierarchies
Yan et al. Geometrically based linear iterative clustering for quantitative feature correspondence
US9336249B2 (en) Decision tree with just-in-time nodal computations
US20100121837A1 (en) Apparatus and Method for Utilizing Context to Resolve Ambiguous Queries
CN115577147A (en) Visual information map retrieval method and device, electronic equipment and storage medium
US11366833B2 (en) Augmenting project data with searchable metadata for facilitating project queries
EP2911107A1 (en) Graph pruning in hipergraph
WO2020008180A1 (en) Method and system for integrating data sets
Trabelsi et al. Employing Data and Process Mining Techniques for Redundancy Detection and Analystics in Business Processes.
CN116610714B (en) Data query method, device, computer equipment and storage medium
US20220342879A1 (en) Data searching system, device, method and program
CN112115699B (en) Method and system for analyzing data
JP2001312419A (en) Software overlap degree evaluating device and recording medium with recorded software overlap degree evaluating program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant