CN113010517A

CN113010517A - Data table management method and device

Info

Publication number: CN113010517A
Application number: CN202110250393.1A
Authority: CN
Inventors: 熊文杰; 谢荣良; 叶桂全
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-03-08
Filing date: 2021-03-08
Publication date: 2021-06-22
Anticipated expiration: 2041-03-08
Also published as: CN113010517B

Abstract

The invention discloses a data table management method and a device, wherein the method comprises the following steps: obtaining subject domain information, table name information, field information and data source information of a data table to be built; searching the theme domain information of the data table to be built in the table resource pool; if the theme domain information of the data table to be built does not exist in the table resource pool, allowing the data table to be newly added in the table resource pool; if the theme domain information of the data table to be built exists in the table resource pool, extracting the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool; determining similarity information of the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table; and managing the data table according to the similarity information. The invention can manage the data in the database, and avoid redundancy and resource waste.

Description

Data table management method and device

Technical Field

The invention relates to the technical field of computers, in particular to a data table management method and device.

Background

In large software companies, different teams often store information in separate databases. However, since some information has generality, such as user name, mobile phone number, etc., storing in respective databases causes redundancy and wastes resources.

Therefore, a need exists for a data table management scheme that overcomes the above-mentioned problems.

Disclosure of Invention

The embodiment of the invention provides a data table management method, which is used for managing data in a database and avoiding redundancy and resource waste, and comprises the following steps:

obtaining subject domain information, table name information, field information and data source information of a data table to be built;

searching the theme domain information of the data table to be built in a table resource pool;

if the theme domain information of the data table to be built does not exist in the table resource pool, allowing the data table to be newly added in the table resource pool;

if the theme domain information of the data table to be built exists in the table resource pool, extracting the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool;

determining similarity information of the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table;

and managing a data table according to the similarity information.

The embodiment of the invention provides a data table management device, which is used for managing data in a database and avoiding redundancy and resource waste, and comprises the following components:

the information acquisition module is used for acquiring the subject domain information, the table name information, the field information and the data source information of the data table to be built;

the information searching module is used for searching the subject domain information of the data table to be built in a table resource pool;

the first information judgment module is used for allowing the data table to be newly added in the table resource pool if the theme domain information of the data table to be built does not exist in the table resource pool;

the second information judgment module is used for extracting the table name information, the field information and the data source information of the historical data table corresponding to the subject domain information in the table resource pool if the subject domain information of the data table to be built exists in the table resource pool;

the similarity determining module is used for determining similarity information between the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table;

and the data table management module is used for managing the data table according to the similarity information.

The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the data table management method.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program for executing the above data table management method is stored in the computer-readable storage medium.

The embodiment of the invention obtains the subject domain information, the table name information, the field information and the data source information of the data table to be built; searching the theme domain information of the data table to be built in a table resource pool; if the theme domain information of the data table to be built does not exist in the table resource pool, allowing the data table to be newly added in the table resource pool; if the theme domain information of the data table to be built exists in the table resource pool, extracting the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool; determining similarity information of the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table; and managing a data table according to the similarity information. According to the embodiment of the invention, the subject domain information of the data table to be built is firstly searched in the table resource pool, if the subject domain information of the data table to be built does not exist in the table resource pool, the data table to be built is allowed to be newly added in the table resource pool, so that the data table is classified according to the subject domain, the situation that the subject domain services have different meanings but the table names are similar and are judged as repeated fields is avoided, and the similarity accuracy is effectively improved. If the theme domain information of the data table to be built exists in the table resource pool, the judgment is needed to be further carried out according to the table name information, the field information and the data source information, the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool are extracted, the similarity information of the data table to be built and the historical data table is determined according to the table name information, the field information and the data source information of the data table to be built, the table name information, the field information and the data source information of the historical data table, and the data table is managed according to the similarity information, so that the relation between the data table to be built and the historical data table is effectively built to manage data in a database, and redundancy and resource waste are avoided.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:

FIG. 1 is a diagram illustrating a data table management method according to an embodiment of the present invention;

FIGS. 2-3 are schematic diagrams of a data table management method according to an embodiment of the invention;

FIG. 4 is a diagram of a data table management apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

In order to manage data in a database, and avoid redundancy and resource waste, an embodiment of the present invention provides a data table management method, which, as shown in fig. 1, may include:

step 101, obtaining subject domain information, table name information, field information and data source information of a data table to be built;

step 102, searching the subject domain information of the data table to be built in a table resource pool;

103, if the theme domain information of the data table to be built does not exist in the table resource pool, allowing the data table to be newly added in the table resource pool;

step 104, if the subject domain information of the data table to be built exists in the table resource pool, extracting the table name information, the field information and the data source information of the historical data table corresponding to the subject domain information in the table resource pool;

105, determining similarity information between the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table;

and 106, managing a data table according to the similarity information.

As shown in fig. 1, in the embodiment of the present invention, the subject domain information, the table name information, the field information, and the data source information of the data table to be created are obtained; searching the theme domain information of the data table to be built in a table resource pool; if the theme domain information of the data table to be built does not exist in the table resource pool, allowing the data table to be newly added in the table resource pool; if the theme domain information of the data table to be built exists in the table resource pool, extracting the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool; determining similarity information of the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table; and managing a data table according to the similarity information. According to the embodiment of the invention, the subject domain information of the data table to be built is firstly searched in the table resource pool, if the subject domain information of the data table to be built does not exist in the table resource pool, the data table to be built is allowed to be newly added in the table resource pool, so that the data table is classified according to the subject domain, the situation that the subject domain services have different meanings but the table names are similar and are judged as repeated fields is avoided, and the similarity accuracy is effectively improved. If the theme domain information of the data table to be built exists in the table resource pool, the judgment is needed to be further carried out according to the table name information, the field information and the data source information, the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool are extracted, the similarity information of the data table to be built and the historical data table is determined according to the table name information, the field information and the data source information of the data table to be built, the table name information, the field information and the data source information of the historical data table, and the data table is managed according to the similarity information, so that the relation between the data table to be built and the historical data table is effectively built to manage data in a database, and redundancy and resource waste are avoided.

It should be noted that the method and apparatus for managing a data table disclosed in the present invention can be used in the financial field, and can also be used in any field other than the financial field.

In the embodiment, the subject domain information, the table name information, the field information and the data source information of a data table to be built are obtained, the subject domain information of the data table to be built is searched in a table resource pool, if the subject domain information of the data table to be built does not exist in the table resource pool, the data table to be built is allowed to be newly added in the table resource pool, and if the subject domain information of the data table to be built exists in the table resource pool, the table name information, the field information and the data source information of a historical data table corresponding to the subject domain information in the table resource pool are extracted.

In specific implementation, the subject domain information and the data source information are newly added in the table structure, and when a data table is newly created, the subject domain information and the data source information need to be filled in addition to the basic information (table name information, field information, and the like). The theme domain information represents the service direction of the data stored in the data table and is used for dividing the table resource pool. The data source information is a table already existing in the table resource pool, for example, if the data source of the a table is the B table, it represents that the field in the a table is from the B table. When the data table is newly built, after the corresponding information is filled, the subject domain to which the data table to be built belongs is judged, namely, the subject domain information of the data table to be built is searched in the table resource pool, if the subject domain information of the data table to be built does not exist in the table resource pool, the subject domains are different, and at the moment, the data table to be built is allowed to be newly added in the table resource pool. If the theme domain information of the data table to be built exists in the table resource pool, the theme domains are the same, and table name information, field information and data source information of the historical data table corresponding to the theme domain information in the table resource pool are extracted.

In the embodiment, the similarity information between the data table to be built and the historical data table is determined according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table.

In this embodiment, determining similarity information between the data table to be created and the historical data table according to the table name information, the field information, and the data source information of the data table to be created and the table name information, the field information, and the data source information of the historical data table includes:

performing word splitting processing on the table name information of the data table to be built and the table name information of the historical data table respectively to obtain a first word splitting result corresponding to the data table to be built and a second word splitting result corresponding to the historical data table, and calculating Jaccard coefficients of the first word splitting result and the second word splitting result to obtain first similarity information;

performing word splitting processing on field information of a data table to be built and field information of a historical data table respectively to obtain a third word splitting result corresponding to the data table to be built and a fourth word splitting result corresponding to the historical data table, and calculating Jaccard coefficients of the third word splitting result and the fourth word splitting result to obtain second similarity information;

performing word splitting processing on data source information of a data table to be built and data source information of a historical data table respectively to obtain a fifth word splitting result corresponding to the data table to be built and a sixth word splitting result corresponding to the historical data table, and calculating Jaccard coefficients of the fifth word splitting result and the sixth word splitting result to obtain third similarity information;

accuracy evaluation is carried out on table name information, field information and data source information of the historical data table, and a first weight corresponding to the first similarity information, a second weight corresponding to the second similarity information and a third weight corresponding to the third similarity information are determined according to the accuracy evaluation result;

and determining the similarity information of the data table to be built and the historical data table according to the first similarity information and the corresponding first weight, the second similarity information and the corresponding second weight, and the third similarity information and the corresponding third weight.

In this embodiment, performing accuracy evaluation on table name information, field information, and data source information of the historical data table, and determining a first weight corresponding to the first similarity information, a second weight corresponding to the second similarity information, and a third weight corresponding to the third similarity information according to a result of the accuracy evaluation includes:

accuracy evaluation is carried out on the table name information, the field information and the data source information of the historical data table to obtain a first evaluation array corresponding to the table name information, a second evaluation array corresponding to the field information and a third evaluation array corresponding to the data source information;

respectively removing the highest value and the lowest value of each array in the first evaluation array, the second evaluation array and the third evaluation array;

summing each array with the highest value and the lowest value removed to obtain a first summation result, a second summation result and a third summation result;

and normalizing the first summation result, the second summation result and the third summation result to obtain a first weight corresponding to the first similarity information, a second weight corresponding to the second similarity information and a third weight corresponding to the third similarity information.

In specific implementation, taking the table name information as an example, the table name information of the data table to be created is subjected to word splitting, for example, the word splitting of the "important customer information table" can be: important, customer, information; the "customer number" can be broken into words: and the client and the number obtain a first word splitting result corresponding to the data table to be built, and the result is set as a set A. And splitting words of the table name information of the historical data table to obtain a second word splitting result corresponding to the historical data table, wherein the second word splitting result can also be directly read from the database and is set as a set B.

Calculating Jaccard coefficients J (A, B) of the first word segmentation result set A and the second word segmentation result set B according to the following formula to obtain first similarity information S₁：

In specific implementation, the second similarity information S corresponding to the field information₂Third similarity information S corresponding to data source information₃Also, the calculation is performed as described above, and it should be noted that, when a plurality of fields are present in one table, it is necessary to calculate the Jaccard coefficients for the different fields, respectively, to obtain an array including the plurality of Jaccard coefficients, S₂Is the average of the array.

In specific implementation, a first weight Q corresponding to the first similarity information of the weight is calculated by using a quantitative statistical method₁Second weight Q corresponding to the second similarity information₂Third weight Q corresponding to third similarity information₃. Accuracy evaluation is carried out on the table name information, the field information and the data source information of the historical data table, experts are organized according to subject domain classification, and accuracy of the table name information, the field information and the data source information under the existing subject domain information is evaluated (0% -100%) to obtain three arrays, namely a first evaluation array corresponding to the table name information, a second evaluation array corresponding to the field information and a third evaluation array corresponding to the data source information. Respectively removing the highest value and the lowest value of each array in the first evaluation array, the second evaluation array and the third evaluation array to obtain the most removed arrayFirst evaluation array C of high and lowest values₁The second evaluation array C₂The third evaluation array C₃Respectively summing each array with the highest value and the lowest value removed to obtain a first summation result sigma C₁Second summation result ∑ C₂And the third summation result ∑ C₃(ii) a Normalizing the first summation result, the second summation result and the third summation result to obtain a first weight Q corresponding to the first similarity information₁Second weight Q corresponding to the second similarity information₂Third weight Q corresponding to third similarity information₃Such that it satisfies:

wherein

In specific implementation, the first similarity information S is obtained₁And a corresponding first weight Q₁Second similarity information S₂And a corresponding second weight Q₂And third similarity information S₃And a corresponding third weight Q₃And then, determining the similarity information S of the data table to be built and the historical data table according to the following formula:

in the embodiment, the data table is managed according to the similarity information.

In this embodiment, managing a data table according to the similarity information includes: comparing the similarity information with a preset similarity threshold; if the similarity information is larger than a preset similarity threshold, the data table to be built is not allowed to be newly added in a table resource pool; and if the similarity information is less than or equal to a preset similarity threshold, allowing the data table to be newly added in a table resource pool.

In specific implementation, if the similarity information is greater than a preset similarity threshold, the data table to be built and the historical data table are judged to be repeated tables, and the data table to be built is not allowed to be newly added in the table resource pool. And if the similarity information is less than or equal to a preset similarity threshold, namely if the similarity between the data table to be built and all the historical data tables in the subject domain does not exceed the threshold, determining that the data table to be built is a new table, and allowing the new data table to be added in the table resource pool.

The embodiment of the invention can calculate the similarity of the table structure, introduce the concept of the subject domain, divide the data table according to the subject domain, check the similarity according to the subject domain, the table name and the data source, and judge the repeated table if the similarity reaches a certain threshold value. Therefore, the data table can be checked when being newly built, repeated construction is avoided, the problem that similar fields with different service attributes in the similarity algorithm are subjected to similarity statistics is solved, the subject domain field is newly arranged in the table structure, the data table is divided according to the service direction, and the accuracy of the similarity algorithm is improved. The embodiment of the invention classifies the table according to the subject domain, avoids the situation that fields with different business meanings but similar names are judged as repeated fields, and improves the accuracy of similarity. And, the relationship between the table and the table is established through the table association data source. Fields with similar names but consistent sources can be judged as similar fields, and the accuracy of the similarity is improved.

A specific embodiment is given below to illustrate a specific application of the data table management method in the embodiment of the present invention. In this embodiment, as shown in fig. 2-3, the data sheet management is performed, and 5 data sheets are under the "public" subject field, and the relevant experts are firstly organized to review the five data sheets, and the results are shown in table 1:

TABLE 1

	Table name	Name of field	Data source
				Expert 1	70％	40％	100％
Expert 2	50％	60％	90％
				Expert 3	50％	60％	90％
Expert 4	50％	60％	90％
				Expert 5	30％	90％	70％

According to the method, the following steps are obtained:

C₁＝{50％，50％，50％}，∑C₁＝1.5；

C₂＝{60％，60％，60％}，∑C₂＝1.8；

C₃＝{90％，90％，90％}，∑C₃＝2.7；

carrying out normalization processing to obtain Q₁＝25％，Q₂＝30％，Q₃＝45％。

At this time, the newly added table "important customer information table" is used to perform word splitting on the table name and the field name, and read the data source information of the table, and the result is as shown in table 2:

TABLE 2

In this case, if the table "list of public client information" exists under the subject field, the table name and field name are stripped, and the result is as shown in table 3:

TABLE 3

As can be seen from the data in tables 2 and 3,

further, the similarity S between the "important customer information table" and the "public customer information table" is:

based on the same inventive concept, the embodiment of the present invention further provides a data table management apparatus, as described in the following embodiments. Since the principles of these solutions are similar to the data table management method, the implementation of the apparatus can refer to the implementation of the method, and the repeated descriptions are omitted.

Fig. 4 is a structural diagram of a data table management apparatus according to an embodiment of the present invention, and as shown in fig. 4, the apparatus includes:

the information obtaining module 401 is configured to obtain subject domain information, table name information, field information, and data source information of a data table to be created;

an information searching module 402, configured to search the subject domain information of the to-be-created data table in a table resource pool;

a first information determining module 403, configured to allow a data table to be newly added to a table resource pool if the theme domain information of the data table to be created does not exist in the table resource pool;

a second information determining module 404, configured to, if the subject domain information of the data table to be created exists in the table resource pool, extract table name information, field information, and data source information of a historical data table corresponding to the subject domain information in the table resource pool;

a similarity determining module 405, configured to determine similarity information between the data table to be created and the historical data table according to the table name information, the field information, and the data source information of the data table to be created, and the table name information, the field information, and the data source information of the historical data table;

and the data table management module 406 is configured to manage a data table according to the similarity information.

In one embodiment, the similarity determination module 405 is further configured to:

In one embodiment, the data table management module 406 is further configured to:

comparing the similarity information with a preset similarity threshold;

if the similarity information is larger than a preset similarity threshold, the data table to be built is not allowed to be newly added in a table resource pool;

and if the similarity information is less than or equal to a preset similarity threshold, allowing the data table to be newly added in a table resource pool.

In summary, in the embodiments of the present invention, the subject domain information, the table name information, the field information, and the data source information of the data table to be created are obtained; searching the theme domain information of the data table to be built in a table resource pool; if the theme domain information of the data table to be built does not exist in the table resource pool, allowing the data table to be newly added in the table resource pool; if the theme domain information of the data table to be built exists in the table resource pool, extracting the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool; determining similarity information of the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table; and managing a data table according to the similarity information. According to the embodiment of the invention, the subject domain information of the data table to be built is firstly searched in the table resource pool, if the subject domain information of the data table to be built does not exist in the table resource pool, the data table to be built is allowed to be newly added in the table resource pool, so that the data table is classified according to the subject domain, the situation that the subject domain services have different meanings but the table names are similar and are judged as repeated fields is avoided, and the similarity accuracy is effectively improved. If the theme domain information of the data table to be built exists in the table resource pool, the judgment is needed to be further carried out according to the table name information, the field information and the data source information, the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool are extracted, the similarity information of the data table to be built and the historical data table is determined according to the table name information, the field information and the data source information of the data table to be built, the table name information, the field information and the data source information of the historical data table, and the data table is managed according to the similarity information, so that the relation between the data table to be built and the historical data table is effectively built to manage data in a database, and redundancy and resource waste are avoided.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for data table management, comprising:

and managing a data table according to the similarity information.

2. The data table management method of claim 1, wherein determining similarity information between the data table to be created and the historical data table according to the table name information, the field information and the data source information of the data table to be created and the table name information, the field information and the data source information of the historical data table comprises:

3. The data sheet management method of claim 2, wherein performing accuracy evaluation on the table name information, the field information, and the data source information of the historical data sheet, and determining a first weight corresponding to the first similarity information, a second weight corresponding to the second similarity information, and a third weight corresponding to the third similarity information according to a result of the accuracy evaluation, comprises:

4. The method of claim 1, wherein managing a data table according to the similarity information comprises:

comparing the similarity information with a preset similarity threshold;

5. A data table management apparatus, comprising:

6. The data sheet management apparatus of claim 5, wherein the similarity determination module is further to:

7. The data sheet management apparatus of claim 6, wherein the similarity determination module is further to:

8. The data table management apparatus of claim 5, wherein the data table management module is further to:

comparing the similarity information with a preset similarity threshold;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 4.