CN113010517A - Data table management method and device - Google Patents

Data table management method and device Download PDF

Info

Publication number
CN113010517A
CN113010517A CN202110250393.1A CN202110250393A CN113010517A CN 113010517 A CN113010517 A CN 113010517A CN 202110250393 A CN202110250393 A CN 202110250393A CN 113010517 A CN113010517 A CN 113010517A
Authority
CN
China
Prior art keywords
information
data table
similarity
built
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110250393.1A
Other languages
Chinese (zh)
Other versions
CN113010517B (en
Inventor
熊文杰
谢荣良
叶桂全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110250393.1A priority Critical patent/CN113010517B/en
Publication of CN113010517A publication Critical patent/CN113010517A/en
Application granted granted Critical
Publication of CN113010517B publication Critical patent/CN113010517B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data table management method and a device, wherein the method comprises the following steps: obtaining subject domain information, table name information, field information and data source information of a data table to be built; searching the theme domain information of the data table to be built in the table resource pool; if the theme domain information of the data table to be built does not exist in the table resource pool, allowing the data table to be newly added in the table resource pool; if the theme domain information of the data table to be built exists in the table resource pool, extracting the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool; determining similarity information of the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table; and managing the data table according to the similarity information. The invention can manage the data in the database, and avoid redundancy and resource waste.

Description

Data table management method and device
Technical Field
The invention relates to the technical field of computers, in particular to a data table management method and device.
Background
In large software companies, different teams often store information in separate databases. However, since some information has generality, such as user name, mobile phone number, etc., storing in respective databases causes redundancy and wastes resources.
Therefore, a need exists for a data table management scheme that overcomes the above-mentioned problems.
Disclosure of Invention
The embodiment of the invention provides a data table management method, which is used for managing data in a database and avoiding redundancy and resource waste, and comprises the following steps:
obtaining subject domain information, table name information, field information and data source information of a data table to be built;
searching the theme domain information of the data table to be built in a table resource pool;
if the theme domain information of the data table to be built does not exist in the table resource pool, allowing the data table to be newly added in the table resource pool;
if the theme domain information of the data table to be built exists in the table resource pool, extracting the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool;
determining similarity information of the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table;
and managing a data table according to the similarity information.
The embodiment of the invention provides a data table management device, which is used for managing data in a database and avoiding redundancy and resource waste, and comprises the following components:
the information acquisition module is used for acquiring the subject domain information, the table name information, the field information and the data source information of the data table to be built;
the information searching module is used for searching the subject domain information of the data table to be built in a table resource pool;
the first information judgment module is used for allowing the data table to be newly added in the table resource pool if the theme domain information of the data table to be built does not exist in the table resource pool;
the second information judgment module is used for extracting the table name information, the field information and the data source information of the historical data table corresponding to the subject domain information in the table resource pool if the subject domain information of the data table to be built exists in the table resource pool;
the similarity determining module is used for determining similarity information between the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table;
and the data table management module is used for managing the data table according to the similarity information.
The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the data table management method.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer program for executing the above data table management method is stored in the computer-readable storage medium.
The embodiment of the invention obtains the subject domain information, the table name information, the field information and the data source information of the data table to be built; searching the theme domain information of the data table to be built in a table resource pool; if the theme domain information of the data table to be built does not exist in the table resource pool, allowing the data table to be newly added in the table resource pool; if the theme domain information of the data table to be built exists in the table resource pool, extracting the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool; determining similarity information of the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table; and managing a data table according to the similarity information. According to the embodiment of the invention, the subject domain information of the data table to be built is firstly searched in the table resource pool, if the subject domain information of the data table to be built does not exist in the table resource pool, the data table to be built is allowed to be newly added in the table resource pool, so that the data table is classified according to the subject domain, the situation that the subject domain services have different meanings but the table names are similar and are judged as repeated fields is avoided, and the similarity accuracy is effectively improved. If the theme domain information of the data table to be built exists in the table resource pool, the judgment is needed to be further carried out according to the table name information, the field information and the data source information, the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool are extracted, the similarity information of the data table to be built and the historical data table is determined according to the table name information, the field information and the data source information of the data table to be built, the table name information, the field information and the data source information of the historical data table, and the data table is managed according to the similarity information, so that the relation between the data table to be built and the historical data table is effectively built to manage data in a database, and redundancy and resource waste are avoided.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:
FIG. 1 is a diagram illustrating a data table management method according to an embodiment of the present invention;
FIGS. 2-3 are schematic diagrams of a data table management method according to an embodiment of the invention;
FIG. 4 is a diagram of a data table management apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
In order to manage data in a database, and avoid redundancy and resource waste, an embodiment of the present invention provides a data table management method, which, as shown in fig. 1, may include:
step 101, obtaining subject domain information, table name information, field information and data source information of a data table to be built;
step 102, searching the subject domain information of the data table to be built in a table resource pool;
103, if the theme domain information of the data table to be built does not exist in the table resource pool, allowing the data table to be newly added in the table resource pool;
step 104, if the subject domain information of the data table to be built exists in the table resource pool, extracting the table name information, the field information and the data source information of the historical data table corresponding to the subject domain information in the table resource pool;
105, determining similarity information between the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table;
and 106, managing a data table according to the similarity information.
As shown in fig. 1, in the embodiment of the present invention, the subject domain information, the table name information, the field information, and the data source information of the data table to be created are obtained; searching the theme domain information of the data table to be built in a table resource pool; if the theme domain information of the data table to be built does not exist in the table resource pool, allowing the data table to be newly added in the table resource pool; if the theme domain information of the data table to be built exists in the table resource pool, extracting the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool; determining similarity information of the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table; and managing a data table according to the similarity information. According to the embodiment of the invention, the subject domain information of the data table to be built is firstly searched in the table resource pool, if the subject domain information of the data table to be built does not exist in the table resource pool, the data table to be built is allowed to be newly added in the table resource pool, so that the data table is classified according to the subject domain, the situation that the subject domain services have different meanings but the table names are similar and are judged as repeated fields is avoided, and the similarity accuracy is effectively improved. If the theme domain information of the data table to be built exists in the table resource pool, the judgment is needed to be further carried out according to the table name information, the field information and the data source information, the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool are extracted, the similarity information of the data table to be built and the historical data table is determined according to the table name information, the field information and the data source information of the data table to be built, the table name information, the field information and the data source information of the historical data table, and the data table is managed according to the similarity information, so that the relation between the data table to be built and the historical data table is effectively built to manage data in a database, and redundancy and resource waste are avoided.
It should be noted that the method and apparatus for managing a data table disclosed in the present invention can be used in the financial field, and can also be used in any field other than the financial field.
In the embodiment, the subject domain information, the table name information, the field information and the data source information of a data table to be built are obtained, the subject domain information of the data table to be built is searched in a table resource pool, if the subject domain information of the data table to be built does not exist in the table resource pool, the data table to be built is allowed to be newly added in the table resource pool, and if the subject domain information of the data table to be built exists in the table resource pool, the table name information, the field information and the data source information of a historical data table corresponding to the subject domain information in the table resource pool are extracted.
In specific implementation, the subject domain information and the data source information are newly added in the table structure, and when a data table is newly created, the subject domain information and the data source information need to be filled in addition to the basic information (table name information, field information, and the like). The theme domain information represents the service direction of the data stored in the data table and is used for dividing the table resource pool. The data source information is a table already existing in the table resource pool, for example, if the data source of the a table is the B table, it represents that the field in the a table is from the B table. When the data table is newly built, after the corresponding information is filled, the subject domain to which the data table to be built belongs is judged, namely, the subject domain information of the data table to be built is searched in the table resource pool, if the subject domain information of the data table to be built does not exist in the table resource pool, the subject domains are different, and at the moment, the data table to be built is allowed to be newly added in the table resource pool. If the theme domain information of the data table to be built exists in the table resource pool, the theme domains are the same, and table name information, field information and data source information of the historical data table corresponding to the theme domain information in the table resource pool are extracted.
In the embodiment, the similarity information between the data table to be built and the historical data table is determined according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table.
In this embodiment, determining similarity information between the data table to be created and the historical data table according to the table name information, the field information, and the data source information of the data table to be created and the table name information, the field information, and the data source information of the historical data table includes:
performing word splitting processing on the table name information of the data table to be built and the table name information of the historical data table respectively to obtain a first word splitting result corresponding to the data table to be built and a second word splitting result corresponding to the historical data table, and calculating Jaccard coefficients of the first word splitting result and the second word splitting result to obtain first similarity information;
performing word splitting processing on field information of a data table to be built and field information of a historical data table respectively to obtain a third word splitting result corresponding to the data table to be built and a fourth word splitting result corresponding to the historical data table, and calculating Jaccard coefficients of the third word splitting result and the fourth word splitting result to obtain second similarity information;
performing word splitting processing on data source information of a data table to be built and data source information of a historical data table respectively to obtain a fifth word splitting result corresponding to the data table to be built and a sixth word splitting result corresponding to the historical data table, and calculating Jaccard coefficients of the fifth word splitting result and the sixth word splitting result to obtain third similarity information;
accuracy evaluation is carried out on table name information, field information and data source information of the historical data table, and a first weight corresponding to the first similarity information, a second weight corresponding to the second similarity information and a third weight corresponding to the third similarity information are determined according to the accuracy evaluation result;
and determining the similarity information of the data table to be built and the historical data table according to the first similarity information and the corresponding first weight, the second similarity information and the corresponding second weight, and the third similarity information and the corresponding third weight.
In this embodiment, performing accuracy evaluation on table name information, field information, and data source information of the historical data table, and determining a first weight corresponding to the first similarity information, a second weight corresponding to the second similarity information, and a third weight corresponding to the third similarity information according to a result of the accuracy evaluation includes:
accuracy evaluation is carried out on the table name information, the field information and the data source information of the historical data table to obtain a first evaluation array corresponding to the table name information, a second evaluation array corresponding to the field information and a third evaluation array corresponding to the data source information;
respectively removing the highest value and the lowest value of each array in the first evaluation array, the second evaluation array and the third evaluation array;
summing each array with the highest value and the lowest value removed to obtain a first summation result, a second summation result and a third summation result;
and normalizing the first summation result, the second summation result and the third summation result to obtain a first weight corresponding to the first similarity information, a second weight corresponding to the second similarity information and a third weight corresponding to the third similarity information.
In specific implementation, taking the table name information as an example, the table name information of the data table to be created is subjected to word splitting, for example, the word splitting of the "important customer information table" can be: important, customer, information; the "customer number" can be broken into words: and the client and the number obtain a first word splitting result corresponding to the data table to be built, and the result is set as a set A. And splitting words of the table name information of the historical data table to obtain a second word splitting result corresponding to the historical data table, wherein the second word splitting result can also be directly read from the database and is set as a set B.
Calculating Jaccard coefficients J (A, B) of the first word segmentation result set A and the second word segmentation result set B according to the following formula to obtain first similarity information S1
Figure BDA0002965787550000061
In specific implementation, the second similarity information S corresponding to the field information2Third similarity information S corresponding to data source information3Also, the calculation is performed as described above, and it should be noted that, when a plurality of fields are present in one table, it is necessary to calculate the Jaccard coefficients for the different fields, respectively, to obtain an array including the plurality of Jaccard coefficients, S2Is the average of the array.
In specific implementation, a first weight Q corresponding to the first similarity information of the weight is calculated by using a quantitative statistical method1Second weight Q corresponding to the second similarity information2Third weight Q corresponding to third similarity information3. Accuracy evaluation is carried out on the table name information, the field information and the data source information of the historical data table, experts are organized according to subject domain classification, and accuracy of the table name information, the field information and the data source information under the existing subject domain information is evaluated (0% -100%) to obtain three arrays, namely a first evaluation array corresponding to the table name information, a second evaluation array corresponding to the field information and a third evaluation array corresponding to the data source information. Respectively removing the highest value and the lowest value of each array in the first evaluation array, the second evaluation array and the third evaluation array to obtain the most removed arrayFirst evaluation array C of high and lowest values1The second evaluation array C2The third evaluation array C3Respectively summing each array with the highest value and the lowest value removed to obtain a first summation result sigma C1Second summation result ∑ C2And the third summation result ∑ C3(ii) a Normalizing the first summation result, the second summation result and the third summation result to obtain a first weight Q corresponding to the first similarity information1Second weight Q corresponding to the second similarity information2Third weight Q corresponding to third similarity information3Such that it satisfies:
Figure BDA0002965787550000062
wherein
Figure BDA0002965787550000063
In specific implementation, the first similarity information S is obtained1And a corresponding first weight Q1Second similarity information S2And a corresponding second weight Q2And third similarity information S3And a corresponding third weight Q3And then, determining the similarity information S of the data table to be built and the historical data table according to the following formula:
Figure BDA0002965787550000064
in the embodiment, the data table is managed according to the similarity information.
In this embodiment, managing a data table according to the similarity information includes: comparing the similarity information with a preset similarity threshold; if the similarity information is larger than a preset similarity threshold, the data table to be built is not allowed to be newly added in a table resource pool; and if the similarity information is less than or equal to a preset similarity threshold, allowing the data table to be newly added in a table resource pool.
In specific implementation, if the similarity information is greater than a preset similarity threshold, the data table to be built and the historical data table are judged to be repeated tables, and the data table to be built is not allowed to be newly added in the table resource pool. And if the similarity information is less than or equal to a preset similarity threshold, namely if the similarity between the data table to be built and all the historical data tables in the subject domain does not exceed the threshold, determining that the data table to be built is a new table, and allowing the new data table to be added in the table resource pool.
The embodiment of the invention can calculate the similarity of the table structure, introduce the concept of the subject domain, divide the data table according to the subject domain, check the similarity according to the subject domain, the table name and the data source, and judge the repeated table if the similarity reaches a certain threshold value. Therefore, the data table can be checked when being newly built, repeated construction is avoided, the problem that similar fields with different service attributes in the similarity algorithm are subjected to similarity statistics is solved, the subject domain field is newly arranged in the table structure, the data table is divided according to the service direction, and the accuracy of the similarity algorithm is improved. The embodiment of the invention classifies the table according to the subject domain, avoids the situation that fields with different business meanings but similar names are judged as repeated fields, and improves the accuracy of similarity. And, the relationship between the table and the table is established through the table association data source. Fields with similar names but consistent sources can be judged as similar fields, and the accuracy of the similarity is improved.
A specific embodiment is given below to illustrate a specific application of the data table management method in the embodiment of the present invention. In this embodiment, as shown in fig. 2-3, the data sheet management is performed, and 5 data sheets are under the "public" subject field, and the relevant experts are firstly organized to review the five data sheets, and the results are shown in table 1:
TABLE 1
Table name Name of field Data source
Expert 1 70% 40% 100%
Expert 2 50% 60% 90%
Expert 3 50% 60% 90%
Expert 4 50% 60% 90%
Expert 5 30% 90% 70%
According to the method, the following steps are obtained:
C1={50%,50%,50%},∑C1=1.5;
C2={60%,60%,60%},∑C2=1.8;
C3={90%,90%,90%},∑C3=2.7;
carrying out normalization processing to obtain Q1=25%,Q2=30%,Q3=45%。
At this time, the newly added table "important customer information table" is used to perform word splitting on the table name and the field name, and read the data source information of the table, and the result is as shown in table 2:
TABLE 2
Figure BDA0002965787550000081
In this case, if the table "list of public client information" exists under the subject field, the table name and field name are stripped, and the result is as shown in table 3:
TABLE 3
Figure BDA0002965787550000082
As can be seen from the data in tables 2 and 3,
Figure BDA0002965787550000083
further, the similarity S between the "important customer information table" and the "public customer information table" is:
Figure BDA0002965787550000084
based on the same inventive concept, the embodiment of the present invention further provides a data table management apparatus, as described in the following embodiments. Since the principles of these solutions are similar to the data table management method, the implementation of the apparatus can refer to the implementation of the method, and the repeated descriptions are omitted.
Fig. 4 is a structural diagram of a data table management apparatus according to an embodiment of the present invention, and as shown in fig. 4, the apparatus includes:
the information obtaining module 401 is configured to obtain subject domain information, table name information, field information, and data source information of a data table to be created;
an information searching module 402, configured to search the subject domain information of the to-be-created data table in a table resource pool;
a first information determining module 403, configured to allow a data table to be newly added to a table resource pool if the theme domain information of the data table to be created does not exist in the table resource pool;
a second information determining module 404, configured to, if the subject domain information of the data table to be created exists in the table resource pool, extract table name information, field information, and data source information of a historical data table corresponding to the subject domain information in the table resource pool;
a similarity determining module 405, configured to determine similarity information between the data table to be created and the historical data table according to the table name information, the field information, and the data source information of the data table to be created, and the table name information, the field information, and the data source information of the historical data table;
and the data table management module 406 is configured to manage a data table according to the similarity information.
In one embodiment, the similarity determination module 405 is further configured to:
performing word splitting processing on the table name information of the data table to be built and the table name information of the historical data table respectively to obtain a first word splitting result corresponding to the data table to be built and a second word splitting result corresponding to the historical data table, and calculating Jaccard coefficients of the first word splitting result and the second word splitting result to obtain first similarity information;
performing word splitting processing on field information of a data table to be built and field information of a historical data table respectively to obtain a third word splitting result corresponding to the data table to be built and a fourth word splitting result corresponding to the historical data table, and calculating Jaccard coefficients of the third word splitting result and the fourth word splitting result to obtain second similarity information;
performing word splitting processing on data source information of a data table to be built and data source information of a historical data table respectively to obtain a fifth word splitting result corresponding to the data table to be built and a sixth word splitting result corresponding to the historical data table, and calculating Jaccard coefficients of the fifth word splitting result and the sixth word splitting result to obtain third similarity information;
accuracy evaluation is carried out on table name information, field information and data source information of the historical data table, and a first weight corresponding to the first similarity information, a second weight corresponding to the second similarity information and a third weight corresponding to the third similarity information are determined according to the accuracy evaluation result;
and determining the similarity information of the data table to be built and the historical data table according to the first similarity information and the corresponding first weight, the second similarity information and the corresponding second weight, and the third similarity information and the corresponding third weight.
In one embodiment, the similarity determination module 405 is further configured to:
accuracy evaluation is carried out on the table name information, the field information and the data source information of the historical data table to obtain a first evaluation array corresponding to the table name information, a second evaluation array corresponding to the field information and a third evaluation array corresponding to the data source information;
respectively removing the highest value and the lowest value of each array in the first evaluation array, the second evaluation array and the third evaluation array;
summing each array with the highest value and the lowest value removed to obtain a first summation result, a second summation result and a third summation result;
and normalizing the first summation result, the second summation result and the third summation result to obtain a first weight corresponding to the first similarity information, a second weight corresponding to the second similarity information and a third weight corresponding to the third similarity information.
In one embodiment, the data table management module 406 is further configured to:
comparing the similarity information with a preset similarity threshold;
if the similarity information is larger than a preset similarity threshold, the data table to be built is not allowed to be newly added in a table resource pool;
and if the similarity information is less than or equal to a preset similarity threshold, allowing the data table to be newly added in a table resource pool.
In summary, in the embodiments of the present invention, the subject domain information, the table name information, the field information, and the data source information of the data table to be created are obtained; searching the theme domain information of the data table to be built in a table resource pool; if the theme domain information of the data table to be built does not exist in the table resource pool, allowing the data table to be newly added in the table resource pool; if the theme domain information of the data table to be built exists in the table resource pool, extracting the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool; determining similarity information of the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table; and managing a data table according to the similarity information. According to the embodiment of the invention, the subject domain information of the data table to be built is firstly searched in the table resource pool, if the subject domain information of the data table to be built does not exist in the table resource pool, the data table to be built is allowed to be newly added in the table resource pool, so that the data table is classified according to the subject domain, the situation that the subject domain services have different meanings but the table names are similar and are judged as repeated fields is avoided, and the similarity accuracy is effectively improved. If the theme domain information of the data table to be built exists in the table resource pool, the judgment is needed to be further carried out according to the table name information, the field information and the data source information, the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool are extracted, the similarity information of the data table to be built and the historical data table is determined according to the table name information, the field information and the data source information of the data table to be built, the table name information, the field information and the data source information of the historical data table, and the data table is managed according to the similarity information, so that the relation between the data table to be built and the historical data table is effectively built to manage data in a database, and redundancy and resource waste are avoided.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for data table management, comprising:
obtaining subject domain information, table name information, field information and data source information of a data table to be built;
searching the theme domain information of the data table to be built in a table resource pool;
if the theme domain information of the data table to be built does not exist in the table resource pool, allowing the data table to be newly added in the table resource pool;
if the theme domain information of the data table to be built exists in the table resource pool, extracting the table name information, the field information and the data source information of the historical data table corresponding to the theme domain information in the table resource pool;
determining similarity information of the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table;
and managing a data table according to the similarity information.
2. The data table management method of claim 1, wherein determining similarity information between the data table to be created and the historical data table according to the table name information, the field information and the data source information of the data table to be created and the table name information, the field information and the data source information of the historical data table comprises:
performing word splitting processing on the table name information of the data table to be built and the table name information of the historical data table respectively to obtain a first word splitting result corresponding to the data table to be built and a second word splitting result corresponding to the historical data table, and calculating Jaccard coefficients of the first word splitting result and the second word splitting result to obtain first similarity information;
performing word splitting processing on field information of a data table to be built and field information of a historical data table respectively to obtain a third word splitting result corresponding to the data table to be built and a fourth word splitting result corresponding to the historical data table, and calculating Jaccard coefficients of the third word splitting result and the fourth word splitting result to obtain second similarity information;
performing word splitting processing on data source information of a data table to be built and data source information of a historical data table respectively to obtain a fifth word splitting result corresponding to the data table to be built and a sixth word splitting result corresponding to the historical data table, and calculating Jaccard coefficients of the fifth word splitting result and the sixth word splitting result to obtain third similarity information;
accuracy evaluation is carried out on table name information, field information and data source information of the historical data table, and a first weight corresponding to the first similarity information, a second weight corresponding to the second similarity information and a third weight corresponding to the third similarity information are determined according to the accuracy evaluation result;
and determining the similarity information of the data table to be built and the historical data table according to the first similarity information and the corresponding first weight, the second similarity information and the corresponding second weight, and the third similarity information and the corresponding third weight.
3. The data sheet management method of claim 2, wherein performing accuracy evaluation on the table name information, the field information, and the data source information of the historical data sheet, and determining a first weight corresponding to the first similarity information, a second weight corresponding to the second similarity information, and a third weight corresponding to the third similarity information according to a result of the accuracy evaluation, comprises:
accuracy evaluation is carried out on the table name information, the field information and the data source information of the historical data table to obtain a first evaluation array corresponding to the table name information, a second evaluation array corresponding to the field information and a third evaluation array corresponding to the data source information;
respectively removing the highest value and the lowest value of each array in the first evaluation array, the second evaluation array and the third evaluation array;
summing each array with the highest value and the lowest value removed to obtain a first summation result, a second summation result and a third summation result;
and normalizing the first summation result, the second summation result and the third summation result to obtain a first weight corresponding to the first similarity information, a second weight corresponding to the second similarity information and a third weight corresponding to the third similarity information.
4. The method of claim 1, wherein managing a data table according to the similarity information comprises:
comparing the similarity information with a preset similarity threshold;
if the similarity information is larger than a preset similarity threshold, the data table to be built is not allowed to be newly added in a table resource pool;
and if the similarity information is less than or equal to a preset similarity threshold, allowing the data table to be newly added in a table resource pool.
5. A data table management apparatus, comprising:
the information acquisition module is used for acquiring the subject domain information, the table name information, the field information and the data source information of the data table to be built;
the information searching module is used for searching the subject domain information of the data table to be built in a table resource pool;
the first information judgment module is used for allowing the data table to be newly added in the table resource pool if the theme domain information of the data table to be built does not exist in the table resource pool;
the second information judgment module is used for extracting the table name information, the field information and the data source information of the historical data table corresponding to the subject domain information in the table resource pool if the subject domain information of the data table to be built exists in the table resource pool;
the similarity determining module is used for determining similarity information between the data table to be built and the historical data table according to the table name information, the field information and the data source information of the data table to be built and the table name information, the field information and the data source information of the historical data table;
and the data table management module is used for managing the data table according to the similarity information.
6. The data sheet management apparatus of claim 5, wherein the similarity determination module is further to:
performing word splitting processing on the table name information of the data table to be built and the table name information of the historical data table respectively to obtain a first word splitting result corresponding to the data table to be built and a second word splitting result corresponding to the historical data table, and calculating Jaccard coefficients of the first word splitting result and the second word splitting result to obtain first similarity information;
performing word splitting processing on field information of a data table to be built and field information of a historical data table respectively to obtain a third word splitting result corresponding to the data table to be built and a fourth word splitting result corresponding to the historical data table, and calculating Jaccard coefficients of the third word splitting result and the fourth word splitting result to obtain second similarity information;
performing word splitting processing on data source information of a data table to be built and data source information of a historical data table respectively to obtain a fifth word splitting result corresponding to the data table to be built and a sixth word splitting result corresponding to the historical data table, and calculating Jaccard coefficients of the fifth word splitting result and the sixth word splitting result to obtain third similarity information;
accuracy evaluation is carried out on table name information, field information and data source information of the historical data table, and a first weight corresponding to the first similarity information, a second weight corresponding to the second similarity information and a third weight corresponding to the third similarity information are determined according to the accuracy evaluation result;
and determining the similarity information of the data table to be built and the historical data table according to the first similarity information and the corresponding first weight, the second similarity information and the corresponding second weight, and the third similarity information and the corresponding third weight.
7. The data sheet management apparatus of claim 6, wherein the similarity determination module is further to:
accuracy evaluation is carried out on the table name information, the field information and the data source information of the historical data table to obtain a first evaluation array corresponding to the table name information, a second evaluation array corresponding to the field information and a third evaluation array corresponding to the data source information;
respectively removing the highest value and the lowest value of each array in the first evaluation array, the second evaluation array and the third evaluation array;
summing each array with the highest value and the lowest value removed to obtain a first summation result, a second summation result and a third summation result;
and normalizing the first summation result, the second summation result and the third summation result to obtain a first weight corresponding to the first similarity information, a second weight corresponding to the second similarity information and a third weight corresponding to the third similarity information.
8. The data table management apparatus of claim 5, wherein the data table management module is further to:
comparing the similarity information with a preset similarity threshold;
if the similarity information is larger than a preset similarity threshold, the data table to be built is not allowed to be newly added in a table resource pool;
and if the similarity information is less than or equal to a preset similarity threshold, allowing the data table to be newly added in a table resource pool.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 4 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 4.
CN202110250393.1A 2021-03-08 2021-03-08 Data table management method and device Active CN113010517B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110250393.1A CN113010517B (en) 2021-03-08 2021-03-08 Data table management method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110250393.1A CN113010517B (en) 2021-03-08 2021-03-08 Data table management method and device

Publications (2)

Publication Number Publication Date
CN113010517A true CN113010517A (en) 2021-06-22
CN113010517B CN113010517B (en) 2024-02-09

Family

ID=76408263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110250393.1A Active CN113010517B (en) 2021-03-08 2021-03-08 Data table management method and device

Country Status (1)

Country Link
CN (1) CN113010517B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202091A (en) * 2015-05-04 2016-12-07 阿里巴巴集团控股有限公司 A kind of field method to set up and device
WO2017113886A1 (en) * 2015-12-30 2017-07-06 华为技术有限公司 Data cleaning method and device
CN110532273A (en) * 2019-08-30 2019-12-03 北京明略软件***有限公司 The processing method and processing device of tables of data, storage medium, electronic device
CN112241421A (en) * 2019-07-18 2021-01-19 天云融创数据科技(北京)有限公司 Data blood margin determination method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202091A (en) * 2015-05-04 2016-12-07 阿里巴巴集团控股有限公司 A kind of field method to set up and device
WO2017113886A1 (en) * 2015-12-30 2017-07-06 华为技术有限公司 Data cleaning method and device
CN112241421A (en) * 2019-07-18 2021-01-19 天云融创数据科技(北京)有限公司 Data blood margin determination method and device
CN110532273A (en) * 2019-08-30 2019-12-03 北京明略软件***有限公司 The processing method and processing device of tables of data, storage medium, electronic device

Also Published As

Publication number Publication date
CN113010517B (en) 2024-02-09

Similar Documents

Publication Publication Date Title
WO2019214245A1 (en) Information pushing method and apparatus, and terminal device and storage medium
US20200089650A1 (en) Techniques for automated data cleansing for machine learning algorithms
CN111986792B (en) Medical institution scoring method, device, equipment and storage medium
CN110309251B (en) Text data processing method, device and computer readable storage medium
CN112307133A (en) Security protection method and device, computer equipment and storage medium
CN111967521A (en) Cross-border active user identification method and device
CN117520503A (en) Financial customer service dialogue generation method, device, equipment and medium based on LLM model
CN116881430A (en) Industrial chain identification method and device, electronic equipment and readable storage medium
CN111639077A (en) Data management method and device, electronic equipment and storage medium
CN108830302B (en) Image classification method, training method, classification prediction method and related device
CN111125329B (en) Text information screening method, device and equipment
CN110399464B (en) Similar news judgment method and system and electronic equipment
CN112800020A (en) Data processing method and device and computer readable storage medium
CN117390132A (en) Method, system and storage medium for managing data and API
US20160335300A1 (en) Searching Large Data Space for Statistically Significant Patterns
CN113010517B (en) Data table management method and device
CN116503608A (en) Data distillation method based on artificial intelligence and related equipment
JP5506629B2 (en) Quasi-frequent structure pattern mining apparatus, frequent structure pattern mining apparatus, method and program thereof
CN104484330A (en) Pre-selecting method and device of spam comments based on grading keyword threshold combination evaluation
CN115422000A (en) Abnormal log processing method and device
CN114528378A (en) Text classification method and device, electronic equipment and storage medium
WO2018100700A1 (en) Data conversion device and data conversion method
CN113760864A (en) Data model generation method and device
CN114756685A (en) Complaint risk identification method and device for complaint sheet
CN109299260B (en) Data classification method, device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant