CN110008193B

CN110008193B - Data standardization method and device

Info

Publication number: CN110008193B
Application number: CN201910304451.7A
Authority: CN
Inventors: 刘俊良; 廖华琛; 王怡君; 王双
Original assignee: Chengdu Sefon Software Co Ltd
Current assignee: Chengdu Sefon Software Co Ltd
Priority date: 2019-04-16
Filing date: 2019-04-16
Publication date: 2021-06-18
Anticipated expiration: 2039-04-16
Also published as: CN110008193A

Abstract

The application provides a data standardization method and device, which are used for comparing metadata of a service database with metadata of a plurality of standard databases in sequence, finding out the same metadata and marking the same metadata as similar metadata. The difference metadata in the business database different from the standard database is used. And calculating the similarity between the data corresponding to the difference metadata and the sample data prestored in the service database. And identifying the metadata corresponding to the sample data with the data similarity larger than the preset threshold value as similar metadata in the industry standard library. And counting the quantity of the metadata identified as the similar metadata in the industry standard library, and determining the industry standard library with the maximum quantity as the industry standard library closest to the business database.

Description

Data standardization method and device

Technical Field

The present application relates to the field of data processing, and in particular, to a data normalization method and apparatus.

Background

With the popularization and development of information technology, the informatization degree of governments and enterprises is higher and higher, and further the business data volume is further increased. In the face of large amounts of business data, it has become a trend to efficiently and quickly build accurate and normative data models. However, in the face of a large number of industry standards, it takes a lot of time and effort to establish the relationship between actual business data and the existing standards through a manual identification method.

Disclosure of Invention

In order to overcome at least one of the deficiencies in the prior art, an object of the present application is to provide a data standardization method applied to a data processing device, where the data processing device has a plurality of industry standard libraries prestored therein, and the industry standard libraries have sample data prestored therein; the method comprises the following steps:

acquiring a service database;

for each industry standard library, comparing the metadata of the industry standard library with the metadata of the business database;

identifying metadata in the industry standard library which is the same as the metadata in the business database as similar metadata;

aiming at different difference metadata in the business database and different difference metadata in the industry standard database, calculating the similarity between data corresponding to the difference metadata and sample data in the industry standard database, and identifying the metadata corresponding to the sample data with the data similarity exceeding a preset threshold value as similar metadata in the industry standard database;

and counting the quantity of the metadata which are identified as the similar metadata in each industry standard library, and determining the industry standard library with the maximum quantity as the industry standard library which is closest to the business database.

Optionally, the step of calculating similarity between data corresponding to the difference metadata and sample data in the industry standard library includes:

and calculating the similarity between the data corresponding to the difference metadata and the sample data in the industry standard library through an artificial neural network.

Optionally, the method further comprises:

creating a standard information database according to the similar metadata in the closest industry standard library;

and acquiring data corresponding to similar metadata in the closest industry standard library from the service database, and storing the data into the standard information database.

Optionally, the data processing device further includes an industry shared information base, and the method further includes:

comparing the metadata of the industry shared information base with the metadata of the standard information base to determine the same shared metadata in the standard information base as in the industry shared information base;

and creating a shared data table according to the data corresponding to the shared metadata.

Optionally, the method further comprises:

and providing a corresponding interface for each shared data table, so that other equipment acquires the data in the shared data table through the interface.

Optionally, the metadata includes a field name, and the step of identifying metadata in the industry standard library that is the same as the business database as similar metadata includes:

and identifying the field names in the industry standard library, which are the same as the field names in the service database, as similar metadata.

Optionally, the metadata further includes a table name, a field type, and a field length.

Another objective of the embodiments of the present application is to provide a data standardization apparatus, which is applied to a data processing device, where the data processing device has a plurality of industry standard libraries prestored therein, and the industry standard libraries have sample data prestored therein, and the data standardization apparatus includes an obtaining module, a comparing module, an identifying module, a similarity calculation module, and a statistics module;

the acquisition module is used for acquiring a service database;

the comparison module is used for comparing the metadata of the industry standard library with the metadata of the business database aiming at each industry standard library;

the identification module is used for identifying the metadata in the industry standard library, which is the same as the metadata in the business database, as similar metadata;

the similarity calculation module is used for calculating the similarity between data corresponding to the difference metadata and sample data in the industry standard database aiming at the difference metadata different from the industry standard database in the business database, and identifying the metadata corresponding to the sample data with the data similarity exceeding a preset threshold value as similar metadata in the industry standard database;

the statistical module is used for counting the quantity of the metadata marked as the similar metadata in each industry standard library, and determining the industry standard library with the largest quantity as the industry standard library closest to the business database.

Optionally, the comparison module compares the metadata of the industry standard library with the metadata of the business database by:

Optionally, the data normalization apparatus further includes a creation module and a writing module;

the creating module is used for creating a standard information database according to the similar metadata in the closest industry standard library;

and the writing module is used for acquiring data corresponding to similar metadata in the closest industry standard library from the service database and storing the data into the standard information database.

Compared with the prior art, the method has the following beneficial effects:

the embodiment of the application provides a data standardization method and device, which are used for comparing metadata of a service database with metadata of a plurality of standard databases in sequence to find out the same metadata, and identifying the same metadata as similar metadata. The difference metadata in the business database different from the standard database is used. And calculating the similarity between the data corresponding to the difference metadata and the sample data prestored in the service database. And identifying the metadata corresponding to the sample data with the data similarity larger than the preset threshold value as similar metadata in the industry standard library. And counting the quantity of the metadata identified as the similar metadata in the industry standard library, and determining the industry standard library with the maximum quantity as the industry standard library closest to the business database.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic block diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating steps of a data normalization method according to an embodiment of the present application;

fig. 3 is a schematic diagram illustrating a comparison between a service data table and an industry standard data table provided in an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a data normalization apparatus according to an embodiment of the present application;

fig. 5 is a second schematic structural diagram of a data normalization apparatus according to an embodiment of the present application.

Icon: 100-a data processing device; 130-a processor; 120-a memory; 110-a data normalization means; 500-service data table; 600-industry standards data sheet; 1101-an acquisition module; 1102-a comparison module; 1103-an identification module; 1104-similarity calculation module; 1105-a statistics module; 1106-creation module; 1107-write module.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Referring to fig. 1, fig. 1 is a block diagram of a data processing apparatus 100 according to an embodiment of the present disclosure, where the data processing apparatus 100 includes a data normalization device 110, a memory 120, and a processor 130.

The elements of the memory 120 and the processor 130 are electrically connected to each other, directly or indirectly, to enable data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The data normalization apparatus 110 includes at least one software function module which can be stored in the memory 120 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the data processing device 100. The processor 130 is used for executing executable modules stored in the memory 120, such as software functional modules and computer programs included in the data normalization device 110.

The data processing device 100 may be, but is not limited to, a smart phone, a Personal Computer (PC), a tablet PC, a Personal Digital Assistant (PDA), a Mobile Internet Device (MID), and the like.

The Memory 120 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 120 is used for storing a program, and the processor 130 executes the program after receiving the execution instruction.

The processor 130 may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Referring to fig. 2, fig. 2 is a flowchart illustrating steps of a data normalization method applied to the data processing apparatus 100 shown in fig. 1, wherein the data processing apparatus 100 is pre-stored with a plurality of industry standard libraries, and the industry standard libraries are pre-stored with sample data; the individual steps of the data normalization method are described in detail below.

And step S100, acquiring a service database.

Optionally, the industry standard library is a database recording typical data among various industries. For example, in one possible example, the industry standards library of the educational industry includes data such as student name, student class, student gender, and student score. The industry standard library of the financial industry includes principal, interest rate, depositor name, sex, age and other data. The data processing apparatus 100 links a service database, and obtains metadata of the service database, where the metadata of the service database includes a database name, a table name, a field name, and a field type.

Step S200, aiming at each industry standard library, comparing the metadata of the industry standard library with the metadata of the business database.

Step S300, identifying the metadata in the industry standard library, which is the same as the metadata in the business database, as similar metadata.

Optionally, for each industry standard library, the data processing device 100 uses the industry standard library as a target industry standard library, and compares the metadata in the business database with the metadata in the target industry standard library to find out the same metadata. The data processing apparatus 100 marks the same metadata as similar metadata. For example, referring to FIG. 3, in one possible example, the metadata includes a field name. The service data table 500 includes field names "age", "fisrtname", and "lastname". The industry standard data table 600 includes field names "age", "number", and "name". The data processing device 100 compares the business data table 500 to the industry standard data table 600, where the "age" field names are the same and the "age" fields are labeled as similar metadata.

Optionally, to further ensure that the data corresponding to the same metadata in the business database and the industry standard library are also similar. The data processing device 100 performs similarity calculation on data corresponding to the same metadata in the business database and the industry standard database, respectively. And identifying the metadata with the similarity larger than a preset threshold as similar metadata. Referring to fig. 2, the data processing apparatus 100 performs similarity calculation between data corresponding to an "age" field in a business database and data corresponding to an "age" field in an industry standard library.

And by comparing whether the metadata are the same or not, the metadata which are similar to the metadata in the business database and the industry standard database are quickly screened out. The named data field name may be in and out for the same data due to different developers, e.g., different developers may name the field name as "score" or "achievement" for a student's exam achievement. By simple metadata comparison, it is impossible to determine whether the two are similar.

Step S400, aiming at different difference metadata between the business database and the industry standard database, calculating the similarity between data corresponding to the difference metadata and sample data in the industry standard database, and identifying the metadata corresponding to the sample data with the data similarity exceeding a preset threshold value as similar metadata in the industry standard database.

Alternatively, there may be duplicate fields in the traffic database whose field names are not the same, but whose actual data are similar. The data processing device 100 performs similarity calculation on data corresponding to the difference metadata in the service database and all sample data in the industry standard library, and identifies the metadata corresponding to the sample data with the data similarity exceeding a preset threshold value as similar metadata in the industry standard library.

In one embodiment provided by the present application, the data processing apparatus 100 inputs data corresponding to the difference metadata and all sample data in the industry standard library into an artificial neural network, and calculates a similarity between the data corresponding to each difference metadata and the sample data corresponding to each metadata in the industry standard library. The data processing apparatus 100 identifies metadata corresponding to sample data having a similarity greater than a preset threshold as similar metadata.

In another embodiment provided by the present application, the data processing apparatus 100 sequentially selects target difference metadata from the difference metadata, performs similarity calculation on data corresponding to the target difference metadata and sample data corresponding to each metadata in the industry standard library, and identifies metadata corresponding to sample data with similarity greater than a preset threshold as similar metadata. Referring again to fig. 3, the difference metadata in the service data table 500 are "lastname" and "firstname". The data processing device 100 performs similarity calculation on the data corresponding to the "lastname" field and the "age" field, the "number" field, and the "name" field in the industry standard data table 600, respectively. The data processing device 100 performs similarity calculation again on the data corresponding to the "firstname" field and the "age" field, the "number" field, and the "name" field in the industry standard data table 600, respectively. If the similarity of the "lastname" field and the "age" field, the similarity of the "number" field and the "name" field are 0.2, 0.1 and 0.7 respectively, wherein the preset threshold of the similarity is 0.6. The data processing device 100 identifies the "name" field in the industry standard data table 600 as a similar field corresponding to the "lastname" field.

Step S500, counting the quantity of the metadata marked as the similar metadata in each industry standard library, and determining the industry standard library with the largest quantity as the industry standard library closest to the business database.

Alternatively, since the data processing apparatus 100 has a plurality of industry standard libraries prestored therein, counts the number of metadata marked as similar fields in each industry standard library, and determines the industry standard library with the largest number of similar metadata as the industry standard library closest to the business database.

Optionally, the data processing device 100 creates a standard information database from similar metadata in the closest industry standard library. The data processing apparatus 100 acquires data corresponding to similar metadata in the closest industry standard library from the business database and stores the data in the standard information database.

Referring to fig. 3 again, the data processing device 100 extracts the "name" field and the "age" field in the industry standard library, and creates a standard information database according to the "name" field and the "age" field. And stores the data corresponding to the "age" field and the "lastname" field in the service data table 500 into the standard information database. It should be noted that the data processing apparatus 100 stores the data in the service data table 500 into the standard information base, and performs corresponding processing if the data type or the data length is different.

Optionally, the data processing apparatus 100 further includes an industry shared information base, and the metadata of the industry shared information base is compared with the metadata of the standard information database to determine the same shared metadata in the standard information database as in the industry shared information base. The data processing apparatus 100 creates a shared data table from the data corresponding to the number of shared elements.

Optionally, for each shared data table, a corresponding interface is provided, so that other devices can access the data in the shared data table through the interface.

The embodiment of the present application further provides a data normalization apparatus 110, which is applied to the data processing device 100, wherein the tree processing device pre-stores a plurality of industry standard libraries, and the industry standard libraries pre-store sample data. Referring to fig. 4, the data normalization apparatus 110 includes an obtaining module 1101, a comparing module 1102, an identifying module 1103, a similarity calculating module 1104, and a counting module 1105.

The obtaining module 1101 is configured to obtain a service database.

In the present embodiment, the obtaining module 1101 is configured to execute step S100 in fig. 2, and reference may be made to the detailed description of step S100 for a detailed description of the obtaining module 1101.

The comparing module 1102 is configured to compare, for each industry standard library, metadata of the industry standard library with metadata of the business database.

In this embodiment, the comparing module 1102 is configured to perform step S200 in fig. 2, and the detailed description about the comparing module 1102 may refer to the detailed description about step S200.

The identifying module 1103 is configured to identify metadata in the industry standard library that is the same as the service database as similar metadata.

In this embodiment, the identification module 1103 is configured to perform step S300 in fig. 2, and the detailed description about the identification module 1103 may refer to the detailed description of step S300.

The similarity calculation module 1104 is configured to calculate, for different difference metadata in the business database and different from the industry standard library, a similarity between data corresponding to the difference metadata and sample data in the industry standard library, and identify, in the industry standard library, metadata corresponding to sample data whose data similarity exceeds a preset threshold as similar metadata.

In the present embodiment, the similarity calculation module 1104 is configured to execute step S400 in fig. 2, and reference may be made to the detailed description of step S400 for a detailed description of the similarity calculation module 1104.

The statistical module 1105 is configured to count the number of metadata identified as the similar metadata in each of the industry standard libraries, and determine the industry standard library with the largest number as the industry standard library closest to the service database.

In this embodiment, the statistics module 1105 is configured to execute step S500 in fig. 2, and reference may be made to the detailed description of step S500 for a detailed description of the statistics module 1105.

Optionally, the comparing module 1102 compares the metadata of the industry standard library with the metadata of the business database by:

Referring to fig. 5 again, the data normalization apparatus 110 further includes a creation module 1106 and a writing module 1107.

The creation module 1106 is configured to create a standards information database based on similar metadata in the closest industry standards library.

The writing module 1107 is configured to obtain data corresponding to the similar metadata in the closest industry standard library from the service database, and store the data in the standard information database.

To sum up, the embodiments of the present application provide a data normalization method and apparatus, which compare metadata of a service database with metadata of multiple standard databases in sequence, find out the same metadata, and identify the same metadata as similar metadata. The difference metadata in the business database different from the standard database is used. And calculating the similarity between the data corresponding to the difference metadata and the sample data prestored in the service database. And identifying the metadata corresponding to the sample data with the data similarity larger than the preset threshold value as similar metadata in the industry standard library. And counting the quantity of the metadata identified as the similar metadata in the industry standard library, and determining the industry standard library with the maximum quantity as the industry standard library closest to the business database.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only for various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all such changes or substitutions are included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The data standardization method is characterized by being applied to data processing equipment, wherein a plurality of industry standard libraries are prestored in the data processing equipment, and sample data are prestored in the industry standard libraries; the method comprises the following steps:

acquiring a service database;

for each industry standard library, comparing the metadata of the industry standard library with the metadata of the business database; the metadata includes a field name;

2. The method of claim 1, wherein the step of calculating the similarity between the data corresponding to the difference metadata and the sample data in the industry standard library comprises:

3. The method of data normalization of claim 1, further comprising:

4. The data normalization method of claim 3, wherein the data processing device further comprises an industry shared information base, the method further comprising:

5. The method of claim 4, further comprising:

6. The data normalization method of claim 1, wherein the step of identifying metadata in the industry standard repository that is the same as the business database as similar metadata comprises:

7. The data normalization method of claim 1, wherein the metadata further includes a table name, a field type, and a field length.

8. A data standardization device is applied to data processing equipment, a plurality of industry standard libraries are prestored in the data processing equipment, sample data are prestored in the industry standard libraries, and the data standardization device comprises an acquisition module, a comparison module, an identification module, a similarity calculation module and a statistic module;

the acquisition module is used for acquiring a service database;

the comparison module is used for comparing the metadata of the industry standard library with the metadata of the business database aiming at each industry standard library; the metadata includes a field name;

9. The data normalization apparatus of claim 8, wherein the comparison module compares the metadata of the industry standard library with the metadata of the business database by:

10. The data normalization apparatus of claim 8, wherein the data normalization apparatus further comprises a creation module, a write module;