CN110287191B

CN110287191B - Data alignment method and device, storage medium and electronic device

Info

Publication number: CN110287191B
Application number: CN201910557282.8A
Authority: CN
Inventors: 接钧靖; 张毅然; 王建伟
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-06-25
Filing date: 2019-06-25
Publication date: 2021-07-27
Anticipated expiration: 2039-06-25
Also published as: CN110287191A

Abstract

The invention provides a data alignment method and device, a storage medium and an electronic device, wherein the method comprises the following steps: acquiring first field information in a received first data record table and first bag-of-words information corresponding to the field information, wherein the first bag-of-words information is used for representing the probability of the first field information appearing in a database; for each of a plurality of second data record tables in the database, determining a similarity probability between the first data record table and the plurality of second data record tables according to the first field information and the first bag-of-words information, and second field information and second bag-of-words information in the second data record tables; and aligning the first description information of the first data record table with the second description information of the second data record table with the similarity probability exceeding a first threshold value.

Description

Data alignment method and device, storage medium and electronic device

Technical Field

The invention relates to the field of computers, in particular to a data alignment method and device, a storage medium and an electronic device.

Background

With the development of computer technology, more and more people begin to pay attention to the analysis and mining of relational data, and then obtain data analysis results about the relational data. However, data quality problems will be caused due to the inconsistency of data standards among different data resources, and the reliability of data analysis results will be seriously affected.

Aiming at the problem that in the related technology, the data cannot be aligned effectively due to the fact that standards of different data resources are inconsistent, and therefore the reliability of a data analysis result is influenced, an effective technical scheme is not provided.

Disclosure of Invention

The embodiment of the invention provides a data alignment method and device, a storage medium and an electronic device, and at least solves the problem that in the related technology, the data cannot be aligned effectively due to the fact that standards of different data resources are inconsistent, and the reliability of a data analysis result is affected.

According to an embodiment of the present invention, there is provided a data alignment method including: acquiring first field information in a received first data record table and first bag-of-words information corresponding to the field information, wherein the first bag-of-words information is used for representing the probability of the first field information appearing in a database; for each of a plurality of second data record tables in the database, determining a similarity probability between the first data record table and the plurality of second data record tables according to the first field information and the first bag-of-words information, and second field information and second bag-of-words information in the second data record tables; and aligning the first description information of the first data record table with the second description information of the second data record table with the similarity probability exceeding a first threshold value.

Optionally, determining the similarity probability between the first data record table and the plurality of second data record tables according to the first field information and the first bag-of-words information, and the second field information and the second bag-of-words information in the second data record tables, includes:

multiplying the value of the first field information and the first bag-of-words information to obtain the similarity probability, wherein the value of the first field information is a second threshold value when the first field information exists in second field information in the second data record table, and the value of the first field information is a third threshold value when the first field information does not exist in the second field information in the second data record table

Optionally, aligning the first description information of the first data record table with the second description information of the second data record table with the similarity probability exceeding a first threshold includes:

and the first data record table is at least one of the following information: table name information, field information and a data format of the first data record table, and at least one of the following information of a second data record table with the similarity probability exceeding a first threshold: and the table name information, the field information and the data format alignment of the second data record table with the similarity probability exceeding a first threshold value.

Optionally, before the first field information in the received first data record table and the first bag-of-words information corresponding to the field information are acquired, the method further includes:

receiving the first data record table, wherein the first data record table comprises first field information;

and establishing first bag-of-words information corresponding to the first field information.

According to another embodiment of the present invention, there is also provided a data alignment apparatus including:

the acquisition module is used for acquiring first field information in a received first data record table and first bag-of-words information corresponding to the field information, wherein the first bag-of-words information is used for representing the probability of the first field information appearing in a database;

a first processing module, configured to determine, for each of a plurality of second data record tables in the database, a similarity probability between the first data record table and the plurality of second data record tables according to the first field information and the first bag-of-words information, and second field information and second bag-of-words information in the second data record tables;

and the second processing module is used for aligning the first description information of the first data record table with the second description information of the second data record table of which the similarity probability exceeds a first threshold value.

Optionally, the first processing module is further configured to multiply the value of the first field information and the first bag-of-words information to obtain the similarity probability, where the value of the first field information is a second threshold value when the first field information exists in second field information in the second data record table, and the value of the first field information is a third threshold value when the first field information does not exist in the second field information in the second data record table.

Optionally, the second processing module is further configured to record at least one of the following information in the first data record: table name information, field information and a data format of the first data record table, and at least one of the following information of a second data record table with the similarity probability exceeding a first threshold: and the table name information, the field information and the data format alignment of the second data record table with the similarity probability exceeding a first threshold value.

Optionally, the apparatus further comprises: a receiving module, configured to receive the first data record table, where the first data record table includes first field information;

and the third processing module is used for establishing first bag-of-words information corresponding to the first field information. According to another embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is configured to execute the data alignment method according to any one of the above items when the computer program runs.

According to another embodiment of the present invention, there is also provided an electronic apparatus including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the data alignment method according to any one of the above.

According to the invention, the first field information in the received first data record table and the first bag-of-words information corresponding to the field information are obtained, wherein the first bag-of-words information is used for representing the probability of the first field information appearing in the database; for each of a plurality of second data record tables in the database, determining a similarity probability between the first data record table and the plurality of second data record tables according to the first field information and the first bag-of-words information, and second field information and second bag-of-words information in the second data record tables; the technical scheme is adopted to solve the problem that in the related technology, the data cannot be effectively aligned due to the fact that standards of different data resources are inconsistent, and therefore reliability of a data analysis result is affected, and further provides a data alignment method which is convenient for subsequent analysis of the data resources.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of an alternative data alignment method according to an embodiment of the invention;

FIG. 2 is a block diagram of an alternative data alignment apparatus according to an embodiment of the present invention;

fig. 3 is another block diagram of an alternative data alignment apparatus according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Example 1

An embodiment of the present invention provides a data alignment method, and fig. 1 is a flowchart of the data alignment method according to the embodiment of the present invention, as shown in fig. 1, including:

step S102, acquiring first field information in a received first data record table and first bag-of-words information corresponding to the field information, wherein the first bag-of-words information is used for representing the probability of the first field information appearing in a database;

step S104, for each second data record table in a plurality of second data record tables in the database, determining the similarity probability between the first data record table and the plurality of second data record tables according to the first field information and the first bag-of-words information, and the second field information and the second bag-of-words information in the second data record tables;

and step S106, aligning the first description information of the first data record table with the second description information of the second data record table with the similarity probability exceeding a first threshold value.

According to the invention, the first field information in the received first data record table and the first bag-of-words information corresponding to the field information are obtained, wherein the first bag-of-words information is used for representing the probability of the first field information appearing in the database; for each of a plurality of second data record tables in the database, determining a similarity probability between the first data record table and the plurality of second data record tables according to the first field information and the first bag-of-words information, and second field information and second bag-of-words information in the second data record tables; the first description information of the first data record table is aligned with the second description information of the second data record table with the similarity probability exceeding the first threshold, namely, bag-of-word information is introduced in the data alignment process.

In an optional embodiment of the present invention, determining, according to the first field information and the first bag-of-words information, and the second field information and the second bag-of-words information in the second data record table, a similarity probability between the first data record table and the plurality of second data record tables includes:

and multiplying the value of the first field information and the first bag of words information to obtain the similarity probability, wherein the value of the first field information is a second threshold value when the first field information exists in second field information in the second data record table, and the value of the first field information is a third threshold value when the first field information does not exist in the second field information in the second data record table.

In an optional embodiment of the present invention, aligning the first description information of the first data record table with the second description information of the second data record table whose similarity probability exceeds a first threshold includes:

In an optional embodiment of the present invention, before acquiring the first field information in the received first data record table and the first bag-of-words information corresponding to the field information, the method further includes:

The following explains the data alignment process with reference to an example, but is not intended to limit the technical solution of the embodiment of the present invention, and the technical solution of the example of the present invention is as follows:

step 1, receiving a first data record table, wherein the first data record table comprises first field information;

step 2, establishing first bag-of-words information corresponding to the first field information;

where a bag of words is understood from the following description, a dictionary is a collection of words and any document is composed of a combination of words in the dictionary. Unlike a dictionary, a bag of words, in addition to containing all the words that make up a document, corresponds to a probability for each word that represents the probability that the word appears in all the documents produced by the bag of words. Therefore, the corresponding bag of words of the documents with different subjects is different. For example, in a sports theme document, the probability of occurrence of the word yaoming is high, and in a fun theme document, the probability of occurrence of the word yaoming is low. That is, the bag-of-words is a concept attached to the subject.

Specifically, the generation of the document theme may be realized by a document theme generation model (LDA), which may also be referred to as a three-layer bayesian probability model. The LDA model comprises three layers of word, subject and document structures. The generative model is understood to mean that each word of an article is obtained through a process of "selecting a topic with a certain probability and selecting a word from the topic with a certain probability". Document-to-topic follows a polynomial distribution, and topic-to-word follows a polynomial distribution.

Further, the first data record table may be regarded as a subject, the table name of the first data record may be regarded as a document, and the first bag-of-words information may be established for each field information of the first data record table.

Step 3, for each of a plurality of second data record tables in the database, determining the similarity probability between the first data record table and the plurality of second data record tables according to the first field information and the first bag-of-words information, and the second field information and the second bag-of-words information in the second data record tables;

specifically, the similarity probability may be obtained by multiplying a value of the first field information by the first bag-of-words information, where a value of the first field information is a second threshold value when the first field information exists in second field information in the second data record table, and a value of the first field information is a third threshold value when the first field information does not exist in the second field information in the second data record table;

the similarity probability is described by taking the first data record table as a motor vehicle registration information table and the second data record table as a motor vehicle illegal information table as an example:

vehicle registration information sheet: { motor vehicle 0.9, vehicle 0.8, registration 0.3, info 0.6, record 0.2, registration 0.2, … … };

vehicle registration information sheet: { motor vehicle 0.9, vehicle 0.8, violation 0.5, violation 0.4, information 0.6, violation 0.2, overspeed 0.1, … … };

it is to be understood that the above-mentioned numbers represent bag information for each field information.

If the "vehicle" in the vehicle registration information table is present in the vehicle registration information table, the second threshold value corresponding to the "vehicle" may be 1 or 0.9, and if the "registration" in the vehicle registration information table is not present in the vehicle registration information table, the third threshold value corresponding to the "registration" may be 0 or 0.7 (that is, 1 to 0.3 is 0.7). For the specific threshold corresponding to other field information, reference may be made to the above manner, and details are not described here.

Then, after comparing each field information in the motor vehicle registration information table with each field information in the motor vehicle registration information table, a threshold corresponding to each field information in the motor vehicle registration information table can be obtained, and the threshold corresponding to each field information in the motor vehicle registration information table is added, so that the similarity probability can be finally obtained.

And 4, obtaining a second data record table most similar to the first data record table according to the similarity probability.

The number of the second data record tables most similar to the first data record table may be one or more, and is not limited herein.

Specifically, the first data record table is at least one of the following information: table name information, field information and a data format of the first data record table, and at least one of the following information of a second data record table with the similarity probability exceeding a first threshold: and the table name information, the field information and the data format alignment of the second data record table with the similarity probability exceeding a first threshold value.

For example, the standard data specification (i.e., the database) includes definitions of two data resources, i.e., the second data record table, of the vehicle registration information and the vehicle violation information. The existing data resource is relational data provided by an organization under the name of a vehicle violation record (i.e., the first data record). Through the technical scheme, the condition that the vehicle violation record table (namely the first data record table) is aligned with the motor vehicle violation information in the standard data specification (namely the second data record table with the similarity probability exceeding the first threshold value with the first data record table) but not the motor vehicle registration information can be determined.

In the above manner, although the calibration of the data can be completed more accurately by manual work, the calibration of the data will consume a lot of manpower with the standard data specification being more and more perfect and more complex, and the efficiency is very low.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a data alignment apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 2 is a block diagram of a data alignment apparatus according to an embodiment of the present invention, as shown in fig. 2, the apparatus including:

an obtaining module 20, configured to obtain first field information in a received first data record table and first bag-of-words information corresponding to the field information, where the first bag-of-words information is used to indicate a probability of occurrence of the first field information in a database;

a first processing module 22, configured to determine, for each of a plurality of second data record tables in the database, a similarity probability between the first data record table and the plurality of second data record tables according to the first field information and the first bag-of-words information, and second field information and second bag-of-words information in the second data record tables;

and a second processing module 24, configured to align the first description information of the first data record table with the second description information of the second data record table whose similarity probability exceeds a first threshold.

In an optional embodiment of the present invention, the first processing module 22 is further configured to multiply the value of the first field information and the first bag-of-word information to obtain the similarity probability, where the value of the first field information is a second threshold value when the first field information exists in second field information in the second data record table, and the value of the first field information is a third threshold value when the first field information does not exist in the second field information in the second data record table. In an optional embodiment of the present invention, the second processing module 24 is further configured to record at least one of the following information in the first data record: table name information, field information and a data format of the first data record table, and at least one of the following information of a second data record table with the similarity probability exceeding a first threshold: and the table name information, the field information and the data format alignment of the second data record table with the similarity probability exceeding a first threshold value.

In an alternative embodiment of the present invention, as shown in fig. 3, the apparatus further includes: a receiving module 26, configured to receive the first data record table, where the first data record table includes first field information;

and a third processing module 28, configured to establish first bag-of-words information corresponding to the first field information.

An embodiment of the present invention further provides a storage medium including a stored program, wherein the program executes any one of the methods described above.

Alternatively, in the present embodiment, the storage medium may be configured to store program codes for performing the following steps:

s1, acquiring first field information in a received first data record table and first bag-of-words information corresponding to the field information, wherein the first bag-of-words information is used for representing the probability of the first field information appearing in a database;

s2, for each of a plurality of second data record tables in the database, determining the similarity probability of the first data record table and the plurality of second data record tables according to the first field information and the first bag-of-word information, and the second field information and the second bag-of-word information in the second data record tables;

and S3, aligning the first description information of the first data record table with the second description information of the second data record table with the similarity probability exceeding a first threshold value.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of data alignment, comprising:

acquiring first field information in a received first data record table and first bag-of-words information corresponding to the field information, wherein the first bag-of-words information is used for representing the probability of the first field information appearing in a database;

for each of a plurality of second data record tables in the database, determining a similarity probability between the first data record table and the plurality of second data record tables according to the first field information and the first bag-of-words information, and second field information and second bag-of-words information in the second data record tables;

aligning the first description information of the first data record table with the second description information of the second data record table of which the similarity probability exceeds a first threshold value;

determining the similarity probability between the first data record table and the plurality of second data record tables according to the first field information, the first bag-of-words information and the second field information and the second bag-of-words information in the second data record tables, including:

2. The method of claim 1, wherein aligning the first description information of the first data record table with the second description information of the second data record table having the similarity probability exceeding a first threshold comprises:

3. The method according to any one of claims 1 or 2, wherein before acquiring the first field information in the received first data record table and the first bag-of-words information corresponding to the field information, the method further comprises:

4. A data alignment apparatus, comprising:

the second processing module is used for aligning the first description information of the first data record table with the second description information of a second data record table of which the similarity probability exceeds a first threshold;

the first processing module is further configured to multiply the value of the first field information and the first bag-of-words information to obtain the similarity probability, where the value of the first field information is a second threshold value when the first field information exists in second field information in the second data record table, and the value of the first field information is a third threshold value when the first field information does not exist in the second field information in the second data record table.

5. The apparatus of claim 4, wherein the second processing module is further configured to record the first data record with at least one of: table name information, field information and a data format of the first data record table, and at least one of the following information of a second data record table with the similarity probability exceeding a first threshold: and the table name information, the field information and the data format alignment of the second data record table with the similarity probability exceeding a first threshold value.

6. The apparatus of claim 4, further comprising:

a receiving module, configured to receive the first data record table, where the first data record table includes first field information;

and the third processing module is used for establishing first bag-of-words information corresponding to the first field information.

7. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 3 when executed.

8. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 3.