CN112597124A

CN112597124A - Data field mapping method and device and storage medium

Info

Publication number: CN112597124A
Application number: CN202011371396.2A
Authority: CN
Inventors: 刘畅
Original assignee: New H3C Big Data Technologies Co Ltd
Current assignee: New H3C Big Data Technologies Co Ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-04-02

Abstract

The disclosure provides a data field mapping method, a data field mapping device and a storage medium, which are used for solving the technical problem of low database field mapping efficiency. The method comprises the steps of vectorizing field names of a field to be mapped of a source database and a field name of a target field of a target database by using a semi-supervised natural language processing model to obtain a field characteristic vector, calculating the similarity between the field to be mapped and the target field by using the natural language processing model to predict the mapping relation between the field to be mapped and the target field, adding a corrected corresponding relation into a training sample through manual correction, and performing iterative training on the model to improve the prediction accuracy. The field mapping is assisted by the machine learning technology, and compared with pure manual labeling, the field mapping method reduces the workload of manual operation and improves the data field mapping efficiency.

Description

Data field mapping method and device and storage medium

Technical Field

The present disclosure relates to the field of communications and artificial intelligence technologies, and in particular, to a data field mapping method, apparatus, and storage medium.

Background

In research and development type enterprises, due to the difference of development time or development departments, a plurality of heterogeneous information systems running on different software and hardware platforms often run simultaneously, and data sources of the systems are independent and mutually closed, so that data are difficult to exchange, share and fuse among the systems, and an information isolated island is formed. With the continuous penetration of information-based applications, in the digital era, organizations must have a top-level control over their data to maintain relevance to the market. Data integration is to logically or physically organically centralize data with different sources, formats and characteristic properties, thereby providing comprehensive data sharing for enterprises.

Data mapping is an essential component of large data migration and data integration processes, and is a mechanism to match fields in a data source, which may be any input such as name, phone, email, etc., with target fields in a data repository or other repository. It is particularly important to solve the problem when an organization has more data sources, types and formats than ever before as part of an overall data strategy that can neutralize the possibility of data errors and mismatches, aid in the data standardization process and make the intended data purpose clearer and more understandable.

For the existing product integrating data mapping, the operation is mainly performed by means of professional knowledge, business experience and rules of a data expert. The method has the conditions of strong subjectivity and excessive dependence on experts. Meanwhile, manual labeling is likely to be careless and expensive, and once a manager with insufficient experience or insufficient professional knowledge is met, mapping errors of data are easily caused. Database field mapping is inefficient based on the above deficiencies.

Disclosure of Invention

In view of this, the present disclosure provides a data field mapping method, apparatus and storage medium, which are used to solve the technical problem of low database field mapping efficiency.

Based on an embodiment of the present disclosure, the present disclosure provides a data field mapping method, including:

acquiring a field set to be mapped of a data source to be mapped and a target field set of a target data source;

respectively carrying out word embedding processing on the fields to be mapped in the field set to be mapped and the target fields in the target field set to obtain field feature vectors of the fields;

taking the field feature vectors of the fields to be mapped and the target fields as the input of a natural language processing model, and respectively calculating the similarity between each field to be mapped and each target field; the natural language processing model is obtained by training a labeled training sample;

and outputting the field to be mapped with the highest similarity and the target field as a prediction result.

Further, the method further comprises:

outputting the prediction result for the user to determine the mapping relation;

and receiving the corrected field to be mapped and the target field, adding the corrected field to be mapped and the target field serving as new labeling samples into a training sample set, and performing iterative training on the natural language processing model.

Further, the prediction result includes a field to be mapped, a target field and a similarity with the highest similarity.

Further, the method for training the natural language processing model by the labeled training sample set comprises the following steps:

vectorizing the training samples in the marked training sample set by using the natural language processing model to obtain field feature vectors of source fields and target fields in the samples;

and inputting the field characteristic vectors of the training samples in the training sample set into the natural language processing model to train the model, and adjusting model parameters to carry out iterative training on the natural language processing model until an iteration cutoff condition is reached when the accuracy and the recall ratio of a model prediction result do not reach a preset iteration cutoff condition.

Further, marking the training samples in the training sample set by adopting a manual marking or rule marking mode; the natural language processing model is word2Vec or GloVe.

Based on another embodiment of the present disclosure, the present disclosure further provides a data field mapping apparatus, including:

the field acquisition module is used for acquiring a field set to be mapped of a data source to be mapped and a target field set of a target data source;

the vectorization module is used for respectively carrying out word embedding processing on the fields to be mapped in the field set to be mapped and the target fields in the target field set to obtain the field characteristic vectors of the fields;

the mapping prediction module is used for taking the field feature vectors of the fields to be mapped and the target fields as the input of a natural language processing model and respectively calculating the similarity between each field to be mapped and each target field; the natural language processing model is obtained by training a labeled training sample;

and the mapping output module is used for outputting the field to be mapped with the highest similarity and the target field as a prediction result.

Further, the apparatus further comprises a training set update module;

the mapping output module outputs the prediction result for the user to determine the mapping relation;

and the training set updating module is used for receiving the corrected field to be mapped and the target field, adding the corrected field to be mapped and the target field serving as new labeling samples into a training sample set, and performing iterative training on the natural language processing model.

Furthermore, the prediction result output by the mapping output module includes a field to be mapped, a target field and the similarity with the highest similarity.

Further, the apparatus further comprises:

the model iteration training module is used for vectorizing the training samples in the marked training sample set by using the natural language processing model to obtain field feature vectors of source fields and target fields in the samples; and inputting the field characteristic vectors of the training samples in the training sample set into the natural language processing model to train the model, and adjusting model parameters to carry out iterative training on the natural language processing model until an iteration cutoff condition is reached when the accuracy and the recall ratio of a model prediction result do not reach a preset iteration cutoff condition.

The method comprises the steps of vectorizing field names of a field to be mapped of a source database and a field name of a target field of a target database by using a semi-supervised natural language processing model to obtain a field characteristic vector, calculating the similarity between the field to be mapped and the target field by using the natural language processing model to predict the mapping relation between the field to be mapped and the target field, adding a corrected corresponding relation into a training sample through manual correction, and performing iterative training on the model to improve the prediction accuracy. The field mapping is assisted by the machine learning technology, and compared with pure manual labeling, the field mapping method reduces the workload of manual operation and improves the data field mapping efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments of the present disclosure or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present disclosure, and other drawings can be obtained by those skilled in the art according to the drawings of the embodiments of the present disclosure.

FIG. 1 is a flowchart illustrating steps of a data field mapping method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a mapping relationship between a field to be mapped in a data source to be mapped and a target field in a target data source according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a process of iteratively training an NLP model used in the embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a data field mapping apparatus according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a data field mapping device according to an embodiment of the present disclosure.

Detailed Description

The terminology used in the embodiments of the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present disclosure. As used in the embodiments of the present disclosure, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The term "and/or" as used in this disclosure is meant to encompass any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information in the embodiments of the present disclosure, such information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of embodiments of the present disclosure. Depending on the context, moreover, the word "if" as used may be interpreted as "at … …" or "when … …" or "in response to a determination".

The basic idea of the present disclosure is to introduce a machine learning technique to improve the mapping efficiency of data fields when performing database migration. The scheme provided by the disclosure adopts the artificial intelligence semi-supervised learning model to automatically generate the mapping suggestion for the user, the user can decide whether to adopt the suggestion provided by the model, and the accuracy of the next suggestion of the model can be improved every time of adopting and correcting. By the method and the device, the efficiency of data mapping can be improved, the manual participation and error probability are reduced, and the time and economic cost of database migration are reduced.

Fig. 1 is a flowchart illustrating steps of a data field mapping method according to an embodiment of the present disclosure, where the method includes:

step 101, acquiring a field set to be mapped of a data source to be mapped and a target field set of a target data source;

the method is used for establishing the optimal field mapping relationship between the source field and the target field. The number of the data sources to be mapped and the number of the target data sources are not limited in the embodiment of the disclosure, and the mapping processing between a plurality of source fields and a plurality of target fields of a plurality of data sources to be mapped and a plurality of target data sources can be executed simultaneously.

102, respectively carrying out word embedding processing on the fields to be mapped in the field set to be mapped and the target fields in the target field set to obtain field feature vectors of the fields;

103, taking the field feature vectors of the fields to be mapped and the target fields as the input of a natural language processing model, and respectively calculating the similarity between each field to be mapped and each target field; the natural language processing model is obtained by training a labeled training sample;

and 104, outputting the field to be mapped with the highest similarity and the target field as a prediction result.

The method comprises the following steps of predicting the mapping relation between a field to be mapped and a target field by means of a Natural Language Processing (NLP) model, wherein the output result of the step is possibly inaccurate, so that the accuracy of the model and the accuracy of the final result are improved in the future. To achieve the object, the method further comprises:

step 105, outputting the prediction result to a user to determine a mapping relation;

in this step, the prediction result may include only the field to be mapped and the target field with the highest similarity, or may predict the field to be mapped and the target field with the highest similarity and their proximity outputs for the user to determine the mapping relationship.

The prediction result comprises a field to be mapped with the highest similarity, a target field and the similarity.

And 106, receiving the corrected field to be mapped and the target field, adding the corrected field to be mapped and the target field serving as new labeling samples into a training sample set, and performing iterative training on the natural language processing model.

Further, in an embodiment of the present disclosure, the method for training the natural language processing model by the labeled training sample set includes:

vectorizing the training samples in the training sample set marked manually or by rules by using the natural language processing model to obtain field feature vectors of source fields and target fields in the samples;

The data field mapping method provided by the disclosure applies a machine learning model of semi-supervised learning during data mapping, combines a machine learning technology with a traditional manual labeling method, and improves the efficiency of data mapping.

Fig. 2 is a schematic diagram illustrating a mapping relationship between a field to be mapped in a data source to be mapped and a target field in a target data source according to an embodiment of the present disclosure. In the embodiment, the fields to be mapped come from a plurality of data sources, and the mapping relationship between the fields to be mapped and the target fields of the target table in the target data source is established by applying the data field mapping method disclosed by the invention.

Fig. 3 is a schematic diagram of a process of iteratively training the NLP model used in the embodiment of the present disclosure. According to the embodiment of the invention, the machine learning technology of semi-supervised learning is added during data mapping, so that the machine learning technology is combined with the traditional manual labeling method, and the efficiency of data mapping is improved. The semi-supervised machine learning techniques used in the present disclosure may preferably be natural language processing techniques, such as word2Vec, GloVe, and the like.

Before data field mapping is performed by using a semi-supervised machine learning technology, a certain number of training samples need to be prepared in a mode of manual standard or rule marking, a plurality of training samples form a training sample set, and the training samples are used for performing iterative training on an NLP model. Each training sample includes a source field and labeled correct target field information. And inputting the training samples in the training sample set into the NLP model to train and evaluate the model, and finishing the model training when the model prediction result is compared with the labeling result and the accuracy and the recall rate reach the preset stopping conditions. When the model is used for actual field mapping, the prediction result can be output and displayed, a user can directly confirm the mapping relation without problems through an interface, the user can manually correct wrong field mapping, after correction, the system can take the corrected source field and target field as new training samples, add the new training samples into a training sample set, and then use the training sample set for iterative training of the NLP model, so that a more accurate prediction result is provided.

When the fields to be mapped and the target fields are mapped, firstly, the fields to be mapped and the target fields need to be converted into feature vectors, for example, Word2Vec is used for Word embedding and is converted into Word vectors, then, the semantic similarity function of a Word2Vec model is called to calculate the similarity between each field to be mapped and the target fields, the similarities of all the fields to be mapped and the target fields are compared, the field to be mapped with the highest similarity is determined to have a mapping relation with the target fields, and the determined fields to be mapped, the target fields and the similarity probability are output for a user to confirm. For example, taking the field "dianhua" to be mapped and the target field "Tel" in fig. 2 as an example, the mapping prediction result is output:

“dianhua”－＞“Tel”,99.987％

due to the limitation of the model, a situation that a part of fields are mapped wrongly and a part of fields are not mapped may occur, the user manually corrects the result of the prediction mistake, the model adds the result of the correction of the part of the user to the training sample set, and the updated training sample set is learned. After the model learning is finished, the user can perform mapping again to generate a mapping relation, the correctness of the mapping result is verified, and the prediction accuracy of the model is improved through continuous model iterative training.

According to the technical scheme, a purely manual labeling method is combined with a machine learning technology, so that the time cost and the expense cost of a user in a data mapping link can be well solved. According to the method, the mapping suggestions can be automatically provided for other data under the condition that less correct labeling is performed manually, and compared with pure manual labeling, the method can achieve semi-automation, visually shows confidence of automatic labeling, and facilitates manual judgment. The method can replace the traditional method well and reduce the workload of manual operation.

It should be recognized that embodiments of the present disclosure can be realized and implemented by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The method may be implemented in a computer program using standard programming techniques, including a non-transitory computer readable storage medium configured with the computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Further, operations of processes described by the present disclosure may be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described in this disclosure (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) executing collectively on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.

Further, the method may be implemented in any type of computing platform operatively connected to a suitable interface, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and the like. Aspects of the disclosure may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described in this disclosure includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The disclosure also includes the computer itself when programmed according to the methods and techniques described in this disclosure.

Fig. 4 is a schematic structural diagram of a data field mapping apparatus according to an embodiment of the present disclosure, where each functional module in the apparatus may be implemented in a form of a software module. The modules in the device can be executed on one hardware device, or can be implemented and completed by different hardware devices respectively. The apparatus 400 comprises: a field acquisition module 410, a vectorization module 420, a mapping prediction module 430, and a mapping output module 440.

The field obtaining module 410 is configured to obtain a field set to be mapped of a data source to be mapped and a target field set of a target data source;

the vectorization module 420 is configured to perform word embedding processing on the fields to be mapped in the field set to be mapped and the target fields in the target field set respectively to obtain field feature vectors of the fields;

the mapping prediction module 430 is configured to use the field feature vectors of the fields to be mapped and the target fields as input of a natural language processing model, and calculate similarity between each field to be mapped and each target field respectively; the natural language processing model is obtained by training a labeled training sample;

the mapping output module 440 is configured to output the field to be mapped with the highest similarity and the target field as the prediction result.

In order to continuously improve the prediction accuracy of the natural language processing model, the apparatus 400 in an embodiment of the present disclosure further includes a training set updating module.

The mapping output module 440 outputs the prediction result for the user to determine the mapping relationship. And the training set updating module is used for receiving the corrected field to be mapped and the target field, adding the corrected field to be mapped and the target field serving as a new labeling sample into a training sample set, and performing iterative training on the natural language processing model.

Preferably, the prediction result output by the mapping output module 440 includes a field to be mapped, a target field and a similarity with the highest similarity.

To implement the prediction of the field mapping, the embodiment of the present disclosure needs to train the natural language model first, and therefore the apparatus further includes:

Fig. 5 is a schematic structural diagram of a data field mapping device according to an embodiment of the present disclosure, where the device 500 includes: a processor 510 such as a Central Processing Unit (CPU), an internal bus 520, a network interface 540, and a computer-readable storage medium 530. Wherein the processor 510 and the computer-readable storage medium 530 can communicate with each other through the internal bus 520. The computer readable storage medium 530 may store therein a computer program provided by the present disclosure for implementing the data field mapping method provided by the present disclosure, and when the computer program is executed by the processor 510, the functions of the steps of the method provided by the present disclosure can be implemented.

The machine-readable storage medium may include Random Access Memory (RAM) and may also include Non-Volatile Memory (NVM), such as at least one disk Memory. Additionally, the machine-readable storage medium 1202 may also be at least one memory device located remotely from the aforementioned processor. The Processor may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), etc.; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

The equipment provided by the embodiment of the disclosure and the method provided by the embodiment of the disclosure have the same technical concept and the same beneficial effects as the method adopted, operated or realized by the equipment.

The above description is only an example of the present disclosure and is not intended to limit the present disclosure. Various modifications and variations of this disclosure will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A data field mapping method, the method comprising:

2. The method of claim 1, further comprising:

3. The method of claim 1,

4. The method of claim 1, wherein the natural language processing model is trained by a labeled training sample set by:

5. The method of claim 1,

marking the training samples in the training sample set by adopting a manual marking or rule marking mode;

the natural language processing model is word2Vec or GloVe.

6. An apparatus for mapping data fields, the apparatus comprising:

7. The apparatus of claim 6, further comprising a training set update module;

8. The apparatus of claim 6,

the prediction result output by the mapping output module comprises a field to be mapped with the highest similarity, a target field and the similarity.

9. The apparatus of claim 6, further comprising:

10. A storage medium on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the functions of the steps of the method according to any one of claims 1 to 5.