CN111061833B

CN111061833B - Data processing method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN111061833B
Application number: CN201911260528.1A
Authority: CN
Inventors: 邹韬; 冯磊; 田东华; 叶淑强; 王磊
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2023-03-21
Anticipated expiration: 2039-12-10
Also published as: CN111061833A

Abstract

The application provides a data processing method, a data processing device, electronic equipment and a computer readable storage medium, and relates to the field of natural language data processing. The data processing method comprises the following steps: inputting data to be aligned to a word vector model to obtain at least one approximate data of the data to be aligned; the word vector model is obtained by using a corpus training library, and the corpus training library comprises a plurality of corpuses; acquiring target data corresponding to the at least one approximate data according to the data mapping relation; the data mapping relationship is a corresponding relationship between a standard data item field and the at least one approximate data, the standard data item field is a targeting field meeting a preset rule, and the target data is data with the standard data item field. It can be understood that the data processing method provided by the application can realize automatic benchmarking of the fields, improve the accuracy of benchmarking, reduce the workload of manual intervention, and further realize rapid data standardization.

Description

Data processing method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of data processing in natural language, and in particular, to a data processing method, apparatus, electronic device, and computer-readable storage medium.

Background

With the continuous development of informatization construction, data becomes an important resource of a new era. Due to the independence of the traditional system and other historical reasons, the data lack of unified standards, so that additional working cost and time consumption are increased on the use of the data, and the data use parties are required to carry out repeated data preprocessing work. It has become necessary to establish uniform data standards for enterprises and industries to constrain the construction of new systems and to standardize the processing of data generated by historical systems.

In order to standardize the table fields as quickly as possible, related departments of an enterprise need to arrange that a plurality of people manually screen out the fields needing to be standardized from one business table structure document one by one, and manually replace the fields with the standard fields, which is particularly inefficient. In order to check the mark field, a special person needs to be arranged additionally, which in turn causes more waste of resources. The traditional data standardization process needs time and labor consumption, and is very easy to make mistakes, so that the problems that the accuracy and the recall rate of standards are improved, the workload of human intervention is reduced, and the targeting efficiency is finally improved are urgently needed to be solved at present.

Disclosure of Invention

In order to overcome at least the above-mentioned deficiencies in the prior art, one of the objectives of the present application is to provide a data processing method, apparatus, electronic device and computer readable storage medium.

In a first aspect, the present application provides a data processing method, including: inputting data to be aligned to a word vector model to obtain at least one approximate data of the data to be aligned; the word vector model is obtained using a corpus training library, which includes a plurality of corpora. Acquiring target data corresponding to the at least one approximate data according to the data mapping relation; the data mapping relationship is a corresponding relationship between a standard data item field and the at least one approximate data, the standard data item field is a calibration field meeting a preset rule, and the target data is data with the standard data item field.

In an optional embodiment, the data to be aligned includes a field to be aligned and information to be aligned, where the information to be aligned is used to interpret the field to be aligned, and the inputting the data to be aligned to a word vector model to obtain at least one approximate datum of the data to be aligned includes: inputting the information to be annotated to be interpreted to the word vector model; judging whether at least one approximate data corresponding to the information to be annotated and released is acquired; if yes, executing the step of acquiring target data corresponding to the at least one approximate data according to the data mapping relation; and if not, inputting the to-be-aligned character field into the word vector model. Judging whether at least one approximate data corresponding to the field to be aligned is acquired or not; and if so, executing the step of acquiring the target data corresponding to the at least one approximate data according to the data mapping relation.

In an optional embodiment, when at least one piece of approximate data corresponding to the to-be-aligned field is acquired, the method further includes: storing the fields to be aligned to an approximate library to update the data mapping relation; the approximate library is used for matching with the standard data item field to obtain the data mapping relation. When at least one piece of approximate data corresponding to the to-be-annotated information is acquired, the method further comprises the following steps: and storing the fields to be aligned to the approximate library to update the data mapping relation.

In an alternative embodiment, the standard data item field includes a qualifier and a data element, the qualifier is used to modify the data element, the data element is a minimum unit describing the standard data item field, and the obtaining target data corresponding to the at least one approximate data according to a data mapping relationship includes: acquiring a mapping data item field corresponding to the at least one approximate data according to the data mapping relation; the mapping data item field is used for determining a to-be-aligned field in the to-be-aligned data; and replacing the field to be benchmarked with the qualifier and the data element according to the mapping data item field to obtain the target data.

In an optional embodiment, the at least one approximation data includes an approximation field and an approximation identifier, and the mapping relationship corresponds to the data item identifier of the standard data item field through the approximation identifier, and the method further includes: storing the standard data item field and the approximate field to the corpus training library to update the word vector model.

In a second aspect, the present application provides a data processing apparatus comprising: the device comprises a query module and a processing module. The query module is used for inputting the data to be aligned to a word vector model so as to obtain at least one approximate data of the data to be aligned; the word vector model is obtained using a corpus training library, which includes a plurality of corpora. The processing module is used for acquiring target data corresponding to the at least one approximate data according to the data mapping relation; the data mapping relationship is a corresponding relationship between a standard data item field and the at least one approximate data, the standard data item field is a calibration field meeting a preset rule, and the target data is data with the standard data item field.

In an optional embodiment, the data to be annotated includes a field to be annotated and information to be annotated, where the information to be annotated is used to interpret the field to be annotated, and the processing module is further configured to: inputting the information to be annotated to be interpreted to the word vector model; judging whether at least one approximate data corresponding to the information to be annotated and released is acquired; if yes, executing the step of acquiring target data corresponding to the at least one approximate data according to the data mapping relation; if not, inputting the to-be-aligned character field into the word vector model; judging whether at least one approximate data corresponding to the field to be aligned is acquired or not; and if so, executing the step of acquiring the target data corresponding to the at least one approximate data according to the data mapping relation.

In an optional embodiment, when at least one piece of approximate data corresponding to the to-be-aligned field is acquired, the processing module is further configured to: storing the fields to be aligned to an approximate library to update the data mapping relation; the approximate library is used for matching with the standard data item field to acquire the data mapping relation; when at least one piece of approximate data corresponding to the to-be-annotated information is acquired, the processing module is further configured to: and storing the fields to be aligned to the approximate library to update the data mapping relation.

In an alternative embodiment, the standard data item field includes a qualifier and a data element, the qualifier is used to modify the data element, the data element is a minimum unit describing the standard data item field, and the processing module is further used to: acquiring a mapping data item field corresponding to the at least one approximate data according to the data mapping relation; the mapping data item field is used for determining a to-be-aligned field in the to-be-aligned data; and replacing the field to be benchmarked with the qualifier and the data element according to the mapping data item field to obtain the target data.

In an optional embodiment, the at least one approximate data includes an approximate field and an approximate identifier, and the mapping relationship corresponds to the data item identifier of the standard data item field through the approximate identifier; the processing module is further configured to store the standard data item field and the approximate field in the corpus training library to update the word vector model.

In a third aspect, the present application provides an electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor to implement the method of any one of the preceding embodiments.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which computer program, when executed by a processor, implements the method according to any of the preceding embodiments.

Compared with the prior art, the application provides a data processing method, a data processing device, an electronic device and a computer readable storage medium, and relates to the field of natural language data processing. The data processing method comprises the following steps: inputting data to be aligned to a word vector model to obtain at least one approximate data of the data to be aligned; the word vector model is obtained by using a corpus training library, and the corpus training library comprises a plurality of corpuses; acquiring target data corresponding to the at least one approximate data according to the data mapping relation; the data mapping relationship is a corresponding relationship between a standard data item field and the at least one approximate data, the standard data item field is a calibration field meeting a preset rule, and the target data is data with the standard data item field. It can be understood that the data processing method provided by the application can realize automatic benchmarking of the fields, improve the accuracy of benchmarking, reduce the workload of manual intervention, and further realize rapid data standardization.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic flowchart of a data processing method according to an embodiment of the present application;

fig. 2 is a schematic flow chart of another data processing method according to an embodiment of the present disclosure;

fig. 3 is a schematic flow chart of another data processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another data processing method according to an embodiment of the present application;

fig. 5 is a schematic flow chart of another data processing method according to an embodiment of the present application;

fig. 6 is a schematic block diagram of a data processing apparatus according to an embodiment of the present application;

fig. 7 is a block schematic diagram of an electronic device according to an embodiment of the present application.

Icon: 40-data processing means, 41-query module, 42-processing module, 60-electronic device, 61-memory, 62-processor, 63-communication interface.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present application, it should also be noted that, unless expressly stated or limited otherwise, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly and can include, for example, fixed connections, detachable connections, or integral connections; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art.

In the current technical scheme, in order to realize standardization of various types of data, replacement of standard fields is mostly performed through manual screening, so that labor cost is greatly consumed, efficiency is low, and error rate is high. In industries such as transportation, public security, etc., there are a large number of conventional databases in historical business systems. For historical reasons, the table structures in databases in different regions have inconsistency, for example, the field of the number in the identity is also defined, some table fields are called GMSFZH (national identity card number), and some table fields are called SFZHM (identity card number). No unified table field naming standard is available, and great obstacles are set for data management, subsequent data analysis, mining and display, third-party platform butt joint and other scenes. At present, the problem of standardization of fields of business data tables mainly depends on manual benchmarking, but manual benchmarking from a plurality of data tables in each region is a very huge task. On one hand, the manual benchmarking consumes a large amount of human resources, on the other hand, the manual benchmarking standards of different manufacturers are different, and the docking difficulty of different manufacturers is increased due to the fact that the definition of the table fields is inconsistent.

In order to solve the above problems and disadvantages of the background art, an embodiment of the present application provides a data processing method, please refer to fig. 1, and fig. 1 is a schematic flow diagram of the data processing method provided in the embodiment of the present application. The data processing method comprises the following steps:

and S20, inputting the data to be aligned to the word vector model to obtain at least one approximate data of the data to be aligned.

The word vector model is obtained using a corpus training library, which includes a plurality of corpuses. For example, in Natural Language Processing (NLP), the corpus collection path in the corpus training library may include, but is not limited to, network resources such as news, microblog, online library on the internet, and various open-source corpus paths, and may also be obtained from desensitization data provided by a client, and the specific obtaining manner may be determined according to actual use requirements; it should be understood that the richer and more targeted the data sources when constructing the NLP corpus (corpus training library), the better the quality of the constructed corpus training library and the better the obtained word vector model. After enough NLP text corpora are collected in the NLP corpus, performing data processing on the NLP text corpora; the real corpus needs to be processed (analyzed and processed) to become useful resources; the data processing process may include, but is not limited to, word segmentation, data cleansing, simplified and traditional conversion, full angle to half angle, and various other normalization processes to generate a normalized NLP text corpus. After generating the normalized NLP text corpus through data processing, the Word vector model may be obtained according to a training mode of a commonly used classical natural language model such as Word2Vec, fastText, ELMo, or transform bi-directional Encoder Representation (BERT), and the specific training mode is not limited in the present application. The above-described approximate data acquisition may be a case: when the name of the data to be registered is the "verifier identification card", the approximate data may be the "verifier", the "identification card number", the "verifier identification card number", or the like, or the directly acquired approximate data may be the "verifier identification card", and the specifically output approximate data is related to the precision of the word vector model.

And S21, acquiring target data corresponding to at least one approximate datum according to the data mapping relation.

The data mapping relation is a corresponding relation between a standard data item field and at least one approximate data, the standard data item field is a benchmarking field which accords with a preset rule, and the target data is data with the standard data item field. It is understood that the data types of the above-mentioned various data can be, but are not limited to, chinese, english capital and small, chinese and english combination, etc.

It can be understood that when the obtained approximate data is completely consistent with the data to be calibrated, the approximate data can be directly taken to obtain the target data; when the approximate data completely consistent with the to-be-aligned data is not acquired, N pieces of approximate data similar to the to-be-aligned data may be acquired, it should be understood that N may be determined by a threshold set by a user or in other manners, the similarity of the approximate data is compared with the threshold to filter out the approximate data with low similarity, and N pieces of approximate data above the threshold are taken. By using the data processing method provided by the application, the data benchmarking work is carried out by a machine, the labor cost is greatly reduced, and the error rate of benchmarking can be effectively reduced by acquiring the approximate data by using the word vector model.

In an optional implementation manner, in order to obtain the above approximate data, on the basis of fig. 1, taking that the data to be benchmarked includes a field to be benchmarked and information to be benchmarked, where the information to be benchmarked is used for explaining the field to be benchmarked as an example, please refer to fig. 2, and fig. 2 is a schematic flow chart of another data processing method provided in the embodiment of the present application. The above S20 may include:

s201, inputting information to be annotated to a word vector model.

S202, judging whether at least one piece of approximate data corresponding to the to-be-annotated explanation information is acquired or not.

If yes, executing S21; if not, go to S203. It can be understood that, the field annotation information is input into the word vector model to predict TopN of the approximate data associated with the information to be annotated, if the similarity between the approximate data and the data to be annotated is 1, the approximate data can be directly taken to reduce the data processing amount; otherwise, the similarity of the approximate data is compared with a threshold value to filter out the approximate data with low similarity, and N approximate data above the threshold value are taken; when any one or more pieces of approximate data are acquired, the approximate data of the data to be aligned can be considered to be acquired. In the public security industry, generally, the table field information is formed by capitally splicing the first letters of Chinese pinyin, and the comment information corresponds to Chinese information: the general standard of the ID number table field is capitalized pinyin abbreviation (GMSFZHM), and the annotation information is as follows: the number of the citizen identity card.

And S203, inputting the character field to be aligned to the word vector model.

And S204, judging whether at least one approximate data corresponding to the field to be aligned is acquired.

If yes, executing S21; if not, S22 is executed. It can be understood that if approximate data of field annotation information cannot be obtained in the word vector model, the field information is input into the word vector model to predict TopN of the approximate data associated with the data to be aligned, and if the similarity between the approximate data and the data to be aligned is 1, the approximate data is directly taken; otherwise, the similarity of the approximate data is compared with a threshold value to filter out the approximate data with low similarity, and N approximate data above the threshold value are taken. It will be appreciated that the relationship between the fields to be annotated and the information to be annotated may be, for example, the listing and Comment relationship in the MySQL database. Due to business reasons, the table fields often cannot visually match the labels with the data, and the label matching result needs to be acquired by preferentially using the information to be released for the label matching; however, in some cases, the sequence of the two determination processes may also be parallel or a process of determining the field to be bid and then determining the information to be released from the tag, and the specific determination process may be determined according to the actual data bid requirement.

And S22, carrying out data alignment through human intervention to obtain target data.

It can be understood that when approximate data of the data to be aligned cannot be acquired through the word vector model, target data, which is data with a standard data item field, may be acquired through human intervention. For example, if a plurality of approximate data points to the same standard data item field, replacing the corresponding field information of the data to be processed in the corresponding original table with the standard data item field; it is foreseen that if the pointing data item fields are not the same, then manual intervention analysis may be performed; in some service scenes, manual intervention pre-analysis can also judge whether the final standard data item field really meets the requirement of an input field (data targeting requirement) so as to improve the accuracy of the data targeting. Or, inputting information to be annotated and released, the word vector model can predict a series of potential approximate values and similarity thereof associated with the input field annotation data according to the relation between word and word after NLP text corpus training, and the approximate values basically cover all possible synonyms of the field annotation; and screening out the approximate value TopN with the highest threshold value according to the predefined approximate value threshold value.

In an alternative embodiment, in order to improve the accuracy of the data targeting, a possible implementation is provided on the basis of fig. 2, please refer to fig. 3, and fig. 3 is a schematic flow chart of another data processing method provided in this embodiment of the present application. The data processing method may further include:

and S23, storing the information to be annotated to the approximate library to update the data mapping relation.

The approximation library is used to match with standard data entry fields to obtain data mapping relationships. It should be noted that the pre-accumulation of the approximate library has the following ways: the data is obtained by training according to the data item fields and the massive corpus in the standard data item field library, and is written manually, which is not the key work of the disclosure and is not described herein.

When at least one piece of approximate data corresponding to the field to be aligned is acquired, the data processing method further comprises the following steps:

and S24, storing the fields to be aligned to the approximate library to update the data mapping relation.

For example, a data mapping relationship may be established with an approximation library by a standard data item field library, which is a database that includes at least one standard data item field. The data mapping relationship may be mapped by an identification of a standard data item field with an identification of approximation data in an approximation repository.

Updating the field to be annotated or the information to be annotated to the approximate library, making an association relation with the data item field of the standard data item field library, and updating the information to be annotated to the approximate library if the annotation is successfully annotated according to the information to be annotated to the annotation; otherwise, updating the field to be benchmarked into the approximate library. The value of the approximate data identification ID may be consistent with the identification ID of the corresponding standard data item field in the standard data item field library. The approximate data structure comprises an approximate data name and an approximate data identification ID, the standard data item name and the standard data item field name can be inquired according to data corresponding to the standard data item field ID in the approximate data identification ID associated standard data item field library, and the field annotation information and the field information of the data to be aligned, which need to be aligned, are sequentially corresponding. It can be understood that the accuracy of the data mapping relation can be effectively improved by establishing the approximate library and updating and supplementing the approximate library, so that the accuracy of the data to the targets is improved, and the error rate of the data to the targets is reduced.

In an alternative embodiment, in order to obtain target data, on the basis of fig. 1, a standard data item field includes a qualifier and a data element, where the qualifier is used to modify the data element, and the data element is a minimum unit describing the standard data item field as an example, please refer to fig. 4, and fig. 4 is a schematic flowchart of another data processing method provided in an embodiment of the present application. The above S21 may include:

s211, obtaining a mapping data item field corresponding to at least one approximate data according to the data mapping relation.

The mapping data item field is used for determining a field to be aligned in the data to be aligned. The data elements may also be atomic or business-specific words that cannot be decomposed, and the qualifier may represent a word that modifies the data element. Such as data item field: the auditor identity card consists of an auditor and an identity card, wherein the auditor represents a limiting word, and the identity card represents a data element.

S212, replacing the field to be aligned with a limiting word and a data element according to the field of the mapping data item to obtain target data.

For example, the field information corresponding to the to-be-benchmarked data is replaced by the standard data entry field, and the replacing step may be: and replacing the data item name and the data item field name corresponding to the standard data item field in the standard data item field library with the to-be-paired label release information and the to-be-paired label field information of the to-be-paired label data to form target data, wherein the target data can form a new table building statement so as to be conveniently and uniformly refreshed in the service library. It will be appreciated that for the above manual intervention, one possible implementation is: manually creating a first data item field, the manually creating the first data item field comprising two parts: a first data item name, a first data item field name. The first data item name comprises a limiting word and a data element, and the limiting word in the limiting word list and the data element information in the standard data item field library are selected according to the actual meaning of the field (to-be-marked data field) of the original list; the first data item field name defaults to using a data item name pinyin capitalization abbreviation; updating the first data item field to a standard data item field library; and updating a piece of data item information, wherein the first data item identification ID can be a random ID, the qualifier identification ID corresponds to the qualifier ID of the qualifier table, and the first data element identifier corresponds to the data item ID of the data element data. For example, the original field related information (data to be benchmarked) is replaced with benchmarks, where the original table field name and field annotation information need to be replaced, i.e. the table structure in the database is updated.

In an optional implementation manner, in order to improve the approximation degree between the approximate data and the data to be aligned, on the basis of fig. 1, for example, the approximate data includes an approximate field and an approximate identifier, and the mapping relationship corresponds to the data item identifier of the standard data item field through the approximate identifier, please refer to fig. 5, where fig. 5 is a schematic flow diagram of another data processing method provided in the embodiment of the present application. The data processing method further comprises:

and S25, storing the standard data item field and the approximate field into a corpus training library to update the word vector model.

It can be understood that the newly generated standard data item field, the approximate field and the approximate identifier are updated to the corpus training library to update data for subsequent word vector model training, and the corpus training library also dynamically updates the corpus according to the data acquired in each aspect.

To implement any of the above data processing methods, an embodiment of the present application provides a data processing apparatus, please refer to fig. 6, and fig. 6 is a block diagram of the data processing apparatus according to the embodiment of the present application. The data processing apparatus 40 includes: a query module 41 and a processing module 42.

The query module 41 is configured to input the data to be aligned to the word vector model to obtain at least one approximate data of the data to be aligned. The word vector model is obtained using a corpus training library, which includes a plurality of corpuses.

The processing module 42 is configured to obtain target data corresponding to at least one piece of approximate data according to the data mapping relationship. The data mapping relationship is a corresponding relationship between a standard data item field and at least one approximate data, the standard data item field is a benchmarking field which accords with a preset rule, and the target data is data with the standard data item field.

It should be understood that the query module 41 and the processing module 42 may implement the various steps shown in fig. 1 in cooperation.

In an optional embodiment, the to-be-tagged data includes a to-be-tagged field and to-be-tagged release information, the to-be-tagged release information is used for explaining the to-be-tagged field, and the processing module 42 is further configured to input the to-be-tagged release information to the word vector model; judging whether at least one approximate data corresponding to the information to be annotated and released is acquired; if yes, executing a step of acquiring target data corresponding to at least one approximate data according to the data mapping relation; if not, inputting the character field to be aligned to the word vector model; judging whether at least one approximate data corresponding to the field to be aligned is acquired or not; and if so, executing a step of acquiring target data corresponding to at least one approximate data according to the data mapping relation. It should be understood that the processing module 42 may implement S201 to S204 described above.

In an optional embodiment, when at least one piece of approximate data corresponding to the to-be-aligned-mark field is obtained, the processing module 42 is further configured to store the to-be-aligned-mark field in an approximate library to update the data mapping relationship, where the approximate library is used to match with the standard data item field to obtain the data mapping relationship. When at least one piece of approximate data corresponding to the to-be-aligned annotation information is acquired, the processing module 42 is further configured to store the to-be-aligned field in an approximate library to update the data mapping relationship. It should be understood that processing module 42 may implement S23 and S24 described above.

In an alternative embodiment, the standard data item field includes a qualifier and a data element, the qualifier is used to modify the data element, the data element is the smallest unit describing the standard data item field, and the processing module 42 is further configured to: and acquiring a mapping data item field corresponding to at least one approximate data according to the data mapping relation. The mapping data item field is used for determining a field to be aligned in the data to be aligned; and replacing the field to be benchmarked with a limiting word and a data element according to the field of the mapping data item to obtain target data. It should be understood that the processing module 42 may also implement S211 and S212 described above.

In an optional embodiment, the at least one approximation data includes an approximation field and an approximation identifier, and the mapping relationship corresponds to the data item identifier of the standard data item field through the approximation identifier: the processing module 42 is further configured to store the standard data item fields and the approximate fields in a corpus training library to update the word vector model. It should be understood that the processing module 42 may also implement S25 described above.

An electronic device is provided in an embodiment of the present application, and as shown in fig. 7, fig. 7 is a block schematic diagram of an electronic device provided in an embodiment of the present application. The electronic device 60 comprises a memory 61, a processor 62 and a communication interface 63. The memory 61, processor 62 and communication interface 63 are electrically connected to each other, directly or indirectly, to enable transmission or interaction of data. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 61 may be used to store software programs and modules, such as program instructions/modules corresponding to the data processing method provided in the embodiments of the present application, and the processor 62 executes the software programs and modules stored in the memory 61, so as to execute various functional applications and data processing. The communication interface 63 may be used for communicating signaling or data with other node devices. The electronic device 60 may have a plurality of communication interfaces 63 in this application.

The Memory 61 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

The processor 62 may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), etc.; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc.

The electronic device 60 may implement any of the data processing methods provided herein. The electronic device 60 may be, but is not limited to, a cell phone, a tablet computer, a notebook computer, a server, or other electronic device with processing capabilities.

The embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the data processing method according to any one of the foregoing embodiments. The computer readable storage medium may be, but is not limited to, various media that can store program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a PROM, an EPROM, an EEPROM, a magnetic or optical disk, etc.

In summary, the present application provides a data processing method, an apparatus, an electronic device and a computer-readable storage medium, and relates to the field of data processing in natural language. The data processing method comprises the following steps: inputting the data to be aligned to a word vector model to obtain at least one approximate data of the data to be aligned; the word vector model is obtained by using a corpus training library, and the corpus training library comprises a plurality of corpuses; acquiring target data corresponding to at least one approximate data according to the data mapping relation; the data mapping relation is a corresponding relation between a standard data item field and at least one approximate data, the standard data item field is a benchmarking field which accords with a preset rule, and the target data is data with the standard data item field. It can be understood that the data processing method provided by the application can realize automatic benchmarking of the fields, improve the accuracy of benchmarking, reduce the workload of manual intervention, and further realize rapid data standardization.

The above description is only for various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all such changes or substitutions are included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of data processing, the method comprising:

inputting data to be aligned to a word vector model to obtain at least one approximate data of the data to be aligned; the data to be aligned comprises a field to be aligned and information to be aligned, the information to be aligned is used for explaining the field to be aligned, the word vector model is obtained by using a corpus training library, and the corpus training library comprises a plurality of corpuses;

acquiring target data corresponding to the at least one approximate data according to the data mapping relation; the data mapping relation is a corresponding relation between a standard data item field and the at least one approximate data, the standard data item field is a benchmarking field which accords with a preset rule, and the target data is data with the standard data item field;

the inputting the data to be aligned to the word vector model to obtain at least one approximate data of the data to be aligned includes:

inputting the information to be annotated to be interpreted to the word vector model;

judging whether at least one approximate data corresponding to the information to be annotated is acquired or not;

if yes, executing the step of acquiring target data corresponding to the at least one approximate data according to the data mapping relation;

if not, inputting the to-be-aligned character field into the word vector model;

judging whether at least one approximate data corresponding to the field to be aligned is acquired or not;

and if so, executing the step of acquiring the target data corresponding to the at least one approximate data according to the data mapping relation.

2. The method according to claim 1, wherein when at least one piece of approximate data corresponding to the to-be-annotated information is acquired, the method further comprises:

storing the information to be annotated and released to an approximate library so as to update the data mapping relation; the approximate library is used for matching with the standard data item field to acquire the data mapping relation;

when at least one piece of approximate data corresponding to the field to be aligned is acquired, the method further comprises the following steps:

and storing the fields to be aligned to an approximate library to update the data mapping relation.

3. The method of claim 1, wherein the standard data item field comprises a qualifier and a data element, the qualifier is used for modifying the data element, the data element is a minimum unit for describing the standard data item field, and the obtaining the target data corresponding to the at least one approximate data according to the data mapping relationship comprises:

acquiring a mapping data item field corresponding to the at least one approximate data according to the data mapping relation; the mapping data item field is used for determining a to-be-aligned field in the to-be-aligned data;

and replacing the field to be benchmarked with the qualifier and the data element according to the mapping data item field to obtain the target data.

4. The method according to claim 1 or 3, wherein the at least one approximate data comprises an approximate field and an approximate identifier, and the mapping relationship corresponds to the data item identifier of the standard data item field through the approximate identifier, and the method further comprises:

storing the standard data item field and the approximate field to the corpus training library to update the word vector model.

5. A data processing apparatus, comprising: the device comprises a query module and a processing module;

the query module is used for inputting the data to be aligned to the word vector model so as to obtain at least one approximate datum of the data to be aligned; the data to be aligned comprises a field to be aligned and information to be aligned, the information to be aligned is used for explaining the field to be aligned, the word vector model is obtained by using a corpus training library, and the corpus training library comprises a plurality of corpuses;

the processing module is used for acquiring target data corresponding to the at least one approximate data according to the data mapping relation; the data mapping relationship is a corresponding relationship between a standard data item field and the at least one approximate data, the standard data item field is a benchmarking field which accords with a preset rule, and the target data is data with the standard data item field;

the query module is specifically configured to:

judging whether at least one approximate data corresponding to the information to be annotated and released is acquired;

if not, inputting the to-be-aligned character field into the word vector model;

6. The apparatus of claim 5, wherein the standard data item field comprises a qualifier and a data element, wherein the qualifier is used to modify the data element, wherein the data element is a minimum unit describing the standard data item field, and wherein the processing module is further configured to:

and replacing the field to be aligned with the qualifier and the data element according to the mapping data item field to obtain the target data.

7. The apparatus according to claim 5 or 6, wherein the at least one approximate data comprises an approximate field and an approximate identifier, and the mapping relationship is corresponded by the approximate identifier and the data item identifier of the standard data item field;

the processing module is further configured to store the standard data item field and the approximate field in the corpus training library to update the word vector model.

8. An electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor to implement the method of any one of claims 1-4.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-4.