CN116304851A - Data standard determining method, apparatus, device, medium and computer program product - Google Patents

Data standard determining method, apparatus, device, medium and computer program product Download PDF

Info

Publication number
CN116304851A
CN116304851A CN202211499915.2A CN202211499915A CN116304851A CN 116304851 A CN116304851 A CN 116304851A CN 202211499915 A CN202211499915 A CN 202211499915A CN 116304851 A CN116304851 A CN 116304851A
Authority
CN
China
Prior art keywords
data
standard
processed
determining
standards
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211499915.2A
Other languages
Chinese (zh)
Inventor
黄天奇
张海军
李甲长
陈石军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202211499915.2A priority Critical patent/CN116304851A/en
Publication of CN116304851A publication Critical patent/CN116304851A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of big data, and provides a data standard determining method, a data standard determining device, computer equipment, a storage medium and a computer program product, which can be particularly applied to the financial field or other related fields. The method and the device can improve efficiency and accuracy of determining the target data standard of the data. The method comprises the following steps: the method comprises the steps of obtaining data to be processed, determining characteristic data corresponding to the data to be processed, inputting the characteristic data into a pre-trained data standard identification model, determining candidate data standards of the data to be processed and matching degrees between the data to be processed and the candidate data standards from pre-stored data standards through the data standard identification model, and determining target data standards of the data to be processed from the candidate data standards according to the matching degrees.

Description

Data standard determining method, apparatus, device, medium and computer program product
Technical Field
The present application relates to the field of big data technology, and in particular, to a data standard determining method, an apparatus, a computer device, a storage medium, and a computer program product.
Background
With the development of information technology, in the case of new and modified table structures such as project stand and system development stage, it is necessary to determine whether there is a corresponding data standard for each data (data item, field), and for the data requiring the associated data standard, it is necessary to determine the most conforming target data standard from the existing data standards.
The conventional technology usually selects the target data standard of each data by a professional technician, but because of the matching of thousands of data standards and massive fields, a great deal of manual workload is required, and the efficiency of determining the target data standard of the data is low.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a data standard determining method, apparatus, computer device, computer readable storage medium, and computer program product.
In a first aspect, the present application provides a method for determining a data standard. The method comprises the following steps:
acquiring data to be processed; the data to be processed is data to be determined by a data standard;
determining characteristic data corresponding to the data to be processed;
inputting the characteristic data into a pre-trained data standard recognition model, and determining a candidate data standard of the data to be processed and the matching degree between the data to be processed and the candidate data standard from pre-stored data standards through the data standard recognition model;
And determining a target data standard of the data to be processed from the candidate data standards according to the matching degree.
In one embodiment, determining the target data standard of the data to be processed from the candidate data standards according to the matching degree includes:
receiving a determination instruction for candidate data criteria; determining that the instruction is an instruction triggered based on the degree of matching;
and determining target data standards of the data to be processed from the candidate data standards according to the determining instruction.
In one embodiment, determining the target data standard of the data to be processed from the candidate data standards according to the matching degree includes:
determining candidate data standards with matching degree meeting the preset matching degree condition from the candidate data standards, and taking the candidate data standards as candidate data standards to be selected;
and determining the target data standard of the data to be processed from the candidate data standards to be selected according to the matching degree of the candidate data standards to be selected.
In one embodiment, determining the target data standard of the data to be processed from the candidate data standards includes:
under the condition that at least one matching degree meets the preset matching degree condition, determining a target data standard of the data to be processed from the candidate data standards;
The method further comprises the steps of:
and under the condition that the matching degree does not meet the preset matching degree condition, determining that the data to be processed has no corresponding target data standard.
In one embodiment, the pre-trained data standard recognition model is trained by:
acquiring sample data and a real data standard of the sample data;
dividing sample data and real data standards of the sample data to obtain a training sample set and a verification sample set;
training the data standard recognition model to be trained by using the training sample set to obtain a trained data standard recognition model;
verifying the trained data standard recognition model by using a verification sample set to obtain a verification result;
and under the condition that the verification result is qualified, determining the trained data standard recognition model as a pre-trained data standard recognition model.
In one embodiment, acquiring data to be processed includes:
acquiring data to be preprocessed;
performing data cleaning treatment on the data to be preprocessed to obtain the data to be preprocessed after the data cleaning treatment;
and carrying out data conversion processing on the data to be preprocessed after the data cleaning processing to obtain the data to be processed.
In a second aspect, the present application further provides a data standard determining apparatus. The device comprises:
the data acquisition module is used for acquiring data to be processed; the data to be processed is data to be determined by a data standard;
the data determining module is used for determining characteristic data corresponding to the data to be processed;
the matching degree determining module is used for inputting the characteristic data into a pre-trained data standard recognition model, and determining candidate data standards of the data to be processed and matching degrees between the data to be processed and the candidate data standards from pre-stored data standards through the data standard recognition model;
and the standard determining module is used for determining the target data standard of the data to be processed from the candidate data standards according to the matching degree.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:
acquiring data to be processed; the data to be processed is data to be determined by a data standard; determining characteristic data corresponding to the data to be processed; inputting the characteristic data into a pre-trained data standard recognition model, and determining a candidate data standard of the data to be processed and the matching degree between the data to be processed and the candidate data standard from pre-stored data standards through the data standard recognition model; and determining a target data standard of the data to be processed from the candidate data standards according to the matching degree.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
acquiring data to be processed; the data to be processed is data to be determined by a data standard; determining characteristic data corresponding to the data to be processed; inputting the characteristic data into a pre-trained data standard recognition model, and determining a candidate data standard of the data to be processed and the matching degree between the data to be processed and the candidate data standard from pre-stored data standards through the data standard recognition model; and determining a target data standard of the data to be processed from the candidate data standards according to the matching degree.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:
acquiring data to be processed; the data to be processed is data to be determined by a data standard; determining characteristic data corresponding to the data to be processed; inputting the characteristic data into a pre-trained data standard recognition model, and determining a candidate data standard of the data to be processed and the matching degree between the data to be processed and the candidate data standard from pre-stored data standards through the data standard recognition model; and determining a target data standard of the data to be processed from the candidate data standards according to the matching degree.
The data standard determining method, the device, the computer equipment, the storage medium and the computer program product acquire data to be processed, the data to be processed is data to be determined of the data standard, the characteristic data corresponding to the data to be processed is determined, the characteristic data is input into a pre-trained data standard identification model, the candidate data standard of the data to be processed and the matching degree between the data to be processed and the candidate data standard are determined from pre-stored data standards through the data standard identification model, and the target data standard of the data to be processed is determined from the candidate data standard according to the matching degree. According to the scheme, the data to be processed is obtained, the characteristic data corresponding to the data to be processed is determined according to the data to be processed, the characteristic data is input into the data standard identification model, the candidate data standard of the data to be processed and the matching degree corresponding to each candidate data standard are determined from the prestored mass data standards through the data standard identification model, and the target data standard of the data to be processed is determined from the candidate data standards according to the matching degree, so that the efficiency and the accuracy of determining the target data standard of the data are improved.
Drawings
FIG. 1 is a flow chart of a method of determining data criteria in one embodiment;
FIG. 2 is a flow chart of a method of determining data criteria in another embodiment;
FIG. 3 is a schematic diagram of the internal devices of the terminal in one embodiment;
FIG. 4 is a schematic diagram of an operation unit included in the intelligent through-label device according to an embodiment;
FIG. 5 is a block diagram showing the structure of a data standard determining apparatus in one embodiment;
fig. 6 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a data standard determining method is provided, and the method is applied to a terminal for illustration in this embodiment, and includes the following steps:
step S101, obtaining data to be processed.
In this step, the data to be processed is the data to be determined for the data standard, such as data items (fields).
Specifically, the terminal acquires data to be processed.
Step S102, determining characteristic data corresponding to the data to be processed.
In this step, the feature data may be a table name, a field chinese name, a field english name, a field data type, a field length, a field precision, and a data standard chinese name of the data to be processed.
Specifically, the terminal determines feature data corresponding to the data to be processed according to the data to be processed.
Step S103, inputting the characteristic data into a pre-trained data standard recognition model, and determining a candidate data standard of the data to be processed and the matching degree between the data to be processed and the candidate data standard from pre-stored data standards through the data standard recognition model.
In this step, the data standard recognition model may be a BERT (Bidirectional Encoder Representations from Transformers, transformer algorithm-based bi-directional code characterization algorithm) language model; the data standard may be a data standard that each data item (field) needs to be associated with when the table structure is newly added and modified, such as project stand, system development stage; the degree of matching may be expressed in percent.
Specifically, the terminal inputs the characteristic data into a pre-trained data standard recognition model, the data standard recognition model determines a candidate data standard of the data to be processed and the matching degree between the candidate data standard and the data to be processed from pre-stored data standards, and the terminal obtains the matching degree corresponding to the candidate data standard and the candidate data standard through the data standard recognition model.
Step S104, determining a target data standard of the data to be processed from the candidate data standards according to the matching degree.
In this step, the target data criterion may be the data criterion that best matches the data to be processed.
Specifically, the terminal determines a target data standard of the data to be processed from the candidate data standards according to the matching degree corresponding to the candidate data standards.
In the data standard determining method, the data to be processed is obtained, the data to be processed is the data to be determined of the data standard, the characteristic data corresponding to the data to be processed is determined, the characteristic data is input into a pre-trained data standard recognition model, the candidate data standard of the data to be processed and the matching degree between the data to be processed and the candidate data standard are determined from pre-stored data standards through the data standard recognition model, and the target data standard of the data to be processed is determined from the candidate data standards according to the matching degree. According to the scheme, the data to be processed is obtained, the characteristic data corresponding to the data to be processed is determined according to the data to be processed, the characteristic data is input into the data standard identification model, the candidate data standard of the data to be processed and the matching degree corresponding to each candidate data standard are determined from the prestored mass data standards through the data standard identification model, and the target data standard of the data to be processed is determined from the candidate data standards according to the matching degree, so that the efficiency and the accuracy of determining the target data standard of the data are improved.
In one embodiment, the determining, according to the matching degree, the target data standard of the data to be processed from the candidate data standards in step S104 specifically includes: receiving a determination instruction for candidate data criteria; and determining target data standards of the data to be processed from the candidate data standards according to the determining instruction.
In this embodiment, the determining instruction is an instruction triggered based on the matching degree, for example, an instruction triggered by the user to select the candidate data standard according to the matching degree.
Specifically, after determining candidate data standards of data to be processed and matching degrees corresponding to the candidate data standards, the terminal displays the candidate data standards and the corresponding matching degrees for reference of a user, the user can trigger a determining instruction for selecting the candidate data standards according to the candidate data standards and the corresponding matching degrees, and the terminal receives and responds to the determining instruction and determines a target data standard of the data to be processed from the candidate data standards according to the determining instruction.
According to the technical scheme of the embodiment, the target data standard of the data to be processed is determined from the candidate data standards according to the determining instruction, so that the accuracy of determining the target data standard of the data is improved.
In one embodiment, the determining, according to the matching degree, the target data standard of the data to be processed from the candidate data standards in step S104 specifically includes: determining candidate data standards with matching degree meeting the preset matching degree condition from the candidate data standards, and taking the candidate data standards as candidate data standards to be selected; and determining the target data standard of the data to be processed from the candidate data standards to be selected according to the matching degree of the candidate data standards to be selected.
In this embodiment, the preset matching degree condition may be a condition that the preset matching degree is greater than the matching degree threshold, for example, a condition that the matching degree is greater than 80%.
Specifically, after determining the candidate data standard of the data to be processed and the matching degree corresponding to each candidate data standard, the terminal determines the candidate data standard with the matching degree meeting the preset matching degree condition from the candidate data standards, and can display the candidate data standard to be selected as the candidate data standard to be selected, and according to the matching degree of the candidate data standard to be selected, the candidate data standard with the largest matching degree can be selected from the candidate data standards to be selected as the target data standard of the data to be processed.
According to the technical scheme, the candidate data standard to be selected is determined, and then the target data standard of the data to be processed is determined from the candidate data standard to be selected, so that the accuracy of determining the target data standard of the data is improved.
In one embodiment, the determining, in step S104, the target data standard of the data to be processed from the candidate data standards specifically includes: under the condition that at least one matching degree meets the preset matching degree condition, determining a target data standard of the data to be processed from the candidate data standards; the method can also determine that the data to be processed has no corresponding target data standard through the following steps: and under the condition that the matching degree does not meet the preset matching degree condition, determining that the data to be processed has no corresponding target data standard.
Specifically, after determining candidate data standards of data to be processed and matching degrees corresponding to the candidate data standards, the terminal judges whether the matching degrees corresponding to the candidate data standards meet preset matching degree conditions, determines target data standards of the data to be processed from the candidate data standards under the condition that at least one matching degree meets the preset matching degree conditions, and determines that the data to be processed does not have the corresponding target data standards under the condition that the matching degrees do not meet the preset matching degree conditions.
According to the technical scheme, as all the data to be processed do not have the corresponding target data standard, and part of the data to be processed does not have the corresponding target data standard, the part of the data to be processed needs to identify that the data does not have the corresponding target data standard, and the terminal determines that the data to be processed does not have the corresponding target data standard under the condition that the matching degree does not meet the preset matching degree condition, so that the part of the data to be processed does not have the corresponding target data standard can be determined, and the accuracy and the efficiency of determining whether the data has the corresponding target data standard are improved.
In one embodiment, the pre-trained data standard recognition model is trained by the following method, specifically including: acquiring sample data and a real data standard of the sample data; dividing sample data and real data standards of the sample data to obtain a training sample set and a verification sample set; training the data standard recognition model to be trained by using the training sample set to obtain a trained data standard recognition model; verifying the trained data standard recognition model by using a verification sample set to obtain a verification result; and under the condition that the verification result is qualified, determining the trained data standard recognition model as a pre-trained data standard recognition model.
In this embodiment, the sample data may be sample data to be processed for training; the real data standard of the sample data may be a real target data standard corresponding to the sample data; the training sample set may be a portion of sample data and a true data standard for the portion of sample data; the validation sample set may be another portion of sample data and a true data standard for that portion of sample data; the verification result can be the accuracy, the precision and/or the recall rate used for representing the trained data standard identification model and can be expressed in percentage; the verification pass may be that the verification result is greater than a preset pass threshold.
Specifically, the terminal acquires sample data and a real data standard of the sample data, and performs the following steps on the sample data and the real data standard of the sample data, as shown in 7:3, randomly extracting the proportion, dividing to obtain a training sample set and a verification sample set, training the data standard recognition model to be trained by using the training sample set to obtain a trained data standard recognition model, verifying the trained data standard recognition model by using the verification sample set to obtain a verification result, determining the trained data standard recognition model as a pre-trained data standard recognition model under the condition that the verification result is qualified, and training the data standard recognition model to be trained by using the training sample set in a round robin manner until the verification result is qualified under the condition that the verification result is unqualified.
According to the technical scheme, the training sample set is utilized to train the data standard recognition model, and the verification sample set is utilized to verify the trained data standard recognition model, so that the more accurate pre-trained data standard recognition model is obtained, and the accuracy of determining the target data standard of the data is improved.
In one embodiment, the acquiring the data to be processed in step S101 specifically includes: acquiring data to be preprocessed; performing data cleaning treatment on the data to be preprocessed to obtain the data to be preprocessed after the data cleaning treatment; and carrying out data conversion processing on the data to be preprocessed after the data cleaning processing to obtain the data to be processed.
In this embodiment, the data to be preprocessed is data to be preprocessed; the data cleaning process may refer to a process of solving the problem of data inconsistency by supplementing missing values, smoothing noise data, deleting outliers, etc., for example, the data cleaning process should follow the following rules: the mapping relation of the data standard is required to be updated up to date, such as standard of a lower line and non-existing standard are required to be avoided, text content cannot contain special symbols, such as punctuation marks, data to be processed with important information deletion is required to be deleted or supplemented, such as field data type deletion cannot be supplemented to be deleted, field Chinese name deletion can be supplemented according to field English name and the like, cleaning rules are added according to business meaning and requirements, and standby fields are not required to be used as data to be processed; the data transformation processing can be to transform the data from one form to another form so as to be more suitable for mining modeling, and the step needs to input the data set after data cleaning and produce the transformed data set, and the data transformation processing mode comprises the following three types: normalizing: scaling the attribute data to fall within a particular cell, the method including max-min normalization, Z-value normalization, decimal scale normalization, etc.; discretizing: the continuous/numerical data is replaced by interval labels or conceptual labels, and the method comprises box division discretization, clustering discretization and the like, wherein the box division discretization is divided into equidistant box division, equal frequency box division, optimal box division and the like; attribute construction: new attributes are built from given attributes and added to the dataset.
Specifically, the terminal acquires data to be preprocessed, performs data cleaning processing on the data to be preprocessed to obtain data to be preprocessed after the data cleaning processing, and performs data conversion processing on the data to be preprocessed after the data cleaning processing to obtain the data to be preprocessed.
According to the technical scheme, the data is subjected to data cleaning and data transformation, so that the data to be processed meeting the format requirements can be obtained, and the accuracy of the target data standard of the determined data can be improved.
The following describes a data standard determining method provided in the present application in an embodiment, where the method is applied to a terminal to illustrate, and the main steps include:
the first step, the terminal acquires sample data and a real data standard of the sample data.
And secondly, dividing the sample data and the real data standard of the sample data by the terminal to obtain a training sample set and a verification sample set.
Thirdly, the terminal trains the data standard recognition model to be trained by using the training sample set, and the trained data standard recognition model is obtained.
And fourthly, the terminal verifies the trained data standard recognition model by using the verification sample set to obtain a verification result.
And fifthly, under the condition that the verification result is qualified, the terminal determines the trained data standard recognition model as a pre-trained data standard recognition model.
And sixthly, the terminal acquires the data to be preprocessed.
Seventh, the terminal performs data cleaning treatment on the data to be preprocessed to obtain the data to be preprocessed after the data cleaning treatment.
And eighth, the terminal performs data conversion processing on the data to be preprocessed after the data cleaning processing to obtain the data to be processed.
And ninth, the terminal determines the characteristic data corresponding to the data to be processed.
And tenth, inputting the characteristic data into a pre-trained data standard recognition model by the terminal, and determining a candidate data standard of the data to be processed and the matching degree between the data to be processed and the candidate data standard from pre-stored data standards through the data standard recognition model.
Eleventh step, the terminal receives a determining instruction aiming at the candidate data standard, and determines a target data standard of the data to be processed from the candidate data standard according to the determining instruction; or the terminal determines candidate data standards with the matching degree meeting the preset matching degree condition from the candidate data standards, and determines the target data standard of the data to be processed from the candidate data standards according to the matching degree of the candidate data standards to be selected when at least one matching degree meets the preset matching degree condition, and determines that the data to be processed does not have the corresponding target data standard when the matching degree does not meet the preset matching degree condition.
Wherein, the data to be processed is the data to be determined by the data standard; the determining instruction is an instruction triggered based on the degree of matching.
According to the technical scheme, the to-be-processed data are obtained, the feature data corresponding to the to-be-processed data are determined according to the to-be-processed data, the feature data are input into the data standard identification model, the candidate data standard of the to-be-processed data and the matching degree corresponding to each candidate data standard are determined from the prestored mass data standards through the data standard identification model, and the target data standard of the to-be-processed data is determined from the candidate data standards according to the matching degree, so that the efficiency and the accuracy of determining the target data standard of the data are improved.
The method for determining the data standard provided by the present application is described below by using an application example, and the application example is applied to a terminal by using the method for illustration, as shown in fig. 2, the main steps include:
firstly, a terminal determines a characteristic index focused by a data standard intelligent through-mark model: the terminal determines the table name, the field Chinese name, the field English name, the field data type, the field length and the field precision as six kinds of influence factor characteristic indexes, and can be regulated and supplemented later according to the needs, wherein the main effect of each influence factor is to select the most suitable data standard from more than 3000 data standards for through marking, so that the output characteristic indexes taking the data standard Chinese name as a model are determined.
Secondly, processing a standard through standard stock data range of the terminal to form a characteristic index sample set: the terminal performs stock record of the data through marks based on the metadata management system, combines related information such as 'table name', 'field Chinese name', 'field English name', and the like, processes each stock through mark record to form a value corresponding to the characteristic index, and takes the processed index record set as a sample set for use in the subsequent step.
Third, the terminal performs sample processing division (training set, sample set): the terminal sets 7 according to the index sample set formed in the second step: 3 are respectively used as a training sample set and a verification sample set, and are used in the subsequent intelligent model construction.
Fourth, the terminal performs model algorithm determination: the method comprises the steps that all constructed data standard intelligent through mark models in a terminal are data standard selection based on historical through mark records and related information to analyze and predict subsequent through marks, are basically a supervised classification problem, and are short text multi-classification problems due to the fact that text contents are involved, a standard automatic mapping model is established based on a deep learning natural language processing technology, an automatic classification mapping model is trained by taking mapped table characters and standards as sample training sets through proper manual mapping (accuracy of services is guaranteed), and according to the scene, based on past practice experience and theoretical basis, an algorithm applicable to the scene can consider a BERT algorithm, a BERT (Bidirectional Encoder Representations from Transformers) model is a two-segment NLP (Neuro Linguistic Programming, neuro-linguistic) model, a language model is trained through external massive corpora, and then a specific NLP downstream task (Fine-tuning) is completed through migration learning by utilizing the trained language model.
Fifthly, the terminal performs model training, verification and tuning: the terminal carries out model training, verification and tuning by using the training sample set and the verification sample set generated in the step three, and determines a data standard intelligent through standard model meeting the actual use requirement after multiple iterations, for tuning, the learning effect of the network is optimized mainly by adjusting parameters such as learning rate, punishment weight and the like in the model network, the training model is iterated to gradually reach the expected effect, the model is tested by using the verification sample set, and then whether three evaluation indexes of accuracy rate >80%, accuracy rate >80% and recall rate >80% between the model output value and the sample actual value reach the standard is observed to evaluate whether the model needs to be continuously optimized.
Sixthly, the terminal performs model deployment to provide service: and the terminal deploys and releases the data standard intelligent through-mark model which is trained and optimized in the step five, and provides real-time calling service based on an online interface.
Seventh, the terminal performs data through-mark operation according to the model recommendation: the terminal obtains related information such as a table name, a field Chinese name, a field English name, a field data type, a field length, a field precision and the like recorded during the maintenance of the table structure, calculates a data standard Chinese name recommendation result by using a model according to the related information during the standard crossing of the data, and finally takes the selection result as an inventory sample set after receiving an instruction of a user for selecting the data standard, so as to be used for continuous optimization and promotion of an intelligent standard crossing model of the data standard.
As shown in fig. 3, the terminal may include a data standard management system, a data standard knowledge base management device, an intelligent through-mark model construction device and an intelligent through-mark device, where the data standard management system relies on a data standard management system such as a bank, and a user can rely on the system to perform operations such as searching, inquiring and managing the data standard, and by the system, a data standard list to be mapped is definitely determined, and the standard list determines a standard range that can be mapped when the final model is applied; the data standard knowledge base management device provides various data resources and software asset centralized registration and management functions for the research and development of the business system, and supports the history data related to the data standard and the field as training samples constructed by a model when all business systems are subjected to the new or modified table structure, so that the accuracy of the history data is ensured; the intelligent through-mark model constructing device is a multi-classification machine learning model based on natural language processing based on BERT algorithm, after the intelligent through-mark model is trained and verified, the intelligent through-mark model is acquired and loaded by an intelligent through-mark device for use, and an operation unit contained in the intelligent through-mark device is shown in fig. 4, and specifically comprises the following steps: may include a modeling initiation job unit, a data processing job unit, a model training job unit, and a model evaluation job unit, wherein the modeling initiation job unit includes two parts of index selection and data set construction: index selection (feature selection): the mapping of the data criteria depends on information about the data item (field), and based on expert experience, can be used as an index for modeling (table 1 below):
TABLE 1
Figure BDA0003966943530000121
Table name, field chinese name, field english name, field data type, field length, field precision, and data standard chinese name; data set construction: the modeling dataset comprises a training set and a testing set, each record of the dataset comprising at least seven variables: the data standard Chinese name is used as a target variable (namely 'Y variable'), the table name, the field Chinese name, the field English name, the field data type, the field length and the field precision are used as independent variables ('X variable'), the data structure is shown in the following table 1, and meanwhile, the data set is constructed by taking the following requirements into consideration: data set sample diversity: the data set needs to contain all data standards to be mapped, the samples should comprise various subjects and a plurality of service fields to improve the generalization performance of the model, and in addition, it should be noted that some fields are not formulated with corresponding data standards, so the data set needs to contain a part of samples not formulated with corresponding standards, the target variable of the sample can be assigned with a specific number to indicate that no corresponding standard exists, and for the fields, no through-standard is required to be recommended in the following process of intelligent through-standard recommendation; data set sample magnitude and class equalization: theoretically, the proper amount of sample corresponding to each standard and similar magnitude are ideal conditions for training the model, but in practical situations, the sample magnitude corresponding to each standard is likely to present long tail distribution, so that the problem of class imbalance is caused (some standards can only correspond to a few fields in real situations), and therefore, if the situation exists, the following 2-point rule needs to be followed when preparing a data set sample: for the sample with standard mapping, automatically counting the sample quantity owned by each standard, setting weight rule parameters according to the sample quantity distribution condition, acquiring punishment weight of each standard, automatically adjusting sample distribution according to punishment weight, ensuring sample classification class balance, and for the sample without standard mapping, preparing the magnitude according to the whole real proportion, for example, the proportion of fields with standards to fields without standards is approximately 3:7, the ratio of samples with and without standards in the dataset is recommended as 3:7, and so on; the data processing operation unit performs data cleaning and data transformation on the data of the data set, and ensures the effect of the intelligent through-mark model by improving the data quality of the modeling data set; the model training operation unit is used for training the BERT language model, carrying out specific NLP downstream task learning training based on the trained language model, and optimizing the learning effect of the network by adjusting the learning rate and punishment weight parameters in the model network; the model evaluation operation unit is used for evaluating the model, and for the model effect, the ROC curve (receiver operation characteristic curve) and the following indexes can be considered: accuracy= (correct number judgment/total number of test set), accuracy= (correct number judgment/total number of table fields with corresponding data standard judgment by model) and recall rate= (correct number judgment/total number of table fields with corresponding data standard judgment by model in table fields with corresponding standard), the better model effect has the following expression: finding out as many fields (high recall) of the mappable data standard as possible and as many accurate (high precision) of the mappable standard as possible, wherein the above indexes are difficult to be simultaneously found under normal conditions, and the threshold and the parameters of the model can be adjusted according to the index effect valued by the service party so as to optimize the model result; the intelligent through-mark device connects an intelligent through-mark model with a data standard management system, the input of the model is related information of data items needing through-mark, and the output of the model is corresponding to several possible standards with reference to the table 1: for Top1 (highest) criteria scoring (matching degree) below the threshold, a null value is output, indicating that there is no corresponding matching criteria, and Top3 (highest Top three) criteria with a score (matching degree) above the threshold, top1 is typically the criteria that should be matched finally (meaning: threshold may be understood as confidence/matching degree threshold, e.g., threshold set to 80%, a field is recommended by the model to associate criterion A if the confidence of associate criterion A is 95% and the confidence of associate criterion B is 75%, and not to associate data criteria if the confidence of associate all criteria is below the threshold).
According to the technical scheme of the application example, the accuracy of the standard data standard through result is ensured, meanwhile, the standard data standard through efficiency is improved, the standards are dropped to the system, the table and the field by establishing the intelligent mapping model, the manpower resource waste is reduced, and the efficiency and the accuracy of determining the target data standard of the data are improved.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a data standard determining device for realizing the above related data standard determining method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the data standard determining device or devices provided below may refer to the limitation of the data standard determining method hereinabove, and will not be repeated herein.
In one embodiment, as shown in fig. 5, a data standard determining apparatus is provided, and the apparatus 500 may include:
a data acquisition module 501, configured to acquire data to be processed; the data to be processed is data to be determined by a data standard;
a data determining module 502, configured to determine feature data corresponding to the data to be processed;
a matching degree determining module 503, configured to input the feature data into a pre-trained data standard recognition model, and determine, from pre-stored data standards, a candidate data standard of the data to be processed and a matching degree between the data to be processed and the candidate data standard through the data standard recognition model;
and the standard determining module 504 is configured to determine a target data standard of the data to be processed from the candidate data standards according to the matching degree.
In one embodiment, the criteria determination module 504 is further configured to receive a determination instruction for the candidate data criteria; the determining instruction is an instruction triggered based on the degree of matching; and determining the target data standard of the data to be processed from the candidate data standards according to the determining instruction.
In one embodiment, the standard determining module 504 is further configured to determine, from the candidate data standards, a candidate data standard with a matching degree satisfying a preset matching degree condition, as a candidate data standard to be selected; and determining the target data standard of the data to be processed from the candidate data standards to be selected according to the matching degree of the candidate data standards to be selected.
In one embodiment, the standard determining module 504 is further configured to determine, from the candidate data standards, a target data standard of the data to be processed, if at least one of the matching degrees satisfies a preset matching degree condition; the apparatus 500 further comprises: and the condition unsatisfied module is used for determining that the data to be processed has no corresponding target data standard under the condition that the matching degree does not meet the preset matching degree condition.
In one embodiment, the apparatus 500 further comprises: the model training module is used for acquiring sample data and real data standards of the sample data; dividing the sample data and the real data standard of the sample data to obtain a training sample set and a verification sample set; training the data standard recognition model to be trained by using the training sample set to obtain a trained data standard recognition model; verifying the trained data standard recognition model by using the verification sample set to obtain a verification result; and under the condition that the verification result is qualified in verification, determining the trained data standard recognition model as the pre-trained data standard recognition model.
In one embodiment, the data acquisition module 501 is further configured to acquire data to be preprocessed; performing data cleaning treatment on the data to be preprocessed to obtain the data to be preprocessed after the data cleaning treatment; and carrying out data conversion processing on the data to be preprocessed after the data cleaning processing to obtain the data to be preprocessed.
The respective modules in the above-described data standard determining apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
It should be noted that the method and the device for determining the data standard provided by the application can be used in the application field related to the data standard determination in the financial field, and can also be used in the processing related to the data standard determination in any field except the financial field, and the application field of the method and the device for determining the data standard provided by the application is not limited.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a data standard determination method. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in fig. 6 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (10)

1. A method of determining a data criterion, the method comprising:
acquiring data to be processed; the data to be processed is data to be determined by a data standard;
determining characteristic data corresponding to the data to be processed;
inputting the characteristic data into a pre-trained data standard recognition model, and determining a candidate data standard of the data to be processed and the matching degree between the data to be processed and the candidate data standard from pre-stored data standards through the data standard recognition model;
And determining the target data standard of the data to be processed from the candidate data standards according to the matching degree.
2. The method according to claim 1, wherein determining the target data standard of the data to be processed from the candidate data standards according to the matching degree comprises:
receiving a determination instruction for the candidate data standard; the determining instruction is an instruction triggered based on the degree of matching;
and determining the target data standard of the data to be processed from the candidate data standards according to the determining instruction.
3. The method according to claim 1, wherein determining the target data standard of the data to be processed from the candidate data standards according to the matching degree comprises:
determining candidate data standards with matching degree meeting the preset matching degree condition from the candidate data standards, and taking the candidate data standards as candidate data standards to be selected;
and determining the target data standard of the data to be processed from the candidate data standards to be selected according to the matching degree of the candidate data standards to be selected.
4. The method of claim 1, wherein determining the target data criteria for the data to be processed from the candidate data criteria comprises:
Under the condition that at least one matching degree meets a preset matching degree condition, determining a target data standard of the data to be processed from the candidate data standards;
the method further comprises the steps of:
and under the condition that the matching degree does not meet the preset matching degree condition, determining that the data to be processed does not have the corresponding target data standard.
5. The method of claim 1, wherein the pre-trained data standard recognition model is trained by:
acquiring sample data and a real data standard of the sample data;
dividing the sample data and the real data standard of the sample data to obtain a training sample set and a verification sample set;
training the data standard recognition model to be trained by using the training sample set to obtain a trained data standard recognition model;
verifying the trained data standard recognition model by using the verification sample set to obtain a verification result;
and under the condition that the verification result is qualified in verification, determining the trained data standard recognition model as the pre-trained data standard recognition model.
6. The method of claim 1, wherein the acquiring the data to be processed comprises:
acquiring data to be preprocessed;
performing data cleaning treatment on the data to be preprocessed to obtain the data to be preprocessed after the data cleaning treatment;
and carrying out data conversion processing on the data to be preprocessed after the data cleaning processing to obtain the data to be preprocessed.
7. A data standard determining apparatus, the apparatus comprising:
the data acquisition module is used for acquiring data to be processed; the data to be processed is data to be determined by a data standard;
the data determining module is used for determining characteristic data corresponding to the data to be processed;
the matching degree determining module is used for inputting the characteristic data into a pre-trained data standard recognition model, and determining candidate data standards of the data to be processed and matching degrees between the data to be processed and the candidate data standards from pre-stored data standards through the data standard recognition model;
and the standard determining module is used for determining the target data standard of the data to be processed from the candidate data standards according to the matching degree.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
CN202211499915.2A 2022-11-28 2022-11-28 Data standard determining method, apparatus, device, medium and computer program product Pending CN116304851A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211499915.2A CN116304851A (en) 2022-11-28 2022-11-28 Data standard determining method, apparatus, device, medium and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211499915.2A CN116304851A (en) 2022-11-28 2022-11-28 Data standard determining method, apparatus, device, medium and computer program product

Publications (1)

Publication Number Publication Date
CN116304851A true CN116304851A (en) 2023-06-23

Family

ID=86800152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211499915.2A Pending CN116304851A (en) 2022-11-28 2022-11-28 Data standard determining method, apparatus, device, medium and computer program product

Country Status (1)

Country Link
CN (1) CN116304851A (en)

Similar Documents

Publication Publication Date Title
WO2021017679A1 (en) Address information parsing method and apparatus, system and data acquisition method
CN112396108A (en) Service data evaluation method, device, equipment and computer readable storage medium
CN107622326B (en) User classification and available resource prediction method, device and equipment
CN112818162A (en) Image retrieval method, image retrieval device, storage medium and electronic equipment
CN116414815A (en) Data quality detection method, device, computer equipment and storage medium
Mo et al. An interval efficiency measurement in DEA when considering undesirable outputs
CN116401379A (en) Financial product data pushing method, device, equipment and storage medium
CN110310012A (en) Data analysing method, device, equipment and computer readable storage medium
CN116501979A (en) Information recommendation method, information recommendation device, computer equipment and computer readable storage medium
CN111353728A (en) Risk analysis method and system
CN116089595A (en) Data processing pushing method, device and medium based on scientific and technological achievements
CN116304851A (en) Data standard determining method, apparatus, device, medium and computer program product
CN111737319B (en) User cluster prediction method, device, computer equipment and storage medium
CN104636489B (en) The treating method and apparatus of attribute data is described
CN114170000A (en) Credit card user risk category identification method, device, computer equipment and medium
CN114238615A (en) Enterprise service achievement data processing method and system
US20200342302A1 (en) Cognitive forecasting
CN112507170A (en) Data asset directory construction method based on intelligent decision and related equipment thereof
CN111754103A (en) Enterprise risk image method, device, computer equipment and readable storage medium
CN115630315A (en) Cluster prediction model training method and cluster prediction method
CN118228993A (en) Method, device, computer equipment and storage medium for determining demand priority
CN118095958A (en) Service level determining method, device, computer equipment and storage medium
CN117743945A (en) Policy risk level classification method and device, electronic equipment and storage medium
CN117407750A (en) Metadata-based data quality monitoring method, device, equipment and storage medium
CN112700044A (en) Prediction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination