CN116414815A

CN116414815A - Data quality detection method, device, computer equipment and storage medium

Info

Publication number: CN116414815A
Application number: CN202310240043.6A
Authority: CN
Inventors: 陈新辉; 刘映楷; 帅翡芍; 黄泽彬
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-03-06
Filing date: 2023-03-06
Publication date: 2023-07-11

Abstract

The application relates to a data quality detection method, a data quality detection device, computer equipment and a storage medium, and relates to the technical field of big data. The method comprises the following steps: determining a field to be detected from the acquired data table to be detected; inputting the field to be detected into a pre-trained field classification model to obtain a candidate quality detection model of the field to be detected and the matching degree between the candidate quality detection model and the field to be detected; determining a target quality detection model from the candidate quality detection models according to the matching degree; and inputting the field to be detected into the target quality detection model to obtain a quality detection result of the data table to be detected. The method can improve the efficiency of data quality detection.

Description

Data quality detection method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of big data technology, and in particular, to a data quality detection method, apparatus, computer device, storage medium, and computer program product.

Background

With the development of big data technology, data enabling financial business gradually becomes trend, and in big data applied to financial business, the problems of missing, messy codes and the like possibly exist, the big data is subjected to data quality detection, the problems in the data can be found in time, the data management is performed aiming at the problems in the data, the data quality can be ensured, the reliability of big data analysis and the accuracy of big data processing are improved, and the normal operation of the financial business is ensured.

The existing data quality detection generally relies on manual processing, has low efficiency, often finds problems in the data after the data is used, is difficult to treat the data in time, and also has the problems of long subsequent data maintenance period, complicated manual review flow and the like.

Therefore, the current data quality detection technology has the problem of low efficiency.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an efficient data quality detection method, apparatus, computer device, computer-readable storage medium, and computer program product.

In a first aspect, the present application provides a data quality detection method. The method comprises the following steps:

determining a field to be detected from the acquired data table to be detected;

inputting the field to be detected into a pre-trained field classification model to obtain a candidate quality detection model of the field to be detected and the matching degree between the candidate quality detection model and the field to be detected;

determining a target quality detection model from the candidate quality detection models according to the matching degree;

and inputting the field to be detected into the target quality detection model to obtain a quality detection result of the data table to be detected.

In one embodiment, before determining the field to be detected from the acquired data table to be detected, the method further includes:

determining a training sample field and a test sample field from an acquired sample data table;

training the candidate field classification model to be trained through the training sample field to obtain a pre-trained candidate field classification model;

determining the classification accuracy of the pre-trained candidate field classification model according to the test sample field;

and under the condition that the classification accuracy exceeds a preset threshold, determining the pre-trained candidate field classification model as the pre-trained field classification model.

In one embodiment, the determining the training sample field and the test sample field from the acquired sample data table includes:

determining a data table field in the sample data table;

and determining the training sample field from the data table fields, and determining the data table fields except the training sample field as the test sample field.

In one embodiment, the training the candidate field classification model to be trained through the training sample field to obtain a pre-trained candidate field classification model includes:

Acquiring a sample label corresponding to the training sample field;

performing grid search processing on the candidate field classification model to be trained to obtain a candidate field classification model after grid search; the candidate field classification model after grid search is matched with the sample label according to a candidate quality detection model obtained by classifying the training sample field;

and performing cross-validation processing on the candidate field classification model after grid searching to obtain the pre-trained candidate field classification model.

In one embodiment, the determining the classification accuracy of the pre-trained candidate field classification model according to the test sample field includes:

determining the number of the sample labels corresponding to the test sample fields;

inputting the test sample field into the pre-trained candidate field classification model to obtain a candidate quality detection model of the test sample field, and determining the number of models of the candidate quality detection model of the test sample field;

and determining the ratio of the number of models to the number of labels as the classification accuracy.

In one embodiment, the quality detection results include a data fluctuation check result and a data volume check result; after inputting the field to be detected to the target quality detection model to obtain a quality detection result of the data table to be detected, the method further comprises the following steps:

Determining a target area in the data fluctuation check result in response to a selected operation for the data fluctuation check result;

and generating an alarm signal under the condition that the data quantity inspection result of the target area is out of a preset range.

In one embodiment, the quality detection result includes a null check result; after inputting the field to be detected to the target quality detection model to obtain a quality detection result of the data table to be detected, the method further comprises the following steps:

determining a field missing region of the data table to be detected according to the null value checking result;

determining a target area from the field missing areas, and counting the number of the field missing areas;

and displaying the number of the target areas and the field missing areas.

In a second aspect, the present application further provides a data quality detection apparatus. The device comprises:

the field determining module is used for determining a field to be detected from the acquired data table to be detected;

the field identification module is used for inputting the field to be detected into a pre-trained field classification model to obtain a candidate quality detection model of the field to be detected and the matching degree between the candidate quality detection model and the field to be detected;

The model determining module is used for determining a target quality detection model from the candidate quality detection models according to the matching degree;

and the quality detection module is used for inputting the field to be detected into the target quality detection model to obtain a quality detection result of the data sheet to be detected.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

determining a field to be detected from the acquired data table to be detected;

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

Determining a field to be detected from the acquired data table to be detected;

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

determining a field to be detected from the acquired data table to be detected;

The data quality detection method, the device, the computer equipment, the storage medium and the computer program product are characterized in that a field to be detected is determined from an acquired data table to be detected, the field to be detected is input into a field classification model trained in advance, a candidate quality detection model of the field to be detected and the matching degree between the candidate quality detection model and the field to be detected are obtained, a target quality detection model is determined from the candidate quality detection model according to the matching degree, and the field to be detected is input into the target quality detection model, so that the quality detection result of the data table to be detected is obtained; the method and the device can determine the matching degree of each candidate quality detection model while determining the candidate quality detection model, further determine a target quality detection model with higher matching degree with a field to be detected in the data table to be detected from the candidate quality detection models, and perform data quality detection on the data table to be detected by using the target quality detection model, so that manual intervention is avoided, and the efficiency of data quality detection is improved.

Drawings

FIG. 1 is a flow chart of a method for detecting data quality in one embodiment;

FIG. 2 is a flow diagram of a training process for a field classification model in one embodiment;

FIG. 3 is a flow chart of a method for detecting data quality in another embodiment;

FIG. 4 is a schematic diagram of a standardized SQL check method set in one embodiment;

FIG. 5 is a flow chart of a training method of the SVM multi-classification model according to an embodiment;

FIG. 6 is a flow chart of a method for detecting data quality in another embodiment;

FIG. 7 is a block diagram showing the structure of a data quality detecting apparatus in one embodiment;

fig. 8 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, a data quality detection method is provided, where the method is applied to a terminal to illustrate the method, it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:

Step S110, determining a field to be detected from the acquired data table to be detected.

The data table to be detected may be a form requiring data quality detection.

The field to be detected may be a field in a data table to be detected.

In a specific implementation, a data table to be detected can be input to a terminal, and the terminal identifies or performs word segmentation processing on the data table to be detected to obtain at least one field to be detected in the data table to be detected.

In practical application, the terminal can identify text content in the data table to be detected through a neural network, machine learning and other methods to obtain one or more fields to be detected, or perform word segmentation on the text content in the data table to be detected to obtain one or more fields to be detected.

Step S120, inputting the field to be detected into a pre-trained field classification model to obtain a candidate quality detection model of the field to be detected and the matching degree between the candidate quality detection model and the field to be detected.

The field classification model may be a classification model for classifying the field to be detected to obtain a candidate quality detection model corresponding to the field to be detected, and may specifically be an SVM (Support Vector Machine ) model. Among them, the SVM model is a supervised learning model in the field of machine learning, and is generally used for pattern recognition, classification, and regression analysis.

The candidate quality detection model may be a candidate inspection method. Among them, the inspection methods include, but are not limited to, data amount inspection, data type inspection, data encoding inspection, data repetition inspection, null value inspection, and fluctuation inspection.

The matching degree may be the degree of matching between the candidate quality detection model and the field to be detected.

In a specific implementation, each field to be detected may be input into a pre-trained field classification model, the pre-trained field classification model classifies the field to be detected to obtain at least one corresponding candidate quality detection model, and the pre-trained field classification model may further determine a matching degree between each candidate quality detection model and the field to be detected.

In practical application, an SVM model may be trained in advance, and the field to be detected is input into the trained SVM model, and classified, where each class corresponds to a set of inspection methods, and the set of inspection methods includes all candidate inspection methods for performing data quality detection on the field to be detected, and each candidate inspection method is a candidate quality detection model. For example, when the field to be detected contains "Amount" or "amountj", the trained SVM model may obtain a set of inspection methods corresponding to the field to be detected by classifying the field to be detected: { data amount check, data type check }, data amount check and data type check are candidate check methods, wherein the data amount check may be counting the amount of data in the field to be detected, and the data type check may be determining the type of data in the field to be detected.

Further, the trained SVM model may also determine the matching degree between each candidate inspection method in the inspection method set and the field to be detected, for example, when the field to be detected contains "the sum of money ratio", since the field is not a description of a specific sum of money value and is not suitable for numerical inspection, the SVM model may determine that the matching degree between the data amount inspection and the field to be detected is 10%, the matching degree between the data type inspection and the field to be detected is 20%, and the matching degree is relatively low; when the field to be detected contains the sum and specific data, the SVM model can determine that the matching degree between the data quantity detection and the field to be detected is 80%, the matching degree between the data type detection and the field to be detected is 90%, and the matching degree is relatively high.

It should be noted that, the candidate quality detection model of the field to be detected and the matching degree model corresponding to the candidate quality detection model may be obtained by inputting the field to be detected into a field classification model trained in advance, where the matching degree model may be a preset matching degree calculation formula, and the matching degree between the candidate quality detection model and the field to be detected is obtained by calculation through the matching degree model.

And step S130, determining a target quality detection model from the candidate quality detection models according to the matching degree.

The target quality detection model may be an inspection method for performing data quality detection on a field to be detected.

In a specific implementation, for at least one candidate quality detection model corresponding to each field to be detected, the candidate quality detection model with the matching degree exceeding the preset threshold value can be determined as the target quality detection model by comparing the matching degree corresponding to each candidate quality detection model with the preset threshold value.

For example, the candidate checking method includes a data volume checking and a data type checking, wherein the matching degree between the data volume checking and the field to be detected is 10%, the matching degree between the data type checking and the field to be detected is 90%, the matching degree threshold is set to be 50%, and the matching degree of the data type checking exceeds the matching degree threshold, the checking method for performing data quality checking on the field to be detected is the data type checking, namely the target quality checking model is the data type checking.

And step S140, inputting the field to be detected into a target quality detection model to obtain a quality detection result of the data sheet to be detected.

In a specific implementation, after determining a target quality detection model corresponding to each field to be detected in the data table to be detected, each field to be detected can be respectively input into the corresponding target quality detection model, the field to be detected is detected through the target quality detection model to obtain a quality detection result of the field to be detected, and the quality detection results of all the fields to be detected are summarized to obtain a quality detection result of the data table to be detected.

For example, the to-be-detected data table includes a to-be-detected field 1 and a to-be-detected field 2, after determining that the inspection method set corresponding to the to-be-detected field 1 is { data type inspection }, and the inspection method set corresponding to the to-be-detected field 2 is { data encoding inspection, data repetition inspection }, the to-be-detected field 1 is subjected to data type inspection, and the results of the data encoding inspection and the data repetition inspection of the to-be-detected field 2 are summarized to obtain the quality inspection result of the to-be-detected data table.

According to the data quality detection method, the field to be detected is determined from the acquired data table to be detected, the field to be detected is input into the field classification model trained in advance, the candidate quality detection model of the field to be detected and the matching degree between the candidate quality detection model and the field to be detected are obtained, the target quality detection model is determined from the candidate quality detection model according to the matching degree, and the field to be detected is input into the target quality detection model, so that the quality detection result of the data table to be detected is obtained; the method and the device can determine the matching degree of each candidate quality detection model while determining the candidate quality detection model, further determine a target quality detection model with higher matching degree with a field to be detected in the data table to be detected from the candidate quality detection models, and perform data quality detection on the data table to be detected by using the target quality detection model, so that manual intervention is avoided, and the efficiency of data quality detection is improved.

In one embodiment, as shown in fig. 2, a training procedure of a field classification model is provided, and before the step S110, the training procedure may specifically further include:

step S102, determining a training sample field and a test sample field from an acquired sample data table;

step S104, training the candidate field classification model to be trained through training sample fields to obtain a pre-trained candidate field classification model;

step S106, determining the classification accuracy of a pre-trained candidate field classification model according to the test sample field;

step S108, when the classification accuracy exceeds a preset threshold, determining the pre-trained candidate field classification model as a pre-trained field classification model.

The sample data table may be a form sample, for example, a plurality of forms that are randomly generated or collected in an actual application.

The training sample field may be a field in the sample data table for performing model training.

The test sample field may be a field in the sample data table for testing a model of the training number.

The candidate field classification model may be a candidate field classification model.

In a specific implementation, the terminal may obtain at least one sample data table, determine at least one sample field for each sample data table, determine each sample field as a training sample field or a test sample field, and further obtain a sample label corresponding to each sample field, where the sample label may be a manually calibrated set of methods for checking the sample field, determine a sample label corresponding to the training sample field as a training sample label, determine a sample label corresponding to the test sample field as a test sample label, input the training sample field into a candidate field classification model to be trained, obtain a classification result of the training sample field, where the classification result of the training sample field may be a set of methods for checking the training sample field corresponding to the candidate field classification model to be trained, and train the candidate field classification model to be trained according to a difference between the training sample label and the classification result of the training sample field, so as to obtain a candidate field classification model to be trained in advance.

Inputting the test sample field into a pre-trained candidate field classification model to obtain a classification result of the test sample field, wherein the classification result of the test sample field can be a checking method set corresponding to the test sample field identified by the pre-trained candidate field classification model, determining the classification accuracy of the pre-trained candidate field classification model according to the classification results of the test sample label and the test sample field, comparing the classification accuracy with a preset threshold, determining the pre-trained candidate field classification model as the pre-trained field classification model if the classification accuracy exceeds the preset threshold, otherwise, not determining the pre-trained candidate field classification model as the pre-trained field classification model if the classification accuracy does not exceed the preset threshold, and returning to the step S102 to re-determine the pre-trained field classification model.

For example, a field may be collected from a plurality of data tables as a sample field, and an inspection method set may be added to the sample field according to a data type, a chinese name, an english name, and a data structure of the sample field, and the inspection method set may be used as a sample tag of the sample field. All sample fields are randomly divided into a training sample set and a test sample set, the SVM model is trained based on the training sample set, in the model training process, a grid search method can be adopted to conduct parameter optimization on the SVM model to obtain a grid-searched SVM model, k-fold cross verification can be adopted to verify the parameter-optimized SVM model to obtain a verified SVM model, each test sample in the test sample set is input into the verified SVM model, the classification accuracy of the SVM model is determined according to the classification result of the SVM model, and under the condition that the accuracy exceeds a preset threshold, the verified SVM model is determined to be a pre-trained field classification model.

In this embodiment, a training sample field and a test sample field are determined from an obtained sample data table, a candidate field classification model to be trained is trained through the training sample field, a pre-trained candidate field classification model is obtained, the classification accuracy of the pre-trained candidate field classification model is determined according to the test sample field, the pre-trained candidate field classification model is determined to be the pre-trained field classification model under the condition that the classification accuracy exceeds a preset threshold value, the candidate field classification model can be obtained through model training, the pre-trained field classification model is screened out according to the classification accuracy, and the data quality detection is performed through the pre-trained field classification model, so that the data quality detection efficiency can be improved.

In one embodiment, the step S102 may specifically include: determining a data table field in a sample data table; and determining a training sample field from the data table fields, and determining the data table fields except the training sample field as test sample fields.

Wherein the data table field may be a field in a sample data table.

In specific implementation, text content in a sample data table can be identified through a neural network, machine learning and other methods to obtain a data table field, or word segmentation processing is performed on the text content in the sample data table to obtain the data table field, and for all the obtained data table fields, training sample fields can be randomly determined from the obtained data table fields, and data table fields except the training sample fields are determined to be test sample fields.

In this embodiment, the data table field in the sample data table is determined; the training sample field is determined from the data table fields, the data table fields except the training sample field are determined to be the test sample fields, the training sample field and the test sample field can be obtained according to the sample data table, the training of the candidate field classification model is carried out through the training sample field, the field classification model is determined from the candidate field classification model through the test sample field, and the reliability of the determination of the field classification model is improved.

In one embodiment, the step S104 may specifically include: acquiring a sample label corresponding to a training sample field; performing grid search processing on the candidate field classification model to be trained to obtain a candidate field classification model after grid search; the candidate field classification model after grid search is matched with a sample label according to a candidate quality detection model obtained by classifying training sample fields; and performing cross-validation processing on the candidate field classification model after grid searching to obtain a pre-trained candidate field classification model.

The sample label can be an artificially calibrated checking method set of training sample fields.

In a specific implementation, a sample label corresponding to each training sample field in the plurality of training sample fields can be obtained, grid search is performed on the candidate field classification model to be trained based on the plurality of training sample fields, so as to adjust model parameters, the training sample fields are classified by using the candidate field classification model after parameter adjustment, if the obtained candidate quality detection model is not matched with the sample label, the parameters of the model are adjusted again until the candidate quality detection model obtained by classifying the training sample fields by using the candidate field classification model after parameter adjustment is matched with the sample label, and the candidate field classification model after parameter adjustment is used as the candidate field classification model after grid search. And then, a k-fold cross validation method can be adopted to validate the candidate field classification model after grid searching, so as to obtain a pre-trained candidate field classification model.

For example, sample fields are collected from a plurality of data tables, an inspection method set is manually calibrated for each sample field, the manually calibrated inspection method set is used as a sample label of the sample field, training sample fields are randomly determined from the sample fields, all the determined training sample fields are determined to be training sample sets, an SVM model is trained based on the training sample sets, in the model training process, a grid search method is firstly adopted to carry out parameter adjustment on the SVM model, the training sample fields are classified by using the SVM model after parameter adjustment, parameters of the SVM model are continuously adjusted if the inspection method set obtained by classification is not matched with the manually calibrated inspection method set, otherwise, the SVM model after parameter adjustment at this time is used as an SVM model after grid search, and k-fold cross validation can be carried out on the SVM model after grid search to obtain a pre-trained SVM model.

In this embodiment, a candidate field classification model to be trained is subjected to grid search processing by acquiring a sample label corresponding to a training sample field, so as to obtain a candidate field classification model after grid search, and cross-validation processing is performed on the candidate field classification model after grid search, so as to obtain a candidate field classification model trained in advance, and the candidate field classification model can be obtained through model training, so that an inspection method set corresponding to a field to be detected can be directly determined through the model, and the efficiency of data quality detection is improved.

In one embodiment, the step S106 may specifically include: determining the number of the sample labels corresponding to the test sample fields; inputting the test sample field into a pre-trained candidate field classification model to obtain a candidate quality detection model of the test sample field, and determining the number of models of the candidate quality detection model of the test sample field; and determining the ratio of the number of the models to the number of the labels as the classification accuracy.

In a specific implementation, the number of sample labels corresponding to the test sample field can be counted to obtain the number of labels, the test sample field can be input into a pre-trained candidate field classification model to obtain a candidate quality detection model of the test sample field determined by the pre-trained candidate field classification model, the number of the candidate quality detection models is counted to obtain the number of models, the ratio of the number of models to the number of labels is used as the classification accuracy, and if the classification accuracy exceeds a preset threshold, the pre-trained candidate field classification model can be determined to be the pre-trained field classification model.

In practical application, assuming that test sample fields contained in a data table are t1, t2 and t3, manually calibrated inspection method sets A1, A2 and A3 corresponding to each test sample field are obtained, wherein the number of inspection methods in each inspection method set is N1, N2 and N3 respectively, and then the total number of inspection methods corresponding to the data table, namely the number of labels, can be Σni, i=1, 2 and 3; inputting each test sample field into a pre-trained SVM model respectively to obtain inspection method sets B1, B2 and B3 determined by the pre-trained SVM model aiming at each test sample field, counting the number of the inspection methods in each inspection method set to be M1, M2 and M3 respectively, and determining the total number of the inspection methods in the data table identified by the SVM model, namely the number of the models to be Sigma Mi, i=1, 2 and 3, and determining the Sigma Mi/Sigma Ni, i=1, 2 and 3 as classification accuracy.

In this embodiment, the number of labels of the sample labels corresponding to the test sample field is determined, the test sample field is input to a pre-trained candidate field classification model, a candidate quality detection model of the test sample field is obtained, the number of models of the candidate quality detection model of the test sample field is determined, the ratio of the number of models to the number of labels is determined to be the classification accuracy, and the candidate field classification model with the classification accuracy meeting a certain condition can be determined to be the field classification model, so that the accuracy of an inspection method set determined by the field classification model is ensured.

In one embodiment, the quality detection results include a data fluctuation check result and a data volume check result; after the step S140, the method may specifically further include: determining a target area in the data fluctuation check result in response to a selected operation for the data fluctuation check result; and generating an alarm signal under the condition that the data quantity inspection result of the target area is out of a preset range.

The data fluctuation check result may be a data fluctuation curve in the data table.

The data quantity checking result can be the data quantity corresponding to each point in the data fluctuation curve.

In specific implementation, a data fluctuation curve can be displayed on a display of the terminal, a user performs a selection operation on the data fluctuation curve, a target area is selected from the data fluctuation curve, after the terminal acquires the target area, whether the data amount corresponding to the target area is within a preset range is detected, if so, an alarm signal is not required to be generated, otherwise, if not, the alarm signal is generated.

For example, the user selects a section of the data fluctuation curve, the terminal detects whether the data volume corresponding to the section of the curve is within the range of [0, 100], and if the data volume exceeds the range, an alarm is sent.

In this embodiment, by determining a target area in the data fluctuation inspection result in response to a selection operation for the data fluctuation inspection result, and generating an alarm signal when the data amount inspection result of the target area is out of a preset range, the data fluctuation in the data table can be detected, and the data fluctuation is ensured to be within a reasonable range.

In one embodiment, the quality detection result includes a null check result; after the step S140, the method may specifically further include: determining a field missing region of the data table to be detected according to the null value checking result; determining a target area from the field missing areas, and counting the number of the field missing areas; and displaying the number of the target areas and the field missing areas.

Wherein the null value check result may be an identification for indicating that a field in the data table is a null value.

In a specific implementation, a null value checking result can be displayed on the terminal, the null value checking result can be one or more null value identifiers, each null value identifier corresponds to a field missing region, and a user can select a target region from the field missing region and display the target region. The number of all the field missing regions can be counted, and the number of all the field missing regions can be displayed.

For example, all the null positions in the data table may be highlighted, the total number of all the null positions in the data table may be displayed on the terminal, and the user may select a target area from the data table, and sample and display the null positions for the target area.

In this embodiment, the field missing region of the data table to be detected is determined according to the null value inspection result, the target region is determined from the field missing region, the number of the field missing regions is counted, and the target region and the number of the field missing regions are displayed, so that the data missing condition in the data table can be displayed, and the user can conveniently and timely process the data exception.

In order to facilitate a thorough understanding of embodiments of the present application by those skilled in the art, the following description will be provided in connection with a specific example.

Fig. 3 provides a flow chart of a data quality detection method. According to fig. 3, the data quality detection method may specifically include the following steps:

step S301, defining a standardized SQL (Structured Query Language ) inspection method set, as shown in FIG. 4, including but not limited to data volume inspection, data type inspection, data encoding inspection, data repetition inspection, null value inspection and fluctuation inspection;

step S302, acquiring a data quality detection method applicability function based on a standardized SQL inspection method set required by each test sample and a decision function of the verified SVM multi-classification training model; the applicability function of the data quality detection method can be a function for determining the matching degree between the candidate detection method and the field to be detected;

step S303, carrying out adaptability matching on a standardized SQL set of data quality of the data table based on a data quality detection method applicability function, and matching the standardized SQL detection method set for each data table and each field; specifically, the matching degree between the candidate checking method and the field to be checked can be determined through the applicability function of the data quality checking method, and the checking method for checking the data quality of the field to be checked is determined from the candidate checking methods according to the matching degree;

Step S304, submitting the standard SQL to a Flink (a distributed system) cluster, and carrying out data quality inspection by the Flink cluster in combination with a data quality calculation rule; the data quality calculation rule may be a preset calculation rule, for example, for a primary key check, if the result obtained by the data quality calculation rule is 0, it indicates that the primary key check is 0 repetitions, and if the result obtained by the data quality calculation rule is greater than 0, it indicates that there is a data repetition;

step S305, generating a data quality report according to the data quality detection result, and sending out data quality early warning;

step S306, monitoring a preset data quality rule, timely finding out the fluctuation (the same ratio, the ring ratio), the missing and other changes of the data according to the data quality inspection report, and generating a data quality trend report;

step S307, according to the data quality trend report, a scene solution is provided, such as data increment and decrement, data repetition, data deletion and the like, whether data storage is normal or not is evaluated, the influence of the data increment on the storage capacity of the system is evaluated, whether data jitter accords with a history rule or not is evaluated, and the like; the analysis result and the suggestion of the data quality trend abnormality can be shown in table 1;

Trend classification	Early warning classification	Recommendation scheme
			Data fluctuation	Warning	Displaying the fluctuation condition of the data and analyzing the numberWhether or not the data volume is reasonable
Data loss	Warning	Sampling and displaying the missing area of the data field, and prompting the total missing number
			Data scrambling	Errors	Sampling display messy code record and prompting current system file coding requirement
Data repetition	Errors	Sampling display of duplicate records, highlighting duplicate content

TABLE 1 data quality trend anomaly analysis results and suggestions

Step S308, when the data quality trend report deviates from the actual situation, a system complement function is provided, and references are provided for other subsequent similar scenes so as to enrich the completeness of the system.

Fig. 5 provides a flow chart of a training method of the SVM multi-classification model. According to fig. 5, the training method of the svm multi-classification model may specifically include:

in step S501, a sufficient number of data table fields are collected as experimental data, and the data table fields are respectively labeled with an inspection method according to the data types, the field chinese names, the field english names, the table structure constraints, and the like of the data table fields, and classified into proper standardized SQL. For example, if the numeric value type field and the field Chinese name contain "Amount" and the field English name contains key information such as "Amount", the SQL label for checking the numeric value type needs to be matched; if the constraint of the main key table structure exists, SQL labels which are required to be matched with the main key uniqueness check are needed; if the field in the table structure is non-empty constraint, SQL labels needing to be matched with non-empty inspection; in this way, the SQL labels of different classifications are matched with the data table of each sample;

Step S502, preprocessing test data under various labels to obtain preprocessed sample data, and randomly dividing the preprocessed sample data into a training sample set and a test sample set;

step S503, obtaining an SVM multi-classification training model based on the multi-classification problem types and the training sample set, namely obtaining an SVM model to be trained;

step S504, carrying out parameter optimization on the SVM multi-classification training model by adopting a grid search method to obtain the SVM multi-classification training model after parameter optimization;

step S505, verifying the SVM multi-classification training model after parameter optimization by adopting a k-fold cross verification method to obtain a verified SVM multi-classification training model;

step S506, the test sample set is imported into the verified SVM multi-classification training model, and the accuracy of the verified SVM multi-classification training model is obtained;

step S507, obtaining a standardized SQL check method set required by each test sample in the test sample set based on the accuracy of the verified SVM multi-classification training model under the condition that the accuracy of the verified SVM multi-classification training model is greater than or equal to a preset value; specifically, for a data table, the table is manually marked to be matched with 100 SQL inspection methods, but 97 results obtained through an SVM multi-classification training model are correctly matched, the accuracy is 97%, the accuracy is greater than a preset accuracy threshold value of 95%, the matching accuracy is considered to meet the requirements, and the model is feasible.

The data quality detection method provides a data quality management platform based on a data center, detects the conditions of data volume, data type, data coding, repeatability, null value detection and the like of the data, automatically generates a standardized data quality detection rule in a machine learning mode, executes standardized detection SQL through a Flink cluster to obtain a detection result, performs early warning report on an abnormal result, establishes trend change analysis of the data quality on the change of data fluctuation, makes early warning, supports the supplement entry of a prediction scene, summarizes the statistical result, realizes automatic detection of the data quality detection for more subsequent early warning providing schemes, and solves the problems of high labor cost, large workload, high repeatability, time consumption and the like in manual detection. Meanwhile, the discovery of the data quality problem provides a problem discovery and possible problem solving method for data management, and the timeliness of the data management is greatly improved. The method has profound significance in the field of data management for data accuracy, integrity and robustness of financial institutions, such as normalization of data input, data transmission integrity and evaluation recommendation schemes based on historical problems.

In one embodiment, as shown in fig. 6, a data quality detection method is provided, and the method is applied to a terminal for illustration, and includes the following steps:

step S601, determining a data table field in a sample data table;

step S602, determining a training sample field from the data table fields, and determining the data table fields except the training sample field as test sample fields;

step S603, obtaining a sample label corresponding to the training sample field;

step S604, performing grid search processing on the candidate field classification model to be trained to obtain a candidate field classification model after grid search; the candidate field classification model after grid search is matched with a sample label according to a candidate quality detection model obtained by classifying training sample fields;

step S605, cross-validation processing is carried out on the candidate field classification model after grid searching, and a pre-trained candidate field classification model is obtained;

step S606, determining the classification accuracy of a pre-trained candidate field classification model according to the test sample field;

step S607, under the condition that the classification accuracy exceeds a preset threshold, determining a pre-trained candidate field classification model as a pre-trained field classification model;

Step S608, determining a field to be detected from the acquired data table to be detected;

step S609, inputting the field to be detected into a pre-trained field classification model to obtain a candidate quality detection model of the field to be detected and the matching degree between the candidate quality detection model and the field to be detected;

step S610, determining a target quality detection model from candidate quality detection models according to the matching degree;

in step S611, the field to be detected is input to the target quality detection model, so as to obtain the quality detection result of the data table to be detected.

In a specific implementation, a field in a sample data table is determined as a data table field, a training sample field is randomly selected from the data table field, a data table field except for a training sample is determined as a test sample field, a manually marked sample label is obtained for the training sample field, grid searching is performed on a candidate field classification model to be trained according to the training sample field and the sample label, a candidate quality detection model obtained by classifying the candidate field classification model after grid searching by aiming at the training sample field is matched with the sample label, cross verification is performed on the candidate field classification model after grid searching, and a pre-trained field classification model is determined according to classification accuracy. Aiming at a data table to be detected, determining a field to be detected from the data table, inputting the field to be detected into a field classification model trained in advance to obtain a candidate quality detection model and a matching degree corresponding to the candidate quality detection model, determining a target quality detection model according to the matching degree, and detecting the field to be detected through the target quality detection model to obtain a quality detection result of the data table to be detected.

According to the data quality detection method, through grid search, cross verification and classification accuracy calculation, the reliability of classification of the pre-trained field classification model can be improved, the matching degree of each candidate quality detection model can be determined while the candidate quality detection model is determined, and then the target quality detection model with higher matching degree with the field to be detected in the data table to be detected is determined from the candidate quality detection models, and the target quality detection model is used for carrying out data quality detection on the data table to be detected, so that manual intervention is avoided, and the efficiency of data quality detection is improved.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a data quality detection device for realizing the above related data quality detection method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation of one or more embodiments of the data quality detection device provided below may refer to the limitation of the data quality detection method hereinabove, and will not be repeated herein.

In one embodiment, as shown in fig. 7, there is provided a data quality detection apparatus including: a field determination module 710, a field identification module 720, a model determination module 730, and a quality detection module 740, wherein:

a field determining module 710, configured to determine a field to be detected from the acquired data table to be detected;

the field identifying module 720 is configured to input the field to be detected into a pre-trained field classification model, and obtain a candidate quality detection model of the field to be detected and a matching degree between the candidate quality detection model and the field to be detected;

a model determining module 730, configured to determine a target quality detection model from the candidate quality detection models according to the matching degree;

And the quality detection module 740 is configured to input the field to be detected to the target quality detection model, and obtain a quality detection result of the data table to be detected.

In one embodiment, the data quality detection apparatus further includes:

the sample field determining module is used for determining a training sample field and a test sample field from the acquired sample data table;

the candidate model training module is used for training the candidate field classification model to be trained through the training sample field to obtain a pre-trained candidate field classification model;

the accuracy rate determining module is used for determining the classification accuracy rate of the pre-trained candidate field classification model according to the test sample field;

and the classification model determining module is used for determining the pre-trained candidate field classification model as the pre-trained field classification model under the condition that the classification accuracy exceeds a preset threshold.

In one embodiment, the sample field determining module is further configured to determine a data table field in the sample data table; and determining the training sample field from the data table fields, and determining the data table fields except the training sample field as the test sample field.

In one embodiment, the candidate model training module is further configured to obtain a sample label corresponding to the training sample field; performing grid search processing on the candidate field classification model to be trained to obtain a candidate field classification model after grid search; the candidate field classification model after grid search is matched with the sample label according to a candidate quality detection model obtained by classifying the training sample field; and performing cross-validation processing on the candidate field classification model after grid searching to obtain the pre-trained candidate field classification model.

In one embodiment, the accuracy determining module is further configured to determine a number of labels of the sample labels corresponding to the test sample field; inputting the test sample field into the pre-trained candidate field classification model to obtain a candidate quality detection model of the test sample field, and determining the number of models of the candidate quality detection model of the test sample field; and determining the ratio of the number of models to the number of labels as the classification accuracy.

In one embodiment, the data quality detection apparatus further includes:

A target area determining module for determining a target area in the data fluctuation check result in response to a selected operation for the data fluctuation check result;

and the alarm signal generation module is used for generating an alarm signal under the condition that the data volume check result of the target area is out of a preset range.

In one embodiment, the data quality detection apparatus further includes:

the missing region determining module is used for determining a field missing region of the data table to be detected according to the null value checking result;

the region quantity counting module is used for determining a target region from the field missing regions and counting the quantity of the field missing regions;

and the target area display module is used for displaying the number of the target areas and the field missing areas.

The respective modules in the above-described data quality detection apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 8. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a data quality detection method. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 8 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method for detecting data quality, the method comprising:

determining a field to be detected from the acquired data table to be detected;

2. The method of claim 1, further comprising, prior to determining the field to be detected from the acquired table of data to be detected:

3. The method of claim 2, wherein determining the training sample field and the test sample field from the acquired sample data table comprises:

determining a data table field in the sample data table;

4. The method according to claim 2, wherein training the candidate field classification model to be trained through the training sample field to obtain a pre-trained candidate field classification model comprises:

acquiring a sample label corresponding to the training sample field;

5. The method of claim 2, wherein said determining classification accuracy of said pre-trained candidate field classification model from said test sample field comprises:

6. The method of claim 1, wherein the quality detection results include a data fluctuation check result and a data volume check result; after inputting the field to be detected to the target quality detection model to obtain a quality detection result of the data table to be detected, the method further comprises the following steps:

7. The method of claim 1, wherein the quality detection result comprises a null check result; after inputting the field to be detected to the target quality detection model to obtain a quality detection result of the data table to be detected, the method further comprises the following steps:

and displaying the number of the target areas and the field missing areas.

8. A data quality detection apparatus, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.

11. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.