CN111125389A

CN111125389A - Data classification cleaning system and cleaning method based on dynamic progressive sampling

Info

Publication number: CN111125389A
Application number: CN201911305676.0A
Authority: CN
Inventors: 秦永强; 张发恩; 李素莹; 纪双西
Original assignee: Ainnovation Hefei Technology Co ltd
Current assignee: Ainnovation Hefei Technology Co ltd
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2020-05-08

Abstract

The invention discloses a data classification cleaning system and a method, wherein the system comprises: the label sample graph placing module is used for placing the label sample graph into each type of data subset in the sample data set; the iterative model training module is used for training a data classification cleaning model by taking a label data set L formed by each label sample diagram as a training sample; the data pseudo label generating module is used for carrying out data classification cleaning on the data set to be cleaned based on the data classification cleaning model and carrying out pseudo labeling on each piece of cleaned unmarked data; the data screening module is used for screening data of a pseudo label data set obtained by pseudo labels to obtain a pseudo label candidate set S; the iterative model training module is also used for iteratively training the data classification cleaning model by taking the pseudo label candidate set S and the label data set L as training samples.

Description

Data classification cleaning system and cleaning method based on dynamic progressive sampling

Technical Field

The invention relates to the technical field of data cleaning, in particular to a data classification cleaning system and a cleaning method based on dynamic progressive sampling.

Background

At present, data cleaning of a picture data set mainly depends on manual cleaning or recognition and cleaning based on a large number of models obtained by training picture samples with labels, but manual cleaning efficiency is low, multiple times of checking are often needed to relatively ensure cleaning accuracy, and the requirement of a user on automatic cleaning of the picture data set cannot be met. The data cleaning method based on a large number of image samples with labels also needs to label the images manually, so that the labeling cost is high, the labeling period is long, the labeling quality is difficult to guarantee, and the technical problem of low accuracy of the prepared data classification result is also solved.

Disclosure of Invention

The invention aims to provide a data classification cleaning system and method based on dynamic progressive sampling to solve the technical problems.

In order to achieve the purpose, the invention adopts the following technical scheme:

the utility model provides a data classification cleaning system based on dynamic progressive sampling, includes:

the label sample graph placing module is used for providing a user with a label sample graph to place the label sample graph with a label into each type of data subset in the sample data set, and each label sample graph correspondingly represents one data type;

the iterative model training module is connected with the label sample diagram placing module and is used for initially training a label data set L formed by the placed label sample diagrams to form a data classification cleaning model by taking the label data set L as a training sample;

the data pseudo label generating module is connected with the iterative model training module and used for inputting a data set to be cleaned into the data classification cleaning model, predicting the data type of unmarked data in the data set through the data classification cleaning model, and performing pseudo labeling on each unmarked data obtained through prediction to obtain a pseudo labeled data set;

the data screening module is connected with the data pseudo label generating module and used for screening the data of the pseudo label data set to obtain a pseudo label candidate set S;

the iterative model training module is further connected with the data screening module, and is further used for iteratively training the data classification cleaning model by taking an extended training data set D formed by the pseudo label candidate set S and the label data set L as a training sample;

and the data pseudo label generation module further cleans the data of the data set based on the data classification cleaning model obtained by iterative training until the classification cleaning process of the data set is completed.

As a preferable aspect of the present invention, the data classification cleaning system further includes:

an index data marking module, connected to the data pseudo tag generating module, configured to mark each remaining unmarked data in the data set as index tag data after the data pseudo tag generating module completes pseudo tagging of each unmarked data in the data set;

the index data marking module is also connected with the iterative model training module, and the iterative model training module is used for updating the data classification cleaning model through iterative training by taking the extended training data set D and each index label data as training samples;

and the data pseudo label generating module carries out data classification cleaning on the data set according to the data classification cleaning model updated iteratively until the data classification cleaning process of all data in the data set is completed.

The invention also provides a data classification cleaning method based on dynamic progressive sampling, which is realized by applying the data classification cleaning system and comprises the following steps:

step S1, the data classification cleaning system acquires the label sample maps and correspondingly places each acquired label sample map into each type of data subset of the sample data set;

step S2, the data classification cleaning system takes a label data set L formed by each label sample diagram as a training sample, and the data classification cleaning model is formed by initial training;

step S3, the data classification cleaning system inputs the data set to be cleaned into the data classification cleaning model, predicts the data type of each unmarked data in the data set through the data classification cleaning model, and performs pseudo marking on each unmarked data obtained through prediction to obtain a pseudo marked data set;

step S4, the data classification cleaning system performs data screening on the data in the pseudo label data set to obtain a pseudo label candidate set S;

step S5, the data classification cleaning system takes an extended training data set D formed by the pseudo label candidate set S and the label data set L as a training sample, and iteratively trains the data classification cleaning model;

and step S6, the data classification cleaning system continuously performs data classification cleaning on the data set based on the data classification cleaning model obtained through iterative training until the data classification cleaning process is completed.

l1, the data classification cleaning system acquires the label sample maps and correspondingly places each acquired label sample map into each type of the data subsets of the sample data set;

step L2, the data classification cleaning system takes a label data set L formed by each label sample diagram as a training sample, and initially trains to form the data classification cleaning model;

step L3, the data classification cleaning system inputs the data set to be cleaned into the data classification cleaning model, predicts the data type of each unlabeled data in the data set through the data classification cleaning model, and performs pseudo labeling on each unlabeled data obtained through prediction to obtain a pseudo-labeled data set;

step L4, the data classification cleaning system performs data screening on the pseudo label data set to obtain a pseudo label candidate set S;

step L5, the data classification cleaning system iteratively trains the data classification cleaning model by taking an extended training data set D formed by the pseudo label candidate set S and the label data set L as a training sample;

step L6, the data classification cleaning system marks each of the unmarked data remaining in the data set as index tag data after completing the pseudo-marking of each of the unmarked data in the data set;

step L7, the data classification cleaning system takes the extended training data set D and each index label data as training samples, and iteratively trains and updates the data classification cleaning model;

and L8, the data classification cleaning system continuously performs data cleaning on the data set based on the data classification cleaning model obtained by iterative training until the classification cleaning process of all data is completed.

The dynamic progressive-based data classification cleaning system provided by the invention only needs to manually label one picture in each type of data subset in a data set to be classified, then the system performs model training according to each labeled sample picture, then performs data classification cleaning on the data set through the trained data classification cleaning model, then automatically marks each piece of unmarked data cleaned, and iteratively updates the data classification cleaning model by taking the data which is automatically marked and is obtained through cleaning and each labeled sample picture as a training sample until the classification cleaning process of the data in the data set is completed. The invention greatly reduces the time cost of manual marking, and improves the accuracy of data classification and cleaning by repeatedly carrying out data classification, cleaning and marking on the data set.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic structural diagram of a data classification and cleaning system based on dynamic progressive sampling according to a first embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a data classification and cleaning system based on dynamic progressive sampling according to a second embodiment of the present invention;

FIG. 3 is a diagram of steps of a method for implementing classified cleaning of data by using the classified cleaning system according to the first embodiment of the present invention;

fig. 4 is a diagram of the steps of a method for implementing data classification cleaning by using the data classification cleaning system according to the second embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.

In the description of the present invention, unless otherwise explicitly specified or limited, the term "connected" or the like, if appearing to indicate a connection relationship between the components, is to be understood broadly, for example, as being fixed or detachable or integral; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or may be connected through one or more other components or may be in an interactive relationship with one another. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Example one

The embodiment of the present invention provides a data classification cleaning system based on dynamic progressive sampling, referring to fig. 1, including:

the system comprises a label sample graph placing module 1, a data collecting module and a data analyzing module, wherein the label sample graph placing module 1 is used for providing a user to place a label sample graph with a label into each type of data subset in a sample data set, and each label sample graph correspondingly represents one data type;

the iterative model training module 2 is connected with the label sample diagram placing module 1 and is used for initially training a label data set L formed by placed label sample diagrams to form a data classification cleaning model by taking the label data set L as a training sample;

the data pseudo label generating module 3 is connected with the iterative model training module 2 and is used for inputting a data set to be cleaned into the data classification cleaning model, predicting the data type of unmarked data in the data set through the data classification cleaning model, and carrying out pseudo labeling on each unmarked data obtained through prediction to obtain a pseudo labeled data set;

the data screening module 4 is connected with the data pseudo label generating module 3 and is used for screening data of the pseudo label data set to obtain a label candidate set S;

the iterative model training module 2 is also connected with the data screening module 4, and the iterative model training module 2 is also used for carrying out iterative training data classification and model cleaning by taking an extended training data set D formed by a label candidate set S and a label data set L as a training sample;

and the data pseudo label generating module 3 further cleans the data of the data set based on the data classification cleaning model obtained by iterative training until the classification cleaning process of the data set is completed.

In the above technical solution, the method for iteratively training the data classification cleaning model is an existing model training method, and since the model training method is not within the scope of the claimed invention, the specific training process of the data classification cleaning model is not described herein.

The labeling process for a pseudo-labeled data set is briefly as follows:

the system predicts the data predicted by the data classification cleaning model to obtain pseudo markers, and the data of the pseudo markers are all suspected label data. The pseudo-tagging of the model to the data may be implemented by existing correlation algorithms, and the pseudo-tagging process to the data is not elaborated herein since it is not within the scope of the claimed invention.

In addition, the pseudo tag candidate set S may be obtained by calculating the confidence that the pseudo tag data is the tag data, the method for calculating the confidence that the pseudo tag data is the existing method, and of course, other existing screening methods may also be used to screen the pseudo tag data to obtain the pseudo tag candidate set S.

The embodiment of the present invention further provides a data classification cleaning method based on dynamic progressive sampling, which is implemented by applying the data classification cleaning system provided in the first embodiment, and please refer to fig. 3, including the following steps:

step S1, the data classification cleaning system acquires the label sample drawings and correspondingly places each acquired label sample drawing into each type of data subset of the sample data set; the data type of the label sample graph is the same as the data type in the placed data subset;

step S2, the data classification cleaning system takes a label data set L formed by each label sample diagram as a training sample, and a data classification cleaning model is formed by initial training;

step S3, inputting the data set to be cleaned into a data classification cleaning model by the data classification cleaning system, predicting the data type of each unmarked data in the data set by the data classification cleaning model, and carrying out pseudo marking on each unmarked data obtained by prediction to obtain a pseudo marked data set;

step S5, the data classification cleaning system takes an extended training data set D formed by the pseudo label candidate set S and the label data set L as a training sample, and iterates a training data classification cleaning model;

and step S6, the data classification cleaning system continues to perform data classification cleaning on the data set based on the data classification cleaning model obtained by iterative training until the data classification cleaning process is completed.

Example two

The difference between the second embodiment and the first embodiment is that, referring to fig. 2, the data classification and cleaning system based on dynamic progressive sampling provided in the second embodiment further includes:

the index data marking module 5 is connected with the pseudo label generating module 3 and is used for marking the remaining unmarked data in the data set as index label data after the data pseudo label generating module finishes pseudo labeling of the unmarked data in the data set;

the index data marking module 5 is also connected with an iterative model training module 2, and the iterative model training module 2 is used for updating a data classification cleaning model by iterative training by taking an extended training data set D and each label data as training samples;

and the data pseudo label generating module 3 performs data classification cleaning on the data set according to the data classification cleaning model updated iteratively until the data classification cleaning process of all data in the data set is completed.

The data classification cleaning system provided by the embodiment II can clean data more thoroughly in a classification manner, and the data classification cleaning effect is better.

The second embodiment further provides a data classification cleaning method based on dynamic progressive sampling, which is implemented by applying the data classification cleaning system provided in the second embodiment, with reference to fig. 4, and includes the following steps:

l1, the data classification cleaning system acquires the label sample drawings and correspondingly places each acquired label sample drawing into each type of data subset of the sample data set;

l2, the data classification cleaning system takes a label data set L formed by each label sample diagram as a training sample, and a data classification cleaning model is formed by initial training;

step L3, inputting the data set to be cleaned into a data classification cleaning model by the data classification cleaning system, predicting the data type of each unmarked data in the data set by the data classification cleaning model, and performing pseudo marking on each unmarked data obtained by prediction to obtain a pseudo marked data set;

l4, the data classification cleaning system performs data screening on the pseudo label data set to obtain a pseudo label candidate set S;

l5, the data classification cleaning system takes an extended training data set D formed by the pseudo label candidate set S and the label data set L as a training sample, and iterates a training data classification cleaning model;

step L6, after completing the pseudo labeling of each unmarked data in the data set, the data classification cleaning system marks each unmarked data remaining in the data set as index label data;

l7, the data classification cleaning system takes the expanded training data D and each index label data as training samples, and iteratively trains and updates the data classification cleaning model;

and step L8, the data classification cleaning system continuously performs data cleaning on the data set based on the data classification cleaning model obtained by iterative training until the classification cleaning process of all data is completed.

It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.

Claims

1. A data classification cleaning system based on dynamic progressive sampling, comprising:

2. The data sorting and cleaning system of claim 1, further comprising:

3. A data classification cleaning method based on dynamic progressive sampling is realized by applying the data classification cleaning system according to the weight 1, and is characterized by comprising the following steps:

4. A data classification cleaning method based on dynamic progressive sampling is realized by the data classification cleaning system with the application weight 2, and is characterized by comprising the following steps: