CN110490237A - Data processing method, device, storage medium and electronic equipment - Google Patents
Data processing method, device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN110490237A CN110490237A CN201910713784.5A CN201910713784A CN110490237A CN 110490237 A CN110490237 A CN 110490237A CN 201910713784 A CN201910713784 A CN 201910713784A CN 110490237 A CN110490237 A CN 110490237A
- Authority
- CN
- China
- Prior art keywords
- data
- data set
- model
- feature
- class label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application discloses a kind of data processing method, device, storage medium and electronic equipments.This method comprises: obtaining multiple data, multiple data carry identical class label;Multiple data are divided into the first data set and the second data set;Extract the feature of each data in first data set and second data set;Obtain the correctness information of the class label of each data in second data set;According to the feature of the correctness information and each data of the class label of each data in second data set, preset two disaggregated model of training obtains object module;Using the feature of each data in the object module and first data set, obtains class label in first data set and be judged as correct first object data;According to the correct data of class label in the first object data and second data set, the second target data is obtained.The efficiency of data cleansing can be improved in the application.
Description
Technical field
The application belongs to data technique field more particularly to a kind of data processing method, device, storage medium and electronics are set
It is standby.
Background technique
Data cleansing, which refers to the process of, to be examined and is verified again to data, and its object is to by the mistake in data set
Information deletion.By taking the data cleansing of category images processing as an example, whether the tag along sort of mainly inspection picture is correct, and will divide
The picture of class tag error is deleted.However, in the related technology, the efficiency of data cleansing processing is lower.
Summary of the invention
The embodiment of the present application provides a kind of data processing method, device, storage medium and electronic equipment, and data can be improved
The efficiency of cleaning.
The embodiment of the present application provides a kind of data processing method, comprising:
Multiple data are obtained, the multiple data carry identical class label;
The multiple data are divided into the first data set and the second data set;
Extract the feature of each data in first data set and second data set;
Obtain the correctness information of the class label of each data in second data set;
According to the spy of the correctness information and each data of the class label of each data in second data set
Sign, preset two disaggregated model of training, obtains object module;
Using the feature of each data in the object module and first data set, first data set is obtained
Middle class label is judged as correct first object data;
According to the correct data of class label in the first object data and second data set, the second mesh is obtained
Mark data.
The embodiment of the present application provides a kind of data processing equipment, comprising:
First obtains module, and for obtaining multiple data, the multiple data carry identical class label;
Division module, for the multiple data to be divided into the first data set and the second data set;
Extraction module, for extracting the feature of each data in first data set and second data set;
Second obtains module, and the correctness for obtaining the class label of each data in second data set is believed
Breath;
Training module, for according to the correctness information of the class label of each data in second data set and
The feature of each data, preset two disaggregated model of training, obtains object module;
Third obtains module, for utilizing the feature of each data in the object module and first data set,
It obtains class label in first data set and is judged as correct first object data;
Processing module, for correctly being counted according to class label in the first object data and second data set
According to obtaining the second target data.
The embodiment of the present application provides a kind of storage medium, is stored thereon with computer program, when the computer program exists
When being executed on computer, so that the computer executes the process in data processing method provided by the embodiments of the present application.
The embodiment of the present application also provides a kind of electronic equipment, including memory, and processor, the processor is by calling institute
The computer program stored in memory is stated, for executing the process in data processing method provided by the embodiments of the present application.
In the embodiment of the present application, electronic equipment can use by two disaggregated models of learning training and carry out data cleansing
Work.Since this can quickly determine out the correct data of class label by two disaggregated models of learning training.Therefore, originally
Embodiment can be quickly obtained clean data.Compared to the label information for checking data by manually browsing one by one in the related technology
Whether wrong data cleansing mode, the efficiency of data cleansing can be improved in the present embodiment.
Detailed description of the invention
With reference to the accompanying drawing, it is described in detail by the specific embodiment to the application, the technical solution of the application will be made
And its advantages are apparent.
Fig. 1 is the first flow diagram of data processing method provided by the embodiments of the present application.
Fig. 2 is second of flow diagram of data processing method provided by the embodiments of the present application.
Fig. 3 is the third flow diagram of data processing method provided by the embodiments of the present application.
Fig. 4 is the structural schematic diagram of the 4th model provided by the embodiments of the present application.
Fig. 5 to Fig. 7 is the schematic diagram of a scenario of data processing method provided by the embodiments of the present application.
Fig. 8 is the 4th kind of flow diagram of data processing method provided by the embodiments of the present application.
Fig. 9 is the structural schematic diagram of data processing equipment provided by the embodiments of the present application.
Figure 10 is the structural schematic diagram of electronic equipment provided by the embodiments of the present application.
Figure 11 is another structural schematic diagram of electronic equipment provided by the embodiments of the present application.
Specific embodiment
Diagram is please referred to, wherein identical component symbol represents identical component, the principle of the application is to implement one
It is illustrated in computing environment appropriate.The following description be based on illustrated by the application specific embodiment, should not be by
It is considered as limitation the application other specific embodiments not detailed herein.
It is understood that the executing subject of the embodiment of the present application can be such as smart phone or tablet computer or desk-top
The electronic equipment of computer or server etc..
Fig. 1 and Fig. 2 are please referred to, Fig. 1 is the first flow diagram of data processing method provided by the embodiments of the present application,
Fig. 2 is second of flow diagram of data processing method provided by the embodiments of the present application, and process may include:
101, multiple data are obtained, multiple data carry identical class label.
Data cleansing, which refers to the process of, to be examined and is verified again to data, and its object is to by the mistake in data set
Information deletion.By taking the data cleansing of category images processing as an example, in the related technology mainly by way of manual inspection come into
Row data cleansing.For example, checking whether the tag along sort of picture is correct by manually, and the picture of tag along sort mistake is deleted
It removes.However, in the related technology, the efficiency of data cleansing processing is lower.
In the embodiment of the present application, for example, electronic equipment can first obtain multiple data, these data can carry phase
Same class label.It is understood that multiple data are to need to carry out the data of data cleansing.For example, electronic equipment
An available data set for needing to carry out data cleansing.
For example, it is desired to which the data for carrying out data cleansing are a pictures, the picture for including in the pictures can be tool
There is the picture of same category label.For example, the class label for the picture for including in the pictures is flowers classification etc..
102, multiple data are divided into the first data set and the second data set.
For example, these data can be divided by electronic equipment after getting the data for needing to carry out data cleansing
One data set and the second data set.
For example, it is desired to which the data for carrying out data cleansing are 1000 pictures, then electronic equipment can scheme this 1000
Piece is divided into the first data set (such as the first pictures) and the second data set (such as second picture collection).
103, the feature of each data in the first data set and the second data set is extracted.
For example, electronic equipment can extract in the first data set after division obtains the first data set and the second data set
Each data feature, and extract the second data set in each data feature.
For example, electronic equipment can extract the feature of each picture in the first data set, and extract in the second data set
The feature of each picture.
104, the correctness information of the class label of each data in the second data set is obtained.
For example, after obtaining the second data set, the classification of each data in available second data set of electronic equipment
The correctness information of label.For example, in available second data set of electronic equipment each picture class label whether
Correct information.
For example, including 200 pictures in the second data set, then the mode of manual inspection can be first passed through to determine this
Whether the class label of each picture in 200 pictures is correct.If the class label of picture is correct, inspection personnel can be with
Being that picture mark is corresponding by electronic equipment indicates the correct information of class label, such as marks digital " 1 " or English words
Female " T " etc..If the class label mistake of picture, inspection personnel, which can be that picture mark is corresponding by electronic equipment, to be indicated
The information of class label mistake, such as mark digital " 0 " or English alphabet " F ".In this way, electronic equipment can get this
The information of the class label correctness of 200 pictures.
105, according to the spy of the correctness information of the class label of data each in the second data set and each data
Sign, preset two disaggregated model of training, obtains object module.
For example, in getting the second data set after the correctness information of the class label of each data, electronic equipment
It can be according to each number in the correctness information of the class label of each data in second data set and the second data set
According to feature, learning training is carried out to preset two disaggregated model, to obtain the model by learning training, i.e. target mould
Type.
For example, electronics is set after getting the information of class label correctness of 200 pictures in the second data set
It is standby can be defeated using the information of the class label correctness of this 200 picture and the feature of this 200 picture as input data
Enter into preset two disaggregated model to carry out learning training to two disaggregated model, to obtain object module.
In one embodiment, for example, picture PiFor the picture in the second data set, then picture PiFeature
fiWith picture PiClass label correctness information biIt can be expressed as < fi,bi>form, then<fi,bi> can make
For a learning sample data of two disaggregated models.
It is understood that being that can export picture according to picture feature by the object module that learning training obtains
The model of the whether correct information of class label.
106, using the feature of each data in object module and the first data set, classification mark in the first data set is obtained
Label are judged as correct first object data.
For example, after obtaining object module, electronic equipment can use every in the object module and the first data set
The feature of one data obtains class label in the first data set and is judged as correct data, i.e. first object data.
For example, including 800 pictures in the first data set.So, after obtaining object module, electronic equipment can be by this
The feature of each picture is input in the object module in 800 pictures, by the object module according to the spy of each picture
Sign exports the whether correct information of class label of the picture, and the correct picture of the class label determined is determined as first
Target data.It is understood that first object data may include plurality of pictures.
107, according to the correct data of class label in first object data and the second data set, the second number of targets is obtained
According to.
For example, it is correct that electronic equipment can also obtain class label in the second data set after obtaining first object data
Data, and the correct data of class label in the first object data and second data set are merged to obtain the second number of targets
According to.It is understood that second target data is the clean data obtained after data cleansing is handled.
For example, electronic equipment gets the correctness of the class label of 200 pictures in the second data set in 104
Information, the class label for having 190 pictures in this 200 picture is correct.In 106, electronic equipment utilizes object module
Determining in 800 pictures of the first data set has the class label of 790 pictures correct, then electronic equipment can be by second
Above-mentioned 190 picture in data set and above-mentioned 790 picture in the first data set merge, 980 obtained pictures.This
980 pictures may be considered the clean data obtained after data cleansing.
It is understood that in the present embodiment, electronic equipment can use by two disaggregated models of learning training come into
Row data cleaning.It is correctly counted since two disaggregated models by learning training can quickly determine out class label
According to.Therefore, the present embodiment can be quickly obtained clean data.Compared in the related technology by manually browsing inspection data one by one
The whether wrong data cleansing mode of label information, the efficiency of data cleansing can be improved in the present embodiment.
In the embodiment of the present application, two disaggregated models are introduced to the cleaning of classification data.Use known second data
The information of the class label correctness of intensive data and and the feature of the second data intensive data train two disaggregated models,
Object module is obtained, the whether correct information of class label for the data that the object module exports in the first data set is reused,
Finally obtain the correct data of all categories label, as clean data.
Referring to Fig. 3, Fig. 3 is the third flow diagram of data processing method provided by the embodiments of the present application, process
May include:
201, electronic equipment obtains multiple data, and multiple data carry identical class label.
For example, available 1000 pictures for needing to carry out data cleansing of electronic equipment.This 1000 photos carry
Identical class label.For example, this 1000 picture has the identical flowers class label manually marked.For example, this 1000
Picture is respectively P1、P2、P3... ..., P1000。
202, multiple data are divided into the first data set and the second data set by electronic equipment.
For example, this 1000 picture can be divided into the first number by electronic equipment after getting above-mentioned 1000 picture
According to collection and the second data set.For example, electronic equipment can randomly select 800 pictures from this 1000 picture is classified as the first number
The second data set is classified as according to collection, and by remaining 200 picture.
After 1000 pictures are divided into the first data set and the second data set, it is current that electronic equipment can detecte it
Whether computing capability is lower than preset threshold.
If detect current computing capability lower than preset threshold, it may be considered that the current computing capability of electronic equipment compared with
It is weak.In such a case, it is possible into 203.
If detecting current computing capability not less than preset threshold, it may be considered that the computing capability that electronic equipment is current
It is relatively strong.In such a case, it is possible into 204.
203, when the computing capability of electronic equipment is lower than preset threshold, which utilizes default feature extraction mould
Type extracts the feature of each data in the first data set and the second data set.
For example, electronic equipment detects current computing capability lower than preset threshold, then the electronic equipment is available
Default Feature Selection Model, and each figure in the first data set and the second data set is extracted using the default Feature Selection Model
The feature of piece.
In one embodiment, the embodiment of the present application can obtain in the following way default Feature Selection Model:
Electronic equipment obtains the first model, which is the ResNet model obtained according to ImageNet training;
Electronic equipment carries out learning training to the ResNet model using the multiple data, obtains the second model;
Electronic equipment removes the full articulamentum for being located at the second model the last layer to obtain third model, and will be described
Third model is determined as default Feature Selection Model.
For example, when the data for needing to carry out data cleansing are picture, electronic equipment can when the multiple data are picture
First to obtain the first model, wherein first model is the ResNet model obtained according to ImageNet training.
It should be noted that ImageNet project is a large-scale visualization number for the research of visual object identification software
According to library.Image URL more than 14,000,000 is by ImageNet manual annotations, to indicate the object in picture.Since two thousand and ten,
ImageNet project holds a software match, the i.e. extensive visual identity challenge match (ILSVRC) of ImageNet, software every year
Program competitively correct classification and Detection object and scene.
ResNet (Residual Neural Network) has successfully trained 152 layers by using ResNet Unit
Neural network, and champion is obtained in ILSVRC2015 match.The instruction for the accelerans network that the structure of ResNet can be exceedingly fast
Practice, the accuracy rate of model also has bigger promotion.
That is, ImageNet is an open, free large-scale picture database, wherein containing 2.2 all creations
Category images.And ResNet is then one with the trained picture classification model of data in ImageNet.
For example, after getting ResNet model, electronic equipment can be first with needing to carry out the picture of data cleansing (such as
Above-mentioned 1000 picture) machine learning training is carried out to ResNet model, to obtain the second model.Obtaining the second model
Afterwards, the full articulamentum that electronic equipment can will be located at the second model the last layer removes, to obtain third model, and by this
Three models are determined as default Feature Selection Model.It should be noted that the last layer of ResNet model is full articulamentum, this is complete
The effect of articulamentum in a model is classified to picture, and in the ResNet model in addition to the full articulamentum of the last layer
Other neural net layers effect be extract feature, therefore will the full articulamentum of the last layer of the second model remove after obtain
Neural net layer be used as Feature Selection Model.In addition, why to utilize the picture for needing to carry out data cleansing
Carry out a learning training again to ResNet model, be because ResNet be a more general disaggregated model, with need into
The picture of row data cleansing carries out a learning training again to ResNet model and obtains the second model, can make the second model pair
The classification for needing to carry out the picture of data cleansing is more targeted, so that third model is to the figure for needing to carry out data cleansing
The feature extraction of piece is more acurrate.
204, when the computing capability of electronic equipment is not less than preset threshold, electronic equipment obtains the 4th model, and utilizes
The feature of each data in 4th the first data set of model extraction and the second data set, the wherein feature extraction of the 4th model
Precision is higher than default Feature Selection Model.
For example, electronic equipment detects its current computing capability not less than preset threshold, then electronic equipment can obtain
The 4th model is taken, and using the feature of each picture in the 4th the first data set of model extraction and the second data set, wherein
The feature extraction precision of 4th model is higher than default Feature Selection Model.
For example, the 4th model can be the more complicated list of structure compared to ResNet model used in the present embodiment
A model, such as Inception-Resnet-v2.Alternatively, the 4th model can be the fusion (stacking) of multiple models.Example
Such as, the structure of the 4th model can be as shown in Figure 4.Image data is inputed into multiple first-level models (Level 1) simultaneously, then
The feature that first-level model is extracted finally uses the output of second-level model as output feature as the input of second-level model.Its
Middle Model 1, Model 2, Model3 can select common deep learning model, as ResNet, Inception,
MobileNet etc., and Model4 can choose better simply conventional machines learning model, such as linear regression.Multi-model melts
The advantage for combining a variety of models is closed, it is stronger to the extractability of feature, so that the effect of subsequent cleaning is more preferable, but the money consumed
Source is also more, is suitble to use in the case where electronic equipment operational capability is sufficient.
In one embodiment, the operational capability of electronic equipment can be in such as CPU usage and/or remaining operation
The capacity and/or remaining running memory capacity deposited ratio etc. shared in running memory total capacity.
205, electronic equipment obtains the correctness information of the class label of each data in the second data set.
For example, after marking off the first data set and the second data set, available second data set of electronic equipment
In each picture class label correctness information.
For example, including 200 pictures in the second data set, then the mode of manual inspection can be first passed through to determine this
Whether the class label of each picture in 200 pictures is correct.If the class label of picture is correct, inspection personnel can be with
Being that picture mark is corresponding by electronic equipment indicates the correct information of class label, such as marks digital " 1 " or English words
Female " T " etc..If the class label mistake of picture, inspection personnel, which can be that picture mark is corresponding by electronic equipment, to be indicated
The information of class label mistake, such as mark digital " 0 " or English alphabet " F ".In this way, electronic equipment can get this
The information of the class label correctness of 200 pictures.
206, according to the spy of the correctness information of the class label of data each in the second data set and each data
Sign, preset two disaggregated model of electronic equipment training, obtains object module.
For example, electronics is set in getting the second data set after the correctness information of the class label of each picture
It is standby can be according to every in the correctness information of the class label of each picture in second data set and the second data set
The feature of one picture carries out learning training to preset two disaggregated model, to obtain the model by learning training, i.e. mesh
Mark model.
For example, preset two disaggregated model can be support vector machines (Support Vector Machine, SVM).In
After the information for getting the class label correctness of 200 pictures in the second data set, electronic equipment can be 200 by this
The information of the class label correctness of picture and the feature of this 200 picture are input to preset SVM mould as input data
To carry out learning training to the SVM model in type, to obtain object module.
In one embodiment, for example, picture PiFor the picture in the second data set, then picture PiFeature
fiWith picture PiClass label correctness information biIt can be expressed as < fi,bi>form, then<fi,bi> can make
For a learning sample data of SVM model.
It is understood that being that can export picture according to picture feature by the object module that learning training obtains
The model of the whether correct information of class label.
In some embodiments, preset two disaggregated model can also be such as multi-layer perception (MLP) (Multi-Layer
Perception), the models such as decision tree (Decision Tree) or random forest (Random Forest).
207, electronic equipment obtains first data using the feature of each data in object module and the first data set
Class label is concentrated to be judged as correct first object data.
For example, after obtaining object module, electronic equipment can use every in the object module and the first data set
The feature of one picture obtains class label in the first data set and is judged as correct picture, i.e. first object data.
For example, including 800 pictures in the first data set.So, after obtaining object module, electronic equipment can be by this
The feature of each picture is input in the object module in 800 pictures, by the object module according to the spy of each picture
Sign exports the whether correct information of class label of the picture, and the correct picture of the class label determined is determined as first
Target data.It is understood that first object data may include plurality of pictures.
208, according to the correct data of class label in first object data and the second data set, electronic equipment obtains
Two target datas.
For example, it is correct that electronic equipment can also obtain class label in the second data set after obtaining first object data
Picture, and the second number of targets is obtained according to the correct picture of class label in the first object data and second data set
According to.It is understood that second target data is the clean picture obtained after data cleansing is handled.
For example, electronic equipment gets the correctness of the class label of 200 pictures in the second data set in 104
Information, the class label for having 190 pictures in this 200 picture is correct.In 106, electronic equipment utilizes object module
Determining in 800 pictures of the first data set has the class label of 790 pictures correct, then electronic equipment can be by second
Above-mentioned 190 picture in data set and above-mentioned 790 picture in the first data set merge, 980 obtained pictures.This
980 pictures may be considered the clean picture obtained after data cleansing.
In some embodiments, the embodiment of the present application can also include:
Electronic equipment is obtained in first data set using the feature of each data in object module and the first data set
Class label is judged as the third target data of mistake;
Electronic equipment deletes the data of class label mistake in third target data and the second data set.
For example, electronic equipment can be by each figure in 800 pictures in the first data set after obtaining object module
The feature of piece is input in the object module, exports the classification mark of the picture according to the feature of each picture by the object module
Label whether correct information, and the picture of the class label mistake determined is determined as third target data.It is understood that
It is that third target data may include plurality of pictures.
After obtaining third target data, electronic equipment can also obtain the figure of class label mistake in the second data set
Piece.Later, electronic equipment can delete the picture of class label mistake in third target data and the second data set.It can manage
It solves, the picture of class label mistake can be considered that data cleansing is handled in the third target data and the second data set
" dirty data " (the Dirty Read) cleaned out.It is understood that these dirty datas are considered as its classification carried
Label and its actual class label be not identical.For example, then should for example, the picture of trees is mistakenly labeled as flowers classification
The picture of trees is dirty data.
In some embodiments, the application is dividing the first data set and when the second data set, first data set and
Second data set can satisfy following condition:
The quantity ratio for the data for including in first data set and the second data set is default ratio, and is wrapped in the first data set
The quantity of the data contained is greater than the second data set.
For example, the quantity ratio for the data for including in the first data set and the second data set can be default ratio, such as should
Default ratio can be 8:2 or 9:1 or 7.5:2.5 etc., and the quantity for the data for including in the first data set is greater than
The quantity for the data for including in second data set.
In another embodiment, the correct data of class label and class label mistake for including in the second data set
The quantity of data can be all satisfied following value conditions: i.e. the correct data of class label and class for including in the second data set
The quantity of the data of distinguishing label mistake can be all larger than or be equal to 100.For example, the class label for including in the second data set is correct
Picture quantity be not less than 100, and the quantity of the picture of class label mistake be not less than 100.
Fig. 5 to Fig. 7 is please referred to, Fig. 5 to Fig. 7 is the schematic diagram of a scenario of data processing method provided by the embodiments of the present application.
For example, as shown in figure 5, user currently need in pictures P 1000 pictures carry out data cleansing processing, this
1000 pictures are labeled with identical class label.So, electronic equipment can first obtain this 1000 picture.
Later, this 1000 picture randomly can be divided into the first pictures and second picture collection, such as Fig. 6 by electronic equipment
It is shown.Wherein, the first pictures include 800 pictures, and second picture collection includes 200 pictures.
Later, electronic equipment can be used default Feature Selection Model the first pictures and second picture are concentrated it is each
Picture carries out feature extraction.For example, feature FiIt is picture PiFeature, i is integer more than or equal to 1.
After extraction obtains the feature of each picture, it can determine what second picture was concentrated by way of manual inspection
Whether the class label of each picture in 200 pictures is correct.If the class label of picture is correct, inspection personnel can be with
Being that picture mark is corresponding by electronic equipment indicates the correct information of class label, such as marks digital " 1 ".If the class of picture
Distinguishing label mistake, then inspection personnel can be the corresponding letter for indicating class label mistake of picture mark by electronic equipment
Breath such as marks digital " 0 ".In this way, electronic equipment can get the letter of the class label correctness of this 200 picture
Breath.For example, there are 190 pictures to be labeled with digital " 1 " in this 200 picture, i.e., 190 are concentrated with through manually checking second picture
The class label of picture is correct.
After getting the information of the class label correctness of 200 pictures of second picture concentration, electronic equipment can
The feature of the information of the class label correctness of this 200 picture and this 200 picture as input data, to be input to
To carry out learning training to the SVM model in preset SVM model, to obtain object module.For example, picture PiFor the second number
According to a picture of concentration, then picture PiFeature fiWith picture PiClass label correctness information biIt can be with table
It is shown as < fi,bi>form, then<fi,bi> it can be used as learning sample data of SVM model.
After obtaining object module, electronic equipment can be by each picture in 800 pictures in the first pictures
Feature is input in the object module, is according to the class label that the feature of each picture exports the picture by the object module
No correct information, and the correct picture of the class label determined is determined as first object data.For example, electronic equipment is most
It is correct for determining the first picture eventually and being concentrated with the class label of 790 pictures.
Later, correct 190 picture of class label and the first pictures that electronic equipment can concentrate second picture
In class label be judged as correct 790 picture and merge, 980 obtained pictures.This 980 picture may be considered
The clean picture obtained after data cleansing.For example, as shown in fig. 7, this 980 picture is synthesized a figure by electronic equipment
Piece collection.
Separately referring to Fig. 8, Fig. 8 is the 4th kind of flow diagram of data processing method provided in this embodiment.
In the present embodiment, in the present embodiment, electronic equipment can use by two disaggregated models of learning training and carry out
Data cleansing work.It is correctly counted since two disaggregated models by learning training can quickly determine out class label
According to.Therefore, the present embodiment can be quickly obtained clean data.Compared in the related technology by manually browsing inspection data one by one
The whether wrong data cleansing mode of label information, this embodiment reduces a large amount of labor workloads, and data can be improved
The efficiency of cleaning reduces the cost of data cleansing.
In addition, the present embodiment by two classification in the way of carry out data cleansing work, can achieve with manually clean it is close
Accuracy.Also, its data cleansing process of data cleansing mode provided in this embodiment can be recalled, and other personnel can pass through
Cleaning process checks data cleansing quality.
Referring to Fig. 9, Fig. 9 is the structural schematic diagram of data processing equipment provided by the embodiments of the present application.Data processing dress
Setting 300 may include: the first acquisition module 301, and division module 302, extraction module 303, second obtains module 304, training mould
Block 305, third obtain module 306, processing module 307.
First obtains module 301, and for obtaining multiple data, the multiple data carry identical class label;
Division module 302, for the multiple data to be divided into the first data set and the second data set;
Extraction module 303, for extracting the feature of each data in first data set and second data set;
Second obtains module 304, for obtaining the correctness of the class label of each data in second data set
Information;
Training module 305, for the correctness information according to the class label of each data in second data set
And the feature of each data, preset two disaggregated model of training obtain object module;
Third obtains module 306, for the spy using each data in the object module and first data set
Sign obtains class label in first data set and is judged as correct first object data;
Processing module 307, for correct according to class label in the first object data and second data set
Data, obtain the second target data.
In one embodiment, the first acquisition module 301 can be used for:
When the multiple data are picture, the first model is obtained, first model is trained according to ImageNet
The ResNet model arrived;
Learning training is carried out to the ResNet model using the multiple data, obtains the second model;
The full articulamentum for being located at the second model the last layer is removed to obtain third model, and by the third model
It is determined as default Feature Selection Model;
So, the extraction module 303 can be used for: utilizing the default Feature Selection Model, extracts first number
According to the feature of each data in collection and second data set.
In one embodiment, the extraction module 303 can be used for:
When the computing capability of electronic equipment is lower than preset threshold, using the default Feature Selection Model, described in extraction
The feature of each data in first data set and second data set.
In one embodiment, the extraction module 303 can be used for:
When the computing capability of the electronic equipment is not less than the preset threshold, the 4th model is obtained, and described in utilization
The feature of each data in first data set described in 4th model extraction and second data set, wherein the 4th model
Feature extraction precision is higher than the default Feature Selection Model.
In one embodiment, the quantity ratio for the data for including in first data set and second data set is
Default ratio, and the quantity for the data for including in first data set is greater than second data set.
In one embodiment, the processing module 307 can be also used for:
Using the feature of each data in the object module and first data set, first data set is obtained
Middle class label is judged as the third target data of mistake;
The data of class label mistake in the third target data and second data set are deleted.
In one embodiment, preset two disaggregated model includes at least support vector machines, multi-layer perception (MLP), determines
Plan tree or random forest.
The embodiment of the present application provides a kind of computer-readable storage medium, computer program is stored thereon with, when described
When computer program executes on computers, so that the computer is executed as in data processing method provided in this embodiment
Process.
The embodiment of the present application also provides a kind of electronic equipment, including memory, and processor, the processor is by calling institute
The computer program stored in memory is stated, for executing the process in data processing method provided in this embodiment.
For example, above-mentioned electronic equipment can be the mobile terminals such as tablet computer or smart phone.Referring to Fig. 10,
Figure 10 is the structural schematic diagram of electronic equipment provided by the embodiments of the present application.
The electronic equipment 400 may include the components such as display screen 401, memory 402, processor 403.Those skilled in the art
Member is appreciated that electronic devices structure shown in Figure 10 does not constitute the restriction to electronic equipment, may include than illustrating more
More or less component perhaps combines certain components or different component layouts.
Display screen 401 is displayed for the information such as picture and text.
Memory 402 can be used for storing application program and data.It include that can hold in the application program that memory 402 stores
Line code.Application program can form various functional modules.Processor 403 is stored in the application journey of memory 402 by operation
Sequence, thereby executing various function application and data processing.
Processor 403 is the control centre of electronic equipment, utilizes each of various interfaces and the entire electronic equipment of connection
A part by running or execute the application program being stored in memory 402, and is called and is stored in memory 402
Data execute the various functions and processing data of electronic equipment, to carry out integral monitoring to electronic equipment.
In the present embodiment, the processor 403 in electronic equipment can be according to following instruction, will be one or more
The corresponding executable code of the process of application program is loaded into memory 402, and is run by processor 403 and be stored in storage
Application program in device 402, thereby executing:
Multiple data are obtained, the multiple data carry identical class label;
The multiple data are divided into the first data set and the second data set;
Extract the feature of each data in first data set and second data set;
Obtain the correctness information of the class label of each data in second data set;
According to the spy of the correctness information and each data of the class label of each data in second data set
Sign, preset two disaggregated model of training, obtains object module;
Using the feature of each data in the object module and first data set, first data set is obtained
Middle class label is judged as correct first object data;
According to the correct data of class label in the first object data and second data set, the second mesh is obtained
Mark data.
Figure 11 is please referred to, electronic equipment 400 may include display screen 401, memory 402, processor 403, input unit
404, the components such as power supply 405.
Display screen 401 is displayed for the information such as picture and text.
Memory 402 can be used for storing application program and data.It include that can hold in the application program that memory 402 stores
Line code.Application program can form various functional modules.Processor 403 is stored in the application journey of memory 402 by operation
Sequence, thereby executing various function application and data processing.
Processor 403 is the control centre of electronic equipment, utilizes each of various interfaces and the entire electronic equipment of connection
A part by running or execute the application program being stored in memory 402, and is called and is stored in memory 402
Data execute the various functions and processing data of electronic equipment, to carry out integral monitoring to electronic equipment.
Input unit 404 can be used for receiving number, character information or the user's characteristic information (such as fingerprint) of input, and
Generate keyboard related with user setting and function control, mouse, operating stick, optics or trackball signal input.
Power supply 405 can be used for providing electric power guarantee for each component.
In the present embodiment, the processor 403 in electronic equipment can be according to following instruction, will be one or more
The corresponding executable code of the process of application program is loaded into memory 402, and is run by processor 403 and be stored in storage
Application program in device 402, thereby executing:
Multiple data are obtained, the multiple data carry identical class label;
The multiple data are divided into the first data set and the second data set;
Extract the feature of each data in first data set and second data set;
Obtain the correctness information of the class label of each data in second data set;
According to the spy of the correctness information and each data of the class label of each data in second data set
Sign, preset two disaggregated model of training, obtains object module;
Using the feature of each data in the object module and first data set, first data set is obtained
Middle class label is judged as correct first object data;
According to the correct data of class label in the first object data and second data set, the second mesh is obtained
Mark data.
In one embodiment, the processor 403 can be also used for: when the multiple data are picture, obtain
First model, first model are the ResNet model obtained according to ImageNet training;Using the multiple data to institute
It states ResNet model and carries out learning training, obtain the second model;The full articulamentum for being located at the second model the last layer is moved
Except obtaining third model, and the third model is determined as default Feature Selection Model;
So, the processor 403 executes each number in the extraction first data set and second data set
According to feature when, can execute: utilize the default Feature Selection Model, extract first data set and second data
Concentrate the feature of each data.
In one embodiment, the processor 403, which executes, utilizes the default Feature Selection Model, extracts described the
It in one data set and second data set when feature of each data, can execute: when the computing capability of electronic equipment is lower than
When preset threshold, using the default Feature Selection Model, extract each in first data set and second data set
The feature of data.
In one embodiment, the processor 403 can also be performed: when the computing capability of the electronic equipment is not low
When the preset threshold, the 4th model is obtained, and utilize the first data set and described second described in the 4th model extraction
The feature of each data in data set, wherein the feature extraction precision of the 4th model is higher than the default feature extraction mould
Type.
In one embodiment, the quantity ratio for the data for including in first data set and second data set is
Default ratio, and the quantity for the data for including in first data set is greater than second data set.
In one embodiment, the processor 403 can also be performed: utilize the object module and described first
The feature of each data in data set obtains the third number of targets that class label in first data set is judged as mistake
According to;The data of class label mistake in the third target data and second data set are deleted.
In one embodiment, preset two disaggregated model includes at least support vector machines, multi-layer perception (MLP), determines
Plan tree or random forest.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment
Point, it may refer to the detailed description above with respect to data processing method, details are not described herein again.
Data processing method in the data processing equipment provided by the embodiments of the present application and foregoing embodiments belongs to together
One design can run either offer method in the data processing method embodiment on the data processing equipment,
Specific implementation process is detailed in the data processing method embodiment, and details are not described herein again.
It should be noted that those of ordinary skill in the art can for the data processing method described in the embodiment of the present application
With understand realize the embodiment of the present application described in data processing method all or part of the process, be can by computer program come
Relevant hardware is controlled to complete, the computer program can be stored in a computer-readable storage medium, such as be stored in
It in memory, and is executed by least one processor, in the process of implementation may include the embodiment such as the data processing method
Process.Wherein, the storage medium can be magnetic disk, CD, read-only memory (ROM, Read Only Memory), random
Access/memory body (RAM, Random Access Memory) etc..
For the data processing equipment of the embodiment of the present application, each functional module be can integrate in a processing core
In piece, it is also possible to modules and physically exists alone, can also be integrated in two or more modules in a module.On
It states integrated module both and can take the form of hardware realization, can also be realized in the form of software function module.The collection
If at module realized in the form of software function module and when sold or used as an independent product, also can store
In one computer-readable storage medium, the storage medium is for example read-only memory, disk or CD etc..
Above to a kind of data processing method, device, storage medium and electronic equipment provided by the embodiment of the present application
It is described in detail, specific examples are used herein to illustrate the principle and implementation manner of the present application, the above reality
The explanation for applying example is merely used to help understand the present processes and its core concept;Meanwhile for those skilled in the art,
According to the thought of the application, there will be changes in the specific implementation manner and application range, in conclusion in this specification
Hold the limitation that should not be construed as to the application.
Claims (10)
1. a kind of data processing method characterized by comprising
Multiple data are obtained, the multiple data carry identical class label;
The multiple data are divided into the first data set and the second data set;
Extract the feature of each data in first data set and second data set;
Obtain the correctness information of the class label of each data in second data set;
According to the feature of the correctness information and each data of the class label of each data in second data set, instruction
Practice preset two disaggregated model, obtains object module;
Using the feature of each data in the object module and first data set, class in first data set is obtained
Distinguishing label is judged as correct first object data;
According to the correct data of class label in the first object data and second data set, the second number of targets is obtained
According to.
2. data processing method according to claim 1, which is characterized in that the method also includes:
When the multiple data are picture, the first model is obtained, first model is obtained according to ImageNet training
ResNet model;
Learning training is carried out to the ResNet model using the multiple data, obtains the second model;
The full articulamentum for being located at the second model the last layer is removed to obtain third model, and the third model is determined
To preset Feature Selection Model;
The feature for extracting each data in first data set and second data set, comprising: preset using described
Feature Selection Model extracts the feature of each data in first data set and second data set.
3. data processing method according to claim 2, which is characterized in that utilize the default Feature Selection Model, mention
Take the feature of each data in first data set and second data set, comprising:
When the computing capability of electronic equipment is lower than preset threshold, using the default Feature Selection Model, described first is extracted
The feature of each data in data set and second data set.
4. data processing method according to claim 3, which is characterized in that the method also includes:
When the computing capability of the electronic equipment is not less than the preset threshold, the 4th model is obtained, and utilize the described 4th
The feature of each data in first data set described in model extraction and second data set, wherein the feature of the 4th model
Extraction accuracy is higher than the default Feature Selection Model.
5. data processing method according to claim 1, which is characterized in that first data set and second data
The quantity ratio for the data that concentration includes is default ratio, and the quantity for the data for including in first data set is greater than described the
Two data sets.
6. data processing method according to claim 1, which is characterized in that the method also includes:
Using the feature of each data in the object module and first data set, class in first data set is obtained
Distinguishing label is judged as the third target data of mistake;
The data of class label mistake in the third target data and second data set are deleted.
7. data processing method according to claim 1, which is characterized in that preset two disaggregated model includes at least
Support vector machines, multi-layer perception (MLP), decision tree or random forest.
8. a kind of data processing equipment characterized by comprising
First obtains module, and for obtaining multiple data, the multiple data carry identical class label;
Division module, for the multiple data to be divided into the first data set and the second data set;
Extraction module, for extracting the feature of each data in first data set and second data set;
Second obtains module, for obtaining the correctness information of the class label of each data in second data set;
Training module, for according to the correctness information of the class label of each data in second data set and each
The feature of data, preset two disaggregated model of training, obtains object module;
Third obtains module, for the feature using each data in the object module and first data set, obtains
Class label is judged as correct first object data in first data set;
Processing module is used for according to the correct data of class label in the first object data and second data set,
Obtain the second target data.
9. a kind of storage medium, is stored thereon with computer program, which is characterized in that when the computer program on computers
When execution, so that the computer executes the method as described in any one of claims 1 to 7.
10. a kind of electronic equipment, including memory, processor, which is characterized in that the processor is by calling the memory
The computer program of middle storage, for executing the method as described in any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910713784.5A CN110490237B (en) | 2019-08-02 | 2019-08-02 | Data processing method and device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910713784.5A CN110490237B (en) | 2019-08-02 | 2019-08-02 | Data processing method and device, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110490237A true CN110490237A (en) | 2019-11-22 |
CN110490237B CN110490237B (en) | 2022-05-17 |
Family
ID=68549273
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910713784.5A Active CN110490237B (en) | 2019-08-02 | 2019-08-02 | Data processing method and device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110490237B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111460195A (en) * | 2020-03-26 | 2020-07-28 | Oppo广东移动通信有限公司 | Picture processing method and device, storage medium and electronic equipment |
CN112734035A (en) * | 2020-12-31 | 2021-04-30 | 成都佳华物链云科技有限公司 | Data processing method and device and readable storage medium |
CN113128979A (en) * | 2021-05-17 | 2021-07-16 | 中铁高新工业股份有限公司 | Scientific research aid decision-making system based on big data |
CN113204660A (en) * | 2021-03-31 | 2021-08-03 | 北京达佳互联信息技术有限公司 | Multimedia data processing method, label identification method, device and electronic equipment |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160026900A1 (en) * | 2013-04-26 | 2016-01-28 | Olympus Corporation | Image processing device, information storage device, and image processing method |
CN108764372A (en) * | 2018-06-08 | 2018-11-06 | Oppo广东移动通信有限公司 | Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set |
CN108875821A (en) * | 2018-06-08 | 2018-11-23 | Oppo广东移动通信有限公司 | The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing |
CN109213862A (en) * | 2018-08-21 | 2019-01-15 | 北京京东尚科信息技术有限公司 | Object identification method and device, computer readable storage medium |
CN109447717A (en) * | 2018-11-12 | 2019-03-08 | 万惠投资管理有限公司 | A kind of determination method and system of label |
CN109753498A (en) * | 2018-12-11 | 2019-05-14 | 中科恒运股份有限公司 | data cleaning method and terminal device based on machine learning |
US20190236412A1 (en) * | 2016-10-18 | 2019-08-01 | Tencent Technology (Shenzhen) Company Limited | Data processing method and device, classifier training method and system, and storage medium |
-
2019
- 2019-08-02 CN CN201910713784.5A patent/CN110490237B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160026900A1 (en) * | 2013-04-26 | 2016-01-28 | Olympus Corporation | Image processing device, information storage device, and image processing method |
US20190236412A1 (en) * | 2016-10-18 | 2019-08-01 | Tencent Technology (Shenzhen) Company Limited | Data processing method and device, classifier training method and system, and storage medium |
CN108764372A (en) * | 2018-06-08 | 2018-11-06 | Oppo广东移动通信有限公司 | Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set |
CN108875821A (en) * | 2018-06-08 | 2018-11-23 | Oppo广东移动通信有限公司 | The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing |
CN109213862A (en) * | 2018-08-21 | 2019-01-15 | 北京京东尚科信息技术有限公司 | Object identification method and device, computer readable storage medium |
CN109447717A (en) * | 2018-11-12 | 2019-03-08 | 万惠投资管理有限公司 | A kind of determination method and system of label |
CN109753498A (en) * | 2018-12-11 | 2019-05-14 | 中科恒运股份有限公司 | data cleaning method and terminal device based on machine learning |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111460195A (en) * | 2020-03-26 | 2020-07-28 | Oppo广东移动通信有限公司 | Picture processing method and device, storage medium and electronic equipment |
CN112734035A (en) * | 2020-12-31 | 2021-04-30 | 成都佳华物链云科技有限公司 | Data processing method and device and readable storage medium |
CN112734035B (en) * | 2020-12-31 | 2023-10-27 | 成都佳华物链云科技有限公司 | Data processing method and device and readable storage medium |
CN113204660A (en) * | 2021-03-31 | 2021-08-03 | 北京达佳互联信息技术有限公司 | Multimedia data processing method, label identification method, device and electronic equipment |
CN113204660B (en) * | 2021-03-31 | 2024-05-17 | 北京达佳互联信息技术有限公司 | Multimedia data processing method, tag identification device and electronic equipment |
CN113128979A (en) * | 2021-05-17 | 2021-07-16 | 中铁高新工业股份有限公司 | Scientific research aid decision-making system based on big data |
Also Published As
Publication number | Publication date |
---|---|
CN110490237B (en) | 2022-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110490237A (en) | Data processing method, device, storage medium and electronic equipment | |
Cliche et al. | Scatteract: Automated extraction of data from scatter plots | |
CN110472082B (en) | Data processing method, data processing device, storage medium and electronic equipment | |
CN108351828A (en) | Technology for device-independent automatic application test | |
CN110175236A (en) | Training sample generation method, device and computer equipment for text classification | |
CN108256537A (en) | A kind of user gender prediction method and system | |
CN112101335A (en) | APP violation monitoring method based on OCR and transfer learning | |
CN106537387B (en) | Retrieval/storage image associated with event | |
Del Rincón et al. | Common-sense reasoning for human action recognition | |
CN108536784A (en) | Comment information sentiment analysis method, apparatus, computer storage media and server | |
CN109857878B (en) | Article labeling method and device, electronic equipment and storage medium | |
CN109716275A (en) | Based on personalized theme with multi-dimensional model come the method that shows image | |
CN110363190A (en) | A kind of character recognition method, device and equipment | |
Li et al. | T3-vis: visual analytic for training and fine-tuning transformers in NLP | |
CN107330009A (en) | Descriptor disaggregated model creation method, creating device and storage medium | |
Yang et al. | Explaining deep convolutional neural networks via latent visual-semantic filter attention | |
CN115658523A (en) | Automatic control and test method for human-computer interaction interface and computer equipment | |
CN106997350A (en) | A kind of method and device of data processing | |
Rizvi et al. | A hybrid approach and unified framework for bibliographic reference extraction | |
CN112270318A (en) | Automatic scoring method and device, electronic equipment and storage medium | |
CN112860851A (en) | Course recommendation method, device, equipment and medium based on root cause analysis | |
CN107423441A (en) | A kind of picture correlating method and its device, electronic equipment | |
CN110580299B (en) | Method, system, equipment and storage medium for generating matching diagram of recommended language of object | |
CN108170838B (en) | Topic evolution visualization display method, application server and computer readable storage medium | |
Li et al. | T3-Vis: a visual analytic framework for Training and fine-Tuning Transformers in NLP |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |