CN115471714A - Data processing method, data processing device, computing equipment and computer readable storage medium - Google Patents

Data processing method, data processing device, computing equipment and computer readable storage medium Download PDF

Info

Publication number
CN115471714A
CN115471714A CN202110574231.3A CN202110574231A CN115471714A CN 115471714 A CN115471714 A CN 115471714A CN 202110574231 A CN202110574231 A CN 202110574231A CN 115471714 A CN115471714 A CN 115471714A
Authority
CN
China
Prior art keywords
data set
processed
weight distribution
sample weight
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110574231.3A
Other languages
Chinese (zh)
Inventor
张诗杰
朱森华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Cloud Computing Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Cloud Computing Technologies Co Ltd filed Critical Huawei Cloud Computing Technologies Co Ltd
Priority to CN202110574231.3A priority Critical patent/CN115471714A/en
Priority to PCT/CN2022/083841 priority patent/WO2022247448A1/en
Publication of CN115471714A publication Critical patent/CN115471714A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the disclosure provides a data processing method, a data processing device, a computing device and a computer readable storage medium. In the method, an unrelated data set with a label is constructed based on a data set to be processed; dividing the independent data set into a first data set with a first sample weight distribution and a second data set with a second sample weight distribution, wherein the first and second sample weight distributions are determined based on the sample weights of the data items to be processed in the data set to be processed; training a classification model based on the first data set and the first sample weight distribution; and evaluating the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result, wherein the evaluation result indicates the bias significance of the data set to be processed with the sample weight distribution. In this way, embodiments of the present disclosure may make a more accurate determination of the biased significance of a data set.

Description

Data processing method, data processing device, computing equipment and computer readable storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and more particularly, to a data processing method, apparatus, computing device, and computer-readable storage medium.
Background
Data set bias is a broad problem that has great negative impact on machine learning, especially deep learning, and is difficult to perceive and easy to ignore. Especially for the scenes with high requirements on model safety, if training is carried out based on a data set containing bias, the obtained model can cause serious accidents in practical use.
At present, data set bias is checked through guessing or based on experience, but the scheme needs to consume a large amount of human resources, is not only low in efficiency, but also low in accuracy, and cannot meet actual requirements.
Disclosure of Invention
Example embodiments of the present disclosure provide a data processing method including a scheme for evaluating a data set bias, which enables more accurate checking of the data set bias.
In a first aspect, a data processing method is provided. The method comprises the following steps: constructing an unrelated data set based on the data set to be processed, wherein the unrelated data set comprises unrelated data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed; dividing the unrelated data set into a first data set and a second data set, wherein the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, and the first sample weight distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the data set to be processed; training the classification model based on the first data set and the first sample weight distribution; and evaluating the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result, wherein the evaluation result indicates the bias significance of the data set to be processed with the sample weight distribution.
As such, by way of embodiments of the present disclosure, bias significance of a data set may be more accurately assessed. Such an evaluation scheme facilitates user manipulation of the data set, among other things.
In some embodiments of the first aspect, further comprising: if the evaluation result is larger than a preset threshold value, updating the sample weight distribution of the data set to be processed; and repeatedly executing the training and the evaluation based on the updated sample weight distribution until the evaluation result is not greater than a preset threshold value.
Therefore, the embodiment of the disclosure can update the sample weight distribution of the data set to be processed based on the trained classification model, so as to obtain the recommended sample weight distribution.
In some embodiments of the first aspect, wherein updating the sample weight distribution comprises: the portion of the sample weight distribution is updated such that the second sample weight distribution is updated without updating the first sample weight distribution.
In some embodiments of the first aspect, further comprising: and taking the sample weight distribution when the evaluation result is not greater than the preset threshold value as a recommended sample weight distribution.
Therefore, the embodiment of the disclosure can update the sample weight distribution based on the iterative training classification model, and can check the change of the bias of the data set along with the update of the sample weight distribution, so that the data set to be processed can be iteratively detected, and the effective and high-accuracy recommended sample weight distribution is obtained.
In some embodiments of the first aspect, further comprising: and adding or deleting the data set to be processed based on the recommended sample weight distribution so as to construct an unbiased data set.
In this way, in the embodiment of the present disclosure, the to-be-processed data set can be added or deleted based on the recommended sample weight distribution, so that an unbiased data set can be constructed. Further, the unbiased data set may be used to train a more robust, unbiased model for a particular task, thereby meeting the actual needs.
In some embodiments of the first aspect, wherein updating the sample weight distribution comprises at least one of: updating the sample weight distribution by adopting a preset rule, updating the sample weight distribution by adopting a random mode, obtaining the modification of the sample weight distribution by a user to update the sample weight distribution, or optimizing the sample weight distribution by a genetic algorithm to update the sample weight distribution.
In some embodiments of the first aspect, wherein constructing the independent data set based on the to-be-processed data set comprises: removing a part associated with the label of the target data item to be processed from the target data item to be processed of the data set to be processed to obtain a remaining part in the target data item to be processed; and constructing an unrelated data item in the unrelated data set by using the remaining part, wherein the label of the unrelated data item corresponds to the label of the target data item to be processed.
In some embodiments of the first aspect, wherein the dataset to be processed is an image dataset, and wherein constructing the independent dataset based on the dataset to be processed comprises: performing image segmentation on a target data item to be processed in a data set to be processed to obtain a background image corresponding to the target data item to be processed; and constructing a piece of extraneous data in the extraneous data set using the background image.
In this way, in the embodiment of the present disclosure, the background image is taken as a representative of the bias, and then the bias check can be performed on the data set.
In some embodiments of the first aspect, wherein the data items to be processed in the data set to be processed are video sequences, and wherein constructing the independent data set based on the data set to be processed comprises: determining a binary image of the video sequence based on gradient information between one frame of image in the video sequence and a frame of image before the one frame of image; generating a background image of the video sequence based on the binary image; and constructing an item of extraneous data in the set of extraneous data using a background image of the video sequence.
In this way, the background image corresponding to the video sequence can be obtained by considering the similarity between the frame images in the video sequence and the characteristic that the background in the video sequence is basically unchanged.
In some embodiments of the first aspect, further comprising: obtaining a class activation map CAM by inputting the target-independent data items into the trained classification model; obtaining a superposition result by superposing the CAM and the target irrelevant data item; and displaying the superposition result.
Thus, embodiments of the present disclosure provide a solution for quantitatively evaluating data set bias, such that the significance of the data set bias can be unambiguously characterized and the specific locations where the bias was generated can be visually presented. Thus, the user can more intuitively and comprehensively know the bias condition of the data set. The scheme does not need excessive participation of users, can be automatically carried out, and can improve the processing efficiency on the premise of ensuring the accuracy of quantitative evaluation of prejudices.
In a second aspect, a data processing apparatus is provided. The device comprises: a construction unit configured to construct an unrelated data set based on the to-be-processed data set, the unrelated data set including unrelated data items having labels, the labels of the unrelated data items being determined based on the labels of the to-be-processed data items in the to-be-processed data set; a dividing unit configured to divide the independent data set into a first data set and a second data set, the first data set having a first sample weight distribution, the second data set having a second sample weight distribution, the first sample weight distribution and the second sample weight distribution being determined based on sample weights of data items to be processed in the data sets to be processed; a training unit configured to train a classification model based on the first data set and the first sample weight distribution; and an evaluation unit configured to evaluate the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result, the evaluation result indicating the biased significance of the data set to be processed with the sample weight distribution.
In some embodiments of the second aspect, further comprising an updating unit configured to: and if the evaluation result is larger than the preset threshold value, updating the sample weight distribution of the data set to be processed.
In some embodiments of the second aspect, wherein the updating unit is configured to: the portion of the sample weight distribution is updated such that the second sample weight distribution is updated without updating the first sample weight distribution.
In some embodiments of the second aspect, wherein the updating unit is configured to: and taking the sample weight distribution when the evaluation result is not greater than the preset threshold value as a recommended sample weight distribution.
In some embodiments of the second aspect, further comprising an adjustment unit configured to: and adding or deleting the data set to be processed based on the recommended sample weight distribution so as to construct an unbiased data set.
In some embodiments of the second aspect, wherein the updating unit is configured to update the sample weight distribution by at least one of: updating the sample weight distribution by adopting a preset rule, updating the sample weight distribution by adopting a random mode, and obtaining the modification of a user on the sample weight distribution to update the sample weight distribution or optimizing the sample weight distribution by a genetic algorithm to update the sample weight distribution.
In some embodiments of the second aspect, wherein the building unit is configured to: removing a part associated with the label of the target data item to be processed from the target data item to be processed of the data set to be processed to obtain a remaining part in the target data item to be processed; and constructing an irrelevant data item in the irrelevant data set by using the residual part, wherein the label of the irrelevant data item corresponds to the label of the target data item to be processed.
In some embodiments of the second aspect, wherein the data set to be processed is an image data set, and wherein the construction unit is configured to: performing image segmentation on a target data item to be processed in a data set to be processed to obtain a background image corresponding to the target data item to be processed; and constructing a piece of extraneous data in the extraneous data set using the background image.
In some embodiments of the second aspect, wherein the data items to be processed in the data set to be processed are video sequences, and wherein the construction unit is configured to: determining a binary image of the video sequence based on gradient information between one frame of image in the video sequence and a frame of image before the one frame of image; generating a background image of the video sequence based on the binary image; and constructing an item of extraneous data in the set of extraneous data using a background image of the video sequence.
In some embodiments of the second aspect, further comprising: an update unit configured to: obtaining a CAM by inputting the target-independent data items into the trained classification model; and obtaining a superposition result by superposing the CAM and the target-independent data item; and a display unit configured to display the superimposition result.
In a third aspect, there is provided a computing device comprising a processor and a memory having stored thereon instructions for execution by the processor, the instructions when executed by the processor causing the computing device to perform: constructing an unrelated data set based on the data set to be processed, wherein the unrelated data set comprises unrelated data items with labels, and the labels of the unrelated data items are determined based on the labels of the data items to be processed in the data set to be processed; dividing the independent data set into a first data set and a second data set, wherein the first data set has a first sample weight distribution, the second data set has a second sample weight distribution, and the first sample weight distribution and the second sample weight distribution are determined based on the sample weights of the data items to be processed in the data sets to be processed; training the classification model based on the first data set and the first sample weight distribution; and evaluating the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result, wherein the evaluation result indicates the bias significance of the data set to be processed with the sample weight distribution.
In some embodiments of the third aspect, the instructions, when executed by the processor, cause the computing device to implement: and if the evaluation result is larger than the preset threshold value, updating the sample weight distribution of the data set to be processed.
In some embodiments of the third aspect, the instructions, when executed by the processor, cause the computing device to implement: the portion of the sample weight distribution is updated such that the second sample weight distribution is updated without updating the first sample weight distribution.
In some embodiments of the third aspect, the instructions, when executed by the processor, cause the computing device to: and taking the sample weight distribution when the evaluation result is not greater than the preset threshold value as a recommended sample weight distribution.
In some embodiments of the third aspect, the instructions, when executed by the processor, cause the computing device to: and adding or deleting the data set to be processed based on the recommended sample weight distribution so as to construct an unbiased data set.
In some embodiments of the third aspect, the instructions, when executed by the processor, cause the apparatus to update the sample weight distribution by at least one of: updating the sample weight distribution by adopting a preset rule, updating the sample weight distribution by adopting a random mode, obtaining the modification of the sample weight distribution by a user to update the sample weight distribution, or optimizing the sample weight distribution by a genetic algorithm to update the sample weight distribution.
In some embodiments of the third aspect, the instructions, when executed by the processor, cause the computing device to: removing a part associated with the label of the target data item to be processed from the target data item to be processed of the data set to be processed to obtain a remaining part in the target data item to be processed; and constructing an irrelevant data item in the irrelevant data set by using the residual part, wherein the label of the irrelevant data item corresponds to the label of the target data item to be processed.
In some embodiments of the third aspect, wherein the set of data to be processed is an image data set, and wherein the instructions, when executed by the processor, cause the computing device to perform: performing image segmentation on a target data item to be processed in a data set to be processed to obtain a background image corresponding to the target data item to be processed; and constructing an item of extraneous data in the set of extraneous data using the background image.
In some embodiments of the third aspect, wherein the item of data to be processed in the set of data to be processed is a video sequence, and wherein the instructions, when executed by the processor, cause the computing device to implement: determining a binary image of the video sequence based on gradient information between one frame of image in the video sequence and a frame of image before the one frame of image; generating a background image of the video sequence based on the binary image; and constructing an item of extraneous data in the set of extraneous data using a background image of the video sequence.
In some embodiments of the third aspect, the instructions, when executed by the processor, cause the computing device to: obtaining a CAM by inputting the target-independent data items into the trained classification model; and obtaining a superposition result by superposing the CAM and the target-independent data item; and displaying the superposition result.
In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs operations according to the method of the first aspect or any embodiment described above.
In a fifth aspect, a chip or chip system is provided. The chip or system of chips comprises processing circuitry configured to perform operations according to the method of the first aspect or any of the embodiments described above.
In a sixth aspect, a computer program or computer program product is provided. The computer program or computer program product is tangibly stored on a computer-readable medium and comprises computer-executable instructions that, when executed, cause an apparatus to implement operations according to the method of the first aspect or any embodiment described above.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:
FIG. 1 shows a schematic block diagram of a system 100 according to an embodiment of the present disclosure;
FIG. 2 shows a schematic block diagram of a data set processing module 200 according to an embodiment of the present disclosure;
FIG. 3 shows a schematic diagram of a process 300 for model training module 130 to derive recommended sample weights, according to an embodiment of the disclosure;
FIG. 4 shows a schematic diagram of a scenario 400 in which system 100 is deployed in a cloud environment, according to an embodiment of the present disclosure;
FIG. 5 shows a schematic diagram of a scenario 500 in which the system 100 is deployed in different environments, according to an embodiment of the present disclosure;
FIG. 6 illustrates a schematic structural diagram of a computing device 600, according to an embodiment of the present disclosure;
FIG. 7 shows a schematic flow chart diagram of a data processing method 700 according to an embodiment of the present disclosure;
FIG. 8 shows a schematic flow chart diagram of a process 800 of building an extraneous data item in accordance with an embodiment of the present disclosure;
FIG. 9 shows a schematic diagram of a process 900 of updating a sample weight distribution of a data set to be processed, according to an embodiment of the present disclosure;
fig. 10 shows a schematic block diagram of a data processing device 1000 according to an embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
In describing embodiments of the present disclosure, the terms "include" and its derivatives should be interpreted as being inclusive, i.e., "including but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.
Artificial Intelligence (AI) utilizes computers to simulate certain mental processes and intelligent behaviors of humans. Artificial intelligence has a natural and clear context from "reasoning" to "knowledge" and then "learning". Artificial intelligence has been widely used in various industries such as security, medical, transportation, education, finance, etc.
Machine Learning (Machine Learning) is a branch of artificial intelligence that studies how computers simulate or implement human Learning behavior to acquire new knowledge or skills, and reorganize existing knowledge structures to continuously improve their performance. That is, machine learning studies how to improve the performance of a particular algorithm in empirical learning.
Deep Learning (Deep Learning) is a machine Learning technology based on a Deep neural network algorithm, and is mainly characterized in that multiple nonlinear transformation structures are used for processing and analyzing data. The method is mainly applied to scenes such as perception, decision and the like in the field of artificial intelligence, such as image and voice recognition, natural language translation, computer gaming and the like.
Data and algorithms are two important pillars of artificial intelligence, and accordingly, data Bias (Data Bias) is a key concern in the field of artificial intelligence. For a particular machine learning task, factors are present in the data that are related to the task but are not non-causal, such as sample imbalances, the presence of artificial markers in the data, etc., and such factors may be considered data bias.
The data set bias refers to the existence of some false features in the data set, which may be learned by a machine learning model. Taking an image data set as an example, some information related to a data acquisition model, acquisition parameters and the like may exist in an image, and the information is irrelevant to an acquisition task. However, due to data acquisition defects, the machine learning model may speculate based on the information and directly guess the classification result, and no longer learn the image features really related to the target task.
Machine learning models, when trained with image datasets having dataset biases, may not be able to objectively and truly learn the training task as expected. Therefore, the machine learning model obtained by learning is difficult to complete target tasks as expected in an actual use environment, and serious performance degradation occurs; or even if the performance is not degraded, the cause of the error may be unacceptable and even cause of ethical officials. For example, a model for predicting lipstick has little influence on the prediction result after the mouth is masked, and thus it is seen that the model does not actually learn the characteristics related to the mouth. As another example, a medical image recognition model may influence the prediction result by inferring the location of acquisition based on markers placed by the physician.
At present, a scheme is to cut out regions which may affect model learning, or adjust colors, gray scales and the like for image data, so as to avoid the influence of data bias on model training. However, this approach is difficult to exhale of all prejudices, and it is labor intensive, requiring a lot of manpower and time costs.
In view of this, the embodiments of the present disclosure provide a scheme for quantitatively evaluating a data set bias, so that an influence of the data set bias can be effectively determined, and thus, a data set can be adjusted based on the influence, and it is ensured that the adjusted data set does not have a negative influence on a model due to the data bias.
Fig. 1 shows a schematic structural diagram of a system 100 according to an embodiment of the present disclosure. As shown in FIG. 1, the system 100 can be as shown in FIG. 1, the system architecture 100 including an Input/Output (I/O) module 110, a dataset processing module 120, and a model training module 130. Optionally, as shown in fig. 1, the system 100 may further include a model storage module 140 and a data storage module 150. The various modules shown in fig. 1 may communicate with each other.
The input/output module 110 may be used to obtain a set of data to be processed. For example, a set of data to be processed entered by a user may be received.
Optionally, the set of data to be processed may be stored in the data storage module 150. As an example, the data Storage module 150 may be a data Storage resource corresponding to an Object Storage Service (OBS) provided by a cloud Service provider.
The set of pending data comprises a number of pending data items, each pending data item having a tag. In other words, the data set to be processed includes a plurality of data items to be processed with labels.
The label may be labeled manually or may be obtained through machine learning, etc., which is not limited by the present disclosure. Labels may also be referred to as task labels or labeling information or other names, etc., which are not listed one by one herein.
In some examples, the annotation information may be annotated by an annotator empirically for a particular portion of the data item to be processed. Alternatively, the annotation information may be annotated by the image recognition model and the annotation model.
For example, for image data including a human face, labels such as sex, age, whether glasses are worn, whether a hat is worn, a size of the human face, and the like may be labeled for the human face part. For example, in a medical image (e.g., an ultrasound acquired image), the detected portion may be labeled with a lesion.
It will be appreciated that the pending data item may comprise a tag related portion and a tag unrelated portion. Taking the above-mentioned face image as an example, assuming that the label is for a face (for example, the face position is marked with the bounding box), the face region in the image is a part related to the label, and the other regions in the image except the face region are parts unrelated to the label. Assuming that the label is eye-specific (e.g., pupil color is noted by "black", "brown", etc.), the eye region in the image is the portion associated with the label, and the other regions in the image outside the eye region are the portions not associated with the label.
The data items to be processed in the data set to be processed may be any type of data, such as images, video, speech, text, etc. For convenience of description, the following description will be made by taking an image as an example.
Embodiments of the present disclosure do not limit the source of the data items to be processed, and take images as an example, which may be collected from an open data set, which may be captured by different image capturing devices, which may be captured by the same image capturing device at different times, which may be image frames in a video sequence captured by the image capturing device, or any combination of the above, or the like.
The input/output module 110 may be implemented as an input module and an output module independent of each other, or may be implemented as a coupling module having both input and output functions. By way of example, a Graphical User Interface (GUI) or a Command-Line Interface (CLI) implementation may be employed.
The dataset processing module 120 may obtain the set of data to be processed from the input/output module 110 or, alternatively, may obtain the set of data to be processed from the data storage module 150. Further, the data set processing module 120 may construct an independent data set based on the data set to be processed. The unrelated data set includes unrelated data items having tags, and the tags of the unrelated data items are determined based on the tags of the to-be-processed data items in the to-be-processed data set.
Optionally, the independent data set may be stored in the data storage module 150.
As described above, the to-be-processed data item has a tag, and the to-be-processed data item includes a tag-related portion and a tag-unrelated portion. Then, the part of the data item to be processed related to the tag may be removed, and only the part of the data item to be processed that is not related to the tag is reserved as the unrelated data item, and the tag of the unrelated data item is the tag of the data item to be processed. This process may also be referred to as splitting, separating, or other names, etc., and the present disclosure is not limited thereto.
That is, for a certain to-be-processed data item of the to-be-processed data set (referred to as a target to-be-processed data item), a portion associated with its tag may be removed from the target to-be-processed data item to obtain a remaining portion of the target to-be-processed data item. The remaining portion is then used to construct an unrelated data item in the set of unrelated data, the tag of the unrelated data item corresponding to the tag of the target pending data item.
For example, suppose the data item to be processed is a face image, and the label indicates a face skin color, such as "white". Then the face region in the face image may be removed, and the remaining portion after the face region is removed is taken as the corresponding unrelated data item, and the unrelated data item still has the label "white" of the face skin color.
In some implementations, if the data items to be processed in the data set to be processed are images, the extraneous data items can be derived by way of image segmentation. The part of the image associated with the label is a foreground region and the other regions of the image other than the foreground region are background regions, then the extraneous data items can be determined based on the background regions only by foreground-background separation.
Specifically, image segmentation is performed on a target to-be-processed data item (target image) in a to-be-processed data set to obtain a background image corresponding to the target image, and then an unrelated data item is constructed by using the background image.
The specific algorithm used in the image segmentation in the embodiments of the present disclosure is not limited, and may be performed by one or more of the following algorithms, for example, and may also be performed by other algorithms: threshold-based image segmentation algorithms, region-based image segmentation algorithms, edge detection-based image segmentation algorithms, wavelet analysis and wavelet transform-based image segmentation algorithms, genetic algorithm-based image segmentation algorithms, active contour model-based image segmentation algorithms, deep learning-based image segmentation algorithms, and the like, wherein the deep learning-based image segmentation algorithms include, but are not limited to: feature encoder based segmentation algorithms, region selective based segmentation algorithms, RNN based segmentation algorithms, upsampling/deconvolution based segmentation algorithms, enhanced feature resolution based segmentation algorithms, feature enhancement based segmentation algorithms, conditional Random Field (CRF)/Markov Random Field (MRF) based segmentation algorithms, and the like.
In other implementations, if the data item to be processed in the data set to be processed is a video sequence. The different items of data to be processed may have the same or different durations, for example, the first item of data to be processed in the set of data to be processed is a first video sequence having a length of m1 frames, including m1 frames of images. For example, the second to-be-processed data item in the to-be-processed data set is a second video sequence, which is m2 frames in length and includes m2 frames of images. m1 and m2 may be equal or unequal.
Specifically, video segmentation is performed on a target to-be-processed data item (target video sequence) in the to-be-processed data set to obtain a background image corresponding to the target video sequence, and then an unrelated data item is constructed by using the background image.
The specific algorithm adopted by the video segmentation is not limited in the embodiment of the present disclosure. For example, image segmentation may be performed on each frame of image in the target video sequence, and the segmented background regions of each frame of image may be fused to obtain a background image corresponding to the target video sequence. As another example, a background image corresponding to the target video sequence may be obtained based on a gradient between two adjacent frames in the target video sequence. Specifically, a binary image corresponding to the video sequence may be obtained based on gradient information of the video sequence. A background image of the video sequence is then generated based on the binary image, as described below in connection with fig. 2.
Fig. 2 shows a schematic structural diagram of a data set processing module 200 according to an embodiment of the present disclosure. The data set processing module 200 may be implemented as the data set processing module 120 in fig. 1, and the data set processing module 200 may be configured to determine an unrelated data set based on a to-be-processed data set, where the to-be-processed data item of the to-be-processed data set is a video sequence, and the unrelated data item in the unrelated data set may be a background image corresponding to the video sequence.
As shown in fig. 2, the data set processing module 200 may include a gradient operator module 210, a gradient stacking sub-module 220, a thresholding sub-module 230, a morphology processing sub-module 240, and a segmentation sub-module 250.
The gradient calculation sub-module 210 can be used to calculate gradient information between one frame of image in the target video sequence and the previous frame of image.
For example, assume that the target video sequence includes m1 frame images, i.e., 0 th frame image, 1 st frame image, \ 8230;, and m1-1 th frame images, respectively. Gradient information between every two adjacent frames of images can be calculated, specifically, gradient information between the 1 st frame image and the 0 th frame image, between the 2 nd frame image and the 1 st frame image, \8230betweenthe m1-1 st frame image and the m1-2 th frame image is calculated.
Embodiments of the present disclosure do not limit the specific manner of calculating gradient information, and for example, a frame difference may be calculated. For example, the gradient of the feature vector of the two images along a specific dimension (e.g., time dimension T) can be calculated, so that a stationary background part, such as an image border, can be extracted from the video sequence through motion information. For example, the difference between the image and the grayed image can be calculated, so that the color part in the video frame image can be extracted, and thus, the color mark can be avoided as the foreground part, such as some color marks or characters added at the later stage after the video acquisition.
The gradient stacking sub-module 220 may be configured to stack the gradient information obtained by the gradiometer sub-module 210, so as to obtain a gradient stack map.
The manner in which the gradient overlap sub-module 220 overlaps may include, but is not limited to, weighted summation (e.g., averaging), maximum, minimum, or the like.
The thresholding sub-module 230 may be used to threshold the gradient stack map obtained by the gradient stack sub-module 220 to obtain an initial binary map.
Specifically, for each pixel in the gradient overlay map, those pixels whose values are greater than the threshold are labeled as 1, and those pixels whose values are less than or equal to the threshold are labeled as 0, so as to obtain an initial binary map, where the pixel values in the initial binary map are either not 1 or not 0.
The morphology processing sub-module 240 may perform morphology processing on the initial binary image obtained by the thresholding sub-module 230 to obtain a binary image corresponding to the video sequence.
For example, if the pixel value of a pixel in the initial binary image is 1, but the pixel values of all the neighboring pixels of the pixel are 0, the pixel value of the pixel may be reset to 0.
Illustratively, morphological treatments may include, but are not limited to, morphological dilation, morphological erosion, and the like. For example, the morphological processing sub-module 240 may perform a number of morphological dilations on the initial binary image obtained by the thresholding sub-module 230, and then perform a same number of morphological erosions, thereby obtaining a binary image.
The separation sub-module 250 may obtain a background image corresponding to the video sequence based on the binary image obtained by the morphology processing sub-module 240.
Illustratively, a matting operation can be performed on the binary image to obtain a background image. The background image may be obtained by matrix dot multiplication, for example.
In this way, the background image corresponding to the video sequence can be obtained by fully considering the similarity of the background between the frame images in the video sequence.
In this way, in the embodiment of the present disclosure, the background image is taken as a representative of the bias, and then the bias check can be performed on the data set. It will be appreciated that if the data set is not biased, then the features of the background image should not have any relationship to the tags associated with the foreground regions.
Assume that the to-be-processed data set includes N to-be-processed data items, and the unrelated data set includes N1 unrelated data items. If processing is performed for each data item to be processed, resulting in a corresponding unrelated data item, then N1= N. If processing is performed on a portion of the data items in the set of data to be processed, resulting in corresponding unrelated data items, then N1< N. It can be understood that the independent data set obtained by processing all the data items to be processed has more independent data items, so that the data set to be processed can be analyzed and evaluated more completely and comprehensively.
In one implementation, the constructed unrelated data set may be divided into two parts: a first portion of the independent data items and a second portion of the independent data items, wherein the first portion of the independent data items can be used to train the model and the second portion of the independent data items can be used to test the model. The division manner is not limited in the embodiments of the present disclosure, and as an example, the independent data set may be divided into the first part and the second part according to 9.
For example, a set of a first portion of unrelated data items may be referred to as an unrelated training set, and a set of a second portion of unrelated data items may be referred to as an unrelated test set. Or, alternatively, the set of first portion of unrelated data items may include an unrelated training set and an unrelated validation set. As an example, the independent data set may be divided into an independent training set, an independent verification set, and an independent test set according to 7.
For simplicity of description, the first set of independent data items will be referred to hereinafter as a first data set (or training set), and the second set of independent data items will be referred to hereinafter as a second data set (or test set).
In some embodiments, the data set processing module 120 may pre-process the data set to be processed and then construct the independent data set based on the pre-processed data set to be processed. Pre-treatments include, but are not limited to: cluster analysis, data denoising, and the like.
The model training module 130 may include a training submodule 132 and an evaluation submodule 134.
The training submodule 132 may be used to train the classification model. In particular, the classification model may be trained based on a first portion of the independent data items in the independent data set and a label for each independent data item in the first portion.
As an implementation, the first portion of the independent data items used for training may be the entirety of the independent data set, and as such, more data items may be employed to participate in the training, making the trained classification model more robust. As another implementation, the first portion of unrelated data items used for training may be part of an unrelated data set, which is divided into a first portion of unrelated data items and a second portion of unrelated data items, as described above.
For convenience of the following description, a set of the first part of the independent data items for training is referred to as a training set, and accordingly, the first part of the independent data items may be training items.
It should be noted that the training here may be the training of an initial classification model or may be the updating of a previously trained classification model, wherein the initial classification model may be an untrained classification model. The previously trained classification model may be derived from training an initial classification model. As an example, the training submodule 132 may retrieve an initial classification model or a previously trained classification model from the model storage module 140.
The training submodule 132 may obtain a first portion of the unrelated data items in the set of unrelated data for training and the label for each of the unrelated data items in the first portion from the data set processing module 120 or the data storage module 150. Alternatively, the training sub-module 132 may obtain a first portion of the unrelated data items in the set of unrelated data for training from the data set processing module 120 and obtain the label for each of the first portion of the unrelated data items from the input/output module 110.
Optionally, before training based on the training set (the first portion of the unrelated data items in the unrelated data set), the training sub-module 132 may pre-process the training set, including but not limited to: feature extraction, cluster analysis, edge detection, image denoising and the like. For example, the training data items after feature extraction may be characterized as S-dimensional feature vectors, where S is greater than 1.
It is understood that the model structure of the classification model is not limited in the embodiments of the present disclosure, and the classification model may be a Convolutional Neural Network (CNN) model, and may optionally include an input layer, a Convolutional layer, an anti-Convolutional layer, a pooling layer, a fully-connected layer, an output layer, and the like, as an example.
The classification model includes a large number of parameters, which may represent weights of calculation formulas or calculation factors in the classification model, and the parameters may be iteratively updated through training. The classification model further includes a hyper-parameter (hyper-parameter) for guiding the construction or training of the classification, such as the number of iterations (iteration) of model training, learning rate (learning rate), batch size (batch size), number of layers of the model, number of neurons per layer, and the like. The hyper-parameters may be parameters obtained by training a model using a training set, or may be preset parameters that are not updated by training the model.
For example, the training sub-module 132 may refer to an existing training process to train the classification model. As an illustrative description, the training process may be: inputting training data items in a training set into a classification model, taking labels corresponding to the training data as references, obtaining a loss value between the output of the classification model and the corresponding labels by using a loss function (loss function), and adjusting parameters of the classification model according to the loss value. Each training data item in the training set iteratively trains the classification model, and the parameters of the classification model are continuously adjusted until the classification model can output an output value which is closer to the label corresponding to the training data item according to the input training data item with higher accuracy, for example, the loss function is minimum or less than a reference threshold.
The penalty function in the training process is a function for measuring the degree to which the classification model is trained, i.e. for calculating the difference between the predicted result and the true value of the classification model. In the process of training the classification model, because the output of the classification model is expected to be as close to the true value (namely the corresponding label) as possible, the parameters in the classification model can be updated according to the difference between the predicted value and the true value of the current classification model by comparing the predicted value and the true value. And judging the difference between the predicted value and the true value of the current classification model through the loss function in each training, updating the parameters of the classification model until the classification model can predict a value which is very close to the true value, and considering that the classification model is trained.
The "classification model" in the embodiments of the present disclosure may also be referred to as a machine learning model, a convolution classification model, a background classification model, a data bias model, or other names, or may also be referred to simply as a "model", etc., to which the present disclosure is not limited. Alternatively, the trained classification model may be stored in the model storage module 140. In some examples, model storage module 140 may be part of model training module 130.
The evaluation sub-module 134 may be used to evaluate the classification model. In particular, an evaluation result for the trained classification model may be determined based on a second portion of the independent data items in the independent data set and the label of each independent data item in the second portion. The evaluation results may be used to characterize the significance of the data bias of the data set to be processed.
As described above, the set of second portion of unrelated data items may be a test set and, correspondingly, the second portion of unrelated data items may be test data items.
As an example, the evaluation process may include: the test data item is input to the trained classification model, a prediction result about the test data item is obtained, and an evaluation result is determined based on a comparison result of the prediction result and a label of the test data item.
In the embodiment of the present disclosure, the evaluation result may include at least one of the following: accuracy, recall, F1 index, accuracy-Recall (P-R) curve, average Precision (AP) index, false alarm rate, etc.
Specifically, a confusion matrix may be constructed in which the number, real value, predicted value, and the like of Positive examples (also called Positive) and Negative examples (also called Negative) are shown.
Accuracy refers to the proportion of samples that are classified correctly to the total samples. For example, if the number of test data items in the test set is N2, and the number of prediction results consistent with the label is N21, the accuracy rate may be represented as N21/N2.
The accuracy, also referred to as precision, is the proportion of samples that are predicted to be positive, that are actually positive. For example, if the number of test data items in the test set is N2, and the number of positive examples in the prediction result is N22, and the number of positive examples labeled in the N22 test data items is N23, the accuracy may be represented as N23/N22.
The recall ratio is a ratio predicted to be positive in the actually positive samples. For example, if the number of test data items in the test set is N2, where the number labeled as positive examples is N31, and for the N31 positive examples, if the number of positive examples in the prediction result is N32, the recall ratio can be represented as N32/N31.
The P-R curve defines the horizontal axis as recall and the vertical axis as precision. One point on the P-R curve represents: under a certain threshold, the model judges the result larger than the threshold as a positive sample, judges the result smaller than the threshold as a negative sample, and returns the recall rate and the accuracy rate corresponding to the node. The entire P-R curve is generated by moving the threshold from high to low. The vicinity of the origin represents the accuracy and recall of the model when the threshold is maximum.
The F1 index, also known as the F1 score (score), is a harmonic mean of precision and recall. For example, the F1 index may be a ratio of twice the product of the precision and the recall rate to the sum of the precision and the recall rate.
In some embodiments of the present disclosure, the evaluation result may include a positive example characterization value, such as a first correctness rate and/or a first recall rate. The first correctness table represents the proportion of samples that are predicted to be positive, and samples that are actually positive. The first recall rate represents the proportion of samples that are actually positive that are predicted to be positive. The evaluation result may include a negative example characterization value, such as a second accuracy rate and/or a second recall rate. The second accuracy indicates the proportion of samples that are predicted to be negative, and are actually negative. The second recall rate represents the proportion of samples that are actually negative that are predicted to be negative.
In some embodiments of the present disclosure, the evaluation result may include the first predicted mean and/or the second predicted mean. The first prediction mean represents the average of the predicted values for the samples that are actually positive. The second prediction mean represents the average of the predicted values for the samples that are actually negative. The evaluation result may include a mean difference representing a difference between the first predicted mean and the second predicted mean, such as a mean difference may be represented by a difference between the first predicted mean and the second predicted mean or by a ratio of the first predicted mean to the second predicted mean, and so on.
It should be understood that the above list is only some examples of the evaluation results, and other characterizations can be used as the evaluation results, and the disclosure is not listed.
Illustratively, the evaluation results may be presented to the user by the input/output module 110. Such as may be presented via a graphical user interface for easy viewing by a user.
As such, by way of embodiments of the present disclosure, the biased significance of a data set may be characterized in a quantified form. The quantitative evaluation scheme can provide a clear reference for a user, and facilitates the user to adjust the data set and the like.
In scenarios in which input/output module 110 includes a graphical user interface, input/output module 110 may also visually present a representation of the data set bias via the graphical user interface.
Specifically, a Class Activation Map (CAM) is obtained by inputting the target-independent data items into the trained classification model. A superimposition result is then obtained by superimposing the CAM with the target-independent data item, and the superimposition result is displayed.
Class activation maps, i.e., class activation thermodynamic maps, such that embodiments of the present disclosure can characterize regions of interest of a classification model by CAM, and in particular, which regions (i.e., regions of interest of the model) lead to bias.
The embodiments of the present disclosure do not limit the specific manner in which the CAM is obtained. For example, the CAM can be obtained by a Gradient-based class-activation-map method (Grad-CAM). For example, the output of the last convolution layer of the classification model, i.e., the last layer feature map, may be extracted, and the extracted last layer feature maps may be weighted and summed to obtain the CAM. Alternatively, the result after weighted summation may be processed by a Linear rectification Unit (ReLU) activation function to serve as the CAM. The weights to be weighted and summed here may be the weights of the top fully-connected layer. As an example, the partial derivatives of the output of the last layer flexibility maximum (Softmax) of the classification model to all pixels of the last layer feature map may be calculated, and then the global average in the width and height dimensions may be taken as the corresponding weight.
The embodiment of the present disclosure does not limit the manner in which the CAM is superimposed on the target-independent data item (e.g., the background image), for example, the CAM may be superimposed by a weighted summation method, and as an example, the CAM and the background image may be equally weighted.
Thus, embodiments of the present disclosure provide a solution for quantitative evaluation and visual presentation of data set bias, such that the significance of the data set bias can be unambiguously characterized and the specific locations where the bias was generated can be visually presented. Thus, the user can more intuitively and comprehensively know the bias condition of the data set. The scheme does not need excessive participation of users, can be automatically carried out, and can improve the processing efficiency on the premise of ensuring the accuracy of quantitative evaluation on prejudices.
The model training module 130 may also be configured to adjust the set of data to be processed based on the classification model.
In particular, the data sets to be processed may have an initial sample weight distribution, correspondingly, the first data set having a first sample weight distribution and the second data set having a second sample weight distribution. For example, assuming that the initial sample weight of the target to-be-processed data item is a, the sample weight of the irrelevant data item generated based on the target to-be-processed data item is also a.
Illustratively, the model training module 130 may be configured to derive the recommended sample weight distribution based on iterative training of the classification model, as described below in conjunction with fig. 3.
FIG. 3 shows a schematic diagram of a process 300 for model training module 130 to derive recommended sample weights according to an embodiment of the disclosure.
At 310, a first data set having a first sample weight distribution and a second data set having a second sample weight distribution are determined.
In particular, the independent data set may be constructed based on the data set to be processed, and the independent data set may be divided into the first data set and the second data set, as described in the above embodiments.
For example, the data items to be processed in the data set to be processed may have initial sample weights, that is, the data set to be processed may have an initial sample weight distribution. As an example, the initial sample weights may be input by a user through the input/output module 110. As another example, the initialization sample weight may be determined by an initialization process.
The sample weight may be used to indicate the sampling probability of the data item to be processed, for example, the sample weight of the ith data item to be processed is assumed to be w i Then the sampling probability of the i-th data item to be processed is
Figure BDA0003083747930000121
As an example, the initial sample weight distribution may indicate that the sampling probabilities of the respective to-be-processed data items in the to-be-processed data set are equal. Assuming that the to-be-processed data set includes N to-be-processed data items, and the initial sample weight of each to-be-processed data item is 1, the sampling probability of each to-be-processed data item is initialized to 1/N.
It will be appreciated that the initial sample weight distribution may be determined while the first sample weight distribution and the second sample weight distribution may be determined accordingly.
At 320, a first data set is sampled based on the first sample weight distribution and a classification model is trained in an iterative manner.
At 330, the classification model after S320 training is evaluated based on the second data set, resulting in an evaluation result.
For example, the evaluation result may be obtained based on a comparison of the prediction result of the trained classification model for the unrelated data item in the second data set and the label of the unrelated data item. As an example, the unrelated data item may be input to a trained classification model, a prediction regarding the unrelated data item may be obtained, and an evaluation result may be determined based on a comparison of the prediction of the unrelated data item and a label of the unrelated data item. The evaluation result may include at least one of: accuracy, recall rate, F1 index, accuracy-recall rate curve, average accuracy index, false alarm rate, missing report rate, etc. For the evaluation result, the above description can be referred to, and the details are not repeated here.
At 340, it is determined whether the bias significance indicated by the evaluation result is high.
If the determination at 340 determines that the evaluation result indicates that the bias significance is high, e.g., the evaluation result is greater than a preset threshold, then proceed to 350. Otherwise, if the determination at 340 determines that the evaluation result indicates that the bias significance is not high, e.g., the evaluation result is not greater than a preset threshold, then proceed to 360.
The preset threshold may be set based on the processing accuracy of the data set to be processed, the application scenario, and the like. The preset threshold may relate to a specific meaning of the evaluation result, such as the evaluation result includes a correct rate, the preset threshold may be set to be, for example, 30% or 50% or other values, etc.
At 350, the sample weight distribution is updated.
Referring to FIG. 3, after 350, execution may continue back at 310 or 320, as indicated by the dashed arrow in FIG. 3.
In one example, execution may continue back to 310, i.e., the first data set and the second data set are reconstructed. Thus, in the previous cycle an unrelated data item may belong to the first data set, but in the next cycle the unrelated data item may belong to both the first data set and the second data set.
In another example, execution may continue back at 320, i.e., no change has occurred to the unrelated data items in the first data set and the second data set, but the first sample weight distribution and/or the second sample weight distribution is updated.
The first data set may be resampled based on the updated first sample weight distribution after 350 and the classification model is re-iteratively trained. And evaluating the retrained classification model based on the second data set to obtain an evaluation result again.
As such, 310 through 350 or 320 through 350 may be performed iteratively until the evaluation result indicates that the bias significance is not high (e.g., the evaluation result is not greater than a preset threshold).
The embodiment of the present disclosure does not limit the specific implementation manner of updating the sample weight distribution.
As an example, the sample weight distribution may be updated in a random manner. For example, the sample weights of some data items to be processed may be randomly updated, for example, the sample weight of a certain data item to be processed is updated from 1 to 2, the sample weight of another data item to be processed is updated from 1 to 3, and so on. It can be appreciated that this random approach has uncertainty, which may make the process of obtaining the recommended sample weight distribution time-consuming.
As another example, the sample weight distribution may be updated using a predetermined rule. For example, the second sample weight distribution may be updated. For example, if the evaluation result indicates that the classification model predicts a different label for an unrelated data item in the second data set than the unrelated data item, the sample weight for the unrelated data item may be adjusted high. Such as updating the sample weight of the irrelevant data item from a1 to a1+1 or 2 a1 or otherwise. In this example, the first sample weight distribution may remain unchanged or may be otherwise updated. Optionally, in this example, after the sample weight distribution is updated, the second data set may be swapped with the first data set for the next cycle. For example, in the next cycle, the classification model will be trained based on the second data set of the previous cycle and the updated second sample weight distribution.
As another example, the sample weight distribution may be optimized by a genetic algorithm to update the sample weight distribution. For example, the sample weight distribution may be used as a gene initial value of a genetic algorithm, and an objective function may be constructed based on the evaluation result obtained at 330, so that the genetic algorithm may be used to optimize the sample weight distribution, and the optimized sample weight distribution is an updated sample weight distribution. The embodiment of the present disclosure does not limit the construction manner of the objective function of the genetic algorithm, for example, if the evaluation result includes the mean difference between the positive sample and the negative sample and the correctness, the sum of the mean difference and the correctness can be used as the objective function. It will be appreciated that the objective function may be constructed in other ways, and is not listed here.
Therefore, the embodiment of the disclosure can update the sample weight distribution of the data set to be processed based on the trained classification model, so as to obtain the recommended sample weight distribution, and the process does not need user participation and has high automation degree.
As another example, a user modification to the sample weight distribution may be obtained to update the sample weight distribution. For example, the user may empirically infer what modification is to be made to the sample weight distribution with reference to the evaluation results and/or the displayed overlay results (as described above), and then input the modification through the input/output module 110 to update the sample weight distribution.
Therefore, the method can fully consider the user requirements, and update the sample weight distribution based on the modification of the user, so that the obtained recommended sample weight distribution can better meet the user expectation, and the user satisfaction is improved.
At 360, a recommended sample weight distribution is obtained.
If it is determined at 340 that the evaluation result indicates that the bias significance is not high, for example, the evaluation result is not greater than a preset threshold, the sample weight distribution resulting in the current evaluation result may be taken as the recommended sample weight distribution.
Therefore, the embodiment of the disclosure can update the sample weight distribution based on the iterative training classification model, and can check the change of the bias of the data set along with the update of the sample weight distribution, so that the data set to be processed can be iteratively detected, and the effective recommended sample weight distribution with high referential property can be obtained.
The input/output module 110 may also present the recommended sample weight distribution for the user to use as a reference for further adjusting the set of data to be processed. For example, the recommended sample weight distribution is visually presented through a graphical user interface.
For example, the data set processing module 120 may add or delete the data set to be processed based on the obtained recommended sample weight distribution to construct an unbiased data set.
As an example, the data set processing module 120 may copy the to-be-processed data items with large recommended sample weights to expand the number of the to-be-processed data items in the to-be-processed data set. The data set processing module 120 may delete the to-be-processed data items with small recommended sample weights to reduce the number of to-be-processed data items in the to-be-processed data set.
As an example, a deletion instruction of the part of the to-be-processed data item by the user may be acquired via the input/output module 110 to delete the part of the to-be-processed data item. Additional data items entered by the user may be obtained via the input/output module 110 for addition to the current set of data to be processed.
For example, the user may add or delete a set of data to be processed based on the recommended sample weight distribution. For example, the user may find other samples similar to the to-be-processed data item with a high weight of the recommended sample to be added to the data set as new data items, thereby implementing data supplementation to the data set. As an example, similar other samples may be other images acquired by the same (or same model of) image acquisition device in similar environments (e.g., care conditions, etc.).
In this way, in the embodiment of the present disclosure, the to-be-processed data set can be added or deleted based on the recommended sample weight distribution, so that an unbiased data set can be constructed. Further, the unbiased data set may be used to train a more robust, unbiased task-specific model.
It is understood that the system 100 shown in fig. 1 can be a system capable of interacting with a user, and the system 10 can be a software system, a hardware system, or a combination of software and hardware systems.
In some examples, the system 100 may be implemented as, or as part of, a computing device, including but not limited to a desktop, mobile terminal, wearable device, server, cloud server, and the like.
It is understood that the system 100 shown in fig. 1 may be implemented as an artificial intelligence platform (AI platform). The AI platform is a platform that provides a convenient AI development environment and a convenient development tool for AI developers and users. The AI platform can be internally provided with various AI models or AI submodels for solving different problems, and the AI platform can establish an applicable AI model according to the requirements input by a user. Namely, the user only needs to determine own requirements in the AI platform, and prepares a data set according to the prompt and uploads the data set to the AI platform, and the AI platform can train an AI model which can be used for realizing the needs of the user for the user. The AI model in the disclosed embodiments may be used to evaluate a data bias of a set of data to be processed entered by a user.
Fig. 4 shows a schematic diagram of a scenario 400 in which the system 100 is deployed in a cloud environment, according to an embodiment of the present disclosure. In scenario 400, system 100 is deployed entirely in cloud environment 410.
Cloud environment 410 is an entity that utilizes underlying resources in a cloud computing mode to provide cloud services to users. Cloud environment 410 includes a cloud data center 412 and a cloud services platform 414, cloud data center 412 including a large number of infrastructure resources (including computing resources, storage resources, and network resources) owned by a cloud service provider, and cloud data center 412 including computing resources that may be a large number of computing devices (e.g., servers). The system 100 may be deployed independently on a server or a virtual machine in the cloud data center 412, or the system 100 may be deployed in a distributed manner on a plurality of servers in the cloud data center 412, or on a plurality of virtual machines in the cloud data center 412, or on a server and a virtual machine in the cloud data center 412.
As shown in fig. 4, the system 100 may be abstracted as an AI development cloud service 424 by a cloud service provider at the cloud service platform 414 and provided to a user, and after the user purchases the cloud service at the cloud service platform 414 (the user may pre-charge the cloud service and then settle according to the final resource usage), the cloud environment 410 provides the AI development cloud service 424 to the user by using the system platform 100 deployed at the cloud data center 412. When the AI is used to develop the cloud service 424, a user may upload a to-be-processed data set or the like through an Application Program Interface (API) or a GUI. The system 100 in the cloud environment 410 receives the to-be-processed data set uploaded by the user, and may perform operations such as data set processing, model training, and data set adjustment. The system 100 may return the evaluation results of the model, the recommended sample weight distribution, etc. to the user through an API or GUI.
In another embodiment of the present application, the system 100 in the cloud environment 410 is abstracted to the AI development cloud service 424 and provided to the user, which may be divided into two parts, for example, a data set bias evaluation cloud service and a data set adjustment cloud service. The user can only purchase the data set bias evaluation cloud service on the cloud service platform 414, the cloud service platform 414 can construct an unrelated data set based on the data set to be processed uploaded by the user, obtain a classification model through training, and return an evaluation result of the classification model to the user, so that the user can know the bias significance of the data set to be processed. The user may also purchase a data set adjustment cloud service further at the cloud service platform 414, and the cloud service platform 414 may perform iterative training on the classification model based on the sample weight distribution, update the sample weight distribution, and return the recommended sample weight distribution to the user, so that the user adds or deletes the to-be-processed data set with reference to the recommended sample weight distribution to construct an unbiased data set.
Fig. 5 shows a schematic diagram of a scenario 500 in which the system 100 is deployed in different environments, according to an embodiment of the present disclosure. In scenario 500, system 100 is distributively deployed in different environments that may include, but are not limited to, at least two of cloud environment 510, edge environment 520, and end computing device 530.
The system 100 may be logically divided into multiple portions, each having a different functionality. For example, as shown in FIG. 1, system 100 includes input/output module 110, data set processing module 120, model training module 130, model storage module 140, and data storage module 150. Portions of system 100 may be deployed in any two or three of end computing device 530, edge environment 520, and cloud environment 510, respectively. The various portions of system 100 deployed in different environments are cooperatively implemented to provide various functionality to a user. For example, in one scenario, the input/output module 110 and the data storage module 150 of the system 100 are deployed in the end computing device 530, the dataset processing module 120 of the system 100 is deployed in the edge computing device of the edge environment 520, and the model training module 130 and the model storage module 140 of the system 100 are deployed in the cloud environment 510. The user sends the pending data set to the input/output module 110 in the terminal computing device 530, and the terminal computing device 530 stores the pending data set to the data storage module 150. The dataset processing module 120 in the edge computing device of the edge environment 520 constructs an independent dataset based on the pending dataset from the terminal computing device 530. Model training module 130 in cloud environment 510 trains classification models based on independent data sets from edge environment 520. Cloud environment 510 may also store the trained classification models to model storage module 140. It should be understood that the present application does not limit what parts of the system 100 are specifically deployed in what environment, and in practical applications, the parts may be deployed adaptively according to the computing power of the terminal computing device 530, the resource occupation of the edge environment 520 and the cloud environment 510, or the specific application requirements.
Edge environment 520 is an environment that includes a set of edge computing devices that are closer to end computing device 530, including but not limited to: edge servers, edge kiosks with computing power, etc. It is understood that system 100 may also be deployed on one edge server of edge environment 520 alone, or may be deployed on multiple edge servers of edge environment 520 in a distributed manner.
The terminal computing device 530 includes, but is not limited to: terminal server, smart mobile phone, notebook computer, panel computer, personal desktop computer, intelligent camera etc.. It is understood that the system 100 can also be deployed solely on one terminal computing device 530, or can be deployed in a distributed fashion across multiple terminal computing devices 530.
Fig. 6 shows a schematic structural diagram of a computing device 600 according to an embodiment of the disclosure. Computing device 600 in fig. 6 may be implemented as a device in cloud environment 510, a device in edge environment 520, or a terminal computing device 530 in fig. 5. It should be understood that the computing device 600 illustrated in fig. 6 may also be considered a cluster of computing devices, i.e., the computing device 600 includes one or more of the aforementioned devices in the cloud environment 510, devices in the edge environment 520, and the end computing device 530.
As shown in fig. 6, computing device 600 includes memory 610, processor 620, communication interface 630, and bus 640, where bus 640 provides for communication of the various components of computing device 600 with one another.
The Memory 610 may be a Read Only Memory (ROM), a Random Access Memory (RAM), a hard disk, a flash Memory, or any combination thereof. Memory 610 may store programs that when executed by processor 620, stored in memory 610, processor 620 and communication interface 630 are used to perform processes that can be performed by the various modules in system 100 as described above. It is to be appreciated that processor 620 and communication interface 630 may also be utilized to perform some or all of the data processing method embodiments described below in this specification. The memory may also store the data sets and classification models. For example, a portion of the storage resources in the memory 610 is divided into a data storage module for storing data sets, such as pending data sets, unrelated data sets, etc., and a portion of the storage resources in the memory 610 is divided into a model storage module for storing classification models.
Processor 620 may employ a Central Processing Unit (CPU), an Application-Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or any combination thereof. Processor 620 may include one or more chips. Processor 620 may include an accelerator, such as a Neural Processing Unit (NPU).
Communication interface 630 enables communication between computing device 600 and other devices or communication networks using transceiver modules, such as transceivers. For example, the data may be acquired through the communication interface 630.
Bus 640 may include a path to transfer information between components of computing device 600 (e.g., memory 610, processor 620, communication interface 630).
Fig. 7 shows a schematic flow chart of a data processing method 700 according to an embodiment of the present disclosure. The method 700 shown in fig. 7 may be performed by the system 100.
As shown in FIG. 7, at block 710, an extraneous data set is constructed based on a pending data set, the extraneous data set including extraneous data items having labels, the labels of the extraneous data items being determined based on the labels of the pending data items in the pending data set.
Illustratively, the set of pending data includes a plurality of pending data items, each pending data item having a tag. The pending data item may comprise a tag related part and a tag unrelated part.
In some embodiments, a portion associated with the tag of the target pending data item may be removed from the target pending data item of the pending data set to obtain a remaining portion of the target pending data item. And constructing an irrelevant data item in the irrelevant data set by using the rest part, wherein the label of the irrelevant data item corresponds to the label of the target data item to be processed.
In some embodiments, the data set to be processed is an image data set, that is, the data items to be processed are images. Image segmentation may be performed on a target data item to be processed in the data set to be processed to obtain a background image corresponding to the target data item to be processed. A background image is used to construct an item of extraneous data in the extraneous data set.
Specifically, the part of the image associated with the label is a foreground region, and the other regions of the image except the foreground region are background regions, so that the irrelevant data item can be determined based on only the background regions through foreground-background separation.
In some embodiments, the data items to be processed in the data set to be processed are video sequences. A binary image of the video sequence may be determined based on gradient information between one frame of image in the video sequence and a previous frame of image of the one frame of image. And generating a background image of the video sequence based on the binary image. A background image of the video sequence is then used to construct a piece of the extraneous data item in the extraneous data set.
FIG. 8 shows a schematic flow diagram of a process 800 of building an extraneous data item according to an embodiment of the present disclosure. In particular, fig. 8 shows a process for constructing an unrelated data item based on a data item to be processed (video sequence).
As shown in fig. 8, at block 810, gradient information between two adjacent frames of images in the target video sequence is calculated.
Illustratively, the gradient of the feature vector of two frames of images along the time dimension can be calculated, thereby obtaining gradient information. In this way, static background parts of the video sequence, such as image borders, can be obtained.
At block 820, a gradient overlay is obtained based on the overlay of gradient information.
Illustratively, the gradient information obtained at 810 may be subjected to weighted summation or maximum value or minimum value, etc. to complete superposition, so as to obtain a gradient superposition map.
At block 830, the gradient stack map is thresholded to obtain an initial binary map.
At block 840, the initial binary image is morphologically processed to obtain a binary image.
Illustratively, the initial binary image is morphologically dilated several times and then morphologically eroded the same number of times, thereby obtaining a binary image.
At block 850, a background image is derived based on the binary image and the background image is treated as an independent data item corresponding to the video sequence.
Illustratively, a matting operation may be performed on a binary image, for example, by means of matrix dot multiplication, so as to obtain a background image.
In this way, the background image corresponding to the video sequence can be obtained by considering the similarity between the frame images in the video sequence and the characteristic that the background in the video sequence is basically unchanged.
In addition, the label of the extraneous data item is determined based on the label of the data item to be processed. Specifically, the target to-be-processed data item has a label a, and the target unrelated data item is obtained by processing (such as image segmentation) the target to-be-processed data item, and then the label of the target unrelated data item is also the label a.
At block 720, the unrelated data set is divided into a first data set having a first sample weight distribution and a second data set having a second sample weight distribution, the first and second sample weight distributions being determined based on sample weights of the data items to be processed in the data set to be processed.
The sample weights of the unrelated data items are determined based on the sample weights of the data items to be processed. Specifically, the target to-be-processed data item has a sample weight w, and the target-independent data item is obtained by processing (such as image segmentation) the target to-be-processed data item, so that the sample weight of the target-independent data item is also the sample weight w.
The way in which the first data set and the second data set are divided is not limited in the embodiments of the present disclosure. For example, the division may be performed in a manner of 9. For example, the division may be performed in a manner of 1. In addition, the first data set may be further divided into a first sub data set and a second sub data set, for example, the ratio of the number of irrelevant data items in the first sub data set to the number of irrelevant data items in the second sub data set is about 7. It is understood that the proportions listed herein are merely illustrative and are not intended to limit the embodiments of the present disclosure.
At block 730, a classification model is trained based on the first data set and the first sample weight distribution.
In particular, a first data set may be sampled based on a first sample weight distribution, and a classification model may be trained based on the first data set based on labels of unrelated data items in the first data set.
That is, the classification model may be trained using the first data set as a training set. Optionally, prior to training, the first data set may be preprocessed, including but not limited to: feature extraction, cluster analysis, edge detection, image denoising and the like.
The specific structure of the classification model is not limited in the embodiments of the present disclosure, and for example, the classification model may be a convolutional neural network, and at least includes a convolutional layer and a full link layer.
At block 740, the classification model is evaluated based on the second data set and the second sample weight distribution to obtain an evaluation result indicating a biased significance of the data set to be processed having the sample weight distribution.
That is, the second data set may be used as a test set to obtain the evaluation result. In particular, the evaluation result may be derived based on a comparison between the prediction result of the classification model for the unrelated data item in the second data set and the label of the unrelated data item in the second data set.
As an example, the evaluation result may comprise a first correction rate for positive samples in the second data set and a second correction rate for negative samples in the second data set.
Therefore, the quantitative representation of the prejudice significance of the data set to be processed can be obtained by constructing the independent data set and training and evaluating the independent data set. Quantitative bias references can be provided, and further adjustment and other operations on the data set to be processed are facilitated.
Illustratively, if the evaluation results obtained at block 740 indicate that the bias is significant (or there is a significant bias), then the sample weight distribution for the set of data to be processed may be updated.
In some embodiments, the sample weight distribution of the data set to be processed is updated if the evaluation result is greater than a preset threshold. Further, thereafter, the first data set and the second data set may be retrieved 720 back, and the operations 730 and 740 may be repeated until the evaluation result obtained at block 740 indicates that the bias is not significant (or there is no significant bias), e.g., the evaluation result is not greater than a preset threshold. Then, the sample weight distribution when the evaluation result is not greater than the preset threshold may be taken as a recommended sample weight distribution, which is output.
The specific way of updating the sample weight distribution in the embodiment of the present disclosure is not limited, and for example, the sample weight distribution may be updated in at least one of the following ways: updating the sample weight distribution by adopting a preset rule, updating the sample weight distribution by adopting a random mode, obtaining the modification of the sample weight distribution by a user to update the sample weight distribution, or optimizing the sample weight distribution by a genetic algorithm to update the sample weight distribution.
In some implementations of the present disclosure, updating the sample weight distribution may update the first sample weight distribution of the first data set, such that upon returning to execution 720, the first sample weight distribution of the first data set in re-execution 720 is updated, and the classification model trained at 730 is also updated.
In another implementation of the present disclosure, updating the sample weight distribution may update a first sample weight distribution of the first data set and update a second sample weight distribution of the second data set. As an example, the sample weight distribution of the dataset to be processed may be updated and the extraneous datasets may be repartitioned. As another example, the sample weight distribution of the data set to be processed may be updated such that the first sample weight distribution and the second sample weight distribution are adaptively updated, but the extraneous data items in the first data set and the second data set are unchanged. Thus, upon returning to execution 720, the first data set in re-execution 720 is updated or the first sample weight distribution of the first data set is updated, and the classification model trained at 730 is also updated.
In another implementation of the present disclosure, updating the sample weight distribution may update a second sample weight distribution of the second data set. Optionally, the first sample weight distribution may comprise invariants. As an example, in this implementation, upon returning to execution 720, the first data set and the second data set in the last execution 720 may be swapped. As such, the first data set at the time of the return execution 730 is the second data set in the course of the last execution. Therefore, more comprehensive consideration on the data set to be processed can be realized, and the classification model is more accurate to the evaluation result of the bias significance.
Fig. 9 shows a schematic diagram of a process 900 of updating a sample weight distribution of a data set to be processed according to an embodiment of the present disclosure.
As shown in FIG. 9, at block 910, an unrelated data set is constructed based on the pending data set, the unrelated data set including unrelated data items having tags, the tags of the unrelated data items being determined based on the tags of the pending data items in the pending data set.
At block 920, the unrelated data set is divided into a first data set having a first sample weight distribution and a second data set having a second sample weight distribution, the first and second sample weight distributions being determined based on sample weights of the data items to be processed in the data set to be processed.
At block 930, a classification model is trained based on the first data set and the first sample weight distribution.
At block 940, the classification model is evaluated based on the second data set and the second sample weight distribution to obtain an evaluation result indicating the biased significance of the data set to be processed having the sample weight distribution.
Reference may be made to 710 to 740 described above in connection with fig. 7 for 910 to 940 in fig. 9, respectively, and for brevity, no further description is provided here.
In FIG. 9, at block 950, it is determined whether the evaluation result is greater than a predetermined threshold. If it is determined that the evaluation result is greater than the preset threshold, then 960 is performed. If it is determined that the evaluation result is not greater than the preset threshold, 980 is performed.
At block 960, a second sample weight distribution for the second data set is updated.
As some examples, the sample weights for all of the unrelated data items in the second data set may be updated, or the sample weights for some of the unrelated data items in the second data set may be updated.
As some examples, the second sample weight distribution may be updated based on the prediction of the unrelated data items in the second data set by the classification model at 940.
Specifically, the sample weight of the irrelevant data item in the second data set with correct prediction can be adjusted to be larger, or the sample weight of the irrelevant data item in the second data set with wrong prediction can be adjusted to be smaller. For example, assuming that the sample weight of the first independent data item in the second data set is 2, and the predicted result obtained by inputting the first independent data item in the second data set into the classification model is consistent with the label thereof, the sample weight of the first independent data item in the second data set may be increased, for example, from 2 to 3 or 4 or other values. For example, assuming that the sample weight of the second unrelated data item in the second data set is 2, and the prediction result obtained by inputting the second unrelated data item in the second data set to the classification model is inconsistent with the label of the second unrelated data item, the sample weight of the second unrelated data item in the second data set may be reduced, for example, from 2 to 1.
At block 970, the first data set having the first sample weight distribution is exchanged with the second data set having the updated second sample weight distribution.
It is understood that the first data set after the swap is the second data set in block 920 and the first data set after the swap has the first sample weight distribution that is the second sample weight distribution updated at block 960. The second data set after the swap is the first data set in block 920, and the second data set after the swap has a second sample weight distribution that is the first sample weight distribution in block 920.
After block 970, execution 930 is returned. That is, the classification model is retrained using the first data set after the exchange at 970.
At block 980, a recommended sample weight distribution is output.
Illustratively, the sample weight distribution when the evaluation result is not greater than the preset threshold is taken as the recommended sample weight distribution. Specifically, the recommended sample weight distribution may be determined based on the first sample weight distribution and the second sample weight distribution.
In some embodiments of the present disclosure, the regions of interest biased by the data set may be presented visually, and in particular, the class activation map may be obtained by inputting the target-independent data items into the trained classification model. And then overlapping the class activation graph and the target irrelevant data item to obtain an overlapping result, and displaying the overlapping result. As an example, the overlay result may be obtained by weighted summation of the heat maps, so that by displaying the overlay result, it can be visually checked which regions of interest are of the classification model, and the regions of interest are important factors causing bias.
In some embodiments of the present disclosure, after obtaining the recommended sample weight distribution, optionally adjusting the to-be-processed data set based on the recommended sample weight distribution may further be included to obtain an unbiased data set.
Illustratively, unbiased datasets can be constructed by adding or deleting to-be-processed datasets.
As an example, the to-be-processed data item with a large recommended sample weight may be copied to expand the number of the to-be-processed data items in the to-be-processed data set. As an example, the to-be-processed data items with small recommended sample weights may be deleted to reduce the number of to-be-processed data items in the to-be-processed data set.
As an example, a deletion instruction of the part of the to-be-processed data item by the user may be acquired to delete the part of the to-be-processed data item. Additional data items entered by the user may be retrieved for addition to the current set of data to be processed.
For example, the user may add or delete a set of data to be processed based on the recommended sample weight distribution. For example, the user may find other samples similar to the to-be-processed data item with a high weight of the recommended sample to be added to the data set as new data items, thereby implementing data supplementation to the data set. As an example, similar other samples may be other images acquired by the same (or same model of) image acquisition device in similar environments (e.g., care conditions, etc.).
In this way, in the embodiment of the present disclosure, the to-be-processed data set can be added or deleted based on the recommended sample weight distribution, so that an unbiased data set can be constructed. Further, the unbiased data set may be used to train a more robust, unbiased model for a particular task.
It is to be understood that the processes described in connection with fig. 7 to 9 in the embodiments of the present disclosure may refer to the functions of the modules and the like described above in connection with fig. 1 to 6, which are not repeated for the sake of brevity.
Fig. 10 shows a schematic block diagram of a data processing device 1000 according to an embodiment of the present disclosure. The apparatus 1000 may be implemented by software, hardware or a combination of both. In some embodiments, the device 1000 may be a software or hardware device that implements some or all of the functionality in the system 100 shown in fig. 1.
As shown in fig. 10, the apparatus 1000 includes a construction unit 1010, a dividing unit 1020, a training unit 1030, and an evaluation unit 1040.
The construction unit 1010 is configured to construct an unrelated data set based on the to-be-processed data set, the unrelated data set including unrelated data items having labels, the labels of the unrelated data items being determined based on the labels of the to-be-processed data items in the to-be-processed data set.
The dividing unit 1020 is configured to divide the unrelated data set into a first data set having a first sample weight distribution and a second data set having a second sample weight distribution, the first and second sample weight distributions being determined based on sample weights of the data items to be processed in the data set to be processed.
The training unit 1030 is configured to train the classification model based on the first data set and the first sample weight distribution.
The evaluation unit 1040 is configured to evaluate the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result, which indicates the biased significance of the data set to be processed with the sample weight distribution.
In some embodiments, the apparatus 1000 may further include an updating unit 1050, an adjusting unit 1060, and a display unit 1070.
The update unit 1050 is configured to: if the evaluation result obtained by the evaluation unit 1040 is greater than the preset threshold, the sample weight distribution of the data set to be processed is updated.
As an example, the updating unit 1050 may be configured to update the portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.
In some embodiments, the updating unit 1050 may be configured to update the sample weight distribution by at least one of: updating the sample weight distribution by adopting a preset rule, updating the sample weight distribution by adopting a random mode, and obtaining the modification of a user on the sample weight distribution to update the sample weight distribution or optimizing the sample weight distribution by a genetic algorithm to update the sample weight distribution.
In some embodiments, the updating unit 1050 may be configured to take the sample weight distribution when the evaluation result is not greater than the preset threshold value as the recommended sample weight distribution.
The adjusting unit 1060 is configured to add or delete the to-be-processed data set based on the recommended sample weight distribution to construct an unbiased data set.
The updating unit 1050 is further configured to: obtaining a class activation graph by inputting the target-independent data items into the trained classification model; the overlay result is obtained by overlaying the activation graph with the target-independent data item 22046.
The display unit 1070 is configured to display the recommended sample weight distribution and/or the superimposition result.
In some embodiments, the building unit 1010 may be configured to remove a portion associated with a tag of a target to-be-processed data item from the target to-be-processed data item of the to-be-processed data set to obtain a remaining portion of the target to-be-processed data item; and constructing an unrelated data item in the unrelated data set by using the remaining part, wherein the label of the unrelated data item corresponds to the label of the target data item to be processed.
In some embodiments, the data set to be processed is an image data set, and the constructing unit 1010 may be configured to perform image segmentation on a target data item to be processed in the data set to be processed to obtain a background image corresponding to the target data item to be processed; and constructing an item of extraneous data in the set of extraneous data using the background image.
In some embodiments, the data item to be processed in the data set to be processed is a video sequence, and the constructing unit 1010 may be configured to determine a binary image of the video sequence based on gradient information between one frame of image in the video sequence and a frame of image before the one frame of image; generating a background image of the video sequence based on the binary image; and constructing an item of extraneous data in the set of extraneous data using a background image of the video sequence.
The division of the units in the embodiments of the present disclosure is illustrative, and is only a logical function division, and in actual implementation, there may be another division manner, and in addition, each functional unit in the embodiments of the present disclosure may be integrated in one processor, may also exist alone physically, or may also be integrated in one unit by two or more units. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The data processing apparatus 1000 shown in fig. 10 can be used to implement the data processing procedures described above in conjunction with fig. 7 to 9.
The present disclosure may also be implemented as a computer program product. The computer program product may include computer-readable program instructions for performing various aspects of the present disclosure. The present disclosure may be implemented as a computer-readable storage medium having stored thereon computer-readable program instructions that, when executed by a processor, cause the processor to perform the above-described data processing procedures.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM) or flash Memory, a Static Random Access Memory (SRAM), a portable Compact disk Read-Only Memory (CD-ROM), a Digital Versatile Disk (DVD), a Memory stick, a floppy disk, a mechanical coding device, such as a punch card or in-groove protrusion structure having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be interpreted as a transitory signal per se, such as a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or an electrical signal transmitted through an electrical wire.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives the computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
Computer-readable program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the internet using an internet service provider). In some embodiments, the electronic circuitry can execute computer-readable program instructions to implement aspects of the present disclosure by utilizing state information of the computer-readable program instructions to personalize custom electronic circuitry, such as Programmable Logic circuitry, field-Programmable Gate arrays (FPGAs), or Programmable Logic Arrays (PLAs).
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer readable program instructions.

Claims (22)

1. A method of data processing, comprising:
constructing an unrelated data set based on a to-be-processed data set, wherein the unrelated data set comprises unrelated data items with labels, and the labels of the unrelated data items are determined based on the labels of the to-be-processed data items in the to-be-processed data set;
dividing the unrelated data set into a first data set and a second data set, the first data set having a first sample weight distribution and the second data set having a second sample weight distribution, the first sample weight distribution and the second sample weight distribution being determined based on sample weights of data items to be processed in the data sets to be processed;
training a classification model based on the first data set and the first sample weight distribution; and
evaluating the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result indicating the bias significance of the data set to be processed with the sample weight distribution.
2. The method of claim 1, further comprising:
if the evaluation result is larger than a preset threshold value, updating the sample weight distribution of the data set to be processed; and
repeatedly performing the training and the evaluation based on the updated sample weight distribution until the evaluation result is not greater than the preset threshold.
3. The method of claim 2, wherein the updating a sample weight distribution comprises:
updating the portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.
4. The method of claim 2 or 3, further comprising:
and taking the sample weight distribution when the evaluation result is not greater than the preset threshold value as recommended sample weight distribution.
5. The method of claim 4, further comprising:
and adding or deleting the data set to be processed based on the recommended sample weight distribution so as to construct an unbiased data set.
6. The method of any of claims 2 to 5, wherein the updating the sample weight distribution comprises at least one of:
the sample weight distribution is updated using a predetermined rule,
the sample weight distribution is updated in a random manner,
obtaining user modifications to the sample weight distribution to update the sample weight distribution, or
Optimizing a sample weight distribution by a genetic algorithm to update the sample weight distribution.
7. The method of any of claims 1 to 6, wherein constructing an unrelated data set based on the pending data set comprises:
removing a part associated with the label of a target to-be-processed data item from the target to-be-processed data item of the to-be-processed data set to obtain a remaining part in the target to-be-processed data item; and
constructing a piece of irrelevant data item in the irrelevant data set by using the remaining part, wherein the label of the piece of irrelevant data item corresponds to the label of the target data item to be processed.
8. The method of any of claims 1 to 6, wherein the dataset to be processed is an image dataset, and wherein constructing the extraneous dataset based on the dataset to be processed comprises:
performing image segmentation on a target data item to be processed in the data set to be processed to obtain a background image corresponding to the target data item to be processed; and
constructing a piece of extraneous data in the extraneous data set using the background image.
9. The method of any of claims 1 to 6, wherein the data items to be processed in the data set to be processed are video sequences, and wherein constructing the unrelated data set based on the data set to be processed comprises:
determining a binary image of the video sequence based on gradient information between one frame of image in the video sequence and a frame of image before the one frame of image;
generating a background image of the video sequence based on the binary image; and
an independent data item in the independent data set is constructed using a background image of the video sequence.
10. The method of any of claims 1 to 9, further comprising:
obtaining a class activation map CAM by inputting target-independent data items into the trained classification model;
obtaining a superposition result by superposing the CAM and the target-independent data item; and
and displaying the superposition result.
11. A data processing apparatus comprising:
a construction unit configured to construct an unrelated data set based on a to-be-processed data set, the unrelated data set including unrelated data items having labels, the labels of the unrelated data items being determined based on the labels of the to-be-processed data items in the to-be-processed data set;
a dividing unit configured to divide the unrelated data set into a first data set and a second data set, the first data set having a first sample weight distribution and the second data set having a second sample weight distribution, the first and second sample weight distributions being determined based on sample weights of data items to be processed in the data sets to be processed;
a training unit configured to train a classification model based on the first data set and the first sample weight distribution; and
an evaluation unit configured to evaluate the classification model based on the second data set and the second sample weight distribution to obtain an evaluation result indicating a bias significance of the data set to be processed having the sample weight distribution.
12. The apparatus of claim 11, further comprising an update unit configured to:
and if the evaluation result is larger than a preset threshold value, updating the sample weight distribution of the data set to be processed.
13. The apparatus of claim 12, wherein the updating unit is configured to:
updating a portion of the sample weight distribution such that the second sample weight distribution is updated without updating the first sample weight distribution.
14. The apparatus according to claim 12 or 13, wherein the updating unit is configured to:
and taking the sample weight distribution when the evaluation result is not greater than the preset threshold value as a recommended sample weight distribution.
15. The apparatus of claim 14, further comprising an adjustment unit configured to:
and adding or deleting the data set to be processed based on the recommended sample weight distribution so as to construct an unbiased data set.
16. The apparatus according to any of claims 12 to 15, wherein the updating unit is configured to update the sample weight distribution by at least one of:
the sample weight distribution is updated using a predetermined rule,
the sample weight distribution is updated in a random manner,
obtaining user modifications to the sample weight distribution to update the sample weight distribution, or
Optimizing a sample weight distribution by a genetic algorithm to update the sample weight distribution.
17. The apparatus according to any of claims 11 to 16, wherein the construction unit is configured to:
removing a part associated with the label of a target to-be-processed data item from the target to-be-processed data item of the to-be-processed data set to obtain a remaining part in the target to-be-processed data item; and
constructing a piece of extraneous data in the extraneous data set using the remaining portion, the label of the piece of extraneous data corresponding to the label of the target to-be-processed data item.
18. The apparatus of any of claims 11 to 16, wherein the dataset to be processed is an image dataset, and wherein the construction module is configured to:
performing image segmentation on a target data item to be processed in the data set to be processed to obtain a background image corresponding to the target data item to be processed; and
constructing an item of extraneous data in the set of extraneous data using the background image.
19. The apparatus according to any of claims 11 to 16, wherein the data items to be processed in the data set to be processed are video sequences, and wherein the construction unit is configured to:
determining a binary image of the video sequence based on gradient information between one frame of image in the video sequence and a frame of image before the one frame of image;
generating a background image of the video sequence based on the binary image; and
an independent data item in the independent data set is constructed using a background image of the video sequence.
20. The apparatus of any of claims 11 to 19, further comprising:
an update unit configured to: obtaining a class activation map CAM by inputting target-independent data items into the trained classification model; and obtaining a superposition result by superposing the CAM and the target-independent data item; and
a display unit configured to display the superimposition result.
21. A computing device comprising a processor and a memory, the processor reading and executing a computer program stored by the memory causing the computing device to perform the method of any of claims 1 to 10.
22. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, carries out the method according to any one of claims 1 to 10.
CN202110574231.3A 2021-05-25 2021-05-25 Data processing method, data processing device, computing equipment and computer readable storage medium Pending CN115471714A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110574231.3A CN115471714A (en) 2021-05-25 2021-05-25 Data processing method, data processing device, computing equipment and computer readable storage medium
PCT/CN2022/083841 WO2022247448A1 (en) 2021-05-25 2022-03-29 Data processing method and apparatus, computing device, and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110574231.3A CN115471714A (en) 2021-05-25 2021-05-25 Data processing method, data processing device, computing equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN115471714A true CN115471714A (en) 2022-12-13

Family

ID=84229488

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110574231.3A Pending CN115471714A (en) 2021-05-25 2021-05-25 Data processing method, data processing device, computing equipment and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN115471714A (en)
WO (1) WO2022247448A1 (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915450B (en) * 2012-09-28 2016-11-16 常州工学院 The object region tracking that a kind of online adaptive adjusts
US11392852B2 (en) * 2018-09-10 2022-07-19 Google Llc Rejecting biased data using a machine learning model
US11410041B2 (en) * 2018-11-27 2022-08-09 Wipro Limited Method and device for de-prejudicing artificial intelligence based anomaly detection
US11775863B2 (en) * 2019-05-22 2023-10-03 Oracle International Corporation Enforcing fairness on unlabeled data to improve modeling performance
CN112115963B (en) * 2020-07-30 2024-02-20 浙江工业大学 Method for generating unbiased deep learning model based on transfer learning
CN112508580A (en) * 2021-02-03 2021-03-16 北京淇瑀信息科技有限公司 Model construction method and device based on rejection inference method and electronic equipment

Also Published As

Publication number Publication date
WO2022247448A1 (en) 2022-12-01

Similar Documents

Publication Publication Date Title
US11367271B2 (en) Similarity propagation for one-shot and few-shot image segmentation
CN111476284B (en) Image recognition model training and image recognition method and device and electronic equipment
US11657602B2 (en) Font identification from imagery
US11416772B2 (en) Integrated bottom-up segmentation for semi-supervised image segmentation
US20190279014A1 (en) Method and apparatus for detecting object keypoint, and electronic device
CN108280477B (en) Method and apparatus for clustering images
CN109086811B (en) Multi-label image classification method and device and electronic equipment
CN110910391B (en) Video object segmentation method for dual-module neural network structure
CN111488873B (en) Character level scene text detection method and device based on weak supervision learning
CN111582409A (en) Training method of image label classification network, image label classification method and device
US20220092407A1 (en) Transfer learning with machine learning systems
CN114677565B (en) Training method and image processing method and device for feature extraction network
CN115861462B (en) Training method and device for image generation model, electronic equipment and storage medium
JP2024513596A (en) Image processing method and apparatus and computer readable storage medium
Ayyar et al. Review of white box methods for explanations of convolutional neural networks in image classification tasks
CN113569852A (en) Training method and device of semantic segmentation model, electronic equipment and storage medium
Yadav et al. An improved deep learning-based optimal object detection system from images
Wang et al. End-to-end trainable network for superpixel and image segmentation
CN111161238A (en) Image quality evaluation method and device, electronic device, and storage medium
US20240028828A1 (en) Machine learning model architecture and user interface to indicate impact of text ngrams
CN113223011A (en) Small sample image segmentation method based on guide network and full-connection conditional random field
CN116109907B (en) Target detection method, target detection device, electronic equipment and storage medium
JP2020123329A (en) Allocation of relevance score of artificial neural network
CN116258937A (en) Small sample segmentation method, device, terminal and medium based on attention mechanism
CN115471714A (en) Data processing method, data processing device, computing equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination