CN113033612A - Image classification method and device - Google Patents

Image classification method and device Download PDF

Info

Publication number
CN113033612A
CN113033612A CN202110209031.8A CN202110209031A CN113033612A CN 113033612 A CN113033612 A CN 113033612A CN 202110209031 A CN202110209031 A CN 202110209031A CN 113033612 A CN113033612 A CN 113033612A
Authority
CN
China
Prior art keywords
image
foreground
training
sample
extractor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110209031.8A
Other languages
Chinese (zh)
Inventor
黄高
王朝飞
宋士吉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202110209031.8A priority Critical patent/CN113033612A/en
Publication of CN113033612A publication Critical patent/CN113033612A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application provides an image classification method and device, and the method comprises the following steps: acquiring an object image to be classified, a foreground object extractor and a final classifier; inputting the object image into a foreground object extractor to obtain a foreground object image; inputting the foreground target image into a final classifier to obtain a classification result; the final classifier consists of an image feature extractor and a small sample event classifier; the image feature extractor is obtained by training foreground target images of known types of things of a large sample as training data; the small sample object classifier is obtained by training foreground target images of the expanded new type objects as training data; the expansion method comprises the step of carrying out posture conversion on a foreground target image of a new type of object by adopting a posture conversion generator to obtain an expanded foreground target image. The scheme of the embodiment effectively solves the problem of classifying fine-grained images under the condition of small samples, and effectively improves the classification accuracy.

Description

Image classification method and device
Technical Field
The present disclosure relates to image recognition technologies, and more particularly, to an image classification method and apparatus.
Background
The classification of fine-grained images has a wide research demand and application scene both in the industrial and academic circles, and the purpose of the classification is to identify different subclasses from one large class, such as different birds and different models of automobiles, and is a hot research problem in the field of computer vision in recent years. The rapid development of the deep learning technology greatly improves the accuracy of fine-grained image classification, and the technology generally needs to rely on a large number of marked samples for model training. However, in many practical application scenarios, such as: mechanical failure detection, medical image recognition, deep-sea biological recognition and the like are frequently confronted with the condition of few marked samples, namely the problem of classifying small-sample fine-grained images due to the reasons of few category data, high marking cost and the like. This problem is more challenging than the fine-grained image classification problem or the common small-sample image classification problem, because it inherits the difficulties of each of these two types of problems: the method has the advantages that the intra-class variance of the fine-grained image classification is large, and the inter-class variance is small; secondly, the difficulty that the samples classified by small sample images are too few to train a deep learning model is difficult.
Disclosure of Invention
The embodiment of the application provides an image classification method and device, which can effectively solve the classification problem of fine-grained images under the condition of small samples and effectively improve the classification accuracy.
The embodiment of the application provides an image classification method, which can comprise the following steps:
acquiring an object image to be classified, a pre-trained foreground object extractor and a final classifier;
inputting the object image to be classified into the foreground object extractor, and acquiring a foreground object image of the object image to be classified;
inputting a foreground object image of the object image to be classified into the final classifier, and outputting a classification result about the object image to be classified by the final classifier;
the final classifier consists of a pre-trained image feature extractor and a small sample event classifier;
the image feature extractor is obtained by training foreground target images of known types of things of a large sample as training data;
the small sample object classifier is obtained by training foreground target images of the expanded new type objects as training data; the expansion method comprises the step of carrying out posture conversion on a foreground target image of a new type of object by adopting a posture conversion generator to obtain an expanded foreground target image.
In an exemplary embodiment of the present application, the foreground object extractor may include: a salient object detection model and an image manipulation module comprising a plurality of image manipulations;
wherein the image operation may sequentially comprise one or more of: pixel level logic operations, multiplication operations, masking operations, and amplification operations;
the salient object detection model is used for acquiring a salient image of an input image of the salient object detection model, and the salient image is a feature image containing the posture of a foreground object; the image operation module is used for carrying out image processing on the saliency map.
In an exemplary embodiment of the present application, the obtaining the foreground object extractor may include: acquiring the saliency target detection model;
the obtaining the salient object detection model may include: and training the constructed convolutional neural network by adopting a first training set consisting of the image of the object of the known type and a saliency map corresponding to the image to obtain the saliency target detection model.
In an exemplary embodiment of the present application, the method of acquiring the image feature extractor may include:
and generating a foreground target image of the first image based on the first image of the object of the known type of the large sample and the foreground target extractor, and training a convolutional neural network by adopting the foreground target image to obtain the image feature extractor.
In an exemplary embodiment of the present application, the method for acquiring the small sample event classifier may include:
generating a new sample for a new type of thing based on a first image of a known type of thing for a large sample, a second image of a new type of thing for a small sample, a pre-trained foreground object extractor, and a pre-trained pose transition generator;
expanding the new sample into the small sample, constituting an expanded sample set for the new type of thing;
and training by adopting the extended sample set to obtain the small sample object classifier.
In an exemplary embodiment of the present application, the generating a new sample for a new type of thing based on a first image of a known type of thing with a large sample, a second image of a new type of thing with a small sample, a pre-trained foreground object extractor, and a pre-trained pose transition generator may include:
respectively setting type labels for the first image set and the second image set, and acquiring the foreground object extractor and the attitude transformation generator;
respectively inputting a first image set and a second image set provided with type labels into a pre-trained foreground object extractor, and respectively acquiring a first foreground object image set related to the known type of object and a second foreground object image set related to the new type of object;
inputting the first set of foreground target images and the second set of foreground target images into the pose transformation generator to transform the pose of the new type of thing to the pose of the known type of thing; and outputting, by the pose conversion generator, a pose conversion image for the new type of thing;
and taking the acquired multiple posture conversion images as a new sample of the new type of things.
In an exemplary embodiment of the present application, the method for acquiring the gesture conversion generator may include:
extracting a multiparous training set from a second training set consisting of first images of a large sample of a known type of thing;
and training the constructed self-encoder neural network by adopting the multi-birth training set to obtain the attitude transformation generator.
In an exemplary embodiment of the present application, the multiparous training set comprises: a training set of four-fetus;
the extracting a multiparous training set from a second training set composed of first images of a large sample of a known type of thing may include:
obtaining a plurality of pairs of quadruple data { A } from the second training set1,A2,B1,B2Forming the tetrad training set;
wherein the tetrad data pairs { A }1,A2,B1,B2Satisfy: a. the1、A2From a known type, B1、B2From another known type, A1Significant graph and B of1Is greater than or equal to a preset similarity threshold, A2Significant graph and B of2Is greater than or equal to the similarity threshold; from A1To A2Posture change and from B1To B2The posture changes of (2) are consistent.
In an exemplary embodiment of the present application, the training the constructed self-coding network with the multi-fetus training set to obtain the pose transformation generator may include:
will be { A1,A2,B1As input, take B as input2As a target output of the attitude transition generator, will
Figure BDA0002950665070000041
As an actual output of the gesture transition generator;
according to the target output B2The actual output
Figure BDA0002950665070000042
And the target output B2The class label y optimizes a preset loss function, the training is determined to be finished when the loss value of the loss function meets the preset requirement, and the trained self-coding network is used as the attitude conversion generator.
In an exemplary embodiment of the present application, the loss function may include:
Figure BDA0002950665070000043
Figure BDA0002950665070000044
wherein Loss is the Loss value,
Figure BDA0002950665070000045
is the mean square error between the actual output and the target output,
Figure BDA0002950665070000046
to be composed of
Figure BDA0002950665070000047
A cross-entropy loss as an input to the pose transition generator; λ is a hyper-parameter that adjusts the ratio between the mean squared error and the cross entropy loss.
In an exemplary embodiment of the present application, the small sample event classifier may be a cosine distance-based classifier.
An embodiment of the present application further provides an image classification apparatus, which may include a processor and a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed by the processor, the image classification apparatus implements the image classification method according to any one of the above items.
Compared with the related art, the embodiment of the application can comprise the following steps: acquiring an object image to be classified, a pre-trained foreground object extractor and a final classifier; inputting the object image to be classified into the foreground object extractor, and acquiring a foreground object image of the object image to be classified; inputting a foreground object image of the object image to be classified into the final classifier, and outputting a classification result about the object image to be classified by the final classifier; the final classifier consists of a pre-trained image feature extractor and a small sample event classifier; the image feature extractor is obtained by training foreground target images of known types of things of a large sample as training data; the small sample object classifier is obtained by training foreground target images of the expanded new type objects as training data; the expansion method comprises the step of carrying out posture conversion on a foreground target image of a new type of object by adopting a posture conversion generator to obtain an expanded foreground target image.
Through the scheme of the embodiment, the problem of classifying fine-grained images under the condition of small samples is effectively solved, and the classification accuracy is effectively improved.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. Other advantages of the present application may be realized and attained by the instrumentalities and combinations particularly pointed out in the specification and the drawings.
Drawings
The accompanying drawings are included to provide an understanding of the present disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the examples serve to explain the principles of the disclosure and not to limit the disclosure.
FIG. 1 is a flowchart of an image classification method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a basic structure of a Baseline + + method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of the overall architecture and workflow of the FOT method according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating a process of processing an image by a foreground object extractor according to an embodiment of the present application;
FIG. 5 is a flowchart of an acquisition method of a small sample event classifier according to an embodiment of the present application;
FIG. 6 is a flowchart of an acquisition method of the pose transformation generator according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a network structure of a gesture translation generator according to an embodiment of the present application;
FIG. 8 is a schematic diagram of an example of a new type of image generated according to the FOT method in an embodiment of the present application;
fig. 9 is a block diagram of an image classification apparatus according to an embodiment of the present application.
Detailed Description
The present application describes embodiments, but the description is illustrative rather than limiting and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the embodiments described herein. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or instead of any other feature or element in any other embodiment, unless expressly limited otherwise.
The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The embodiments, features and elements disclosed in this application may also be combined with any conventional features or elements to form a unique inventive concept as defined by the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive aspects to form yet another unique inventive aspect, as defined by the claims. Thus, it should be understood that any of the features shown and/or discussed in this application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not limited except as by the appended claims and their equivalents. Furthermore, various modifications and changes may be made within the scope of the appended claims.
Further, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other orders of steps are possible as will be understood by those of ordinary skill in the art. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Further, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.
In the exemplary embodiment of the present application, the following analysis exists for two difficulties raised by the background art:
aiming at the difficulty one, the current fine-grained image classification method usually adopts information such as additional labeling frames and local area positions to assist in identifying important areas in an image, or captures similar information by a deep learning method, but the method still needs a large number of labeled samples to be realized, and cannot be well applied to the environment of small-sample image classification.
For the second difficulty, the current small sample image classification method often adopts image clipping, mirror image transformation, proper rotation and other modes to solve the problem of too few samples through sample expansion, which has obvious effect on the general small sample image classification problem, but has very limited effect when a fine-grained image is taken as an object. To explore the source of this problem, we have carefully observed two phenomena: the method comprises the following steps that firstly, different types of images in fine-grained image classification have similar backgrounds, for example, different bird life habits are not greatly different, and the images of the images usually take woods, water, sky and the like as backgrounds; and secondly, key features for distinguishing the categories are all positioned on a foreground target of the image, for example, local features of birds such as beaks, colors, tail feathers and the like are often required to distinguish the birds. The first phenomenon shows that the image background plays more negative roles in the small sample fine-grained image classification problem, and the second phenomenon shows that the image foreground object plays more positive roles in the small sample fine-grained image classification problem. Therefore, the mainstream sample expansion method adopted by the current small sample image classification does not treat the foreground and the background of the image differently, and the sample expansion directly by the image integral transformation mode is difficult to obtain a good effect, because the negative effect of the background is expanded while the sample is expanded.
Based on the above analysis, the present application provides an image classification method, as shown in fig. 1, the method may include steps S101 to S103:
s101, acquiring an object image to be classified, a foreground object extractor and a final classifier which are trained in advance;
s102, inputting the object image to be classified into the foreground object extractor, and acquiring a foreground object image of the object image to be classified;
s103, inputting a foreground object image of the object image to be classified into the final classifier, and outputting a classification result of the object image to be classified by the final classifier;
the final classifier consists of a pre-trained image feature extractor and a small sample event classifier;
the image feature extractor is obtained by training foreground target images of known types of things of a large sample as training data;
the small sample object classifier is obtained by training foreground target images of the expanded new type objects as training data; the expansion method comprises the step of carrying out posture conversion on a foreground target image of a new type of object by adopting a posture conversion generator to obtain an expanded foreground target image.
In an exemplary embodiment of the present application, the large sample event classifier refers to a classifier capable of classifying only a known type of event of a large sample. The small sample event classifier is a classifier capable of classifying a new type of event of a small sample.
In an exemplary embodiment of the present application, the foreground object extractor is configured to extract a foreground object image of an input object image to be classified, the image feature extractor is configured to extract a feature map of the foreground object image, and the small-sample object classifier is configured to perform type classification based on the input feature map.
In the exemplary embodiment of the application, a small sample fine-grained image classification method based on foreground object posture transformation is provided, the scheme is mainly based on a trained final classification model, when a test picture (or object image to be classified) about a new type of object is subjected to foreground object image extraction through a foreground object extractor, the foreground object image can be used as the input of the final classification model, image feature extraction is performed on the foreground object image through an image feature extractor respectively, and the extracted feature image is classified through a small sample object classifier, so that a class label corresponding to the test picture can be obtained. By the method, the problem of classification of fine-grained images under the condition of small samples can be effectively solved, and the classification accuracy is effectively improved.
In an exemplary embodiment of the application, the small sample image classification method (FOT for short) based on foreground pose transformation proposed in the embodiment is a complete new method formed by adding two new modules, namely a foreground object extractor and a pose conversion generator, to a small sample image reference method Baseline + + as a basic framework. The basic structure of the Baseline + + method is shown in fig. 2, and can be divided into two stages: 1. training a training stage (first stage) for training an image feature extractor and a large sample object classifier by using a known type object image with sufficient sample number; 2. and a fine adjustment stage (second stage), namely fixing the image feature extractor unchanged, expanding limited new samples, and training a small sample event classifier by using the expanded samples.
In an exemplary embodiment of the present application, an overall architecture of the FOT method proposed by the present embodiment may be as shown in fig. 3, and a workflow may include:
and (I) extracting a foreground target. The known type and the new type are processed by a foreground object extractor to obtain foreground objects of the amplified version of all the images after the background is removed, and the detailed process can be as shown in fig. 4.
(II) training on known types. The known type has a large number of labeled samples, and an image feature extraction network based on a Convolutional Neural Network (CNN) and a classifier aiming at the known type are obtained by training the image of the known type (namely the foreground target image) processed in the last step.
And (III) training a posture conversion generator. A satisfactory multi-fetus training set (e.g., a four-fetus dataset) is constructed on a known type, and a training pose transformation generator is trained on the multi-fetus training set.
And (IV) producing a new type of sample. For the problem of insufficient new-type samples, each new-type sample can be utilized to find a sample with a similar saliency map to a known class, and the posture transformation of the known sample is transferred to the new-type sample, so that the generation of additional new-type samples is realized.
And (V) training a small sample event classifier. And after the generated new type sample is expanded into the new type, training a new type classifier by utilizing the expanded new type data set.
And sixthly, obtaining a complete classifier (namely a final classifier) through the steps, wherein the complete classifier can be formed by sequentially connecting an image feature extractor and a small sample event classifier, and when a new-class test image (namely an object image to be classified) is used as input, obtaining a corresponding class label through image feature extraction and new-type classification.
In an exemplary embodiment of the present application, as can be seen from the above steps, the obtaining scheme of the final classifier may include: firstly, designing a foreground object extractor, and extracting a foreground object image from object images of known types and new types, wherein the foreground object extractor consists of a significance detection model and several image operations; training by using the acquired foreground target image of the known type to obtain an image feature extractor based on a neural network; constructing a four-fetus data set (or called a four-fetus training set), and training a neural network structure of a self-encoder by adopting the four-fetus data set to obtain a posture conversion generator; performing pose transformation on the new type of image by using a pose transformation generator to generate an additional sample (i.e. a new sample about the new type of thing); and training by utilizing the expanded data set to obtain a classifier of the new type of object images (namely the small sample object classifier), and sequentially connecting the image feature extractor and the small sample object classifier to form the final classifier.
In the exemplary embodiment of the present application, the following describes in detail the acquisition methods of the plurality of components of the final classifier, respectively.
In an exemplary embodiment of the present application, the foreground object extractor may include: a salient object detection model and an image manipulation module comprising a plurality of image manipulations;
wherein the image operation may sequentially comprise one or more of: pixel level logic operations, multiplication operations, masking operations, and amplification operations;
the salient object detection model is used for acquiring a salient image of an input image of the salient object detection model, and the salient image is a feature image containing the posture of a foreground object; the image operation module is used for carrying out image processing on the saliency map.
In exemplary embodiments of the present application, the foreground object extractor may be constructed from a pre-trained saliency object detection model (BASNet) f, a pixel-level logic operation σ, a multiplication operation
Figure BDA0002950665070000101
A mask operation g and a zoom operation
Figure BDA0002950665070000102
Composition, as shown in fig. 4.
In an exemplary embodiment of the present application, the pre-trained saliency target detection model adopts a BASNet network model, and other saliency target detection models may also be adopted.
In an exemplary embodiment of the present application, the obtaining the foreground object extractor may include: acquiring the saliency target detection model;
the obtaining the salient object detection model may include: and training the constructed convolutional neural network by adopting a first training set consisting of the image of the object of the known type and a saliency map corresponding to the image to obtain the saliency target detection model.
In the exemplary embodiment of the present application, in practical applications, a three-channel RGB (red, green, blue) image X is taken as an input, and a final foreground target picture Y can be obtained through 5 processing steps:
firstly, obtaining a saliency map f (X) of an input image X by utilizing a pre-trained BASNet model;
performing pixel-level logic operation on the saliency map to obtain a single-channel 0-1 logic map sigma (f (X)), wherein the execution process of the operation is to give a threshold value gamma, when the mean value of three channels of a certain pixel point is greater than gamma, the pixel point value is made to be 1, and otherwise, the pixel point value is 0;
(III) multiplying the single-channel 0-1 logic diagram with the original image to change the input image into a black background image
Figure BDA0002950665070000103
(IV) using masking to crop the image to obtain only the region (e.g. rectangular region) containing the foreground object
Figure BDA0002950665070000111
The size and position of the mask frame can be automatically read from the logic diagram of the previous step;
(V) amplifying the area containing the foreground object to a uniform size by using an amplifying operation as an input of a subsequent deep learning model
Figure BDA0002950665070000112
In an exemplary embodiment of the present application, combining the above steps, the input-output relationship may be expressed as:
Figure BDA0002950665070000113
in an exemplary embodiment of the present application, the method of acquiring the image feature extractor may include:
and generating a foreground target image of the first image based on the first image of the object of the known type of the large sample and the foreground target extractor, and training a convolutional neural network by adopting the foreground target image to obtain the image feature extractor.
In an exemplary embodiment of the present application, the structure of the image feature extractor is typically a Convolutional Neural Network (CNN), and the training data set uses a foreground object image of a known type.
In an exemplary embodiment of the present application, as shown in fig. 5, the method for acquiring a small sample object classifier may include steps S201 to S203:
s201, generating a new sample of the new type of things based on a first image of the known type of things of a large sample, a second image of the new type of things of a small sample, a pre-trained foreground object extractor and a pre-trained posture conversion generator.
In an exemplary embodiment of the present application, as shown in fig. 6, the obtaining method of the gesture conversion generator may include steps S301 to S302:
s301, extracting a multi-fetus training set from a second training set formed by first images of known type objects of a large sample.
In an exemplary embodiment of the present application, the multiparous training set comprises: a training set of four-fetus;
the extracting a multiparous training set from a second training set composed of first images of a large sample of a known type of thing may include:
obtaining a plurality of pairs of quadruple data { A } from the second training set1,A2,B1,B2Forming the tetrad training set;
wherein the tetrad data pairs { A }1,A2,B1,B2Satisfy: a. the1、A2From a known type, B1、B2From another known type, A1Significant graph and B of1Is greater than or equal to a preset similarity threshold, A2Significant graph and B of2Is greater than or equal to the similarity threshold; from A1To A2Posture change and from B1To B2The posture changes of (2) are consistent.
In an exemplary embodiment of the present application, the pose transformation generator is an auto-encoder neural network (auto-encoder network), the architecture of which is shown in fig. 7, and is a deep learning model, and besides the network structure itself, the emphasis is on the training of the model, and the training data set of which uses a multi-fetus training set (e.g., a four-fetus training set) constructed from known types. For this purpose, a multi-fetus training set construction algorithm based on a saliency map is proposed on the basis of known types (base classes), namely, a training set of a deep learning model is formed by a large number of four-fetus data pairs { A }1,A2,B1,B2Is composed of (A) } in which1,A2From a known type, B1,B2From another known type and satisfies A1Significant graph and B of1Are similar to the significant figures of (A)2Significant graph and B of2The saliency maps of are similar. Since the saliency map can represent the pose of a foreground object, then from A1To A2The attitude change of (A) is equal to that of (B)1To B2The posture changes are consistent; the formalization is described as follows:
find all{A1,A2,B1,B2},
Figure BDA0002950665070000121
wherein,
Figure BDA0002950665070000122
represent two different known types of the type that are known,
Figure BDA0002950665070000123
each represents A1,A2,B1,B2The Distance () uses euclidean Distance, and α, β represent two thresholds for measuring the magnitude of euclidean Distance.
S302, training the constructed self-encoder neural network by adopting the multi-birth training set to obtain the attitude transformation generator.
In an exemplary embodiment of the present application, the training the constructed self-coding network with the multi-fetus training set to obtain the pose transformation generator may include:
will be { A1,A2,B1As input, take B as input2As a target output of the attitude transition generator, will
Figure BDA0002950665070000124
As an actual output of the gesture transition generator;
according to the target output B2The actual output
Figure BDA0002950665070000125
And the target output B2The class label y optimizes a preset loss function, the training is determined to be finished when the loss value of the loss function meets the preset requirement, and the trained self-coding network is used as the attitude conversion generator.
In an exemplary embodiment of the present application, the loss function may include:
Figure BDA0002950665070000131
Figure BDA0002950665070000132
wherein Loss is the Loss value,
Figure BDA0002950665070000133
is the mean square error between the actual output and the target output,
Figure BDA0002950665070000134
to be composed of
Figure BDA0002950665070000135
A cross-entropy loss as an input to the pose transition generator; λ is a hyper-parameter that adjusts the ratio between the mean squared error and the cross entropy loss.
In an exemplary embodiment of the present application, through data training, the gesture transition generator can learn how from A will be1To A2The posture change of (A) is transferred to (B)1To learn a similar B2Is/are as follows
Figure BDA0002950665070000136
In an exemplary embodiment of the present application, the generating a new sample for a new type of thing based on a first image of a known type of thing with a large sample, a second image of a new type of thing with a small sample, a pre-trained foreground object extractor, and a pre-trained pose transition generator may include:
respectively setting type labels for the first image set and the second image set, and acquiring the foreground object extractor and the attitude transformation generator;
respectively inputting a first image set and a second image set provided with type labels into a pre-trained foreground object extractor, and respectively acquiring a first foreground object image set related to the known type of object and a second foreground object image set related to the new type of object;
inputting the first set of foreground target images and the second set of foreground target images into the pose transformation generator to transform the pose of the new type of thing to the pose of the known type of thing; and outputting, by the pose conversion generator, a pose conversion image for the new type of thing;
and taking the acquired multiple posture conversion images as a new sample of the new type of things.
In an exemplary embodiment of the present application, applying the gesture conversion generator to perform gesture conversion on a new class sample may follow the following steps: for each training sample Z in the new type1Finding a similar sample X from the known type to its saliency map1Binding to sample X1Other samples X of the type2Taken together [ X)1,X2,Z1]As input to the attitude transformation generator, thereby obtaining Z1New sample sequence after morphological transformation
Figure BDA0002950665070000137
This process is repeated for a sufficient number of new class samples.
S202, expanding the new sample into the small sample to form an expanded sample set related to the new type of things.
S203, training by adopting the extended sample set to obtain the small sample object classifier.
In an exemplary embodiment of the present application, the small sample event classifier may be a cosine distance-based classifier, and other types of classifiers may also be employed.
In the exemplary embodiments of the present application, detailed implementation of the embodiments of the present application is further described below with reference to an implementation case. It should be apparent that the embodiment described in this section is only one embodiment of the present invention, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments of the present application belong to the protection scope of the embodiments of the present application.
In the exemplary embodiment of the application, taking the fine-grained image classification benchmark datasets Cub birds, Stanford logs and Stanford Cars as examples, two small-sample fine-grained classification strategies are mainly adopted, namely 5-way 1-shot, that is, for 5 new classes (i.e., new types), only 1 image is in each type in the training set; and the other is 5-way 5-shot, namely, aiming at 5 new classes, each class of the training set only has 5 images, and each class of 5-way 1-shot or 5-way 5-shot selects 15 images as test images during testing.
In the exemplary embodiment of the present application, to illustrate the effectiveness of the method of this embodiment, comparison experiments are performed on three reference data sets with two reference methods, namely, Baseline and Baseline + +, four mainstream small sample image classification methods, namely, MatchingNet, ProtoNet, MAML and RelationNet, and four most advanced fine-grained small sample image classification methods, namely, PCM, PABN, covannet and DN4, and the results are shown in table 1. In table 1, the results of the experiments on three fine-grained image reference datasets for different methods are compared, the dark gray areas represent the best results per column, and the light gray areas represent the next best results per column.
TABLE 1
Figure BDA0002950665070000141
Figure BDA0002950665070000151
In the exemplary embodiment of the present application, as can be seen from table 1, on the basis of the Baseline + + method, the method (FOT) proposed in the embodiment of the present application is improved by 7.20% and 5.99% on the 5-way 1-shot and 5-way 5-shot on average; the method greatly surpasses the current mainstream small sample image classification method; compared with the current most advanced fine-grained small sample image classification method, the highest results are obtained on both the Cub birds and the Stanford Dogs data sets, and the classification performance and generalization performance are higher on the Stanford Cars next to the DN4 method.
In an exemplary embodiment of the present application, as a sample expansion method, the FOT method provided in the embodiment of the present application may also be used as an auxiliary method to be combined with a current small sample image classification method, so as to improve the performance of the current small sample image classification method. Table 2 shows the experimental results of combining the FOT method with the current mainstream four small sample image classification methods.
TABLE 2
Figure BDA0002950665070000152
Figure BDA0002950665070000161
In an exemplary embodiment of the present application, as a data expansion method, the FOT method proposed by the embodiment of the present application can be conveniently applied to any existing small sample image classification method as an auxiliary module. As can be seen from table 2, the FOT module can effectively improve the performance of the mainstream small sample image classification method. Virtually all of the methods listed in table 1 can be further enhanced by FOT modules.
In the exemplary embodiment of the present application, to further visualize and describe the implementation effect of the method of the embodiment of the present application, as shown in fig. 8, a partial example of a new type of image generated by the FOT method proposed by the embodiment of the present application is shown, and each column represents different pairs of data of four-cell-based images
Figure BDA0002950665070000162
Z1An image representing a new type, X1Is an image of known type, X, similar to the saliency map2Is and X1Another image of the same known type,
Figure BDA0002950665070000163
is the actual image generated. As can be seen,
Figure BDA0002950665070000164
is provided with Z in category1Has X in posture (e.g. bird color, key part characteristics, etc.)2The characteristics of (1).
An embodiment of the present application further provides an image classification apparatus 1, as shown in fig. 9, which may include a processor 11 and a computer-readable storage medium 12, where the computer-readable storage medium 12 stores instructions, and when the instructions are executed by the processor 11, the image classification apparatus implements any one of the image classification methods described above.
It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims (12)

1. A method of image classification, the method comprising:
acquiring an object image to be classified, a pre-trained foreground object extractor and a final classifier;
inputting the object image to be classified into the foreground object extractor, and acquiring a foreground object image of the object image to be classified;
inputting a foreground object image of the object image to be classified into the final classifier, and outputting a classification result about the object image to be classified by the final classifier;
the final classifier consists of a pre-trained image feature extractor and a small sample event classifier;
the image feature extractor is obtained by training foreground target images of known types of things of a large sample as training data;
the small sample object classifier is obtained by training foreground target images of the expanded new type objects as training data; the expansion method comprises the step of carrying out posture conversion on a foreground target image of a new type of object by adopting a posture conversion generator to obtain an expanded foreground target image.
2. The image classification method according to claim 1, wherein the foreground object extractor comprises: a salient object detection model and an image manipulation module comprising a plurality of image manipulations;
wherein the image operation comprises one or more of the following in sequence: pixel level logic operations, multiplication operations, masking operations, and amplification operations;
the salient object detection model is used for acquiring a salient image of an input image of the salient object detection model, and the salient image is a feature image containing the posture of a foreground object; the image operation module is used for carrying out image processing on the saliency map.
3. The image classification method according to claim 2, wherein obtaining the foreground object extractor comprises: acquiring the saliency target detection model;
the acquiring the salient object detection model comprises: and training the constructed convolutional neural network by adopting a first training set consisting of the image of the object of the known type and a saliency map corresponding to the image to obtain the saliency target detection model.
4. The image classification method according to claim 1, wherein the image feature extractor acquisition method includes:
and generating a foreground target image of the first image based on the first image of the object of the known type of the large sample and the foreground target extractor, and training a convolutional neural network by adopting the foreground target image to obtain the image feature extractor.
5. The image classification method according to claim 1, wherein the method for acquiring the small sample object classifier comprises:
generating a new sample for a new type of thing based on a first image of a known type of thing for a large sample, a second image of a new type of thing for a small sample, a pre-trained foreground object extractor, and a pre-trained pose transition generator;
expanding the new sample into the small sample, constituting an expanded sample set for the new type of thing;
and training by adopting the extended sample set to obtain the small sample object classifier.
6. The image classification method according to claim 5, wherein the generating of the new sample for the new type of thing based on the first image of the known type of thing in the large sample, the second image of the new type of thing in the small sample, the pre-trained foreground object extractor and the pre-trained pose transformation generator comprises:
respectively setting type labels for the first image set and the second image set, and acquiring the foreground object extractor and the attitude transformation generator;
respectively inputting a first image set and a second image set provided with type labels into a pre-trained foreground object extractor, and respectively acquiring a first foreground object image set related to the known type of object and a second foreground object image set related to the new type of object;
inputting the first set of foreground target images and the second set of foreground target images into the pose transformation generator to transform the pose of the new type of thing to the pose of the known type of thing; and outputting, by the pose conversion generator, a pose conversion image for the new type of thing;
and taking the acquired multiple posture conversion images as a new sample of the new type of things.
7. The image classification method according to claim 5 or 6, characterized in that the acquisition method of the pose conversion generator comprises:
extracting a multiparous training set from a second training set consisting of first images of a large sample of a known type of thing;
and training the constructed self-encoder neural network by adopting the multi-birth training set to obtain the attitude transformation generator.
8. The image classification method according to claim 7, wherein the multiparous training set comprises: a training set of four-fetus;
said extracting a multiparous training set from a second training set consisting of first images of a large sample of a known type of thing comprises:
obtaining a plurality of pairs of quadruple data { A } from the second training set1,A2,B1,B2Forming the tetrad training set;
wherein the tetrad data pairs { A }1,A2,B1,B2Satisfy: a. the1、A2From a known type, B1、B2From another known type, A1Significant graph and B of1Is greater than or equal to a preset similarity threshold, A2Significant graph and B of2Similarity of saliency maps ofGreater than or equal to the similarity threshold; from A1To A2Posture change and from B1To B2The posture changes of (2) are consistent.
9. The image classification method according to claim 8, wherein the training of the constructed self-coding network with the multi-fetus training set to obtain the pose transformation generator comprises:
will be { A1,A2,B1As input, take B as input2As a target output of the attitude transition generator, will
Figure FDA0002950665060000031
As an actual output of the gesture transition generator;
according to the target output B2The actual output
Figure FDA0002950665060000032
And the target output B2The class label y optimizes a preset loss function, the training is determined to be finished when the loss value of the loss function meets the preset requirement, and the trained self-coding network is used as the attitude conversion generator.
10. The image classification method according to claim 9, characterized in that the loss function comprises:
Figure FDA0002950665060000033
wherein Loss is the Loss value,
Figure FDA0002950665060000034
is the mean square error between the actual output and the target output,
Figure FDA0002950665060000035
to be composed of
Figure FDA0002950665060000036
A cross-entropy loss as an input to the pose transition generator; λ is a hyper-parameter that adjusts the ratio between the mean squared error and the cross entropy loss.
11. The image classification method according to any of claims 1 to 4, characterized in that the small sample object classifier is a cosine distance-based classifier.
12. An image classification apparatus comprising a processor and a computer-readable storage medium having instructions stored therein, wherein the instructions, when executed by the processor, implement the image classification method according to any one of claims 1 to 9.
CN202110209031.8A 2021-02-24 2021-02-24 Image classification method and device Pending CN113033612A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110209031.8A CN113033612A (en) 2021-02-24 2021-02-24 Image classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110209031.8A CN113033612A (en) 2021-02-24 2021-02-24 Image classification method and device

Publications (1)

Publication Number Publication Date
CN113033612A true CN113033612A (en) 2021-06-25

Family

ID=76461165

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110209031.8A Pending CN113033612A (en) 2021-02-24 2021-02-24 Image classification method and device

Country Status (1)

Country Link
CN (1) CN113033612A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114821207A (en) * 2022-06-30 2022-07-29 浙江凤凰云睿科技有限公司 Image classification method and device, storage medium and terminal

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114821207A (en) * 2022-06-30 2022-07-29 浙江凤凰云睿科技有限公司 Image classification method and device, storage medium and terminal
CN114821207B (en) * 2022-06-30 2022-11-04 浙江凤凰云睿科技有限公司 Image classification method and device, storage medium and terminal

Similar Documents

Publication Publication Date Title
CN108229490B (en) Key point detection method, neural network training method, device and electronic equipment
CN109344701B (en) Kinect-based dynamic gesture recognition method
CN111401384B (en) Transformer equipment defect image matching method
Liu et al. Parsenet: Looking wider to see better
Xu et al. Learning-based shadow recognition and removal from monochromatic natural images
Shotton et al. Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation
Baró et al. Traffic sign recognition using evolutionary adaboost detection and forest-ECOC classification
CN106570874B (en) Image marking method combining image local constraint and object global constraint
Rana et al. Learning-based tone mapping operator for efficient image matching
CN107862680B (en) Target tracking optimization method based on correlation filter
US11449707B2 (en) Method for processing automobile image data, apparatus, and readable storage medium
Reis et al. Combining convolutional side-outputs for road image segmentation
CN113674288B (en) Automatic segmentation method for digital pathological image tissue of non-small cell lung cancer
JP6107531B2 (en) Feature extraction program and information processing apparatus
Bharath et al. Scalable scene understanding using saliency-guided object localization
CN110852327A (en) Image processing method, image processing device, electronic equipment and storage medium
JP2017102622A (en) Image processing device, image processing method and program
JP2020057348A (en) System and method for identifying image containing identification document
US20220215679A1 (en) Method of determining a density of cells in a cell image, electronic device, and storage medium
CN111931782A (en) Semantic segmentation method, system, medium, and apparatus
Rana et al. Learning-based tone mapping operator for image matching
CN112949706B (en) OCR training data generation method, device, computer equipment and storage medium
CN113033612A (en) Image classification method and device
CN113361589A (en) Rare or endangered plant leaf identification method based on transfer learning and knowledge distillation
Lazebnik et al. An empirical bayes approach to contextual region classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination