CN115424060A

CN115424060A - Model training method, image classification method and device

Info

Publication number: CN115424060A
Application number: CN202211037161.9A
Authority: CN
Inventors: 何凤翔
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2022-08-26
Filing date: 2022-08-26
Publication date: 2022-12-02

Abstract

The embodiment of the disclosure provides a model training method, which includes: in response to the fact that a small sample learning network comprising a local feature extractor, a semantic feature extractor and a classification discriminator is obtained, an image pair corresponding to the sample image and the query image is built, the image pair is input into the local feature extractor to be subjected to local feature extraction, a first local feature corresponding to the sample image and a second local feature corresponding to the query image are obtained, the first local feature and the second local feature are respectively input into the semantic feature extractor to be subjected to semantic feature extraction, a first semantic feature corresponding to the first local feature and a second semantic feature corresponding to the second local feature are obtained, the first local feature and the first semantic feature, the second local feature and the second semantic feature are respectively input into the classification discriminator, and the small sample learning network is trained on the basis of the annotation information corresponding to the sample image, so that an image classification model is obtained.

Description

Model training method, image classification method and device

Technical Field

The embodiment of the disclosure relates to the technical field of computers and the technical field of internet, in particular to the technical field of artificial intelligence and the technical field of image processing, and particularly relates to a model training method, an image classification method and an image classification device.

Background

With the continuous development and wide application of deep neural networks, deep learning models are rapidly developed in the field of computer vision, however, the development is greatly dependent on massive labeled image samples from related categories. Defect detection in real-world scenarios, such as industrial manufacturing, does not always enable such a large number of image samples to be obtained for the new category of interest due to the high cost. With the continuous development and progress of the neural network, a small sample learning method for fast learning from a small number of labeled images is provided, and at present, the method mainly comprises the following steps: metric learning techniques, semantic alignment techniques, and feature aggregation techniques.

However, metric learning techniques typically aggregate features of all regions in an image, and do not consider whether these image regions are semantically related to the image, thereby reducing the discriminativity of the obtained features; or, the semantic alignment technology focuses on exploring an effective local feature similarity measurement technology between the paired images, and is limited by problems of background clutter, intra-class variation or high calculation cost, and in small sample learning, the scarcity of the marked images makes accurate semantic alignment more difficult; furthermore, feature aggregation techniques lack the perception of image semantics and focus more on image regions that are not task-related.

Disclosure of Invention

The embodiment of the disclosure provides a model training method, an image classification device, electronic equipment and a computer readable medium.

In a first aspect, an embodiment of the present disclosure provides a model training method, including: in response to the acquisition of the annotation information and the query image which comprise the sample image and correspond to the sample image, establishing an image pair which corresponds to the sample image and the query image; constructing a small sample learning network comprising a local feature extractor, a semantic feature extractor and a classification discriminator; inputting the image pair into a local feature extractor for local feature extraction to obtain a first local feature corresponding to the sample image and a second local feature corresponding to the query image; respectively inputting the first local feature and the second local feature into a semantic feature extractor for semantic feature extraction to obtain a first semantic feature corresponding to the first local feature and a second semantic feature corresponding to the second local feature; and respectively inputting the first local feature and the first semantic feature as well as the second local feature and the second semantic feature into a classification discriminator, and training a small sample learning network based on the labeling information corresponding to the sample image to obtain an image classification model.

In some embodiments, the step of inputting the first local feature and the second local feature into a semantic feature extractor for semantic feature extraction to obtain a first semantic feature corresponding to the first local feature and a second semantic feature corresponding to the second local feature includes: respectively performing feature compression on the first local feature and the second local feature through a semantic feature extractor to obtain a compressed first local feature and a compressed second local feature; and performing dimension expansion and semantic feature extraction on the compressed first local features and the compressed second local features respectively through a semantic feature extractor to obtain first semantic features corresponding to the first local features and second semantic features corresponding to the second local features.

In some embodiments, the step of inputting the first local feature and the first semantic feature, and the second local feature and the second semantic feature into a classification discriminator, and training a small sample learning network based on labeling information corresponding to a sample image to obtain an image classification model includes: respectively inputting the first local feature and the first semantic feature as well as the second local feature and the second semantic feature into a classification discriminator to carry out feature aggregation to obtain a first aggregation feature corresponding to the first local feature and a second aggregation feature corresponding to the second local feature; and training the small sample learning network based on the first aggregation characteristic, the second aggregation characteristic and the labeling information corresponding to the sample image to obtain an image classification model.

In some embodiments, the inputting the first local feature and the first semantic feature, and the second local feature and the second semantic feature into the classification discriminator respectively to perform feature aggregation to obtain a first aggregated feature corresponding to the first local feature and a second aggregated feature corresponding to the second local feature, includes: carrying out weight calculation on the first local feature and the first semantic feature through a classification discriminator to obtain a first weight of the first local feature; determining a first aggregation feature corresponding to the first local feature based on the first weight and the first local feature; performing weight calculation on the second local feature and the second semantic feature through a classification discriminator to obtain a second weight of the second local feature; and determining a second aggregation characteristic corresponding to the second local characteristic based on the second weight and the second local characteristic.

In some embodiments, training the small sample learning network based on the first aggregation feature, the second aggregation feature, and the labeling information corresponding to the sample image to obtain an image classification model, including: determining a cross-entropy loss function based on the first aggregation characteristic and the second aggregation characteristic; and performing joint training on the small sample learning network based on the cross entropy loss function and the labeling information corresponding to the sample image to obtain an image classification model.

In a second aspect, an embodiment of the present disclosure provides an image classification method, including: acquiring an image to be classified; and inputting the image to be classified into an image classification model to obtain a classification result of the image to be classified, wherein the image classification model is obtained based on the method of the first aspect.

In a third aspect, an embodiment of the present disclosure provides a model training apparatus, including: the construction module is configured to respond to the acquisition of the annotation information and the query image which comprise the sample image and the sample image, and construct an image pair corresponding to the sample image and the query image; constructing a small sample learning network comprising a local feature extractor, a semantic feature extractor and a classification discriminator; the local feature extraction module is configured to input the image pair into the local feature extractor to perform local feature extraction, so as to obtain a first local feature corresponding to the sample image and a second local feature corresponding to the query image; the semantic feature extraction module is configured to input the first local feature and the second local feature into the semantic feature extractor respectively for semantic feature extraction, so as to obtain a first semantic feature corresponding to the first local feature and a second semantic feature corresponding to the second local feature; and the training module is configured to input the first local feature and the first semantic feature, the second local feature and the second semantic feature into the classification discriminator respectively, and train the small sample learning network based on the labeling information corresponding to the sample image to obtain an image classification model.

In some embodiments, the semantic feature extraction module is further configured to: respectively performing feature compression on the first local feature and the second local feature through a semantic feature extractor to obtain a compressed first local feature and a compressed second local feature; and respectively carrying out dimension expansion and semantic feature extraction on the compressed first local features and the compressed second local features through a semantic feature extractor to obtain first semantic features corresponding to the first local features and second semantic features corresponding to the second local features.

In some embodiments, a training module, comprising: the feature aggregation unit is configured to input the first local feature and the first semantic feature, and the second local feature and the second semantic feature into the classification discriminator respectively for feature aggregation, so as to obtain a first aggregation feature corresponding to the first local feature and a second aggregation feature corresponding to the second local feature; and the training unit is configured to train the small sample learning network based on the first aggregation characteristic, the second aggregation characteristic and the labeling information corresponding to the sample image to obtain an image classification model.

In some embodiments, the feature aggregation unit is further configured to: carrying out weight calculation on the first local feature and the first semantic feature through a classification discriminator to obtain a first weight of the first local feature; determining a first aggregation feature corresponding to the first local feature based on the first weight and the first local feature; performing weight calculation on the second local feature and the second semantic feature through a classification discriminator to obtain a second weight of the second local feature; and determining a second aggregation characteristic corresponding to the second local characteristic based on the second weight and the second local characteristic.

In some embodiments, the training unit is further configured to: determining a cross-entropy loss function based on the first aggregation characteristic and the second aggregation characteristic; and performing joint training on the small sample learning network based on the cross entropy loss function and the labeling information corresponding to the sample image to obtain an image classification model.

In a fourth aspect, an embodiment of the present disclosure provides an image classification apparatus, including: an acquisition module configured to acquire an image to be classified; and the classification module is configured to input the image to be classified into an image classification model to obtain a classification result of the image to be classified, wherein the image classification model is obtained based on the method of the first aspect.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when executed by one or more processors, cause the one or more processors to implement a method as described in any one of the embodiments of the first or second aspects.

In a sixth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which computer program, when executed by a processor, implements a method as described in any of the embodiments of the first or second aspect.

According to the model training method provided by the embodiment of the disclosure, the execution main body firstly responds to the acquisition of labeling information and query images which comprise sample images and sample images, an image pair corresponding to the sample images and the query images is constructed, a small sample learning network comprising a local feature extractor, a semantic feature extractor and a classification discriminator is constructed, then the image pair is input into the local feature extractor for local feature extraction, a first local feature corresponding to the sample images and a second local feature corresponding to the query images are obtained, then the first local feature and the second local feature are respectively input into the semantic feature extractor for semantic feature extraction, a first semantic feature corresponding to the first local feature and a second semantic feature corresponding to the second local feature are obtained, finally the first local feature, the first semantic feature, the second local feature and the second semantic feature are respectively input into the classification discriminator, based on the labeling information corresponding to the sample images, the small sample learning network is trained, an image classification model is obtained, the corresponding salient features can be extracted based on the local features, the labeling information corresponding to the current semantic features can be respectively input into the classification discriminator, the semantic feature classification model training method can further overcome the problem that the accuracy of semantic image classification is not needed, and the task classification related to be further improved.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a model training method according to the present disclosure;

FIG. 3 is a flow diagram for one embodiment of semantic feature extraction according to the present disclosure;

FIG. 4 is a flow diagram for one embodiment of training a small sample learning network according to the present disclosure;

FIG. 5 is a flow diagram of one embodiment of performing feature aggregation in accordance with the present disclosure;

FIG. 6 is a flow diagram for one embodiment of obtaining an image classification model according to the present disclosure;

FIG. 7 is a flow diagram for one embodiment of an image classification method according to the present disclosure;

FIG. 8 is a schematic block diagram of one embodiment of a model training apparatus according to the present disclosure;

FIG. 9 is a schematic structural diagram of one embodiment of an image classification apparatus according to the present disclosure;

FIG. 10 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant disclosure and are not limiting of the disclosure. It should be noted that, for the convenience of description, only the parts relevant to the related disclosure are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which the model training methods, image classification methods, and apparatus of embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

104, 105, 106, a network 107, and

servers

101, 102, 103. The network 107 serves as a medium for providing communication links between the

terminal devices

104, 105, 106 and the

servers

101, 102, 103. The network 107 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with

servers

101, 102, 103 belonging to the same server cluster via a network 107 via

terminal devices

104, 105, 106 to receive or transmit information or the like. Various applications may be installed on the

terminal devices

104, 105, 106, such as an item presentation application, a data analysis application, a search-type application, and so forth.

The

terminal devices

104, 105, 106 may be hardware or software. When the terminal device is hardware, it may be various electronic devices having a display screen and supporting communication with the server, including but not limited to a smart phone, a tablet computer, a laptop portable computer, a desktop computer, and the like. When the terminal device is software, the terminal device can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.

The

servers

101, 102, 103 may be servers that provide various services, such as background servers that receive requests sent by terminal devices with which communication connections are established. The background server can receive and analyze the request sent by the terminal device, and generate a processing result.

The

servers

101, 102, and 103 may acquire a sample image, annotation information corresponding to the sample image, and a query image, construct an image pair corresponding to the sample image and the query image, and then construct a small sample learning network including a local feature extractor, a semantic feature extractor, and a classification discriminator. The

servers

101, 102, and 103 may input the image pair into the local feature extractor to perform local feature extraction, to obtain a first local feature corresponding to the sample image and a second local feature corresponding to the query image, and input the first local feature and the second local feature into the semantic feature extractor respectively to perform semantic feature extraction, to obtain a first semantic feature corresponding to the first local feature and a second semantic feature corresponding to the second local feature, and finally input the first local feature, the first semantic feature, the second local feature, and the second semantic feature into the classification discriminator respectively, and train the small sample learning network based on the labeling information corresponding to the sample image, to obtain the image classification model.

And the

servers

101, 102, and 103 may obtain an image to be classified, and input the image to be classified into an image classification model, to obtain a classification result of the image to be classified, where the image classification model is obtained based on the method.

The server may be hardware or software. When the server is hardware, it may be various electronic devices that provide various services to the terminal device. When the server is software, it may be implemented as a plurality of software or software modules that provide various services to the terminal device, or may be implemented as a single software or software module that provides various services to the terminal device. And is not particularly limited herein.

It should be noted that the model training method and the image classification method provided by the embodiments of the present disclosure may be executed by the

servers

101, 102, 103. Accordingly, a model training means and an image classification means are provided in the

servers

101, 102, 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a model training method according to the present disclosure is shown. The model training method comprises the following steps:

step 210, in response to acquiring the annotation information and the query image including the sample image and corresponding to the sample image, constructing an image pair corresponding to the sample image and the query image.

In this step, an execution subject (for example,

servers

101, 102, 103 in fig. 1) on which the model training method operates may acquire images of multiple categories each corresponding to multiple different images by network reading or the like. The execution main body can group the acquired images, and divide the images into a sample image group and an inquiry image group, wherein the sample image group can include a plurality of sample images of different categories, and perform information annotation on the sample images to obtain annotation information of the sample images, and the sample image group is used as a support set for small sample learning, and can include images of N categories, and each category can include M sample images; the query image group may include a plurality of unlabeled query images, and the query image group serves as a query set for small sample learning.

After the execution subject acquires the sample image group and the query image group, the execution subject can select a sample image from the sample image group, select a query image from the query image group, and combine the sample image and the query image into an image pair.

Step 220, constructing a small sample learning network comprising a local feature extractor, a semantic feature extractor and a classification discriminator.

In this step, the execution agent may use a neural network such as fast R-CNN as a main framework of the model, and then construct a small sample learning network including a local feature extractor, a semantic feature extractor, and a classification discriminator.

The local feature extractor can be used for extracting features of an input image and extracting local features of the input image; the semantic feature extractor can be used for performing semantic analysis on the obtained local features and extracting corresponding semantic features; the classification discriminator can be used for analyzing and processing the input semantic features and local features to generate the classification result of the input image.

Step 230, inputting the image pair into the local feature extractor for local feature extraction, so as to obtain a first local feature corresponding to the sample image and a second local feature corresponding to the query image.

In this step, the local feature extractor may be a conventional Convolutional Neural Network (CNN) after removing the global pooling layer and the subsequent layers thereof, the execution main body may input the constructed image pair into the local feature extractor, and the local feature extractor may perform local feature extraction on the sample image and the query image in the image pair, so that the local feature extractor may perform local feature extraction on the sample image to obtain a first local feature corresponding to the sample image, and the local feature extractor may perform local feature extraction on the query image to obtain a second local feature corresponding to the query image.

The local feature extractor can

To perform feature extraction corresponding to the sample image and the query image, wherein theta ₁ The learnable parameters of the used convolutional neural network are represented, C, H and W respectively represent the channel, the height and the width, and the HW local features with the dimension of C of the sample image and the query image can be obtained by combining the spatial dimensions and can be represented as

And 240, respectively inputting the first local feature and the second local feature into a semantic feature extractor for semantic feature extraction, so as to obtain a first semantic feature corresponding to the first local feature and a second semantic feature corresponding to the second local feature.

In this step, after the execution main body obtains the first local feature corresponding to the sample image and the second local feature corresponding to the query image through the local feature extractor, the first local feature and the second local feature may be respectively input into the semantic feature extractor, the semantic extractor performs semantic analysis and semantic processing on the first local feature according to the current classification task to obtain the first semantic feature corresponding to the first local feature, and the semantic extractor performs semantic analysis and semantic processing on the second local feature according to the current classification task to obtain the second semantic feature corresponding to the second local feature.

And 250, respectively inputting the first local feature and the first semantic feature as well as the second local feature and the second semantic feature into a classification discriminator, and training a small sample learning network based on the labeling information corresponding to the sample image to obtain an image classification model.

In this step, after the execution subject obtains the first semantic feature corresponding to the first local feature and the second semantic feature corresponding to the second local feature, the first local feature and the first semantic feature may be input to a classification discriminator, the second local feature and the second semantic feature may also be input to the classification discriminator, the classification discriminator performs classification processing on the first local feature, the first semantic feature, the second local feature and the second semantic feature, respectively, and a machine learning method is used to train the small sample learning network according to the input labeled information of the sample image, so as to obtain an image classification model.

Referring to FIG. 3, FIG. 3 shows a flow diagram 300 of one embodiment of performing semantic feature extraction, which may include the steps of:

and 310, respectively performing feature compression on the first local feature and the second local feature through a semantic feature extractor to obtain a compressed first local feature and a compressed second local feature.

The semantic feature extractor can comprise a global pooling layer, three full-connection layers and a ReLU layer which are overlapped with each other.

In this step, after the execution main body acquires the first local feature and the second local feature, the first local feature and the second local feature may be input into the semantic feature extractor, the first local feature may be aggregated in a global pooling manner by using a global pooling layer in the semantic feature extractor, and the pooled and aggregated first local feature may be feature-compressed by using three mutually overlapping full link layers and ReLU layers in the semantic feature extractor, the first local feature obtained through the global pooling may be compressed into a compact feature space according to a dimensionality reduction ratio, and the first local feature may be compressed from a dimensionality C to a dimensionality D local feature, where a numeric value of D is smaller than a numeric value of C. As an example, the execution subject may perform feature compression on the first local feature by using a Principal Component Analysis (PCA) to obtain a compressed first local feature, where the principal component analysis is also called principal component analysis, and aims to convert the multi-index into a few comprehensive indexes (i.e., principal components) by using the idea of dimension reduction, where each principal component can reflect most of information of the original variable and the contained information does not overlap with each other.

And the execution main body performs global pooling aggregation on the second local features through a global pooling layer in the semantic feature extractor, performs feature compression on the pooled and aggregated second local features by utilizing three mutually overlapped full-connection layers and a ReLU layer in the semantic feature extractor, can compress the second local features obtained through the global pooling aggregation into a compact feature space according to a dimensionality reduction ratio, can compress the second local features from a dimensionality of C to a dimensionality of D, and the numerical value of D is smaller than the numerical value of C. As an example, the executing entity may perform feature compression on the second local feature by a Principal Component Analysis (PCA) of an image to obtain a compressed second local feature, where the principal component analysis is also called principal component analysis, and aims to convert multiple indexes into a small number of comprehensive indexes (i.e., principal components) by using a dimensionality reduction idea, where each principal component can reflect most information of the original variable and the contained information does not overlap each other.

And 320, performing dimension expansion and semantic feature extraction on the compressed first local feature and the compressed second local feature through a semantic feature extractor respectively to obtain a first semantic feature corresponding to the first local feature and a second semantic feature corresponding to the second local feature.

In this step, after the execution subject obtains the compressed first local feature and the compressed second local feature, the dimension of the compressed first local feature may be further expanded through three mutually overlapped full-link layers and ReLU layers in the semantic feature extractor, that is, the dimension of the compressed first local feature may be expanded to an original dimension, for example, the original dimension is C, the compressed first local feature may be expanded from a compression dimension D to an original dimension C, and a value of D is smaller than a value of C. Then, the executing agent may perform semantic feature extraction on the first local feature expanded to the original dimension through three mutually overlapped fully-connected layers and ReLU layers in the semantic feature extractor, and may implement semantic feature extraction by using the following formula, where the formula may be:

wherein,

and theta ₂ Is a semantic feature extraction parameter, r _j A feature vector representing local features, H and W representing height and width, respectively.

According to the method, the first semantic feature corresponding to the first local feature and the second semantic feature corresponding to the second local feature can be obtained according to the compressed first local feature and the compressed second local feature through a semantic feature obtaining method in the related technology.

In the implementation mode, the image area irrelevant to the current classification task is filtered by compressing the first local characteristic and the second local characteristic, so that the discriminability of the obtained characteristics and the performance of the model are improved.

Referring to fig. 4, fig. 4 shows a flowchart 400 of one embodiment of training a small sample learning network, which may include the steps of:

and step 410, inputting the first local feature and the first semantic feature, and inputting the second local feature and the second semantic feature into a classification discriminator respectively for feature aggregation to obtain a first aggregation feature corresponding to the first local feature and a second aggregation feature corresponding to the second local feature.

In this step, after the execution main body obtains the first semantic feature corresponding to the first local feature and the second semantic feature corresponding to the second local feature, the first semantic feature, the second local feature, and the second semantic feature may be respectively input to a classification discriminator to perform feature aggregation, the first local feature is subjected to feature aggregation by the classification discriminator according to the first semantic feature to obtain a first aggregation feature corresponding to the first local feature, and the second local feature is subjected to feature aggregation by the classification discriminator according to the second semantic feature to obtain a second aggregation feature corresponding to the second local feature.

And step 420, training the small sample learning network based on the first aggregation characteristic, the second aggregation characteristic and the labeling information corresponding to the sample image to obtain an image classification model.

In this step, after the execution subject obtains the first aggregation feature and the second aggregation feature, the first aggregation feature and the second aggregation feature may be classified by the classification discriminator, and the small sample learning network is trained by using a machine learning method according to the label information of the input sample image, so as to obtain an image classification model.

In the implementation mode, the local features are aggregated based on the semantic features, so that the image area related to the current classification task can be strengthened, other image areas are restrained, the perception of image semantics is emphasized, and the accuracy of the aggregated features is improved.

Referring to fig. 5, fig. 5 shows a flow chart 500 of one embodiment of performing feature aggregation, which may include the steps of:

step 510, performing weight calculation on the first local feature and the first semantic feature through the classification discriminator to obtain a first weight of the first local feature.

In this step, after the execution subject acquires the first semantic feature corresponding to the first local feature and the second semantic feature corresponding to the second local feature, the classification discriminator may perform a weight calculation on the first local feature and the first semantic feature, calculate a norm of the first local feature, and define the norm of the local feature as a attention value of the local feature, where the attention value may be used as a first weight of the first local feature.

The executing agent may calculate a norm of the first local feature by the classification discriminator using a formula as follows:

||r _j +f|| ² ＝(||f||+||r _j ||) ² -2||f||||r _j ||(1-cosθ _j )

wherein, theta _j Denotes f and r _j Angle between r and _j representing local features, it will be understood that for two local features r _j And r _i If r is _j Ratio r _i More relevant to the semantic perception feature f, cos θ _j Should be greater than cos θ _i This means that the above formula makes r _j Is greater than r _i Is increased. There are two extremes as follows: 1) If r is _j If the norm increment is | | | f | | |, the norm increment is completely related to the semantic perception characteristic; if r is _j And completely irrelevant to semantic perception characteristics, the norm increment is- | | f |.

Step 520, based on the first weight and the first local feature, a first aggregate feature corresponding to the first local feature is determined.

In this step, after the execution body obtains the first weight of the first local feature, the execution body may calculate a first aggregation feature corresponding to the first local feature according to the first weight.

The executing agent may calculate, by the classification discriminator, a first aggregate feature corresponding to the first local feature by using the following formula:

wherein r is _j + f denotes the norm of the first local feature, i.e. the first weight; h and W represent height and width, respectively.

And 530, performing weight calculation on the second local feature and the second semantic feature through the classification discriminator to obtain a second weight of the second local feature.

In this step, the executing entity may perform a weight calculation on the second local feature and the second semantic feature through the classification discriminator, calculate a norm of the second local feature, and define the norm of the local feature as an attention value of the local feature, which may be used as a second weight of the second local feature.

The executing entity may calculate a norm of the second local feature by the classification discriminator using a formula as follows:

||r _j +f|| ² ＝(||f||+||r _j ||) ² -2||f||||r _j ||(1-cosθ _j )

wherein, theta _j Denotes f and r _j Angle between r _j Representing local features, it will be understood that for two local features r _j And r _i If r is _j Ratio r _i More relevant to the semantic perception feature f, cos θ _j Should be greater than cos θ _i This means that the above formula makes r _j Is greater than r _i Is increased. There are two extremes as follows: 1) If r is _j If the norm increment is | | | f | | |, the norm increment is completely related to the semantic perception characteristic; if r is _j And completely irrelevant to semantic perception characteristics, the norm increment is- | | f |.

And step 540, determining a second aggregation feature corresponding to the second local feature based on the second weight and the second local feature.

In this step, after the execution subject obtains the second weight of the second local feature, the execution subject may calculate a second aggregate feature corresponding to the second local feature according to the second weight.

The executing entity may calculate, by the classification discriminator, a second aggregate feature corresponding to the second local feature by using the following formula:

wherein r is _j + f denotes the norm of the second local feature, i.e. the second weight; h and W represent height and width, respectively.

In the implementation mode, the weighted value of each local feature is determined through the semantic features, so that the image area irrelevant to the current classification task is prevented from being endowed with more weights, the accuracy of the weighted values is improved, the aggregation features are determined by using the semantic features, the image area irrelevant to the current classification task can be inhibited, the image area relevant to the current classification task is highlighted, and the feature aggregation accuracy is improved.

Referring to FIG. 6, FIG. 6 shows a flowchart 600 of one embodiment of obtaining an image classification model, which may include the steps of:

step 610, determining a cross entropy loss function based on the first aggregation characteristic and the second aggregation characteristic.

In this step, after the execution subject obtains the first aggregation characteristic and the second aggregation characteristic, a corresponding cross entropy loss function may be determined according to the first aggregation characteristic and the second aggregation characteristic.

Specifically, after acquiring the first aggregation feature and the second aggregation feature, the executing body may calculate an average feature of the sample image:

wherein,

and representing the aggregation characteristic of the ith sample image belonging to the mth support category in the sample image group, wherein N is the total number of the sample images in the sample image group.

The executing agent may then calculate the probability that the kth query image in the query image set belongs to the mth support category based on the following formula:

wherein d (·,) represents the cosine distance, τ represents the scale factor,

and representing the aggregation characteristic of the Kth sample image belonging to the mth support category in the query image group.

Finally, the execution agent may determine the cross-entropy loss function as follows:

wherein,

the label representing the kth query image, I (-) represents the indicator function, which equals 1 if its argument is true, and equals 0 otherwise.

And step 620, performing joint training on the small sample learning network based on the cross entropy loss function and the labeling information corresponding to the sample images to obtain an image classification model.

In this step, after the execution subject determines the cross entropy loss function, the small sample learning network may be jointly trained by using a machine learning method according to the cross entropy loss function and the label information of the input sample image, and network parameters in the small sample learning network are adjusted to obtain an image classification model.

In the implementation mode, the small sample learning network is subjected to combined training through the cross entropy loss function and the labeling information corresponding to the sample images to obtain the image classification model, and the training effect and accuracy of the image classification model are improved.

Referring to FIG. 7, FIG. 7 shows a flow chart 700 of one embodiment of an image classification method, which may include the steps of:

and step 710, acquiring an image to be classified.

In this step, the execution main body obtains the image to be classified through network reading and the like.

And 720, inputting the image to be classified into the image classification model to obtain a classification result of the image to be classified.

In this step, after the execution subject obtains the image to be classified, the image to be classified may be input into an image classification model, the image classification model processes the image to be classified and outputs a classification result corresponding to the image to be classified, and the classification result may represent image category information of the image to be classified. Wherein the image classification model is obtained based on the model training method of fig. 2 to 6.

According to the image classification device provided by the embodiment of the disclosure, the execution main body firstly obtains the image to be classified, then inputs the image to be classified into the image classification model, and obtains the classification result of the image to be classified, the image classification model is obtained based on the method, and the image to be classified is classified through the image classification model, so that the classification efficiency and accuracy of the image to be classified can be improved.

With further reference to FIG. 8, as an implementation of the methods illustrated in the above figures, the present disclosure provides one embodiment of a model training apparatus. This embodiment of the device corresponds to the embodiment of the method shown in fig. 2.

As shown in fig. 8, the model training apparatus 800 of the present embodiment may include: a construction module 810, a local feature extraction module 820, a semantic feature extraction module 830, and a training module 840.

The constructing module 810 is configured to, in response to acquiring the annotation information and the query image which comprise the sample image and the sample image, construct an image pair corresponding to the sample image and the query image; constructing a small sample learning network comprising a local feature extractor, a semantic feature extractor and a classification discriminator;

a local feature extraction module 820 configured to input the image into the local feature extractor for local feature extraction, so as to obtain a first local feature corresponding to the sample image and a second local feature corresponding to the query image;

a semantic feature extraction module 830, configured to input the first local feature and the second local feature into the semantic feature extractor for semantic feature extraction, so as to obtain a first semantic feature corresponding to the first local feature and a second semantic feature corresponding to the second local feature;

a training module 840 configured to input the first local feature and the first semantic feature, the second local feature and the second semantic feature into the classification discriminator, respectively, and train the small sample learning network based on the labeling information corresponding to the sample image, so as to obtain an image classification model.

In some optional implementations of this implementation, the semantic feature extraction module 830 is further configured to: respectively performing feature compression on the first local feature and the second local feature through a semantic feature extractor to obtain a compressed first local feature and a compressed second local feature; and respectively carrying out dimension expansion and semantic feature extraction on the compressed first local features and the compressed second local features through a semantic feature extractor to obtain first semantic features corresponding to the first local features and second semantic features corresponding to the second local features.

In some optional implementations of this implementation, the training module 840 includes: the feature aggregation unit is configured to input the first local feature and the first semantic feature, and the second local feature and the second semantic feature into the classification discriminator respectively for feature aggregation, so as to obtain a first aggregation feature corresponding to the first local feature and a second aggregation feature corresponding to the second local feature; and the training unit is configured to train the small sample learning network based on the first aggregation characteristic, the second aggregation characteristic and the labeling information corresponding to the sample image to obtain an image classification model.

In some optional implementations of this implementation, the feature aggregation unit is further configured to: carrying out weight calculation on the first local feature and the first semantic feature through a classification discriminator to obtain a first weight of the first local feature; determining a first aggregation feature corresponding to the first local feature based on the first weight and the first local feature; performing weight calculation on the second local feature and the second semantic feature through a classification discriminator to obtain a second weight of the second local feature; and determining a second aggregation characteristic corresponding to the second local characteristic based on the second weight and the second local characteristic.

In some optional implementations of this implementation, the training unit is further configured to: determining a cross entropy loss function based on the first aggregation characteristic and the second aggregation characteristic; and performing joint training on the small sample learning network based on the cross entropy loss function and the labeling information corresponding to the sample image to obtain an image classification model.

In the model training device provided by the above embodiment of the present disclosure, the execution main body first responds to the acquisition of the annotation information and the query image including the sample image and corresponding to the sample image, constructs an image pair including the sample image and the query image, and constructs a small sample learning network including the local feature extractor, the semantic feature extractor, and the classification discriminator, then inputs the image pair into the local feature extractor for local feature extraction, to obtain the first local feature corresponding to the sample image and the second local feature corresponding to the query image, then inputs the first local feature and the second local feature into the semantic feature extractor for semantic feature extraction, to obtain the first semantic feature corresponding to the first local feature and the second semantic feature corresponding to the second local feature, and finally inputs the first local feature, the second local feature and the second semantic feature into the classification discriminator, based on the annotation information corresponding to the sample image, the small sample learning network is trained, to obtain the image classification model, which can extract and obtain the corresponding features based on the local features, can be based on the semantic information corresponding to the current semantic image classification, and can further overcome the problem that the accuracy of the task classification of the semantic image classification is not required by the task classification.

Those skilled in the art will appreciate that the above-described apparatus may also include some other well-known structures, such as processors, memories, etc., which are not shown in fig. 8 in order to not unnecessarily obscure embodiments of the present disclosure.

With further reference to fig. 9, the present disclosure provides one embodiment of an image classification apparatus as an implementation of the methods illustrated in the above figures. This device embodiment corresponds to the method embodiment shown in fig. 7.

As shown in fig. 9, the image classification apparatus 900 of the present embodiment may include: an acquisition module 910 and a classification module 920.

The obtaining module 910 is configured to obtain an image to be classified;

the classifying module 920 is configured to input the image to be classified into an image classification model, so as to obtain a classification result of the image to be classified, where the image classification model is obtained based on the methods shown in fig. 2 to 6.

Those skilled in the art will appreciate that the above-described apparatus may also include some other well-known structure, such as a processor, memory, etc., which is not shown in fig. 9 in order not to unnecessarily obscure embodiments of the present disclosure.

Referring now to FIG. 10, a block diagram of an electronic device 1000 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a smart screen, a notebook computer, a PAD (tablet computer), a PMP (portable multimedia player), a car terminal (e.g., car navigation terminal), etc., and a fixed terminal such as a digital TV, a desktop computer, etc. The terminal device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 10, the electronic device 1000 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 1001 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage means 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 are also stored. The processing device 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Generally, the following devices may be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 1007 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 1008 including, for example, magnetic tape, hard disk, and the like; and a communication device 1009. The communications apparatus 1009 may allow the electronic device 1000 to communicate wirelessly or by wire with other devices to exchange data. While fig. 10 illustrates an electronic device 1000 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided. Each block shown in fig. 10 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 1009, or installed from the storage means 1008, or installed from the ROM 1002. The computer program, when executed by the processing device 1001, performs the above-described functions defined in the methods of the embodiments of the present disclosure. It should be noted that the computer readable medium of the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and including conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a construction module, a local feature extraction module, a semantic feature extraction module, and a training module, or a processor includes an acquisition module and a classification module, where the names of these modules do not in some cases constitute a limitation on the module itself.

As another aspect, the present application also provides a computer-readable medium, which may be included in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: in response to the acquisition of the annotation information and the query image which comprise the sample image and correspond to the sample image, establishing an image pair which corresponds to the sample image and the query image; constructing a small sample learning network comprising a local feature extractor, a semantic feature extractor and a classification discriminator; inputting the image pair into a local feature extractor for local feature extraction to obtain a first local feature corresponding to the sample image and a second local feature corresponding to the query image; respectively inputting the first local feature and the second local feature into a semantic feature extractor for semantic feature extraction to obtain a first semantic feature corresponding to the first local feature and a second semantic feature corresponding to the second local feature; and respectively inputting the first local feature and the first semantic feature as well as the second local feature and the second semantic feature into a classification discriminator, and training a small sample learning network based on the labeling information corresponding to the sample image to obtain an image classification model. Or, cause the electronic device to: acquiring an image to be classified; and inputting the image to be classified into an image classification model to obtain a classification result of the image to be classified, wherein the image classification model is obtained based on a model training method.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A method of model training, the method comprising:

in response to the acquisition of a sample image, annotation information corresponding to the sample image and a query image, constructing an image pair corresponding to the sample image and the query image;

constructing a small sample learning network comprising a local feature extractor, a semantic feature extractor and a classification discriminator;

inputting the image pair into the local feature extractor for local feature extraction to obtain a first local feature corresponding to the sample image and a second local feature corresponding to the query image;

respectively inputting the first local feature and the second local feature into the semantic feature extractor for semantic feature extraction to obtain a first semantic feature corresponding to the first local feature and a second semantic feature corresponding to the second local feature;

and respectively inputting the first local feature, the first semantic feature, the second local feature and the second semantic feature into the classification discriminator, and training the small sample learning network based on the labeling information corresponding to the sample image to obtain an image classification model.

2. The method of claim 1, wherein the inputting the first local feature and the second local feature into the semantic feature extractor for semantic feature extraction to obtain a first semantic feature corresponding to the first local feature and a second semantic feature corresponding to the second local feature respectively comprises:

respectively performing feature compression on the first local feature and the second local feature through the semantic feature extractor to obtain a compressed first local feature and a compressed second local feature;

and performing dimension expansion and semantic feature extraction on the compressed first local features and the compressed second local features respectively through the semantic feature extractor to obtain first semantic features corresponding to the first local features and second semantic features corresponding to the second local features.

3. The method of claim 1, wherein the inputting the first local feature and the first semantic feature, the second local feature and the second semantic feature into the classification discriminator, respectively, and training the small sample learning network based on labeling information corresponding to the sample image to obtain an image classification model comprises:

inputting the first local feature and the first semantic feature, the second local feature and the second semantic feature into the classification discriminator respectively for feature aggregation to obtain a first aggregation feature corresponding to the first local feature and a second aggregation feature corresponding to the second local feature;

and training the small sample learning network based on the first aggregation characteristic, the second aggregation characteristic and the labeling information corresponding to the sample image to obtain an image classification model.

4. The method of claim 3, wherein the inputting the first local feature and the first semantic feature, the second local feature and the second semantic feature into the classification discriminator respectively for feature aggregation to obtain a first aggregated feature corresponding to the first local feature and a second aggregated feature corresponding to the second local feature comprises:

performing weight calculation on the first local feature and the first semantic feature through the classification discriminator to obtain a first weight of the first local feature;

determining a first aggregation feature corresponding to the first local feature based on the first weight and the first local feature;

performing weight calculation on the second local feature and the second semantic feature through the classification discriminator to obtain a second weight of the second local feature;

and determining a second aggregation feature corresponding to the second local feature based on the second weight and the second local feature.

5. The method according to claim 3 or 4, wherein the training the small sample learning network based on the first aggregation feature, the second aggregation feature and the labeling information corresponding to the sample image to obtain an image classification model comprises:

determining a cross-entropy loss function based on the first aggregation characteristic and the second aggregation characteristic;

and performing joint training on the small sample learning network based on the cross entropy loss function and the labeling information corresponding to the sample image to obtain an image classification model.

6. A method of image classification, the method comprising:

acquiring an image to be classified;

inputting the image to be classified into an image classification model to obtain a classification result of the image to be classified, wherein the image classification model is obtained based on the method of the claims 1 to 5.

7. A model training apparatus, the apparatus comprising:

the construction module is configured to respond to the acquisition of the annotation information and the query image which comprise a sample image and a sample image, and construct an image pair corresponding to the sample image and the query image; constructing a small sample learning network comprising a local feature extractor, a semantic feature extractor and a classification discriminator;

a local feature extraction module configured to input the image into the local feature extractor for local feature extraction, so as to obtain a first local feature corresponding to the sample image and a second local feature corresponding to the query image;

a semantic feature extraction module configured to input the first local feature and the second local feature into the semantic feature extractor for semantic feature extraction, so as to obtain a first semantic feature corresponding to the first local feature and a second semantic feature corresponding to the second local feature;

the training module is configured to input the first local feature and the first semantic feature, the second local feature and the second semantic feature into the classification discriminator respectively, and train the small sample learning network based on the labeling information corresponding to the sample image to obtain an image classification model.

8. The apparatus of claim 7, wherein the semantic feature extraction module is further configured to:

and performing dimensionality extension and semantic feature extraction on the compressed first local feature and the compressed second local feature through the semantic feature extractor respectively to obtain a first semantic feature corresponding to the first local feature and a second semantic feature corresponding to the second local feature.

9. The apparatus of claim 7, wherein the training module comprises:

a feature aggregation unit configured to input the first local feature and the first semantic feature, and the second local feature and the second semantic feature into the classification discriminator respectively for feature aggregation, so as to obtain a first aggregation feature corresponding to the first local feature and a second aggregation feature corresponding to the second local feature;

and the training unit is configured to train the small sample learning network based on the first aggregation characteristic, the second aggregation characteristic and the labeling information corresponding to the sample images to obtain an image classification model.

10. The apparatus of claim 9, wherein the feature aggregation unit is further configured to:

11. The apparatus of claim 9 or 10, wherein the training unit is further configured to:

12. An image classification apparatus, the apparatus comprising:

an acquisition module configured to acquire an image to be classified;

a classification module configured to input the image to be classified into an image classification model to obtain a classification result of the image to be classified, wherein the image classification model is obtained based on the method of claims 1 to 5.

13. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.

14. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.