CN117541882B

CN117541882B - Instance-based multi-view vision fusion transduction type zero sample classification method

Info

Publication number: CN117541882B
Application number: CN202410017127.8A
Authority: CN
Inventors: 汤龙; 赵靖涛
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2024-01-05
Filing date: 2024-01-05
Publication date: 2024-04-19
Anticipated expiration: 2044-01-05
Also published as: CN117541882A

Abstract

The invention discloses an example-based multi-view vision fusion transduction type zero sample classification method, which comprises the following steps: extracting multi-view visual characteristics of the seen type pictures; sending the multi-view visual characteristics and semantic attributes of the seen pictures into a multi-view visual-semantic mapping model, and learning a conversion matrix at different views by using an alternate direction multiplier method; predicting semantic projection of the unseen pictures by using the learned conversion matrix; further extracting final semantic representation of the unseen pictures from the semantic projection and realizing identification of the unseen pictures based on the final semantic representation; according to the invention, the interaction sharing of visual information at different visual angles is realized by adopting a single linear constraint, so that the complexity of a traditional multi-visual angle information fusion model is simplified; meanwhile, in order to further mine visual-semantic association hidden in the unseen class, a self-supervision learning strategy is provided, semantic calibration on the unseen class picture is realized by utilizing consistency among multiple visual angles, and zero sample classification performance can be greatly improved.

Description

Instance-based multi-view vision fusion transduction type zero sample classification method

Technical Field

The invention relates to the technical field of image recognition, in particular to a multi-view vision fusion transduction type zero sample classification method based on an example.

Background

In recent years, zero Sample Learning (ZSL) has received increasing attention. Unlike conventional pattern recognition, ZSL is able to recognize samples with tags that are not used in training. And classifying the samples of the unseen category by constructing a mapping relation between the visual features and the semantic attributes by using the inherent association of the semantic attributes among the categories. Most ZSL methods currently use only a single visual feature representation, however in many practical scenarios, multiple viewing angles of visual feature representations are often available through different channels. For high resolution images, different feature extractors (SIFT, SURF, PHOG, pre-training depth networks, etc.) may be used to acquire features. Due to the variability between different perspectives, example-based multi-perspective visual data may provide a more comprehensive description than single visual data, if utilized properly, is expected to greatly improve ZSL performance.

Disclosure of Invention

The invention aims to: the invention aims to provide an example-based multi-view vision fusion transduction type zero sample classification method, which improves the generalization performance of a zero sample classifier, so as to realize more accurate identification of unseen pictures.

The technical scheme is as follows: the invention discloses an example-based multi-view vision fusion transduction type zero sample classification method, which comprises the following steps of:

(1) Extracting multi-view visual characteristics of the seen type pictures and the unseen type pictures;

(2) Sending the multi-view visual characteristics of the seen type pictures and the corresponding category semantic attributes into a multi-view visual-semantic mapping model, and learning a conversion matrix on different view angles by using an alternate direction multiplier method;

(3) Predicting semantic projection of the unseen pictures by using the learned conversion matrix;

(4) And (3) further extracting final semantics of the unseen pictures according to the semantic projection obtained in the step (3) and identifying the unseen pictures.

Further, the step (1) specifically comprises the following steps: visual features were extracted using ResNet and GoogLeNet pre-trained on the ImageNet database, representing view a and view B, respectively.

Further, the multi-view visual-semantic mapping model in the step (2) is expressed as the following optimization problem:

；

the constraint conditions are as follows:

；

wherein, ，/>，，/>，/>Is an optimized variable matrix; representing a view angle feature matrix on a v-th view angle of the seen type picture, wherein each column corresponds to one seen type picture; /(I) A category semantic attribute matrix representing the seen type pictures, wherein each column corresponds to one seen type picture; Representing the average matrix of the semantic attributes of the seen classes, wherein each column of the average matrix is the average vector of all the semantic attributes of the seen classes; A dimension that is a view feature at a v-th view; m is the dimension of the category semantic attribute; n is the number of the pictures of the seen class; /(I) 、/>、/>、/>、/>Are super parameters; v is the number of viewing angles.

Further, the alternative direction multiplier method in the step (2) is specifically as follows:

Initializing:

，/>，/>，/>，/>，/>，；

Let the iteration times Determining convergence threshold/>，/>And related parameters/>，/>，/>；

By solving the followingEquation of/>; Wherein/>For the parameters within the alternate direction multiplier method, the formula is as follows:

；

by solving the following Optimization problem of/>The formula is as follows:

；

by solving the following Equation of/>The formula is as follows:

；

updating by ：

；

Updating by：

；

Updating Lagrangian multipliers by the following formula，/>，/>And/>：

；

If it is

；

Then convergence; otherwise, letContinuing the updating operation; the final transformation matrix obtained through iteration is: /(I)。

Further, the semantic projection of the unseen picture on a single view angle obtained in the step (3) is as follows:

；

wherein, Representing a view angle feature matrix on a v view angle of the unseen picture, wherein each column corresponds to one unseen picture; /(I)The number of the unseen pictures.

Further, the final semantic formula of the unseen pictures is extracted in the step (4) as follows:

；

wherein, The final semantic representation of the unseen pictures to be extracted, namely the optimization variables;；/> Is a diagonal matrix;

is a super parameter.

Further, the method comprises the steps of,Calculated by the following formula:

；

wherein, In the form of a block matrix,。

Further, the identifying of the unseen picture in the step (4) includes:

And (3) averaging the final semantic representation of the unseen pictures at each view angle, wherein the formula is as follows:

；

category labels for unseen pictures are obtained using the following formula:

；

wherein, Returning a number vector representing the largest element of each column of the input matrix; semantic attributes are not found; /(I) The number of the unobserved categories is the number of the unobserved categories; /(I)And marking the category of the identified unseen pictures.

The invention relates to an example-based multi-view vision fusion transduction type zero sample identification system, which comprises the following components:

The data acquisition module is used for extracting multi-view visual characteristics of the seen pictures and the unseen pictures;

The model learning module is used for sending the multi-view visual characteristics of the seen type pictures and the corresponding category semantic attributes into a multi-view visual-semantic mapping model, and learning the conversion matrixes at different view angles by using an alternate direction multiplier method; predicting semantic projection of the unseen pictures by using the learned conversion matrix; further extracting final semantic representation of the unseen pictures from the semantic projection;

And the picture identification module is used for classifying the extracted final semantic representations of the unseen pictures.

An apparatus of the present invention includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program implementing an example-based multi-view vision fusion transduction zero sample classification method of any one of the above when loaded into the processor.

The beneficial effects are that: compared with the prior art, the invention has the following remarkable advantages: the multi-view visual features are utilized to contain richer, more sufficient and more comprehensive information of the training samples, so that the generalization performance of the zero sample classifier is effectively improved, and more accurate identification of unseen pictures is realized. Compared with the existing zero sample learning method, the method has the advantages that the classification accuracy of unseen pictures is improved to a large extent, the method is simple and efficient, and the method has good application prospects in the related fields of pattern recognition, data mining, computer vision and the like.

Drawings

FIG. 1 is a flow chart of the present invention.

Description of the embodiments

The technical scheme of the invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, an embodiment of the present invention provides an example-based multi-view vision fusion transduction type zero sample classification method, which includes the following steps:

(1) Extracting multi-view visual characteristics of the seen type pictures and the unseen type pictures; the method comprises the following steps: visual features were extracted using ResNet (fc 9 layer 2048 dimension) and GoogLeNet (fc 17 layer 1024 dimension) pre-trained on the ImageNet database, representing view a and view B, respectively.

(2) Sending the multi-view visual characteristics of the seen type pictures and the corresponding category semantic attributes into a multi-view visual-semantic mapping model, and learning a conversion matrix on different view angles by using an alternate direction multiplier method; the multi-view visual-semantic mapping model is expressed as the following optimization problem P1:

；

the constraint conditions are as follows:

；

wherein, ，/>，，/>，/>Is an optimized variable matrix; representing a view angle feature matrix on a v-th view angle of the seen type picture, wherein each column corresponds to one seen type picture; /(I) A category semantic attribute matrix representing the seen type pictures, wherein each column corresponds to one seen type picture; Representing the average matrix of the semantic attributes of the seen classes, wherein each column of the average matrix is the average vector of all the semantic attributes of the seen classes; A dimension that is a view feature at a v-th view; m is the dimension of the category semantic attribute; n is the number of the pictures of the seen class; /(I) 、/>、/>、/>、/>Are super parameters; v is the number of viewing angles.Is a loss term; /(I)For consistency items, the prediction results of all the visual angles are kept consistent on the seen type samples, and constraint 1.1 is that single linear constraint is adopted to realize interactive sharing of visual information at different visual angles; constraint 1.2-1.4 is used to construct a reconfigurable subspace in the map; constraint 1.5 is a non-negative constraint. The variable of the problem P1 input is/>，/>，/>；

The solution variable is，/>，/>，/>，/>。

For the optimization problem P1, an alternate direction multiplier method is adopted for solving, and the method is specifically as follows:

inputting training set data ，/>，/>; Super parameter/>，/>、/>、/>、/>; Let iteration times/>Determining convergence threshold/>，/>And related parameters/>，/>，/>；

Initializing:

，/>，/>，/>，/>，/>，；

；

by solving the following Optimization problem of/>The formula is as follows:

；

by solving the following Equation of/>The formula is as follows:

；

updating by ：

；

Updating by：

；

Updating Lagrangian multipliers by the following formula，/>，/>And/>：

；

If it is

；

(3) Predicting semantic projection of the unseen pictures by using the learned conversion matrix; the semantic projection of the unseen pictures on a single view angle is obtained as follows:

；

(4) And (3) further extracting final semantics of the unseen pictures according to the semantic projection obtained in the step (3) and identifying the unseen pictures. The final semantic formula for extracting the unseen pictures is as follows:

；

wherein, The final semantic representation of the unseen pictures to be extracted, namely the optimization variables;

； Is a diagonal matrix;

is a super parameter.

Calculated by the following formula:

；

wherein, In the form of a block matrix,。

The identification of the unseen pictures comprises the following steps:

；

category labels for unseen pictures are obtained using the following formula:

；

wherein, Returning a number vector representing the largest element of each column of the input matrix; /(I)Semantic attributes are not found; /(I)The number of the unobserved categories is the number of the unobserved categories; /(I)And marking the category of the identified unseen pictures.

In order to verify the effect and performance of the method provided by the invention, the invention adopts three classical zero sample classification data sets of AwA, CUB, SUN and the like to carry out a comparison experiment. Table 1 lists the unseen identification accuracies of several existing ZSL methods.

Table 1 comparison of identification results of several methods

Compared with other methods, the multi-view visual fusion transduction type zero sample classification method based on the example provided by the invention can fully utilize the characteristic information of different views, has obvious advantages in generalization performance, and can achieve higher level of accuracy in recognition of unseen pictures.

The embodiment of the invention also provides a multi-view visual fusion transduction type zero sample identification system based on the example, which comprises the following steps:

The embodiment of the invention also provides equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the multi-view visual fusion transduction zero sample classification method based on any one of the examples when being loaded to the processor.

Claims

1. An example-based multi-view vision fusion transduction type zero sample classification method is characterized by comprising the following steps of:

(2) Sending the multi-view visual characteristics of the seen type pictures and the corresponding category semantic attributes into a multi-view visual-semantic mapping model, and learning a conversion matrix on different view angles by using an alternate direction multiplier method; the alternate direction multiplier method is specifically as follows:

Initializing:

，/>，/>，/>，/>，/>，；

Let the iteration times Determining convergence threshold/>，/>Sum parameter/>，/>，/>；

；

by solving the following Optimization problem of/>The formula is as follows:

；

by solving the following Equation of/>The formula is as follows:

；

updating by ：

；

Updating by：

；

Updating Lagrangian multipliers by the following formula，/>，/>And/>：

；

If it is

；

Then convergence; otherwise, letContinuing the updating operation; the final transformation matrix obtained through iteration is: /(I)；

2. The example-based multi-view vision fusion transduction zero sample classification method according to claim 1, wherein the step (1) is specifically as follows: visual features were extracted using ResNet and GoogLeNet pre-trained on the ImageNet database, representing view a and view B, respectively.

3. The example-based multi-view vision fusion transduction zero sample classification method according to claim 1, wherein the multi-view vision-semantic mapping model in the step (2) is expressed as the following optimization problem:

；

the constraint conditions are as follows:

；

4. The example-based multi-view visual fusion transduction zero sample classification method according to claim 1, wherein the semantic projection of the unseen pictures on a single view angle is obtained in the step (3):

；

5. The example-based multi-view visual fusion transduction zero sample classification method according to claim 1, wherein the final semantic formula of the extracted unseen pictures in the step (4) is as follows:

；

is a super parameter.

6. The method for zero sample classification based on instance-based multi-view visual fusion transduction of claim 1, wherein,Calculated by the following formula:

；

wherein, In the form of a block matrix,。

7. The example-based multi-view visual fusion transduction zero sample classification method according to claim 1, wherein the identifying of the unseen class picture in the step (4) comprises:

；

category labels for unseen pictures are obtained using the following formula:

；