CN115482443A - Image feature fusion and model training method, device, equipment and storage medium - Google Patents

Image feature fusion and model training method, device, equipment and storage medium Download PDF

Info

Publication number
CN115482443A
CN115482443A CN202211067385.4A CN202211067385A CN115482443A CN 115482443 A CN115482443 A CN 115482443A CN 202211067385 A CN202211067385 A CN 202211067385A CN 115482443 A CN115482443 A CN 115482443A
Authority
CN
China
Prior art keywords
image
features
level
layer
image features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211067385.4A
Other languages
Chinese (zh)
Inventor
夏春龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Apollo Zhilian Beijing Technology Co Ltd
Original Assignee
Apollo Zhilian Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Apollo Zhilian Beijing Technology Co Ltd filed Critical Apollo Zhilian Beijing Technology Co Ltd
Priority to CN202211067385.4A priority Critical patent/CN115482443A/en
Publication of CN115482443A publication Critical patent/CN115482443A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)

Abstract

The disclosure provides an image feature fusion and model training method, device, equipment and storage medium, and relates to the technical field of image processing, in particular to the fields of computer vision, deep learning and the like. The specific implementation scheme is as follows: acquiring image characteristics of different receptive fields; enhancing channel information of high-level image features in the image features of different receptive fields by utilizing bottom-level image features in the image features of different receptive fields to obtain enhanced high-level features after the channel information is enhanced; wherein the receptive field of the high-level image features is larger than the receptive field of the bottom-level image features; performing spatial information enhancement on bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information enhancement; and fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features. The image feature fusion method and the image feature fusion system can better perform image feature fusion and improve accuracy of image features.

Description

Image feature fusion and model training method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of image processing technology, and in particular, to the fields of computer vision, deep learning, and the like.
Background
In many fields, fusing features of different receptive fields is an important means to improve performance. For example, in the field of image processing, fusing images of different receptive fields is an important means for processing the images to improve the image processing effect.
Disclosure of Invention
The disclosure provides an image feature fusion and model training method, device, equipment and storage medium.
According to a first aspect of the present disclosure, there is provided an image feature fusion method, including:
acquiring image characteristics of different receptive fields;
carrying out channel information enhancement on the high-level image features in the image features of different receptive fields by utilizing the bottom-level image features in the image features of different receptive fields to obtain enhanced high-level features after the channel information enhancement; wherein the receptive field of the high-level image features is larger than the receptive field of the bottom-level image features;
carrying out spatial information enhancement on bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information enhancement;
and fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features.
According to a second aspect of the present disclosure, there is provided a model training method for image feature fusion, comprising:
obtaining a plurality of sample images and labels corresponding to the sample images;
inputting the sample images into a model for image feature fusion aiming at each sample image to obtain fused features;
performing image processing based on the fused features to obtain an image processing result;
calculating the difference between the image processing result and the label corresponding to the sample image;
adjusting model parameters of the model for image feature fusion based on the difference;
based on the adjusted model parameters and the plurality of sample images, continuing the adjustment process of the model parameters until a preset iteration end condition is met;
taking model parameters obtained when a preset iteration ending condition is met as model parameters after training, and taking a model for image feature fusion including the model parameters after training as a model for image feature fusion after training;
the model for image feature fusion comprises a channel attention network, a space attention network and a fusion network;
the channel attention network is used for enhancing channel information of the high-level image features in the image features of different receptive fields by utilizing the bottom-level image features in the image features of different receptive fields to obtain enhanced high-level features after the channel information is enhanced; wherein the receptive field of the high-level image features is larger than the receptive field of the bottom-level image features;
the spatial attention network is used for enhancing spatial information of bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information is enhanced;
and the fusion network is used for fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features.
According to a third aspect of the present disclosure, there is provided an image feature fusion apparatus comprising:
the acquisition module is used for acquiring image characteristics of different receptive fields;
the characteristic enhancement module is used for utilizing the bottom layer image characteristics in the image characteristics of different receptive fields to enhance the channel information of the high-level image characteristics in the image characteristics of different receptive fields to obtain enhanced high-level characteristics after the channel information is enhanced; wherein the receptive field of the high-level image features is larger than the receptive field of the bottom-level image features; the enhancement device is used for utilizing the high-level image features in the image features of different receptive fields to carry out spatial information enhancement on the bottom-level image features in the image features of different receptive fields to obtain enhanced bottom-level features after the spatial information enhancement;
and the fusion module is used for fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features.
According to a fourth aspect of the present disclosure, there is provided a model training apparatus for image feature fusion, comprising:
the acquisition module is used for acquiring a plurality of sample images and labels corresponding to the sample images;
a fusion feature obtaining module, configured to input the sample images into a model for image feature fusion, to obtain fused features;
an image processing result obtaining module, configured to perform image processing based on the fused features to obtain an image processing result;
the calculation module is used for calculating the difference between the image processing result and the label corresponding to the sample image;
a training module for adjusting model parameters of the model for image feature fusion based on the difference; based on the adjusted model parameters and the plurality of sample images, continuing the adjustment process of the model parameters until a preset iteration end condition is met; taking model parameters obtained when a preset iteration ending condition is met as trained model parameters, and taking a model for image feature fusion, which comprises the trained model parameters, as a trained model for image feature fusion;
the model for image feature fusion comprises a channel attention network, a spatial attention network and a fusion network;
the channel attention network is used for enhancing channel information of the high-level image features in the image features of different receptive fields by utilizing the bottom-level image features in the image features of different receptive fields to obtain enhanced high-level features after the channel information is enhanced; wherein the receptive field of the high-level image features is larger than the receptive field of the bottom-level image features;
the spatial attention network is used for enhancing spatial information of bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information is enhanced;
and the fusion network is used for fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features.
According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first or second aspect.
According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first or second aspect.
According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to the first or second aspect.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a flowchart of an image feature fusion method provided by an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of an image feature fusion method provided by an embodiment of the present disclosure;
FIG. 3 is a flowchart of a training method for a model for image feature fusion provided by an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a model for image feature fusion in an embodiment of the present disclosure;
FIG. 5 is another schematic diagram of a model for image feature fusion in an embodiment of the present disclosure;
FIG. 6 is yet another schematic diagram of a model for image feature fusion in an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an image feature fusion apparatus provided in an embodiment of the present disclosure;
FIG. 8 is a schematic structural diagram of a model training apparatus for image feature fusion provided in an embodiment of the present disclosure;
fig. 9 is a block diagram of an electronic device for implementing an image feature fusion method or a training method for a model for image feature fusion of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The embodiment of the disclosure provides an image feature fusion method, which includes:
acquiring image characteristics of different receptive fields;
enhancing channel information of high-level image features in the image features of different receptive fields by utilizing bottom-level image features in the image features of different receptive fields to obtain enhanced high-level features after the channel information is enhanced; wherein the receptive field of the high-level image features is larger than the receptive field of the bottom-level image features;
performing spatial information enhancement on bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information enhancement;
and fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features.
In the embodiment of the disclosure, the channel information enhancement is performed on the high-level image features in the image features of different receptive fields by using the bottom-level image features in the image features of different receptive fields, and the spatial information enhancement is performed on the bottom-level image features in the image features of different receptive fields by using the high-level image features in the image features of different receptive fields to obtain the enhanced bottom-level features after the spatial information enhancement.
Fig. 1 is a flowchart of an image feature fusion method provided in an embodiment of the present disclosure, and referring to fig. 1, the image feature fusion method provided in an embodiment of the present disclosure may include:
and S101, acquiring image characteristics of different receptive fields.
The receptive field represents the size of the region mapped to the original image. For example, the image features are represented by image feature maps, and the image features of different receptive fields may be image feature maps of different receptive fields.
In one implementation, the image features of different receptive fields of the image can be extracted by the convolutional neural network, and the receptive fields represent the size of the area in which the pixel points on the feature map output by each layer of the convolutional neural network are mapped back to the original image. In popular terms, the receptive field refers to the size of a point on the feature map relative to the original image, and is also the area where the original image can be seen by the features of the convolutional neural network. The image features of different receptive fields of the original image (which can also be understood as an image to be processed) can be extracted through the convolutional neural network.
S102, channel information enhancement is carried out on the high-level image features in the image features of different receptive fields by utilizing the bottom-level image features in the image features of different receptive fields, and enhanced high-level features after the channel information enhancement are obtained.
S103, carrying out spatial information enhancement on the bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information enhancement.
Wherein the receptive field of the high-level image features is larger than the receptive field of the low-level image features.
And S104, fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features.
A channel importance coefficient may be calculated based on the underlying image features; and adjusting the high-level image characteristics based on the channel importance coefficient to obtain enhanced high-level characteristics after channel information enhancement. The channel importance coefficient is proportional to the importance of the high-level image feature in the channel dimension, and it is understood that the higher the channel importance coefficient is, the more important the high-level image feature is in the channel dimension.
The spatial importance coefficient may be calculated based on high-level image features; and adjusting the bottom layer image characteristics based on the spatial importance coefficient to obtain enhanced bottom layer characteristics after spatial information enhancement.
The spatial importance coefficient is proportional to the importance of the bottom layer image feature in the spatial dimension, and it is simply understood that the larger the spatial importance coefficient is, the more important the bottom layer image feature is in the spatial dimension.
The enhanced high-level features and the enhanced low-level features can be directly added to obtain fused features. Or the channel dimension information can be spliced through the channel dimension firstly and then aggregated through convolution.
In an implementation manner, the image features of different receptive fields include image features of different scales, and in the embodiment of the present disclosure, the image features of different receptive fields are fused, that is, the image features of different scales are obtained; enhancing channel information of high-level image features in the image features of different scales by using bottom-level image features in the image features of different scales to obtain enhanced high-level features after the channel information is enhanced; wherein the receptive field of the high-level image features is larger than the receptive field of the bottom-level image features; performing spatial information enhancement on bottom layer image features in the image features with different scales by using the high layer image features in the image features with different scales to obtain enhanced bottom layer features after the spatial information enhancement; and fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features.
In another implementation, the image features of different receptive fields include image features of different receptive fields at the same scale, e.g., different regions in a feature map at the same scale.
Compared with the prior art that the features with different scales are scaled to the same scale, and then corresponding features are added or spliced, the effectiveness of the features with different scales is not considered, or the feature fusion is carried out by utilizing a channel attention mechanism, and the difference among different spatial features is not considered; alternatively, the correlation between different spatial location features is constructed using a large range of self-attention mechanisms, lacking importance considerations between different channels. In the embodiment of the disclosure, the channel information enhancement is performed on the high-level image features in the image features of different receptive fields by using the bottom-level image features in the image features of different receptive fields, and the spatial information enhancement is performed on the bottom-level image features in the image features of different receptive fields by using the high-level image features in the image features of different receptive fields to obtain the enhanced bottom-level features after the spatial information enhancement.
In an optional embodiment, as shown in fig. 2, before S101, the method may further include:
s201, acquiring an image to be processed.
And acquiring image characteristics of different receptive fields, namely extracting the image characteristics of the different receptive fields of the image to be processed.
After S104, the method may further include:
s202, image classification, image detection and/or image segmentation are carried out based on the fused features, and classification results, detection results and/or segmentation results aiming at the images to be processed are obtained.
Based on the fused features, only image classification, image detection, or image segmentation may be performed, or both image classification and image detection may be performed, or both image classification and image segmentation may be performed, or both image segmentation and image detection may be performed, or both image classification and image detection may be performed, or image segmentation may be performed.
Specifically, when the images are to be classified, the images to be processed can be obtained, the image features of different receptive fields of the images to be processed are extracted, then, the channel information enhancement is performed on the high-level image features in the image features of different receptive fields by using the bottom-level image features in the image features of different receptive fields, and the enhanced high-level features after the channel information enhancement are obtained; performing spatial information enhancement on bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information enhancement; and fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features. And classifying the images based on the fused features of the images to be processed.
When the image is to be segmented, the image to be processed can be obtained, the image characteristics of different receptive fields of the image to be processed are extracted, then, the bottom layer image characteristics in the image characteristics of the different receptive fields are utilized to enhance the channel information of the high-level image characteristics in the image characteristics of the different receptive fields, and the enhanced high-level characteristics after the channel information enhancement are obtained; carrying out spatial information enhancement on bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information enhancement; and fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features. And performing image segmentation based on the fused features of the image to be processed.
When the image is to be detected, the image to be processed can be obtained, the image characteristics of different receptive fields of the image to be processed are extracted, then, the channel information of the high-level image characteristics in the image characteristics of the different receptive fields is enhanced by utilizing the bottom-level image characteristics in the image characteristics of the different receptive fields, and the enhanced high-level characteristics after the channel information is enhanced are obtained; carrying out spatial information enhancement on bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information enhancement; and fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features. And detecting the image based on the fused features of the image to be processed.
According to the embodiment of the disclosure, the channel information enhancement is performed on the high-level image features in the image features of different receptive fields by utilizing the bottom-level image features in the image features of different receptive fields, so as to obtain enhanced high-level features after the channel information enhancement; carrying out spatial information enhancement on bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information enhancement; the enhanced high-level features and the enhanced bottom-level features are fused to obtain fused features, so that different receptive field feature information can be better utilized through interaction of the high-level image features and the bottom-level image features, image feature fusion can be better carried out, the extracted image features are more accurate, and the obtained fused features are utilized for image detection, image classification, image segmentation and the like, so that an image detection result, an image classification result and an image segmentation result are more accurate.
In one implementation, the fused features may be obtained from a model used for image feature fusion. Specifically, the model training process will be described in detail in the following embodiments, which are not repeated herein.
In particular, models for image feature fusion may include feature extraction networks, channel attention networks, spatial attention networks, and fusion networks.
Acquiring image characteristics of different receptive fields, comprising:
acquiring an image to be processed, inputting the image to be processed into a model for image feature fusion, and extracting image features of different receptive fields of the image to be processed through a feature extraction network in the model for image feature fusion;
the method for enhancing the channel information of the high-level image features in the image features of different receptive fields by utilizing the bottom-level image features in the image features of different receptive fields to obtain enhanced high-level features after the channel information is enhanced comprises the following steps:
performing channel information enhancement on the high-level image features in the image features of different receptive fields by using the bottom-level image features in the image features of different receptive fields through a channel attention network in a model for image feature fusion to obtain enhanced high-level features after the channel information enhancement;
utilize the high-rise image characteristic among the image characteristic of different receptive fields, carry out spatial information reinforcing to the bottom image characteristic among the image characteristic of different receptive fields, obtain the reinforcing bottom characteristic after the spatial information reinforcing, include:
performing spatial information enhancement on bottom layer image features in the image features of different receptive fields by using high layer image features in the image features of different receptive fields through a spatial attention network in a model for image feature fusion to obtain enhanced bottom layer features after the spatial information enhancement;
fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features, wherein the fused features comprise:
and fusing the enhanced high-level features and the enhanced bottom-level features through a fusion network in the model for fusing the image features to obtain fused features.
The method is simple to understand, an image to be processed is input into a model for image feature fusion, the bottom layer image features in the image features of different receptive fields can be utilized through the model to perform channel information enhancement on the high layer image features in the image features of different receptive fields, the high layer image features in the image features of different receptive fields are utilized to perform spatial information enhancement on the bottom layer image features in the image features of different receptive fields, and the fused features can be obtained through the model; on the other hand, the process can be realized more conveniently through the model.
After the fused features are obtained by the model for image feature fusion, image classification, image detection and/or image segmentation can be performed based on the fused features, so as to obtain a classification result, a detection result and/or a segmentation result for the image to be processed, which is specifically shown in the embodiment corresponding to fig. 2.
In an alternative embodiment, the channel attention network may include a global pooling layer, a first perception layer, and a normalization layer;
through a channel attention network in a model for image feature fusion, the method utilizes bottom layer image features in image features of different receptive fields to perform channel information enhancement on high layer image features in the image features of the different receptive fields, and comprises the following steps:
performing global pooling operation on the bottom image characteristics by using a global pooling layer;
utilizing the first perception layer to obtain a column vector consistent with the channel dimension of the high-level image feature through full connection operation based on the result after the global pooling operation;
and normalizing the column vector consistent with the channel dimension of the high-level image feature by using a first normalization layer to obtain a first normalized column vector, wherein a value in the first normalized column vector represents a channel importance coefficient.
The channel importance coefficient is used for adjusting the characteristics of the high-level image so as to obtain enhanced high-level characteristics after channel information is enhanced. The channel importance coefficient is in direct proportion to the importance of the high-level image feature on the channel dimension, and the simple understanding is that the larger the channel importance coefficient is, the more important the high-level image feature is from the channel dimension. It can also be understood that the underlying image features provide channel importance information for the higher level image features.
The spatial attention network may include a second perception layer and a second normalization layer;
the method for enhancing the spatial information of the bottom layer image features in the image features of different receptive fields by utilizing the high-layer image features in the image features of different receptive fields through the spatial attention network in the model for fusing the image features comprises the following steps:
performing full-connection operation on the high-level image features by using a second perception layer to obtain column vectors consistent with the spatial dimensions of the bottom-level image features;
and normalizing the column vectors consistent with the spatial dimension of the bottom layer image features by using a second normalization layer to obtain second normalized column vectors, wherein values in the second normalized column vectors represent spatial importance coefficients.
The spatial importance coefficient is used for adjusting the bottom layer image characteristics to obtain enhanced bottom layer characteristics after the spatial information is enhanced. The spatial importance coefficient is proportional to the importance of the bottom layer image feature in the spatial dimension, and it is simply understood that the larger the spatial importance coefficient is, the more important the bottom layer image feature is in the spatial dimension. It can also be understood that the higher level image features provide spatial importance information for the underlying image features.
The channel attention network comprises a global pooling layer, a first sensing layer and a normalizing layer, the spatial attention network comprises a second sensing layer and a second normalizing layer, namely, the global pooling layer, the first sensing layer and the normalizing layer can realize that the bottom image characteristics in the image characteristics of different receptive fields are utilized to carry out channel information enhancement on the high-level image characteristics in the image characteristics of different receptive fields, the second sensing layer and the second normalizing layer can realize that the high-level image characteristics in the image characteristics of different receptive fields are utilized to carry out spatial information enhancement on the bottom image characteristics in the image characteristics of different receptive fields, the complexity of the model is low, the feature fusion is realized by the model, and when the processes of image classification, image segmentation, image detection and the like are carried out by the fused features, the calculation complexity can be reduced, and the image processing processes such as simple and effective feature fusion are realized.
The embodiment of the present disclosure provides a training method for a model for image feature fusion, as shown in fig. 3, the training method may include:
s301, obtaining a plurality of sample images and labels corresponding to the sample images;
s302, aiming at each sample image, inputting the sample image into a model for image feature fusion to obtain fused features.
S303, processing the image based on the fused features to obtain an image processing result;
s304, calculating the difference between the image processing result and the label corresponding to the sample image;
s305, adjusting model parameters of a model for image feature fusion based on the difference;
s306, based on the adjusted model parameters and the plurality of sample images, continuing to perform the adjustment process of the model parameters until a preset iteration end condition is met;
s307, taking the model parameters obtained when the preset iteration ending conditions are met as the trained model parameters, and taking the model for image feature fusion including the trained model parameters as the trained model for image feature fusion;
the model for image feature fusion comprises a feature extraction network, a channel attention network, a space attention network and a fusion network;
the characteristic extraction network is used for extracting image characteristics of different receptive fields of the image to be processed;
the channel attention network is used for enhancing channel information of high-level image features in the image features of different receptive fields by utilizing bottom-level image features in the image features of different receptive fields to obtain enhanced high-level features after the channel information is enhanced; wherein the receptive field of the high-level image features is larger than the receptive field of the bottom-level image features;
the spatial attention network is used for enhancing spatial information of bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information is enhanced;
and the fusion network is used for fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features.
The model for image feature fusion can be constructed firstly, the model comprises a feature extraction network, a channel attention network, a space attention network and a fusion network, and then the model is trained by utilizing a plurality of sample images and labels corresponding to the sample images until a preset iteration ending condition is met, so that the trained model is obtained.
Specifically, a sample image is input into a model for image feature fusion, and image features of different receptive fields of an image to be processed are extracted through a feature extraction network in the model for image feature fusion; carrying out channel information enhancement on the high-level image features in the image features of different receptive fields by utilizing the bottom-level image features in the image features of different receptive fields through a channel attention network in a model for image feature fusion to obtain enhanced high-level features after the channel information enhancement; performing spatial information enhancement on bottom layer image features in the image features of different receptive fields by using high layer image features in the image features of different receptive fields through a spatial attention network in a model for image feature fusion to obtain enhanced bottom layer features after the spatial information enhancement; and fusing the enhanced high-level features and the enhanced bottom-level features through a fusion network in the model for fusing the image features to obtain fused features. As shown in fig. 4, the high-level image features are high-level semantic feature maps (Global features), the bottom-level image features are bottom-level semantic feature maps (Local features), spatial Interaction is used to enhance Spatial information of the bottom-level semantic feature maps by using the high-level semantic feature maps, that is, spatial information enhancement is performed on the bottom-level image features in the image features of different receptive fields by using the high-level image features in the image features of different receptive fields, so as to obtain enhanced bottom-level features after Spatial information enhancement; channel Interaction is used for enhancing the information of the Channel dimension of the high-level semantic feature map by using the low-level semantic feature map, namely, the Channel information of the high-level image features in the image features of different receptive fields is enhanced by using the low-level image features in the image features of different receptive fields, so that the enhanced high-level features after the Channel information is enhanced are obtained. Fuse is the fusion after the enhancement of the high-level and the bottom-level features, that is, the enhancement of the high-level features and the enhancement of the bottom-level features are fused to obtain the fused features.
Then, image processing is carried out based on the fused features to obtain an image processing result; calculating the difference between the image processing result and the label corresponding to the sample image; adjusting model parameters of a model for image feature fusion based on the difference; based on the adjusted model parameters and the plurality of sample images, continuing to perform the adjustment process of the model parameters until a preset iteration ending condition is met; and taking the model parameters obtained when the preset iteration ending condition is met as the trained model parameters, and taking the model comprising the trained model parameters as the trained model for image feature fusion.
The model parameters may be adjusted by backpropagating the error gradient, which may also be understood as updating the weights by backpropagating the error gradient.
The preset iteration ending condition may include convergence of a difference between an image processing result for a sample image and a label corresponding to the image processing result, or the preset iteration ending condition includes that the number of iterations reaches a preset number of times, and the preset number of times may be determined according to an actual requirement or experience, for example, 500 times, 600 times, 1000 times, and the like.
In addition, after the training is finished, namely after the trained model for image feature fusion is obtained, the model can be tested. Specifically, a plurality of test images (which may also be referred to as data to be inferred) may be obtained, the trained model for image feature fusion is adjusted based on the test images, specifically, the process refers to the process of training the model for image feature fusion based on the sample images, and if the model for image feature fusion adjusted based on the test data is better than the trained model for image feature fusion, the model for image feature fusion adjusted based on the test data is used as the final model.
In the embodiment of the disclosure, a model for image feature fusion can be trained in advance, and channel information enhancement can be performed on high-level image features in image features of different receptive fields by using bottom-level image features in the image features of different receptive fields through the model to obtain enhanced high-level features after channel information enhancement; carrying out spatial information enhancement on bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information enhancement; the enhanced high-level features and the enhanced bottom-level features are fused to obtain fused features, so that different receptive field feature information can be better utilized through interaction of the high-level image features and the bottom-level image features, image feature fusion can be better performed, the extracted image features are more accurate, the obtained fused features can be further utilized for image detection, image classification, image segmentation and the like, and image detection results, image classification results and image segmentation results are more accurate. In addition, a model for image feature fusion is trained in advance, and when the image feature fusion is to be realized, the feature fusion can be conveniently carried out through the model for image feature fusion.
In the embodiment of the present disclosure, the specific type of the model is not limited, for example, the model may be a convolutional neural network such as a residual network, a processing module in the residual network is replaced by a network layer formed by a feature extraction network, a channel attention network, a spatial attention network, and a fusion network, and a model for image feature fusion is constructed, as shown in fig. 5, the model is constructed and parameters are initialized by using a depth residual network (resnet) structure, and the residual network includes an Input layer (Input), a feature extraction layer (Stem), a processing module (Block), and an output layer (f) C ) Wherein, the number of blocks can be multiple, such as xN, a residual error network is formed by convolution (conv 3 x 3) with convolution kernel of 3 x 3 for high-level semantic feature map (GF), and the bottom layer image feature is bottomIn the embodiment of the present disclosure, a Block, that is, a module in a dashed frame is replaced by a network layer formed by a feature extraction network, a channel attention network, a spatial attention network, and a fusion network, to construct an initial model. Then, the model is trained to obtain a trained model for image feature fusion.
In an optional embodiment, for each sample image, the label corresponding to the sample image includes a category label, a detection label and/or a segmentation label of the sample image;
s304 may include:
carrying out image classification, image detection and/or image segmentation based on the fusion result to obtain a classification result, a detection result and/or a segmentation result aiming at the sample image;
s305 may include:
differences between the classification results and the class labels, between the detection results and the detection labels, and/or between the segmentation results and the segmentation labels are computed.
In the embodiment of the disclosure, a model for image feature fusion can be obtained by training a plurality of sample images and class labels of the sample images; or, a model for image feature fusion can be obtained by training a plurality of sample images and detection labels of the sample images; or, a model for image feature fusion can be obtained by training a plurality of sample images and segmentation labels of the sample images; or training by utilizing a plurality of sample images and class labels and detection labels of the sample images to obtain a model for image feature fusion; or training by utilizing a plurality of sample images and class labels and segmentation labels of the sample images to obtain a model for image feature fusion; or training by using a plurality of sample images and the detection labels and the segmentation labels of the sample images to obtain a model for image feature fusion, or training by using a plurality of sample images and the class labels, the detection labels and the segmentation labels of the sample images to obtain a model for image feature fusion.
In an implementation manner, a model for image feature fusion can be obtained by training a plurality of sample images and class labels of the sample images, and then image classification can be performed by using the model. Specifically, for each sample image, the sample image is input into a model for image feature fusion, and fused features are obtained. Carrying out image classification based on the fused image to obtain a classification result aiming at the sample image; calculating the difference between the classification result and the class label corresponding to the sample image; adjusting model parameters of a model for image feature fusion based on the difference; based on the adjusted model parameters and the plurality of sample images, continuing the adjustment process of the model parameters until a preset iteration end condition is met; and taking the model parameters obtained when the preset iteration ending conditions are met as the trained model parameters, and taking the model for image feature fusion including the trained model parameters as the trained model for image feature fusion. Inputting a sample image into a model for image feature fusion, wherein obtaining fused features comprises: inputting a sample image into a model for image feature fusion, and extracting image features of different receptive fields of an image to be processed through a feature extraction network in the model for image feature fusion; performing channel information enhancement on the high-level image features in the image features of different receptive fields by using the bottom-level image features in the image features of different receptive fields through a channel attention network in a model for image feature fusion to obtain enhanced high-level features after the channel information enhancement; performing spatial information enhancement on bottom layer image features in the image features of different receptive fields by using high layer image features in the image features of different receptive fields through a spatial attention network in a model for image feature fusion to obtain enhanced bottom layer features after the spatial information enhancement; and fusing the enhanced high-level features and the enhanced bottom-level features through a fusion network in the model for fusing the image features to obtain fused features.
The model for image feature fusion can be trained by utilizing a plurality of sample images and class labels corresponding to the sample images, image feature fusion can be performed on the images to be processed through the model, and then image classification can be performed based on the fused features. In the model, channel information enhancement is carried out on the high-level image features in the image features of different receptive fields by utilizing the bottom-level image features in the image features of different receptive fields to obtain enhanced high-level features after the channel information enhancement; carrying out spatial information enhancement on bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information enhancement; and the enhanced high-level features and the enhanced bottom-level features are fused to obtain fused features, so that the image features can be better fused by better utilizing different receptive field feature information through the interaction of the high-level image features and the bottom-level image features, and the extracted image features are more accurate. In addition, because the model is trained on the basis of the sample images and the class labels of the images, the image classification result can be more accurate by utilizing the model to classify the images.
In another implementation manner, a model for image feature fusion can be obtained by training a plurality of sample images and detection labels of the sample images, and then image detection can be performed by using the model. The specific training process is similar to the process of training the model by using a plurality of sample images and class labels of the sample images, and is different in that image detection is performed based on the fused images to obtain a detection result for the sample images, and the difference between the detection result and the detection label is calculated. After the model is obtained in the embodiment of the present disclosure, the image feature fusion can be performed on the image to be processed through the model, and then the image detection can be performed based on the fused feature. In the model, the bottom layer image characteristics in the image characteristics of different receptive fields are utilized to enhance the channel information of the high layer image characteristics in the image characteristics of different receptive fields, so as to obtain enhanced high layer characteristics after the channel information is enhanced; carrying out spatial information enhancement on bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information enhancement; and the enhanced high-level features and the enhanced bottom-level features are fused to obtain fused features, so that the image features can be better fused by better utilizing different receptive field feature information through the interaction of the high-level image features and the bottom-level image features, and the extracted image features are more accurate. In addition, because the model is trained based on the sample image and the detection label of the image, the image detection is carried out by using the model, and the image detection result can be more accurate.
In another implementation manner, a model for image feature fusion can be obtained by training a plurality of sample images and segmentation labels of the sample images, and then image segmentation can be performed by using the model. The specific training process is similar to the process of training the model by using a plurality of sample images and class labels of the sample images, and is different in that image segmentation is performed on the basis of the fused images, a segmentation result for the sample images is obtained, and a difference between the segmentation result and the segmentation labels is calculated. After the model is obtained in the embodiment of the present disclosure, the image feature fusion can be performed on the image to be processed through the model, and then the image segmentation can be performed based on the fused feature. In the model, channel information enhancement is carried out on the high-level image features in the image features of different receptive fields by utilizing the bottom-level image features in the image features of different receptive fields to obtain enhanced high-level features after the channel information enhancement; carrying out spatial information enhancement on bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information enhancement; and the enhanced high-level features and the enhanced bottom-level features are fused to obtain fused features, so that the image features can be better fused by better utilizing different receptive field feature information through the interaction of the high-level image features and the bottom-level image features, and the extracted image features are more accurate. In addition, because the model is trained based on the sample image and the segmentation label of the image, the image segmentation is carried out by using the model, and the image segmentation result can be more accurate.
In another implementation mode, a model for image feature fusion is obtained by training a plurality of sample images and class labels and detection labels of the sample images; or training by utilizing a plurality of sample images and class labels and segmentation labels of the sample images to obtain a model for image feature fusion; or, training by using the plurality of sample images and the detection labels and the segmentation labels of the sample images to obtain a model for image feature fusion, or training by using the class labels, the detection labels and the segmentation labels of the plurality of sample images and the sample images to obtain a model for image feature fusion. The plurality of sample images can comprise two parts, and training of the model is performed by utilizing one part of sample images, the class labels of the part of sample images, the other part of sample images and the detection labels of the part of sample images; or training the model by utilizing a part of sample images, the class labels of the part of sample images, another part of sample images and the segmentation labels of the part of sample images; or training the model by using a part of sample images and detection labels of the part of sample images and another part of sample images and segmentation labels of the part of sample images. Or, the plurality of sample images may include three parts, and training of the model is performed by using a part of the sample images and class labels of the part of the sample images, a part of the sample images and detection labels of the part of the sample images, and another part of the sample images and segmentation labels of the part of the sample images. In the embodiment of the disclosure, on one hand, a plurality of scene training models are combined, so that a plurality of sample data can be obtained more conveniently, and the difficulty of model training is reduced; on the other hand, the accuracy of model training can be improved by combining data of a plurality of scenes, so that the fused features obtained by image feature fusion through the model are more accurate, and the results obtained by image classification, image detection and image segmentation based on the fused features are more accurate.
In an alternative embodiment, the channel attention network includes a global pooling layer, a first perception layer, and a normalization layer;
the global pooling layer is used for performing global pooling operation on the bottom layer image characteristics;
the first perception layer is used for obtaining a column vector consistent with the channel dimension of the high-level image feature through full-connection operation based on the result after the global pooling operation;
the first normalization layer is used for performing normalization operation on the column vectors consistent with the channel dimensions of the high-level image features to obtain first normalized column vectors, and values in the first normalized column vectors represent channel importance coefficients.
The channel attention network may further normalize the first normalized column vector obtained by the first normalization layer by activating a function to map the value of the column vector to a value of (0, 1).
Calculating a channel importance coefficient based on the bottom image characteristics through a channel attention network; and adjusting the high-level image characteristics based on the channel importance coefficient to obtain enhanced high-level characteristics after channel information enhancement, so as to realize the enhancement of the channel information on the high-level image characteristics in the image characteristics of different receptive fields by utilizing the bottom-layer image characteristics in the image characteristics of different receptive fields to obtain the enhanced high-level characteristics after the channel information enhancement.
The spatial attention network comprises a second perception layer and a second normalization layer;
the second perception layer is used for carrying out full connection operation on the high-level image features to obtain column vectors consistent with the spatial dimensions of the bottom-level image features;
and the second normalization layer is used for performing normalization operation on the column vectors consistent with the spatial dimensions of the bottom layer image characteristics to obtain second normalized column vectors, and values in the second normalized column vectors represent spatial importance coefficients.
Calculating a spatial importance coefficient based on high-level image characteristics through a spatial attention network; and adjusting bottom layer image characteristics based on the spatial importance coefficient, wherein the adjusted characteristics are the characteristics after spatial information enhancement, and obtaining enhanced bottom layer characteristics after spatial information enhancement, so that the bottom layer image characteristics in the image characteristics of different receptive fields are subjected to spatial information enhancement by utilizing the high layer image characteristics in the image characteristics of different receptive fields, and the enhanced bottom layer characteristics after spatial information enhancement are obtained.
As shown in fig. 6, the high-level image features are high-level semantic feature maps (Global features), the bottom-level image features are bottom-level semantic feature maps (Local features), spatial information enhancement of the bottom-level semantic feature maps by using the high-level semantic feature maps is realized through Spatial Interaction, that is, spatial information enhancement of the bottom-level image features in the image features of different receptive fields is performed by using the high-level image features in the image features of different receptive fields, so as to obtain enhanced bottom-level features after Spatial information enhancement; channel Interaction is used for enhancing the Channel dimension information of the high-level semantic feature map by using the bottom-level semantic feature map, namely, the Channel information of the high-level image features in the image features of different receptive fields is enhanced by using the bottom-level image features in the image features of different receptive fields, and the enhanced high-level features after the Channel information is enhanced are obtained. Fuse is the fusion after the enhancement of the high-level and the bottom-level features, that is, the enhancement of the high-level features and the enhancement of the bottom-level features are fused to obtain the fused features.
In fig. 6, (C1, H, W), C1 represents the Channel number (Channel) of the underlying semantic feature map, the length (Height) of the underlying semantic feature map, and the Width (Width) of the underlying semantic feature map; c2 in (C2, H, W) represents the Channel number (Channel) of the bottom semantic feature map, the length (Height) of the bottom semantic feature map and the Width (Width) of the bottom semantic feature map;
channel Interaction includes GAP, LP, BN and Activation; spatial Interaction includes LP, BN, and Activation. Wherein, BN is batch normalization processing operation, activation represents an Activation function, LP is a perception layer, and GAP is global pooling operation. Concat is a feature splicing operation, conv1x1 is used for adjusting the feature channel dimension, that is, the fused feature is obtained through the splicing operation and the adjustment of the feature channel dimension.
In the Channel Interaction process, the GAP (GAP), namely the global pooling layer, performs global pooling on the bottom layer feature information to obtain a result which is a C1-dimensional column vector, the LP, namely the first sensing layer, specifically a linear sensing layer, can perform full-connection operation, and the output result is a column vector consistent with the Channel dimension of the high-level semantic feature map, namely a C2-dimensional column vector. The result is input into the BN, that is, the result obtained by the normalization operation performed by the first normalization layer is the result after normalization, and the dimension of the output result is unchanged. The result of Activation is a number one (0, 1) before, which can be seen as an adjustment to the importance of each channel feature on the high-level semantic feature map. The larger the value, the more important the feature.
Similarly, spatial Interaction can also realize the calculation of a Spatial importance coefficient based on high-level image characteristics through LP, BN and Activation; and adjusting the bottom layer image characteristics based on the spatial importance coefficient, wherein the adjusted characteristics are the characteristics after the spatial information is enhanced, and obtaining the enhanced bottom layer characteristics after the spatial information is enhanced. In Spatial Interaction, LP is the second sensing layer, and BN is the second normalization layer.
In the spatial attention network in the embodiment of the present disclosure, a convolution operation is adopted to perform information aggregation or dimension reduction on channel dimensions, and compared with performing maximum or mean pooling dimension reduction on a channel in a conventional operation, the convolution operation in the embodiment of the present disclosure aims at learnable parameters, and is more flexible, and a linear sensing layer in the channel attention network is only one layer, thereby reducing the complexity of a structure.
In the embodiment of the disclosure, the channel attention network includes a global pooling layer, a first sensing layer and a normalization layer, and the spatial attention network includes a second sensing layer and a second normalization layer, that is, the global pooling layer, the first sensing layer and the normalization layer can realize that bottom image features in image features of different receptive fields are utilized to perform channel information enhancement on high-level image features in the image features of different receptive fields, and the second sensing layer and the second normalization layer can realize that high-level image features in the image features of different receptive fields are utilized to perform spatial information enhancement on the bottom image features in the image features of different receptive fields, so that the complexity of a model can be reduced, further, when the model is utilized to realize feature fusion, and the fused features are utilized to perform image classification, image segmentation, image detection and other processes, the computational complexity can be reduced, and simple and effective feature fusion and other image processing processes can be realized.
Corresponding to the image feature fusion method provided in the foregoing embodiment, an embodiment of the present disclosure further provides an image feature fusion device, as shown in fig. 7, which may include:
an obtaining module 701, configured to obtain image features of different receptive fields;
a feature enhancement module 702, configured to perform channel information enhancement on a high-level image feature in image features of different receptive fields by using a bottom-level image feature in the image features of different receptive fields, so as to obtain an enhanced high-level feature after the channel information enhancement; wherein the receptive field of the high-level image features is larger than the receptive field of the bottom-level image features; the system is used for utilizing the high-level image characteristics in the image characteristics of different receptive fields to carry out spatial information enhancement on the bottom-level image characteristics in the image characteristics of different receptive fields so as to obtain enhanced bottom-level characteristics after the spatial information enhancement;
and a fusion module 703, configured to fuse the enhanced high-level features and the enhanced bottom-level features to obtain fused features.
Optionally, the obtaining module 701 is specifically configured to obtain an image to be processed, input the image to be processed into a model for image feature fusion, and extract image features of different receptive fields of the image to be processed through a feature extraction network in the model for image feature fusion;
the feature enhancement module 702 is specifically configured to perform, through a channel attention network in the model for image feature fusion, channel information enhancement on a high-level image feature in image features of different receptive fields by using a bottom-level image feature in the image features of different receptive fields, so as to obtain an enhanced high-level feature after the channel information enhancement; performing spatial information enhancement on bottom layer image features in the image features of different receptive fields by using high layer image features in the image features of different receptive fields through a spatial attention network in a model for image feature fusion to obtain enhanced bottom layer features after the spatial information enhancement;
and a fusion module 703, configured to fuse the enhanced high-level features and the enhanced bottom-level features through a fusion network in the model for image feature fusion, to obtain fused features.
Optionally, the channel attention network comprises a global pooling layer, a first perception layer, and a normalization layer; the spatial attention network comprises a second perception layer and a second normalization layer;
a feature enhancement module 702, specifically configured to perform global pooling on the bottom-layer image features by using a global pooling layer; utilizing the first perception layer to obtain a column vector consistent with the channel dimension of the high-level image feature through full connection operation based on the result after the global pooling operation; normalizing the column vector consistent with the channel dimension of the high-level image feature by using a first normalization layer to obtain a first normalized column vector, wherein a value in the first normalized column vector represents a channel importance coefficient; performing full-connection operation on the high-level image features by using a second perception layer to obtain column vectors consistent with the spatial dimensions of the bottom-level image features; and normalizing the column vectors consistent with the spatial dimension of the bottom layer image features by using a second normalization layer to obtain second normalized column vectors, wherein values in the second normalized column vectors represent spatial importance coefficients.
Corresponding to the model training method for image feature fusion provided in the foregoing embodiment, an embodiment of the present disclosure further provides a model training device for image feature fusion, as shown in fig. 8, which may include:
an obtaining module 801, configured to obtain a plurality of sample images and labels corresponding to the sample images;
a fusion feature obtaining module 802, configured to input the sample images into a model for image feature fusion for each sample image, to obtain fused features;
an image processing result obtaining module 803, configured to perform image processing based on the fused features to obtain an image processing result;
a calculating module 804, configured to calculate a difference between the image processing result and a label corresponding to the sample image;
a training module 805 for adjusting model parameters of the model for image feature fusion based on the difference; based on the adjusted model parameters and the plurality of sample images, continuing the adjustment process of the model parameters until a preset iteration end condition is met; taking model parameters obtained when a preset iteration ending condition is met as model parameters after training, and taking a model for image feature fusion including the model parameters after training as a model for image feature fusion after training;
the model for image feature fusion comprises a channel attention network, a space attention network and a fusion network;
the channel attention network is used for enhancing channel information of high-level image features in the image features of different receptive fields by utilizing bottom-level image features in the image features of different receptive fields to obtain enhanced high-level features after the channel information is enhanced; wherein the receptive field of the high-level image features is larger than the receptive field of the bottom-level image features;
the spatial attention network is used for enhancing spatial information of bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information is enhanced;
and the fusion network is used for fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features.
Optionally, for each sample image, the label corresponding to the sample image includes a category label, a detection label and/or a segmentation label of the sample image;
an image processing result obtaining module 803, specifically configured to perform image classification, image detection, and/or image segmentation based on the fusion result, so as to obtain a classification result, a detection result, and/or a segmentation result for the sample image;
the calculating module 804 is specifically configured to calculate a difference between the classification result and the class label, a difference between the detection result and the detection label, and/or a difference between the segmentation result and the segmentation label.
Optionally, the channel attention network comprises a global pooling layer, a first perception layer, and a normalization layer;
the global pooling layer is used for performing global pooling operation on the bottom layer image characteristics;
the first perception layer is used for obtaining a column vector consistent with the channel dimension of the high-level image feature through full-connection operation based on the result after the global pooling operation;
the first normalization layer is used for performing normalization operation on the column vectors consistent with the channel dimensions of the high-level image features to obtain first normalized column vectors, and values in the first normalized column vectors represent channel importance coefficients;
the spatial attention network comprises a second perception layer and a second normalization layer;
the second perception layer is used for carrying out full connection operation on the high-level image features to obtain column vectors consistent with the spatial dimensions of the bottom-level image features;
and the second normalization layer is used for performing normalization operation on the column vectors consistent with the spatial dimensions of the bottom layer image characteristics to obtain second normalized column vectors, and values in the second normalized column vectors represent spatial importance coefficients.
In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as an image feature fusion method or a model training method for image feature fusion. For example, in some embodiments, the image feature fusion method or the model training method for image feature fusion may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the image feature fusion method or the model training method for image feature fusion described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the image feature fusion method or the model training method for image feature fusion by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims (15)

1. An image feature fusion method, comprising:
acquiring image characteristics of different receptive fields;
carrying out channel information enhancement on the high-level image features in the image features of different receptive fields by utilizing the bottom-level image features in the image features of different receptive fields to obtain enhanced high-level features after the channel information enhancement; wherein the receptive field of the high-level image features is larger than the receptive field of the bottom-level image features;
utilizing the high-level image features in the image features of different receptive fields to perform spatial information enhancement on the bottom-level image features in the image features of different receptive fields to obtain enhanced bottom-level features after the spatial information enhancement;
and fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features.
2. The method of claim 1, wherein the acquiring image features of different receptive fields comprises:
acquiring an image to be processed, inputting the image to be processed into a model for image feature fusion, and extracting image features of different receptive fields of the image to be processed through a feature extraction network in the model for image feature fusion;
the method for enhancing the channel information of the high-level image features in the image features of different receptive fields by utilizing the bottom-level image features in the image features of different receptive fields to obtain enhanced high-level features after the channel information enhancement comprises the following steps:
performing channel information enhancement on the high-level image features in the image features of different receptive fields by utilizing the bottom-level image features in the image features of different receptive fields through a channel attention network in the model for image feature fusion to obtain enhanced high-level features after the channel information enhancement;
the method for enhancing the spatial information of the bottom image features in the image features of different receptive fields by utilizing the high-level image features in the image features of different receptive fields to obtain enhanced bottom features after the spatial information enhancement comprises the following steps:
performing spatial information enhancement on bottom layer image features in the image features of different receptive fields by utilizing high layer image features in the image features of different receptive fields through a spatial attention network in the model for image feature fusion to obtain enhanced bottom layer features after the spatial information enhancement;
the fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features, which comprises:
and fusing the enhanced high-level features and the enhanced bottom-level features through the fusion network in the model for image feature fusion to obtain fused features.
3. The method of claim 2, wherein the channel attention network comprises a global pooling layer, a first perception layer, and a normalization layer;
the channel information enhancement of the high-level image features in the image features of different receptive fields by using the bottom-level image features in the image features of different receptive fields through the channel attention network in the model for image feature fusion includes:
performing global pooling operation on the bottom layer image characteristics by utilizing the global pooling layer;
obtaining a column vector consistent with the channel dimension of the high-level image feature through full-connection operation by utilizing the result of the first perception layer based on the global pooling operation;
normalizing the column vector consistent with the channel dimension of the high-level image feature by using the first normalization layer to obtain a first normalized column vector, wherein a value in the first normalized column vector represents a channel importance coefficient;
the spatial attention network comprises a second perception layer and the second normalization layer;
the enhancing spatial information of the bottom layer image features in the image features of different receptive fields by using the high layer image features in the image features of different receptive fields through the spatial attention network in the model for image feature fusion includes:
performing full connection operation on the high-layer image features by using the second perception layer to obtain column vectors consistent with the spatial dimensions of the bottom-layer image features;
and normalizing the column vector consistent with the spatial dimension of the bottom layer image characteristic by using the second normalization layer to obtain a second normalized column vector, wherein a value in the second normalized column vector represents a spatial importance coefficient.
4. A model training method for image feature fusion comprises the following steps:
obtaining a plurality of sample images and labels corresponding to the sample images;
inputting the sample images into a model for image feature fusion aiming at each sample image to obtain fused features;
performing image processing based on the fused features to obtain an image processing result;
calculating the difference between the image processing result and a label corresponding to the sample image;
adjusting model parameters of the model for image feature fusion based on the difference;
based on the adjusted model parameters and the plurality of sample images, continuing to perform the adjustment process of the model parameters until a preset iteration end condition is met;
taking model parameters obtained when a preset iteration ending condition is met as model parameters after training, and taking a model for image feature fusion including the model parameters after training as a model for image feature fusion after training;
the model for image feature fusion comprises a channel attention network, a spatial attention network and a fusion network;
the channel attention network is used for enhancing channel information of the high-level image features in the image features of different receptive fields by utilizing the bottom-level image features in the image features of different receptive fields to obtain enhanced high-level features after the channel information is enhanced; wherein the receptive field of the high-level image features is larger than the receptive field of the bottom-level image features;
the spatial attention network is used for enhancing spatial information of bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information is enhanced;
and the fusion network is used for fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features.
5. The method according to claim 4, wherein, for each sample image, the corresponding label of the sample image comprises a class label, a detection label and/or a segmentation label of the sample image;
the image processing based on the fused features to obtain an image processing result includes:
performing image classification, image detection and/or image segmentation based on the fusion result to obtain a classification result, a detection result and/or a segmentation result for the sample image;
the calculating the difference between the image processing result and the label corresponding to the sample image comprises:
calculating a difference between the classification result and the class label, a difference between the detection result and the detection label, and/or a difference between the segmentation result and the segmentation label.
6. The method of claim 4, wherein the channel attention network comprises a global pooling layer, a first perception layer, and a normalization layer;
the global pooling layer is used for performing global pooling operation on the bottom layer image characteristics;
the first perception layer is used for obtaining a column vector consistent with the channel dimension of the high-level image feature through full connection operation based on the result after the global pooling operation;
the first normalization layer is used for performing normalization operation on the column vectors consistent with the channel dimensions of the high-level image features to obtain first normalized column vectors, and values in the first normalized column vectors represent channel importance coefficients;
the spatial attention network comprises a second perception layer and the second normalization layer;
the second perception layer is used for carrying out full connection operation on the high-layer image features to obtain column vectors consistent with the spatial dimensions of the bottom-layer image features;
the second normalization layer is configured to perform normalization operation on the column vectors consistent with the spatial dimensions of the bottom-layer image features to obtain second normalized column vectors, and values in the second normalized column vectors represent spatial importance coefficients.
7. An image feature fusion apparatus comprising:
the acquisition module is used for acquiring image characteristics of different receptive fields;
the characteristic enhancement module is used for utilizing the bottom layer image characteristics in the image characteristics of different receptive fields to enhance the channel information of the high-level image characteristics in the image characteristics of different receptive fields to obtain enhanced high-level characteristics after the channel information is enhanced; wherein the receptive field of the high-level image features is larger than the receptive field of the bottom-level image features; the enhancement device is used for utilizing the high-level image features in the image features of different receptive fields to carry out spatial information enhancement on the bottom-level image features in the image features of different receptive fields to obtain enhanced bottom-level features after the spatial information enhancement;
and the fusion module is used for fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features.
8. The apparatus according to claim 7, wherein the acquiring module is specifically configured to acquire an image to be processed, input the image to be processed into a model for image feature fusion, and extract image features of different receptive fields of the image to be processed through a feature extraction network in the model for image feature fusion;
the feature enhancement module is specifically configured to perform, through the channel attention network in the model for image feature fusion, channel information enhancement on a high-level image feature in the image features of different receptive fields by using a bottom-level image feature in the image features of different receptive fields, so as to obtain an enhanced high-level feature after the channel information enhancement; performing spatial information enhancement on bottom layer image features in the image features of different receptive fields by utilizing high layer image features in the image features of different receptive fields through a spatial attention network in the model for image feature fusion to obtain enhanced bottom layer features after the spatial information enhancement;
and the fusion module is used for fusing the enhanced high-level features and the enhanced bottom-level features through a fusion network in the model for image feature fusion to obtain fused features.
9. The apparatus of claim 8, wherein the channel attention network comprises a global pooling layer, a first perception layer, and a normalization layer; the spatial attention network comprises a second perception layer and the second normalization layer;
the feature enhancement module is specifically configured to perform global pooling operation on the bottom-layer image features by using the global pooling layer; utilizing the result of the first perception layer based on the global pooling operation to obtain a column vector consistent with the channel dimension of the high-level image feature through full connection operation; normalizing the column vector consistent with the channel dimension of the high-level image feature by using the first normalization layer to obtain a first normalized column vector, wherein a value in the first normalized column vector represents a channel importance coefficient; performing full connection operation on the high-level image features by using the second perception layer to obtain column vectors consistent with the spatial dimensions of the bottom-level image features; and normalizing the column vector consistent with the spatial dimension of the bottom layer image characteristic by using the second normalization layer to obtain a second normalized column vector, wherein a value in the second normalized column vector represents a spatial importance coefficient.
10. A model training apparatus for image feature fusion, comprising:
the acquisition module is used for acquiring a plurality of sample images and labels corresponding to the sample images;
a fusion feature obtaining module, configured to input the sample images into a model for image feature fusion, to obtain fused features;
an image processing result obtaining module, configured to perform image processing based on the fused features to obtain an image processing result;
the calculation module is used for calculating the difference between the image processing result and the label corresponding to the sample image;
a training module for adjusting model parameters of the model for image feature fusion based on the difference; based on the adjusted model parameters and the plurality of sample images, continuing the adjustment process of the model parameters until a preset iteration end condition is met; taking model parameters obtained when a preset iteration ending condition is met as model parameters after training, and taking a model for image feature fusion including the model parameters after training as a model for image feature fusion after training;
the model for image feature fusion comprises a channel attention network, a space attention network and a fusion network;
the channel attention network is used for enhancing channel information of the high-level image features in the image features of different receptive fields by utilizing the bottom-level image features in the image features of different receptive fields to obtain enhanced high-level features after the channel information is enhanced; wherein the receptive field of the high-level image features is larger than the receptive field of the bottom-level image features;
the spatial attention network is used for enhancing spatial information of bottom layer image features in the image features of different receptive fields by utilizing the high layer image features in the image features of different receptive fields to obtain enhanced bottom layer features after the spatial information is enhanced;
and the fusion network is used for fusing the enhanced high-level features and the enhanced bottom-level features to obtain fused features.
11. The apparatus of claim 10, wherein, for each sample image, the label to which the sample image corresponds comprises a class label, a detection label, and/or a segmentation label of the sample image;
the image processing result obtaining module is specifically configured to perform image classification, image detection and/or image segmentation based on the fusion result to obtain a classification result, a detection result and/or a segmentation result for the sample image;
the calculating module is specifically configured to calculate a difference between the classification result and the class label, a difference between the detection result and the detection label, and/or a difference between the segmentation result and the segmentation label.
12. The apparatus of claim 10, wherein the channel attention network comprises a global pooling layer, a first perception layer, and a normalization layer;
the global pooling layer is used for performing global pooling operation on the bottom layer image characteristics;
the first perception layer is used for obtaining a column vector consistent with the channel dimension of the high-level image feature through full connection operation based on the result after the global pooling operation;
the first normalization layer is used for performing normalization operation on the column vectors consistent with the channel dimensions of the high-level image features to obtain first normalized column vectors, and values in the first normalized column vectors represent channel importance coefficients;
the spatial attention network comprises a second perception layer and the second normalization layer;
the second perception layer is used for carrying out full connection operation on the high-layer image features to obtain column vectors consistent with the spatial dimensions of the bottom-layer image features;
the second normalization layer is configured to perform normalization operation on the column vectors consistent with the spatial dimensions of the bottom-layer image features to obtain second normalized column vectors, and values in the second normalized column vectors represent spatial importance coefficients.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.
15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6.
CN202211067385.4A 2022-09-01 2022-09-01 Image feature fusion and model training method, device, equipment and storage medium Pending CN115482443A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211067385.4A CN115482443A (en) 2022-09-01 2022-09-01 Image feature fusion and model training method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211067385.4A CN115482443A (en) 2022-09-01 2022-09-01 Image feature fusion and model training method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115482443A true CN115482443A (en) 2022-12-16

Family

ID=84421951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211067385.4A Pending CN115482443A (en) 2022-09-01 2022-09-01 Image feature fusion and model training method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115482443A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116740510A (en) * 2023-03-20 2023-09-12 北京百度网讯科技有限公司 Image processing method, model training method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116740510A (en) * 2023-03-20 2023-09-12 北京百度网讯科技有限公司 Image processing method, model training method and device

Similar Documents

Publication Publication Date Title
CN113920307A (en) Model training method, device, equipment, storage medium and image detection method
CN115861462B (en) Training method and device for image generation model, electronic equipment and storage medium
CN114882321A (en) Deep learning model training method, target object detection method and device
CN112861885A (en) Image recognition method and device, electronic equipment and storage medium
CN115082920A (en) Deep learning model training method, image processing method and device
CN114187459A (en) Training method and device of target detection model, electronic equipment and storage medium
CN113378712A (en) Training method of object detection model, image detection method and device thereof
EP4123595A2 (en) Method and apparatus of rectifying text image, training method and apparatus, electronic device, and medium
CN112749300A (en) Method, apparatus, device, storage medium and program product for video classification
CN113378855A (en) Method for processing multitask, related device and computer program product
CN115147680B (en) Pre-training method, device and equipment for target detection model
CN113205041A (en) Structured information extraction method, device, equipment and storage medium
CN113705362A (en) Training method and device of image detection model, electronic equipment and storage medium
CN114821063A (en) Semantic segmentation model generation method and device and image processing method
CN114861758A (en) Multi-modal data processing method and device, electronic equipment and readable storage medium
CN114120454A (en) Training method and device of living body detection model, electronic equipment and storage medium
CN115482443A (en) Image feature fusion and model training method, device, equipment and storage medium
CN114399513B (en) Method and device for training image segmentation model and image segmentation
CN115641481A (en) Method and device for training image processing model and image processing
CN115861809A (en) Rod detection and training method and device for model thereof, electronic equipment and medium
CN114973333A (en) Human interaction detection method, human interaction detection device, human interaction detection equipment and storage medium
CN113378774A (en) Gesture recognition method, device, equipment, storage medium and program product
CN113947771A (en) Image recognition method, apparatus, device, storage medium, and program product
CN113610856A (en) Method and device for training image segmentation model and image segmentation
CN113344200A (en) Method for training separable convolutional network, road side equipment and cloud control platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination