CN114863179A

CN114863179A - Endoscope image classification method based on multi-scale feature embedding and cross attention

Info

Publication number: CN114863179A
Application number: CN202210542820.8A
Authority: CN
Inventors: 史骏; 张元�; 汪逸飞; 杨皓程; 周泰然; 李想; 郑利平
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-08-05
Anticipated expiration: 2042-05-18
Also published as: CN114863179B

Abstract

The invention provides an endoscope image classification method based on multi-scale feature embedding and cross attention, which comprises the following steps: acquiring marked N types of endoscope images; establishing a deep learning network based on multi-scale feature embedding and multi-head cross attention; constructing an endoscope image classifier; and predicting the endoscope image category by utilizing the established classifier. According to the endoscope image classification method, rich semantic information in a deep characteristic diagram and geometric detail information in a shallow characteristic diagram are fused through multi-scale characteristic embedding, semantic information and geometric information ambiguity among different scale characteristic diagrams are eliminated by combining a cross attention mechanism to mine more effective characteristics, and endoscope images are accurately classified, so that a doctor is assisted in diagnosis and interpretation, and the interpretation efficiency is improved.

Description

Endoscope image classification method based on multi-scale feature embedding and cross attention

Technical Field

The invention relates to the field of computer vision, in particular to an image classification technology, and specifically relates to an endoscope image classification method based on multi-scale feature embedding and cross attention.

Background

Endoscopy is the most common way of cancer diagnosis, and endoscopic image classification has important clinical significance in early cancer screening. Traditional cancer diagnosis is mainly carried out through the manual diagnosis under an endoscope by an endoscopist, but in clinical diagnosis, the judgment of the endoscopist on cancers has subjective difference, the workload of an endoscope judgment map is large, and missed diagnosis and misdiagnosis sometimes occur. Therefore, an accurate and efficient endoscopic diagnosis method is needed, which can reduce the reading pressure of the endoscopic doctor and improve the accuracy of endoscopic image classification by using a computer to assist the doctor in reading the endoscopic images.

In recent years, a deep learning framework has attracted wide attention in the field of computer vision, researchers have begun to apply the deep learning framework to various classification tasks, but most of endoscope image classification methods based on deep learning adopt a convolutional neural network model to extract features of an endoscope image in a single scale, and ignore information of other scales, so that the accuracy of endoscope image classification is difficult to improve.

Disclosure of Invention

The invention provides an endoscope image classification method based on multi-scale feature embedding and cross attention in order to make up for the existing technical defects, and aims to integrate abundant semantic information in a deep feature map and geometric detail information in a shallow feature map through multi-scale feature embedding, eliminate semantic information and geometric information ambiguity between feature maps with different scales by combining a cross attention mechanism, mine more effective features and finish accurate classification of endoscope images.

In order to achieve the purpose, the invention adopts the following technical scheme:

according to an embodiment of the invention, the invention provides an endoscope image classification method based on multi-scale feature embedding and cross attention, which comprises the following steps:

step 1, acquiring endoscope image samples of N types of C × H × W, preprocessing the samples to obtain a training set E, wherein E is { E ═ E { (E) ₁ ,E ₂ ,...,E _n ,...,E _N }；

E _n Showing an nth class endoscopic image sample, the nth class having P images in total,

representing a p-th image in the n-th class of preprocessed endoscopic image samples; c denotes an image channel, H denotes an image height, W denotes an image width, and N is 1, 2.., N;

step 2, establishing a deep learning network, processing the sample data set of the endoscope image through a convolution neural network of the deep learning network to output feature maps of different convolution stages, and forming a dimension reduction output feature map after dimension reduction processing on the feature maps of the different convolution stages

i＝1,2,3,4；

Step 3, outputting the characteristic diagram of the dimension reduction

Input to a pre-constructed multi-scaleIn a multi-head cross attention encoder with embedded features, a feature graph U is output after normalization and upsampling processing _n,p ；

Step 4, the characteristic diagram U is processed _n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D _n,p Feature map D output from the convolution stage _n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result feature map D' _n,p D 'is a feature map' _n,p Inputting the result vector into a full connection layer to obtain an N-dimensional classification result vector;

and 5, constructing an endoscope classifier based on the result vector of the N-dimensional classification to classify the endoscope image.

Further, the step 2 specifically includes:

step 2.1, establishing a deep learning network, wherein the deep learning network comprises the following steps: the system comprises a multi-scale feature extraction module, a multi-scale feature embedding module and a multi-head cross attention encoder module;

step 2.2, constructing a multi-scale feature extraction module:

the multi-scale feature extraction module is composed of four convolutional neural network stages, and sequentially comprises the following steps: a first convolution stage, a second convolution stage, a third convolution stage and a fourth convolution stage;

the p-th image

Inputting the feature map into the multi-scale feature extraction module, and respectively obtaining the feature map output by the first convolution stage through the first convolution stage, the second convolution stage, the third convolution stage and the fourth convolution stage

Feature map output by the second convolution stage

Feature map output by the third convolution stage

Feature map output by the fourth convolution stage

Step 2.3, constructing a multi-scale feature embedding module:

the multi-scale feature embedded module is formed by connecting 4 different embedded layers in parallel, wherein 4 embedded layers correspond to 4 embedded layers

1,2,3,4, each embedded layer comprising a convolution layer and a dimensionality reduction process;

outputting feature maps of four convolution stages

Input into a multi-scale feature embedding module,

i is 1,2,3,4, respectively, and is 2 by convolution kernel ^5-i ×2 ^5-i The convolution layers are respectively output characteristic graphs after dimension reduction processing

i＝1,2,3,4。

Further, the step 3 specifically includes:

step 3.1, constructing a multi-head cross attention encoder with multi-scale feature embedding:

the multi-head cross attention encoder module embedded with the multi-scale features is formed by serially connecting features embedded with 4 convolution stages and L multi-head cross attention encoders;

combining 4 characteristic maps

Inputting the multi-scale features into the multi-scale feature embedding module, respectively carrying out normalization processing through an LN layer, and carrying out normalization processing on feature maps from the angle of channel intersection

Converting, specifically obtaining a channel cross characteristic diagram by using the formula (1)

i＝1,2,3,4：

Transpose (-) in the formula (1) represents the transposition process of the feature map,

represents C ⁱ Size is H ⁱ ·W ⁱ Is generated from the pixel characteristic map of (a),

is represented by H ⁱ ·W ⁱ Size is C ⁱ A pixel feature map of individual channel intersections;

step 3.2, crossing the characteristic diagram of the channel

Performing multi-scale embedding, specifically obtaining a multi-scale feature embedding feature map by using the formula (2)

Concat (-) in equation (2) represents the feature vector splicing operation,

to represent

Channel after multi-scale feature embedding and transpositionA cross feature map;

step 3.3, feature map

As the input of the 1 st multi-headed cross attention encoder module, the output of the c-th multi-headed cross attention encoder module is as the input of the c +1 th multi-headed cross attention encoder module;

any of the c-th multi-headed cross-attention encoder modules comprises: 2 linear transformation layers, M parallel cross attention layers; c 1,2,. said, L;

step 3.4, feature map

inputting i-1, 2,3,4 into the c-th multi-head cross attention encoder module, and inputting the feature map

Respectively with two weight matrices W _m ^K ,W _m ^V Multiplying, and comparing the feature maps

i is 1,2,3,4, and is respectively associated with four weight matrixes

Multiplying and outputting the characteristic diagram K _n,p 、V _n,p 、

1,2,3,4, and the specific formula is shown in formula (3):

in the formula (3), φ (-) represents a normalization function;

step 3.5, embedding the multi-scale features into a module output feature graph K _n,p ,V _n,p ,

i is input into the 1 st multi-head cross attention encoder, K _n,p ,V _n,p ,

inputting the information I to the M-head cross attention layer after linear transformation processing respectively, and inputting the information I to the M-head cross attention layer

Are each independently of K _n,p Multiplying, finally activating by Softmax function, and mixing with V _n,p Multiplying to obtain an output, wherein a specific formula is shown as a formula (4):

in equation (4), ψ (-) is a normalization function, δ (-) is a Softmax function;

step 3.6, attention feature map

Based on the above, dynamically fusing the attention feature maps of different heads to form a new attention feature map, wherein the specific formula is shown in formula (5):

in the formula (5), the reaction mixture is,

is a learnable transformation matrix by

Fusing the multi-head attention feature maps and generating a new attention feature map;

will result in M cross attention layer outputs

i 1,2,3,4, M1, 2, M, and obtaining a characteristic map as shown in formula (6)

c＝1,2,…,L,i＝1,2,3,4：

M in equation (6) is the number of cross-attention slice headers,

represents the ith feature map Q in the c-th multi-head cross attention encoder module ⁱ A profile generated through the mth cross attention layer;

step 3.7, drawing the characteristic diagram after the multi-head cross attention

The i is processed by linear transformation and normalization, and then the output of the multi-head cross attention encoder module is obtained as shown in formula (7)

i＝1,2,3,4:

In the formula (7), δ (-) represents a GeLU function, and σ (-) represents a linear transformation function;

when c is not equal to L, the output of the c multi-head cross attention encoder module is input into the next 1 multi-head cross attention encoder module;

when c is equal to L, the output of the Lth multi-head cross attention encoder module is used

Characteristic diagram using equation (8)

Performing up-sampling processing to obtain a characteristic diagram U _n,p ：

In equation (8), μ (-) represents the upsampling function, φ (-) represents the normalization function, δ (-) represents the ReLU function;

step 3.8, carrying out upsampling processing on the feature map U _n,p Feature map output from convolution stage

Fusing to obtain output U _n,p The specific formula is shown as formula (9):

in the formula (9), the reaction mixture is,

is the feature map output by the fourth convolution stage in the multi-scale feature extraction module.

Further, the step 4 specifically includes:

step 4.1, embedding the multi-scale features into a feature graph U output by the multi-head cross attention encoder _n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D _n,p

Step 4.2, outputting the feature graph D of the convolution stage _n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result D' _n,p The specific formula is shown as formula (10):

in equation (10), concat (. cndot.) represents the feature vector splicing operation,

is D _n,p The feature map output after global average pooling,

is D _n,p Outputting a feature map after global maximum pooling;

will feature map D' _n,p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.

Further, the step 5 specifically includes: establishing a cross entropy loss function, inputting a training sample set into the deep learning network for training, and then adopting a back propagation algorithm to carry out optimization solution on the cross entropy loss function, thereby adjusting all parameters in the deep learning network and obtaining the endoscope image classifier for classifying the endoscope images.

Compared with the prior art, the invention has the following advantages:

the invention constructs an endoscope image classification model by using an endoscope image classification method based on multi-scale feature embedding and cross attention. The general convolutional neural network classification depends on semantic information of the deep feature map and ignores geometric detail information of the shallow feature map. According to the method, the deep characteristic diagram rich in semantic information and the shallow characteristic diagram rich in geometric detail information are fused through multi-scale embedding, ambiguity of semantic information and geometric information between characteristic diagrams of different scales is eliminated from the angle of channel intersection, and more effective characteristics are extracted, so that the accuracy of endoscope image classification is improved, a doctor is assisted in diagnosis and film reading, and the pressure of the endoscope doctor in film reading is reduced.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a block diagram of a deep learning network according to the present invention;

FIG. 3 is a schematic diagram of a multi-headed cross attention encoder module according to the present invention.

Detailed Description

For the convenience of understanding, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In this embodiment, an endoscope image classification method based on multi-scale feature embedding and cross attention includes, as shown in fig. 1, the following specific steps:

step 2, as shown in fig. 2, establishing a deep learning network, processing the sample data set of the endoscope image through a convolutional neural network of the deep learning network to output feature maps of different convolutional stages, and forming a dimension-reduced output feature map after dimension reduction processing is performed on the feature maps of the different convolutional stages

i＝1,2,3,4。

step 2.2, constructing a multi-scale feature extraction module:

the p-th image

Feature map output by the second convolution stage

Feature map output by the third convolution stage

Feature map output by the fourth convolution stage

Step 2.3, constructing a multi-scale feature embedding module:

i-1, 2,3,4, each embedded layer comprising a convolutional layer and a dimensionality reduction process;

outputting feature maps of four convolution stages

Input into a multi-scale feature embedding module,

i＝1,2,3,4。

Step 3, outputting the characteristic diagram of the dimension reduction

Inputting the data into a multi-head cross attention encoder embedded with multi-scale features constructed in advance, performing normalization and upsampling processing, and outputting a feature map U _n,p 。

combining 4 characteristic maps

i＝1,2,3,4：

Transpose (-) in the formula (1) represents the transpose processing of the feature map,

represents C ⁱ Each size isH ⁱ ·W ⁱ Is generated from the pixel characteristic map of (a),

step 3.2, crossing the characteristic diagram of the channel

Performing multi-scale embedding, specifically obtaining a multi-scale characteristic embedding characteristic diagram by using a formula (2)

Concat (-) in equation (2) represents the feature vector splicing operation,

to represent

A channel cross feature map after multi-scale feature embedding and conversion;

step 3.3, feature map

as shown in fig. 3, any of the c-th multi-headed cross-attention encoder modules includes: 2 linear transformation layers, M parallel cross attention layers; c 1,2,. said, L;

step 3.4, feature map

i is 1,2,3,4 inputThe c multi-head cross attention encoder module, the feature map

i is 1,2,3,4, and is respectively associated with four weight matrixes

Multiplying and outputting the characteristic diagram K _n,p 、V _n,p 、

i is 1,2,3,4, and the specific formula is shown in formula (3):

in the formula (3), φ (-) represents a normalization function;

i is input into the 1 st multi-head cross attention encoder, K _n,p ,V _n,p ,

Are each independently of K _n,p Multiplying, activating by Softmax function, and mixing with V _n,p Multiplying to obtain an output, wherein a specific formula is shown as a formula (4):

in equation (4), ψ (·) is a normalization function, δ (·) is a Softmax function;

step 3.6, attention feature map

in the formula (5), the reaction mixture is,

is a learnable transformation matrix by

will result in M cross attention layer outputs

i 1,2,3,4, M1, 2, M, and obtaining a characteristic map as shown in formula (6)

c＝1,2,…,L,i＝1,2,3,4：

M in equation (6) is the number of cross-attention slice headers,

represents the ith feature map Q in the c-th multi-head cross attention encoder module ⁱ Passing through the m-th crossA feature map generated by the attention layer;

The output of the multi-head cross attention encoder module is obtained by performing linear transformation and normalization on i 1,2,3 and 4, and then using the formula (7)

i＝1,2,3,4:

Characteristic diagram using equation (8)

Performing up-sampling processing to obtain a characteristic diagram U _n,p ：

Are fused to obtainTo the output U _n,p The specific formula is shown as formula (9):

in the formula (9), the reaction mixture is,

Step 4, the characteristic diagram U is processed _n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D _n,p Feature map D output from the convolution stage _n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result feature map D' _n,p D 'is a feature map' _n,p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.

Step 4.1, embedding the multi-scale features into a feature graph U output by the multi-head cross attention encoder _n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D _n,p ；

is D _n,p The feature map output after global average pooling,

is D _n,p Outputting a feature map after global maximum pooling;

The step 5 specifically comprises: establishing a cross entropy loss function, inputting a training sample set into the deep learning network for training, and then adopting a back propagation algorithm to carry out optimization solution on the cross entropy loss function, thereby adjusting all parameters in the deep learning network and obtaining the endoscope image classifier for classifying the endoscope images.

Establishing a cross entropy loss function shown as a formula (11), inputting a training sample set into the deep learning network for training, and then optimally solving the cross entropy loss function by adopting a back propagation algorithm, so as to adjust all parameters in the deep learning network, thereby obtaining an endoscope image classifier for classifying endoscope images, wherein the cross entropy loss function is as follows:

in the formula (11), wherein C represents the number of classes, p _i Representing the true class of sample i, q _i Represents the prediction class of sample i and CE (p, q) represents the classification penalty on the sample.

It will be evident to those skilled in the art that the embodiments of the present invention are not limited to the details of the foregoing illustrative embodiments, and that the embodiments of the present invention are capable of being embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the embodiments being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. Several units, modules or means recited in the system, apparatus or terminal claims may also be implemented by one and the same unit, module or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention and not for limiting, and although the embodiments of the present invention are described in detail with reference to the above preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the embodiments of the present invention without departing from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for classifying endoscopic images based on multi-scale feature embedding and cross-attention, the method comprising:

step 1, acquiring endoscope image samples of N types of C × H × W, preprocessing the samples to obtain a training set E, wherein E is { E ═ E { (E) ₁ ，E ₂ ，...，E _n ，...，E _N }；

step 2, establishing a deep learning network, processing the sample data set of the endoscope image through a convolution neural network of the deep learning network to output feature maps of different convolution stages, and enabling the feature maps of the different convolution stages to pass throughForming a dimension reduction output characteristic diagram after dimension reduction treatment

Step 3, outputting the characteristic diagram of the dimension reduction

Inputting the data into a multi-head cross attention encoder embedded with multi-scale features constructed in advance, performing normalization and upsampling processing, and outputting a feature map U _n，p ；

Step 4, the characteristic diagram U is processed _n，p Inputting the data to a convolution stage for feature extraction and outputting a feature map D _n，p Feature map D output from the convolution stage _n，p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result feature map D' _n，p D 'is a feature map' _n，p Inputting the result vector into a full connection layer to obtain an N-dimensional classification result vector;

2. The endoscopic image classification method according to claim 1, characterized in that said step 2 specifically comprises:

step 2.2, constructing a multi-scale feature extraction module:

the p-th image

Input deviceIn the multi-scale feature extraction module, the feature map output by the first convolution stage is obtained through the first convolution stage, the second convolution stage, the third convolution stage and the fourth convolution stage

Feature map output by the second convolution stage

Feature map output by the third convolution stage

Feature map output by the fourth convolution stage

Step 2.3, constructing a multi-scale feature embedding module:

Each embedded layer comprises a convolution layer and a dimension reduction process;

outputting feature maps of four convolution stages

Input into a multi-scale feature embedding module,

respectively pass through convolution kernels of 2 ^5-i ×2 ^5-i The convolution layers are respectively output characteristic graphs after dimension reduction processing

3. The endoscopic image classification method according to claim 2, characterized in that said step 3 specifically comprises:

combining 4 characteristic maps

is represented by H ⁱ ·W ⁱ Each size is C ⁱ A pixel feature map of individual channel intersections;

step 3.2, crossing the characteristic diagram of the channel

To carry outMulti-scale embedding, specifically obtaining a multi-scale feature embedding feature map by using the formula (2)

Concat (-) in equation (2) represents the feature vector splicing operation,

to represent

A channel cross feature map after multi-scale feature embedding and conversion;

step 3.3, feature map

step 3.4, feature map

Inputting the feature map into the c-th multi-head cross attention encoder module

Respectively with two weight matrices W _m ^K ，W _m ^V Multiplying, and comparing the feature maps

Respectively with four weight matrices

Multiplying and outputting the characteristic diagram K _n，p 、V _n，p 、

The specific formula is shown as formula (3):

in the formula (3), φ (-) represents a normalization function;

step 3.5, embedding the multi-scale features into a module output feature graph K _n，p ，V _n，p ，

Input into the 1 st multi-head cross attention encoder, K _n，p ，V _n，p ，

Respectively processed by linear transformation and input into M-head cross attention layers

Are each independently of K _n，p Multiplying, finally activating by Softmax function, and mixing with V _n，p Multiplying to obtain an output, wherein a specific formula is shown as a formula (4):

step 3.6, attention feature map

in the formula (5), the reaction mixture is,

is a learnable transformation matrix by

will result in M cross attention layer outputs

The characteristic diagram obtained by the formula (6) is reused

M in equation (6) is the number of cross-attention slice headers,

Respectively processed by linear transformation and normalization, and then output of the multi-head cross attention encoder module is obtained as shown in formula (7)

Characteristic diagram using equation (8)

Performing up-sampling processing to obtain a characteristic diagram U _n，p ：

step 3.8, carrying out upsampling processing on the feature map U _n，p Feature map output from convolution stage

Fusing to obtain output U _n，p The specific formula is shown in formula (9):

in the formula (9), the reaction mixture is,

4. The endoscopic image classification method according to claim 3, characterized in that said step 4 specifically comprises:

step 4.1, embedding the multi-scale features into a feature graph U output by the multi-head cross attention encoder _n，p Inputting the data to a convolution stage for feature extraction and outputting a feature map D _n，p

Step 4.2, outputting the characteristic diagram D of the convolution stage _n，p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result D' _n，p The specific formula is shown as formula (10):

is D _n，p The feature map output after global average pooling,

is D _n，p Outputting a feature map after global maximum pooling;

will feature map D' _n，p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.

5. The endoscopic image classification method according to claim 4, wherein said step 5 specifically includes: establishing a cross entropy loss function, inputting a training sample set into the deep learning network for training, and then adopting a back propagation algorithm to carry out optimization solution on the cross entropy loss function, thereby adjusting all parameters in the deep learning network and obtaining the endoscope image classifier for classifying the endoscope images.