CN114863179B

CN114863179B - Endoscope image classification method based on multi-scale feature embedding and cross attention

Info

Publication number: CN114863179B
Application number: CN202210542820.8A
Authority: CN
Inventors: 史骏; 张元�; 汪逸飞; 杨皓程; 周泰然; 李想; 郑利平
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-12-13
Anticipated expiration: 2042-05-18
Also published as: CN114863179A

Abstract

The invention provides an endoscope image classification method based on multi-scale feature embedding and cross attention, which comprises the following steps: acquiring marked N types of endoscope images; establishing a deep learning network based on multi-scale feature embedding and multi-head cross attention; constructing an endoscope image classifier; and predicting the endoscope image category by utilizing the established classifier. According to the endoscope image classification method, rich semantic information in a deep characteristic diagram and geometric detail information in a shallow characteristic diagram are fused through multi-scale characteristic embedding, semantic information and geometric information ambiguity among different scale characteristic diagrams are eliminated by combining a cross attention mechanism to mine more effective characteristics, and endoscope images are accurately classified, so that a doctor is assisted in diagnosis and interpretation, and the interpretation efficiency is improved.

Description

Endoscope image classification method based on multi-scale feature embedding and cross attention

Technical Field

The invention relates to the field of computer vision, in particular to an image classification technology, and specifically relates to an endoscope image classification method based on multi-scale feature embedding and cross attention.

Background

Endoscopy is the most common way of cancer diagnosis, and endoscopic image classification has important clinical significance in early cancer screening. Traditional cancer diagnosis is mainly carried out through manual diagnosis under an endoscope by an endoscopist, but in clinical diagnosis, the endoscopist has subjective difference in cancer judgment, the work load of an endoscope judgment diagram is large, and missed diagnosis and misdiagnosis sometimes occur. Therefore, there is a need for an accurate and efficient endoscopic diagnosis method, which can reduce the reading pressure of an endoscopic doctor and improve the accuracy of endoscopic image classification by using a computer to assist the doctor in reading the endoscopic images.

In recent years, a deep learning framework has attracted wide attention in the field of computer vision, researchers have begun to apply the deep learning framework to various classification tasks, but most of endoscope image classification methods based on deep learning adopt a convolutional neural network model to extract features of an endoscope image in a single scale, and ignore information of other scales, so that the accuracy of endoscope image classification is difficult to improve.

Disclosure of Invention

The invention provides an endoscope image classification method based on multi-scale feature embedding and cross attention in order to make up for the existing technical defects, and aims to integrate abundant semantic information in a deep feature map and geometric detail information in a shallow feature map through multi-scale feature embedding, eliminate semantic information and geometric information ambiguity between feature maps with different scales by combining a cross attention mechanism, mine more effective features and finish accurate classification of endoscope images.

In order to achieve the purpose, the invention adopts the following technical scheme:

according to the embodiment of the invention, the invention provides an endoscope image classification method based on multi-scale feature embedding and cross attention, which comprises the following steps:

step 1, obtaining endoscope image samples of N types of C × H × W, preprocessing the samples to obtain a training set E, E = { E = } ₁ ,E ₂ ,...,E _n ,...,E _N }；

E _n Showing an nth class endoscopic image sample, the nth class having P images in total,

representing the p image in the endoscope image sample after the n-th type of preprocessing; c denotes an image channel, H denotes an image height, W denotes an image width, N =1,2, · N;

step 2, establishing a deep learning network, processing the sample data set of the endoscope image through a convolution neural network of the deep learning network to output feature maps of different convolution stages, and forming a dimension reduction output feature map after dimension reduction processing on the feature maps of the different convolution stages

i＝1,2,3,4；

Step 3, outputting the characteristic diagram by dimension reduction

Inputting the data into a multi-head cross attention encoder embedded with multi-scale features constructed in advance, performing normalization and upsampling processing, and outputting a feature map U _n,p ；

Step 4, the characteristic diagram U is processed _n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D _n,p Feature map D output from the convolution stage _n,p Respectively carrying out global average pooling operation and global maximum pooling operation, splicing and fusing the obtained feature maps to obtain a result feature map D' _n,p D 'is a feature map' _n,p Inputting the result vector into a full connection layer to obtain an N-dimensional classification result vector;

and 5, constructing an endoscope classifier based on the result vector of the N-dimensional classification to classify the endoscope image.

Further, the step 2 specifically includes:

step 2.1, establishing a deep learning network, wherein the deep learning network comprises the following steps: the system comprises a multi-scale feature extraction module, a multi-scale feature embedding module and a multi-head cross attention encoder module;

step 2.2, constructing a multi-scale feature extraction module:

the multi-scale feature extraction module is composed of four convolutional neural network stages, and sequentially comprises the following steps: a first convolution stage, a second convolution stage, a third convolution stage and a fourth convolution stage;

the p-th image

Inputting the feature map into the multi-scale feature extraction module, and respectively obtaining the feature map output by the first convolution stage through the first convolution stage, the second convolution stage, the third convolution stage and the fourth convolution stage

Feature map output by the second convolution stage

Feature map output by the third convolution stage

Feature map output by the fourth convolution stage

Step 2.3, constructing a multi-scale feature embedding module:

the multi-scale feature embedded module is formed by connecting 4 different embedded layers in parallel, wherein 4 embedded layers correspond to 4 embedded layers

i =1,2,3,4, each embedded layer containing oneConvolution layer and a dimension reduction treatment;

outputting feature maps of four convolution stages

Input into a multi-scale feature embedding module,

i =1,2,3,4, each 2 by convolution kernel ^5-i ×2 ^5-i The convolution layers are respectively output characteristic graphs after dimension reduction processing

i＝1,2,3,4。

Further, the step 3 specifically includes:

step 3.1, constructing a multi-head cross attention encoder with multi-scale feature embedding:

the multi-head cross attention encoder module embedded with the multi-scale features is formed by serially connecting features embedded with 4 convolution stages and L multi-head cross attention encoders;

combining 4 characteristic maps

Inputting the multi-scale features into the multi-scale feature embedding module, respectively carrying out normalization processing through an LN layer, and carrying out normalization processing on feature maps from the angle of channel intersection

Converting, specifically obtaining a channel cross characteristic diagram by using the formula (1)

i＝1,2,3,4：

Transpose (-) in the formula (1) represents the transpose processing of the feature map,

is represented by C ⁱ Size is H ⁱ ·W ⁱ Is generated from the pixel characteristic map of (a),

is represented by H ⁱ ·W ⁱ Size is C ⁱ A pixel feature map of individual channel intersections;

step 3.2, crossing the characteristic diagram of the channel

Performing multi-scale embedding, specifically obtaining a multi-scale characteristic embedding characteristic diagram by using a formula (2)

Concat (-) in equation (2) represents the feature vector splicing operation,

represent

A channel cross feature map after multi-scale feature embedding and conversion;

step 3.3, feature map

As the input of the 1 st multi-head cross attention encoder module, the output of the c multi-head cross attention encoder module is as the input of the c +1 st multi-head cross attention encoder module;

any of the c-th multi-headed cross-attention encoder modules comprises: 2 linear transformation layers, M parallel cross attention layers; c =1,2, · L;

step 3.4, feature map

i =1,2,3,4 is input to the c-th multi-head cross attention encoder module, which maps the signatures

Respectively with two weight matrices W _m ^K ,W _m ^V Multiplying, and comparing the feature maps

i =1,2,3,4, respectively associated with four weight matrices

Multiplying and outputting the characteristic diagram K _n,p 、V _n,p 、

i =1,2,3,4, the specific formula is shown in formula (3):

in the formula (3), φ (-) represents a normalization function;

step 3.5, embedding the multi-scale features into a module output feature graph K _n,p ,V _n,p ,

i =1,2,3,4 input into the 1 st multi-headed cross attention encoder, K _n,p ,V _n,p ,

i =1,2,3,4 are respectively input to the M-head cross attention layers by linear transformation processing

Are each independently of K _n,p Multiplying, finally activating by Softmax function, and mixing with V _n,p Multiplying to obtain an output, the specific formula is as shown in formula (4)The following steps:

in equation (4), ψ (·) is a normalization function, δ (·) is a Softmax function;

step 3.6, attention feature map

Based on the above, dynamically fusing the attention feature maps of different heads to form a new attention feature map, wherein the specific formula is shown in formula (5):

in the formula (5), the reaction mixture is,

is a learnable transformation matrix by

Fusing the multi-head attention feature maps and generating a new attention feature map;

will result in M cross attention layer outputs

i =1,2,3,4, M =1,2,.. M, and a characteristic diagram is obtained as shown in formula (6)

c＝1,2,…,L,i＝1,2,3,4：

M in equation (6) is the number of cross-attention slice headers,

represents the ith feature map Q in the c-th multi-head cross attention encoder module ⁱ A feature map generated through the mth cross attention layer;

step 3.7, drawing the characteristic diagram after the multi-head cross attention

i =1,2,3,4 is respectively processed by linear transformation and normalization, and then the output of the multi-head cross attention encoder module is obtained as shown in formula (7)

i＝1,2,3,4:

In the formula (7), δ (-) represents a GeLU function, and σ (-) represents a linear transformation function;

when c is not equal to L, the output of the c multi-head cross attention encoder module is input into the next 1 multi-head cross attention encoder module;

when c = L, output of the Lth multi-head cross attention encoder module is used

Characteristic diagram using equation (8)

Performing up-sampling processing to obtain a characteristic diagram U _n,p ：

In equation (8), μ (-) represents the upsampling function, φ (-) represents the normalization function, δ (-) represents the ReLU function;

step 3.8, the characteristic diagram U subjected to the upsampling treatment _n,p Feature map output from convolution stage

Fusing to obtain output U _n,p The specific formula is shown as formula (9):

in the formula (9), the reaction mixture is,

is the feature map output by the fourth convolution stage in the multi-scale feature extraction module.

Further, the step 4 specifically includes:

step 4.1, embedding the multi-scale features into a feature graph U output by the multi-head cross attention encoder _n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D _n,p

Step 4.2, outputting the feature graph D of the convolution stage _n,p Respectively carrying out global average pooling operation and global maximum pooling operation, splicing and fusing the obtained feature maps to obtain a result D' _n,p The specific formula is shown in formula (10):

in equation (10), concat (. Cndot.) represents the feature vector splicing operation,

is D _n,p The feature map output after global average pooling,

is D _n,p Outputting a feature map after global maximum pooling;

will feature map D' _n,p Inputting the data into a full connection layer to obtain a result vector of N-dimensional classification.

Further, the step 5 specifically includes: establishing a cross entropy loss function, inputting a training sample set into the deep learning network for training, and then adopting a back propagation algorithm to carry out optimization solution on the cross entropy loss function, thereby adjusting all parameters in the deep learning network and obtaining an endoscopic image classifier for classifying endoscopic images.

Compared with the prior art, the invention has the following advantages:

the invention constructs an endoscope image classification model by using an endoscope image classification method based on multi-scale feature embedding and cross attention. The general convolutional neural network classification depends on semantic information of the deep feature map and ignores geometric detail information of the shallow feature map. According to the method, the deep characteristic diagram rich in semantic information and the shallow characteristic diagram rich in geometric detail information are fused through multi-scale embedding, ambiguity of semantic information and geometric information between characteristic diagrams of different scales is eliminated from the angle of channel intersection, and more effective characteristics are extracted, so that the accuracy of endoscope image classification is improved, a doctor is assisted in diagnosis and film reading, and the pressure of the endoscope doctor in film reading is reduced.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a block diagram of a deep learning network according to the present invention;

FIG. 3 is a schematic diagram of a multi-headed cross attention encoder module according to the present invention.

Detailed Description

For the convenience of understanding, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In this embodiment, an endoscopic image classification method based on multi-scale feature embedding and cross attention includes the following specific steps, as shown in fig. 1:

step 1, acquiring N types of endoscope image samples of C multiplied by H multiplied by W, preprocessing the sample to obtain a training set E, E = { E = ₁ ,E ₂ ,...,E _n ,...,E _N }；

representing a p-th image in the n-th class of preprocessed endoscopic image samples; c denotes an image channel, H denotes an image height, W denotes an image width, N =1,2, · N;

step 2, as shown in fig. 2, establishing a deep learning network, processing the sample data set of the endoscope image through a convolutional neural network of the deep learning network to output feature maps of different convolutional stages, and forming a dimension-reduced output feature map after dimension reduction processing is performed on the feature maps of the different convolutional stages

i＝1,2,3,4。

step 2.2, constructing a multi-scale feature extraction module:

the p-th image

Feature map output by the second convolution stage

Feature map output by the third convolution stage

Feature map output by the fourth convolution stage

Step 2.3, constructing a multi-scale feature embedding module:

i =1,2,3,4, each embedded layer containing one convolutional layer and one dimension reduction process;

outputting feature maps of four convolution stages

Input into a multi-scale feature embedding module,

i＝1,2,3,4。

Step 3, outputting the characteristic diagram of the dimension reduction

Inputting the data into a multi-head cross attention encoder embedded with multi-scale features constructed in advance, and outputting a feature map after normalization and upsampling processingU _n,p 。

combining 4 characteristic maps

i＝1,2,3,4：

Transpose (-) in the formula (1) represents the transposition process of the feature map,

represents C ⁱ Size is H ⁱ ·W ⁱ The characteristic map of the pixels of (a),

step 3.2, crossing the characteristic diagram of the channel

Performing multi-scale embedding, specifically obtaining a multi-scale feature embedding feature map by using the formula (2)

Concat (-) in equation (2) represents the feature vector splicing operation,

to represent

A channel cross feature map after multi-scale feature embedding and conversion;

step 3.3, feature map

As the input of the 1 st multi-headed cross attention encoder module, the output of the c-th multi-headed cross attention encoder module is as the input of the c +1 th multi-headed cross attention encoder module;

as shown in fig. 3, any of the c-th multi-headed cross-attention encoder modules includes: 2 linear transformation layers, M parallel cross attention layers; c =1,2, · L;

step 3.4, feature map

i =1,2,3,4, respectively associated with four weight matrices

Multiplying and outputtingCharacteristic diagram K _n,p 、V _n,p 、

i =1,2,3,4, the specific formula is shown in formula (3):

in the formula (3), φ (·) represents a normalization function;

Are respectively connected with K _n,p Multiplying, finally activating by Softmax function, and mixing with V _n,p Multiplying to obtain an output, wherein a specific formula is shown as a formula (4):

step 3.6, attention feature map

in the formula (5), the reaction mixture is,

is a learnable transformation matrix by

will result in M cross attention layer outputs

c＝1,2,…,L,i＝1,2,3,4：

M in equation (6) is the number of cross-attention slice headers,

step 3.7, feature diagram after multi-head cross attention

i =1,2,3,4 are respectively subjected to linear transformation processing and normalization processing, and then the output of the multi-head cross attention encoder module is obtained as shown in formula (7)

i＝1,2,3,4:

In equation (7), δ (·) represents a GeLU function, and σ (·) represents a linear transformation function;

when c = L, output of the Lth multi-head cross attention encoder module is output

Characteristic diagram using equation (8)

Performing up-sampling processing to obtain a characteristic diagram U _n,p ：

step 3.8, carrying out upsampling processing on the feature map U _n,p Feature map output from convolution stage

Fusing to obtain output U _n,p The specific formula is shown in formula (9):

in the formula (9), the reaction mixture is,

is a feature map output at the fourth convolution stage in the multi-scale feature extraction module.

Step 4, the characteristic diagram U is processed _n,p Input to a convolutionStage-by-stage feature extraction and output feature map D _n,p Feature map D output from the convolution stage _n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result feature map D' _n,p D 'is a feature map' _n,p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.

Step 4.1, embedding the multi-scale features into a feature graph U output by the multi-head cross attention encoder _n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D _n,p ；

Step 4.2, outputting the feature graph D of the convolution stage _n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result D' _n,p The specific formula is shown as formula (10):

is D _n,p The feature map output after global average pooling,

is D _n,p Outputting a feature map after global maximum pooling;

will feature map D' _n,p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.

The step 5 specifically includes: establishing a cross entropy loss function, inputting a training sample set into the deep learning network for training, and then adopting a back propagation algorithm to carry out optimization solution on the cross entropy loss function, thereby adjusting all parameters in the deep learning network and obtaining the endoscope image classifier for classifying the endoscope images.

Establishing a cross entropy loss function shown as a formula (11), inputting a training sample set into the deep learning network for training, and then optimally solving the cross entropy loss function by adopting a back propagation algorithm, so as to adjust all parameters in the deep learning network, thereby obtaining an endoscope image classifier for classifying endoscope images, wherein the cross entropy loss function is as follows:

in the formula (11), wherein C represents the number of classes, p _i Representing the true class of sample i, q _i Represents the prediction class of sample i, and CE (p, q) represents the classification loss over the sample.

It will be evident to those skilled in the art that the embodiments of the present invention are not limited to the details of the foregoing illustrative embodiments, and that the embodiments of the present invention are capable of being embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the embodiments being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. Several units, modules or means recited in the system, device or terminal claims may also be implemented by one and the same unit, module or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention and not for limiting, and although the embodiments of the present invention are described in detail with reference to the above preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the embodiments of the present invention without departing from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for classifying endoscopic images based on multi-scale feature embedding and cross-attention, the method comprising:

step 1, obtaining endoscope image samples of N types of C × H × W, preprocessing the samples to obtain a training set E, E = { E = } ₁ ，E ₂ ，...，E _n ，...，E _N }；

representing a p-th image in the n-th class of preprocessed endoscopic image samples; c denotes an image channel, H denotes an image height, W denotes an image width, N =1,2,.., N;

i =1,2,3,4; the method specifically comprises the following steps:

step 2.2, constructing a multi-scale feature extraction module:

the multi-scale feature extraction module is composed of four convolutional neural network stages and sequentially comprises the following steps: a first convolution stage, a second convolution stage, a third convolution stage and a fourth convolution stage;

the p-th image

Feature map output by the second convolution stage

Feature map output by the third convolution stage

Feature map output by the fourth convolution stage

Step 2.3, constructing a multi-scale feature embedding module:

outputting feature maps of four convolution stages

Input into a multi-scale feature embedding module,

i =1,2,3,4, minRespectively subjected to convolution kernel of 2 ^5-i ×2 ^5-i The convolution layers are respectively output characteristic graphs after dimension reduction processing

i＝1，2，3，4；

Step 3, outputting the characteristic diagram of the dimension reduction

Inputting the data into a multi-head cross attention encoder embedded with multi-scale features constructed in advance, performing normalization and upsampling processing, and outputting a feature map U _n，p (ii) a The method specifically comprises the following steps:

4 feature maps are combined

i＝1，2，3，4：

represents C ⁱ Size is H ⁱ ·W ⁱ Is generated from the pixel characteristic map of (a),

step 3.2, crossing the characteristic diagram of the channel

Concat (. Cndot.) in equation (2) represents the feature vector splicing operation,

to represent

A channel cross feature map after multi-scale feature embedding and conversion;

step 3.3, feature map

any of the c-th multi-headed cross-attention encoder modules comprises: 2 linear transformation layers, M parallel cross attention layers; c =1,2, ·, L;

step 3.4, feature map

f =1,2,3,4 is input to the c-th multi-head cross attention encoder module

Respectively with two weight matrices W _m ^K ，W _m ^V Multiplying, and comparing the feature maps

i =1,2,3,4, respectively associated with four weight matrices

Multiplying and outputting the characteristic diagram K _n，p 、V _n，p 、

f =1,2,3,4, the specific formula is shown in formula (3):

in the formula (3), φ (-) represents a normalization function;

step 3.5, embedding the multi-scale features into a module output feature map

Input into the 1 st multi-headed cross attention encoder,

respectively processed by linear transformation and input into M-head cross attention layers

Are respectively connected with K _n，p Multiplying, finally activating by Softmax function, and mixing with V _n，p Multiplying to obtain an output, wherein a specific formula is shown as a formula (4):

in equation (4), ψ (-) is a normalization function, δ (-) is a Softmax function;

step 3.6, attention feature map

On the basis, the attention-inducing feature maps of different heads are dynamically fused to form a new attention feature map, and the specific formula is shown as the formula (5):

in the formula (5), the reaction mixture is,

is a learnable transformation matrix formed by

will result in M cross attention layer outputs

i =1,2,3,4, M =1,2,. So, M, and a characteristic diagram is obtained as shown in formula (6)

c＝1，2，...，L，i＝1，2，3，4：

M in equation (6) is the number of cross-attention slice headers,

represents the ith feature map Q in the c-th multi-head cross attention encoder module ⁱ A profile generated through the mth cross attention layer;

i＝1，2，3，4：

when c = L, output of the Lth multi-head cross attention encoder module is used

Characteristic diagram using equation (8)

Performing up-sampling processing to obtain a characteristic diagram U _n，p ：

step 3.8, the characteristic diagram U subjected to the upsampling treatment _n，p Feature map output from convolution stage

Fusing to obtain output U _n，p The specific formula is shown as formula (9):

in the formula (9), the reaction mixture is,

is a feature map output by a fourth convolution stage in the multi-scale feature extraction module;

step 4, the characteristic diagram U is processed _n，p Inputting the data to a convolution stage for feature extraction and outputting a feature map D _n，p Feature map D output from the convolution stage _n，p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result feature map D' _n，p D 'is a feature map' _n，p Inputting the result vector into a full connection layer to obtain an N-dimensional classification result vector;

2. The endoscopic image classification method according to claim 1, characterized in that said step 4 specifically comprises:

step 4.1, embedding the multi-scale features into a feature graph U output by the multi-head cross attention encoder _n，p Inputting the data to a convolution stage for feature extraction and outputting a feature map D _n，p ；

Step 4.2, outputting the feature graph D of the convolution stage _n，p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result D' _n，p The specific formula is shown in formula (10):

is D _n，p The feature map output after global average pooling,

is D _n，p Outputting a feature map after global maximum pooling;

feature map D' _n，p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.

3. The endoscopic image classification method according to claim 2, characterized in that said step 5 specifically comprises: establishing a cross entropy loss function, inputting a training sample set into the deep learning network for training, and then adopting a back propagation algorithm to carry out optimization solution on the cross entropy loss function, thereby adjusting all parameters in the deep learning network and obtaining the endoscope image classifier for classifying the endoscope images.