CN114863179B - Endoscope image classification method based on multi-scale feature embedding and cross attention - Google Patents

Endoscope image classification method based on multi-scale feature embedding and cross attention Download PDF

Info

Publication number
CN114863179B
CN114863179B CN202210542820.8A CN202210542820A CN114863179B CN 114863179 B CN114863179 B CN 114863179B CN 202210542820 A CN202210542820 A CN 202210542820A CN 114863179 B CN114863179 B CN 114863179B
Authority
CN
China
Prior art keywords
feature
feature map
scale
output
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210542820.8A
Other languages
Chinese (zh)
Other versions
CN114863179A (en
Inventor
史骏
张元�
汪逸飞
杨皓程
周泰然
李想
郑利平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202210542820.8A priority Critical patent/CN114863179B/en
Publication of CN114863179A publication Critical patent/CN114863179A/en
Application granted granted Critical
Publication of CN114863179B publication Critical patent/CN114863179B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/03Recognition of patterns in medical or anatomical images

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Public Health (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Image Processing (AREA)

Abstract

The invention provides an endoscope image classification method based on multi-scale feature embedding and cross attention, which comprises the following steps: acquiring marked N types of endoscope images; establishing a deep learning network based on multi-scale feature embedding and multi-head cross attention; constructing an endoscope image classifier; and predicting the endoscope image category by utilizing the established classifier. According to the endoscope image classification method, rich semantic information in a deep characteristic diagram and geometric detail information in a shallow characteristic diagram are fused through multi-scale characteristic embedding, semantic information and geometric information ambiguity among different scale characteristic diagrams are eliminated by combining a cross attention mechanism to mine more effective characteristics, and endoscope images are accurately classified, so that a doctor is assisted in diagnosis and interpretation, and the interpretation efficiency is improved.

Description

Endoscope image classification method based on multi-scale feature embedding and cross attention
Technical Field
The invention relates to the field of computer vision, in particular to an image classification technology, and specifically relates to an endoscope image classification method based on multi-scale feature embedding and cross attention.
Background
Endoscopy is the most common way of cancer diagnosis, and endoscopic image classification has important clinical significance in early cancer screening. Traditional cancer diagnosis is mainly carried out through manual diagnosis under an endoscope by an endoscopist, but in clinical diagnosis, the endoscopist has subjective difference in cancer judgment, the work load of an endoscope judgment diagram is large, and missed diagnosis and misdiagnosis sometimes occur. Therefore, there is a need for an accurate and efficient endoscopic diagnosis method, which can reduce the reading pressure of an endoscopic doctor and improve the accuracy of endoscopic image classification by using a computer to assist the doctor in reading the endoscopic images.
In recent years, a deep learning framework has attracted wide attention in the field of computer vision, researchers have begun to apply the deep learning framework to various classification tasks, but most of endoscope image classification methods based on deep learning adopt a convolutional neural network model to extract features of an endoscope image in a single scale, and ignore information of other scales, so that the accuracy of endoscope image classification is difficult to improve.
Disclosure of Invention
The invention provides an endoscope image classification method based on multi-scale feature embedding and cross attention in order to make up for the existing technical defects, and aims to integrate abundant semantic information in a deep feature map and geometric detail information in a shallow feature map through multi-scale feature embedding, eliminate semantic information and geometric information ambiguity between feature maps with different scales by combining a cross attention mechanism, mine more effective features and finish accurate classification of endoscope images.
In order to achieve the purpose, the invention adopts the following technical scheme:
according to the embodiment of the invention, the invention provides an endoscope image classification method based on multi-scale feature embedding and cross attention, which comprises the following steps:
step 1, obtaining endoscope image samples of N types of C × H × W, preprocessing the samples to obtain a training set E, E = { E = } 1 ,E 2 ,...,E n ,...,E N };
Figure BDA0003650924180000021
E n Showing an nth class endoscopic image sample, the nth class having P images in total,
Figure BDA0003650924180000022
representing the p image in the endoscope image sample after the n-th type of preprocessing; c denotes an image channel, H denotes an image height, W denotes an image width, N =1,2, · N;
step 2, establishing a deep learning network, processing the sample data set of the endoscope image through a convolution neural network of the deep learning network to output feature maps of different convolution stages, and forming a dimension reduction output feature map after dimension reduction processing on the feature maps of the different convolution stages
Figure BDA0003650924180000023
i=1,2,3,4;
Step 3, outputting the characteristic diagram by dimension reduction
Figure BDA0003650924180000024
Inputting the data into a multi-head cross attention encoder embedded with multi-scale features constructed in advance, performing normalization and upsampling processing, and outputting a feature map U n,p
Step 4, the characteristic diagram U is processed n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p Feature map D output from the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, splicing and fusing the obtained feature maps to obtain a result feature map D' n,p D 'is a feature map' n,p Inputting the result vector into a full connection layer to obtain an N-dimensional classification result vector;
and 5, constructing an endoscope classifier based on the result vector of the N-dimensional classification to classify the endoscope image.
Further, the step 2 specifically includes:
step 2.1, establishing a deep learning network, wherein the deep learning network comprises the following steps: the system comprises a multi-scale feature extraction module, a multi-scale feature embedding module and a multi-head cross attention encoder module;
step 2.2, constructing a multi-scale feature extraction module:
the multi-scale feature extraction module is composed of four convolutional neural network stages, and sequentially comprises the following steps: a first convolution stage, a second convolution stage, a third convolution stage and a fourth convolution stage;
the p-th image
Figure BDA0003650924180000031
Inputting the feature map into the multi-scale feature extraction module, and respectively obtaining the feature map output by the first convolution stage through the first convolution stage, the second convolution stage, the third convolution stage and the fourth convolution stage
Figure BDA0003650924180000032
Feature map output by the second convolution stage
Figure BDA0003650924180000033
Feature map output by the third convolution stage
Figure BDA0003650924180000034
Feature map output by the fourth convolution stage
Figure BDA0003650924180000035
Step 2.3, constructing a multi-scale feature embedding module:
the multi-scale feature embedded module is formed by connecting 4 different embedded layers in parallel, wherein 4 embedded layers correspond to 4 embedded layers
Figure BDA0003650924180000036
i =1,2,3,4, each embedded layer containing oneConvolution layer and a dimension reduction treatment;
outputting feature maps of four convolution stages
Figure BDA0003650924180000037
Input into a multi-scale feature embedding module,
Figure BDA0003650924180000038
i =1,2,3,4, each 2 by convolution kernel 5-i ×2 5-i The convolution layers are respectively output characteristic graphs after dimension reduction processing
Figure BDA0003650924180000039
i=1,2,3,4。
Further, the step 3 specifically includes:
step 3.1, constructing a multi-head cross attention encoder with multi-scale feature embedding:
the multi-head cross attention encoder module embedded with the multi-scale features is formed by serially connecting features embedded with 4 convolution stages and L multi-head cross attention encoders;
combining 4 characteristic maps
Figure BDA00036509241800000310
Inputting the multi-scale features into the multi-scale feature embedding module, respectively carrying out normalization processing through an LN layer, and carrying out normalization processing on feature maps from the angle of channel intersection
Figure BDA00036509241800000311
Converting, specifically obtaining a channel cross characteristic diagram by using the formula (1)
Figure BDA00036509241800000312
i=1,2,3,4:
Figure BDA0003650924180000041
Transpose (-) in the formula (1) represents the transpose processing of the feature map,
Figure BDA0003650924180000042
is represented by C i Size is H i ·W i Is generated from the pixel characteristic map of (a),
Figure BDA0003650924180000043
is represented by H i ·W i Size is C i A pixel feature map of individual channel intersections;
step 3.2, crossing the characteristic diagram of the channel
Figure BDA0003650924180000044
Performing multi-scale embedding, specifically obtaining a multi-scale characteristic embedding characteristic diagram by using a formula (2)
Figure BDA0003650924180000045
Figure BDA0003650924180000046
Concat (-) in equation (2) represents the feature vector splicing operation,
Figure BDA0003650924180000047
represent
Figure BDA0003650924180000048
A channel cross feature map after multi-scale feature embedding and conversion;
step 3.3, feature map
Figure BDA0003650924180000049
As the input of the 1 st multi-head cross attention encoder module, the output of the c multi-head cross attention encoder module is as the input of the c +1 st multi-head cross attention encoder module;
any of the c-th multi-headed cross-attention encoder modules comprises: 2 linear transformation layers, M parallel cross attention layers; c =1,2, · L;
step 3.4, feature map
Figure BDA00036509241800000410
i =1,2,3,4 is input to the c-th multi-head cross attention encoder module, which maps the signatures
Figure BDA00036509241800000411
Respectively with two weight matrices W m K ,W m V Multiplying, and comparing the feature maps
Figure BDA00036509241800000412
i =1,2,3,4, respectively associated with four weight matrices
Figure BDA00036509241800000413
Multiplying and outputting the characteristic diagram K n,p 、V n,p
Figure BDA00036509241800000414
i =1,2,3,4, the specific formula is shown in formula (3):
Figure BDA00036509241800000415
in the formula (3), φ (-) represents a normalization function;
step 3.5, embedding the multi-scale features into a module output feature graph K n,p ,V n,p ,
Figure BDA00036509241800000416
i =1,2,3,4 input into the 1 st multi-headed cross attention encoder, K n,p ,V n,p ,
Figure BDA00036509241800000417
i =1,2,3,4 are respectively input to the M-head cross attention layers by linear transformation processing
Figure BDA0003650924180000051
Are each independently of K n,p Multiplying, finally activating by Softmax function, and mixing with V n,p Multiplying to obtain an output, the specific formula is as shown in formula (4)The following steps:
Figure BDA0003650924180000052
in equation (4), ψ (·) is a normalization function, δ (·) is a Softmax function;
step 3.6, attention feature map
Figure BDA0003650924180000053
Based on the above, dynamically fusing the attention feature maps of different heads to form a new attention feature map, wherein the specific formula is shown in formula (5):
Figure BDA0003650924180000054
in the formula (5), the reaction mixture is,
Figure BDA0003650924180000055
is a learnable transformation matrix by
Figure BDA0003650924180000056
Fusing the multi-head attention feature maps and generating a new attention feature map;
will result in M cross attention layer outputs
Figure BDA0003650924180000057
i =1,2,3,4, M =1,2,.. M, and a characteristic diagram is obtained as shown in formula (6)
Figure BDA0003650924180000058
c=1,2,…,L,i=1,2,3,4:
Figure BDA0003650924180000059
M in equation (6) is the number of cross-attention slice headers,
Figure BDA00036509241800000510
represents the ith feature map Q in the c-th multi-head cross attention encoder module i A feature map generated through the mth cross attention layer;
step 3.7, drawing the characteristic diagram after the multi-head cross attention
Figure BDA00036509241800000511
i =1,2,3,4 is respectively processed by linear transformation and normalization, and then the output of the multi-head cross attention encoder module is obtained as shown in formula (7)
Figure BDA00036509241800000512
i=1,2,3,4:
Figure BDA00036509241800000513
In the formula (7), δ (-) represents a GeLU function, and σ (-) represents a linear transformation function;
when c is not equal to L, the output of the c multi-head cross attention encoder module is input into the next 1 multi-head cross attention encoder module;
when c = L, output of the Lth multi-head cross attention encoder module is used
Figure BDA00036509241800000514
Characteristic diagram using equation (8)
Figure BDA0003650924180000061
Performing up-sampling processing to obtain a characteristic diagram U n,p
Figure BDA0003650924180000062
In equation (8), μ (-) represents the upsampling function, φ (-) represents the normalization function, δ (-) represents the ReLU function;
step 3.8, the characteristic diagram U subjected to the upsampling treatment n,p Feature map output from convolution stage
Figure BDA0003650924180000063
Fusing to obtain output U n,p The specific formula is shown as formula (9):
Figure BDA0003650924180000064
in the formula (9), the reaction mixture is,
Figure BDA0003650924180000065
is the feature map output by the fourth convolution stage in the multi-scale feature extraction module.
Further, the step 4 specifically includes:
step 4.1, embedding the multi-scale features into a feature graph U output by the multi-head cross attention encoder n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p
Step 4.2, outputting the feature graph D of the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, splicing and fusing the obtained feature maps to obtain a result D' n,p The specific formula is shown in formula (10):
Figure BDA0003650924180000066
in equation (10), concat (. Cndot.) represents the feature vector splicing operation,
Figure BDA0003650924180000067
is D n,p The feature map output after global average pooling,
Figure BDA0003650924180000068
is D n,p Outputting a feature map after global maximum pooling;
will feature map D' n,p Inputting the data into a full connection layer to obtain a result vector of N-dimensional classification.
Further, the step 5 specifically includes: establishing a cross entropy loss function, inputting a training sample set into the deep learning network for training, and then adopting a back propagation algorithm to carry out optimization solution on the cross entropy loss function, thereby adjusting all parameters in the deep learning network and obtaining an endoscopic image classifier for classifying endoscopic images.
Compared with the prior art, the invention has the following advantages:
the invention constructs an endoscope image classification model by using an endoscope image classification method based on multi-scale feature embedding and cross attention. The general convolutional neural network classification depends on semantic information of the deep feature map and ignores geometric detail information of the shallow feature map. According to the method, the deep characteristic diagram rich in semantic information and the shallow characteristic diagram rich in geometric detail information are fused through multi-scale embedding, ambiguity of semantic information and geometric information between characteristic diagrams of different scales is eliminated from the angle of channel intersection, and more effective characteristics are extracted, so that the accuracy of endoscope image classification is improved, a doctor is assisted in diagnosis and film reading, and the pressure of the endoscope doctor in film reading is reduced.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a block diagram of a deep learning network according to the present invention;
FIG. 3 is a schematic diagram of a multi-headed cross attention encoder module according to the present invention.
Detailed Description
For the convenience of understanding, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
In this embodiment, an endoscopic image classification method based on multi-scale feature embedding and cross attention includes the following specific steps, as shown in fig. 1:
step 1, acquiring N types of endoscope image samples of C multiplied by H multiplied by W, preprocessing the sample to obtain a training set E, E = { E = 1 ,E 2 ,...,E n ,...,E N };
Figure BDA0003650924180000081
E n Showing an nth class endoscopic image sample, the nth class having P images in total,
Figure BDA0003650924180000082
representing a p-th image in the n-th class of preprocessed endoscopic image samples; c denotes an image channel, H denotes an image height, W denotes an image width, N =1,2, · N;
step 2, as shown in fig. 2, establishing a deep learning network, processing the sample data set of the endoscope image through a convolutional neural network of the deep learning network to output feature maps of different convolutional stages, and forming a dimension-reduced output feature map after dimension reduction processing is performed on the feature maps of the different convolutional stages
Figure BDA0003650924180000083
i=1,2,3,4。
Step 2.1, establishing a deep learning network, wherein the deep learning network comprises the following steps: the system comprises a multi-scale feature extraction module, a multi-scale feature embedding module and a multi-head cross attention encoder module;
step 2.2, constructing a multi-scale feature extraction module:
the multi-scale feature extraction module is composed of four convolutional neural network stages, and sequentially comprises the following steps: a first convolution stage, a second convolution stage, a third convolution stage and a fourth convolution stage;
the p-th image
Figure BDA0003650924180000084
Inputting the feature map into the multi-scale feature extraction module, and respectively obtaining the feature map output by the first convolution stage through the first convolution stage, the second convolution stage, the third convolution stage and the fourth convolution stage
Figure BDA0003650924180000085
Feature map output by the second convolution stage
Figure BDA0003650924180000086
Feature map output by the third convolution stage
Figure BDA0003650924180000087
Feature map output by the fourth convolution stage
Figure BDA0003650924180000088
Step 2.3, constructing a multi-scale feature embedding module:
the multi-scale feature embedded module is formed by connecting 4 different embedded layers in parallel, wherein 4 embedded layers correspond to 4 embedded layers
Figure BDA0003650924180000089
i =1,2,3,4, each embedded layer containing one convolutional layer and one dimension reduction process;
outputting feature maps of four convolution stages
Figure BDA00036509241800000810
Input into a multi-scale feature embedding module,
Figure BDA00036509241800000811
i =1,2,3,4, each 2 by convolution kernel 5-i ×2 5-i The convolution layers are respectively output characteristic graphs after dimension reduction processing
Figure BDA0003650924180000091
i=1,2,3,4。
Step 3, outputting the characteristic diagram of the dimension reduction
Figure BDA0003650924180000092
Inputting the data into a multi-head cross attention encoder embedded with multi-scale features constructed in advance, and outputting a feature map after normalization and upsampling processingU n,p
Step 3.1, constructing a multi-head cross attention encoder with multi-scale feature embedding:
the multi-head cross attention encoder module embedded with the multi-scale features is formed by serially connecting features embedded with 4 convolution stages and L multi-head cross attention encoders;
combining 4 characteristic maps
Figure BDA0003650924180000093
Inputting the multi-scale features into the multi-scale feature embedding module, respectively carrying out normalization processing through an LN layer, and carrying out normalization processing on feature maps from the angle of channel intersection
Figure BDA0003650924180000094
Converting, specifically obtaining a channel cross characteristic diagram by using the formula (1)
Figure BDA0003650924180000095
i=1,2,3,4:
Figure BDA0003650924180000096
Transpose (-) in the formula (1) represents the transposition process of the feature map,
Figure BDA0003650924180000097
represents C i Size is H i ·W i The characteristic map of the pixels of (a),
Figure BDA0003650924180000098
is represented by H i ·W i Size is C i A pixel feature map of individual channel intersections;
step 3.2, crossing the characteristic diagram of the channel
Figure BDA0003650924180000099
Performing multi-scale embedding, specifically obtaining a multi-scale feature embedding feature map by using the formula (2)
Figure BDA00036509241800000910
Figure BDA00036509241800000911
Concat (-) in equation (2) represents the feature vector splicing operation,
Figure BDA00036509241800000912
to represent
Figure BDA00036509241800000913
A channel cross feature map after multi-scale feature embedding and conversion;
step 3.3, feature map
Figure BDA00036509241800000914
As the input of the 1 st multi-headed cross attention encoder module, the output of the c-th multi-headed cross attention encoder module is as the input of the c +1 th multi-headed cross attention encoder module;
as shown in fig. 3, any of the c-th multi-headed cross-attention encoder modules includes: 2 linear transformation layers, M parallel cross attention layers; c =1,2, · L;
step 3.4, feature map
Figure BDA0003650924180000101
i =1,2,3,4 is input to the c-th multi-head cross attention encoder module, which maps the signatures
Figure BDA0003650924180000102
Respectively with two weight matrices W m K ,W m V Multiplying, and comparing the feature maps
Figure BDA0003650924180000103
i =1,2,3,4, respectively associated with four weight matrices
Figure BDA0003650924180000104
Multiplying and outputtingCharacteristic diagram K n,p 、V n,p
Figure BDA0003650924180000105
i =1,2,3,4, the specific formula is shown in formula (3):
Figure BDA0003650924180000106
in the formula (3), φ (·) represents a normalization function;
step 3.5, embedding the multi-scale features into a module output feature graph K n,p ,V n,p ,
Figure BDA0003650924180000107
i =1,2,3,4 input into the 1 st multi-headed cross attention encoder, K n,p ,V n,p ,
Figure BDA0003650924180000108
i =1,2,3,4 are respectively input to the M-head cross attention layers by linear transformation processing
Figure BDA0003650924180000109
Are respectively connected with K n,p Multiplying, finally activating by Softmax function, and mixing with V n,p Multiplying to obtain an output, wherein a specific formula is shown as a formula (4):
Figure BDA00036509241800001010
in equation (4), ψ (·) is a normalization function, δ (·) is a Softmax function;
step 3.6, attention feature map
Figure BDA00036509241800001011
Based on the above, dynamically fusing the attention feature maps of different heads to form a new attention feature map, wherein the specific formula is shown in formula (5):
Figure BDA00036509241800001012
in the formula (5), the reaction mixture is,
Figure BDA00036509241800001013
is a learnable transformation matrix by
Figure BDA00036509241800001014
Fusing the multi-head attention feature maps and generating a new attention feature map;
will result in M cross attention layer outputs
Figure BDA00036509241800001015
i =1,2,3,4, M =1,2,.. M, and a characteristic diagram is obtained as shown in formula (6)
Figure BDA00036509241800001016
c=1,2,…,L,i=1,2,3,4:
Figure BDA00036509241800001017
M in equation (6) is the number of cross-attention slice headers,
Figure BDA00036509241800001018
represents the ith feature map Q in the c-th multi-head cross attention encoder module i A feature map generated through the mth cross attention layer;
step 3.7, feature diagram after multi-head cross attention
Figure BDA0003650924180000111
i =1,2,3,4 are respectively subjected to linear transformation processing and normalization processing, and then the output of the multi-head cross attention encoder module is obtained as shown in formula (7)
Figure BDA0003650924180000112
i=1,2,3,4:
Figure BDA0003650924180000113
In equation (7), δ (·) represents a GeLU function, and σ (·) represents a linear transformation function;
when c is not equal to L, the output of the c multi-head cross attention encoder module is input into the next 1 multi-head cross attention encoder module;
when c = L, output of the Lth multi-head cross attention encoder module is output
Figure BDA0003650924180000114
Characteristic diagram using equation (8)
Figure BDA0003650924180000115
Performing up-sampling processing to obtain a characteristic diagram U n,p
Figure BDA0003650924180000116
In equation (8), μ (-) represents the upsampling function, φ (-) represents the normalization function, δ (-) represents the ReLU function;
step 3.8, carrying out upsampling processing on the feature map U n,p Feature map output from convolution stage
Figure BDA0003650924180000117
Fusing to obtain output U n,p The specific formula is shown in formula (9):
Figure BDA0003650924180000118
in the formula (9), the reaction mixture is,
Figure BDA0003650924180000119
is a feature map output at the fourth convolution stage in the multi-scale feature extraction module.
Step 4, the characteristic diagram U is processed n,p Input to a convolutionStage-by-stage feature extraction and output feature map D n,p Feature map D output from the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result feature map D' n,p D 'is a feature map' n,p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.
Step 4.1, embedding the multi-scale features into a feature graph U output by the multi-head cross attention encoder n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p
Step 4.2, outputting the feature graph D of the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result D' n,p The specific formula is shown as formula (10):
Figure BDA0003650924180000121
in equation (10), concat (. Cndot.) represents the feature vector splicing operation,
Figure BDA0003650924180000122
is D n,p The feature map output after global average pooling,
Figure BDA0003650924180000123
is D n,p Outputting a feature map after global maximum pooling;
will feature map D' n,p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.
And 5, constructing an endoscope classifier based on the result vector of the N-dimensional classification to classify the endoscope image.
The step 5 specifically includes: establishing a cross entropy loss function, inputting a training sample set into the deep learning network for training, and then adopting a back propagation algorithm to carry out optimization solution on the cross entropy loss function, thereby adjusting all parameters in the deep learning network and obtaining the endoscope image classifier for classifying the endoscope images.
Establishing a cross entropy loss function shown as a formula (11), inputting a training sample set into the deep learning network for training, and then optimally solving the cross entropy loss function by adopting a back propagation algorithm, so as to adjust all parameters in the deep learning network, thereby obtaining an endoscope image classifier for classifying endoscope images, wherein the cross entropy loss function is as follows:
Figure BDA0003650924180000124
in the formula (11), wherein C represents the number of classes, p i Representing the true class of sample i, q i Represents the prediction class of sample i, and CE (p, q) represents the classification loss over the sample.
It will be evident to those skilled in the art that the embodiments of the present invention are not limited to the details of the foregoing illustrative embodiments, and that the embodiments of the present invention are capable of being embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the embodiments being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. Several units, modules or means recited in the system, device or terminal claims may also be implemented by one and the same unit, module or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention and not for limiting, and although the embodiments of the present invention are described in detail with reference to the above preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the embodiments of the present invention without departing from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (3)

1. A method for classifying endoscopic images based on multi-scale feature embedding and cross-attention, the method comprising:
step 1, obtaining endoscope image samples of N types of C × H × W, preprocessing the samples to obtain a training set E, E = { E = } 1 ,E 2 ,...,E n ,...,E N };
Figure FDA0003883663930000011
E n Showing an nth class endoscopic image sample, the nth class having P images in total,
Figure FDA0003883663930000012
representing a p-th image in the n-th class of preprocessed endoscopic image samples; c denotes an image channel, H denotes an image height, W denotes an image width, N =1,2,.., N;
step 2, establishing a deep learning network, processing the sample data set of the endoscope image through a convolution neural network of the deep learning network to output feature maps of different convolution stages, and forming a dimension reduction output feature map after dimension reduction processing on the feature maps of the different convolution stages
Figure FDA0003883663930000013
i =1,2,3,4; the method specifically comprises the following steps:
step 2.1, establishing a deep learning network, wherein the deep learning network comprises the following steps: the system comprises a multi-scale feature extraction module, a multi-scale feature embedding module and a multi-head cross attention encoder module;
step 2.2, constructing a multi-scale feature extraction module:
the multi-scale feature extraction module is composed of four convolutional neural network stages and sequentially comprises the following steps: a first convolution stage, a second convolution stage, a third convolution stage and a fourth convolution stage;
the p-th image
Figure FDA0003883663930000014
Inputting the feature map into the multi-scale feature extraction module, and respectively obtaining the feature map output by the first convolution stage through the first convolution stage, the second convolution stage, the third convolution stage and the fourth convolution stage
Figure FDA0003883663930000015
Feature map output by the second convolution stage
Figure FDA0003883663930000016
Feature map output by the third convolution stage
Figure FDA0003883663930000017
Feature map output by the fourth convolution stage
Figure FDA0003883663930000018
Step 2.3, constructing a multi-scale feature embedding module:
the multi-scale feature embedded module is formed by connecting 4 different embedded layers in parallel, wherein 4 embedded layers correspond to 4 embedded layers
Figure FDA0003883663930000019
i =1,2,3,4, each embedded layer containing one convolutional layer and one dimension reduction process;
outputting feature maps of four convolution stages
Figure FDA00038836639300000110
Input into a multi-scale feature embedding module,
Figure FDA0003883663930000021
i =1,2,3,4, minRespectively subjected to convolution kernel of 2 5-i ×2 5-i The convolution layers are respectively output characteristic graphs after dimension reduction processing
Figure FDA0003883663930000022
i=1,2,3,4;
Step 3, outputting the characteristic diagram of the dimension reduction
Figure FDA0003883663930000023
Inputting the data into a multi-head cross attention encoder embedded with multi-scale features constructed in advance, performing normalization and upsampling processing, and outputting a feature map U n,p (ii) a The method specifically comprises the following steps:
step 3.1, constructing a multi-head cross attention encoder with multi-scale feature embedding:
the multi-head cross attention encoder module embedded with the multi-scale features is formed by serially connecting features embedded with 4 convolution stages and L multi-head cross attention encoders;
4 feature maps are combined
Figure FDA0003883663930000024
Inputting the multi-scale features into the multi-scale feature embedding module, respectively carrying out normalization processing through an LN layer, and carrying out normalization processing on feature maps from the angle of channel intersection
Figure FDA0003883663930000025
Converting, specifically obtaining a channel cross characteristic diagram by using the formula (1)
Figure FDA0003883663930000026
i=1,2,3,4:
Figure FDA0003883663930000027
Transpose (-) in the formula (1) represents the transposition process of the feature map,
Figure FDA0003883663930000028
represents C i Size is H i ·W i Is generated from the pixel characteristic map of (a),
Figure FDA0003883663930000029
is represented by H i ·W i Size is C i A pixel feature map of individual channel intersections;
step 3.2, crossing the characteristic diagram of the channel
Figure FDA00038836639300000210
Performing multi-scale embedding, specifically obtaining a multi-scale characteristic embedding characteristic diagram by using a formula (2)
Figure FDA00038836639300000211
Figure FDA00038836639300000212
Concat (. Cndot.) in equation (2) represents the feature vector splicing operation,
Figure FDA00038836639300000213
to represent
Figure FDA00038836639300000214
A channel cross feature map after multi-scale feature embedding and conversion;
step 3.3, feature map
Figure FDA00038836639300000215
As the input of the 1 st multi-headed cross attention encoder module, the output of the c-th multi-headed cross attention encoder module is as the input of the c +1 th multi-headed cross attention encoder module;
any of the c-th multi-headed cross-attention encoder modules comprises: 2 linear transformation layers, M parallel cross attention layers; c =1,2, ·, L;
step 3.4, feature map
Figure FDA0003883663930000031
f =1,2,3,4 is input to the c-th multi-head cross attention encoder module
Figure FDA0003883663930000032
Respectively with two weight matrices W m K ,W m V Multiplying, and comparing the feature maps
Figure FDA0003883663930000033
i =1,2,3,4, respectively associated with four weight matrices
Figure FDA0003883663930000034
Multiplying and outputting the characteristic diagram K n,p 、V n,p
Figure FDA0003883663930000035
f =1,2,3,4, the specific formula is shown in formula (3):
Figure FDA0003883663930000036
in the formula (3), φ (-) represents a normalization function;
step 3.5, embedding the multi-scale features into a module output feature map
Figure FDA0003883663930000037
Input into the 1 st multi-headed cross attention encoder,
Figure FDA0003883663930000038
respectively processed by linear transformation and input into M-head cross attention layers
Figure FDA0003883663930000039
Are respectively connected with K n,p Multiplying, finally activating by Softmax function, and mixing with V n,p Multiplying to obtain an output, wherein a specific formula is shown as a formula (4):
Figure FDA00038836639300000310
in equation (4), ψ (-) is a normalization function, δ (-) is a Softmax function;
step 3.6, attention feature map
Figure FDA00038836639300000311
On the basis, the attention-inducing feature maps of different heads are dynamically fused to form a new attention feature map, and the specific formula is shown as the formula (5):
Figure FDA00038836639300000312
in the formula (5), the reaction mixture is,
Figure FDA00038836639300000313
is a learnable transformation matrix formed by
Figure FDA00038836639300000314
Fusing the multi-head attention feature maps and generating a new attention feature map;
will result in M cross attention layer outputs
Figure FDA00038836639300000315
i =1,2,3,4, M =1,2,. So, M, and a characteristic diagram is obtained as shown in formula (6)
Figure FDA00038836639300000316
c=1,2,...,L,i=1,2,3,4:
Figure FDA00038836639300000317
M in equation (6) is the number of cross-attention slice headers,
Figure FDA00038836639300000318
represents the ith feature map Q in the c-th multi-head cross attention encoder module i A profile generated through the mth cross attention layer;
step 3.7, drawing the characteristic diagram after the multi-head cross attention
Figure FDA0003883663930000041
i =1,2,3,4 is respectively processed by linear transformation and normalization, and then the output of the multi-head cross attention encoder module is obtained as shown in formula (7)
Figure FDA0003883663930000042
i=1,2,3,4:
Figure FDA0003883663930000043
In the formula (7), δ (-) represents a GeLU function, and σ (-) represents a linear transformation function;
when c is not equal to L, the output of the c multi-head cross attention encoder module is input into the next 1 multi-head cross attention encoder module;
when c = L, output of the Lth multi-head cross attention encoder module is used
Figure FDA0003883663930000044
Characteristic diagram using equation (8)
Figure FDA0003883663930000045
Performing up-sampling processing to obtain a characteristic diagram U n,p
Figure FDA0003883663930000046
In equation (8), μ (-) represents the upsampling function, φ (-) represents the normalization function, δ (-) represents the ReLU function;
step 3.8, the characteristic diagram U subjected to the upsampling treatment n,p Feature map output from convolution stage
Figure FDA0003883663930000047
Fusing to obtain output U n,p The specific formula is shown as formula (9):
Figure FDA0003883663930000048
in the formula (9), the reaction mixture is,
Figure FDA0003883663930000049
is a feature map output by a fourth convolution stage in the multi-scale feature extraction module;
step 4, the characteristic diagram U is processed n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p Feature map D output from the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result feature map D' n,p D 'is a feature map' n,p Inputting the result vector into a full connection layer to obtain an N-dimensional classification result vector;
and 5, constructing an endoscope classifier based on the result vector of the N-dimensional classification to classify the endoscope image.
2. The endoscopic image classification method according to claim 1, characterized in that said step 4 specifically comprises:
step 4.1, embedding the multi-scale features into a feature graph U output by the multi-head cross attention encoder n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p
Step 4.2, outputting the feature graph D of the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result D' n,p The specific formula is shown in formula (10):
Figure FDA0003883663930000051
in equation (10), concat (. Cndot.) represents the feature vector splicing operation,
Figure FDA0003883663930000052
is D n,p The feature map output after global average pooling,
Figure FDA0003883663930000053
is D n,p Outputting a feature map after global maximum pooling;
feature map D' n,p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.
3. The endoscopic image classification method according to claim 2, characterized in that said step 5 specifically comprises: establishing a cross entropy loss function, inputting a training sample set into the deep learning network for training, and then adopting a back propagation algorithm to carry out optimization solution on the cross entropy loss function, thereby adjusting all parameters in the deep learning network and obtaining the endoscope image classifier for classifying the endoscope images.
CN202210542820.8A 2022-05-18 2022-05-18 Endoscope image classification method based on multi-scale feature embedding and cross attention Active CN114863179B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210542820.8A CN114863179B (en) 2022-05-18 2022-05-18 Endoscope image classification method based on multi-scale feature embedding and cross attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210542820.8A CN114863179B (en) 2022-05-18 2022-05-18 Endoscope image classification method based on multi-scale feature embedding and cross attention

Publications (2)

Publication Number Publication Date
CN114863179A CN114863179A (en) 2022-08-05
CN114863179B true CN114863179B (en) 2022-12-13

Family

ID=82638829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210542820.8A Active CN114863179B (en) 2022-05-18 2022-05-18 Endoscope image classification method based on multi-scale feature embedding and cross attention

Country Status (1)

Country Link
CN (1) CN114863179B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116188436B (en) * 2023-03-03 2023-11-10 合肥工业大学 Cystoscope image classification method based on fusion of local features and global features
CN117522884B (en) * 2024-01-05 2024-05-17 武汉理工大学三亚科教创新园 Ocean remote sensing image semantic segmentation method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034253A (en) * 2018-08-01 2018-12-18 华中科技大学 A kind of chronic venous disease image classification method based on multiscale semanteme feature
CN113378791A (en) * 2021-07-09 2021-09-10 合肥工业大学 Cervical cell classification method based on double-attention mechanism and multi-scale feature fusion
WO2022073452A1 (en) * 2020-10-07 2022-04-14 武汉大学 Hyperspectral remote sensing image classification method based on self-attention context network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111739075B (en) * 2020-06-15 2024-02-06 大连理工大学 Deep network lung texture recognition method combining multi-scale attention

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034253A (en) * 2018-08-01 2018-12-18 华中科技大学 A kind of chronic venous disease image classification method based on multiscale semanteme feature
WO2022073452A1 (en) * 2020-10-07 2022-04-14 武汉大学 Hyperspectral remote sensing image classification method based on self-attention context network
CN113378791A (en) * 2021-07-09 2021-09-10 合肥工业大学 Cervical cell classification method based on double-attention mechanism and multi-scale feature fusion

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Bi-Modal Learning With Channel-Wise Attention for Multi-Label Image Classification;Peng Li et.al;《IEEE Access》;20200107;第2169-3536页 *
基于注意力机制及多尺度特征融合的番茄叶片缺素图像分类方法;韩旭 等;《农业工程学报》;20210908;第37卷(第17期);第177-188页 *

Also Published As

Publication number Publication date
CN114863179A (en) 2022-08-05

Similar Documents

Publication Publication Date Title
CN114863179B (en) Endoscope image classification method based on multi-scale feature embedding and cross attention
CN112116605B (en) Pancreas CT image segmentation method based on integrated depth convolution neural network
CN111242288B (en) Multi-scale parallel deep neural network model construction method for lesion image segmentation
CN113239954B (en) Attention mechanism-based image semantic segmentation feature fusion method
CN113674253A (en) Rectal cancer CT image automatic segmentation method based on U-transducer
CN113378791B (en) Cervical cell classification method based on double-attention mechanism and multi-scale feature fusion
CN111401156B (en) Image identification method based on Gabor convolution neural network
CN112347908B (en) Surgical instrument image identification method based on space grouping attention model
CN112818969A (en) Knowledge distillation-based face pose estimation method and system
CN109492610B (en) Pedestrian re-identification method and device and readable storage medium
CN114037699B (en) Pathological image classification method, equipment, system and storage medium
CN114782753A (en) Lung cancer histopathology full-section classification method based on weak supervision learning and converter
CN115457311A (en) Hyperspectral remote sensing image band selection method based on self-expression transfer learning
CN115410258A (en) Human face expression recognition method based on attention image
CN114820481A (en) Lung cancer histopathology full-section EGFR state prediction method based on converter
CN114581789A (en) Hyperspectral image classification method and system
CN115331047A (en) Earthquake image interpretation method based on attention mechanism
CN113192076B (en) MRI brain tumor image segmentation method combining classification prediction and multi-scale feature extraction
Girdher et al. Detecting and estimating severity of leaf spot disease in golden pothos using hybrid deep learning approach
CN112926619B (en) High-precision underwater laser target recognition system
CN116486101B (en) Image feature matching method based on window attention
Cai et al. Semi-Supervised Segmentation of Interstitial Lung Disease Patterns from CT Images via Self-Training with Selective Re-Training
CN113408463B (en) Cell image small sample classification system based on distance measurement
CN118015332A (en) Remote sensing image saliency target detection method
CN111339782B (en) Sign language translation system and method based on multilevel semantic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant