CN114863179A - Endoscope image classification method based on multi-scale feature embedding and cross attention - Google Patents

Endoscope image classification method based on multi-scale feature embedding and cross attention Download PDF

Info

Publication number
CN114863179A
CN114863179A CN202210542820.8A CN202210542820A CN114863179A CN 114863179 A CN114863179 A CN 114863179A CN 202210542820 A CN202210542820 A CN 202210542820A CN 114863179 A CN114863179 A CN 114863179A
Authority
CN
China
Prior art keywords
feature
feature map
scale
formula
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210542820.8A
Other languages
Chinese (zh)
Other versions
CN114863179B (en
Inventor
史骏
张元�
汪逸飞
杨皓程
周泰然
李想
郑利平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202210542820.8A priority Critical patent/CN114863179B/en
Publication of CN114863179A publication Critical patent/CN114863179A/en
Application granted granted Critical
Publication of CN114863179B publication Critical patent/CN114863179B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/03Recognition of patterns in medical or anatomical images

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Public Health (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Image Processing (AREA)

Abstract

The invention provides an endoscope image classification method based on multi-scale feature embedding and cross attention, which comprises the following steps: acquiring marked N types of endoscope images; establishing a deep learning network based on multi-scale feature embedding and multi-head cross attention; constructing an endoscope image classifier; and predicting the endoscope image category by utilizing the established classifier. According to the endoscope image classification method, rich semantic information in a deep characteristic diagram and geometric detail information in a shallow characteristic diagram are fused through multi-scale characteristic embedding, semantic information and geometric information ambiguity among different scale characteristic diagrams are eliminated by combining a cross attention mechanism to mine more effective characteristics, and endoscope images are accurately classified, so that a doctor is assisted in diagnosis and interpretation, and the interpretation efficiency is improved.

Description

Endoscope image classification method based on multi-scale feature embedding and cross attention
Technical Field
The invention relates to the field of computer vision, in particular to an image classification technology, and specifically relates to an endoscope image classification method based on multi-scale feature embedding and cross attention.
Background
Endoscopy is the most common way of cancer diagnosis, and endoscopic image classification has important clinical significance in early cancer screening. Traditional cancer diagnosis is mainly carried out through the manual diagnosis under an endoscope by an endoscopist, but in clinical diagnosis, the judgment of the endoscopist on cancers has subjective difference, the workload of an endoscope judgment map is large, and missed diagnosis and misdiagnosis sometimes occur. Therefore, an accurate and efficient endoscopic diagnosis method is needed, which can reduce the reading pressure of the endoscopic doctor and improve the accuracy of endoscopic image classification by using a computer to assist the doctor in reading the endoscopic images.
In recent years, a deep learning framework has attracted wide attention in the field of computer vision, researchers have begun to apply the deep learning framework to various classification tasks, but most of endoscope image classification methods based on deep learning adopt a convolutional neural network model to extract features of an endoscope image in a single scale, and ignore information of other scales, so that the accuracy of endoscope image classification is difficult to improve.
Disclosure of Invention
The invention provides an endoscope image classification method based on multi-scale feature embedding and cross attention in order to make up for the existing technical defects, and aims to integrate abundant semantic information in a deep feature map and geometric detail information in a shallow feature map through multi-scale feature embedding, eliminate semantic information and geometric information ambiguity between feature maps with different scales by combining a cross attention mechanism, mine more effective features and finish accurate classification of endoscope images.
In order to achieve the purpose, the invention adopts the following technical scheme:
according to an embodiment of the invention, the invention provides an endoscope image classification method based on multi-scale feature embedding and cross attention, which comprises the following steps:
step 1, acquiring endoscope image samples of N types of C × H × W, preprocessing the samples to obtain a training set E, wherein E is { E ═ E { (E) 1 ,E 2 ,...,E n ,...,E N };
Figure BDA0003650924180000021
E n Showing an nth class endoscopic image sample, the nth class having P images in total,
Figure BDA0003650924180000022
representing a p-th image in the n-th class of preprocessed endoscopic image samples; c denotes an image channel, H denotes an image height, W denotes an image width, and N is 1, 2.., N;
step 2, establishing a deep learning network, processing the sample data set of the endoscope image through a convolution neural network of the deep learning network to output feature maps of different convolution stages, and forming a dimension reduction output feature map after dimension reduction processing on the feature maps of the different convolution stages
Figure BDA0003650924180000023
i=1,2,3,4;
Step 3, outputting the characteristic diagram of the dimension reduction
Figure BDA0003650924180000024
Input to a pre-constructed multi-scaleIn a multi-head cross attention encoder with embedded features, a feature graph U is output after normalization and upsampling processing n,p
Step 4, the characteristic diagram U is processed n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p Feature map D output from the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result feature map D' n,p D 'is a feature map' n,p Inputting the result vector into a full connection layer to obtain an N-dimensional classification result vector;
and 5, constructing an endoscope classifier based on the result vector of the N-dimensional classification to classify the endoscope image.
Further, the step 2 specifically includes:
step 2.1, establishing a deep learning network, wherein the deep learning network comprises the following steps: the system comprises a multi-scale feature extraction module, a multi-scale feature embedding module and a multi-head cross attention encoder module;
step 2.2, constructing a multi-scale feature extraction module:
the multi-scale feature extraction module is composed of four convolutional neural network stages, and sequentially comprises the following steps: a first convolution stage, a second convolution stage, a third convolution stage and a fourth convolution stage;
the p-th image
Figure BDA0003650924180000031
Inputting the feature map into the multi-scale feature extraction module, and respectively obtaining the feature map output by the first convolution stage through the first convolution stage, the second convolution stage, the third convolution stage and the fourth convolution stage
Figure BDA0003650924180000032
Feature map output by the second convolution stage
Figure BDA0003650924180000033
Feature map output by the third convolution stage
Figure BDA0003650924180000034
Feature map output by the fourth convolution stage
Figure BDA0003650924180000035
Step 2.3, constructing a multi-scale feature embedding module:
the multi-scale feature embedded module is formed by connecting 4 different embedded layers in parallel, wherein 4 embedded layers correspond to 4 embedded layers
Figure BDA0003650924180000036
1,2,3,4, each embedded layer comprising a convolution layer and a dimensionality reduction process;
outputting feature maps of four convolution stages
Figure BDA0003650924180000037
Input into a multi-scale feature embedding module,
Figure BDA0003650924180000038
i is 1,2,3,4, respectively, and is 2 by convolution kernel 5-i ×2 5-i The convolution layers are respectively output characteristic graphs after dimension reduction processing
Figure BDA0003650924180000039
i=1,2,3,4。
Further, the step 3 specifically includes:
step 3.1, constructing a multi-head cross attention encoder with multi-scale feature embedding:
the multi-head cross attention encoder module embedded with the multi-scale features is formed by serially connecting features embedded with 4 convolution stages and L multi-head cross attention encoders;
combining 4 characteristic maps
Figure BDA00036509241800000310
Inputting the multi-scale features into the multi-scale feature embedding module, respectively carrying out normalization processing through an LN layer, and carrying out normalization processing on feature maps from the angle of channel intersection
Figure BDA00036509241800000311
Converting, specifically obtaining a channel cross characteristic diagram by using the formula (1)
Figure BDA00036509241800000312
i=1,2,3,4:
Figure BDA0003650924180000041
Transpose (-) in the formula (1) represents the transposition process of the feature map,
Figure BDA0003650924180000042
represents C i Size is H i ·W i Is generated from the pixel characteristic map of (a),
Figure BDA0003650924180000043
is represented by H i ·W i Size is C i A pixel feature map of individual channel intersections;
step 3.2, crossing the characteristic diagram of the channel
Figure BDA0003650924180000044
Performing multi-scale embedding, specifically obtaining a multi-scale feature embedding feature map by using the formula (2)
Figure BDA0003650924180000045
Figure BDA0003650924180000046
Concat (-) in equation (2) represents the feature vector splicing operation,
Figure BDA0003650924180000047
to represent
Figure BDA0003650924180000048
Channel after multi-scale feature embedding and transpositionA cross feature map;
step 3.3, feature map
Figure BDA0003650924180000049
As the input of the 1 st multi-headed cross attention encoder module, the output of the c-th multi-headed cross attention encoder module is as the input of the c +1 th multi-headed cross attention encoder module;
any of the c-th multi-headed cross-attention encoder modules comprises: 2 linear transformation layers, M parallel cross attention layers; c 1,2,. said, L;
step 3.4, feature map
Figure BDA00036509241800000410
inputting i-1, 2,3,4 into the c-th multi-head cross attention encoder module, and inputting the feature map
Figure BDA00036509241800000411
Respectively with two weight matrices W m K ,W m V Multiplying, and comparing the feature maps
Figure BDA00036509241800000412
i is 1,2,3,4, and is respectively associated with four weight matrixes
Figure BDA00036509241800000413
Multiplying and outputting the characteristic diagram K n,p 、V n,p
Figure BDA00036509241800000414
1,2,3,4, and the specific formula is shown in formula (3):
Figure BDA00036509241800000415
in the formula (3), φ (-) represents a normalization function;
step 3.5, embedding the multi-scale features into a module output feature graph K n,p ,V n,p ,
Figure BDA00036509241800000416
i is input into the 1 st multi-head cross attention encoder, K n,p ,V n,p ,
Figure BDA00036509241800000417
inputting the information I to the M-head cross attention layer after linear transformation processing respectively, and inputting the information I to the M-head cross attention layer
Figure BDA0003650924180000051
Are each independently of K n,p Multiplying, finally activating by Softmax function, and mixing with V n,p Multiplying to obtain an output, wherein a specific formula is shown as a formula (4):
Figure BDA0003650924180000052
in equation (4), ψ (-) is a normalization function, δ (-) is a Softmax function;
step 3.6, attention feature map
Figure BDA0003650924180000053
Based on the above, dynamically fusing the attention feature maps of different heads to form a new attention feature map, wherein the specific formula is shown in formula (5):
Figure BDA0003650924180000054
in the formula (5), the reaction mixture is,
Figure BDA0003650924180000055
is a learnable transformation matrix by
Figure BDA0003650924180000056
Fusing the multi-head attention feature maps and generating a new attention feature map;
will result in M cross attention layer outputs
Figure BDA0003650924180000057
i 1,2,3,4, M1, 2, M, and obtaining a characteristic map as shown in formula (6)
Figure BDA0003650924180000058
c=1,2,…,L,i=1,2,3,4:
Figure BDA0003650924180000059
M in equation (6) is the number of cross-attention slice headers,
Figure BDA00036509241800000510
represents the ith feature map Q in the c-th multi-head cross attention encoder module i A profile generated through the mth cross attention layer;
step 3.7, drawing the characteristic diagram after the multi-head cross attention
Figure BDA00036509241800000511
The i is processed by linear transformation and normalization, and then the output of the multi-head cross attention encoder module is obtained as shown in formula (7)
Figure BDA00036509241800000512
i=1,2,3,4:
Figure BDA00036509241800000513
In the formula (7), δ (-) represents a GeLU function, and σ (-) represents a linear transformation function;
when c is not equal to L, the output of the c multi-head cross attention encoder module is input into the next 1 multi-head cross attention encoder module;
when c is equal to L, the output of the Lth multi-head cross attention encoder module is used
Figure BDA00036509241800000514
Characteristic diagram using equation (8)
Figure BDA0003650924180000061
Performing up-sampling processing to obtain a characteristic diagram U n,p
Figure BDA0003650924180000062
In equation (8), μ (-) represents the upsampling function, φ (-) represents the normalization function, δ (-) represents the ReLU function;
step 3.8, carrying out upsampling processing on the feature map U n,p Feature map output from convolution stage
Figure BDA0003650924180000063
Fusing to obtain output U n,p The specific formula is shown as formula (9):
Figure BDA0003650924180000064
in the formula (9), the reaction mixture is,
Figure BDA0003650924180000065
is the feature map output by the fourth convolution stage in the multi-scale feature extraction module.
Further, the step 4 specifically includes:
step 4.1, embedding the multi-scale features into a feature graph U output by the multi-head cross attention encoder n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p
Step 4.2, outputting the feature graph D of the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result D' n,p The specific formula is shown as formula (10):
Figure BDA0003650924180000066
in equation (10), concat (. cndot.) represents the feature vector splicing operation,
Figure BDA0003650924180000067
is D n,p The feature map output after global average pooling,
Figure BDA0003650924180000068
is D n,p Outputting a feature map after global maximum pooling;
will feature map D' n,p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.
Further, the step 5 specifically includes: establishing a cross entropy loss function, inputting a training sample set into the deep learning network for training, and then adopting a back propagation algorithm to carry out optimization solution on the cross entropy loss function, thereby adjusting all parameters in the deep learning network and obtaining the endoscope image classifier for classifying the endoscope images.
Compared with the prior art, the invention has the following advantages:
the invention constructs an endoscope image classification model by using an endoscope image classification method based on multi-scale feature embedding and cross attention. The general convolutional neural network classification depends on semantic information of the deep feature map and ignores geometric detail information of the shallow feature map. According to the method, the deep characteristic diagram rich in semantic information and the shallow characteristic diagram rich in geometric detail information are fused through multi-scale embedding, ambiguity of semantic information and geometric information between characteristic diagrams of different scales is eliminated from the angle of channel intersection, and more effective characteristics are extracted, so that the accuracy of endoscope image classification is improved, a doctor is assisted in diagnosis and film reading, and the pressure of the endoscope doctor in film reading is reduced.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a block diagram of a deep learning network according to the present invention;
FIG. 3 is a schematic diagram of a multi-headed cross attention encoder module according to the present invention.
Detailed Description
For the convenience of understanding, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In this embodiment, an endoscope image classification method based on multi-scale feature embedding and cross attention includes, as shown in fig. 1, the following specific steps:
step 1, acquiring endoscope image samples of N types of C × H × W, preprocessing the samples to obtain a training set E, wherein E is { E ═ E { (E) 1 ,E 2 ,...,E n ,...,E N };
Figure BDA0003650924180000081
E n Showing an nth class endoscopic image sample, the nth class having P images in total,
Figure BDA0003650924180000082
representing a p-th image in the n-th class of preprocessed endoscopic image samples; c denotes an image channel, H denotes an image height, W denotes an image width, and N is 1, 2.., N;
step 2, as shown in fig. 2, establishing a deep learning network, processing the sample data set of the endoscope image through a convolutional neural network of the deep learning network to output feature maps of different convolutional stages, and forming a dimension-reduced output feature map after dimension reduction processing is performed on the feature maps of the different convolutional stages
Figure BDA0003650924180000083
i=1,2,3,4。
Step 2.1, establishing a deep learning network, wherein the deep learning network comprises the following steps: the system comprises a multi-scale feature extraction module, a multi-scale feature embedding module and a multi-head cross attention encoder module;
step 2.2, constructing a multi-scale feature extraction module:
the multi-scale feature extraction module is composed of four convolutional neural network stages, and sequentially comprises the following steps: a first convolution stage, a second convolution stage, a third convolution stage and a fourth convolution stage;
the p-th image
Figure BDA0003650924180000084
Inputting the feature map into the multi-scale feature extraction module, and respectively obtaining the feature map output by the first convolution stage through the first convolution stage, the second convolution stage, the third convolution stage and the fourth convolution stage
Figure 1
Feature map output by the second convolution stage
Figure BDA0003650924180000086
Feature map output by the third convolution stage
Figure BDA0003650924180000087
Feature map output by the fourth convolution stage
Figure BDA0003650924180000088
Step 2.3, constructing a multi-scale feature embedding module:
the multi-scale feature embedded module is formed by connecting 4 different embedded layers in parallel, wherein 4 embedded layers correspond to 4 embedded layers
Figure BDA0003650924180000089
i-1, 2,3,4, each embedded layer comprising a convolutional layer and a dimensionality reduction process;
outputting feature maps of four convolution stages
Figure BDA00036509241800000810
Input into a multi-scale feature embedding module,
Figure BDA00036509241800000811
i is 1,2,3,4, respectively, and is 2 by convolution kernel 5-i ×2 5-i The convolution layers are respectively output characteristic graphs after dimension reduction processing
Figure BDA0003650924180000091
i=1,2,3,4。
Step 3, outputting the characteristic diagram of the dimension reduction
Figure BDA0003650924180000092
Inputting the data into a multi-head cross attention encoder embedded with multi-scale features constructed in advance, performing normalization and upsampling processing, and outputting a feature map U n,p
Step 3.1, constructing a multi-head cross attention encoder with multi-scale feature embedding:
the multi-head cross attention encoder module embedded with the multi-scale features is formed by serially connecting features embedded with 4 convolution stages and L multi-head cross attention encoders;
combining 4 characteristic maps
Figure BDA0003650924180000093
Inputting the multi-scale features into the multi-scale feature embedding module, respectively carrying out normalization processing through an LN layer, and carrying out normalization processing on feature maps from the angle of channel intersection
Figure BDA0003650924180000094
Converting, specifically obtaining a channel cross characteristic diagram by using the formula (1)
Figure BDA0003650924180000095
i=1,2,3,4:
Figure BDA0003650924180000096
Transpose (-) in the formula (1) represents the transpose processing of the feature map,
Figure BDA0003650924180000097
represents C i Each size isH i ·W i Is generated from the pixel characteristic map of (a),
Figure BDA0003650924180000098
is represented by H i ·W i Size is C i A pixel feature map of individual channel intersections;
step 3.2, crossing the characteristic diagram of the channel
Figure 2
Performing multi-scale embedding, specifically obtaining a multi-scale characteristic embedding characteristic diagram by using a formula (2)
Figure BDA00036509241800000910
Figure BDA00036509241800000911
Concat (-) in equation (2) represents the feature vector splicing operation,
Figure BDA00036509241800000912
to represent
Figure BDA00036509241800000913
A channel cross feature map after multi-scale feature embedding and conversion;
step 3.3, feature map
Figure BDA00036509241800000914
As the input of the 1 st multi-headed cross attention encoder module, the output of the c-th multi-headed cross attention encoder module is as the input of the c +1 th multi-headed cross attention encoder module;
as shown in fig. 3, any of the c-th multi-headed cross-attention encoder modules includes: 2 linear transformation layers, M parallel cross attention layers; c 1,2,. said, L;
step 3.4, feature map
Figure BDA0003650924180000101
i is 1,2,3,4 inputThe c multi-head cross attention encoder module, the feature map
Figure BDA0003650924180000102
Respectively with two weight matrices W m K ,W m V Multiplying, and comparing the feature maps
Figure BDA0003650924180000103
i is 1,2,3,4, and is respectively associated with four weight matrixes
Figure BDA0003650924180000104
Multiplying and outputting the characteristic diagram K n,p 、V n,p
Figure BDA0003650924180000105
i is 1,2,3,4, and the specific formula is shown in formula (3):
Figure BDA0003650924180000106
in the formula (3), φ (-) represents a normalization function;
step 3.5, embedding the multi-scale features into a module output feature graph K n,p ,V n,p ,
Figure BDA0003650924180000107
i is input into the 1 st multi-head cross attention encoder, K n,p ,V n,p ,
Figure BDA0003650924180000108
inputting the information I to the M-head cross attention layer after linear transformation processing respectively, and inputting the information I to the M-head cross attention layer
Figure BDA0003650924180000109
Are each independently of K n,p Multiplying, activating by Softmax function, and mixing with V n,p Multiplying to obtain an output, wherein a specific formula is shown as a formula (4):
Figure BDA00036509241800001010
in equation (4), ψ (·) is a normalization function, δ (·) is a Softmax function;
step 3.6, attention feature map
Figure BDA00036509241800001011
Based on the above, dynamically fusing the attention feature maps of different heads to form a new attention feature map, wherein the specific formula is shown in formula (5):
Figure BDA00036509241800001012
in the formula (5), the reaction mixture is,
Figure BDA00036509241800001013
is a learnable transformation matrix by
Figure BDA00036509241800001014
Fusing the multi-head attention feature maps and generating a new attention feature map;
will result in M cross attention layer outputs
Figure BDA00036509241800001015
i 1,2,3,4, M1, 2, M, and obtaining a characteristic map as shown in formula (6)
Figure BDA00036509241800001016
c=1,2,…,L,i=1,2,3,4:
Figure BDA00036509241800001017
M in equation (6) is the number of cross-attention slice headers,
Figure BDA00036509241800001018
represents the ith feature map Q in the c-th multi-head cross attention encoder module i Passing through the m-th crossA feature map generated by the attention layer;
step 3.7, drawing the characteristic diagram after the multi-head cross attention
Figure BDA0003650924180000111
The output of the multi-head cross attention encoder module is obtained by performing linear transformation and normalization on i 1,2,3 and 4, and then using the formula (7)
Figure BDA0003650924180000112
i=1,2,3,4:
Figure BDA0003650924180000113
In the formula (7), δ (-) represents a GeLU function, and σ (-) represents a linear transformation function;
when c is not equal to L, the output of the c multi-head cross attention encoder module is input into the next 1 multi-head cross attention encoder module;
when c is equal to L, the output of the Lth multi-head cross attention encoder module is used
Figure BDA0003650924180000114
Characteristic diagram using equation (8)
Figure BDA0003650924180000115
Performing up-sampling processing to obtain a characteristic diagram U n,p
Figure BDA0003650924180000116
In equation (8), μ (-) represents the upsampling function, φ (-) represents the normalization function, δ (-) represents the ReLU function;
step 3.8, carrying out upsampling processing on the feature map U n,p Feature map output from convolution stage
Figure BDA0003650924180000117
Are fused to obtainTo the output U n,p The specific formula is shown as formula (9):
Figure BDA0003650924180000118
in the formula (9), the reaction mixture is,
Figure BDA0003650924180000119
is the feature map output by the fourth convolution stage in the multi-scale feature extraction module.
Step 4, the characteristic diagram U is processed n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p Feature map D output from the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result feature map D' n,p D 'is a feature map' n,p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.
Step 4.1, embedding the multi-scale features into a feature graph U output by the multi-head cross attention encoder n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p
Step 4.2, outputting the feature graph D of the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result D' n,p The specific formula is shown as formula (10):
Figure BDA0003650924180000121
in equation (10), concat (. cndot.) represents the feature vector splicing operation,
Figure BDA0003650924180000122
is D n,p The feature map output after global average pooling,
Figure BDA0003650924180000123
is D n,p Outputting a feature map after global maximum pooling;
will feature map D' n,p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.
And 5, constructing an endoscope classifier based on the result vector of the N-dimensional classification to classify the endoscope image.
The step 5 specifically comprises: establishing a cross entropy loss function, inputting a training sample set into the deep learning network for training, and then adopting a back propagation algorithm to carry out optimization solution on the cross entropy loss function, thereby adjusting all parameters in the deep learning network and obtaining the endoscope image classifier for classifying the endoscope images.
Establishing a cross entropy loss function shown as a formula (11), inputting a training sample set into the deep learning network for training, and then optimally solving the cross entropy loss function by adopting a back propagation algorithm, so as to adjust all parameters in the deep learning network, thereby obtaining an endoscope image classifier for classifying endoscope images, wherein the cross entropy loss function is as follows:
Figure BDA0003650924180000124
in the formula (11), wherein C represents the number of classes, p i Representing the true class of sample i, q i Represents the prediction class of sample i and CE (p, q) represents the classification penalty on the sample.
It will be evident to those skilled in the art that the embodiments of the present invention are not limited to the details of the foregoing illustrative embodiments, and that the embodiments of the present invention are capable of being embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the embodiments being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. Several units, modules or means recited in the system, apparatus or terminal claims may also be implemented by one and the same unit, module or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention and not for limiting, and although the embodiments of the present invention are described in detail with reference to the above preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the embodiments of the present invention without departing from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (5)

1. A method for classifying endoscopic images based on multi-scale feature embedding and cross-attention, the method comprising:
step 1, acquiring endoscope image samples of N types of C × H × W, preprocessing the samples to obtain a training set E, wherein E is { E ═ E { (E) 1 ,E 2 ,...,E n ,...,E N };
Figure FDA0003650924170000011
E n Showing an nth class endoscopic image sample, the nth class having P images in total,
Figure FDA0003650924170000012
representing a p-th image in the n-th class of preprocessed endoscopic image samples; c denotes an image channel, H denotes an image height, W denotes an image width, and N is 1, 2.., N;
step 2, establishing a deep learning network, processing the sample data set of the endoscope image through a convolution neural network of the deep learning network to output feature maps of different convolution stages, and enabling the feature maps of the different convolution stages to pass throughForming a dimension reduction output characteristic diagram after dimension reduction treatment
Figure FDA0003650924170000013
Step 3, outputting the characteristic diagram of the dimension reduction
Figure FDA0003650924170000014
Inputting the data into a multi-head cross attention encoder embedded with multi-scale features constructed in advance, performing normalization and upsampling processing, and outputting a feature map U n,p
Step 4, the characteristic diagram U is processed n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p Feature map D output from the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result feature map D' n,p D 'is a feature map' n,p Inputting the result vector into a full connection layer to obtain an N-dimensional classification result vector;
and 5, constructing an endoscope classifier based on the result vector of the N-dimensional classification to classify the endoscope image.
2. The endoscopic image classification method according to claim 1, characterized in that said step 2 specifically comprises:
step 2.1, establishing a deep learning network, wherein the deep learning network comprises the following steps: the system comprises a multi-scale feature extraction module, a multi-scale feature embedding module and a multi-head cross attention encoder module;
step 2.2, constructing a multi-scale feature extraction module:
the multi-scale feature extraction module is composed of four convolutional neural network stages, and sequentially comprises the following steps: a first convolution stage, a second convolution stage, a third convolution stage and a fourth convolution stage;
the p-th image
Figure FDA0003650924170000015
Input deviceIn the multi-scale feature extraction module, the feature map output by the first convolution stage is obtained through the first convolution stage, the second convolution stage, the third convolution stage and the fourth convolution stage
Figure FDA0003650924170000021
Feature map output by the second convolution stage
Figure FDA0003650924170000022
Feature map output by the third convolution stage
Figure FDA0003650924170000023
Feature map output by the fourth convolution stage
Figure FDA0003650924170000024
Step 2.3, constructing a multi-scale feature embedding module:
the multi-scale feature embedded module is formed by connecting 4 different embedded layers in parallel, wherein 4 embedded layers correspond to 4 embedded layers
Figure FDA0003650924170000025
Each embedded layer comprises a convolution layer and a dimension reduction process;
outputting feature maps of four convolution stages
Figure FDA0003650924170000026
Input into a multi-scale feature embedding module,
Figure FDA0003650924170000027
respectively pass through convolution kernels of 2 5-i ×2 5-i The convolution layers are respectively output characteristic graphs after dimension reduction processing
Figure FDA0003650924170000028
3. The endoscopic image classification method according to claim 2, characterized in that said step 3 specifically comprises:
step 3.1, constructing a multi-head cross attention encoder with multi-scale feature embedding:
the multi-head cross attention encoder module embedded with the multi-scale features is formed by serially connecting features embedded with 4 convolution stages and L multi-head cross attention encoders;
combining 4 characteristic maps
Figure FDA0003650924170000029
Inputting the multi-scale features into the multi-scale feature embedding module, respectively carrying out normalization processing through an LN layer, and carrying out normalization processing on feature maps from the angle of channel intersection
Figure FDA00036509241700000210
Converting, specifically obtaining a channel cross characteristic diagram by using the formula (1)
Figure FDA00036509241700000211
Figure FDA00036509241700000212
Transpose (-) in the formula (1) represents the transpose processing of the feature map,
Figure FDA00036509241700000213
represents C i Size is H i ·W i Is generated from the pixel characteristic map of (a),
Figure FDA00036509241700000214
is represented by H i ·W i Each size is C i A pixel feature map of individual channel intersections;
step 3.2, crossing the characteristic diagram of the channel
Figure FDA00036509241700000215
To carry outMulti-scale embedding, specifically obtaining a multi-scale feature embedding feature map by using the formula (2)
Figure FDA00036509241700000216
Figure FDA00036509241700000217
Concat (-) in equation (2) represents the feature vector splicing operation,
Figure FDA00036509241700000218
to represent
Figure FDA00036509241700000219
A channel cross feature map after multi-scale feature embedding and conversion;
step 3.3, feature map
Figure FDA0003650924170000031
As the input of the 1 st multi-headed cross attention encoder module, the output of the c-th multi-headed cross attention encoder module is as the input of the c +1 th multi-headed cross attention encoder module;
any of the c-th multi-headed cross-attention encoder modules comprises: 2 linear transformation layers, M parallel cross attention layers; c 1,2,. said, L;
step 3.4, feature map
Figure FDA0003650924170000032
Inputting the feature map into the c-th multi-head cross attention encoder module
Figure FDA0003650924170000033
Respectively with two weight matrices W m K ,W m V Multiplying, and comparing the feature maps
Figure FDA0003650924170000034
Figure FDA0003650924170000035
Respectively with four weight matrices
Figure FDA0003650924170000036
Multiplying and outputting the characteristic diagram K n,p 、V n,p
Figure FDA0003650924170000037
Figure FDA0003650924170000038
The specific formula is shown as formula (3):
Figure FDA0003650924170000039
in the formula (3), φ (-) represents a normalization function;
step 3.5, embedding the multi-scale features into a module output feature graph K n,p ,V n,p
Figure FDA00036509241700000310
Input into the 1 st multi-head cross attention encoder, K n,p ,V n,p
Figure FDA00036509241700000311
Respectively processed by linear transformation and input into M-head cross attention layers
Figure FDA00036509241700000312
Are each independently of K n,p Multiplying, finally activating by Softmax function, and mixing with V n,p Multiplying to obtain an output, wherein a specific formula is shown as a formula (4):
Figure FDA00036509241700000313
in equation (4), ψ (-) is a normalization function, δ (-) is a Softmax function;
step 3.6, attention feature map
Figure FDA00036509241700000314
Based on the above, dynamically fusing the attention feature maps of different heads to form a new attention feature map, wherein the specific formula is shown in formula (5):
Figure FDA00036509241700000315
in the formula (5), the reaction mixture is,
Figure FDA00036509241700000316
is a learnable transformation matrix by
Figure FDA00036509241700000317
Fusing the multi-head attention feature maps and generating a new attention feature map;
will result in M cross attention layer outputs
Figure FDA00036509241700000318
The characteristic diagram obtained by the formula (6) is reused
Figure FDA00036509241700000319
Figure FDA0003650924170000041
M in equation (6) is the number of cross-attention slice headers,
Figure FDA0003650924170000042
represents the ith feature map Q in the c-th multi-head cross attention encoder module i A profile generated through the mth cross attention layer;
step 3.7, drawing the characteristic diagram after the multi-head cross attention
Figure FDA0003650924170000043
Respectively processed by linear transformation and normalization, and then output of the multi-head cross attention encoder module is obtained as shown in formula (7)
Figure FDA0003650924170000044
Figure FDA0003650924170000045
In the formula (7), δ (-) represents a GeLU function, and σ (-) represents a linear transformation function;
when c is not equal to L, the output of the c multi-head cross attention encoder module is input into the next 1 multi-head cross attention encoder module;
when c is equal to L, the output of the Lth multi-head cross attention encoder module is used
Figure FDA0003650924170000046
Characteristic diagram using equation (8)
Figure FDA0003650924170000047
Performing up-sampling processing to obtain a characteristic diagram U n,p
Figure FDA0003650924170000048
In equation (8), μ (-) represents the upsampling function, φ (-) represents the normalization function, δ (-) represents the ReLU function;
step 3.8, carrying out upsampling processing on the feature map U n,p Feature map output from convolution stage
Figure FDA0003650924170000049
Fusing to obtain output U n,p The specific formula is shown in formula (9):
Figure FDA00036509241700000410
in the formula (9), the reaction mixture is,
Figure FDA00036509241700000411
is the feature map output by the fourth convolution stage in the multi-scale feature extraction module.
4. The endoscopic image classification method according to claim 3, characterized in that said step 4 specifically comprises:
step 4.1, embedding the multi-scale features into a feature graph U output by the multi-head cross attention encoder n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p
Step 4.2, outputting the characteristic diagram D of the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result D' n,p The specific formula is shown as formula (10):
Figure FDA00036509241700000412
in equation (10), concat (. cndot.) represents the feature vector splicing operation,
Figure FDA0003650924170000051
is D n,p The feature map output after global average pooling,
Figure FDA0003650924170000052
is D n,p Outputting a feature map after global maximum pooling;
will feature map D' n,p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.
5. The endoscopic image classification method according to claim 4, wherein said step 5 specifically includes: establishing a cross entropy loss function, inputting a training sample set into the deep learning network for training, and then adopting a back propagation algorithm to carry out optimization solution on the cross entropy loss function, thereby adjusting all parameters in the deep learning network and obtaining the endoscope image classifier for classifying the endoscope images.
CN202210542820.8A 2022-05-18 2022-05-18 Endoscope image classification method based on multi-scale feature embedding and cross attention Active CN114863179B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210542820.8A CN114863179B (en) 2022-05-18 2022-05-18 Endoscope image classification method based on multi-scale feature embedding and cross attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210542820.8A CN114863179B (en) 2022-05-18 2022-05-18 Endoscope image classification method based on multi-scale feature embedding and cross attention

Publications (2)

Publication Number Publication Date
CN114863179A true CN114863179A (en) 2022-08-05
CN114863179B CN114863179B (en) 2022-12-13

Family

ID=82638829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210542820.8A Active CN114863179B (en) 2022-05-18 2022-05-18 Endoscope image classification method based on multi-scale feature embedding and cross attention

Country Status (1)

Country Link
CN (1) CN114863179B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116188436A (en) * 2023-03-03 2023-05-30 合肥工业大学 Cystoscope image classification method based on fusion of local features and global features
CN117522884A (en) * 2024-01-05 2024-02-06 武汉理工大学三亚科教创新园 Ocean remote sensing image semantic segmentation method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034253A (en) * 2018-08-01 2018-12-18 华中科技大学 A kind of chronic venous disease image classification method based on multiscale semanteme feature
CN113378791A (en) * 2021-07-09 2021-09-10 合肥工业大学 Cervical cell classification method based on double-attention mechanism and multi-scale feature fusion
US20210390338A1 (en) * 2020-06-15 2021-12-16 Dalian University Of Technology Deep network lung texture recogniton method combined with multi-scale attention
WO2022073452A1 (en) * 2020-10-07 2022-04-14 武汉大学 Hyperspectral remote sensing image classification method based on self-attention context network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034253A (en) * 2018-08-01 2018-12-18 华中科技大学 A kind of chronic venous disease image classification method based on multiscale semanteme feature
US20210390338A1 (en) * 2020-06-15 2021-12-16 Dalian University Of Technology Deep network lung texture recogniton method combined with multi-scale attention
WO2022073452A1 (en) * 2020-10-07 2022-04-14 武汉大学 Hyperspectral remote sensing image classification method based on self-attention context network
CN113378791A (en) * 2021-07-09 2021-09-10 合肥工业大学 Cervical cell classification method based on double-attention mechanism and multi-scale feature fusion

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PENG LI ET.AL: "Bi-Modal Learning With Channel-Wise Attention for Multi-Label Image Classification", 《IEEE ACCESS》 *
韩旭 等: "基于注意力机制及多尺度特征融合的番茄叶片缺素图像分类方法", 《农业工程学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116188436A (en) * 2023-03-03 2023-05-30 合肥工业大学 Cystoscope image classification method based on fusion of local features and global features
CN116188436B (en) * 2023-03-03 2023-11-10 合肥工业大学 Cystoscope image classification method based on fusion of local features and global features
CN117522884A (en) * 2024-01-05 2024-02-06 武汉理工大学三亚科教创新园 Ocean remote sensing image semantic segmentation method and device and electronic equipment
CN117522884B (en) * 2024-01-05 2024-05-17 武汉理工大学三亚科教创新园 Ocean remote sensing image semantic segmentation method and device and electronic equipment

Also Published As

Publication number Publication date
CN114863179B (en) 2022-12-13

Similar Documents

Publication Publication Date Title
CN114863179B (en) Endoscope image classification method based on multi-scale feature embedding and cross attention
CN108596248B (en) Remote sensing image classification method based on improved deep convolutional neural network
CN112116605B (en) Pancreas CT image segmentation method based on integrated depth convolution neural network
CN104794504B (en) Pictorial pattern character detecting method based on deep learning
CN110276402B (en) Salt body identification method based on deep learning semantic boundary enhancement
CN113239954B (en) Attention mechanism-based image semantic segmentation feature fusion method
CN111401156B (en) Image identification method based on Gabor convolution neural network
CN113378792B (en) Weak supervision cervical cell image analysis method fusing global and local information
CN112149720A (en) Fine-grained vehicle type identification method
CN112347908B (en) Surgical instrument image identification method based on space grouping attention model
CN113378791A (en) Cervical cell classification method based on double-attention mechanism and multi-scale feature fusion
CN113344044A (en) Cross-species medical image classification method based on domain self-adaptation
CN114998647B (en) Breast cancer full-size pathological image classification method based on attention multi-instance learning
CN114530222A (en) Cancer patient classification system based on multiomics and image data fusion
CN114037699B (en) Pathological image classification method, equipment, system and storage medium
CN114820481A (en) Lung cancer histopathology full-section EGFR state prediction method based on converter
Yin et al. Pyramid tokens-to-token vision transformer for thyroid pathology image classification
CN113221948B (en) Digital slice image classification method based on countermeasure generation network and weak supervised learning
CN116758621B (en) Self-attention mechanism-based face expression depth convolution identification method for shielding people
CN113963232A (en) Network graph data extraction method based on attention learning
CN116935044B (en) Endoscopic polyp segmentation method with multi-scale guidance and multi-level supervision
CN115331047A (en) Earthquake image interpretation method based on attention mechanism
CN115329821A (en) Ship noise identification method based on pairing coding network and comparison learning
CN113989528A (en) Hyperspectral image feature representation method based on depth joint sparse-collaborative representation
CN113192076A (en) MRI brain tumor image segmentation method combining classification prediction and multi-scale feature extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant