CN114863179A - Endoscope image classification method based on multi-scale feature embedding and cross attention - Google Patents
Endoscope image classification method based on multi-scale feature embedding and cross attention Download PDFInfo
- Publication number
- CN114863179A CN114863179A CN202210542820.8A CN202210542820A CN114863179A CN 114863179 A CN114863179 A CN 114863179A CN 202210542820 A CN202210542820 A CN 202210542820A CN 114863179 A CN114863179 A CN 114863179A
- Authority
- CN
- China
- Prior art keywords
- feature
- feature map
- scale
- formula
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/03—Recognition of patterns in medical or anatomical images
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Public Health (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Image Processing (AREA)
Abstract
The invention provides an endoscope image classification method based on multi-scale feature embedding and cross attention, which comprises the following steps: acquiring marked N types of endoscope images; establishing a deep learning network based on multi-scale feature embedding and multi-head cross attention; constructing an endoscope image classifier; and predicting the endoscope image category by utilizing the established classifier. According to the endoscope image classification method, rich semantic information in a deep characteristic diagram and geometric detail information in a shallow characteristic diagram are fused through multi-scale characteristic embedding, semantic information and geometric information ambiguity among different scale characteristic diagrams are eliminated by combining a cross attention mechanism to mine more effective characteristics, and endoscope images are accurately classified, so that a doctor is assisted in diagnosis and interpretation, and the interpretation efficiency is improved.
Description
Technical Field
The invention relates to the field of computer vision, in particular to an image classification technology, and specifically relates to an endoscope image classification method based on multi-scale feature embedding and cross attention.
Background
Endoscopy is the most common way of cancer diagnosis, and endoscopic image classification has important clinical significance in early cancer screening. Traditional cancer diagnosis is mainly carried out through the manual diagnosis under an endoscope by an endoscopist, but in clinical diagnosis, the judgment of the endoscopist on cancers has subjective difference, the workload of an endoscope judgment map is large, and missed diagnosis and misdiagnosis sometimes occur. Therefore, an accurate and efficient endoscopic diagnosis method is needed, which can reduce the reading pressure of the endoscopic doctor and improve the accuracy of endoscopic image classification by using a computer to assist the doctor in reading the endoscopic images.
In recent years, a deep learning framework has attracted wide attention in the field of computer vision, researchers have begun to apply the deep learning framework to various classification tasks, but most of endoscope image classification methods based on deep learning adopt a convolutional neural network model to extract features of an endoscope image in a single scale, and ignore information of other scales, so that the accuracy of endoscope image classification is difficult to improve.
Disclosure of Invention
The invention provides an endoscope image classification method based on multi-scale feature embedding and cross attention in order to make up for the existing technical defects, and aims to integrate abundant semantic information in a deep feature map and geometric detail information in a shallow feature map through multi-scale feature embedding, eliminate semantic information and geometric information ambiguity between feature maps with different scales by combining a cross attention mechanism, mine more effective features and finish accurate classification of endoscope images.
In order to achieve the purpose, the invention adopts the following technical scheme:
according to an embodiment of the invention, the invention provides an endoscope image classification method based on multi-scale feature embedding and cross attention, which comprises the following steps:
step 1, acquiring endoscope image samples of N types of C × H × W, preprocessing the samples to obtain a training set E, wherein E is { E ═ E { (E) 1 ,E 2 ,...,E n ,...,E N }; E n Showing an nth class endoscopic image sample, the nth class having P images in total,representing a p-th image in the n-th class of preprocessed endoscopic image samples; c denotes an image channel, H denotes an image height, W denotes an image width, and N is 1, 2.., N;
step 2, establishing a deep learning network, processing the sample data set of the endoscope image through a convolution neural network of the deep learning network to output feature maps of different convolution stages, and forming a dimension reduction output feature map after dimension reduction processing on the feature maps of the different convolution stagesi=1,2,3,4;
Step 3, outputting the characteristic diagram of the dimension reductionInput to a pre-constructed multi-scaleIn a multi-head cross attention encoder with embedded features, a feature graph U is output after normalization and upsampling processing n,p ;
Step 4, the characteristic diagram U is processed n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p Feature map D output from the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result feature map D' n,p D 'is a feature map' n,p Inputting the result vector into a full connection layer to obtain an N-dimensional classification result vector;
and 5, constructing an endoscope classifier based on the result vector of the N-dimensional classification to classify the endoscope image.
Further, the step 2 specifically includes:
step 2.1, establishing a deep learning network, wherein the deep learning network comprises the following steps: the system comprises a multi-scale feature extraction module, a multi-scale feature embedding module and a multi-head cross attention encoder module;
step 2.2, constructing a multi-scale feature extraction module:
the multi-scale feature extraction module is composed of four convolutional neural network stages, and sequentially comprises the following steps: a first convolution stage, a second convolution stage, a third convolution stage and a fourth convolution stage;
the p-th imageInputting the feature map into the multi-scale feature extraction module, and respectively obtaining the feature map output by the first convolution stage through the first convolution stage, the second convolution stage, the third convolution stage and the fourth convolution stageFeature map output by the second convolution stageFeature map output by the third convolution stageFeature map output by the fourth convolution stage
Step 2.3, constructing a multi-scale feature embedding module:
the multi-scale feature embedded module is formed by connecting 4 different embedded layers in parallel, wherein 4 embedded layers correspond to 4 embedded layers1,2,3,4, each embedded layer comprising a convolution layer and a dimensionality reduction process;
outputting feature maps of four convolution stagesInput into a multi-scale feature embedding module,i is 1,2,3,4, respectively, and is 2 by convolution kernel 5-i ×2 5-i The convolution layers are respectively output characteristic graphs after dimension reduction processingi=1,2,3,4。
Further, the step 3 specifically includes:
step 3.1, constructing a multi-head cross attention encoder with multi-scale feature embedding:
the multi-head cross attention encoder module embedded with the multi-scale features is formed by serially connecting features embedded with 4 convolution stages and L multi-head cross attention encoders;
combining 4 characteristic mapsInputting the multi-scale features into the multi-scale feature embedding module, respectively carrying out normalization processing through an LN layer, and carrying out normalization processing on feature maps from the angle of channel intersectionConverting, specifically obtaining a channel cross characteristic diagram by using the formula (1)i=1,2,3,4:
Transpose (-) in the formula (1) represents the transposition process of the feature map,represents C i Size is H i ·W i Is generated from the pixel characteristic map of (a),is represented by H i ·W i Size is C i A pixel feature map of individual channel intersections;
step 3.2, crossing the characteristic diagram of the channelPerforming multi-scale embedding, specifically obtaining a multi-scale feature embedding feature map by using the formula (2)
Concat (-) in equation (2) represents the feature vector splicing operation,to representChannel after multi-scale feature embedding and transpositionA cross feature map;
step 3.3, feature mapAs the input of the 1 st multi-headed cross attention encoder module, the output of the c-th multi-headed cross attention encoder module is as the input of the c +1 th multi-headed cross attention encoder module;
any of the c-th multi-headed cross-attention encoder modules comprises: 2 linear transformation layers, M parallel cross attention layers; c 1,2,. said, L;
step 3.4, feature mapinputting i-1, 2,3,4 into the c-th multi-head cross attention encoder module, and inputting the feature mapRespectively with two weight matrices W m K ,W m V Multiplying, and comparing the feature mapsi is 1,2,3,4, and is respectively associated with four weight matrixesMultiplying and outputting the characteristic diagram K n,p 、V n,p 、1,2,3,4, and the specific formula is shown in formula (3):
in the formula (3), φ (-) represents a normalization function;
step 3.5, embedding the multi-scale features into a module output feature graph K n,p ,V n,p ,i is input into the 1 st multi-head cross attention encoder, K n,p ,V n,p ,inputting the information I to the M-head cross attention layer after linear transformation processing respectively, and inputting the information I to the M-head cross attention layerAre each independently of K n,p Multiplying, finally activating by Softmax function, and mixing with V n,p Multiplying to obtain an output, wherein a specific formula is shown as a formula (4):
in equation (4), ψ (-) is a normalization function, δ (-) is a Softmax function;
step 3.6, attention feature mapBased on the above, dynamically fusing the attention feature maps of different heads to form a new attention feature map, wherein the specific formula is shown in formula (5):
in the formula (5), the reaction mixture is,is a learnable transformation matrix byFusing the multi-head attention feature maps and generating a new attention feature map;
will result in M cross attention layer outputsi 1,2,3,4, M1, 2, M, and obtaining a characteristic map as shown in formula (6)c=1,2,…,L,i=1,2,3,4:
M in equation (6) is the number of cross-attention slice headers,represents the ith feature map Q in the c-th multi-head cross attention encoder module i A profile generated through the mth cross attention layer;
step 3.7, drawing the characteristic diagram after the multi-head cross attentionThe i is processed by linear transformation and normalization, and then the output of the multi-head cross attention encoder module is obtained as shown in formula (7)i=1,2,3,4:
In the formula (7), δ (-) represents a GeLU function, and σ (-) represents a linear transformation function;
when c is not equal to L, the output of the c multi-head cross attention encoder module is input into the next 1 multi-head cross attention encoder module;
when c is equal to L, the output of the Lth multi-head cross attention encoder module is usedCharacteristic diagram using equation (8)Performing up-sampling processing to obtain a characteristic diagram U n,p :
In equation (8), μ (-) represents the upsampling function, φ (-) represents the normalization function, δ (-) represents the ReLU function;
step 3.8, carrying out upsampling processing on the feature map U n,p Feature map output from convolution stageFusing to obtain output U n,p The specific formula is shown as formula (9):
in the formula (9), the reaction mixture is,is the feature map output by the fourth convolution stage in the multi-scale feature extraction module.
Further, the step 4 specifically includes:
step 4.1, embedding the multi-scale features into a feature graph U output by the multi-head cross attention encoder n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p
Step 4.2, outputting the feature graph D of the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result D' n,p The specific formula is shown as formula (10):
in equation (10), concat (. cndot.) represents the feature vector splicing operation,is D n,p The feature map output after global average pooling,is D n,p Outputting a feature map after global maximum pooling;
will feature map D' n,p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.
Further, the step 5 specifically includes: establishing a cross entropy loss function, inputting a training sample set into the deep learning network for training, and then adopting a back propagation algorithm to carry out optimization solution on the cross entropy loss function, thereby adjusting all parameters in the deep learning network and obtaining the endoscope image classifier for classifying the endoscope images.
Compared with the prior art, the invention has the following advantages:
the invention constructs an endoscope image classification model by using an endoscope image classification method based on multi-scale feature embedding and cross attention. The general convolutional neural network classification depends on semantic information of the deep feature map and ignores geometric detail information of the shallow feature map. According to the method, the deep characteristic diagram rich in semantic information and the shallow characteristic diagram rich in geometric detail information are fused through multi-scale embedding, ambiguity of semantic information and geometric information between characteristic diagrams of different scales is eliminated from the angle of channel intersection, and more effective characteristics are extracted, so that the accuracy of endoscope image classification is improved, a doctor is assisted in diagnosis and film reading, and the pressure of the endoscope doctor in film reading is reduced.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a block diagram of a deep learning network according to the present invention;
FIG. 3 is a schematic diagram of a multi-headed cross attention encoder module according to the present invention.
Detailed Description
For the convenience of understanding, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In this embodiment, an endoscope image classification method based on multi-scale feature embedding and cross attention includes, as shown in fig. 1, the following specific steps:
step 1, acquiring endoscope image samples of N types of C × H × W, preprocessing the samples to obtain a training set E, wherein E is { E ═ E { (E) 1 ,E 2 ,...,E n ,...,E N }; E n Showing an nth class endoscopic image sample, the nth class having P images in total,representing a p-th image in the n-th class of preprocessed endoscopic image samples; c denotes an image channel, H denotes an image height, W denotes an image width, and N is 1, 2.., N;
step 2, as shown in fig. 2, establishing a deep learning network, processing the sample data set of the endoscope image through a convolutional neural network of the deep learning network to output feature maps of different convolutional stages, and forming a dimension-reduced output feature map after dimension reduction processing is performed on the feature maps of the different convolutional stagesi=1,2,3,4。
Step 2.1, establishing a deep learning network, wherein the deep learning network comprises the following steps: the system comprises a multi-scale feature extraction module, a multi-scale feature embedding module and a multi-head cross attention encoder module;
step 2.2, constructing a multi-scale feature extraction module:
the multi-scale feature extraction module is composed of four convolutional neural network stages, and sequentially comprises the following steps: a first convolution stage, a second convolution stage, a third convolution stage and a fourth convolution stage;
the p-th imageInputting the feature map into the multi-scale feature extraction module, and respectively obtaining the feature map output by the first convolution stage through the first convolution stage, the second convolution stage, the third convolution stage and the fourth convolution stageFeature map output by the second convolution stageFeature map output by the third convolution stageFeature map output by the fourth convolution stage
Step 2.3, constructing a multi-scale feature embedding module:
the multi-scale feature embedded module is formed by connecting 4 different embedded layers in parallel, wherein 4 embedded layers correspond to 4 embedded layersi-1, 2,3,4, each embedded layer comprising a convolutional layer and a dimensionality reduction process;
outputting feature maps of four convolution stagesInput into a multi-scale feature embedding module,i is 1,2,3,4, respectively, and is 2 by convolution kernel 5-i ×2 5-i The convolution layers are respectively output characteristic graphs after dimension reduction processingi=1,2,3,4。
Step 3, outputting the characteristic diagram of the dimension reductionInputting the data into a multi-head cross attention encoder embedded with multi-scale features constructed in advance, performing normalization and upsampling processing, and outputting a feature map U n,p 。
Step 3.1, constructing a multi-head cross attention encoder with multi-scale feature embedding:
the multi-head cross attention encoder module embedded with the multi-scale features is formed by serially connecting features embedded with 4 convolution stages and L multi-head cross attention encoders;
combining 4 characteristic mapsInputting the multi-scale features into the multi-scale feature embedding module, respectively carrying out normalization processing through an LN layer, and carrying out normalization processing on feature maps from the angle of channel intersectionConverting, specifically obtaining a channel cross characteristic diagram by using the formula (1)i=1,2,3,4:
Transpose (-) in the formula (1) represents the transpose processing of the feature map,represents C i Each size isH i ·W i Is generated from the pixel characteristic map of (a),is represented by H i ·W i Size is C i A pixel feature map of individual channel intersections;
step 3.2, crossing the characteristic diagram of the channelPerforming multi-scale embedding, specifically obtaining a multi-scale characteristic embedding characteristic diagram by using a formula (2)
Concat (-) in equation (2) represents the feature vector splicing operation,to representA channel cross feature map after multi-scale feature embedding and conversion;
step 3.3, feature mapAs the input of the 1 st multi-headed cross attention encoder module, the output of the c-th multi-headed cross attention encoder module is as the input of the c +1 th multi-headed cross attention encoder module;
as shown in fig. 3, any of the c-th multi-headed cross-attention encoder modules includes: 2 linear transformation layers, M parallel cross attention layers; c 1,2,. said, L;
step 3.4, feature mapi is 1,2,3,4 inputThe c multi-head cross attention encoder module, the feature mapRespectively with two weight matrices W m K ,W m V Multiplying, and comparing the feature mapsi is 1,2,3,4, and is respectively associated with four weight matrixesMultiplying and outputting the characteristic diagram K n,p 、V n,p 、i is 1,2,3,4, and the specific formula is shown in formula (3):
in the formula (3), φ (-) represents a normalization function;
step 3.5, embedding the multi-scale features into a module output feature graph K n,p ,V n,p ,i is input into the 1 st multi-head cross attention encoder, K n,p ,V n,p ,inputting the information I to the M-head cross attention layer after linear transformation processing respectively, and inputting the information I to the M-head cross attention layerAre each independently of K n,p Multiplying, activating by Softmax function, and mixing with V n,p Multiplying to obtain an output, wherein a specific formula is shown as a formula (4):
in equation (4), ψ (·) is a normalization function, δ (·) is a Softmax function;
step 3.6, attention feature mapBased on the above, dynamically fusing the attention feature maps of different heads to form a new attention feature map, wherein the specific formula is shown in formula (5):
in the formula (5), the reaction mixture is,is a learnable transformation matrix byFusing the multi-head attention feature maps and generating a new attention feature map;
will result in M cross attention layer outputsi 1,2,3,4, M1, 2, M, and obtaining a characteristic map as shown in formula (6)c=1,2,…,L,i=1,2,3,4:
M in equation (6) is the number of cross-attention slice headers,represents the ith feature map Q in the c-th multi-head cross attention encoder module i Passing through the m-th crossA feature map generated by the attention layer;
step 3.7, drawing the characteristic diagram after the multi-head cross attentionThe output of the multi-head cross attention encoder module is obtained by performing linear transformation and normalization on i 1,2,3 and 4, and then using the formula (7)i=1,2,3,4:
In the formula (7), δ (-) represents a GeLU function, and σ (-) represents a linear transformation function;
when c is not equal to L, the output of the c multi-head cross attention encoder module is input into the next 1 multi-head cross attention encoder module;
when c is equal to L, the output of the Lth multi-head cross attention encoder module is usedCharacteristic diagram using equation (8)Performing up-sampling processing to obtain a characteristic diagram U n,p :
In equation (8), μ (-) represents the upsampling function, φ (-) represents the normalization function, δ (-) represents the ReLU function;
step 3.8, carrying out upsampling processing on the feature map U n,p Feature map output from convolution stageAre fused to obtainTo the output U n,p The specific formula is shown as formula (9):
in the formula (9), the reaction mixture is,is the feature map output by the fourth convolution stage in the multi-scale feature extraction module.
Step 4, the characteristic diagram U is processed n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p Feature map D output from the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result feature map D' n,p D 'is a feature map' n,p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.
Step 4.1, embedding the multi-scale features into a feature graph U output by the multi-head cross attention encoder n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p ;
Step 4.2, outputting the feature graph D of the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result D' n,p The specific formula is shown as formula (10):
in equation (10), concat (. cndot.) represents the feature vector splicing operation,is D n,p The feature map output after global average pooling,is D n,p Outputting a feature map after global maximum pooling;
will feature map D' n,p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.
And 5, constructing an endoscope classifier based on the result vector of the N-dimensional classification to classify the endoscope image.
The step 5 specifically comprises: establishing a cross entropy loss function, inputting a training sample set into the deep learning network for training, and then adopting a back propagation algorithm to carry out optimization solution on the cross entropy loss function, thereby adjusting all parameters in the deep learning network and obtaining the endoscope image classifier for classifying the endoscope images.
Establishing a cross entropy loss function shown as a formula (11), inputting a training sample set into the deep learning network for training, and then optimally solving the cross entropy loss function by adopting a back propagation algorithm, so as to adjust all parameters in the deep learning network, thereby obtaining an endoscope image classifier for classifying endoscope images, wherein the cross entropy loss function is as follows:
in the formula (11), wherein C represents the number of classes, p i Representing the true class of sample i, q i Represents the prediction class of sample i and CE (p, q) represents the classification penalty on the sample.
It will be evident to those skilled in the art that the embodiments of the present invention are not limited to the details of the foregoing illustrative embodiments, and that the embodiments of the present invention are capable of being embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the embodiments being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. Several units, modules or means recited in the system, apparatus or terminal claims may also be implemented by one and the same unit, module or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention and not for limiting, and although the embodiments of the present invention are described in detail with reference to the above preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the embodiments of the present invention without departing from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (5)
1. A method for classifying endoscopic images based on multi-scale feature embedding and cross-attention, the method comprising:
step 1, acquiring endoscope image samples of N types of C × H × W, preprocessing the samples to obtain a training set E, wherein E is { E ═ E { (E) 1 ,E 2 ,...,E n ,...,E N };E n Showing an nth class endoscopic image sample, the nth class having P images in total,representing a p-th image in the n-th class of preprocessed endoscopic image samples; c denotes an image channel, H denotes an image height, W denotes an image width, and N is 1, 2.., N;
step 2, establishing a deep learning network, processing the sample data set of the endoscope image through a convolution neural network of the deep learning network to output feature maps of different convolution stages, and enabling the feature maps of the different convolution stages to pass throughForming a dimension reduction output characteristic diagram after dimension reduction treatment
Step 3, outputting the characteristic diagram of the dimension reductionInputting the data into a multi-head cross attention encoder embedded with multi-scale features constructed in advance, performing normalization and upsampling processing, and outputting a feature map U n,p ;
Step 4, the characteristic diagram U is processed n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p Feature map D output from the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result feature map D' n,p D 'is a feature map' n,p Inputting the result vector into a full connection layer to obtain an N-dimensional classification result vector;
and 5, constructing an endoscope classifier based on the result vector of the N-dimensional classification to classify the endoscope image.
2. The endoscopic image classification method according to claim 1, characterized in that said step 2 specifically comprises:
step 2.1, establishing a deep learning network, wherein the deep learning network comprises the following steps: the system comprises a multi-scale feature extraction module, a multi-scale feature embedding module and a multi-head cross attention encoder module;
step 2.2, constructing a multi-scale feature extraction module:
the multi-scale feature extraction module is composed of four convolutional neural network stages, and sequentially comprises the following steps: a first convolution stage, a second convolution stage, a third convolution stage and a fourth convolution stage;
the p-th imageInput deviceIn the multi-scale feature extraction module, the feature map output by the first convolution stage is obtained through the first convolution stage, the second convolution stage, the third convolution stage and the fourth convolution stageFeature map output by the second convolution stageFeature map output by the third convolution stageFeature map output by the fourth convolution stage
Step 2.3, constructing a multi-scale feature embedding module:
the multi-scale feature embedded module is formed by connecting 4 different embedded layers in parallel, wherein 4 embedded layers correspond to 4 embedded layersEach embedded layer comprises a convolution layer and a dimension reduction process;
3. The endoscopic image classification method according to claim 2, characterized in that said step 3 specifically comprises:
step 3.1, constructing a multi-head cross attention encoder with multi-scale feature embedding:
the multi-head cross attention encoder module embedded with the multi-scale features is formed by serially connecting features embedded with 4 convolution stages and L multi-head cross attention encoders;
combining 4 characteristic mapsInputting the multi-scale features into the multi-scale feature embedding module, respectively carrying out normalization processing through an LN layer, and carrying out normalization processing on feature maps from the angle of channel intersectionConverting, specifically obtaining a channel cross characteristic diagram by using the formula (1)
Transpose (-) in the formula (1) represents the transpose processing of the feature map,represents C i Size is H i ·W i Is generated from the pixel characteristic map of (a),is represented by H i ·W i Each size is C i A pixel feature map of individual channel intersections;
step 3.2, crossing the characteristic diagram of the channelTo carry outMulti-scale embedding, specifically obtaining a multi-scale feature embedding feature map by using the formula (2)
Concat (-) in equation (2) represents the feature vector splicing operation,to representA channel cross feature map after multi-scale feature embedding and conversion;
step 3.3, feature mapAs the input of the 1 st multi-headed cross attention encoder module, the output of the c-th multi-headed cross attention encoder module is as the input of the c +1 th multi-headed cross attention encoder module;
any of the c-th multi-headed cross-attention encoder modules comprises: 2 linear transformation layers, M parallel cross attention layers; c 1,2,. said, L;
step 3.4, feature mapInputting the feature map into the c-th multi-head cross attention encoder moduleRespectively with two weight matrices W m K ,W m V Multiplying, and comparing the feature maps Respectively with four weight matricesMultiplying and outputting the characteristic diagram K n,p 、V n,p 、 The specific formula is shown as formula (3):
in the formula (3), φ (-) represents a normalization function;
step 3.5, embedding the multi-scale features into a module output feature graph K n,p ,V n,p ,Input into the 1 st multi-head cross attention encoder, K n,p ,V n,p ,Respectively processed by linear transformation and input into M-head cross attention layersAre each independently of K n,p Multiplying, finally activating by Softmax function, and mixing with V n,p Multiplying to obtain an output, wherein a specific formula is shown as a formula (4):
in equation (4), ψ (-) is a normalization function, δ (-) is a Softmax function;
step 3.6, attention feature mapBased on the above, dynamically fusing the attention feature maps of different heads to form a new attention feature map, wherein the specific formula is shown in formula (5):
in the formula (5), the reaction mixture is,is a learnable transformation matrix byFusing the multi-head attention feature maps and generating a new attention feature map;
will result in M cross attention layer outputsThe characteristic diagram obtained by the formula (6) is reused
M in equation (6) is the number of cross-attention slice headers,represents the ith feature map Q in the c-th multi-head cross attention encoder module i A profile generated through the mth cross attention layer;
step 3.7, drawing the characteristic diagram after the multi-head cross attentionRespectively processed by linear transformation and normalization, and then output of the multi-head cross attention encoder module is obtained as shown in formula (7)
In the formula (7), δ (-) represents a GeLU function, and σ (-) represents a linear transformation function;
when c is not equal to L, the output of the c multi-head cross attention encoder module is input into the next 1 multi-head cross attention encoder module;
when c is equal to L, the output of the Lth multi-head cross attention encoder module is usedCharacteristic diagram using equation (8)Performing up-sampling processing to obtain a characteristic diagram U n,p :
In equation (8), μ (-) represents the upsampling function, φ (-) represents the normalization function, δ (-) represents the ReLU function;
step 3.8, carrying out upsampling processing on the feature map U n,p Feature map output from convolution stageFusing to obtain output U n,p The specific formula is shown in formula (9):
4. The endoscopic image classification method according to claim 3, characterized in that said step 4 specifically comprises:
step 4.1, embedding the multi-scale features into a feature graph U output by the multi-head cross attention encoder n,p Inputting the data to a convolution stage for feature extraction and outputting a feature map D n,p
Step 4.2, outputting the characteristic diagram D of the convolution stage n,p Respectively carrying out global average pooling operation and global maximum pooling operation, and splicing and fusing the obtained feature maps to obtain a result D' n,p The specific formula is shown as formula (10):
in equation (10), concat (. cndot.) represents the feature vector splicing operation,is D n,p The feature map output after global average pooling,is D n,p Outputting a feature map after global maximum pooling;
will feature map D' n,p Inputting the result vector into a full connection layer to obtain the result vector of N-dimensional classification.
5. The endoscopic image classification method according to claim 4, wherein said step 5 specifically includes: establishing a cross entropy loss function, inputting a training sample set into the deep learning network for training, and then adopting a back propagation algorithm to carry out optimization solution on the cross entropy loss function, thereby adjusting all parameters in the deep learning network and obtaining the endoscope image classifier for classifying the endoscope images.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210542820.8A CN114863179B (en) | 2022-05-18 | 2022-05-18 | Endoscope image classification method based on multi-scale feature embedding and cross attention |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210542820.8A CN114863179B (en) | 2022-05-18 | 2022-05-18 | Endoscope image classification method based on multi-scale feature embedding and cross attention |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114863179A true CN114863179A (en) | 2022-08-05 |
CN114863179B CN114863179B (en) | 2022-12-13 |
Family
ID=82638829
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210542820.8A Active CN114863179B (en) | 2022-05-18 | 2022-05-18 | Endoscope image classification method based on multi-scale feature embedding and cross attention |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114863179B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116188436A (en) * | 2023-03-03 | 2023-05-30 | 合肥工业大学 | Cystoscope image classification method based on fusion of local features and global features |
CN117522884A (en) * | 2024-01-05 | 2024-02-06 | 武汉理工大学三亚科教创新园 | Ocean remote sensing image semantic segmentation method and device and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109034253A (en) * | 2018-08-01 | 2018-12-18 | 华中科技大学 | A kind of chronic venous disease image classification method based on multiscale semanteme feature |
CN113378791A (en) * | 2021-07-09 | 2021-09-10 | 合肥工业大学 | Cervical cell classification method based on double-attention mechanism and multi-scale feature fusion |
US20210390338A1 (en) * | 2020-06-15 | 2021-12-16 | Dalian University Of Technology | Deep network lung texture recogniton method combined with multi-scale attention |
WO2022073452A1 (en) * | 2020-10-07 | 2022-04-14 | 武汉大学 | Hyperspectral remote sensing image classification method based on self-attention context network |
-
2022
- 2022-05-18 CN CN202210542820.8A patent/CN114863179B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109034253A (en) * | 2018-08-01 | 2018-12-18 | 华中科技大学 | A kind of chronic venous disease image classification method based on multiscale semanteme feature |
US20210390338A1 (en) * | 2020-06-15 | 2021-12-16 | Dalian University Of Technology | Deep network lung texture recogniton method combined with multi-scale attention |
WO2022073452A1 (en) * | 2020-10-07 | 2022-04-14 | 武汉大学 | Hyperspectral remote sensing image classification method based on self-attention context network |
CN113378791A (en) * | 2021-07-09 | 2021-09-10 | 合肥工业大学 | Cervical cell classification method based on double-attention mechanism and multi-scale feature fusion |
Non-Patent Citations (2)
Title |
---|
PENG LI ET.AL: "Bi-Modal Learning With Channel-Wise Attention for Multi-Label Image Classification", 《IEEE ACCESS》 * |
韩旭 等: "基于注意力机制及多尺度特征融合的番茄叶片缺素图像分类方法", 《农业工程学报》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116188436A (en) * | 2023-03-03 | 2023-05-30 | 合肥工业大学 | Cystoscope image classification method based on fusion of local features and global features |
CN116188436B (en) * | 2023-03-03 | 2023-11-10 | 合肥工业大学 | Cystoscope image classification method based on fusion of local features and global features |
CN117522884A (en) * | 2024-01-05 | 2024-02-06 | 武汉理工大学三亚科教创新园 | Ocean remote sensing image semantic segmentation method and device and electronic equipment |
CN117522884B (en) * | 2024-01-05 | 2024-05-17 | 武汉理工大学三亚科教创新园 | Ocean remote sensing image semantic segmentation method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN114863179B (en) | 2022-12-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114863179B (en) | Endoscope image classification method based on multi-scale feature embedding and cross attention | |
CN108596248B (en) | Remote sensing image classification method based on improved deep convolutional neural network | |
CN112116605B (en) | Pancreas CT image segmentation method based on integrated depth convolution neural network | |
CN104794504B (en) | Pictorial pattern character detecting method based on deep learning | |
CN110276402B (en) | Salt body identification method based on deep learning semantic boundary enhancement | |
CN113239954B (en) | Attention mechanism-based image semantic segmentation feature fusion method | |
CN111401156B (en) | Image identification method based on Gabor convolution neural network | |
CN113378792B (en) | Weak supervision cervical cell image analysis method fusing global and local information | |
CN112149720A (en) | Fine-grained vehicle type identification method | |
CN112347908B (en) | Surgical instrument image identification method based on space grouping attention model | |
CN113378791A (en) | Cervical cell classification method based on double-attention mechanism and multi-scale feature fusion | |
CN113344044A (en) | Cross-species medical image classification method based on domain self-adaptation | |
CN114998647B (en) | Breast cancer full-size pathological image classification method based on attention multi-instance learning | |
CN114530222A (en) | Cancer patient classification system based on multiomics and image data fusion | |
CN114037699B (en) | Pathological image classification method, equipment, system and storage medium | |
CN114820481A (en) | Lung cancer histopathology full-section EGFR state prediction method based on converter | |
Yin et al. | Pyramid tokens-to-token vision transformer for thyroid pathology image classification | |
CN113221948B (en) | Digital slice image classification method based on countermeasure generation network and weak supervised learning | |
CN116758621B (en) | Self-attention mechanism-based face expression depth convolution identification method for shielding people | |
CN113963232A (en) | Network graph data extraction method based on attention learning | |
CN116935044B (en) | Endoscopic polyp segmentation method with multi-scale guidance and multi-level supervision | |
CN115331047A (en) | Earthquake image interpretation method based on attention mechanism | |
CN115329821A (en) | Ship noise identification method based on pairing coding network and comparison learning | |
CN113989528A (en) | Hyperspectral image feature representation method based on depth joint sparse-collaborative representation | |
CN113192076A (en) | MRI brain tumor image segmentation method combining classification prediction and multi-scale feature extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |