WO2023071680A1 - 内窥镜图像特征学习模型、分类模型的训练方法和装置 - Google Patents

内窥镜图像特征学习模型、分类模型的训练方法和装置 Download PDF

Info

Publication number
WO2023071680A1
WO2023071680A1 PCT/CN2022/122056 CN2022122056W WO2023071680A1 WO 2023071680 A1 WO2023071680 A1 WO 2023071680A1 CN 2022122056 W CN2022122056 W CN 2022122056W WO 2023071680 A1 WO2023071680 A1 WO 2023071680A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
learning
scale
contrastive
image
Prior art date
Application number
PCT/CN2022/122056
Other languages
English (en)
French (fr)
Inventor
边成
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Publication of WO2023071680A1 publication Critical patent/WO2023071680A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10068Endoscopic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Definitions

  • the present application relates to the field of artificial intelligence, in particular to a training method of an endoscope image feature learning model based on comparative learning, a training method of an endoscope image classification model, an endoscope image classification method, a device and a computer readable medium.
  • Colonoscopy is the preferred examination method for the prevention and diagnosis of intestinal cancer, and endoscopic minimally invasive treatment of some early gastrointestinal cancers can achieve the goal of curative resection.
  • Colonoscopy is the process of observing colonic lesions (such as inflammation, tumor, etc.) from the mucosal side by using an electronic colonoscope to pass through the anus, pass through the rectum and sigmoid colon, and reach the ileocecal area.
  • the ileocecal part is the part where the end of the ileum and the cecum meet each other. It is also the main organ of the ileocecum. Therefore, the identification of the ileocecal is very important during endoscopy.
  • the existing research work on the recognition of the ileocecum is basically based on a fully supervised convolutional neural network. They usually use an off-the-shelf convolutional neural network such as ResNet, VGG, Inceptionv3, etc. A few works slightly modify these off-the-shelf models, such as fine-tuning using pre-trained models.
  • the pre-training models they use are usually based on ready-made pre-trained results on natural images. Due to the differences between medical images and natural images, such pre-training models cannot learn the characteristics of endoscopic images well.
  • Contrastive learning focuses on learning the common features between instances of the same class and distinguishing the differences between instances of the same class. It does not need to pay attention to the tedious details of the instance, but only needs to learn to distinguish the data in the feature space of the abstract semantic level, so the model and its optimization become simpler and the generalization ability is stronger. Contrastive loss can maximize the mutual information between positive samples and minimize the mutual information between negative samples.
  • contrastive learning has been applied in the medical field. However, such methods only learn contrastive learning at the image level, but do not learn features at different levels at different scales.
  • an improved endoscopic image feature learning model training method can better learn the abstract semantic level features of the image itself on a large amount of unlabeled data in the case of limited labeled data.
  • An object of the present disclosure is to provide a method for training an endoscopic image feature learning model based on contrastive learning, a training method for an endoscopic image classification model, an endoscopic image classification method, a device, and a computer-readable medium.
  • An embodiment of the present disclosure provides a training method of an endoscopic image feature learning model based on multi-scale contrastive learning, the method includes: acquiring a first training data set, the first training data set includes one or more an endoscopic image with an object to be identified and one or more endoscopic images without an object to be identified; inputting the first training data set into the endoscopic image feature learning model; and based on the first A training data set performs unsupervised comparative learning on the endoscope image feature learning model to obtain a trained endoscope image feature learning model, wherein the endoscope image feature learning model includes a plurality of contrastive learning models Sub-modules, each of the plurality of comparative learning sub-modules is used to extract feature representations of different scales of the same endoscopic image in the first training data set, and perform comparative learning based on the extracted feature representations of different scales .
  • the plurality of contrastive learning submodules include M contrastive learning submodules connected in sequence
  • any one of the M contrastive learning submodules is a contrastive learning submodule i all include: a first encoder and a second encoder with the same structure, and a first mapper module and a second mapper module with the same structure, wherein the output of the first encoder is connected to the The input end of the first mapper module, the output end of the second encoder is connected to the input end of the second mapper module, wherein the M first encoders in the M contrastive learning sub-modules are sequentially connection, the M second encoders in the M contrastive learning sub-modules are connected sequentially, wherein the M is an integer greater than or equal to 1, and the i ⁇ [1,M] is connected.
  • inputting the first training data set into the endoscope image feature learning model includes: during each iterative training: randomly selecting from the first training data set Selecting L endoscopic images, performing first image enhancement on each of the L endoscopic images, to obtain L first enhanced endoscopes corresponding to the L endoscopic images one-to-one image, and input to the first coder of the first contrastive learning submodule in the endoscope image feature learning model; and performing second image enhancement on each of the L endoscope images to obtain the same
  • the L endoscopic images are one-to-one corresponding L second enhanced endoscopic images, and are input to the second encoder of the first contrastive learning submodule in the endoscopic image feature learning model, wherein , the L is a positive integer greater than 1.
  • the first image enhancement and the second image enhancement respectively include any two of the following: keep unchanged, crop, flip, color transform and Gaussian blur.
  • unsupervised contrastive learning is performed on the endoscope image feature learning model to obtain a trained endoscope image feature learning model Including: calculating a joint contrast loss value based on the feature output of each contrast learning sub-module i in the M contrast learning sub-modules, and adjusting parameters of the endoscope image feature learning model based on the joint contrast loss value , until the joint contrast loss function of the endoscope image feature learning model converges, wherein the joint contrast loss function is the contrast loss based on the output of each contrast learning submodule i in the M contrast learning submodules sum of functions.
  • performing unsupervised contrastive learning on the endoscope image feature learning model based on the first training data set includes: based on the M contrastive learning submodules Any one of the contrastive learning sub-module i uses the first encoder and the second encoder included therein to respectively extract the L first features of the i-th scale corresponding to the L first enhanced endoscopic images Representing the second feature representations of L i-th scales corresponding to the L second enhanced endoscopic images; using the first mapper module and the second mapper module included therein, respectively for the The L first feature representations of the i-th scale and the L second feature representations of the i-th scale are mapped to obtain the mapped i-th feature representations corresponding to the L first enhanced endoscopic images A feature representation of a scale and a mapped i-th scale feature representation corresponding to the L second enhanced endoscopic images; and based on the mapping corresponding to the L first enhanced endoscopic images After the i-th scale
  • the first mapper module in the contrastive learning submodule i includes a first global mapper, and the output terminal of the first encoder in the contrastive learning submodule i Connected to the input end of the first global mapper in the contrast learning submodule i;
  • the second mapper module in the contrast learning submodule i includes a second global mapper, and the contrast learning submodule i
  • the output of the second encoder is connected to the input of the second global mapper in the contrastive learning sub-module i.
  • the L first feature representations of the i-th scale and the L The second feature representation of the i scale is mapped to obtain the mapped feature representation of the i-th scale corresponding to the L first enhanced endoscopic images and the corresponding feature representation of the L second enhanced endoscopic images.
  • the feature representation of the mapped i-th scale corresponding to the mirror image includes: based on the first global mapper and the second global mapper included in the contrastive learning sub-module i, the L-th scale
  • the first feature representation of the i scale and the L second feature representations of the i scale are subjected to global mapping processing to obtain the L number of globally mapped first endoscopic images corresponding to the L first enhanced endoscopic images.
  • the first feature representation of scale i and the second feature representation of scale i after L global mappings corresponding to the L second enhanced endoscopic images.
  • the first global mapper and the second global mapper are two-layer fully connected modules.
  • the calculation of the contrast loss value of the contrast learning submodule i includes: the L global corresponding to the L first enhanced endoscopic images
  • the mapped first feature representation of the i-th scale and the L second feature representations of the i-th scale after the global mapping corresponding to the L second enhanced endoscopic images are one-to-one corresponding two feature representations are used as a pair of positive examples, and the remaining (2L-2) feature representations are used as negative examples, and the contrastive loss function is calculated to obtain the contrastive loss value of the contrastive learning submodule i.
  • the first mapper module in the contrastive learning submodule i includes a first global mapper and a first local mapper
  • the first mapper module in the contrastive learning submodule i The output end of an encoder is simultaneously connected to the input end of the first global mapper and the input end of the first local mapper in the described contrastive learning submodule i
  • the second mapper module in the described contrastive learning submodule i Including a second global mapper and a second local mapper, the output end of the second encoder in the contrastive learning submodule i is simultaneously connected to the input end and the second global mapper in the contrastive learning submodule i Input to the second local mapper.
  • the L first feature representations of the i-th scale and the L The second feature representation of the i scale is mapped to obtain the mapped feature representation of the i-th scale corresponding to the L first enhanced endoscopic images and the corresponding feature representation of the L second enhanced endoscopic images.
  • the feature representation of the mapped i-th scale corresponding to the mirror image includes: based on the first global mapper and the second global mapper included in the contrastive learning sub-module i, the L-th scale
  • the first feature representation of the i scale and the L second feature representations of the i scale are subjected to global mapping processing to obtain the L number of globally mapped first endoscopic images corresponding to the L first enhanced endoscopic images.
  • the first feature representation of the i-scale and the second feature representation of the i-th scale after L global mappings corresponding to the L second enhanced endoscopic images; and based on the comparison learning submodule i includes The first local mapper and the second local mapper respectively perform local mapping on the L first feature representations of the i-th scale and the L second feature representations of the i-th scale, to obtain The first feature representations of the i-th scale after L local mapping corresponding to the L first enhanced endoscopic images and the L local local features corresponding to the L second enhanced endoscopic images The second feature representation of the i-th scale after mapping.
  • the first global mapper and the second global mapper are two-layer fully connected modules, and the first local mapper and the second local mapper
  • the filter is a two-layer 1x1 convolution module.
  • the calculation of the contrast loss value of the contrastive learning submodule i includes: mapping the L global maps corresponding to the L first enhanced endoscopic images to the i-th scale A feature representation and two feature representations in one-to-one correspondence among the second feature representations of the i-th scale after the L global mapping corresponding to the L second enhanced endoscopic images are used as a pair of positive examples , the remaining (2L-2) feature representations are used as negative examples, and the contrast loss function is calculated to obtain a global contrast loss value; and the L local maps corresponding to the L first enhanced endoscopic images Each of the first feature representations of the subsequent i-th scale is divided into the first S local feature representations of the i-th scale to obtain the first (L ⁇ S) local feature representations of the i-th scale; In the same manner as
  • the contrastive loss function is a noise contrastive estimation loss function InfoNCE.
  • the first encoder and the second encoder are multi-scale Transformer encoder blocks
  • the multi-scale Transformer encoder blocks include one or more intervals set A multi-head pooling attention module and one or more multi-layer perceptron modules, wherein each multi-head pooling attention module and each multi-layer perceptron module is preceded by a module normalization module.
  • the method according to an embodiment of the present disclosure, wherein the object is the ileocecal portion.
  • Embodiments of the present disclosure also provide a training device for an endoscope image feature learning model based on multi-scale contrastive learning
  • the device includes: a training data set acquisition component, configured to acquire a first training data set, The first training data set includes one or more endoscopic images with objects to be identified and one or more endoscopic images without objects to be identified; an input component is used to convert the first training data set Input to the endoscope image feature learning model; a training component for performing unsupervised comparative learning on the endoscope image feature learning model based on the first training data set, so as to obtain a trained endoscope An image feature learning model, wherein the endoscope image feature learning model includes a plurality of contrastive learning submodules, the plurality of contrastive learning submodules are used to extract feature representations of different scales of the same input sample, and based on the extracted Feature representations of different scales are compared for learning.
  • Embodiments of the present disclosure also provide a training method for an endoscope image classification model, including: acquiring a second training data set, the second training data set includes one or more objects with objects to be identified Endoscopic image and one or more endoscopic images that do not have the object to be identified, the endoscopic image is marked with a label for indicating whether the endoscopic image includes the object to be identified; the second training data set Input into the endoscope image classification model for training until the target loss function of the endoscope image classification model converges to obtain the trained endoscope image classification model, wherein the endoscope image classification model includes A feature extraction module and a classifier module connected in turn, wherein the feature extraction module is obtained in the endoscope image feature learning model according to the training method of the endoscope image feature learning model based on multi-scale comparison learning described above M first encoders or M second encoders, where M is an integer greater than 1.
  • the target loss function of the endoscopic image classification model includes: the final result based on the endoscopic image classification model
  • the output is a focal loss function determined with the annotated labels of the image samples.
  • Embodiments of the present disclosure also provide a training device for an endoscopic image classification model, including: an image acquisition component for acquiring a second training data set, the second training data set includes one or more An endoscopic image with an object to be identified and one or more endoscopic images without an object to be identified, the endoscopic image being marked with a label indicating whether the endoscopic image includes the object to be identified; a training component , inputting the second training data set into the endoscopic image classification model for training until the target loss function of the endoscopic image classification model converges to obtain a trained endoscopic image classification model, wherein,
  • the endoscopic image classification model includes a sequentially connected feature extraction module and a classifier module, wherein the feature extraction module is obtained according to the training method of the endoscopic image feature learning model based on multi-scale comparison learning. M first encoders or M second encoders in the mirror image feature learning model, where M is an integer greater than 1.
  • Embodiments of the present disclosure also provide a method for classifying endoscopic images, including: acquiring an endoscopic image to be identified; and obtaining a classification of the endoscopic image based on a trained endoscopic image classification model Result; wherein, the trained endoscope image feature learning model is obtained based on the training method of the above-mentioned endoscope image classification model.
  • Embodiments of the present disclosure also provide an endoscope image classification system, including: an image acquisition component, configured to acquire an endoscope image to be identified; a processing component, based on a trained endoscope image classification model, Obtain the classification result of the endoscopic image; the output unit is used to output the classification result of the endoscopic image to be recognized, wherein the trained endoscopic image feature learning model is based on the above-mentioned endoscopic image Obtained by the training method of the classification model.
  • An embodiment of the present disclosure also provides an electronic device, including a memory and a processor, wherein the memory stores a program code readable by the processor, and when the processor executes the program code, the method according to the above method is executed. any one of the methods described.
  • Embodiments of the present disclosure also provide a computer-readable storage medium, on which computer-executable instructions are stored, and the computer-executable instructions are used to execute the method according to any one of the above-mentioned methods.
  • Fig. 1 shows a schematic diagram of the application architecture of the endoscopic image feature learning model training and the endoscopic image classification method in the embodiment of the present disclosure
  • Fig. 2 shows a schematic diagram of a traditional contrastive learning network architecture based on SimCLR
  • Fig. 3 shows an overall exemplary block diagram of a conventional Vision Transformer model
  • Figure 4 shows a schematic diagram of ViT in Figure 3 flattening the original image into a sequence
  • FIG. 5 shows a schematic diagram of the multi-head pooling attention (MHPA) module in the encoder block of the multi-scale Vision Transformer
  • Figure 6A shows an endoscopic image of the ileocecal region according to an embodiment of the present disclosure
  • Figure 6B shows an endoscopic image of the non-ileocecal region
  • FIG. 7A shows a schematic structure of an endoscopic image feature learning model 700A based on contrastive learning according to an embodiment of the present disclosure
  • Figure 7B shows an embodiment in which the encoder in model 700A is a multi-scale Vision Transformer
  • FIG. 7C shows an example model for further performing local comparison learning on the basis of the model 700A in FIG. 7A for feature outputs of the same scale
  • FIG. 8 shows a flow chart of a method 800 for training an endoscopic image feature learning model based on multi-scale contrastive learning according to an embodiment of the present disclosure
  • FIG. 9 shows a more specific exemplary description of the step of performing unsupervised comparative learning on the endoscope image feature learning model based on the first training data set in step S803 in FIG. 8;
  • Figure 10 illustrates how to calculate the local contrast loss value of the contrastive learning submodule i based on the locally mapped features
  • Fig. 11 has described the flowchart of the training method of the endoscopic image classification model of the embodiment of the present disclosure
  • FIG. 12 depicts a flowchart of a method for classifying endoscopic images in an embodiment of the present disclosure
  • Fig. 13 shows a schematic structural diagram of an endoscope image classification system in an embodiment of the present disclosure
  • FIG. 14 shows a training device for an endoscope feature learning model according to an embodiment of the present disclosure
  • Fig. 15 shows a training device for an endoscopic image classification model according to an embodiment of the present disclosure.
  • FIG. 16 shows a schematic diagram of a storage medium according to an embodiment of the present disclosure.
  • any number of different modules may be used and run on the user terminal and/or the server.
  • the modules are illustrative only, and different aspects of the systems and methods may use different modules.
  • this disclosure proposes an endoscope feature learning model based on multi-scale contrastive learning, which extracts features from input endoscopic images at different scales, and performs comparative learning on the basis of feature representations at different scales, The characteristics of endoscopic images can be better learned.
  • FIG. 1 shows a schematic diagram of an application architecture of an endoscopic image feature learning model training and an endoscopic image classification method in an embodiment of the present disclosure, including a server 100 and a terminal device 200 .
  • the terminal device 200 may be a medical device, for example, a user may view endoscopic image classification results based on the terminal device 200 .
  • the terminal device 200 and the server 100 may be connected through the Internet to realize mutual communication.
  • the aforementioned Internet uses standard communication technologies and/or protocols.
  • the Internet is usually the Internet, but can be any network, including but not limited to Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), mobile, wired or wireless network , private network, or any combination of virtual private networks.
  • data exchanged over a network is represented using technologies and/or formats including Hyper Text Markup Language (HTML), Extensible Markup Language (XML), and the like.
  • HTML Hyper Text Markup Language
  • XML Extensible Markup Language
  • Secure Socket Layer Secure SocketLayer, SSL
  • Transport Layer Security Transport Layer Security
  • TLS Transport Layer Security
  • Virtual Private Network Virtual Private Network
  • IPsec Internet Protocol Security
  • Encryption technology to encrypt all or some links.
  • customized and/or dedicated data communication technologies may also be used to replace or supplement the above data communication technologies.
  • the server 100 may provide various network services for the terminal device 200, wherein the server 100 may be a server, a server cluster composed of several servers, or a cloud computing center.
  • the server 100 may include a processor 110 (Center Processing Unit, CPU), a memory 120, an input device 130, an output device 140, etc.
  • the input device 130 may include a keyboard, a mouse, a touch screen, etc.
  • the output device 140 may include a display device, Such as liquid crystal display (Liquid Crystal Display, LCD), cathode ray tube (Cathode Ray Tube, CRT) and so on.
  • LCD Liquid Crystal Display
  • CRT cathode Ray Tube
  • the memory 120 may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides program instructions and data stored in the memory 120 to the processor 110 .
  • the memory 120 can be used to store the program of the training method of the endoscope image feature learning model, the training method of the endoscope image classification model or the endoscope image classification method in the embodiment of the present disclosure.
  • the processor 110 calls the program instructions stored in the memory 120, and the processor 110 is used to execute any training method of the endoscopic image feature learning model and the training of the endoscopic image classification model in the embodiments of the present disclosure according to the obtained program instructions. method or steps of an endoscopic image classification method.
  • the training method of the endoscopic image feature learning model, the training method of the endoscopic image classification model or the endoscopic image classification method are mainly executed by the server 100 side, for example, for endoscopic image classification method
  • the terminal device 200 can send the collected endoscopic image of the digestive tract (for example, an image of the ileocecal portion) to the server 100
  • the server 100 can identify the type of the endoscopic image of the digestive tract, and can send the recognition result Return to the terminal device 200.
  • the application architecture shown in Figure 1 is described by taking the application on the server 100 side as an example.
  • the method in the embodiment of the present disclosure can also be executed by the terminal device 200.
  • the terminal device 200 can obtain training from the server 100 side.
  • a good endoscopic image classification model based on the endoscopic image classification model, is used to identify the type of endoscopic images and obtain a classification result, which is not limited in this embodiment of the present disclosure.
  • FIG. 1 Various embodiments of the present disclosure are schematically described by taking the application architecture diagram shown in FIG. 1 as an example.
  • Contrastive learning is a kind of unsupervised learning, which is characterized by the fact that it does not require manual labeling of category label information, and directly uses the data itself as supervisory information to learn the feature expression of sample data and use it for downstream tasks, for example, for the back blind part The task of classifying image types.
  • representations are learned by making comparisons between input samples. Instead of learning a signal from a single data sample at a time, contrastive learning learns by making comparisons between different samples. Comparisons can be made between positive pairs of "similar” inputs and negative pairs of "dissimilar” inputs. Contrastive learning is learned by simultaneously maximizing the consistency between different transformed views of the same image (e.g.
  • FIG. 2 shows a schematic diagram of a traditional SimCLR-based comparative learning network architecture.
  • the traditional SimCLR model architecture consists of two symmetrical branches (Branches). As shown in the figure, the upper and lower branches are respectively symmetrically provided with encoders and nonlinear mappers. SimCLR proposes a way to construct positive and negative examples.
  • L is a positive integer greater than 1
  • image enhancement such as including cropping, flipping, color transformation and Gaussian blur, etc.
  • the transformed data pair ⁇ x′ i , x′′ i > of image x are mutually positive examples, while x′ i and the other 2L-2 images are mutually negative examples.
  • the augmented image is projected into the representation space.
  • the above branch is taken as an example.
  • the enhanced image x′ i is first transformed by the feature encoder Encoder (generally using Deep residual network (ResNet) as the model structure, represented by the function f ⁇ (x) here). into the corresponding feature representation h′ i .
  • ResNet Deep residual network
  • Non-linear Projector Composed of two layers of multi-layer perceptrons (multi-layer perceptron, MLP), here represented by the function g ⁇ ( )
  • MLP multi-layer perceptron
  • Unsupervised learning of image features is achieved by computing and maximizing the similarity between positive mapped features and minimizing the similarity between negative mapped features.
  • SimCLR uses cosine similarity to calculate the similarity between two enhanced images. For two enhanced images x′ i and x′′ i , in its projection (ie, mapping) representation and z′′ i to calculate the cosine similarity.
  • the similarity between the enhanced pair of images here can be called a pair of positive examples, such as ⁇ x′ i ,x′′ i >
  • the similarity between the pair of images and other images in the two batches will be low.
  • the loss function of contrastive learning can be defined based on the similarity between positive and negative examples.
  • SimCLR uses a contrastive loss InfoNCE, as shown in the following equation (1):
  • zi represents the feature after nonlinear mapping
  • z j(i) represents the positive example corresponding to zi
  • z a represents all other features except zi (including positive and negative examples).
  • I means all images.
  • ( ⁇ ) represents the dot product operation.
  • represents the temperature parameter, which is used to prevent falling into a local optimal solution in the early stage of model training, and to help converge with model training.
  • image features are generally extracted first. This part is the foundation of the entire CV task, because subsequent downstream tasks are based on the extracted image features (such as classification, generation, etc.), so this part of the network structure is called the backbone network.
  • traditional contrastive loss models generally employ deep residual networks as encoders to extract image-level features, and perform contrastive learning based on the extracted image-level features.
  • this disclosure proposes a new multi-scale contrastive learning model, which acquires feature representations at different scales of the same image, and performs comparative learning based on feature representations at different scales.
  • Multi-scale image technology is also called multi-resolution technology (MRA), which refers to the use of multi-scale expressions for images and separate processing at different scales.
  • MRA multi-resolution technology
  • the so-called multi-scale is actually the sampling of different granularities of the signal.
  • different features can be observed at different scales, so as to complete different tasks.
  • image pyramid and feature pyramid There are two main ways to deal with multi-scale in vision tasks: image pyramid and feature pyramid. Among them, the feature pyramid obtains receptive fields of different sizes through convolution kernels of different sizes and pooling to obtain feature representations at different scales.
  • the embodiment of the present disclosure takes multi-scale Vision Transformer (Multi-scale ViT) as an example, which is used as an exemplary network for obtaining feature representations of different scales of the same input image.
  • the multi-scale Vision Transformer encoder block adds a pooling layer to the traditional Transformer encoder block to pool the input image features into smaller scale features. By cascading multiple multi-scale Vision Transformer encoder blocks, multiple feature representations of different scales can be obtained.
  • Fig. 3 shows an overall exemplary block diagram of a conventional Vision Transformer (ViT) model.
  • the scale ViT divides the original image into a grid of squares, and flattens each square into a single vector by concatenating all pixel channels in a square and then linearly projecting them to the desired input dimension using a linear mapper .
  • the scale ViT is independent of the structure of the input elements, so it is further required to utilize a position encoder to add a learnable position embedding in each square vector, so that the model can understand the image structure.
  • Each Vision Transformer encoder block includes a multi-head attention (Multi-head Attention, MHA) module and a multi-layer perceptron (Multi-Layer Perception, MLP) module set at intervals, where each multi-head attention module and multi-layer The perceptron module is preceded by a layer normalization module.
  • MHA multi-head attention
  • MLP multi-layer perceptron
  • Figure 4 shows a schematic diagram of ViT in Figure 3 flattening the original image into a sequence.
  • the image input into ViT is a polyp white light image of H ⁇ W ⁇ C, where H and W are the number of pixels in the length and width directions, respectively, and C is the number of channels.
  • N is the length of the final embedding sequence
  • D is the dimension of each vector of the embedding sequence, where each D-dimensional vector represents a feature of a corresponding region, for example, N ⁇ D here correspond to N regions, respectively.
  • a position encoder is used to add position information to the sequence, and the dimensionality of the position-encoded input vector does not change in any way.
  • the sequence after adding the position information can be input into the Transformer encoder for feature extraction.
  • the multi-head attention (MHA) module in the traditional Vision Transformer encoder block is replaced by a multi-head pooling attention (Multi-head Pooling Attention, MHPA) module, by adding a pooling layer to Get smaller scale features.
  • MHA multi-head attention
  • MHPA Multi-head Pooling Attention
  • MHPA multi-head pooling attention
  • the final output feature is the pooled feature of the original input feature and the feature that is further pooled and calculated by the attention module.
  • the size of the output feature is Compared to the input size HW ⁇ D, the features are changed in scale (in this case, smaller), and the dimensionality of each vector is doubled.
  • the multi-scale Vision Transformer can pool the features of the input image into smaller scales.
  • each encoder block will obtain smaller-scale features based on the received input features, sequentially connecting multiple multi-scale Vision Transformer encoder blocks will obtain the same input sample image at different scales feature representation.
  • the features extracted by these multi-scale ViTs can be connected to downstream task modules for further feature extraction or image recognition or segmentation.
  • the method for training the endoscopic image feature learning model based on contrastive learning in the embodiment of the present application further performs contrastive learning based on features extracted by multi-scale ViT.
  • embodiments of the present disclosure are not limited thereto, and other network architectures can also be used as the backbone network for multi-scale feature extraction, such as Inception, Deeplab-V3 architecture, etc., and the present disclosure is not limited here.
  • FIG. 6A shows an endoscopic image of the ileocecal portion according to an embodiment of the disclosure.
  • the endoscope enters the human body through the natural orifice of the human body or through a small surgical incision to obtain relevant endoscopic images, which are subsequently used for diagnosis and treatment of diseases.
  • Figure 6A shows an image of the ileocecal region captured by an endoscope operating in a white light (WL) imaging mode.
  • Figure 6B shows an endoscopic image of the non-ileocecal region.
  • Fig. 7A shows a schematic structure of an endoscopic image feature learning model 700A based on contrastive learning according to an embodiment of the present disclosure.
  • the structure of the endoscopic image feature learning model 700A according to the embodiment of the present disclosure is similar to the traditional SimCLR-based comparative learning network architecture shown in FIG. 2 , and consists of two completely symmetrical branches.
  • an encoder may be a multi-scale Vision Transformer encoder.
  • each multi-scale Vision Transformer encoder block is composed of alternating multi-head pooling attention (MHPA) modules and multi-layer perceptron (MLP) modules.
  • a pooling layer is added to the MHPA module to further pool the scale of the input data.
  • the encoder block of multi-scale ViT can use the pooling layer to pool the N ⁇ D feature sequence into Q ⁇ D (Q can be, for example, ).
  • the scale of the feature is reduced to 1/4, because in the multi-scale Vision Transformer, each encoder block pools the original input features
  • the features after optimization are spliced with the features that are further pooled and calculated by the attention module, and the final output feature size is 16 ⁇ 2048. It should be understood that in other multi-scale encoders, the above concatenation process may not be performed, and the scaled feature size may be 16 ⁇ 1024.
  • the model 700A includes two branches, left and right, each branch includes a plurality of encoders connected in sequence, and the output of each encoder is connected to a mapper module (for example, the global mapping module shown in the figure ). Since the two branches have exactly the same structure and perform exactly the same processing based on different enhanced versions of the same original image, the model 700A is structurally divided according to functions. For example, the model 700A can be divided into multiple (for example, M, where M is an integer greater than 1) contrastive learning sub-modules. As shown in FIG.
  • the endoscopic image feature learning model based on multi-scale contrastive learning includes multiple (for example M, M is an integer greater than 1) sequentially connected contrastive learning sub-modules 700A_1-700A_M.
  • Each contrastive learning sub-module includes a pair of first encoder and second encoder with the same structure in the two branches and a pair of first mapper module and second mapper with the same structure respectively connected to the pair of encoders module.
  • the encoder here can be used to extract output features at a different scale than the input features.
  • the encoder here can be a multi-scale Vision Transformer encoder block. It should be understood that the encoder for multi-scale feature extraction according to the embodiments of the present disclosure is not limited thereto, and may also include other architectures capable of achieving the same function, such as Inception, Deeplab-V3 architecture, etc., and the present disclosure is not limited here.
  • the linear mapper module here can be the nonlinear mapper in the traditional SimCLR-based contrastive learning network architecture shown in Figure 2, which is used to further map the feature representation output by the encoder into a vector in another space.
  • the mapper module here is a global mapper module that maps based on image-level features.
  • the mapper module here can be a two-layer fully-connected layer.
  • the enhanced images X' and X" may also need some pre-processing before being input to the first encoder.
  • the encoder in 700A is a multi-scale Vision Transformer encoder block.
  • the input enhanced images X' and X" are divided into tiles of the same size before input, and these tiles are flattened into one-dimensional vectors, and then linearized using a linear mapper. Transform to compress the dimension. Subsequently, a position encoder is used to add position information in the sequence. Therefore, on the basis of model 700A, model 700B can also include sequentially connected linear mappers and position encoders in two branches respectively .
  • multiple multi-scale encoders connected in series can generate feature representations at different scales based on the same input image.
  • the embodiments of the present disclosure perform comparative learning based on feature representations at different scales, so that compared with ordinary comparison
  • the learning model can achieve better feature learning effect.
  • the comparative learning here is usually performed at the image level, that is, in the images input to the two branches, different enhanced versions of the same image are used as a pair of positive examples, and the rest of the enhanced images are used as negative examples. It is learned by maximizing the consistency between different transformed views of the same image (e.g. cropping, flipping, color transformation, etc.) and minimizing the consistency between transformed views of different images.
  • the embodiment of the present disclosure also proposes a further embodiment, based on the features of each scale of each contrastive learning sub-module, in addition to performing contrastive learning at the image level, further performing contrastive learning at the region level.
  • FIG. 7C shows an example model in which local contrastive learning is further performed on the basis of the model 700A in FIG. 7A , in addition to global contrastive learning for feature outputs of the same scale.
  • the encoder here can be used as a multi-scale encoder that extracts output features at a different scale than the input features.
  • the encoder here can be a multi-scale Vision Transformer encoder block. It should be understood that the encoder here may also include other architectures capable of realizing the same function, such as Inception, Deeplab-V3 architecture, etc., and this disclosure is not limited here.
  • the global mapper here is a global mapper module that maps based on image-level features.
  • the global mapper module here can be a two-layer fully-connected layer.
  • the local mapper here maps regional features individually at the level of each region.
  • the local mapper here can be a two-layer 1 ⁇ 1 convolutional layer, so that the dimension of the feature map after local mapping remains unchanged.
  • the endoscopic image feature learning model provided according to the embodiments of the present disclosure performs global and local comparisons simultaneously on a multi-scale basis, which can better learn the features of endoscopic images compared with conventional comparison learning.
  • FIG. 8 shows a flowchart of a method 800 for training a multi-scale contrastive learning-based endoscopic image feature learning model according to an embodiment of the present disclosure.
  • the endoscope image feature learning model here is the endoscope image feature learning model 700A shown in FIG. 7A, the endoscope image feature learning model 700B shown in FIG. Image feature learning model 700C.
  • the method 800 for training an endoscope image feature learning model can be executed by a server, and the server can be the server 100 shown in FIG. 1 .
  • a first training data set is obtained, the first training data set includes one or more endoscopic images with objects to be identified and one or more endoscopic images without objects to be identified .
  • the object here can be the ileocecal.
  • the training process of the endoscope image feature learning model here is an unsupervised pre-training process, which is used to learn the features of the data itself. Therefore, these data sets are not labeled.
  • the first training data set here may be prepared by simulating the long-tailed distribution of image types in the ileocecal region in real situations.
  • the endoscopic images of the ileocecal portion only account for a small proportion, and the rest are all endoscopic images of the non-ileocecal portion, so that the entire The training dataset exhibits a long-tailed distribution.
  • the first training data set here may be obtained by operating an endoscope, downloaded from a network, or obtained by other means, which is not limited in the embodiments of the present disclosure.
  • the number and ratio of the first training data set according to the training method of the endoscope image feature learning model according to the embodiment of the present disclosure may be adjusted according to the actual situation, which is not limited in the present disclosure.
  • embodiments of the present disclosure may also be applicable to feature learning of images of other digestive tract parts or lesions other than the ileocecal region, such as polyps, and the present disclosure is not limited thereto.
  • any other endoscopic images of the digestive tract may be used to construct a data set and train the endoscopic image feature learning model according to the embodiments of the present disclosure.
  • These endoscopic images may be images acquired by the endoscope in any suitable mode, such as narrow-band light images, autofluorescence images, I-SCAN images, and the like.
  • the above various modal images may also be mixed to construct a data set, which is not limited in the present disclosure.
  • step S803 the first training data set is input into the endoscope image feature learning model.
  • L images are randomly selected from the training data set to form a batch of input images.
  • two image-augmented views are generated for each image by an image augmentation method, and these two augmented views constitute a pair of positive examples.
  • L endoscopic images are randomly selected from the first training data set, and each of the L endoscopic images is subjected to first image enhancement to obtain a L first enhanced endoscopic images corresponding to each endoscopic image one-to-one, and input to the first encoder of the first contrastive learning submodule in the endoscopic image feature learning model; and the described endoscopic image feature learning model;
  • Each of the L endoscopic images is subjected to second image enhancement to obtain L second enhanced endoscopic images corresponding to the L endoscopic images one-to-one, and input to the endoscopic image
  • the second encoder of the first contrastive learning submodule in the feature learning model is the same.
  • image enhancement here can include cropping, flipping, color transformation, and Gaussian blurring.
  • image enhancement here can include cropping, flipping, color transformation, and Gaussian blurring.
  • image enhancement here can include cropping, flipping, color transformation, and Gaussian blurring.
  • those skilled in the art should understand that only one enhancement transformation can be performed, and the original L images and L enhanced images are input into the model. Therefore, the first enhancement is used here for the convenience of description. In fact, this first enhancement may also include not performing any transformation on the image.
  • the training method of the endoscope image feature learning model can also perform preprocessing before inputting the enhanced image to the encoder. For example, taking the multi-scale Vision Transformer encoder as an example, after enhancing a selected batch of input images to obtain two batches of enhanced endoscopic images, it is divided into blocks of the same size, These tiles are flattened into 1D vectors and then linearly transformed using a linear mapper for dimensionality reduction. Subsequently, a position encoder is used to add position information to the sequence.
  • step S805 unsupervised contrastive learning is performed on the endoscope image feature learning model based on the first training data set, so as to obtain a trained endoscope image feature learning model.
  • the endoscope image feature learning model here may include a plurality of contrastive learning submodules connected in sequence, and each of the plurality of contrastive learning submodules is used to extract the same content in the first training data set.
  • the total joint loss function can be based on the comparison of features of multiple different scales Sum of contrastive loss functions for learning (i.e., for each contrastive learning submodule).
  • the joint loss function is:
  • L (i) is the contrast loss function of any contrastive learning sub-module i
  • M is the number of contrastive learning sub-modules.
  • unsupervised contrastive learning is performed on the endoscope image feature learning model to obtain a trained endoscope image feature learning model that can be Including: calculating a joint contrast loss value based on the feature output of each contrast learning sub-module i in the M contrast learning sub-modules, and adjusting parameters of the endoscope image feature learning model based on the joint contrast loss value , until the joint contrast loss function of the endoscope image feature learning model converges, wherein the joint contrast loss function is the contrast loss based on the output of each contrast learning submodule i in the M contrast learning submodules sum of functions.
  • step S803 a more specific exemplary description will be given to the step of performing unsupervised contrastive learning on the endoscope image feature learning model based on the first training data set in step S803 .
  • the unsupervised comparative learning of the endoscope image feature learning model based on the first training data set in step S803 includes the following sub-steps S901-S905. These steps are illustrated using an iterative process as an example.
  • any one of the M contrastive learning sub-modules will be described as a contrastive learning sub-module i, where i ⁇ [1,M].
  • the scale of the extracted image features is the i-th scale.
  • step S901 based on any one of the contrastive learning submodules i in the M contrastive learning submodules, the first encoder and the second encoder included therein are used to respectively extract The L first feature representations of the i-th scale corresponding to the endoscope images and the L second feature representations of the i-th scale corresponding to the L second enhanced endoscopic images.
  • the first encoder and the second encoder here have exactly the same structure, and are used to perform feature extraction on the input features respectively corresponding to the input samples of the first branch and the input samples corresponding to the second branch, and The scale of the extracted features is different from that of the received features.
  • the input features received by the first encoder included in it are The first encoder extracts features of a scale different from the input features.
  • the output features can be It should be understood that 1/(2 2 ) here is just an example, and the scaling ratio may be any preset value.
  • the encoder here may use pooling to reduce the feature scale, or any other method that can achieve this technical effect, which is not limited in the present disclosure.
  • the features output by the first encoder in each contrastive learning sub-module will enter the first encoder in the contrastive learning sub-module of the next layer.
  • the output features of the first encoder in the first contrastive learning submodule 1 will enter the first encoder in the second contrastive learning sub-module 2, which further reduces the scale, such as output features And so on.
  • the process of the second encoder is exactly the same as that of the first encoder, and will not be repeated here.
  • the encoder here may be a multi-scale Vision Transformer encoder block, and the process of how to perform feature pooling and feature extraction is well known in the art, and details will not be repeated here.
  • multi-scale feature extraction encoder is not limited thereto, and may also include other architectures capable of achieving the same function, such as Inception, Deeplab-V3 architecture, etc., and the present disclosure is not limited here.
  • step S903 using the first mapper module and the second mapper module included therein, respectively map the L first feature representations of the i-th scale and the L second feature representations of the i-th scale processing to obtain the mapped i-th scale feature representation corresponding to the L first enhanced endoscopic images and the mapped i-th scale feature representation corresponding to the L second enhanced endoscopic images Feature representation at i-scale.
  • the contrastive learning submodule i is based on two batches of endoscopic images received from the upper layer (for example, the above mentioned The input of the L first enhanced endoscopic images and the L second enhanced endoscopic images) feature representation, and further feature extraction is performed on different scales.
  • the output of each encoder is connected to the corresponding mapper for mapping, and the contrastive learning calculates the similarity (such as cosine similarity) on the mapped feature representation.
  • the first mapper module and the second mapper module may only include global mappers, such as the first global mapper and the second global mapper, as in the model 700A in Figure 7A above or as in Figure 7B above Shown in Model 700B.
  • These two global mappers are respectively connected to the outputs of the first encoder and the second encoder for global mapping of the features output by the first encoder and the second encoder on an image-level basis.
  • the L first feature representations of the i-th scale and the L second feature representations of the i-th scale are respectively mapped to obtain the L first enhancements
  • the first feature representations of the i-th scale after L global mappings corresponding to the type endoscopic images and the second features of the i-th scale after L global mappings corresponding to the L second enhanced endoscopic images express.
  • the first encoder or the second encoder can be connected to a local mapper in addition to a global mapper, as shown in model 700C in Figure 7C above. These two local mappers are used to perform local feature mapping on the feature representation received from the encoder.
  • the two local mappers further locally map the features output by the first and second encoders on a region-level basis, respectively.
  • the first local mapper and the second local mapper respectively perform local mapping on the L first feature representations of the i-th scale and the L second feature representations of the i-th scale, In order to obtain the first feature representation of the i-th scale after L local mappings corresponding to the L first enhanced endoscopic images and L corresponding to the L second enhanced endoscopic images The second feature representation of the i-th scale after local mapping.
  • step S905 based on the mapped feature representations corresponding to the L first enhanced endoscopic images and the mapped feature representations corresponding to the L second enhanced endoscopic images, calculate Contrastive loss value for contrastive learning submodule i.
  • contrastive learning uses a mapper to map the feature representation output from the encoder into a vector in another space, and then calculates the cosine similarity between positive and negative examples on the mapped feature representation. Ideally, the similarity between positive examples will be high, and the similarity between positive and negative examples will be low.
  • One embodiment of the present disclosure does contrastive learning at the image level only.
  • the mapped global features of a pair of augmented versions of the same image are taken as positive examples, and the mapped global features of other images are taken as negative examples.
  • the L first feature representations of the i-th scale after the global mapping corresponding to the L first enhanced endoscopic images are compared with the L second enhanced endoscopic images
  • the corresponding two feature representations in the second feature representations of the i-th scale after the L global mappings are used as a pair of positive examples, and the remaining (2L-2) feature representations are used as negative examples, and the comparison loss function is calculated , to get the contrastive loss value of the contrastive learning submodule i.
  • Region-level contrastive learning regards the features output by the encoder as a collection of features from several regions, and locally maps the features of different regions based on a local mapper.
  • the local mapper here can be two layers of 1x1 convolutional modules. Since the size of the 1x1 convolution kernel is only 1x1, there is no need to consider the relationship between the pixel and the surrounding area, and it will not fuse the features of the surrounding area with the current area.
  • the local features of local regions in a pair of enhanced versions of the same image are used as positive examples, and other regions in the same pair of images, as well as all regions in different images are taken as negative examples.
  • each contrastive loss sub-module i can be the sum of the local contrastive loss function and the global contrastive loss function:
  • the total joint loss function can be each The sum of contrastive loss functions for the contrastive learning submodule.
  • the following describes in detail how to calculate the local contrast loss value of the contrastive learning sub-module i based on the locally mapped features with reference to FIG. 10 .
  • step S1001 each of the L locally mapped first feature representations of the i-th scale corresponding to the L first enhanced endoscopic images is divided into the first S i-th scales The local feature representation of , to get the local feature representation of the first (L ⁇ S) i-th scale.
  • the first encoder in the first contrastive learning submodule outputs features for a first enhanced endoscopic image
  • the local mapping function is a 1*1 convolution, it is not necessary to consider the relationship between the pixels in the current area and the surrounding area, and it will not fuse the features of the surrounding area with the features of the current area, so , the feature Y 1 after local mapping still belongs to
  • Y 1 can be regarded as the same as corresponding to the region A collection of local features.
  • multiple numbers of 1 ⁇ D vectors can correspond to a larger area, for example, two 1 ⁇ D vectors can be used as features corresponding to a larger area, at this time, Y 1 can be as with A set of local features corresponding to each region.
  • the present disclosure does not limit the size of feature divisions (ie, region divisions).
  • step S1003 in the same manner as dividing the first S local feature representations, the i-th scale second features corresponding to the L second enhanced endoscopic images are mapped Each of the representations is divided into the second S local feature representations of the i-th scale that correspond one-to-one to the first S local feature representations of the i-th scale, so as to obtain the second (L ⁇ S) i-th scale representation of local features.
  • This process is exactly the same as dividing the first S local feature representations, and will not be repeated here.
  • step S1005 two local feature representations corresponding one-to-one between the first (L ⁇ S) i-th scale local feature representation and the second (L ⁇ S) i-th scale local feature representation As a pair of positive examples, the remaining (2 ⁇ (L ⁇ S)-2) local feature representations are taken as negative examples, and the contrastive loss function is calculated to obtain the local contrastive loss value.
  • the contrastive loss value is calculated by taking the local features of local regions of a pair of enhanced versions of the same image as positive examples, other regions in the same pair of images, and all regions in other different pictures as negative examples.
  • the endoscopic image feature learning model provided according to the embodiments of the present disclosure performs global and local comparisons on a multi-scale basis, and can better learn the features of the endoscopic image compared with conventional comparison learning.
  • the embodiment of the present disclosure further performs supervised classification training based on the encoder in the trained endoscope image feature learning model.
  • the embodiment of the present disclosure also provides a training method for an endoscope image classification model.
  • the method includes:
  • a second training data set is obtained, the training data set includes one or more endoscopic images with objects to be identified and one or more endoscopic images without objects to be identified, the endoscopic The endoscopic image is annotated with a label indicating whether the endoscopic image includes the object to be identified.
  • the second training data set here may be prepared by simulating the long-tailed distribution of image types in the ileocecal region in real situations.
  • the endoscopic images of the ileocecal portion only account for a small proportion, and the rest are endoscopic images of the non-ileocecal portion, so that the entire training data set presents a a long-tailed distribution.
  • the second training data set here may be obtained by operating an endoscope, or by downloading from a network, or by other means, which is not limited in this embodiment of the present disclosure.
  • the endoscopic image classification model of the embodiment of the present disclosure can also be applicable to other digestive systems other than the ileocecal Image classification of tract parts or lesions, such as polyps, etc., which is not limited in the present disclosure.
  • the endoscope images in the second training data set may be images acquired by the endoscope in any suitable mode, such as narrow-band light images, autofluorescence images, I-SCAN images, and the like.
  • the above various modal images may also be mixed to construct a data set, which is not limited in the present disclosure.
  • step S1103 the second training data set is input into the endoscopic image classification model for training until the target loss function of the endoscopic image classification model converges, so as to obtain the trained endoscopic image classification Model.
  • the classification model here is the same as the general classification model in this field, including a feature extraction module and a classifier.
  • the feature extraction module is used to extract image features
  • the classifier is used to perform classification prediction based on the extracted image features, and then based on the prediction Calculate the loss value based on the results and ground truth labels, and adjust the parameters of the endoscopic image classification model based on the loss value until the target loss function converges.
  • the feature extraction module of the endoscope image classification model may be the M first encoders or M second encoders in any one of the above-mentioned trained endoscope feature learning models 700A, 700B or 700C.
  • the target loss function here may be a cross-entropy loss function determined based on the final output result of the endoscope image classification model and the labeling label of the image sample.
  • the target loss function can be the focal loss function determined by the final output result of the endoscope image classification model and the labeling label of the image sample, as follows Equation (5) shows:
  • ⁇ 0 is an adjustable weight.
  • an embodiment of the present disclosure further provides an endoscopic image classification method.
  • the method includes:
  • step S1201 an endoscopic image to be identified is acquired.
  • the acquired endoscopic image to be recognized is the collected ileocecal image or non-ileocecal image.
  • step S1203 the endoscopic image to be recognized is input into a trained endoscopic image classification model to obtain a classification result of the endoscopic image.
  • FIG. 13 is a schematic structural diagram of an endoscope image classification system 1300 in an embodiment of the present disclosure.
  • the endoscopic image classification system 1300 at least includes an image acquisition component 1301 , a processing component 1302 and an output component 1303 .
  • the image acquisition unit 1301, the processing unit 1302, and the output unit 1303 are related medical devices, which can be integrated into the same medical device, or can be divided into multiple devices, connected and communicated with each other to form a medical system for use etc.
  • the image acquisition unit 1301 can be an endoscope
  • the processing unit 1302 and output unit 1303 can be a computer device in communication with the endoscope, etc.
  • the image acquisition component 1301 is used to acquire the image to be recognized.
  • the processing component 1302 is, for example, configured to execute the method steps shown in FIG. 12 , extract image feature information of the image to be recognized, and obtain a classification result of the image to be recognized based on the feature information of the image to be recognized.
  • the output unit 1303 is used to output the classification result of the image to be recognized.
  • FIG. 14 shows a training device 1400 for an endoscope feature learning model according to an embodiment of the present disclosure, which specifically includes a training data set acquisition component 1401 , an input component 1403 and a training component 1405 .
  • the training data set acquiring component 1401 is used to acquire a first training data set, the first training data set includes one or more endoscope images with objects to be identified and one or more endoscope images without objects to be identified image.
  • the input component 1403 is used for inputting the first training data set into the endoscope image feature learning model.
  • the training component 1405 is configured to perform unsupervised comparative learning on the endoscope image feature learning model based on the first training data set, so as to obtain a trained endoscope image feature learning model.
  • each of the plurality of contrastive learning submodules is used to extract the difference of the same endoscope image in the first training data set.
  • the feature representation of different scales is used to perform comparative learning based on the extracted feature representations of different scales.
  • the multiple contrastive learning submodules include M contrastive learning submodules connected in sequence, wherein any one contrastive learning submodule i in the M contrastive learning submodules includes: An encoder and a second encoder, and a first mapper module and a second mapper module with identical structures, wherein the output of the first encoder is connected to the input of the first mapper module, The output end of the second encoder is connected to the input end of the second mapper module, wherein, the M first encoders in the M contrastive learning submodules are connected sequentially, and the M contrastive learning submodules M second encoders in the module are connected sequentially, wherein, M is an integer greater than or equal to 1, and i ⁇ [1,M].
  • the input unit 1403 randomly selects L endoscopic images from the first training data set, performs first image enhancement on each of the L endoscopic images, Obtain L first enhanced endoscopic images corresponding to the L endoscopic images one-to-one, and input them to the first encoder of the first contrastive learning submodule in the endoscopic image feature learning model and performing second image enhancement on each of the L endoscopic images to obtain L second enhanced endoscopic images corresponding to the L endoscopic images one-to-one, and input them to the L endoscopic images
  • the first image enhancement and the second image enhancement respectively include any two of the following items: keep unchanged, crop, flip, color transform and Gaussian blur.
  • the training component 1405 calculates a joint contrast loss value based on the feature output of each contrast learning sub-module i in the M contrast learning sub-modules, and adjusts the endoscope based on the joint contrast loss value. parameters of the image feature learning model until the joint contrastive loss function of the endoscopic image feature learning model converges.
  • the joint contrast loss function is based on the sum of the contrast loss functions of the output of each contrast learning submodule i in the M contrast learning submodules.
  • the training component 1405 includes a feature extraction subcomponent 1405_1, a mapping subcomponent 1405_3 and a loss value calculation subcomponent 1405_5.
  • the feature extraction sub-component 1405_1 is based on any one of the contrastive learning sub-modules i in the M contrastive learning sub-modules, and uses the first encoder and the second encoder included therein to respectively extract The L first feature representations of the i-th scale corresponding to the enhanced endoscopic images and the L second feature representations of the i-th scale corresponding to the L second enhanced endoscopic images.
  • the mapping subcomponent 1405_3 utilizes the first mapper module and the second mapper module included therein to respectively represent the L first feature representations of the i-th scale and the L second feature representations of the i-th scale performing mapping processing to obtain the mapped i-th scale feature representations corresponding to the L first enhanced endoscopic images and the mapped feature representations corresponding to the L second enhanced endoscopic images
  • the loss value calculation component 1405_5 is based on the mapped i-th scale feature representation corresponding to the L first enhanced endoscopic images and the feature representation corresponding to the L second enhanced endoscopic images
  • the feature representation of the i-th scale after mapping is used to calculate the contrastive loss value of the contrastive learning submodule i.
  • the first encoder and the second encoder in any one of the contrastive learning submodule i perform feature extraction on the received input on different scales, so that the first encoder in any one of the comparative learning submodule i
  • the feature representations of the i-th scale extracted by the encoder and the second encoder are different from the scales of the feature representations extracted by the first encoder and the second encoder in the remaining (M-1) contrastive learning sub-modules.
  • the mapping subcomponent 1405_3 performs the first feature representation and The L second feature representations of the i-th scale are subjected to global mapping processing to obtain the L first feature representations of the i-th scale after global mapping corresponding to the L first enhanced endoscopic images and The L second feature representations of the i-th scale after global mapping corresponding to the L second enhanced endoscopic images.
  • the first global mapper and the second global mapper are two-layer fully connected modules.
  • the loss value calculation subcomponent 1405_5 combines the first feature representations of the i-th scale after the L global mappings corresponding to the L first enhanced endoscopic images with the L Among the second feature representations of the i-th scale after the L global mappings corresponding to the two enhanced endoscopic images, the two feature representations that correspond one-to-one are taken as a pair of positive examples, and the remaining (2L-2) feature representations As a negative example, calculate the contrastive loss function to obtain the contrastive loss value of the contrastive learning submodule i.
  • the mapping subcomponent 1405_3 performs the first feature representation and The L second feature representations of the i-th scale are subjected to global mapping processing to obtain the L first feature representations of the i-th scale after global mapping corresponding to the L first enhanced endoscopic images and L second feature representations of the i-th scale after global mapping corresponding to the L second enhanced endoscopic images; and based on the first local mapper included in the contrastive learning submodule i and the second local mapper, respectively performing local mapping on the L first feature representations of the i-th scale and the L second feature representations of the i-th scale, so as to obtain the L first enhanced
  • the first feature representations of the i-th scale after L local mappings corresponding to the L second enhanced endoscopic images and the L local mappings of the i-th scale corresponding to the L second enhanced endoscopic images Two features are represented.
  • first global mapper and the second global mapper are two-layer fully connected modules
  • first local mapper and the second local mapper are two-layer 1x1 convolutional modules
  • the loss value calculation subcomponent 1405_5 combines the first feature representations of the i-th scale after the L global mappings corresponding to the L first enhanced endoscopic images with the L Among the second feature representations of the i-th scale after the L global mappings corresponding to the two enhanced endoscopic images, the two feature representations that correspond one-to-one are taken as a pair of positive examples, and the remaining (2L-2) feature representations
  • calculate a contrast loss function to obtain a global contrast loss value
  • Each of is divided into the first S local feature representations of the i-th scale to obtain the first (L ⁇ S) local feature representations of the i-th scale; in the same way as dividing the first S local feature representations, dividing each of the L locally mapped i-th scale second feature representations corresponding to the L second enhanced endoscopic images into The local feature representations correspond one-to-one to the second
  • the contrastive loss function is a noise contrastive estimation loss function InfoNCE.
  • the multi-scale Transformer encoders include one or more multi-head pooling attention modules and one or more A multi-layer perceptron module, where each multi-head attention module and multi-layer perceptron module is preceded by a module normalization module.
  • the object is the ileocecal.
  • FIG. 15 shows a training device 1500 for an endoscope image classification model according to an embodiment of the present disclosure, which specifically includes a training data set acquisition component 1501 and a training component 1503 .
  • the training data set acquiring component 1501 is used to acquire a second training data set, the second training data set includes one or more endoscope images with objects to be identified and one or more endoscope images without objects to be identified An image, the endoscopic image is marked with a label for indicating whether the endoscopic image includes an object to be recognized.
  • the training component 1503 is used to input the second training data set into the endoscopic image classification model for training until the target loss function of the endoscopic image classification model converges, so as to obtain the trained endoscopic image classification Model.
  • the endoscopic image classification model includes a feature extraction module and a classifier module connected in sequence, wherein the feature extraction module is based on the aforementioned endoscopic image feature learning model based on multi-scale contrastive learning M first encoders or M second encoders in the endoscopic image feature learning model obtained by the training method, wherein M is an integer greater than 1.
  • the target loss function of the endoscope image classification model includes: based on the final output result of the endoscope image classification model and the labeling labels of image samples Deterministic focal loss function.
  • an electronic device in another exemplary embodiment is also provided in the embodiments of the present disclosure.
  • the electronic device in the embodiments of the present disclosure may include a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein, when the processor executes the program, the above embodiments may be implemented.
  • the steps of the endoscopic image feature learning model training method or the endoscopic image recognition method may be implemented.
  • the electronic device is the server 100 in FIG. 120.
  • FIG. 16 shows a schematic diagram 1600 of a storage medium according to an embodiment of the disclosure.
  • computer-executable instructions 1601 are stored on the computer-readable storage medium 1600 .
  • the method for training an endoscopic image feature learning model based on contrastive learning and the method for classifying endoscopic images according to the embodiments of the present disclosure described with reference to the above figures can be executed.
  • the computer readable storage medium includes, but is not limited to, for example, volatile memory and/or nonvolatile memory.
  • the volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache).
  • the non-volatile memory may include, for example, a read-only memory (ROM), a hard disk, a flash memory, and the like.
  • Embodiments of the present disclosure also provide a computer program product or computer program, the computer program product or computer program including computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the method for training the endoscopic image feature learning model based on contrastive learning according to an embodiment of the present disclosure and Classification methods for endoscopic images.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

一种内窥镜图像特征学习模型、分类模型的训练方法和装置。所述方法包括:获取第一训练数据集,所述第一训练数据集包括一个或多个具有待识别对象的内窥镜图像和一个或多个不具有待识别对象的内窥镜图像;将所述第一训练数据集输入到所述内窥镜图像特征学习模型;以及基于所述第一训练数据集对所述内窥镜图像特征学习模型进行无监督的对比学习,以获得训练完成的内窥镜图像特征学习模型,其中,所述内窥镜图像特征学习模型包括多个对比学习子模块,所述多个对比学习子模块的每一个用于提取所述第一训练数据集中的同一内窥镜图像的不同尺度的特征表示,并基于所提取的不同尺度的特征表示进行对比学习。

Description

内窥镜图像特征学习模型、分类模型的训练方法和装置
本申请要求于2021年10月26日提交的中国专利申请第202111248801.6的优先权,该中国专利申请的全文通过引用的方式结合于此以作为本申请的一部分。
技术领域
本申请涉及人工智能领域,具体涉及一种基于对比学习的内窥镜图像特征学习模型的训练方法、内窥镜图像分类模型的训练方法、内窥镜图像分类方法、装置及计算机可读介质。
背景技术
大多数结直肠癌开始于结直肠内膜表面的赘生物,称为息肉,而有些息肉可以发展为癌症。因此,早期诊断成为直肠癌防治关键一环。肠镜检查是预防和诊断肠道癌症的首选检查方法,部分消化道早期癌内镜下微创治疗可达到治愈性切除的目的。结肠镜是利用电子肠镜经***,经过直肠、乙状结肠,到达回盲部,从黏膜侧观察结肠病变(如炎症、肿瘤等)的过程。回盲部是回肠末端与盲肠互相交接的部位,称回盲部,回盲部是肠管的炎症(如周围炎、憩室炎等)、肿瘤、套叠等疾病的好发部位,而盲肠与阑尾又是回盲部的主要器官。因此在内镜检查过程中,对回盲部的识别至关重要。
为了减轻医生的负担,有一些工作尝试研究使用深度学习的方式自动化地实现对回盲部的识别。然而这些工作仅使用了简单的卷积神经网络,且都是基于全监督的方法,即需要大量标注数据。而现有的内镜影像的标注数据集主要集中于息肉等病变标注,很少有关于回盲部的标注,而单独为这一任务进行大量回盲部的标注是费时费力的。
现有的对回盲部进行识别的研究工作基本基于全监督的卷积神经网络。它们通常使用一个现成的卷积神经网络,如ResNet、VGG、Inceptionv3等。少数工作在这些现成的模型上稍加修改,如使用预训练的模型微调。然而,它们使用的预训练模型通常是基于现成的在自然图像上预训练好的结果,由 于医学图像和自然图像的差异,这类预训练模型无法很好的学习到内镜影像的特征。
近年来,使用基于对比学习的自监督学习来进行预训练的工作取得了巨大的发展。对比学习着重于学习同类实例之间的共同特征,区分非同类实例之间的不同之处。它不需要关注实例上繁琐的细节,只需要在抽象语义级别的特征空间上学会对数据的区分即可,因此模型以及其优化变得更加简单,且泛化能力更强。对比损失可以最大化正样本之间的互信息并最小化负样本之间的互信息。最近,对比学习的思想已经被应用于医学领域。然而,这类方法仅在图像级别进行对比学习的学习,而没有学习到不同尺度下不同级别的特征。
因此,期望一种改进的内窥镜图像特征学习模型的训练方法,在标注数据有限的情况下,能够在大量无标注的数据上更好的学习到影像本身的抽象语义级别的特征。
发明内容
考虑到以上问题而做出了本公开。本公开的一个目的是提供一种基于对比学习的内窥镜图像特征学习模型的训练方法、内窥镜图像分类模型的训练方法、内窥镜图像分类方法、装置及计算机可读介质。
本公开的实施例提供了一种基于多尺度对比学习的内窥镜图像特征学习模型的训练方法,所述方法包括:获取第一训练数据集,所述第一训练数据集包括一个或多个具有待识别对象的内窥镜图像和一个或多个不具有待识别对象的内窥镜图像;将所述第一训练数据集输入到所述内窥镜图像特征学习模型;以及基于所述第一训练数据集对所述内窥镜图像特征学习模型进行无监督的对比学习,以获得训练完成的内窥镜图像特征学习模型,其中,所述内窥镜图像特征学习模型包括多个对比学习子模块,所述多个对比学习子模块的每一个用于提取所述第一训练数据集中的同一内窥镜图像的不同尺度的特征表示,并基于所提取的不同尺度的特征表示进行对比学习。
例如,根据本公开的实施例的方法,其中,所述多个对比学习子模块包括依次连接的M个对比学习子模块,其中,所述M个对比学习子模块中的任意一个对比学习子模块i都包括:结构完全相同的第一编码器和第二编码 器、以及结构完全相同的第一映射器模块和第二映射器模块,其中,所述第一编码器的输出端连接到所述第一映射器模块的输入端,所述第二编码器的输出端连接到所述第二映射器模块的输入端,其中,所述M个对比学习子模块中的M个第一编码器依次连接,所述M个对比学习子模块中的M个第二编码器依次连接,其中,所述M为大于或等于1的整数,所述i∈[1,M]。
例如,根据本公开的实施例的方法,其中,将所述第一训练数据集输入到所述内窥镜图像特征学习模型包括:在每次迭代训练时:从所述第一训练数据集中随机选取L个内窥镜图像,将所述L个内窥镜图像中的每一个进行第一图像增强,得到与所述L个内窥镜图像一一对应的L个第一增强型内窥镜图像,并输入到所述内窥镜图像特征学习模型中第一个对比学习子模块的第一编码器;以及将所述L个内窥镜图像中的每一个进行第二图像增强,得到与所述L个内窥镜图像一一对应的L个第二增强型内窥镜图像,并输入到所述内窥镜图像特征学习模型中第一个对比学习子模块的第二编码器,其中,所述L为大于1的正整数。
例如,根据本公开的实施例的方法,其中,所述第一图像增强和第二图像增强分别包括以下各项中任意两个:保持不变、剪裁、翻转、颜色变换和高斯模糊。
例如,根据本公开的实施例的方法,其中,基于所述第一训练数据集对所述内窥镜图像特征学习模型进行无监督的对比学习,以获得训练完成的内窥镜图像特征学习模型包括:基于所述M个对比学习子模块中的每一个对比学习子模块i的特征输出,计算联合对比损失值,并基于所述联合对比损失值调整所述内窥镜图像特征学习模型的参数,直到所述内窥镜图像特征学习模型的联合对比损失函数收敛,其中,所述联合对比损失函数是基于所述M个对比学习子模块中的每一对比学习子模块i的输出的对比损失函数之和。
例如,根据本公开的实施例的方法,其中,基于所述第一训练数据集对所述内窥镜图像特征学习模型进行无监督的对比学习包括:基于所述M个对比学习子模块中的任意一个对比学习子模块i,利用其中所包括的第一编码器和第二编码器,分别提取与所述L个第一增强型内窥镜图像相对应的L个第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的L个第i尺度的第二特征表示;利用其中所包括的第一映射器模块和第二映射器 模块,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行映射处理,以得到与所述L个第一增强型内窥镜图像相对应的映射后的第i尺度的特征表示和与所述L个第二增强型内窥镜图像相对应的映射后的第i尺度的特征表示;以及基于与所述L个第一增强型内窥镜图像相对应的映射后的第i尺度的特征表示和与所述L个第二增强型内窥镜图像相对应的映射后的第i尺度的特征表示,计算对比学习子模块i的对比损失值,其中,所述任意一个对比学习子模块i中的第一编码器和第二编码器在不同尺度上对所接收的输入进行特征提取,使得任一个对比学习子模块i中的第一编码器和第二编码器所提取的第i尺度的特征表示与其余(M-1)个对比学习子模块中的第一编码器和第二编码器所提取的特征表示的尺度都不相同。
例如,根据本公开的实施例的方法,其中,所述对比学习子模块i中的第一映射器模块包括第一全局映射器,所述对比学习子模块i中的第一编码器的输出端连接到所述对比学习子模块i中的第一全局映射器的输入端;所述对比学习子模块i中的第二映射器模块包括第二全局映射器,所述对比学习子模块i中的第二编码器的输出端连接到所述对比学习子模块i中的第二全局映射器的输入端。
例如,根据本公开的实施例的方法,其中,利用其中所包括的第一映射器模块和第二映射器模块,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行映射处理,以得到与所述L个第一增强型内窥镜图像相对应的映射后的第i尺度的特征表示和与所述L个第二增强型内窥镜图像相对应的映射后的第i尺度的特征表示包括:基于所述对比学习子模块i中包括的所述第一全局映射器和所述第二全局映射器,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行全局映射处理,以得到与所述L个第一增强型内窥镜图像相对应的L个全局映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的L个全局映射后的第i尺度的第二特征表示。
例如,根据本公开的实施例的方法,其中,所述第一全局映射器和所述第二全局映射器是两层的全连接模块。
例如,根据本公开的实施例的方法,其中,基于与所述L个第一增强型 内窥镜图像相对应的映射后的第i尺度的特征表示和与所述L个第二增强型内窥镜图像相对应的映射后的第i尺度的特征表示,计算对比学习子模块i的对比损失值包括:将与所述L个第一增强型内窥镜图像相对应的所述L个全局映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的所述L个全局映射后的第i尺度的第二特征表示中一一对应的两个特征表示作为一对正例,其余(2L-2)个特征表示作为负例,计算对比损失函数,以得到对比学习子模块i的对比损失值。
例如,根据本公开的实施例的方法,其中,所述对比学习子模块i中的第一映射器模块包括第一全局映射器和第一局部映射器,所述对比学习子模块i中的第一编码器的输出端同时连接到所述对比学习子模块i中的第一全局映射器的输入端和第一局部映射器的输入端;所述对比学习子模块i中的第二映射器模块包括第二全局映射器和第二局部映射器,所述对比学习子模块i中的第二编码器的输出端同时连接到所述对比学习子模块i中的第二全局映射器的输入端和第二局部映射器的输入端。
例如,根据本公开的实施例的方法,其中,利用其中所包括的第一映射器模块和第二映射器模块,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行映射处理,以得到与所述L个第一增强型内窥镜图像相对应的映射后的第i尺度的特征表示和与所述L个第二增强型内窥镜图像相对应的映射后的第i尺度的特征表示包括:基于所述对比学习子模块i中包括的所述第一全局映射器和所述第二全局映射器,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行全局映射处理,以得到与所述L个第一增强型内窥镜图像相对应的L个全局映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的L个全局映射后的第i尺度的第二特征表示;以及基于所述对比学习子模块i中包括的所述第一局部映射器和所述第二局部映射器,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行局部映射,以得到与所述L个第一增强型内窥镜图像相对应的L个局部映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的L个局部映射后的第i尺度的第二特征表示。
例如,根据本公开的实施例的方法,其中,所述第一全局映射器和所述 第二全局映射器是两层的全连接模块,所述第一局部映射器和所述第二局部映射器是两层1x1的卷积模块。
例如,根据本公开的实施例的方法,其中,基于与所述L个第一增强型内窥镜图像相对应的映射后的特征表示和与所述L个第二增强型内窥镜图像相对应的映射后的特征表示,计算对比学习子模块i的对比损失值包括:将与所述L个第一增强型内窥镜图像相对应的所述L个全局映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的所述L个全局映射后的第i尺度的第二特征表示中一一对应的两个特征表示作为一对正例,其余(2L-2)个特征表示作为负例,计算对比损失函数,以得到全局对比损失值;以及将与所述L个第一增强型内窥镜图像相对应的所述L个局部映射后的第i尺度的第一特征表示中的每一个划分为第一S个第i尺度的局部特征表示,以得到第一(L×S)个第i尺度的局部特征表示;以与划分第一S个局部特征表示相同的方式,将与所述L个第二增强型内窥镜图像相对应的所述L个局部映射后的第i尺度的第二特征表示中的每一个划分为与所述第一S个第i尺度的局部特征表示一一对应的第二S个第i尺度的局部特征表示,以得到第二(L×S)个第i尺度的局部特征表示;将所述第一(L×S)个第i尺度的局部特征表示与所述第二L×S个第i尺度的局部特征表示中一一对应的两个局部特征表示作为一对正例,其余(2×(L×S)-2)个局部特征表示作为负例,计算对比损失函数,以得到局部对比损失值;将所述全局对比损失值与所述局部对比损失值相加,以得到对比学习子模块i的对比损失值。
例如,根据本公开的实施例的方法,其中,所述对比损失函数是噪声对比估计损失函数InfoNCE。
例如,根据本公开的实施例的方法,其中,所述第一编码器和所述第二编码器是多尺度Transformer编码器块,所述多尺度Transformer编码器块包括间隔设置的一个或多个多头池化注意力模块和一个或多个多层感知器模块,其中每个多头池化注意力模块和每个多层感知器模块之前设置有模块标准化模块。
例如,根据本公开的实施例的方法,其中,所述对象是回盲部。
本公开的实施例还提供了还提供了一种基于多尺度对比学习的内窥镜图 像特征学习模型的训练装置,所述装置包括:训练数据集获取部件,用于获取第一训练数据集,所述第一训练数据集包括一个或多个具有待识别对象的内窥镜图像和一个或多个不具有待识别对象的内窥镜图像;输入部件,用于将所述第一训练数据集输入到所述内窥镜图像特征学习模型;训练部件,用于基于所述第一训练数据集对所述内窥镜图像特征学习模型进行无监督的对比学习,以获得训练完成的内窥镜图像特征学习模型,其中,所述内窥镜图像特征学习模型包括多个对比学习子模块,所述多个对比学习子模块用于提取同一输入样本的不同尺度的特征表示,并基于所提取的不同尺度的特征表示进行对比学习。
本公开的实施例还提供了还提供了一种内窥镜图像分类模型的训练方法,包括:获取第二训练数据集,所述第二训练数据集包括一个或多个具有待识别对象的内窥镜图像和一个或多个不具有待识别对象的内窥镜图像,所述内窥镜图像标注有标签,用于指示内窥镜图像是否包括待识别对象;将所述第二训练数据集输入到内窥镜图像分类模型中进行训练,直到所述内窥镜图像分类模型的目标损失函数收敛,以获得训练完成的内窥镜图像分类模型,其中,所述内窥镜图像分类模型包括依次连接的特征提取模块和分类器模块,其中所述特征提取模块是根据前面所述的基于多尺度对比学习的内窥镜图像特征学习模型的训练方法所获得的内窥镜图像特征学习模型中的M个第一编码器或M个第二编码器,其中M是大于1的整数。
例如,根据本公开的实施例的方法,其中,所述第二训练数据集呈长尾分布,所述内窥镜图像分类模型的目标损失函数包括:基于所述内窥镜图像分类模型的最终输出结果与图像样本的标注标签而确定的焦点损失函数。
本公开的实施例还提供了还提供了一种内窥镜图像分类模型的训练装置,包括:图像获取部件,用于获取第二训练数据集,所述第二训练数据集包括一个或多个具有待识别对象的内窥镜图像和一个或多个不具有待识别对象的内窥镜图像,所述内窥镜图像标注有标签,用于指示内窥镜图像是否包括待识别对象;训练部件,将所述第二训练数据集输入到内窥镜图像分类模型中进行训练,直到所述内窥镜图像分类模型的目标损失函数收敛,以获得训练完成的内窥镜图像分类模型,其中,所述内窥镜图像分类模型包括依次连接的特征提取模块和分类器模块,其中所述特征提取模块是根据上述基于多尺 度对比学习的内窥镜图像特征学习模型的训练方法所获得的内窥镜图像特征学习模型中的M个第一编码器或M个第二编码器,其中M是大于1的整数。
本公开的实施例提供了还提供了一种内窥镜图像分类方法,包括:获取待识别的内窥镜图像;基于训练好的内窥镜图像分类模型,获得所述内窥镜图像的分类结果;其中,所述训练好的内窥镜图像特征学习模型是基于上述内窥镜图像分类模型的训练方法所获得的。
本公开的实施例提供了还提供了一种内窥镜图像分类***,包括:图像获取部件,用于获取待识别的内窥镜图像;处理部件,基于训练好的内窥镜图像分类模型,获得所述内窥镜图像的分类结果;输出部件,用于输出待识别的内窥镜图像的分类结果,其中,所述训练好的内窥镜图像特征学习模型是基于根据上述内窥镜图像分类模型的训练方法所获得的。
本公开的实施例还提供了一种电子设备,包括存储器和处理器,其中,所述存储器上存储有处理器可读的程序代码,当处理器执行所述程序代码时,执行根据上述方法中任一项所述的方法。
本公开的实施例还提供了一种计算机可读存储介质,其上存储有计算机可执行指令,所述计算机可执行指令用于执行根据上述方法中任一项所述的方法。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对本公开实施例的附图作简单地介绍。明显地,下面描述中的附图仅仅涉及本公开的一些实施例,而非对本公开的限制。
图1示出了本公开实施例中内窥镜图像特征学习模型训练及内窥镜图像分类方法的应用架构示意图;
图2示出了传统的基于SimCLR的对比学习网络架构示意图;
图3示出了一个常规的Vision Transformer模型的一个整体示例性框图;
图4示出了图3中的ViT将原始图像展平成序列的示意图;
图5示出了多尺度Vision Transformer的编码器块中多头池化注意力(MHPA)模块的示意图;
图6A示出了根据本公开实施例的回盲部内窥镜图像;
图6B示出了非回盲部的内窥镜图像;
图7A示出了根据本公开实施例的基于对比学习的内窥镜图像特征学习模型700A的示意性结构;
图7B示出了模型700A中的编码器是多尺度Vision Transformer的一个实施例;
图7C示出了在图7A的模型700A的基础上,针对同一尺度的特征输出进一步进行局部对比学习的示例模型;
图8示出了用于训练根据本公开一个实施例的基于多尺度对比学习的内窥镜图像特征学习模型的方法800的流程图;
图9示出了图8中步骤S803中的基于所述第一训练数据集对所述内窥镜图像特征学习模型进行无监督的对比学习的步骤进行更具体的示例性说明;
图10说明如何基于局部映射后的特征来计算对比学习子模块i的局部对比损失值;
图11描述了本公开实施例的内窥镜图像分类模型的训练方法的流程图;
图12描述本公开实施例中内窥镜图像分类方法的流程图;
图13示出了本公开实施例中一种内窥镜图像分类***的结构示意图;
图14示出了根据本公开实施例的内窥镜特征学习模型的训练装置;
图15示出了根据本公开实施例的内窥镜图像分类模型的训练装置;以及
图16示出了根据本公开的实施例的存储介质的示意图。
具体实施方式
下面将结合附图对本申请实施例中的技术方案进行清楚、完整地描述,显而易见地,所描述的实施例仅仅是本申请的部分实施例,而不是全部的实施例。基于本申请实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,也属于本申请保护的范围。
本说明书中使用的术语是考虑到关于本公开的功能而在本领域中当前广泛使用的那些通用术语,但是这些术语可以根据本领域普通技术人员的意图、先例或本领域新技术而变化。此外,特定术语可以由申请人选择,并且在这种情况下,其详细含义将在本公开的详细描述中描述。因此,说明书中使用的术语不应理解为简单的名称,而是基于术语的含义和本公开的总体描述。
虽然本申请对根据本申请的实施例的***中的某些模块做出了各种引用,然而,任何数量的不同模块可以被使用并运行在用户终端和/或服务器上。所述模块仅是说明性的,并且所述***和方法的不同方面可以使用不同模块。
本申请中使用了流程图来说明根据本申请的实施例的***所执行的操作。应当理解的是,前面或下面操作不一定按照顺序来精确地执行。相反,根据需要,可以按照倒序或同时处理各种步骤。同时,也可以将其他操作添加到这些过程中,或从这些过程移除某一步或数步操作。
为了减轻医生的负担,有一些工作尝试研究使用深度学习的方式自动化的实现对回盲部的识别。然而这些工作仅使用了简单的卷积神经网络,且都是基于全监督的方法,即需要大量标注数据。而现有的内镜影像的标注数据集主要集中于息肉等病变标注,很少有关于回盲部的标注,而单独为这一任务进行大量回盲部的标注是费时费力的此外,现有的内窥镜影像识别模型的研究工作基本基于现成的卷积神经网络,这类模型无法很好的学习到内窥镜影像的特征。
因此,本公开提出了一种基于多尺度对比学习的内窥镜特征学习模型,在不同尺度上对输入的内窥镜影像进行特征提取,并在不同尺度的特征表示的基础上进行对比学习,能够更好地学习到内窥镜影像的特征。
图1示出了本公开实施例中内窥镜图像特征学习模型训练及内窥镜图像分类方法的应用架构示意图,包括服务器100、终端设备200。
终端设备200可以是医疗设备,例如,用户可以基于终端设备200查看内窥镜图像分类结果。
终端设备200与服务器100之间可以通过互联网相连,实现相互之间的通信。可选地,上述的互联网使用标准通信技术和/或协议。互联网通常为因特网、但也可以是任何网络,包括但不限于局域网(Local Area Network,LAN)、城域网(Metropolitan AreaNetwork,MAN)、广域网(Wide Area Network,WAN)、移动、有线或者无线网络、专用网络或者虚拟专用网络的任何组合。在一些实施例中,使用包括超文本标记语言(Hyper Text MarkupLanguage,HTML)、可扩展标记语言(Extensible Markup Language,XML)等的技术和/或格式来代表通过网络交换的数据。此外还可以使用诸如安全套接字层(Secure SocketLayer,SSL)、传输层安全(Transport Layer Security,TLS)、 虚拟专用网络(VirtualPrivate Network,VPN)、网际协议安全(Internet Protocol Security,IPsec)等常规加密技术来加密所有或者一些链路。在另一些实施例中,还可以使用定制和/或专用数据通信技术取代或者补充上述数据通信技术。
服务器100可以为终端设备200提供各种网络服务,其中,服务器100可以是一台服务器、若干台服务器组成的服务器集群或云计算中心。
具体地,服务器100可以包括处理器110(Center Processing Unit,CPU)、存储器120、输入设备130和输出设备140等,输入设备130可以包括键盘、鼠标、触摸屏等,输出设备140可以包括显示设备,如液晶显示器(Liquid Crystal Display,LCD)、阴极射线管(Cathode Ray Tube,CRT)等。
存储器120可以包括只读存储器(ROM)和随机存取存储器(RAM),并向处理器110提供存储器120中存储的程序指令和数据。在本公开实施例中,存储器120可以用于存储本公开实施例中内窥镜图像特征学习模型的训练方法、内窥镜图像分类模型的训练方法或内窥镜图像分类方法的程序。
处理器110通过调用存储器120存储的程序指令,处理器110用于按照获得的程序指令执行本公开实施例中任一种内窥镜图像特征学习模型的训练方法、内窥镜图像分类模型的训练方法或内窥镜图像分类方法的步骤。
例如,本公开实施例中,内窥镜图像特征学习模型的训练方法、内窥镜图像分类模型的训练方法或内窥镜图像分类方法主要由服务器100侧执行,例如,针对内窥镜图像分类方法,终端设备200可以将采集到的消化道的内窥镜图像(例如,回盲部图像)发送给服务器100,由服务器100对消化道的内窥镜图像进行类型识别,并可以将识别结果返回给终端设备200。
如图1所示的应用架构,是以应用于服务器100侧为例进行说明的,当然,本公开实施例中的方法也可以由终端设备200执行,例如终端设备200可以从服务器100侧获得训练好的内窥镜图像分类模型,从而基于该内窥镜图像分类模型,对内窥镜影像进行类型识别,获得分类结果,对此本公开实施例中并不进行限制。
另外,本公开实施例中的应用架构图是为了更加清楚地说明本公开实施例中的技术方案,并不构成对本公开实施例提供的技术方案的限制,当然,对于其它的应用架构和业务应用,本公开实施例提供的技术方案对于类似的问题,同样适用。
本公开各个实施例以应用于图1所示的应用架构图为例进行示意性说明。
首先,为了使本领域技术人员能更清楚地理解本公开的原理,下面对本公开所涉及的一些技术术语以及背景知识进行简要的描述。
对比学习:对比学习属于一种无监督学习,特点是不需要人工标注的类别标签信息,直接利用数据本身作为监督信息,来学习样本数据的特征表达,并用于下游任务,例如,对回盲部影像的类型进行分类的任务。在对比学习中,通过在输入样本之间进行比较来学习表示。对比学习不是一次从单个数据样本中学习信号,而是通过在不同样本之间进行比较来学习。可以在“相似”输入的正例对和“不同”输入的负例对之间进行比较。对比学习通过同时最大化同一图像的不同变换视图(例如剪裁,翻转,颜色变换等)之间的一致性,以及最小化不同图像的变换视图之间的一致性来学习的。简单来说,就是对比学习要做到相同的图像经过各类变换之后,依然能识别出是同一张图像,所以要最大化各类变换后图像的相似度(因为都是同一个图像得到的)。应当理解,广义的对比学习不一定是将同一张图像的不同变换作为“相似”的正例,还可以直接选择相似的两张图片作为的正例,而其余不同图像作为负例。通过这样的对比训练,编码器(encoder)能学习到图像的更高层次的通用特征。
图2示出了传统的基于SimCLR的对比学习网络架构示意图。
如图2所示,传统的SimCLR模型架构由对称的两个分支(Branch)构成,如图所示,上下两个分支分别对称地设置有编码器和非线性映射器。SimCLR提出了一种构建正负例的方式,基本思想是:输入一个批次的L(L为大于1的正整数)张图X=x 1,x 2,x 3,…,x L,以其中的某张图像x i来说,对其进行随机变换(图像增强,例如包括剪裁,翻转,颜色变换和高斯模糊等)得到两幅图x′ i和x″ i,那么一个批次的L张图像X经过增强以后得到两个批次的图像X′和X″,这两个批次X′和X″各自包含L张图像,并且这两个批次的图像中之间一一对应。例如,图像x经过变换后的数据对<x′ i,x″ i>互为正例,而x′ i和其余2L-2个图像都互为负例。在经过变换后,增强图像被投影到表示空间。以上分支为例进行说明,增强图像x′ i首先经过特征编码器Encoder(一般采用深度残差网络(Deep residual network,ResNet)做为模型结构,这里以函数f θ(x)代表),被转换成对应的特征表示h′ i。紧随其后, 是另外一个非线性映射器Non-linear Projector(由两层的多层感知器(multi-layer perceptron,MLP)构成,这里以函数g θ(·)代表),进一步将特征表示h′ i映射成另外一个空间里的向量z′ i。这样,经过g θ(f θ(x))两次非线性变换,就将增强图像投影到了表示空间。下分支的过程类似,在此不做赘述。
此外,本领域技术人员应当理解,也可以只做一个增强变换,并将原始图像和其增强后的版本作为一对正例。
通过计算并最大化正例映射特征之间的相似性,并最小化负例映射特征之间的相似性,可以实现对图像特征的无监督学习。SimCLR中用余弦相似度来计算两个增强的图像之间的相似度,对于两个增强的图像x′ i和x″ i,在其投影(即,映射)表示
Figure PCTCN2022122056-appb-000001
和z″ i上计算余弦相似度。在理想情况下,增强后的一对图像(这里可以称为一对正例,例如<x′ i,x″ i>)之间的相似度会很高,而该对图像和两个批次中的其他图像之间的相似度会较低。
可以基于正例与负例之间的相似度来定义对比学习的损失函数,SimCLR使用了一种对比损失InfoNCE,如下等式(1)所示:
Figure PCTCN2022122056-appb-000002
其中,z i表示经过非线性映射之后的特征,z j(i)表示与z i对应的正例,z a表示除了z i的所有其他特征(包括正例和负例)。I表示所有图像。(·)表示点乘操作。τ表示温度参数,用于在模型训练初期防止陷入局部最优解,并随着模型训练帮助收敛。
通过优化上面这个对比损失函数InfoNCE,可以实现最大化正例之间的相似性,同时最小化负例之间的相似性,在一种无监督的环境下可以学到图像的本质特征。
在神经网络中,尤其是计算机视觉(Computer Vision,CV)领域,一般先对图像进行特征提取,这一部分是整个CV任务的根基,因为后续的下游任务都是基于提取出来的图像特征进行(比如分类,生成等等),所以将这一部分网络结构称为主干网络。如上所述,传统的对比损失模型一般采用深度残差网络作为编码器来提取图像级别的特征,并基于所提取的图像级别的特征进行对比学习。
为了更好地学习内窥镜影像的特征,本公开提出了一种新的多尺度对比学习模型,获取同一图像不同尺度上的特征表示,并基于每个不同尺度的特征表示分别进行对比学习。
多尺度特征:多尺度图像技术也叫做多分辨率技术(MRA),指对图像采用多尺度的表达,并且在不同尺度下分别进行处理。所谓多尺度,实际上就是对信号的不同粒度的采样,通常在不同的尺度下可以观察到不同的特征,从而完成不同的任务。要在多尺度情况下对图像进行处理首先要在多尺度情况下对图像进行表达。视觉任务中处理多尺度主要有两类方式:图像金字塔和特征金字塔。其中特征金字塔通过不同大小的卷积核以及池化,获得不同大小的感受野来获得不同尺度下的特征表示。
以下,本公开实施例以多尺度Vision Transformer(Multi-scale ViT)为例,用作获得同一输入图像的不同尺度的特征表示的示例性网络。多尺度Vision Transformer编码器块在传统的Transformer编码器块的基础上增加了一个池化层,用于将输入图像特征池化为更小的尺度特征。通过级联多个多尺度Vision Transformer的编码器块,便可以得到多个不同尺度的特征表示。
首先,图3示出了一个常规的Vision Transformer(ViT)模型的一个整体示例性框图。在进行编码之前,尺度ViT对原始图像分为方块网格,通过连接一个方块中所有像素通道,然后利用线性映射器将其线性投影到所需的输入维度,将每个方块展平为单个向量。尺度ViT与输入元素的结构无关,因此还进一步需要利用位置编码器在每个方块向量中添加可学***的序列输入进Transformer模型的编码器部分(这里的Transformer编码器由多个Transformer编码器块串行堆叠构成,例如图3所示的m个(m×)Transformer编码器块)用以进行特征提取。每个Vision Transformer编码器块包括间隔设置的一个多头注意力(Multi-head Attention,MHA)模块和一个多层感知器(Multi-Layer Perception,MLP)模块,其中每个多头注意力模块和多层感知器模块之前设置有一个层标准化模块。
图4示出了图3中的ViT将原始图像展平成序列的示意图。
如图4所示,输入ViT的图像是一张H×W×C的息肉白光影像图像,其中H和W分别为长和宽方向上的像素数量,C为通道数量。先将图像分 为方块,再进行展平。假设每个方块的长宽为(P×P),那么方块的数目为N=H×W/(P×P),然后对每个图像方块展平成一维向量,每个向量大小为P×P×C,N个方块总的输入向量变换为N×(P×P×C)。接着利用线性映射器对每个向量都做一个线性变换(即全连接层)来进行矩阵变维(reshape),将维度压缩为D,这里称其为图块嵌入(Patch Embedding),就得到了一个N×D的嵌入序列(embedding vector),N是最终得到的嵌入序列的长度,D是嵌入序列的每个向量的维度,其中,每个D维的向量表示一个相应区域的特征,例如,这里的N×D分别对应于N个区域。随后,用一个位置编码器在序列中加入位置信息,经位置编码后的输入向量的维度并不会发生任何变化。接下来便可以将加入了位置信息以后的序列输入到Transformer编码器中进行特征提取。
在多尺度Vision Transformer中,传统的Vision Transformer编码器块中多头的注意力(MHA)模块被替换为多头池化注意力(Multi-head Pooling Attention,MHPA)模块,通过在其中添加池化层以获得更小尺度的特征。
如图5所示,示出了多尺度Vision Transformer的编码器块中多头池化注意力(MHPA)模块的示意图。
对于输入特征序列长度为N的D维输入张量X∈R^(HW×D)(其中H和W分别为长和宽方向上的像素数量),和普通的transformer编码器块一样,都是将
Figure PCTCN2022122056-appb-000003
分别乘以三个变换矩阵W q、W k和W v,以得到对应的三个中间张量
Figure PCTCN2022122056-appb-000004
Figure PCTCN2022122056-appb-000005
MHPA模块进一步添加了一个池化层,如图5中的Pool Q、Pool K和Pool V所示,用于将特征表示进行池化,以获得更小尺度的特征。例如,如图5所示,中间张量
Figure PCTCN2022122056-appb-000006
Figure PCTCN2022122056-appb-000007
经过池化后变为
Figure PCTCN2022122056-appb-000008
Figure PCTCN2022122056-appb-000009
此时,输入特征的尺度从HW变为
Figure PCTCN2022122056-appb-000010
每个特征向量的维度D保持不变。接下来,基于池化后的中间张量
Figure PCTCN2022122056-appb-000011
Figure PCTCN2022122056-appb-000012
继续进行一系列处理,最终得到的输出特征是把原始输入特征进行池化以后的特征和进一步经过注意力模块进行池化和注意力计算的特征进行拼接,如图所示,输出特征的尺寸为
Figure PCTCN2022122056-appb-000013
与输入尺寸HW×D相比,特征在尺度上发生了变化(这里是变小),并且每个向量的维度变为两倍。通过添加池化层,多尺度Vision Transformer可以将输入图像的特征池化为更小的尺度。
可以理解,由于每个编码器块都将在所接收的输入特征的基础上获取更小尺度的特征,那么依次连接多个多尺度Vision Transformer编码器块将会得到同一输入样本图像在不同尺度上的特征表示。这些多尺度ViT所提取的特征可以接入下游任务模块进行进一步的特征提取或进行图像识别或分割等。例如,本申请实施例的基于对比学习的内窥镜图像特征学习模型的训练方法进一步基于多尺度ViT所提取的特征进行对比学习。
应当注意的是,本公开实施例不限于此,还可以利用其它的网络架构来作为多尺度特征提取的主干网络,例如Inception,Deeplab-V3架构等,本公开在此不做限制。
以下以回盲部影像为例,对本公开实施例的基于多尺度对比学习的内窥镜图像特征学习模型的训练方法进行示意性说明。应当注意,本公开实施例提供的技术方案对于其他内窥镜影像同样适用。
图6A示出了根据本公开实施例的回盲部内窥镜图像。
内窥镜经人体的天然孔道,或者是经手术做的小切口进入人体内,获取关于相关的内窥镜图像,这些影像后续被用于疾病的诊断和治疗。如图6A示出了利用在白光(white light,WL)成像模式下操作的内窥镜所拍摄到的回盲部影像。图6B示出了非回盲部的内窥镜图像。通过和图6B的非回盲部影像进行对比可以看出,相对于其他非回盲部的区域,回盲部具有瓣状皱襞,瓣口呈鱼口状。
图7A示出了根据本公开实施例的基于对比学习的内窥镜图像特征学习模型700A的示意性结构。
如图7A所示,根据本公开实施例的内窥镜图像特征学习模型700A的结构和图2所示的传统的基于SimCLR的对比学习网络架构类似,由完全对称的两个分支组成。
例如,根据本公开的一个实施例的编码器可以是多尺度Vision Transformer编码器。在多尺度ViT中,每个多尺度Vision Transformer编码器块由交替的多头池化注意力(Multi-head Pooling Attention,MHPA)模块模块和多层感知器(MLP)模块构成。MHPA模块中添加有池化层,以将输入数据的尺度进行进一步的池化。例如,多尺度ViT的编码器块可以采用池化层将N×D的特征序列池化为Q×D(Q例如可以是
Figure PCTCN2022122056-appb-000014
)。以 Q为
Figure PCTCN2022122056-appb-000015
为例,对于尺寸为64×1024的输入,经过多尺度Transformer编码器块处理之后,将特征的尺度缩小为1/4,由于多尺度Vision Transformer中,每个编码器块把原始输入特征进行池化以后的特征和进一步经过注意力模块进行池化和注意力计算的特征进行拼接,最终得到的输出特征的尺寸为16×2048。应当理解,在其他多尺度编码器中,可以不进行上述拼接过程,则尺度缩放后的特征尺寸可以是16×1024。
应当理解,多尺度Transformer的结构及其进行提取特征的技术在本领域是公知的,在此不做过多赘述。
如图7A所示,模型700A包括左右两个分支,每个分支包括依次连接的多个编码器,每个编码器的输出端连接到一个映射器模块(例如,图中示出为全局映射模块)。由于这两个分支结构完全相同,并且分别基于同一原始图像的不同增强版本进行完全相同的处理,这里按照功能来对模型700A进行结构划分。例如,可以将模型700A划分为多个(例如,M个,这里的M为大于1的整数)对比学习子模块。如图7A,根据本公开实施例的基于多尺度对比学习的内窥镜图像特征学习模型包括依次连接的多个(例如M个,M为大于1的整数)对比学习子模块700A_1-700A_M。每个对比学习子模块包括两个分支中的一对结构相同的第一编码器和第二编码器及分别与一对编码器连接的一对结构相同的第一映射器模块和第二映射器模块。
应当理解,这里使用的序数词“第一”和“第二”仅仅是为了进行区分,而不进行任何重要性或顺序的限定。例如,这里的“第一编码器”和“第二第二编码器”仅仅是为了区分两个不同分支上的编码器。
例如,这里的编码器可以用于提取与输入特征不同尺度的输出特征。例如,这里的编码器可以是多尺度Vision Transformer编码器块。应当理解,根据本公开实施例的用于多尺度特征提取的编码器不限于此,还可以包括其他能够实现相同功能的架构,例如Inception,Deeplab-V3架构等,本公开在此不做限制。
例如,这里的线性映射器模块可以是图2所示的传统的基于SimCLR的对比学习网络架构中的非线性映射器,用于进一步将编码器输出的特征表示映射成另外一个空间里的向量。例如,这里的映射器模块是一个基于图片级 别的特征进行映射的全局映射器模块。例如,这里的映射器模块可以是两层的全连接层。
此外,应当理解,取决于这里的编码器类型,增强图像X’和X”在输入到第一个编码器之前,还可能需要经过一些预处理。
例如,如图7B所示,示出了700A中的编码器是多尺度Vision Transformer编码器块的一个实施例。如上关于Vision Tranformer的相关背景介绍可知,输入的增强图像X’和X”都在输入之前被分割为相同大小的图块,这些图块被展平为一维向量,接着利用线性映射器进行线性变换,以压缩维度。随后,用一个位置编码器在序列中加入位置信息。因此,在模型700A的基础上,模型700B还可以在两个分支中分别包括依次连接的线性映射器和位置编码器。
如前面介绍的,依次串联连接的多个多尺度编码器可以基于同一输入图像生成不同尺度上的特征表示,本公开实施例基于不同尺度上的特征表示进行对比学习,以使得相对于普通的对比学习模型能实现更好的特征学习效果。但是这里进行对比学习通常都是在图像级别进行的,也就是说,在输入两个分支的图像中,以同一张图像的不同增强版本作为一对正例,其余的增强图像作为负例,通过最大化同一图像的不同变换视图(例如剪裁,翻转,颜色变换等)之间的一致性,以及最小化不同图像的变换视图之间的一致性来学习。
本公开实施例还提出了一个进一步的实施例,在基于每一个对比学习子模块的每个尺度的特征的基础上,除了在图像级别进行对比学习以外,还进一步在区域级别进行对比学习。
图7C示出了在图7A的模型700A的基础上,针对同一尺度的特征输出除了进行全局对比学习以外,进一步进行局部对比学习的示例模型。
同样,这里的编码器可以用于提取与输入特征不同尺度的输出特征的多尺度编码器。例如,这里的编码器可以是多尺度Vision Transformer编码器块。应当理解,这里的编码器还可以包括其他能够实现相同功能的架构,例如Inception,Deeplab-V3架构等,本公开在此不做限制。
如上所述,这里的全局映射器是一个基于图片级别的特征进行映射的全局映射器模块。例如,这里的全局映射器模块可以是两层的全连接层。
这里的局部映射器在每个区域的级别上对区域特征进行单独映射。例如,这里的局部映射器可以是两层的1×1卷积层,使得经过局部映射后的特征图维度保持不变。
如此,根据本公开实施例提供的内窥镜图像特征学习模型在多尺度的基础上同时进行全局和局部的对比,相较于常规的对比学习能够更好地学习到内窥镜图像的特征。
图8示出了用于训练根据本公开一个实施例的基于多尺度对比学习的内窥镜图像特征学习模型的方法800的流程图。例如,这里该内窥镜图像特征学习模型是如上参考图7A所示的内窥镜图像特征学习模型700A、图7B所示的内窥镜图像特征学习模型700B或图7C所示的内窥镜图像特征学习模型700C。例如,该内窥镜图像特征学习模型的训练方法800可以由服务器来执行,该服务器可以是图1中所示的服务器100。
首先,在步骤S801中,获取第一训练数据集,所述第一训练数据集包括一个或多个具有待识别对象的内窥镜图像和一个或多个不具有待识别对象的内窥镜图像。
例如,这里的对象可以是回盲部。根据本公告开的一个实施例,这里的内窥镜图像特征学习模型的训练过程是一个无监督的预训练过程,用于学习数据本身的特征,因此,这些数据集没有标注标签。
例如,这里的第一训练数据集可以是模仿真实情况中回盲部图像类型呈现长尾分布的情况所准备的。例如,在本公开的实施例的一个具体实现方式的第一训练数据集中,回盲部的内窥镜图像只占很小的比例,其余的都是非回盲部的内窥镜图像,使得整个训练数据集呈现一种长尾分布。
例如,这里的第一训练数据集可以是通过操作内窥镜获得的,也可以是通过网络下载的方式获取的,也可以通过其他途径获取的,本公开的实施例对此不作限制。
应当理解,根据本公开实施例的内窥镜图像特征学习模型的训练方法的第一训练数据集的数量和比例可以根据实际情况进行调整,本公开对此不做限制。
应当理解,本公开实施例还可以同样适用于除回盲部以外的其他消化道部位或病灶的影像的特征学习,例如息肉等,本公开对此不作限制。
应当理解,如果要针对其他消化道内窥镜影像进行特征学习,这里也可以采用任何其他消化道内窥镜影像来构建数据集并对根据本公开实施例的内窥镜图像特征学习模型进行训练。这些内窥镜影像可以是内窥镜采取任意合适的模式所获取的影像,例如窄带光影像、自发荧光影像、I-SCAN影像等。例如,还可以将以上各种模态影像混合起来构建数据集,本公开对此不作限制。
在步骤S803中,将所述第一训练数据集输入到所述内窥镜图像特征学习模型。
如上所述,在传统的对比学习中,在每次迭代训练时,随机从训练数据集中选取L张图像构成一个批次的输入图像。对于一个批次中的每张图像,通过图像增强方法对每张图像生成两个图像增强视图,这两个增强视图构成一对正例。当然,也可以对每张图像生成一个增强视图,这个增强视图与原始图像构成一对正例。
在对本公开实施例的内窥镜图像特征学习模型进行训练时也是一样。例如,在每次迭代训练时,从所述第一训练数据集中随机选取L个内窥镜图像,将所述L个内窥镜图像中的每一个进行第一图像增强,得到与所述L个内窥镜图像一一对应的L个第一增强型内窥镜图像,并输入到所述内窥镜图像特征学习模型中第一个对比学习子模块的第一编码器;以及将所述L个内窥镜图像中的每一个进行第二图像增强,得到与所述L个内窥镜图像一一对应的L个第二增强型内窥镜图像,并输入到所述内窥镜图像特征学习模型中第一个对比学习子模块的第二编码器。
例如,这里的图像增强可以包括剪裁、翻转、颜色变换和高斯模糊等。此外,本领域技术人员应当理解,也可以只做一个增强变换,并将原始的L个图像和L个增强后的图像输入到模型中。因此,这里使用第一增强是为了便于描述,实际上这个第一增强也可以包括不对图像做任何变换。
作为一个替代实施例,内窥镜图像特征学***为一维向量,接着利用线性映射器进行线性变换,进行维度压缩。随后,用一个 位置编码器在序列中加入位置信息。
在步骤S805中,基于所述第一训练数据集对所述内窥镜图像特征学习模型进行无监督的对比学习,以获得训练完成的内窥镜图像特征学习模型。
根据本公开的实施例,这里的内窥镜图像特征学习模型可以包括依次连接的多个对比学习子模块,多个对比学习子模块的每一个用于提取所述第一训练数据集中的同一内窥镜图像的一个不同尺度的特征表示,并基于所提取的不同尺度的特征表示进行对比学习。
如本领域技术人员所熟知的,机器学习算法通常依赖于对目标函数最大化或者最小化的过程,常常把最小化的函数称为损失函数。
由于本公开实施例的内窥镜图像特征学习模型的训练方法是基于多个尺度的图像特征中的每一个进行对比学习的,因此,总的联合损失函数可以是基于多个不同尺度特征的对比学习(即,每个对比学习子模块)的对比损失函数之和。
对于M个的对比学习子模块,联合损失函数为:
Figure PCTCN2022122056-appb-000016
其中,L (i)为任意一个对比学习子模块i的对比损失函数,M为对比学习子模块的个数。
例如,根据本公开实施例的内窥镜图像特征学习模型的训练方法中,对所述内窥镜图像特征学习模型进行无监督的对比学习,以获得训练完成的内窥镜图像特征学习模型可以包括:基于所述M个对比学习子模块中的每一个对比学习子模块i的特征输出,计算联合对比损失值,并基于所述联合对比损失值调整所述内窥镜图像特征学习模型的参数,直到所述内窥镜图像特征学习模型的联合对比损失函数收敛,其中,所述联合对比损失函数是基于所述M个对比学习子模块中的每一对比学习子模块i的输出的对比损失函数之和。
下面参考图9,来对步骤S803中的基于所述第一训练数据集对所述内窥镜图像特征学习模型进行无监督的对比学习的步骤进行更具体的示例性说明。
如图9所示,步骤S803中的基于所述第一训练数据集对所述内窥镜图像特征学习模型进行无监督的对比学习包括以下子步骤S901-S905。这些步 骤是以一次迭代过程为例进行说明的。
具体地,对于每个对比学习子模块,除了接收的数据尺度不一样,它们所执行的处理都完全一样,最后的联合损失值仅是每个对比学习子模块的损失值的简单相加。因此,下面以M个对比学习子模块中的任意一个对比学习子模块i来进行说明,其中i∈[1,M]。这里假设任意一个对比学习子模块i,其所提取的图像特征的尺度为第i尺度。
在步骤S901,基于所述M个对比学习子模块中的任意一个对比学习子模块i,利用其中所包括的第一编码器和第二编码器,分别提取与所述L个第一增强型内窥镜图像相对应的L个第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的L个第i尺度的第二特征表示。
如上所述,这里的第一编码器和第二编码器具有完全相同的结构,用于对分别对应于第一分支的输入样本以及对应于第二分支的输入样本的输入特征进行特征提取,并且所提取的特征的尺度与所接收的特征的尺度不同。举例来说,对于第一个对比学习子模块1,假设其中所包括的第一编码器接收的输入特征为
Figure PCTCN2022122056-appb-000017
该第一编码器提取不同于输入特征的尺度的特征,例如,经过第一编码器编码后,输出的特征可以是
Figure PCTCN2022122056-appb-000018
应当理解,这里的1/(2 2)仅是示例,尺度缩小比例可以是任意预设的值。例如,这里的编码器可以采用池化的方式来缩小特征尺度,也可以采用任何可以实现此技术效果的其他方法,本公开对此不作限制。每个对比学习子模块中的第一编码器输出的特征会进入到下一层的对比学习子模块中的第一编码器。例如,这里第一个对比学习子模块1中第一编码器的输出特征
Figure PCTCN2022122056-appb-000019
会进入到第二个对比学习子模块2中的第一编码器,该第一编码器进一步缩小尺度,例如输出特征
Figure PCTCN2022122056-appb-000020
依次类推。第二编码器的过程与第一编码器完全一样,这里不再赘述。
例如,这里的编码器可以是多尺度Vision Transformer编码器块,其如何进行特征池化以及特征提取的过程在本领域是公知的,在此不做过多赘述。
应当理解,根据本公开实施例的多尺度特征提取的编码器不限于此,还 可以包括其他能够实现相同功能的架构,例如Inception,Deeplab-V3架构等,本公开在此不做限制。
在步骤S903,利用其中所包括的第一映射器模块和第二映射器模块,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行映射处理,以得到与所述L个第一增强型内窥镜图像相对应的映射后的第i尺度的特征表示和与所述L个第二增强型内窥镜图像相对应的映射后的第i尺度的特征表示。
与图2所示的传统的基于SimCLR的对比学习网络架构类似,根据本公开实施例的对比学习子模块i基于从上一层接收的两个批次的内窥镜图像(例如,上文提到的L个第一增强型内窥镜图像和L个第二增强型内窥镜图像的输入)的特征表示,进一步在不同尺度上进行特征提取。每个编码器的输出端连接到相应的映射器进行映射,对比学习便在映射后的特征表示上计算相似度(例如余弦相似度)。
在本公开的一个实施例中,提出仅在图像级别进行对比学习。
在这种情况下,这里的第一映射器模块和第二映射器模块可以仅包括全局映射器,例如第一全局映射器和第二全局映射器,如上图7A中的模型700A或如上图7B中的模型700B所示。这两个全局映射器分别连接到第一编码器和第二编码器的输出,用于在图像级别的基础上对第一编码器和第二编码器输出的特征进行全局映射。
例如,基于第一全局映射器和第二全局映射器,分别对L个第i尺度的第一特征表示和L个第i尺度的第二特征表示进行映射处理,以得到与L个第一增强型内窥镜图像相对应的L个全局映射后的第i尺度的第一特征表示和与L个第二增强型内窥镜图像相对应的L个全局映射后的第i尺度的第二特征表示。
在本公开的另一个实施例中,还提出在图像级别进行对比学习的基础上,进一步在区域级别进行对比学习。
例如,对于对比学习子模块i,第一编码器或第二编码器除了连接到一个全局映射器之外,还可以连接到一个局部映射器,如上图7C中的模型700C所示。这两个局部映射器用于对从编码器接收的特征表示进行局部特征的映射。
在这种情况下,这两个局部映射器进一步分别在区域级别的基础上对第一编码器和第二编码器输出的特征进行局部映射。
例如,基于所述第一局部映射器和所述第二局部映射器,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行局部映射,以得到与所述L个第一增强型内窥镜图像相对应的L个局部映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的L个局部映射后的第i尺度的第二特征表示。
在步骤S905,基于与所述L个第一增强型内窥镜图像相对应的映射后的特征表示和与所述L个第二增强型内窥镜图像相对应的映射后的特征表示,计算对比学习子模块i的对比损失值。
如上文所述,对比学习利用映射器将从编码器输出的特征表示映射成另外一个空间里的向量,随后便在映射后的特征表示上计算正例和负例之间的余弦相似度。在理想情况下,正例之间的相似度会很高,正例和负例之间的相似度会较低。
本公开的一个实施例仅在图像级别进行对比学习。在这种情况下,将同一图像的一对增强版本的映射后的全局特征作为正例,其他图像的映射后的全局特征作为负例。
例如,将与所述L个第一增强型内窥镜图像相对应的所述L个全局映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的所述L个全局映射后的第i尺度的第二特征表示中一一对应的两个特征表示作为一对正例,其余(2L-2)个特征表示作为负例,计算对比损失函数,以得到对比学习子模块i的对比损失值。
本公开的另一个实施例除了图像级别进行对比学习以外,进一步在区域级别进行对比学习。区域级别的对比学习将编码器输出的特征作为若干区域的特征的集合,基于局部映射器分别对不同区域的特征进行局部映射。
例如,这里的局部映射器可以是两层1x1的卷积模块。由于1x1的卷积核大小只有1x1,所以并不需要考虑像素跟周边区域的关系,也并不会将周边区域与当前区域的特征进行融合。
在这种情况下,将同一图像的一对增强版本的局部区域的局部特征作为正例,同一对图像中的其他区域、以及不同图片中的所有区域都作为负例。
此时,每一个对比损失子模块i的损失函数可以是局部对比损失函数与全局对比损失函数之和:
Figure PCTCN2022122056-appb-000021
同样,由于本公开实施例的内窥镜图像特征学习模型的训练方法是基于多个尺度的图像特征中的每一个进行全局和局部的对比学习的,因此,总的联合损失函数可以是每个对比学习子模块的对比损失函数之和。
对于M个的对比学习子模块,总的联合损失函数为:
Figure PCTCN2022122056-appb-000022
其中,
Figure PCTCN2022122056-appb-000023
为任意一个对比学习子模块i的局部对比损失函数,
Figure PCTCN2022122056-appb-000024
为任意一个对比学习子模块i的全局对比损失函数,M为对比学习子模块的个数。
下面结合图10具体说明如何基于局部映射后的特征来计算对比学习子模块i的局部对比损失值。
在步骤S1001,将与所述L个第一增强型内窥镜图像相对应的所述L个局部映射后的第i尺度的第一特征表示中的每一个划分为第一S个第i尺度的局部特征表示,以得到第一(L×S)个第i尺度的局部特征表示。
例如,假设第一个对比学习子模块中第一编码器针对一个第一增强型内窥镜图像输出特征
Figure PCTCN2022122056-appb-000025
如上文步骤S905所述,由于局部映射函数是1*1的卷积,所以并不需要考虑当前区域像素跟周边区域的关系,也并不会将周边区域的特征与当前区域的特征融合,因此,经过局部映射后的特征Y 1仍然属于
Figure PCTCN2022122056-appb-000026
如本领域技术人员所理解的,对于
Figure PCTCN2022122056-appb-000027
其中每一个1×D的向量对应于一个区域,因此,根据本公开的实施例,可以将Y 1当成与
Figure PCTCN2022122056-appb-000028
个区域 相对应的
Figure PCTCN2022122056-appb-000029
个局部特征的集合。
此外,应当理解,多个数量的1×D向量可以对应于一个更大区域,例如,可以将两个1×D的向量作为与一个更大区域相对应的特征,此时,可以将Y 1当成与
Figure PCTCN2022122056-appb-000030
个区域相对应的局部特征的集合。本公开对特征划分(即区域划分)的尺寸不做限制。
在步骤S1003,以与划分第一S个局部特征表示相同的方式,将与所述L个第二增强型内窥镜图像相对应的所述L个局部映射后的第i尺度的第二特征表示中的每一个划分为与所述第一S个第i尺度的局部特征表示一一对应的第二S个第i尺度的局部特征表示,以得到第二(L×S)个第i尺度的局部特征表示。该过程与划分第一S个局部特征表示完全相同,在此不做赘述。
在步骤S1005,将所述第一(L×S)个第i尺度的局部特征表示与所述第二(L×S)个第i尺度的局部特征表示中一一对应的两个局部特征表示作为一对正例,其余(2×(L×S)-2)个局部特征表示作为负例,计算对比损失函数,以得到局部对比损失值。
基于此,通过将同一图像的一对增强版本的局部区域的局部特征作为正例,同一对图像中的其他区域、以及其他不同图片中的所有区域都作为负例,来计算对比损失值。
如此,根据本公开实施例提供的内窥镜图像特征学习模型在多尺度的基础上进行全局和局部的对比,相较于常规的对比学习能够更好地学习到内窥镜图像的特征。
在内窥镜图像特征学习模型的训练完成之后,本公开实施例进一步基于训练好的内窥镜图像特征学习模型中的编码器来进行有监督的分类训练。
本公开实施例还提供了一种内窥镜图像分类模型的训练方法。参考图11来描述本公开实施例中内窥镜图像分类模型的训练方法的流程图,该方法包括:
步骤S1101中,获取第二训练数据集,所述训练数据集包括一个或多个具有待识别对象的内窥镜图像和一个或多个不具有待识别对象的内窥镜图像, 所述内窥镜图像标注有标签,用于指示内窥镜图像是否包括待识别对象。
例如,这里的第二训练数据集可以是模仿真实情况中回盲部图像类型呈现长尾分布的情况所准备的。例如,在本公开的实施例的一个具体实现方式中,回盲部的内窥镜图像只占很小的比例,其余的都是非回盲部的内窥镜图像,使得整个训练数据集呈现一种长尾分布。
例如,这里的第二训练数据集可以是通过操作内窥镜获得的,也可以是通过网络下载的方式获取的,也可以通过其他途径获取的,本公开的实施例对此不作限制。
应当理解,根据本公开实施例的内窥镜图像分类模型的训练方法的训练数据集的数量和比例可以根据实际情况进行调整,本公开对此不做限制。
应当理解,在内窥镜图像特征学习模型是训练为学习其他类型的内窥镜影像的情况下,本公开实施例的内窥镜图像分类模型还可以同样适用于除回盲部以外的其他消化道部位或病灶的影像分类,例如息肉等,本公开对此不作限制。
应当理解,这里的第二训练数据集中的内窥镜影像可以是内窥镜采取任意合适的模式所获取的影像,例如窄带光影像、自发荧光影像、I-SCAN影像等。例如,还可以将以上各种模态影像混合起来构建数据集,本公开对此不作限制。
在步骤S1103中,将所述第二训练数据集输入到内窥镜图像分类模型中进行训练,直到所述内窥镜图像分类模型的目标损失函数收敛,以获得训练完成的内窥镜图像分类模型。
例如,这里的分类模型和本领域普通的分类模型一样,都是包括特征提取模块和一个分类器,特征提取模块用于提取图像特征,分类器用于基于提取的图像特征进行分类预测,再基于预测的结果和真实标签计算损失值,并基于所述损失值调整所述内窥镜图像分类模型的参数,直到目标损失函数收敛。
例如,这里的内窥镜图像分类模型的特征提取模块可以是上述训练好的内窥镜特征学习模型700A、700B或700C的任何一个中的M个第一编码器或M个第二编码器。
例如,这里的目标损失函数可以是基于所述内窥镜图像分类模型的最终 输出结果与图像样本的标注标签而确定的交叉熵损失函数。
例如,若第二训练数据集基于呈现真实情况的长尾分布,这里的目标损失函数可以是所述内窥镜图像分类模型的最终输出结果与图像样本的标注标签而确定的焦点损失函数,如下等式(5)所示:
Figure PCTCN2022122056-appb-000031
其中,
Figure PCTCN2022122056-appb-000032
为预测概率分布,γ≥0,为可调节的权重。
基于通过如上方式训练好的内窥镜图像分类模型,本公开实施例还提供了一种内窥镜图像分类方法。参考图12来描述本公开实施例中内窥镜图像分类方法的流程图,该方法包括:
在步骤S1201中,获取待识别的内窥镜图像。
例如,如果所训练的图像分类模型是针对回盲部识别进行训练的,获取的待识别的内窥镜图像即是采集到的回盲部影像或非回盲部影像。
在步骤S1203中,将所述待识别的内窥镜图像输入到训练好的内窥镜图像分类模型中,以获得所述内窥镜图像的分类结果。
基于上述实施例,参阅图13所示,为本公开实施例中一种内窥镜图像分类***1300的结构示意图。该内窥镜图像分类***1300至少包括图像获取部件1301、处理部件1302和输出部件1303。本公开实施例中,图像获取部件1301、处理部件1302和输出部件1303为相关的医疗器械,可以集成在同一医疗器械中,也可以分为多个设备,相互连接通信,组成一个医疗***来使用等,例如针对消化道疾病诊断,图像获取部件1301可以为内镜,处理部件1302和输出部件1303可以为与内镜相通信的计算机设备等。
具体地,图像获取部件1301用于获取待识别图像。处理部件1302例如用于执行图12所示的方法步骤,提取待识别图像的图像特征信息,并基于待识别的图像的特征信息获得待识别图像的分类结果。输出部件1303用于输出待识别图像的分类结果。
图14示出了根据本公开实施例的内窥镜特征学习模型的训练装置1400,具体包括训练数据集获取部件1401、输入部件1403和训练部件1405。
训练数据集获取部件1401用于获取第一训练数据集,所述第一训练数据集包括一个或多个具有待识别对象的内窥镜图像和一个或多个不具有待识别对象的内窥镜图像。输入部件1403用于将所述第一训练数据集输入到所述内 窥镜图像特征学习模型。训练部件1405用于基于所述第一训练数据集对所述内窥镜图像特征学习模型进行无监督的对比学习,以获得训练完成的内窥镜图像特征学习模型。
例如,其中,所述内窥镜图像特征学习模型包括多个对比学习子模块,所述多个对比学习子模块的每一个用于提取所述第一训练数据集中的同一内窥镜图像的不同尺度的特征表示,并基于所提取的不同尺度的特征表示进行对比学习。
例如,其中,所述多个对比学习子模块包括依次连接的M个对比学习子模块,其中,所述M个对比学习子模块中的任意一个对比学习子模块i都包括:结构完全相同的第一编码器和第二编码器、以及结构完全相同的第一映射器模块和第二映射器模块,其中,所述第一编码器的输出端连接到所述第一映射器模块的输入端,所述第二编码器的输出端连接到所述第二映射器模块的输入端,其中,所述M个对比学习子模块中的M个第一编码器依次连接,所述M个对比学习子模块中的M个第二编码器依次连接,其中,所述M为大于或等于1的整数,所述i∈[1,M]。
例如,所述输入部件1403在每次迭代训练时:从所述第一训练数据集中随机选取L个内窥镜图像,将所述L个内窥镜图像中的每一个进行第一图像增强,得到与所述L个内窥镜图像一一对应的L个第一增强型内窥镜图像,并输入到所述内窥镜图像特征学习模型中第一个对比学习子模块的第一编码器;以及将所述L个内窥镜图像中的每一个进行第二图像增强,得到与所述L个内窥镜图像一一对应的L个第二增强型内窥镜图像,并输入到所述内窥镜图像特征学习模型中第一个对比学习子模块的第二编码器,其中,所述L为大于1的正整数。
例如,其中,所述第一图像增强和第二图像增强分别包括以下各项中任意两个:保持不变、剪裁、翻转、颜色变换和高斯模糊。
例如,其中所述训练部件1405基于所述M个对比学习子模块中的每一个对比学习子模块i的特征输出,计算联合对比损失值,并基于所述联合对比损失值调整所述内窥镜图像特征学习模型的参数,直到所述内窥镜图像特征学习模型的联合对比损失函数收敛。
例如,其中,所述联合对比损失函数是基于所述M个对比学习子模块中 的每一对比学习子模块i的输出的对比损失函数之和。
例如,其中所述训练部件1405包括特征提取子部件1405_1、映射子部件1405_3和损失值计算子部件1405_5。
所述特征提取子部件1405_1基于所述M个对比学习子模块中的任意一个对比学习子模块i,利用其中所包括的第一编码器和第二编码器,分别提取与所述L个第一增强型内窥镜图像相对应的L个第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的L个第i尺度的第二特征表示。所述映射子部件1405_3利用其中所包括的第一映射器模块和第二映射器模块,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行映射处理,以得到与所述L个第一增强型内窥镜图像相对应的映射后的第i尺度的特征表示和与所述L个第二增强型内窥镜图像相对应的映射后的第i尺度的特征表示。所述损失值计算部件1405_5基于与所述L个第一增强型内窥镜图像相对应的映射后的第i尺度的特征表示和与所述L个第二增强型内窥镜图像相对应的映射后的第i尺度的特征表示,计算对比学习子模块i的对比损失值。
例如,其中,所述任意一个对比学习子模块i中的第一编码器和第二编码器在不同尺度上对所接收的输入进行特征提取,使得任一个对比学习子模块i中的第一编码器和第二编码器所提取的第i尺度的特征表示与其余(M-1)个对比学习子模块中的第一编码器和第二编码器所提取的特征表示的尺度都不相同。
例如,所述映射子部件1405_3基于所述对比学习子模块i中包括的所述第一全局映射器和所述第二全局映射器,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行全局映射处理,以得到与所述L个第一增强型内窥镜图像相对应的L个全局映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的L个全局映射后的第i尺度的第二特征表示。
例如,其中,所述第一全局映射器和所述第二全局映射器是两层的全连接模块。
例如,所述损失值计算子部件1405_5将与所述L个第一增强型内窥镜图像相对应的所述L个全局映射后的第i尺度的第一特征表示和与所述L个 第二增强型内窥镜图像相对应的所述L个全局映射后的第i尺度的第二特征表示中一一对应的两个特征表示作为一对正例,其余(2L-2)个特征表示作为负例,计算对比损失函数,以得到对比学习子模块i的对比损失值。
例如,所述映射子部件1405_3基于所述对比学习子模块i中包括的所述第一全局映射器和所述第二全局映射器,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行全局映射处理,以得到与所述L个第一增强型内窥镜图像相对应的L个全局映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的L个全局映射后的第i尺度的第二特征表示;以及基于所述对比学习子模块i中包括的所述第一局部映射器和所述第二局部映射器,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行局部映射,以得到与所述L个第一增强型内窥镜图像相对应的L个局部映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的L个局部映射后的第i尺度的第二特征表示。
例如,其中,所述第一全局映射器和所述第二全局映射器是两层的全连接模块,所述第一局部映射器和所述第二局部映射器是两层1x1的卷积模块。
例如,所述损失值计算子部件1405_5将与所述L个第一增强型内窥镜图像相对应的所述L个全局映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的所述L个全局映射后的第i尺度的第二特征表示中一一对应的两个特征表示作为一对正例,其余(2L-2)个特征表示作为负例,计算对比损失函数,以得到全局对比损失值;以及将与所述L个第一增强型内窥镜图像相对应的所述L个局部映射后的第i尺度的第一特征表示中的每一个划分为第一S个第i尺度的局部特征表示,以得到第一(L×S)个第i尺度的局部特征表示;以与划分第一S个局部特征表示相同的方式,将与所述L个第二增强型内窥镜图像相对应的所述L个局部映射后的第i尺度的第二特征表示中的每一个划分为与所述第一S个第i尺度的局部特征表示一一对应的第二S个第i尺度的局部特征表示,以得到第二(L×S)个第i尺度的局部特征表示;将所述第一(L×S)个第i尺度的局部特征表示与所述第二(L×S)个第i尺度的局部特征表示中一一对应的两个局部特征表示作为一对正例,其余(2×(L×S)-2)个局部特征表示 作为负例,计算对比损失函数,以得到局部对比损失值;将所述全局对比损失值与所述局部对比损失值相加,以得到对比学习子模块i的对比损失值。
例如,其中,所述对比损失函数是噪声对比估计损失函数InfoNCE。
例如,其中,所述第一编码器和所述第二编码器是多尺度Transformer编码器,所述多尺度Transformer编码器包括间隔设置的一个或多个多头池化注意力模块和一个或多个多层感知器模块,其中每个多头注意力模块和多层感知器模块之前设置有模块标准化模块。
例如,其中,所述对象是回盲部。
图15示出了根据本公开实施例的内窥镜图像分类模型的训练装置1500,具体包括训练数据集获取部件1501和训练部件1503。
训练数据集获取部件1501用于获取第二训练数据集,所述第二训练数据集包括一个或多个具有待识别对象的内窥镜图像和一个或多个不具有待识别对象的内窥镜图像,所述内窥镜图像标注有标签,用于指示内窥镜图像是否包括待识别对象。训练部件1503用于将所述第二训练数据集输入到内窥镜图像分类模型中进行训练,直到所述内窥镜图像分类模型的目标损失函数收敛,以获得训练完成的内窥镜图像分类模型。
例如,其中,所述内窥镜图像分类模型包括依次连接的特征提取模块和分类器模块,其中所述特征提取模块是根据前面所述的基于多尺度对比学习的内窥镜图像特征学习模型的训练方法所获得的内窥镜图像特征学习模型中的M个第一编码器或M个第二编码器,其中M是大于1的整数。
例如,其中,所述第二训练数据集呈长尾分布,所述内窥镜图像分类模型的目标损失函数包括:基于所述内窥镜图像分类模型的最终输出结果与图像样本的标注标签而确定的焦点损失函数。
基于上述实施例,本公开实施例中还提供了另一示例性实施方式的电子设备。在一些可能的实施方式中,本公开实施例中电子设备可以包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,处理器执行程序时可以实现上述实施例中内窥镜图像特征学习模型训练方法或内窥镜图像识别方法的步骤。
例如,以电子设备为本公开图1中的服务器100为例进行说明,则该电子设备中的处理器即为服务器100中的处理器110,该电子设备中的存储器 即为服务器100中的存储器120。
本公开的实施例还提供了一种计算机可读存储介质。图16示出了根据本公开的实施例的存储介质的示意图1600。如图16所示,所述计算机可读存储介质1600上存储有计算机可执行指令1601。当所述计算机可执行指令1601由处理器运行时,可以执行参照以上附图描述的根据本公开实施例的基于对比学习的内窥镜图像特征学习模型的训练方法和内窥镜图像分类方法。所述计算机可读存储介质包括但不限于例如易失性存储器和/或非易失性存储器。所述易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。
本公开的实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行根据本公开实施例的基于对比学习的内窥镜图像特征学习模型的训练方法和内窥镜图像分类方法。
本领域技术人员能够理解,本公开所披露的内容可以出现多种变型和改进。例如,以上所描述的各种设备或组件可以通过硬件实现,也可以通过软件、固件、或者三者中的一些或全部的组合实现。
此外,虽然本公开对根据本公开的实施例的***中的某些单元做出了各种引用,然而,任何数量的不同单元可以被使用并运行在客户端和/或服务器上。所述单元仅是说明性的,并且所述***和方法的不同方面可以使用不同单元。
本领域普通技术人员可以理解上述方法中的全部或部分的步骤可通过程序来指令相关硬件完成,所述程序可以存储于计算机可读存储介质中,如只读存储器、磁盘或光盘等。可选地,上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现。相应地,上述实施例中的各模块/单元可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。本公开并不限制于任何特定形式的硬件和软件的结合。
除非另有定义,这里使用的所有术语(包括技术和科学术语)具有与本公开所属领域的普通技术人员共同理解的相同含义。还应当理解,诸如在通 常字典里定义的那些术语应当被解释为具有与它们在相关技术的上下文中的含义相一致的含义,而不应用理想化或极度形式化的意义来解释,除非这里明确地这样定义。
以上是对本公开的说明,而不应被认为是对其的限制。尽管描述了本公开的如果干示例性实施例,但本领域技术人员将容易地理解,在不背离本公开的新颖教学和优点的前提下可以对示例性实施例进行许多修改。因此,所有这些修改都意图包含在权利要求书所限定的本公开范围内。应当理解,上面是对本公开的说明,而不应被认为是限于所公开的特定实施例,并且对所公开的实施例以及其他实施例的修改意图包含在所附权利要求书的范围内。本公开由权利要求书及其等效物限定。

Claims (25)

  1. 一种基于多尺度对比学习的内窥镜图像特征学习模型的训练方法,所述方法包括:
    获取第一训练数据集,所述第一训练数据集包括一个或多个具有待识别对象的内窥镜图像和一个或多个不具有待识别对象的内窥镜图像;
    将所述第一训练数据集输入到所述内窥镜图像特征学习模型;以及
    基于所述第一训练数据集对所述内窥镜图像特征学习模型进行无监督的对比学习,以获得训练完成的内窥镜图像特征学习模型,
    其中,所述内窥镜图像特征学习模型包括多个对比学习子模块,所述多个对比学习子模块的每一个用于提取所述第一训练数据集中的同一内窥镜图像的不同尺度的特征表示,并基于所提取的不同尺度的特征表示进行对比学习。
  2. 根据权利要求1所述的方法,其中,所述多个对比学习子模块包括依次连接的M个对比学习子模块,其中,
    所述M个对比学习子模块中的任意一个对比学习子模块i都包括:结构完全相同的第一编码器和第二编码器、以及结构完全相同的第一映射器模块和第二映射器模块,
    其中,所述第一编码器的输出端连接到所述第一映射器模块的输入端,所述第二编码器的输出端连接到所述第二映射器模块的输入端,
    其中,所述M个对比学习子模块中的M个第一编码器依次连接,所述M个对比学习子模块中的M个第二编码器依次连接,
    其中,所述M为大于或等于1的整数,所述i∈[1,M]。
  3. 根据权利要求2所述的方法,其中,将所述第一训练数据集输入到所述内窥镜图像特征学习模型包括:
    在每次迭代训练时:
    从所述第一训练数据集中随机选取L个内窥镜图像,将所述L个内窥镜图像中的每一个进行第一图像增强,得到与所述L个内窥镜图像一一对应的L个第一增强型内窥镜图像,并输入到所述内窥镜图像特征学习模型中第一个对比学习子模块的第一编码器;以及
    将所述L个内窥镜图像中的每一个进行第二图像增强,得到与所述L个内窥镜图像一一对应的L个第二增强型内窥镜图像,并输入到所述内窥镜图像特征学习模型中第一个对比学习子模块的第二编码器,其中,所述L为大于1的正整数。
  4. 根据权利要求3所述的方法,其中,所述第一图像增强和第二图像增强分别包括以下各项中任意两个:保持不变、剪裁、翻转、颜色变换和高斯模糊。
  5. 根据权利要求3或4所述的方法,其中,基于所述第一训练数据集对所述内窥镜图像特征学习模型进行无监督的对比学习包括:
    基于所述M个对比学习子模块中的任意一个对比学习子模块i,利用其中所包括的第一编码器和第二编码器,分别提取与所述L个第一增强型内窥镜图像相对应的L个第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的L个第i尺度的第二特征表示;
    利用其中所包括的第一映射器模块和第二映射器模块,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行映射处理,以得到与所述L个第一增强型内窥镜图像相对应的映射后的第i尺度的特征表示和与所述L个第二增强型内窥镜图像相对应的映射后的第i尺度的特征表示;以及
    基于与所述L个第一增强型内窥镜图像相对应的映射后的第i尺度的特征表示和与所述L个第二增强型内窥镜图像相对应的映射后的第i尺度的特征表示,计算对比学习子模块i的对比损失值,
    其中,所述任意一个对比学习子模块i中的第一编码器和第二编码器在不同尺度上对所接收的输入进行特征提取,使得任一个对比学习子模块i中的第一编码器和第二编码器所提取的第i尺度的特征表示与其余(M-1)个对比学习子模块中的第一编码器和第二编码器所提取的特征表示的尺度都不相同。
  6. 根据权利要求2至5中任一项所述的方法,其中,基于所述第一训练数据集对所述内窥镜图像特征学习模型进行无监督的对比学习,以获得训练完成的内窥镜图像特征学习模型包括:
    基于所述M个对比学习子模块中的每一个对比学习子模块i的特征输出, 计算联合对比损失值,并基于所述联合对比损失值调整所述内窥镜图像特征学习模型的参数,直到所述内窥镜图像特征学习模型的联合对比损失函数收敛,
    其中,所述联合对比损失函数是基于所述M个对比学习子模块中的每一对比学习子模块i的输出的对比损失函数之和。
  7. 根据权利要求5所述的方法,其中,所述对比学习子模块i中的第一映射器模块包括第一全局映射器,所述对比学习子模块i中的第一编码器的输出端连接到所述对比学习子模块i中的第一全局映射器的输入端;所述对比学习子模块i中的第二映射器模块包括第二全局映射器,所述对比学习子模块i中的第二编码器的输出端连接到所述对比学习子模块i中的第二全局映射器的输入端。
  8. 根据权利要求7所述的方法,其中,利用其中所包括的第一映射器模块和第二映射器模块,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行映射处理,以得到与所述L个第一增强型内窥镜图像相对应的映射后的第i尺度的特征表示和与所述L个第二增强型内窥镜图像相对应的映射后的第i尺度的特征表示包括:
    基于所述对比学习子模块i中包括的所述第一全局映射器和所述第二全局映射器,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行全局映射处理,以得到与所述L个第一增强型内窥镜图像相对应的L个全局映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的L个全局映射后的第i尺度的第二特征表示。
  9. 根据权利要求7或8所述的方法,其中,所述第一全局映射器和所述第二全局映射器是两层的全连接模块。
  10. 根据权利要求8所述的方法,其中,基于与所述L个第一增强型内窥镜图像相对应的映射后的第i尺度的特征表示和与所述L个第二增强型内窥镜图像相对应的映射后的第i尺度的特征表示,计算对比学习子模块i的对比损失值包括:
    将与所述L个第一增强型内窥镜图像相对应的所述L个全局映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的所述L个全局映射后的第i尺度的第二特征表示中一一对应的两个特征表示作为一 对正例,其余(2L-2)个特征表示作为负例,计算对比损失函数,以得到对比学习子模块i的对比损失值。
  11. 根据权利要求5所述的方法,其中,所述对比学习子模块i中的第一映射器模块包括第一全局映射器和第一局部映射器,所述对比学习子模块i中的第一编码器的输出端同时连接到所述对比学习子模块i中的第一全局映射器的输入端和第一局部映射器的输入端;所述对比学习子模块i中的第二映射器模块包括第二全局映射器和第二局部映射器,所述对比学习子模块i中的第二编码器的输出端同时连接到所述对比学习子模块i中的第二全局映射器的输入端和第二局部映射器的输入端。
  12. 根据权利要求11所述的方法,其中,利用其中所包括的第一映射器模块和第二映射器模块,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行映射处理,以得到与所述L个第一增强型内窥镜图像相对应的映射后的第i尺度的特征表示和与所述L个第二增强型内窥镜图像相对应的映射后的第i尺度的特征表示包括:
    基于所述对比学习子模块i中包括的所述第一全局映射器和所述第二全局映射器,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行全局映射处理,以得到与所述L个第一增强型内窥镜图像相对应的L个全局映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的L个全局映射后的第i尺度的第二特征表示;以及
    基于所述对比学习子模块i中包括的所述第一局部映射器和所述第二局部映射器,分别对所述L个第i尺度的第一特征表示和所述L个第i尺度的第二特征表示进行局部映射,以得到与所述L个第一增强型内窥镜图像相对应的L个局部映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的L个局部映射后的第i尺度的第二特征表示。
  13. 根据权利要求11或12所述的方法,其中,所述第一全局映射器和所述第二全局映射器是两层的全连接模块,所述第一局部映射器和所述第二局部映射器是两层1x1的卷积模块。
  14. 根据权利要求12所述的方法,其中,基于与所述L个第一增强型内窥镜图像相对应的映射后的特征表示和与所述L个第二增强型内窥镜图像 相对应的映射后的特征表示,计算对比学习子模块i的对比损失值包括:
    将与所述L个第一增强型内窥镜图像相对应的所述L个全局映射后的第i尺度的第一特征表示和与所述L个第二增强型内窥镜图像相对应的所述L个全局映射后的第i尺度的第二特征表示中一一对应的两个特征表示作为一对正例,其余(2L-2)个特征表示作为负例,计算对比损失函数,以得到全局对比损失值;
    将与所述L个第一增强型内窥镜图像相对应的所述L个局部映射后的第i尺度的第一特征表示中的每一个划分为第一S个第i尺度的局部特征表示,以得到第一(L×S)个第i尺度的局部特征表示;
    以与划分第一S个局部特征表示相同的方式,将与所述L个第二增强型内窥镜图像相对应的所述L个局部映射后的第i尺度的第二特征表示中的每一个划分为与所述第一S个第i尺度的局部特征表示一一对应的第二S个第i尺度的局部特征表示,以得到第二(L×S)个第i尺度的局部特征表示;
    将所述第一(L×S)个第i尺度的局部特征表示与所述第二(L×S)个第i尺度的局部特征表示中一一对应的两个局部特征表示作为一对正例,其余(2×(L×S)-2)个局部特征表示作为负例,计算对比损失函数,以得到局部对比损失值;以及
    将所述全局对比损失值与所述局部对比损失值相加,以得到对比学习子模块i的对比损失值。
  15. 根据权利要求10或14所述的方法,其中,所述对比损失函数是噪声对比估计损失函数InfoNCE。
  16. 根据权利要求2-15任一项所述的方法,其中,所述第一编码器和所述第二编码器是多尺度Transformer编码器块,所述多尺度Transformer编码器块包括间隔设置的一个或多个多头池化注意力模块和一个或多个多层感知器模块,其中每个多头池化注意力模块和每个多层感知器模块之前设置有模块标准化模块。
  17. 根据权利要求1-16任一项所述的方法,其中,所述对象是回盲部。
  18. 一种基于对比学习的内窥镜图像特征学习模型的训练装置,所述装置包括:
    训练数据集获取部件,用于获取第一训练数据集,所述第一训练数据集 包括一个或多个具有待识别对象的内窥镜图像和一个或多个不具有待识别对象的内窥镜图像;
    输入部件,用于将所述第一训练数据集输入到所述内窥镜图像特征学习模型;
    训练部件,用于基于所述第一训练数据集对所述内窥镜图像特征学习模型进行无监督的对比学习,以获得训练完成的内窥镜图像特征学习模型,
    其中,所述内窥镜图像特征学习模型包括多个对比学习子模块,所述多个对比学习子模块用于提取同一输入样本的不同尺度的特征表示,并基于所提取的不同尺度的特征表示进行对比学习。
  19. 一种内窥镜图像分类模型的训练方法,包括:
    获取第二训练数据集,所述第二训练数据集包括一个或多个具有待识别对象的内窥镜图像和一个或多个不具有待识别对象的内窥镜图像,所述内窥镜图像标注有标签,用于指示内窥镜图像是否包括待识别对象;
    将所述第二训练数据集输入到内窥镜图像分类模型中进行训练,直到所述内窥镜图像分类模型的目标损失函数收敛,以获得训练完成的内窥镜图像分类模型,
    其中,所述内窥镜图像分类模型包括依次连接的特征提取模块和分类器模块,其中所述特征提取模块是根据权利要求1-17中任一项所述的基于多尺度对比学习的内窥镜图像特征学习模型的训练方法所获得的内窥镜图像特征学习模型中的M个第一编码器或M个第二编码器,其中M是大于1的整数。
  20. 根据权利要求19所述的方法,其中,所述第二训练数据集呈长尾分布,所述内窥镜图像分类模型的目标损失函数包括:基于所述内窥镜图像分类模型的最终输出结果与图像样本的标注标签而确定的焦点损失函数。
  21. 一种内窥镜图像分类模型的训练装置,包括:
    图像获取部件,用于获取第二训练数据集,所述第二训练数据集包括一个或多个具有待识别对象的内窥镜图像和一个或多个不具有待识别对象的内窥镜图像,所述内窥镜图像标注有标签,用于指示内窥镜图像是否包括待识别对象;
    训练部件,将所述第二训练数据集输入到内窥镜图像分类模型中进行训练,直到所述内窥镜图像分类模型的目标损失函数收敛,以获得训练完成的 内窥镜图像分类模型,
    其中,所述内窥镜图像分类模型包括依次连接的特征提取模块和分类器模块,其中所述特征提取模块是根据权利要求1-17中任一项所述的基于多尺度对比学习的内窥镜图像特征学习模型的训练方法所获得的内窥镜图像特征学习模型中的M个第一编码器或M个第二编码器,其中M是大于1的整数。
  22. 一种内窥镜图像分类方法,包括:
    获取待识别的内窥镜图像;
    基于训练好的内窥镜图像分类模型,获得所述内窥镜图像的分类结果;
    其中,所述训练好的内窥镜图像特征学习模型是基于根据权利要求19所述的内窥镜图像分类模型的训练方法所获得的。
  23. 一种内窥镜图像分类***,包括:
    图像获取部件,用于获取待识别的内窥镜图像;
    处理部件,基于训练好的内窥镜图像分类模型,获得所述内窥镜图像的分类结果;
    输出部件,用于输出待识别的内窥镜图像的分类结果,
    其中,所述训练好的内窥镜图像特征学习模型是基于根据权利要求19所述的内窥镜图像分类模型的训练方法所获得的。
  24. 一种电子设备,包括存储器和处理器,其中,所述存储器上存储有处理器可读的程序代码,当处理器执行所述程序代码时,执行根据权利要求1-17、19-20和22中任一项所述的方法。
  25. 一种计算机可读存储介质,其上存储有计算机可执行指令,所述计算机可执行指令用于执行根据权利要求1-17、19-20和22中任一项所述的方法。
PCT/CN2022/122056 2021-10-26 2022-09-28 内窥镜图像特征学习模型、分类模型的训练方法和装置 WO2023071680A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111248801.6A CN113706526B (zh) 2021-10-26 2021-10-26 内窥镜图像特征学习模型、分类模型的训练方法和装置
CN202111248801.6 2021-10-26

Publications (1)

Publication Number Publication Date
WO2023071680A1 true WO2023071680A1 (zh) 2023-05-04

Family

ID=78646913

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/122056 WO2023071680A1 (zh) 2021-10-26 2022-09-28 内窥镜图像特征学习模型、分类模型的训练方法和装置

Country Status (2)

Country Link
CN (1) CN113706526B (zh)
WO (1) WO2023071680A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116597163A (zh) * 2023-05-18 2023-08-15 广东省旭晟半导体股份有限公司 红外光学透镜及其制备方法
CN116741372A (zh) * 2023-07-12 2023-09-12 东北大学 一种基于双分支表征一致性损失的辅助诊断***及装置
CN116994076A (zh) * 2023-09-28 2023-11-03 中国海洋大学 一种基于双分支相互学习特征生成的小样本图像识别方法
CN117036832A (zh) * 2023-10-09 2023-11-10 之江实验室 一种基于随机多尺度分块的图像分类方法、装置及介质
CN117437518A (zh) * 2023-11-03 2024-01-23 苏州鑫康成医疗科技有限公司 基于glnet和自注意力的心脏超声图像识别方法
CN117636064A (zh) * 2023-12-21 2024-03-01 浙江大学 一种基于儿童病理切片的神经母细胞瘤智能分类***

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113706526B (zh) * 2021-10-26 2022-02-08 北京字节跳动网络技术有限公司 内窥镜图像特征学习模型、分类模型的训练方法和装置
CN114005073B (zh) * 2021-12-24 2022-04-08 东莞理工学院 上肢镜像康复训练、识别方法和装置
CN114332637B (zh) * 2022-03-17 2022-08-30 北京航空航天大学杭州创新研究院 遥感影像水体提取方法、遥感影像水体提取的交互方法
CN114419400B (zh) * 2022-03-28 2022-07-29 北京字节跳动网络技术有限公司 图像识别模型的训练方法、识别方法、装置、介质和设备
CN115115904B (zh) * 2022-06-08 2024-07-09 马上消费金融股份有限公司 基于对比学习的模型训练方法及装置
CN116051486B (zh) * 2022-12-29 2024-07-02 抖音视界有限公司 内窥镜图像识别模型的训练方法、图像识别方法及装置
CN116052061B (zh) * 2023-02-21 2024-02-27 嘉洋智慧安全科技(北京)股份有限公司 事件监测方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948733A (zh) * 2019-04-01 2019-06-28 深圳大学 消化道内窥镜图像的多分类方法、分类装置及存储介质
CN113486990A (zh) * 2021-09-06 2021-10-08 北京字节跳动网络技术有限公司 内窥镜图像分类模型的训练方法、图像分类方法和装置
CN113496489A (zh) * 2021-09-06 2021-10-12 北京字节跳动网络技术有限公司 内窥镜图像分类模型的训练方法、图像分类方法和装置
US20210327029A1 (en) * 2020-04-13 2021-10-21 Google Llc Systems and Methods for Contrastive Learning of Visual Representations
CN113706526A (zh) * 2021-10-26 2021-11-26 北京字节跳动网络技术有限公司 内窥镜图像特征学习模型、分类模型的训练方法和装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109523522B (zh) * 2018-10-30 2023-05-09 腾讯医疗健康(深圳)有限公司 内窥镜图像的处理方法、装置、***及存储介质
CN113034500A (zh) * 2021-05-25 2021-06-25 紫东信息科技(苏州)有限公司 基于多通道结构的消化道内窥镜图片病灶识别***

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948733A (zh) * 2019-04-01 2019-06-28 深圳大学 消化道内窥镜图像的多分类方法、分类装置及存储介质
US20210327029A1 (en) * 2020-04-13 2021-10-21 Google Llc Systems and Methods for Contrastive Learning of Visual Representations
CN113486990A (zh) * 2021-09-06 2021-10-08 北京字节跳动网络技术有限公司 内窥镜图像分类模型的训练方法、图像分类方法和装置
CN113496489A (zh) * 2021-09-06 2021-10-12 北京字节跳动网络技术有限公司 内窥镜图像分类模型的训练方法、图像分类方法和装置
CN113706526A (zh) * 2021-10-26 2021-11-26 北京字节跳动网络技术有限公司 内窥镜图像特征学习模型、分类模型的训练方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI BIN; LI YIN; ELICEIRI KEVIN W.: "Dual-stream Multiple Instance Learning Network for Whole Slide Image Classification with Self-supervised Contrastive Learning", 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 20 June 2021 (2021-06-20), pages 14313 - 14323, XP034007713, DOI: 10.1109/CVPR46437.2021.01409 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116597163A (zh) * 2023-05-18 2023-08-15 广东省旭晟半导体股份有限公司 红外光学透镜及其制备方法
CN116741372A (zh) * 2023-07-12 2023-09-12 东北大学 一种基于双分支表征一致性损失的辅助诊断***及装置
CN116741372B (zh) * 2023-07-12 2024-01-23 东北大学 一种基于双分支表征一致性损失的辅助诊断***及装置
CN116994076A (zh) * 2023-09-28 2023-11-03 中国海洋大学 一种基于双分支相互学习特征生成的小样本图像识别方法
CN116994076B (zh) * 2023-09-28 2024-01-19 中国海洋大学 一种基于双分支相互学习特征生成的小样本图像识别方法
CN117036832A (zh) * 2023-10-09 2023-11-10 之江实验室 一种基于随机多尺度分块的图像分类方法、装置及介质
CN117036832B (zh) * 2023-10-09 2024-01-05 之江实验室 一种基于随机多尺度分块的图像分类方法、装置及介质
CN117437518A (zh) * 2023-11-03 2024-01-23 苏州鑫康成医疗科技有限公司 基于glnet和自注意力的心脏超声图像识别方法
CN117636064A (zh) * 2023-12-21 2024-03-01 浙江大学 一种基于儿童病理切片的神经母细胞瘤智能分类***
CN117636064B (zh) * 2023-12-21 2024-05-28 浙江大学 一种基于儿童病理切片的神经母细胞瘤智能分类***

Also Published As

Publication number Publication date
CN113706526B (zh) 2022-02-08
CN113706526A (zh) 2021-11-26

Similar Documents

Publication Publication Date Title
WO2023071680A1 (zh) 内窥镜图像特征学习模型、分类模型的训练方法和装置
CN113496489B (zh) 内窥镜图像分类模型的训练方法、图像分类方法和装置
WO2023030520A1 (zh) 内窥镜图像分类模型的训练方法、图像分类方法和装置
WO2022057078A1 (zh) 基于集成知识蒸馏的实时肠镜影像分割方法及装置
Yan et al. Learning mutually local-global u-nets for high-resolution retinal lesion segmentation in fundus images
Khadka et al. Meta-learning with implicit gradients in a few-shot setting for medical image segmentation
CN113470029B (zh) 训练方法及装置、图像处理方法、电子设备和存储介质
CN113781489B (zh) 一种息肉影像语义分割方法及装置
Zhu et al. Stacked U-shape networks with channel-wise attention for image super-resolution
Li et al. Hdrnet: Single-image-based hdr reconstruction using channel attention cnn
Masmoudi et al. Optimal feature extraction and ulcer classification from WCE image data using deep learning
Raut et al. Gastrointestinal tract disease segmentation and classification in wireless capsule endoscopy using intelligent deep learning model
CN115471470A (zh) 一种食管癌ct图像分割方法
CN114399465A (zh) 良恶性溃疡识别方法及***
Moyes et al. Multi-channel auto-encoders for learning domain invariant representations enabling superior classification of histopathology images
Ouyang et al. LEA U-Net: a U-Net-based deep learning framework with local feature enhancement and attention for retinal vessel segmentation
Nguyen-Mau et al. Multi kernel positional embedding ConvNeXt for polyp segmentation
Ghaznavi Bidgoli et al. Automatic diagnosis of dental diseases using convolutional neural network and panoramic radiographic images
Ahmed et al. DOLG-NeXt: Convolutional neural network with deep orthogonal fusion of local and global features for biomedical image segmentation
Nguyen-Mau et al. Pefnet: Positional embedding feature for polyp segmentation
Nie et al. Specular reflections detection and removal for endoscopic images based on brightness classification
CN115511861A (zh) 基于人工神经网络的识别方法
Zhang et al. Pgnn: Physics-guided neural network for fourier ptychographic microscopy
Rifai et al. Analysis for diagnosis of pneumonia symptoms using chest X-ray based on MobileNetV2 models with image enhancement using white balance and contrast limited adaptive histogram equalization (CLAHE)
Gupta et al. Classification of endoscopic images and identification of gastrointestinal diseases

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22885565

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE