CN115423790A - Anterior chamber angle image grading method based on visual text fusion - Google Patents

Anterior chamber angle image grading method based on visual text fusion Download PDF

Info

Publication number
CN115423790A
CN115423790A CN202211138484.7A CN202211138484A CN115423790A CN 115423790 A CN115423790 A CN 115423790A CN 202211138484 A CN202211138484 A CN 202211138484A CN 115423790 A CN115423790 A CN 115423790A
Authority
CN
China
Prior art keywords
text
visual
anterior chamber
chamber angle
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211138484.7A
Other languages
Chinese (zh)
Inventor
贾西平
黄静琪
关立南
聂栋
崔怀林
廖秀秀
林智勇
马震远
刘海珠
张倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Polytechnic Normal University
Original Assignee
Guangdong Polytechnic Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Polytechnic Normal University filed Critical Guangdong Polytechnic Normal University
Priority to CN202211138484.7A priority Critical patent/CN115423790A/en
Publication of CN115423790A publication Critical patent/CN115423790A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing
    • G06T2207/30041Eye; Retina; Ophthalmic

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an anterior chamber angle image grading method based on visual text fusion, which comprises the following steps of: s1: constructing an anterior chamber angle image dataset; s2: pre-processing images in the anterior chamber angle image dataset; s3: constructing a depth neural network model based on visual text fusion and initializing; s4: initializing a loss function and an optimizer; s5: training the depth neural network model based on the visual text fusion in the step S3 by using the pre-processed anterior chamber angle image data set in the step S2, and calculating a loss function; s6: updating network parameters of the deep neural network model based on the visual text fusion by using an optimizer to obtain an optimal deep neural network model based on the visual text fusion; s7: the anterior chamber angle image is graded using an optimal vision-based text-fusion deep neural network model. The invention solves the problem of computer-aided diagnosis and treatment of glaucoma to a certain extent.

Description

Anterior chamber angle image grading method based on visual text fusion
Technical Field
The invention relates to the field of medical image processing, in particular to an anterior chamber angle image grading method based on visual text fusion.
Background
Glaucoma is an irreversible blinding eye disease and is a leading cause of blindness. After glaucoma has been diagnosed, it is usually clinically necessary to determine the Anterior Chamber Angle grade by observing the internal structure of the Anterior Chamber Angle (ACA) with the aid of an gonioscopy, and then to develop a targeted treatment plan. In recent years, a deep learning technology obtains a plurality of important achievements in the field of medical image analysis, and the advantages of high effective rate of automatic classification of the anterior chamber angle image, good stability and consistency of conclusion, capability of being popularized and applied in different areas and the like are realized by means of the deep learning technology, so that the method has important value clinically.
1 anterior chamber angle assessment. Cheng et al use edge detection and arc detection algorithms to achieve two-stage intelligent analysis of RetCam images, and their proposed system achieves automatic classification of closed/open angle glaucoma by observing the size of the arc or angle. EyeCam was originally aimed at producing a wide-angle picture of the fundus, later modified to record the anterior chamber angle. This is a portable handheld device that, like the anterior horn herein, is capable of performing a contact check on a patient. Baskaran et al compared the performance of the gonioscopic and EyeCam systems and found that the diagnostic results of these two devices to determine whether the angle is closed were substantially consistent. Unlike the study objective herein (differentiating five levels of anterior chamber angle), the students studying both the RetCam and EyeCam images are more concerned about whether the angle in the image is closed, and the task of dichotomy is simpler. In addition to this, researchers have achieved segmentation of four structures in the anterior chamber angle image, which is quite different from our classification task.
2 glaucoma detection. In recent years, deep learning techniques have achieved a series of important results in glaucoma detection, including analysis of imaging data such AS-OCT and fundus examination by the deep learning techniques, and further realization of auxiliary diagnosis of glaucoma. Some studies have aimed at segmenting the geometry of AS-OCT and fundus images (for optic nerve segmentation or for cup/disc segmentation) using deep convolutional neural networks or fully-connected convolutional neural networks. For example, fang et al extract image features of AS-OCT using convolutional neural networks, and at the same time, find the final boundary from the probability map using a graph search method, thereby achieving automatic segmentation of retinal layer boundaries using AS-OCT images; some studies are designed to support glaucoma diagnosis, and color information capture of optic nerve fundus images based on a deep learning algorithm is mainly achieved by using a convolutional neural network architecture. Liu et al constructed a deep learning system for the detection of glaucomatous optic neuropathy using averages over red, blue and green channels.
In summary, the deep learning technique has achieved a series of important results in glaucoma-related medical image analysis (e.g., fundus images, OCT images, etc.), but has relatively little assistance in analysis of anterior chamber angle images. Considering that the conventional gonioscopes are still widely used in China, it is necessary to study the auxiliary analysis of the anterior chamber angle image.
Disclosure of Invention
The invention provides an anterior chamber angle image grading method based on visual text fusion, which realizes automatic grading of an anterior chamber angle image.
In order to solve the technical problems, the technical scheme of the invention is as follows:
an anterior chamber angle image grading method based on visual text fusion comprises the following steps:
s1: constructing an anterior chamber angle image dataset;
s2: pre-processing images in the anterior chamber angle image dataset;
s3: constructing a depth neural network model based on visual text fusion and initializing;
s4: initializing a loss function and an optimizer;
s5: training the depth neural network model based on the visual text fusion in the step S3 by using the pre-processed anterior chamber angle image data set in the step S2, and calculating a loss function;
s6: updating the network parameters of the depth neural network model based on the visual text fusion by using an optimizer to enable the network parameters to approach or reach an optimal value, so that a loss function is minimized, and finding the optimal network parameters to obtain the optimal depth neural network model based on the visual text fusion;
s7: the anterior chamber angle image is graded using an optimal vision-based text fusion-based deep neural network model.
Preferably, the anterior chamber angle image data set in step S1 includes a plurality of anterior chamber angle images, each of which is labeled with anterior chamber angle grade information and a manually defined text label, wherein a part of the anterior chamber angle images are also labeled with a pixel-level label, wherein:
the anterior chamber angle grading information is divided into five grades according to an anterior chamber angle evaluation system described by Shaffer, wherein each grade corresponds to different clinical descriptions, and N1, N2, N3, N4 and W;
the text label induces clinical features owned by the anterior chamber angle of each level according to the clinical description of each level, and defines a mapping strategy, the clinical features owned by the anterior chamber angle of each level are mapped into codes which can be identified by a computer, an attribute vector corresponds to the anterior chamber angle of each level, five levels form a 5-d attribute matrix which is called a text label, and each vector in the attribute matrix represents the text description of the anterior chamber angle of one level;
the pixel-level label labels each pixel in the anterior chamber angle image as belonging to one of a Schwalbe line, trabecular meshwork, scleral spur, ciliary body zone, or background structure.
Preferably, the mapping strategy uses sequential coding, specifically:
selecting an anterior chamber angle image in which A attributes describe various levels, each level of the anterior chamber angle image being composed of an A-dimensional word-level attribute
Figure BDA0003853131520000031
Representing, by sequential encoding, encoded into a computer-recognizable attribute vector: v. of 0 ,…,v A-1 ;v 0 To v A-2 Indicating the degree to which each structure of the anterior chamber angle is visible in the anterior chamber angle image, 0 indicating invisible, 1 indicating partially visible, 2 indicating fully visible; v. of A-1 Is used as its semantic attribute to indicate the likelihood of anterior chamber angle closure.
Preferably, the image in the anterior chamber angle image data set is preprocessed in step S2, specifically:
and performing data enhancement including random horizontal mirroring and random salt-pepper noise data enhancement operation on each anterior chamber angle image in the data set, and finally performing normalization processing on the images.
Preferably, the deep neural network model based on visual text fusion in step S3 specifically includes:
the deep neural network model based on the visual text fusion comprises a visual learning branch, a text learning branch and a main branch, wherein an image of the anterior chamber angle image data set is input into the main branch, an obtained one of the potential visual feature maps is input into the text learning branch, the text learning branch processes the output text feature map according to the potential visual feature map and returns the text feature map to the main branch, the visual learning branch receives feature information in the main branch and outputs visual features to the main branch, and the main branch performs visual text fusion and outputs the anterior chamber angle image level.
Preferably, the main branch comprises a visual encoder, a first fusion block, a second fusion block, and a classifier, wherein:
the vision encoder is ResNet50, the input of the vision encoder is an image of an anterior chamber angle image dataset, and the output of the vision encoder is potential vision characteristic maps of two different scales, wherein one potential vision characteristic map P la Input into the text learning branch, and another potential visual feature map P vi Inputting the text characteristic graph P into a first fusion block, and receiving the text characteristic graph P output by a text learning branch by the first fusion block te The first fusion block combines the potential visual feature map P vi And a text feature map P te Fusing to obtain visual context information P F1 Visual context information P F1 Respectively sent into the visual learning branch and a second fusion block, and the second fusion block also receives the characteristic information P output by the visual learning branch SEG And P EMB The second fusion block combines the visual context information P F1 And characteristic information P SEG And P EMB Performing fusion to obtain a polymerized latent feature P F2 Finally, the aggregated potential features P F2 Inputting the data into a classifier for classification, adopting a multilayer perceptron as the construction of the classifier, and aggregating the potential features P F2 Mapping to the class distribution to obtain the anterior chamber angle image grade.
Preferably, the first fusion block adopts an attention mechanism, and the fusion process of the first fusion block is as follows:
potential visual feature map P vi And a text feature map P te Respectively modeling static context information through 3 multiplied by 3 convolution;
for text feature map P te The obtained context information and text characteristic graph P te After the channel splicing operation is used, two continuous 1 multiplied by 1 convolution operations are carried out, and then remodeling and averaging operations are used to obtain a text relation matrix;
by means of a latent visual feature map P vi Remodeling reshape is carried out on the obtained context information to obtain a visual relation matrix;
normalizing the text relation matrix by using a Softmax function to obtain an attention weight graph, multiplying the attention weight graph by the visual relation matrix element by element, and guiding visual feature learning by using text information to obtain new visual context information;
modeling the dependency relationship of the characteristics between the visual and text modes by element-by-element summation to complete the potential visual characteristic diagram P vi And a text feature map P te Fusing;
the specific fusion process of the second fusion block is as follows:
P F2 =GAP(P F1 )++GAP(P SEG )++GAP(P EMB ))
in the formula, GAP () is the global average pooling operation, and + is the channel splicing operation.
Preferably, the text feature branch is formed by a text encoder formed by the res4 residual block of ResNet, the input of which is the latent feature map P from the visual encoder la The output is a text feature map P te The res4 parameter of the visual encoder is shared with the text encoder parameter, and the text feature map is obtained from the text encoder through attribute learning
Figure BDA0003853131520000041
Wherein C, H and W represent channel, height and width, respectively, the text feature branches apply global average pooling on H and W to learn global discriminant features:
Figure BDA0003853131520000042
in the formula (I), the compound is shown in the specification,
Figure BDA0003853131520000043
is from the feature P at spatial position (i, j) te Is extracted from the Chinese medicinal materials;
text feature branching is also utilized with parametersW te Maps text features to semantic embedding space, thus predicting attribute vectors
Figure BDA0003853131520000044
Predicted latent semantic information representing a attributes in the anterior chamber angle image I:
Figure BDA0003853131520000045
in the formula (I), the compound is shown in the specification,
Figure BDA0003853131520000046
is a linear transformation, using a 1 x 1 convolution calculation for the input tensor,
Figure BDA0003853131520000047
is a predicted attribute vector.
Preferably, the visual feature branch comprises a visual decoder, a feature pyramid network, a segmentation sub-module and an embedding sub-module, wherein:
the vision decoder is an up-sampling sub-network, the up-sampling sub-network adopts a jump connection and a symmetrical structure, the input of the up-sampling sub-network is image data in a vision encoder, the output of the up-sampling sub-network is input into the feature pyramid network, and the output P of the feature pyramid network FPN Inputting into an embedded submodule, the output P of the feature pyramid network FPN And also visual context information P F1 Adding the obtained data to a segmentation submodule, and outputting characteristic information P by the segmentation submodule SEG The embedded sub-module outputs characteristic information P EMB
The division submodule consists of a first division block and a second division block, the first division block consists of two convolution layers of 3 multiplied by 3 and a ReLu activation function, and the input of the first division block is the output P of the characteristic pyramid network FPN With visual context information P F1 Adding, the output of the first division block being characteristic information P SEG The second partition is three 3 × 3 convolutional layers and a ReLu activation functionComposition, the input of the second divided block is characteristic information P SEG The output of the second partition is dense prediction
Figure BDA0003853131520000051
The embedded sub-module comprises a first embedded block and a second embedded block, wherein the first embedded block is formed by five 3 x 3 convolution layers and used for combining a multi-scale feature map P FPN Mapping to an embedded feature map P EMB Representing discriminant features, the second embedded block is composed of a 1 × 1 convolutional layer, mapping all pixels in the graph to points in feature space, each point being represented by an embedded vector
Figure BDA0003853131520000052
And the embedded vector can express the compressed implicit information of the pixel point in the graph.
Preferably, the loss function in step S3 is specifically:
L total =α·L CLS +β·L SEG +γ·L EMB +δ·L TE
in the formula, L total As a function of total loss, L CLS As a loss function of the main branch, L SEG To split the loss function of the submodule, L EMB Loss function for embedding sub-modules, L TE For the loss function of the text learning branch, alpha, beta, gamma and delta are the weights of four loss terms;
Figure BDA0003853131520000053
Figure BDA0003853131520000054
Figure BDA0003853131520000055
in the formula (I), the compound is shown in the specification,
Figure BDA0003853131520000061
for calculating the loss incurred by training samples with pixel-level labels,
Figure BDA0003853131520000062
for calculating the loss caused by label data at the unmarked pixel level,
Figure BDA0003853131520000063
and
Figure BDA0003853131520000064
respectively the prediction probability and the true label of the pixel i, N j Representing the number of pixels in the structure j,
Figure BDA0003853131520000065
a pseudo-label representing the pixel i,
Figure BDA0003853131520000066
indicating that a pseudo label for pixel i, m, is used when the prediction probability score is above a threshold τ i 1, placing;
L EMB =λ·L var +ρ·L dist +ω·L reg
Figure BDA0003853131520000067
Figure BDA0003853131520000068
Figure BDA0003853131520000069
in the formula (I), the compound is shown in the specification,
Figure BDA00038531315200000610
is the embedding vector of pixel i, mu is the mean-like vector, i.e. class center, delta vd Is the sum of the variancesMargin of distance loss, i.e. the maximum distance that can be accepted within a cluster and the minimum distance that clusters are pulled away from each other, respectively, | | is the L2 norm, [ x | ]] + =max(0,x);
Figure BDA00038531315200000611
Figure BDA00038531315200000612
In the formula (I), the compound is shown in the specification,
Figure BDA00038531315200000613
representing a predicted A-dimensional attribute vector, V representing a text label defined manually according to domain knowledge, s I Similarity score, s, representing each anterior chamber angle level I The highest similarity score in the series represents the prediction level of the anterior chamber angle image I.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
1. the requirement of the anterior chamber angle grading on the work qualification of doctors is reduced, and the method has practical significance for relieving the problem of insufficient medical experts. The automatic anterior chamber angle image grading system based on deep learning not only can reduce the medical cost of patients, but also enables more glaucoma patients to enjoy high-quality medical services at lower cost.
2. As an auxiliary diagnosis technology for glaucoma, the method provides important reference for doctors and effectively reduces the instability of a diagnosis conclusion. Based on the proposed grading method, the method not only can realize automatic grading of the anterior chamber angle image, but also can be used as an auxiliary diagnosis and treatment tool for glaucoma, so that the image reading workload of ophthalmologists is reduced, and the stability of the diagnosis result is improved.
3. New study directions are provided for other medical images that have similar challenges as the anterior chamber angle image. Similar challenges include: the unclear boundary of the critical local structure of the target image leads to difficult recognition of the model, poor characteristic distinguishability of adjacent structures and the like. For example, the diagnosis of Diabetic Retinopathy (DR) is associated with microaneurysms, hemorrhage, soft and hard exudates, which are difficult to distinguish due to their similar appearance. Therefore, the method proposed by the present invention is equally applicable to the related problems.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Figure 2 is an exemplary graph of multi-modal data associated with an image of the anterior chamber angle provided by an embodiment.
Fig. 3 is an overall framework diagram of a deep neural network model based on visual context fusion according to an embodiment.
Fig. 4 is a network architecture diagram of the visual learning branch.
Fig. 5 is a network architecture diagram of a text learning branch.
Fig. 6 is a network framework diagram of a first fusion block.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
The embodiment provides an anterior chamber angle image grading method based on visual text fusion, as shown in fig. 1, including the following steps:
s1: constructing an anterior chamber angle image dataset;
s2: pre-processing images in the anterior chamber angle image dataset;
s3: constructing a depth neural network model based on visual text fusion and initializing;
s4: initializing a loss function and an optimizer;
s5: training the depth neural network model based on the visual text fusion in the step S3 by using the pre-processed anterior chamber angle image data set in the step S2, and calculating a loss function;
s6: updating the network parameters of the deep neural network model based on the visual text fusion by using an optimizer, and enabling the network parameters to approach or reach an optimal value, so that a loss function is minimized, the optimal network parameters are found, and the optimal deep neural network model based on the visual text fusion is obtained;
s7: the anterior chamber angle image is graded using an optimal vision-based text-fusion deep neural network model.
Example 2
This example continues to disclose the following on the basis of example 1:
the anterior chamber angle image data set (ACA 999) in step S1 includes 999 anterior chamber angle images, each of which is labeled with anterior chamber angle grading information and manually defined text labels, wherein 100 anterior chamber angle images are also labeled with pixel-level labels, wherein:
the anterior chamber angle grading information is divided into five grades according to an anterior chamber angle evaluation system described by Shaffer, wherein each grade corresponds to different clinical descriptions, and N1, N2, N3, N4 and W; for example, the degree of patency is the anterior chamber angle of N2, which corresponds to a clinical depiction of "CBB disappearance";
the text labels summarize the clinical features possessed by each level of anterior chamber angle according to the clinical description of each level, and thus, each category of anterior chamber angle has its corresponding word-level attributes. For example, "CBB vanishing" can be summarized as N2 having SL structure, TM structure, SS structure, no CBB structure, possibly closed angle, etc. attributes, and a mapping strategy is defined to map clinical features owned by each level of anterior chamber angle into computer recognizable codes, one attribute vector for each level of anterior chamber angle, five levels forming a 5-d attribute matrix, called text label, each vector in the attribute matrix representing a text description of one level of anterior chamber angle;
the pixel-level label labels each pixel in the anterior chamber angle image as belonging to one of Schwalbe Line (SL), trabecular Meshwork (TM), scleral Spur (SS), ciliary body zone (CBB), or background structure.
The anterior chamber angle image, data for the visual modality, and data for the text modality are shown in figure 1. The multi-modal data can describe the characteristic information of four structures from different angles, for example, (1) the anterior chamber angle image can provide texture information, color information of the iridocorneal population (anterior chamber angle between iridocorneals); (2) The data of the visual modality can provide local information of four key structures; (3) The data of the text modality can provide clinically relevant domain knowledge: the integrity of the four important structures determines the level of anterior chamber angle. Therefore, the data of the visual modality and the data of the text modality can provide different but complementary information to the model, and the application of the multi-modal data can not only reduce the uncertainty of the information, but also provide more clinical information. Therefore, it is highly desirable for a computer to learn the complementary information and common features contained in multi-modal data for automatic assessment of anterior chamber angle
The visual learning branch, while it may learn visual features, may ignore the effects of the order of the four structures, the degree of visualization, and the severity of the anterior chamber angle closure on the anterior chamber angle progression. Therefore, the invention proposes a text learning branch based on attribute learning to learn text features from text descriptions, breaking these limitations.
The clinical description of the anterior chamber angle for each grade was different according to the anterior chamber angle evaluation system described by Shaffer. As can be seen in figure 1, the physician determines that the basis for the anterior chamber angle is that different levels of anterior chamber angle have different clinical characteristics. A mapping strategy phi () C → V is defined to map all levels of the anterior chamber angle image into a semantic matrix based on clinical features consisting of attribute-specific words corresponding to each level, so the invention notes V = phi (C), V denotes N c An a-dimensional attribute vector corresponding to the anterior chamber angle level in table 1 as judged by domain knowledge.
TABLE 1 encoding of attribute vectors
Figure BDA0003853131520000091
Based on the basis, the invention manually summarizes the attributes described as word levels by summarizing the text, selects A attributes to describe the anterior chamber angle images of various levels, and each level of the anterior chamber angle images is composed of a word level attribute of an A dimension
Figure BDA0003853131520000092
Representing, by sequential encoding, encoded into a computer-recognizable attribute vector: v. of 0 ,…,v A-1 (ii) a Since N2 is at risk of atrial angle closure, patients with an atrial angle rating of N2 are recommended to follow-up by the ophthalmologist, and one level of text labels are manually defined a-dimensional attribute vectors based on domain knowledge that are used to guide text coders to learn the underlying semantic information of the atrial angle image. The attribute vectors for all anterior chamber angle levels in Table 1 constitute a set of manually defined text labels, v 0 To v A-2 Indicating the degree to which each structure of the anterior chamber angle is visible in the anterior chamber angle image, 0 indicating invisible, 1 indicating partially visible, 2 indicating fully visible; v. of A-1 Is used as its semantic attribute to indicate the likelihood of anterior chamber angle closure.
In step S2, preprocessing the image in the anterior chamber angle image dataset, specifically:
and performing data enhancement including random horizontal mirroring and random salt-pepper noise data enhancement operation on each anterior chamber angle image in the data set, and finally performing normalization processing on the images.
Example 3
This example discloses the following on the basis of example 1 and example 2:
in step S3, the depth neural network model based on the visual text fusion is specifically, as shown in fig. 3:
the deep neural network model based on visual Text fusion comprises a visual learning branch (Vision learning branch), a Text learning branch (Text learning branch) and a Main branch (Main branch), wherein an image of the anterior chamber angle image data set is input into the Main branch, an obtained potential visual feature map is input into the Text learning branch, the Text learning branch processes and outputs the Text feature map to return to the Main branch according to the potential visual feature map, the visual learning branch receives feature information in the Main branch and outputs visual features to return to the Main branch, and the Main branch performs visual Text fusion again and outputs the anterior chamber angle image level.
Aiming at the problem that 'important visual feature distribution areas are small and the boundaries of the important visual features are fuzzy' in an anterior chamber angle image, visual features of a visual learning branch extraction image are constructed; aiming at the problem that the abstract semantics of text features are difficult to learn and express, a text learning branch is constructed by utilizing the field knowledge and attribute learning; aiming at the problem that the data characteristics of the visual modality and the text modality are difficult to effectively fuse, the following method is provided to overcome the limitation that the internal relation between the two modalities is difficult to learn: (1) using the coarse potential visual features as input to a text learning branch, thereby reducing a ravine between the visual and text features; (2) the attention mechanism is used as a main component of the fusion block, and can embed multi-modal data into the fusion block to synthesize common semantic features, so that the relevance among the multi-modal data is improved.
In a traditional medical image classification task, a model is usually trained using image and image-level labels, and during inference, the model classifies a test set sample. However, due to the limitation of the conventional deep neural network, the model often cannot obtain the detail features, semantic information and the relationship between the structures of the region of interest. In order to overcome the limitation of computer-aided anterior chamber angle assessment, the present embodiment provides a deep neural Network model of a Visual Text Fusion Network (VTFN).
The framework designs a visual learning branch based on weak supervised metric learning and a text learning branch based on attribute learning. Through these two branches, the model is able to learn both visual and textual features from multimodal data. Next, the multi-modal features are fused into a common feature. And finally, classifying the anterior chamber angle images of the common features. Thus, the visual learning branch was developed to learn and distinguish the four structures Schwalbe lines, trabecular meshwork, scleral spur and ciliary body zone, the text learning branch was developed to map text descriptions to attribute vectors to emphasize the representation of specific words in important image sub-regions, and the fusion block was developed to fuse multimodal features, improving the intrinsic connection between multimodal data.
The main branch comprises a Visual encoder (Visual encoder), a first Fusion block (Fusion block 1), a second Fusion block (Fusion block 2), and a Classifier (CLS), wherein:
the vision encoder is ResNet50, the input of the vision encoder is an image of an anterior chamber angle image dataset, and the output of the vision encoder is potential vision characteristic maps of two different scales, wherein one potential vision characteristic map P la Input into the text learning branch, and another potential visual feature map P vi Inputting the text characteristic graph P into a first fusion block, and receiving the text characteristic graph P output by a text learning branch by the first fusion block te The first fusion block transforms the potential visual feature map P vi And a text feature map P te Fusing to obtain visual context information P F1 Visual context information P F1 Respectively sent into a visual learning branch and a second fusion block, and the second fusion block also receives the characteristic information P output by the visual learning branch SEG And P EMB The second fusion block combines the visual context information P F1 And characteristic information P SEG And P EMB Performing fusion to obtain a polymerized latent feature P F2 Finally, the aggregated potential features P F2 Inputting the data into a classifier for classification, adopting a multilayer perceptron as the construction of the classifier, and aggregating the potential features P F2 Mapping to the class distribution to obtain the anterior chamber angle image grade.
Because the simple splicing operation can lose the internal correlation among the multi-modal data, the invention provides two fusion blocks, and complementary information in the multi-modal data can be obtained through the characteristics of the fusion blocks. As shown in fig. 5, the first fusion block is composed of an attention mechanism, which takes full advantage of information of two modalities by aggregating visual and textual features, reduces ravines between the two modalities, learns the intrinsic relationship between the two modalities, and the fusion process of the first fusion block is as follows:
potential visual feature map P vi And a text feature map P te Respectively modeling static context information through 3 multiplied by 3 convolution;
for text feature map P te The obtained context information and text characteristic graph P te After the channel splicing operation is used, two continuous 1 multiplied by 1 convolution operations are carried out, and then remodeling and averaging operations are used to obtain a text relation matrix;
by means of a latent visual feature map P vi Remodeling reshape is carried out on the obtained context information to obtain a visual relation matrix;
normalizing the text relation matrix by using a Softmax function to obtain an attention weight graph, multiplying the attention weight graph by the visual relation matrix element by element, and guiding visual feature learning by using text information to obtain new visual context information;
modeling the dependency relationship of the characteristics between the visual and text modes through element-by-element summation, so far, an attention mechanism projects multi-mode characteristics into a common characteristic subspace, and a potential visual characteristic diagram P is completed vi And a text feature map P te Fusing;
the mechanism can provide additional supplementary clues according to the information of other modalities, and mutual guidance of multi-modal features is realized.
The specific fusion process of the second fusion block is as follows:
P F2 =GAP(P F1 )++GAP(P SEG )++GAP(P EMB ))
in the formula, GAP () is the global average pooling operation, and + + is the channel splicing operation.
The text feature branch is formed by a text encoder (Textual encoder) which, as shown in fig. 5, is formed by res4 residual blocks of ResNet, the input of which is a latent feature map P from the visual encoder la Output as a text feature map P te The res4 parameter of the visual encoder is shared with the text encoder parameter, and the text feature map is obtained from the text encoder through attribute learning
Figure BDA0003853131520000121
Wherein C, H and W represent channel, height and width, respectively, the text feature branches apply global average pooling on H and W to learn global discriminant features:
Figure BDA0003853131520000122
in the formula (I), the compound is shown in the specification,
Figure BDA0003853131520000123
is from the feature P at spatial position (i, j) te Is extracted from the Chinese medicinal materials;
text feature branching also utilizes a branch with parameter W te Maps text features to semantic embedding space, thus predicting attribute vectors
Figure BDA0003853131520000124
Predicted latent semantic information representing a attributes in the anterior chamber angle image I:
Figure BDA0003853131520000125
in the formula (I), the compound is shown in the specification,
Figure BDA0003853131520000126
is a linear transformation, using a 1 x 1 convolution calculation for the input tensor,
Figure BDA0003853131520000127
is a predicted attribute vector.
Because information loss can be caused by data of a single mode, a text learning branch based on attribute learning is also provided, and the branch utilizes the field knowledge of the text mode provided by an ophthalmologist to embed the learned attribute vector into a model as text information and learn text features. The intermediate implicit feature graph is used as input, so that the difference between visual features and text features can be reduced, and the inherent relation between the visual features and the text features can be improved.
The Visual Feature branch is shown in fig. 4, and includes a Visual decoder (Visual decoder), a Feature Pyramid Network (Feature Pyramid Network, denoted as FPN), a segmentation Submodule (SEG), and an embedding submodule (EMB), where:
the visual decoder is an up-sampling sub-network, the structure of the network is similar to U-Net, the up-sampling sub-network adopts a jump connection and a symmetrical structure, the input of the up-sampling sub-network is image data in a visual encoder, the output of the up-sampling sub-network is input into a characteristic pyramid network, and the output P of the characteristic pyramid network is FPN Inputting into an embedded submodule, the output P of the feature pyramid network FPN And also visual context information P F1 Adding the obtained data to a segmentation submodule, and outputting characteristic information P by the segmentation submodule SEG The embedded sub-module outputs characteristic information P EMB
The segmentation submodule consists of a first segmentation block (SEG 1) and a second segmentation block (SEG 2), the first segmentation block consists of two convolution layers of 3 multiplied by 3 and a ReLu activation function, and the input of the first segmentation block is the output P of the characteristic pyramid network FPN And visual context information P F1 Adding, the output of the first divided block being the characteristic information P SEG The second partition block is composed of three 3 × 3 convolutional layers and a ReLu activation function, and the input of the second partition block is characteristic information P SEG The output of the second partition is dense prediction
Figure BDA0003853131520000131
The segmentation sub-module completes a segmentation auxiliary task, and induces the model to pay attention to the local non-significant region in a form of segmenting four key structures from the background;
the embedded submodule comprises a first embedded block (EMB 1) and a second embedded block (EMB 2), the first embedded block being formed by five 3 x 3 convolutional layers and integrating a multi-scale feature map P FPN Mapping to an embedded feature map P EMB Representing discriminant features, the second embedded block is composed of a 1 × 1 convolutional layer, mapping all pixels in the graph to points in feature space, each point being represented by an embedded vector
Figure BDA0003853131520000132
And the embedded vector can express the compressed implicit information of the pixel point in the graph. Learning a structural feature with discriminability based on an embedded submodule of metric learning; metric learning can directly learn the mapping of each pixel in the anterior chamber angle image to a point in feature space. On one hand, the points of all pixels belonging to one category in the image on the feature space should be close to each other, and the distance from the clustering center is small; on the other hand, points on the feature space of pixels belonging to the same class form clusters, and the distance between clusters is large.
The loss function in step S3 is specifically:
L total =α·L CLS +β·L SEG +γ·L EMB +δ·L TE
in the formula, L total As a function of the total loss, L CLS Loss function of principal branch, L SEG To split the loss function of the submodule, L EMB Loss function for embedding sub-modules, L TE For the loss function of the text learning branch, alpha, beta, gamma and delta are the weights of four loss terms;
the segmentation submodule uses Dice Loss, which treats the segmentation as a pixel-by-pixel classification, and therefore, the task is to classify each pixel in the image into one of five classes: SL, TM, SS, CBB and background, the weakly supervised loss function is calculated according to the following formula:
Figure BDA0003853131520000133
Figure BDA0003853131520000134
Figure BDA0003853131520000141
in the formula (I), the compound is shown in the specification,
Figure BDA0003853131520000142
for calculating the loss incurred by training samples with pixel-level labels,
Figure BDA0003853131520000143
for calculating the loss caused by label data at the unmarked pixel level,
Figure BDA0003853131520000144
and
Figure BDA0003853131520000145
respectively the prediction probability and the true label of the pixel i, N j Indicates the number of pixels in structure j,
Figure BDA0003853131520000146
a pseudo-label representing the pixel i is shown,
Figure BDA0003853131520000147
indicating that a pseudo label, m, for pixel i is used when the predicted probability score is above a threshold τ i 1, placing;
the embedding submodule uses a discriminant loss function to carry out metric learning so as to guide the model to learn discriminant features of each structure in the feature space, and the embedding submodule forces the model to map each pixel in the image into a dimensional embedding vector in the feature space, which is expressed by a point. In this way, the points of the pixels with the same label (same structure) in the feature space are close to each other to form a cluster, each class (structure) forms a corresponding cluster, and different clusters are far away from each other, so the metric learning method enables the model to well recognize different structures in the image and the same structure in different images, thereby learning the feature with discriminability. Discriminant loss is described as a weighted sum of three components:
L EMB =λ·L var +ρ·L dist +ω·L reg
Figure BDA0003853131520000148
Figure BDA0003853131520000149
Figure BDA00038531315200001410
in the formula (I), the compound is shown in the specification,
Figure BDA00038531315200001411
is the embedding vector for pixel i, μ is the mean-like vector, i.e. the class center, δ vd Is the margin of variance and distance loss, i.e., the maximum distance acceptable within a cluster and the minimum distance that clusters are pulled away from each other, respectively, | | | | | is the L2 norm, [ x | ]] + = max (0,x); the second formula represents a variance term that applies a pulling force to the class center on each embedded vector; the third formula represents a distance term that pushes the cluster-to-cluster class centers away from each other; equation the fourth formula is a regular term.
And the text learning branch obtains text features with potential information according to the attribute learning method and the domain knowledge, and the potential features synthesized with the visual features are finally used for the anterior chamber angle assessment. Now, the latent feature P la Attribute vector predicted to be in semantic space
Figure BDA00038531315200001412
Finally, calculating similarity score s with five manually defined real attribute vectors I It can be calculated as the inner product:
Figure BDA0003853131520000151
wherein
Figure BDA0003853131520000152
A-dimensional attribute vector representing the prediction,
Figure BDA0003853131520000153
representing text labels that are manually defined according to domain knowledge. It should be noted that it is preferable to provide,
Figure BDA0003853131520000154
similarity score, s, representing respective ACA levels I The highest similarity score in the series represents the prediction level of the anterior chamber angle image I. The more likely the predicted image is for category c, because the greater the similarity score between the predicted attribute vector and the true attribute vector for category c.
As shown in FIG. 4, given a potential feature P la And a manually defined attribute matrix V, the objective of attribute learning being:
Figure BDA0003853131520000155
in the specific implementation process, the present embodiment performs a comparative experiment with the existing reference models (VGG, *** lenet, resNet, CCT, UPS, and FixMatch), and the experimental results are shown in table 2. Table 3 is the performance of each predicted attribute for the test sample.
TABLE 2 Classification Performance of VTFN and other reference models
Figure BDA0003853131520000156
Figure BDA0003853131520000161
Table 3 performance of various predicted attributes of test samples
Figure BDA0003853131520000162
The same or similar reference numerals correspond to the same or similar parts;
the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. An anterior chamber angle image grading method based on visual text fusion is characterized by comprising the following steps:
s1: constructing an anterior chamber angle image dataset;
s2: pre-processing images in the anterior chamber angle image dataset;
s3: constructing a depth neural network model based on visual text fusion and initializing;
s4: initializing a loss function and an optimizer;
s5: training the depth neural network model based on the visual text fusion in the step S3 by using the pre-processed anterior chamber angle image data set in the step S2, and calculating a loss function;
s6: updating the network parameters of the depth neural network model based on the visual text fusion by using an optimizer to enable the network parameters to approach or reach an optimal value, so that a loss function is minimized, and finding the optimal network parameters to obtain the optimal depth neural network model based on the visual text fusion;
s7: the anterior chamber angle image is graded using an optimal vision-based text-fusion deep neural network model.
2. The anterior chamber angle image grading method based on visual text fusion according to claim 1, wherein the anterior chamber angle image data set in step S1 comprises a plurality of anterior chamber angle images, each of which is labeled with anterior chamber angle grading information and a manually defined text label, wherein a part of the anterior chamber angle images are also labeled with a pixel-level label, wherein:
the anterior chamber angle grading information is divided into five grades according to an anterior chamber angle evaluation system described by Shaffer, wherein each grade corresponds to different clinical descriptions, and N1, N2, N3, N4 and W;
the text label induces clinical features owned by the anterior chamber angle of each level according to the clinical description of each level, a mapping strategy is defined, the clinical features owned by the anterior chamber angle of each level are mapped into codes which can be identified by a computer, an attribute vector corresponds to the anterior chamber angle of each level, five levels form a 5-d attribute matrix which is called a text label, and each vector in the attribute matrix represents the text description of the anterior chamber angle of one level;
the pixel-level label labels each pixel in the anterior chamber angle image as belonging to one of a Schwalbe line, trabecular meshwork, scleral spur, ciliary body zone, or background structure.
3. The anterior chamber angle image classification method based on visual text fusion as claimed in claim 2, wherein the mapping strategy uses sequential coding, specifically:
selecting A attributes to describe the anterior chamber angle image of each level, each level of the anterior chamber angle image is composed of an A-dimension word-level attribute
Figure FDA0003853131510000021
Representing, by sequential encoding, encoded into a computer-recognizable attribute vector: v. of 0 ,…,v A-1 ;v 0 To v A-2 Indicating the degree to which each structure of the anterior chamber angle is visible in the anterior chamber angle image, 0 indicating invisible, 1 indicating partially visible, 2 indicating fully visible; v. of A-1 Is used as its semantic attribute to indicate the likelihood of anterior chamber angle closure.
4. The anterior chamber angle image classification method based on visual text fusion according to claim 1, characterized in that the images in the anterior chamber angle image dataset are preprocessed in step S2, specifically:
and performing data enhancement including random horizontal mirroring and random salt-pepper noise data enhancement operation on each anterior chamber angle image in the data set, and finally performing normalization processing on the images.
5. The anterior chamber angle image classification method based on visual text fusion as claimed in claim 1, wherein the deep neural network model based on visual text fusion in step S3 is specifically:
the deep neural network model based on the visual text fusion comprises a visual learning branch, a text learning branch and a main branch, wherein an image of the anterior chamber angle image dataset is input into the main branch, one obtained potential visual feature map is input into the text learning branch, the text learning branch processes the output text feature map according to the potential visual feature map and returns the output text feature map to the main branch, the visual learning branch receives feature information in the main branch, the output visual features return to the main branch, and the main branch performs visual text fusion again and outputs the anterior chamber angle image level.
6. The visual text fusion-based anterior chamber angle image ranking method of claim 5 wherein the main branch comprises a visual encoder, a first fusion block, a second fusion block, and a classifier, wherein:
the vision encoder is ResNet50, the input of the vision encoder is an image of an anterior chamber angle image dataset, and the output of the vision encoder is potential vision characteristic maps of two different scales, wherein one potential vision characteristic map P la Input into the text learning branch, and another potential visual feature map P vi Inputting the text characteristic graph P into a first fusion block, and receiving the text characteristic graph P output by a text learning branch by the first fusion block te The first fusion block transforms the potential visual feature map P vi And a text feature map P te Fusing to obtain visual context information P F1 Visual context information P F1 Respectively sent into a visual learning branch and a second fusion block, and the second fusion block also receives the characteristic information P output by the visual learning branch SEG And P EMB The second fusion block combines the visual context information P F1 And characteristic information P SEG And P EMB Performing fusion to obtain a polymerized latent feature P F2 Finally, the aggregated potential features P F2 Inputting the data into a classifier for classification, adopting a multilayer perceptron as the construction of the classifier, and aggregating the potential features P F2 Mapping to the class distribution to obtain the anterior chamber angle image grade.
7. The anterior chamber angle image grading method based on visual text fusion according to claim 6, characterized in that the first fusion block adopts an attention mechanism, and the fusion process of the first fusion block is as follows:
potential visual feature map P vi And a text feature map P te Respectively modeling static context information through 3 multiplied by 3 convolution;
for text feature map P te The obtained context information and text characteristic graph P te After the channel splicing operation is used, two continuous 1 × 1 convolution operations are carried out, and then the operations of remodeling and averaging are used to obtain a text relation matrix;
by means of a latent visual feature map P vi Remodeling reshape is carried out on the obtained context information to obtain a visual relation matrix;
normalizing the text relation matrix by using a Softmax function to obtain an attention weight graph, multiplying the attention weight graph by the visual relation matrix element by element, and guiding visual feature learning by using text information to obtain new visual context information;
modeling the dependency relationship of the characteristics between the visual and text modes by element-by-element summation to complete the potential visual characteristic diagram P vi And a text feature map P te Fusing;
the specific fusion process of the second fusion block is as follows:
P F2 =GAP(P F1 )++GAP(P SEG )++GAP(P EMB ))
in the formula, GAP () is the global average pooling operation, and + + is the channel splicing operation.
8. The anterior chamber angle image classification method based on visual text fusion as claimed in claim 7, characterized in that the text feature branch is composed of a text encoder composed of res4 residual block of ResNet, the input of which is potential feature map P from the visual encoder la Output as a text feature map P te The res4 parameter of the visual encoder is shared with the text encoder parameter, and the text feature map is obtained from the text encoder through attribute learning
Figure FDA0003853131510000031
C, H and W represent channel, height and width, respectively, and the text feature branches apply global mean pooling on H and W to learn global discriminant features:
Figure FDA0003853131510000032
in the formula (I), the compound is shown in the specification,
Figure FDA0003853131510000033
is from feature P at spatial location (i, j) te Is extracted from the Chinese medicinal materials;
text feature branching also utilizes a branch with parameter W te Maps text features to semantic embedding space, thus predicting attribute vectors
Figure FDA0003853131510000034
Predicted latent semantic information representing a attributes in the anterior chamber angle image I:
Figure FDA0003853131510000035
in the formula (I), the compound is shown in the specification,
Figure FDA0003853131510000036
is a linear transformation, using a 1 x 1 convolution calculation for the input tensor,
Figure FDA0003853131510000037
is a predicted attribute vector.
9. The visual text fusion based anterior chamber angle image grading method according to claim 8, wherein the visual feature branch comprises a visual decoder, a feature pyramid network, a segmentation sub-module and an embedding sub-module, wherein:
the vision decoder is an up-sampling sub-network, the up-sampling sub-network adopts a jump connection and a symmetrical structure, the input of the up-sampling sub-network is image data in the vision encoder, the output of the up-sampling sub-network is input into the feature pyramid network, and the output P of the feature pyramid network is FPN Inputting into an embedded submodule, the output P of the feature pyramid network FPN And also visual context information P F1 Adding the obtained data to a segmentation submodule, and outputting characteristic information P by the segmentation submodule SEG The embedded sub-module outputs characteristic information P EMB
The division submodule consists of a first division block and a second division block, the first division block consists of two convolution layers of 3 multiplied by 3 and a ReLu activation function, and the input of the first division block is the output P of the characteristic pyramid network FPN And visual context information P F1 Adding, the output of the first divided block being the characteristic information P SEG The second partition block is composed of three 3 × 3 convolutional layers and a ReLu activation function, and the input of the second partition block is characteristic information P SEG The output of the second partition is dense prediction
Figure FDA0003853131510000041
The embedded sub-module comprises a first embedded block and a second embedded block, wherein the first embedded block is formed by five 3 x 3 convolution layers and used for combining a multi-scale feature map P FPN Mapping to an embedded feature map P EMB Representing discriminant features, the second embedded block is composed of a 1 × 1 convolutional layer, mapping all pixels in the graph to points in the feature space, each point being represented by an embedded vector
Figure FDA0003853131510000042
And the embedded vector can express the compressed implicit information of the pixel point in the graph.
10. The anterior chamber angle image classification method based on visual text fusion according to claim 9, characterized in that the loss function in step S3 is specifically:
L total =α·L CLS +β·L SEG +γ·L EMB +δ·L TE
in the formula, L total As a function of the total loss, L CLS As a loss function of the main branch, L SEG To split the loss function of the submodule, L EMB Loss function for embedding sub-modules, L TE For the loss function of the text learning branch, alpha, beta, gamma and delta are the weights of four loss terms;
Figure FDA0003853131510000043
Figure FDA0003853131510000044
Figure FDA0003853131510000051
in the formula (I), the compound is shown in the specification,
Figure FDA0003853131510000052
for calculating the loss incurred by training samples with pixel-level labels,
Figure FDA0003853131510000053
for calculating the loss caused by label data at the unmarked pixel level,
Figure FDA0003853131510000054
and
Figure FDA0003853131510000055
respectively the prediction probability and the true label of the pixel i, N j Indicates the number of pixels in structure j,
Figure FDA0003853131510000056
a pseudo-label representing the pixel i is shown,
Figure FDA0003853131510000057
indicating that a pseudo label for pixel i, m, is used when the prediction probability score is above a threshold τ i 1, placing;
L EMB =λ·L var +ρ·L dist +ω·L reg
Figure FDA0003853131510000058
Figure FDA0003853131510000059
Figure FDA00038531315100000510
in the formula (I), the compound is shown in the specification,
Figure FDA00038531315100000511
is the embedding vector for pixel i, μ is the mean-like vector, i.e. the class center, δ vd Is the margin of variance and distance loss, i.e. the maximum distance and cluster, respectively, that can be accepted within a clusterAnd the smallest distance from each other between clusters, | | is the L2 norm, [ x | ]] + =max(0,x);
Figure FDA00038531315100000512
Figure FDA00038531315100000513
In the formula (I), the compound is shown in the specification,
Figure FDA00038531315100000514
representing a predicted A-dimensional attribute vector, V representing a text label defined manually according to domain knowledge, s I Similarity score, s, representing each anterior chamber angle level I The highest similarity score in the series represents the prediction level of the anterior chamber angle image I.
CN202211138484.7A 2022-09-19 2022-09-19 Anterior chamber angle image grading method based on visual text fusion Pending CN115423790A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211138484.7A CN115423790A (en) 2022-09-19 2022-09-19 Anterior chamber angle image grading method based on visual text fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211138484.7A CN115423790A (en) 2022-09-19 2022-09-19 Anterior chamber angle image grading method based on visual text fusion

Publications (1)

Publication Number Publication Date
CN115423790A true CN115423790A (en) 2022-12-02

Family

ID=84204639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211138484.7A Pending CN115423790A (en) 2022-09-19 2022-09-19 Anterior chamber angle image grading method based on visual text fusion

Country Status (1)

Country Link
CN (1) CN115423790A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118135052A (en) * 2024-05-08 2024-06-04 南京邮电大学 Method and system for reconstructing visual stimulus image based on human brain fMRI

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118135052A (en) * 2024-05-08 2024-06-04 南京邮电大学 Method and system for reconstructing visual stimulus image based on human brain fMRI

Similar Documents

Publication Publication Date Title
Lv et al. Attention guided U-Net with atrous convolution for accurate retinal vessels segmentation
Salam et al. Automated detection of glaucoma using structural and non structural features
Pathan et al. Automated segmentation and classifcation of retinal features for glaucoma diagnosis
CN110390665B (en) Knee joint disease ultrasonic diagnosis method based on deep learning multichannel and graph embedding method
Jin et al. Construction of retinal vessel segmentation models based on convolutional neural network
CN113610118B (en) Glaucoma diagnosis method, device, equipment and method based on multitasking course learning
CN114821189B (en) Focus image classification and identification method based on fundus image
Ai et al. FN-OCT: Disease detection algorithm for retinal optical coherence tomography based on a fusion network
Yuan et al. A multi-scale convolutional neural network with context for joint segmentation of optic disc and cup
Wang et al. Cataract detection based on ocular B-ultrasound images by collaborative monitoring deep learning
Kumar et al. IterMiUnet: A lightweight architecture for automatic blood vessel segmentation
Al-Bander et al. A novel choroid segmentation method for retinal diagnosis using deep learning
Singh et al. A novel hybrid robust architecture for automatic screening of glaucoma using fundus photos, built on feature selection and machine learning‐nature driven computing
CN113012093A (en) Training method and training system for glaucoma image feature extraction
Singh et al. Optimized convolutional neural network for glaucoma detection with improved optic-cup segmentation
CN115423790A (en) Anterior chamber angle image grading method based on visual text fusion
Yi et al. C2FTFNet: Coarse-to-fine transformer network for joint optic disc and cup segmentation
CN117237711A (en) Bimodal fundus image classification method based on countermeasure learning
Wang et al. Optic disc detection based on fully convolutional neural network and structured matrix decomposition
Ji et al. An image diagnosis algorithm for keratitis based on deep learning
Ashtari-Majlan et al. Deep learning and computer vision for glaucoma detection: A review
Maryada et al. Applying a novel two-stage deep-learning model to improve accuracy in detecting retinal fundus images
Ahmed et al. An effective deep learning network for detecting and classifying glaucomatous eye.
Guo et al. DSLN: Dual-tutor student learning network for multiracial glaucoma detection
Yang et al. Segmentation of retinal detachment and retinoschisis in OCT images based on improved U-shaped network with cross-fusion global feature module

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination