CN115423790A

CN115423790A - Anterior chamber angle image grading method based on visual text fusion

Info

Publication number: CN115423790A
Application number: CN202211138484.7A
Authority: CN
Inventors: 贾西平; 黄静琪; 关立南; 聂栋; 崔怀林; 廖秀秀; 林智勇; 马震远; 刘海珠; 张倩
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2022-09-19
Filing date: 2022-09-19
Publication date: 2022-12-02

Abstract

The invention discloses an anterior chamber angle image grading method based on visual text fusion, which comprises the following steps of: s1: constructing an anterior chamber angle image dataset; s2: pre-processing images in the anterior chamber angle image dataset; s3: constructing a depth neural network model based on visual text fusion and initializing; s4: initializing a loss function and an optimizer; s5: training the depth neural network model based on the visual text fusion in the step S3 by using the pre-processed anterior chamber angle image data set in the step S2, and calculating a loss function; s6: updating network parameters of the deep neural network model based on the visual text fusion by using an optimizer to obtain an optimal deep neural network model based on the visual text fusion; s7: the anterior chamber angle image is graded using an optimal vision-based text-fusion deep neural network model. The invention solves the problem of computer-aided diagnosis and treatment of glaucoma to a certain extent.

Description

Anterior chamber angle image grading method based on visual text fusion

Technical Field

The invention relates to the field of medical image processing, in particular to an anterior chamber angle image grading method based on visual text fusion.

Background

Glaucoma is an irreversible blinding eye disease and is a leading cause of blindness. After glaucoma has been diagnosed, it is usually clinically necessary to determine the Anterior Chamber Angle grade by observing the internal structure of the Anterior Chamber Angle (ACA) with the aid of an gonioscopy, and then to develop a targeted treatment plan. In recent years, a deep learning technology obtains a plurality of important achievements in the field of medical image analysis, and the advantages of high effective rate of automatic classification of the anterior chamber angle image, good stability and consistency of conclusion, capability of being popularized and applied in different areas and the like are realized by means of the deep learning technology, so that the method has important value clinically.

1 anterior chamber angle assessment. Cheng et al use edge detection and arc detection algorithms to achieve two-stage intelligent analysis of RetCam images, and their proposed system achieves automatic classification of closed/open angle glaucoma by observing the size of the arc or angle. EyeCam was originally aimed at producing a wide-angle picture of the fundus, later modified to record the anterior chamber angle. This is a portable handheld device that, like the anterior horn herein, is capable of performing a contact check on a patient. Baskaran et al compared the performance of the gonioscopic and EyeCam systems and found that the diagnostic results of these two devices to determine whether the angle is closed were substantially consistent. Unlike the study objective herein (differentiating five levels of anterior chamber angle), the students studying both the RetCam and EyeCam images are more concerned about whether the angle in the image is closed, and the task of dichotomy is simpler. In addition to this, researchers have achieved segmentation of four structures in the anterior chamber angle image, which is quite different from our classification task.

2 glaucoma detection. In recent years, deep learning techniques have achieved a series of important results in glaucoma detection, including analysis of imaging data such AS-OCT and fundus examination by the deep learning techniques, and further realization of auxiliary diagnosis of glaucoma. Some studies have aimed at segmenting the geometry of AS-OCT and fundus images (for optic nerve segmentation or for cup/disc segmentation) using deep convolutional neural networks or fully-connected convolutional neural networks. For example, fang et al extract image features of AS-OCT using convolutional neural networks, and at the same time, find the final boundary from the probability map using a graph search method, thereby achieving automatic segmentation of retinal layer boundaries using AS-OCT images; some studies are designed to support glaucoma diagnosis, and color information capture of optic nerve fundus images based on a deep learning algorithm is mainly achieved by using a convolutional neural network architecture. Liu et al constructed a deep learning system for the detection of glaucomatous optic neuropathy using averages over red, blue and green channels.

In summary, the deep learning technique has achieved a series of important results in glaucoma-related medical image analysis (e.g., fundus images, OCT images, etc.), but has relatively little assistance in analysis of anterior chamber angle images. Considering that the conventional gonioscopes are still widely used in China, it is necessary to study the auxiliary analysis of the anterior chamber angle image.

Disclosure of Invention

The invention provides an anterior chamber angle image grading method based on visual text fusion, which realizes automatic grading of an anterior chamber angle image.

In order to solve the technical problems, the technical scheme of the invention is as follows:

an anterior chamber angle image grading method based on visual text fusion comprises the following steps:

s1: constructing an anterior chamber angle image dataset;

s2: pre-processing images in the anterior chamber angle image dataset;

s3: constructing a depth neural network model based on visual text fusion and initializing;

s4: initializing a loss function and an optimizer;

s5: training the depth neural network model based on the visual text fusion in the step S3 by using the pre-processed anterior chamber angle image data set in the step S2, and calculating a loss function;

s6: updating the network parameters of the depth neural network model based on the visual text fusion by using an optimizer to enable the network parameters to approach or reach an optimal value, so that a loss function is minimized, and finding the optimal network parameters to obtain the optimal depth neural network model based on the visual text fusion;

s7: the anterior chamber angle image is graded using an optimal vision-based text fusion-based deep neural network model.

Preferably, the anterior chamber angle image data set in step S1 includes a plurality of anterior chamber angle images, each of which is labeled with anterior chamber angle grade information and a manually defined text label, wherein a part of the anterior chamber angle images are also labeled with a pixel-level label, wherein:

the anterior chamber angle grading information is divided into five grades according to an anterior chamber angle evaluation system described by Shaffer, wherein each grade corresponds to different clinical descriptions, and N1, N2, N3, N4 and W;

the text label induces clinical features owned by the anterior chamber angle of each level according to the clinical description of each level, and defines a mapping strategy, the clinical features owned by the anterior chamber angle of each level are mapped into codes which can be identified by a computer, an attribute vector corresponds to the anterior chamber angle of each level, five levels form a 5-d attribute matrix which is called a text label, and each vector in the attribute matrix represents the text description of the anterior chamber angle of one level;

the pixel-level label labels each pixel in the anterior chamber angle image as belonging to one of a Schwalbe line, trabecular meshwork, scleral spur, ciliary body zone, or background structure.

Preferably, the mapping strategy uses sequential coding, specifically:

selecting an anterior chamber angle image in which A attributes describe various levels, each level of the anterior chamber angle image being composed of an A-dimensional word-level attribute

Representing, by sequential encoding, encoded into a computer-recognizable attribute vector: v. of ₀ ,…,v _A-1 ；v ₀ To v _A-2 Indicating the degree to which each structure of the anterior chamber angle is visible in the anterior chamber angle image, 0 indicating invisible, 1 indicating partially visible, 2 indicating fully visible; v. of _A-1 Is used as its semantic attribute to indicate the likelihood of anterior chamber angle closure.

Preferably, the image in the anterior chamber angle image data set is preprocessed in step S2, specifically:

and performing data enhancement including random horizontal mirroring and random salt-pepper noise data enhancement operation on each anterior chamber angle image in the data set, and finally performing normalization processing on the images.

Preferably, the deep neural network model based on visual text fusion in step S3 specifically includes:

the deep neural network model based on the visual text fusion comprises a visual learning branch, a text learning branch and a main branch, wherein an image of the anterior chamber angle image data set is input into the main branch, an obtained one of the potential visual feature maps is input into the text learning branch, the text learning branch processes the output text feature map according to the potential visual feature map and returns the text feature map to the main branch, the visual learning branch receives feature information in the main branch and outputs visual features to the main branch, and the main branch performs visual text fusion and outputs the anterior chamber angle image level.

Preferably, the main branch comprises a visual encoder, a first fusion block, a second fusion block, and a classifier, wherein:

the vision encoder is ResNet50, the input of the vision encoder is an image of an anterior chamber angle image dataset, and the output of the vision encoder is potential vision characteristic maps of two different scales, wherein one potential vision characteristic map P _la Input into the text learning branch, and another potential visual feature map P _vi Inputting the text characteristic graph P into a first fusion block, and receiving the text characteristic graph P output by a text learning branch by the first fusion block _te The first fusion block combines the potential visual feature map P _vi And a text feature map P _te Fusing to obtain visual context information P _F1 Visual context information P _F1 Respectively sent into the visual learning branch and a second fusion block, and the second fusion block also receives the characteristic information P output by the visual learning branch _SEG And P _EMB The second fusion block combines the visual context information P _F1 And characteristic information P _SEG And P _EMB Performing fusion to obtain a polymerized latent feature P _F2 Finally, the aggregated potential features P _F2 Inputting the data into a classifier for classification, adopting a multilayer perceptron as the construction of the classifier, and aggregating the potential features P _F2 Mapping to the class distribution to obtain the anterior chamber angle image grade.

Preferably, the first fusion block adopts an attention mechanism, and the fusion process of the first fusion block is as follows:

potential visual feature map P _vi And a text feature map P _te Respectively modeling static context information through 3 multiplied by 3 convolution;

for text feature map P _te The obtained context information and text characteristic graph P _te After the channel splicing operation is used, two continuous 1 multiplied by 1 convolution operations are carried out, and then remodeling and averaging operations are used to obtain a text relation matrix;

by means of a latent visual feature map P _vi Remodeling reshape is carried out on the obtained context information to obtain a visual relation matrix;

normalizing the text relation matrix by using a Softmax function to obtain an attention weight graph, multiplying the attention weight graph by the visual relation matrix element by element, and guiding visual feature learning by using text information to obtain new visual context information;

modeling the dependency relationship of the characteristics between the visual and text modes by element-by-element summation to complete the potential visual characteristic diagram P _vi And a text feature map P _te Fusing;

the specific fusion process of the second fusion block is as follows:

P _F2 ＝GAP(P _F1 )++GAP(P _SEG )++GAP(P _EMB ))

in the formula, GAP () is the global average pooling operation, and + is the channel splicing operation.

Preferably, the text feature branch is formed by a text encoder formed by the res4 residual block of ResNet, the input of which is the latent feature map P from the visual encoder _la The output is a text feature map P _te The res4 parameter of the visual encoder is shared with the text encoder parameter, and the text feature map is obtained from the text encoder through attribute learning

Wherein C, H and W represent channel, height and width, respectively, the text feature branches apply global average pooling on H and W to learn global discriminant features:

in the formula (I), the compound is shown in the specification,

is from the feature P at spatial position (i, j) _te Is extracted from the Chinese medicinal materials;

text feature branching is also utilized with parametersW _te Maps text features to semantic embedding space, thus predicting attribute vectors

Predicted latent semantic information representing a attributes in the anterior chamber angle image I:

in the formula (I), the compound is shown in the specification,

is a linear transformation, using a 1 x 1 convolution calculation for the input tensor,

is a predicted attribute vector.

Preferably, the visual feature branch comprises a visual decoder, a feature pyramid network, a segmentation sub-module and an embedding sub-module, wherein:

the vision decoder is an up-sampling sub-network, the up-sampling sub-network adopts a jump connection and a symmetrical structure, the input of the up-sampling sub-network is image data in a vision encoder, the output of the up-sampling sub-network is input into the feature pyramid network, and the output P of the feature pyramid network _FPN Inputting into an embedded submodule, the output P of the feature pyramid network _FPN And also visual context information P _F1 Adding the obtained data to a segmentation submodule, and outputting characteristic information P by the segmentation submodule _SEG The embedded sub-module outputs characteristic information P _EMB ；

The division submodule consists of a first division block and a second division block, the first division block consists of two convolution layers of 3 multiplied by 3 and a ReLu activation function, and the input of the first division block is the output P of the characteristic pyramid network _FPN With visual context information P _F1 Adding, the output of the first division block being characteristic information P _SEG The second partition is three 3 × 3 convolutional layers and a ReLu activation functionComposition, the input of the second divided block is characteristic information P _SEG The output of the second partition is dense prediction

The embedded sub-module comprises a first embedded block and a second embedded block, wherein the first embedded block is formed by five 3 x 3 convolution layers and used for combining a multi-scale feature map P _FPN Mapping to an embedded feature map P _EMB Representing discriminant features, the second embedded block is composed of a 1 × 1 convolutional layer, mapping all pixels in the graph to points in feature space, each point being represented by an embedded vector

And the embedded vector can express the compressed implicit information of the pixel point in the graph.

Preferably, the loss function in step S3 is specifically:

L _total ＝α·L _CLS +β·L _SEG +γ·L _EMB +δ·L _TE

in the formula, L _total As a function of total loss, L _CLS As a loss function of the main branch, L _SEG To split the loss function of the submodule, L _EMB Loss function for embedding sub-modules, L _TE For the loss function of the text learning branch, alpha, beta, gamma and delta are the weights of four loss terms;

in the formula (I), the compound is shown in the specification,

for calculating the loss incurred by training samples with pixel-level labels,

for calculating the loss caused by label data at the unmarked pixel level,

and

respectively the prediction probability and the true label of the pixel i, N _j Representing the number of pixels in the structure j,

a pseudo-label representing the pixel i,

indicating that a pseudo label for pixel i, m, is used when the prediction probability score is above a threshold τ _i 1, placing;

L _EMB ＝λ·L _var +ρ·L _dist +ω·L _reg

in the formula (I), the compound is shown in the specification,

is the embedding vector of pixel i, mu is the mean-like vector, i.e. class center, delta _v ,δ _d Is the sum of the variancesMargin of distance loss, i.e. the maximum distance that can be accepted within a cluster and the minimum distance that clusters are pulled away from each other, respectively, | | is the L2 norm, [ x | ]] ₊ ＝max(0,x)；

In the formula (I), the compound is shown in the specification,

representing a predicted A-dimensional attribute vector, V representing a text label defined manually according to domain knowledge, s _I Similarity score, s, representing each anterior chamber angle level _I The highest similarity score in the series represents the prediction level of the anterior chamber angle image I.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

1. the requirement of the anterior chamber angle grading on the work qualification of doctors is reduced, and the method has practical significance for relieving the problem of insufficient medical experts. The automatic anterior chamber angle image grading system based on deep learning not only can reduce the medical cost of patients, but also enables more glaucoma patients to enjoy high-quality medical services at lower cost.

2. As an auxiliary diagnosis technology for glaucoma, the method provides important reference for doctors and effectively reduces the instability of a diagnosis conclusion. Based on the proposed grading method, the method not only can realize automatic grading of the anterior chamber angle image, but also can be used as an auxiliary diagnosis and treatment tool for glaucoma, so that the image reading workload of ophthalmologists is reduced, and the stability of the diagnosis result is improved.

3. New study directions are provided for other medical images that have similar challenges as the anterior chamber angle image. Similar challenges include: the unclear boundary of the critical local structure of the target image leads to difficult recognition of the model, poor characteristic distinguishability of adjacent structures and the like. For example, the diagnosis of Diabetic Retinopathy (DR) is associated with microaneurysms, hemorrhage, soft and hard exudates, which are difficult to distinguish due to their similar appearance. Therefore, the method proposed by the present invention is equally applicable to the related problems.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

Figure 2 is an exemplary graph of multi-modal data associated with an image of the anterior chamber angle provided by an embodiment.

Fig. 3 is an overall framework diagram of a deep neural network model based on visual context fusion according to an embodiment.

Fig. 4 is a network architecture diagram of the visual learning branch.

Fig. 5 is a network architecture diagram of a text learning branch.

Fig. 6 is a network framework diagram of a first fusion block.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

The embodiment provides an anterior chamber angle image grading method based on visual text fusion, as shown in fig. 1, including the following steps:

s1: constructing an anterior chamber angle image dataset;

s2: pre-processing images in the anterior chamber angle image dataset;

s4: initializing a loss function and an optimizer;

s6: updating the network parameters of the deep neural network model based on the visual text fusion by using an optimizer, and enabling the network parameters to approach or reach an optimal value, so that a loss function is minimized, the optimal network parameters are found, and the optimal deep neural network model based on the visual text fusion is obtained;

s7: the anterior chamber angle image is graded using an optimal vision-based text-fusion deep neural network model.

Example 2

This example continues to disclose the following on the basis of example 1:

the anterior chamber angle image data set (ACA 999) in step S1 includes 999 anterior chamber angle images, each of which is labeled with anterior chamber angle grading information and manually defined text labels, wherein 100 anterior chamber angle images are also labeled with pixel-level labels, wherein:

the anterior chamber angle grading information is divided into five grades according to an anterior chamber angle evaluation system described by Shaffer, wherein each grade corresponds to different clinical descriptions, and N1, N2, N3, N4 and W; for example, the degree of patency is the anterior chamber angle of N2, which corresponds to a clinical depiction of "CBB disappearance";

the text labels summarize the clinical features possessed by each level of anterior chamber angle according to the clinical description of each level, and thus, each category of anterior chamber angle has its corresponding word-level attributes. For example, "CBB vanishing" can be summarized as N2 having SL structure, TM structure, SS structure, no CBB structure, possibly closed angle, etc. attributes, and a mapping strategy is defined to map clinical features owned by each level of anterior chamber angle into computer recognizable codes, one attribute vector for each level of anterior chamber angle, five levels forming a 5-d attribute matrix, called text label, each vector in the attribute matrix representing a text description of one level of anterior chamber angle;

the pixel-level label labels each pixel in the anterior chamber angle image as belonging to one of Schwalbe Line (SL), trabecular Meshwork (TM), scleral Spur (SS), ciliary body zone (CBB), or background structure.

The anterior chamber angle image, data for the visual modality, and data for the text modality are shown in figure 1. The multi-modal data can describe the characteristic information of four structures from different angles, for example, (1) the anterior chamber angle image can provide texture information, color information of the iridocorneal population (anterior chamber angle between iridocorneals); (2) The data of the visual modality can provide local information of four key structures; (3) The data of the text modality can provide clinically relevant domain knowledge: the integrity of the four important structures determines the level of anterior chamber angle. Therefore, the data of the visual modality and the data of the text modality can provide different but complementary information to the model, and the application of the multi-modal data can not only reduce the uncertainty of the information, but also provide more clinical information. Therefore, it is highly desirable for a computer to learn the complementary information and common features contained in multi-modal data for automatic assessment of anterior chamber angle

The visual learning branch, while it may learn visual features, may ignore the effects of the order of the four structures, the degree of visualization, and the severity of the anterior chamber angle closure on the anterior chamber angle progression. Therefore, the invention proposes a text learning branch based on attribute learning to learn text features from text descriptions, breaking these limitations.

The clinical description of the anterior chamber angle for each grade was different according to the anterior chamber angle evaluation system described by Shaffer. As can be seen in figure 1, the physician determines that the basis for the anterior chamber angle is that different levels of anterior chamber angle have different clinical characteristics. A mapping strategy phi () C → V is defined to map all levels of the anterior chamber angle image into a semantic matrix based on clinical features consisting of attribute-specific words corresponding to each level, so the invention notes V = phi (C), V denotes N _c An a-dimensional attribute vector corresponding to the anterior chamber angle level in table 1 as judged by domain knowledge.

TABLE 1 encoding of attribute vectors

Based on the basis, the invention manually summarizes the attributes described as word levels by summarizing the text, selects A attributes to describe the anterior chamber angle images of various levels, and each level of the anterior chamber angle images is composed of a word level attribute of an A dimension

Representing, by sequential encoding, encoded into a computer-recognizable attribute vector: v. of ₀ ,…,v _A-1 (ii) a Since N2 is at risk of atrial angle closure, patients with an atrial angle rating of N2 are recommended to follow-up by the ophthalmologist, and one level of text labels are manually defined a-dimensional attribute vectors based on domain knowledge that are used to guide text coders to learn the underlying semantic information of the atrial angle image. The attribute vectors for all anterior chamber angle levels in Table 1 constitute a set of manually defined text labels, v ₀ To v _A-2 Indicating the degree to which each structure of the anterior chamber angle is visible in the anterior chamber angle image, 0 indicating invisible, 1 indicating partially visible, 2 indicating fully visible; v. of _A-1 Is used as its semantic attribute to indicate the likelihood of anterior chamber angle closure.

In step S2, preprocessing the image in the anterior chamber angle image dataset, specifically:

Example 3

This example discloses the following on the basis of example 1 and example 2:

in step S3, the depth neural network model based on the visual text fusion is specifically, as shown in fig. 3:

the deep neural network model based on visual Text fusion comprises a visual learning branch (Vision learning branch), a Text learning branch (Text learning branch) and a Main branch (Main branch), wherein an image of the anterior chamber angle image data set is input into the Main branch, an obtained potential visual feature map is input into the Text learning branch, the Text learning branch processes and outputs the Text feature map to return to the Main branch according to the potential visual feature map, the visual learning branch receives feature information in the Main branch and outputs visual features to return to the Main branch, and the Main branch performs visual Text fusion again and outputs the anterior chamber angle image level.

Aiming at the problem that 'important visual feature distribution areas are small and the boundaries of the important visual features are fuzzy' in an anterior chamber angle image, visual features of a visual learning branch extraction image are constructed; aiming at the problem that the abstract semantics of text features are difficult to learn and express, a text learning branch is constructed by utilizing the field knowledge and attribute learning; aiming at the problem that the data characteristics of the visual modality and the text modality are difficult to effectively fuse, the following method is provided to overcome the limitation that the internal relation between the two modalities is difficult to learn: (1) using the coarse potential visual features as input to a text learning branch, thereby reducing a ravine between the visual and text features; (2) the attention mechanism is used as a main component of the fusion block, and can embed multi-modal data into the fusion block to synthesize common semantic features, so that the relevance among the multi-modal data is improved.

In a traditional medical image classification task, a model is usually trained using image and image-level labels, and during inference, the model classifies a test set sample. However, due to the limitation of the conventional deep neural network, the model often cannot obtain the detail features, semantic information and the relationship between the structures of the region of interest. In order to overcome the limitation of computer-aided anterior chamber angle assessment, the present embodiment provides a deep neural Network model of a Visual Text Fusion Network (VTFN).

The framework designs a visual learning branch based on weak supervised metric learning and a text learning branch based on attribute learning. Through these two branches, the model is able to learn both visual and textual features from multimodal data. Next, the multi-modal features are fused into a common feature. And finally, classifying the anterior chamber angle images of the common features. Thus, the visual learning branch was developed to learn and distinguish the four structures Schwalbe lines, trabecular meshwork, scleral spur and ciliary body zone, the text learning branch was developed to map text descriptions to attribute vectors to emphasize the representation of specific words in important image sub-regions, and the fusion block was developed to fuse multimodal features, improving the intrinsic connection between multimodal data.

The main branch comprises a Visual encoder (Visual encoder), a first Fusion block (Fusion block 1), a second Fusion block (Fusion block 2), and a Classifier (CLS), wherein:

the vision encoder is ResNet50, the input of the vision encoder is an image of an anterior chamber angle image dataset, and the output of the vision encoder is potential vision characteristic maps of two different scales, wherein one potential vision characteristic map P _la Input into the text learning branch, and another potential visual feature map P _vi Inputting the text characteristic graph P into a first fusion block, and receiving the text characteristic graph P output by a text learning branch by the first fusion block _te The first fusion block transforms the potential visual feature map P _vi And a text feature map P _te Fusing to obtain visual context information P _F1 Visual context information P _F1 Respectively sent into a visual learning branch and a second fusion block, and the second fusion block also receives the characteristic information P output by the visual learning branch _SEG And P _EMB The second fusion block combines the visual context information P _F1 And characteristic information P _SEG And P _EMB Performing fusion to obtain a polymerized latent feature P _F2 Finally, the aggregated potential features P _F2 Inputting the data into a classifier for classification, adopting a multilayer perceptron as the construction of the classifier, and aggregating the potential features P _F2 Mapping to the class distribution to obtain the anterior chamber angle image grade.

Because the simple splicing operation can lose the internal correlation among the multi-modal data, the invention provides two fusion blocks, and complementary information in the multi-modal data can be obtained through the characteristics of the fusion blocks. As shown in fig. 5, the first fusion block is composed of an attention mechanism, which takes full advantage of information of two modalities by aggregating visual and textual features, reduces ravines between the two modalities, learns the intrinsic relationship between the two modalities, and the fusion process of the first fusion block is as follows:

modeling the dependency relationship of the characteristics between the visual and text modes through element-by-element summation, so far, an attention mechanism projects multi-mode characteristics into a common characteristic subspace, and a potential visual characteristic diagram P is completed _vi And a text feature map P _te Fusing;

the mechanism can provide additional supplementary clues according to the information of other modalities, and mutual guidance of multi-modal features is realized.

The specific fusion process of the second fusion block is as follows:

P _F2 ＝GAP(P _F1 )++GAP(P _SEG )++GAP(P _EMB ))

in the formula, GAP () is the global average pooling operation, and + + is the channel splicing operation.

The text feature branch is formed by a text encoder (Textual encoder) which, as shown in fig. 5, is formed by res4 residual blocks of ResNet, the input of which is a latent feature map P from the visual encoder _la Output as a text feature map P _te The res4 parameter of the visual encoder is shared with the text encoder parameter, and the text feature map is obtained from the text encoder through attribute learning

in the formula (I), the compound is shown in the specification,

text feature branching also utilizes a branch with parameter W _te Maps text features to semantic embedding space, thus predicting attribute vectors

in the formula (I), the compound is shown in the specification,

is a predicted attribute vector.

Because information loss can be caused by data of a single mode, a text learning branch based on attribute learning is also provided, and the branch utilizes the field knowledge of the text mode provided by an ophthalmologist to embed the learned attribute vector into a model as text information and learn text features. The intermediate implicit feature graph is used as input, so that the difference between visual features and text features can be reduced, and the inherent relation between the visual features and the text features can be improved.

The Visual Feature branch is shown in fig. 4, and includes a Visual decoder (Visual decoder), a Feature Pyramid Network (Feature Pyramid Network, denoted as FPN), a segmentation Submodule (SEG), and an embedding submodule (EMB), where:

the visual decoder is an up-sampling sub-network, the structure of the network is similar to U-Net, the up-sampling sub-network adopts a jump connection and a symmetrical structure, the input of the up-sampling sub-network is image data in a visual encoder, the output of the up-sampling sub-network is input into a characteristic pyramid network, and the output P of the characteristic pyramid network is _FPN Inputting into an embedded submodule, the output P of the feature pyramid network _FPN And also visual context information P _F1 Adding the obtained data to a segmentation submodule, and outputting characteristic information P by the segmentation submodule _SEG The embedded sub-module outputs characteristic information P _EMB ；

The segmentation submodule consists of a first segmentation block (SEG 1) and a second segmentation block (SEG 2), the first segmentation block consists of two convolution layers of 3 multiplied by 3 and a ReLu activation function, and the input of the first segmentation block is the output P of the characteristic pyramid network _FPN And visual context information P _F1 Adding, the output of the first divided block being the characteristic information P _SEG The second partition block is composed of three 3 × 3 convolutional layers and a ReLu activation function, and the input of the second partition block is characteristic information P _SEG The output of the second partition is dense prediction

The segmentation sub-module completes a segmentation auxiliary task, and induces the model to pay attention to the local non-significant region in a form of segmenting four key structures from the background;

the embedded submodule comprises a first embedded block (EMB 1) and a second embedded block (EMB 2), the first embedded block being formed by five 3 x 3 convolutional layers and integrating a multi-scale feature map P _FPN Mapping to an embedded feature map P _EMB Representing discriminant features, the second embedded block is composed of a 1 × 1 convolutional layer, mapping all pixels in the graph to points in feature space, each point being represented by an embedded vector

And the embedded vector can express the compressed implicit information of the pixel point in the graph. Learning a structural feature with discriminability based on an embedded submodule of metric learning; metric learning can directly learn the mapping of each pixel in the anterior chamber angle image to a point in feature space. On one hand, the points of all pixels belonging to one category in the image on the feature space should be close to each other, and the distance from the clustering center is small; on the other hand, points on the feature space of pixels belonging to the same class form clusters, and the distance between clusters is large.

The loss function in step S3 is specifically:

L _total ＝α·L _CLS +β·L _SEG +γ·L _EMB +δ·L _TE

in the formula, L _total As a function of the total loss, L _CLS Loss function of principal branch, L _SEG To split the loss function of the submodule, L _EMB Loss function for embedding sub-modules, L _TE For the loss function of the text learning branch, alpha, beta, gamma and delta are the weights of four loss terms;

the segmentation submodule uses Dice Loss, which treats the segmentation as a pixel-by-pixel classification, and therefore, the task is to classify each pixel in the image into one of five classes: SL, TM, SS, CBB and background, the weakly supervised loss function is calculated according to the following formula:

in the formula (I), the compound is shown in the specification,

for calculating the loss incurred by training samples with pixel-level labels,

for calculating the loss caused by label data at the unmarked pixel level,

and

respectively the prediction probability and the true label of the pixel i, N _j Indicates the number of pixels in structure j,

a pseudo-label representing the pixel i is shown,

indicating that a pseudo label, m, for pixel i is used when the predicted probability score is above a threshold τ _i 1, placing;

the embedding submodule uses a discriminant loss function to carry out metric learning so as to guide the model to learn discriminant features of each structure in the feature space, and the embedding submodule forces the model to map each pixel in the image into a dimensional embedding vector in the feature space, which is expressed by a point. In this way, the points of the pixels with the same label (same structure) in the feature space are close to each other to form a cluster, each class (structure) forms a corresponding cluster, and different clusters are far away from each other, so the metric learning method enables the model to well recognize different structures in the image and the same structure in different images, thereby learning the feature with discriminability. Discriminant loss is described as a weighted sum of three components:

L _EMB ＝λ·L _var +ρ·L _dist +ω·L _reg

in the formula (I), the compound is shown in the specification,

is the embedding vector for pixel i, μ is the mean-like vector, i.e. the class center, δ _v ,δ _d Is the margin of variance and distance loss, i.e., the maximum distance acceptable within a cluster and the minimum distance that clusters are pulled away from each other, respectively, | | | | | is the L2 norm, [ x | ]] ₊ = max (0,x); the second formula represents a variance term that applies a pulling force to the class center on each embedded vector; the third formula represents a distance term that pushes the cluster-to-cluster class centers away from each other; equation the fourth formula is a regular term.

And the text learning branch obtains text features with potential information according to the attribute learning method and the domain knowledge, and the potential features synthesized with the visual features are finally used for the anterior chamber angle assessment. Now, the latent feature P _la Attribute vector predicted to be in semantic space

Finally, calculating similarity score s with five manually defined real attribute vectors _I It can be calculated as the inner product:

wherein

A-dimensional attribute vector representing the prediction,

representing text labels that are manually defined according to domain knowledge. It should be noted that it is preferable to provide,

similarity score, s, representing respective ACA levels _I The highest similarity score in the series represents the prediction level of the anterior chamber angle image I. The more likely the predicted image is for category c, because the greater the similarity score between the predicted attribute vector and the true attribute vector for category c.

As shown in FIG. 4, given a potential feature P _la And a manually defined attribute matrix V, the objective of attribute learning being:

in the specific implementation process, the present embodiment performs a comparative experiment with the existing reference models (VGG, *** lenet, resNet, CCT, UPS, and FixMatch), and the experimental results are shown in table 2. Table 3 is the performance of each predicted attribute for the test sample.

TABLE 2 Classification Performance of VTFN and other reference models

Table 3 performance of various predicted attributes of test samples

The same or similar reference numerals correspond to the same or similar parts;

the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. An anterior chamber angle image grading method based on visual text fusion is characterized by comprising the following steps:

s1: constructing an anterior chamber angle image dataset;

s2: pre-processing images in the anterior chamber angle image dataset;

s4: initializing a loss function and an optimizer;

2. The anterior chamber angle image grading method based on visual text fusion according to claim 1, wherein the anterior chamber angle image data set in step S1 comprises a plurality of anterior chamber angle images, each of which is labeled with anterior chamber angle grading information and a manually defined text label, wherein a part of the anterior chamber angle images are also labeled with a pixel-level label, wherein:

the text label induces clinical features owned by the anterior chamber angle of each level according to the clinical description of each level, a mapping strategy is defined, the clinical features owned by the anterior chamber angle of each level are mapped into codes which can be identified by a computer, an attribute vector corresponds to the anterior chamber angle of each level, five levels form a 5-d attribute matrix which is called a text label, and each vector in the attribute matrix represents the text description of the anterior chamber angle of one level;

3. The anterior chamber angle image classification method based on visual text fusion as claimed in claim 2, wherein the mapping strategy uses sequential coding, specifically:

selecting A attributes to describe the anterior chamber angle image of each level, each level of the anterior chamber angle image is composed of an A-dimension word-level attribute

4. The anterior chamber angle image classification method based on visual text fusion according to claim 1, characterized in that the images in the anterior chamber angle image dataset are preprocessed in step S2, specifically:

5. The anterior chamber angle image classification method based on visual text fusion as claimed in claim 1, wherein the deep neural network model based on visual text fusion in step S3 is specifically:

the deep neural network model based on the visual text fusion comprises a visual learning branch, a text learning branch and a main branch, wherein an image of the anterior chamber angle image dataset is input into the main branch, one obtained potential visual feature map is input into the text learning branch, the text learning branch processes the output text feature map according to the potential visual feature map and returns the output text feature map to the main branch, the visual learning branch receives feature information in the main branch, the output visual features return to the main branch, and the main branch performs visual text fusion again and outputs the anterior chamber angle image level.

6. The visual text fusion-based anterior chamber angle image ranking method of claim 5 wherein the main branch comprises a visual encoder, a first fusion block, a second fusion block, and a classifier, wherein:

7. The anterior chamber angle image grading method based on visual text fusion according to claim 6, characterized in that the first fusion block adopts an attention mechanism, and the fusion process of the first fusion block is as follows:

for text feature map P _te The obtained context information and text characteristic graph P _te After the channel splicing operation is used, two continuous 1 × 1 convolution operations are carried out, and then the operations of remodeling and averaging are used to obtain a text relation matrix;

the specific fusion process of the second fusion block is as follows:

P _F2 ＝GAP(P _F1 )++GAP(P _SEG )++GAP(P _EMB ))

8. The anterior chamber angle image classification method based on visual text fusion as claimed in claim 7, characterized in that the text feature branch is composed of a text encoder composed of res4 residual block of ResNet, the input of which is potential feature map P from the visual encoder _la Output as a text feature map P _te The res4 parameter of the visual encoder is shared with the text encoder parameter, and the text feature map is obtained from the text encoder through attribute learning

C, H and W represent channel, height and width, respectively, and the text feature branches apply global mean pooling on H and W to learn global discriminant features:

in the formula (I), the compound is shown in the specification,

is from feature P at spatial location (i, j) _te Is extracted from the Chinese medicinal materials;

in the formula (I), the compound is shown in the specification,

is a predicted attribute vector.

9. The visual text fusion based anterior chamber angle image grading method according to claim 8, wherein the visual feature branch comprises a visual decoder, a feature pyramid network, a segmentation sub-module and an embedding sub-module, wherein:

the vision decoder is an up-sampling sub-network, the up-sampling sub-network adopts a jump connection and a symmetrical structure, the input of the up-sampling sub-network is image data in the vision encoder, the output of the up-sampling sub-network is input into the feature pyramid network, and the output P of the feature pyramid network is _FPN Inputting into an embedded submodule, the output P of the feature pyramid network _FPN And also visual context information P _F1 Adding the obtained data to a segmentation submodule, and outputting characteristic information P by the segmentation submodule _SEG The embedded sub-module outputs characteristic information P _EMB ；

The division submodule consists of a first division block and a second division block, the first division block consists of two convolution layers of 3 multiplied by 3 and a ReLu activation function, and the input of the first division block is the output P of the characteristic pyramid network _FPN And visual context information P _F1 Adding, the output of the first divided block being the characteristic information P _SEG The second partition block is composed of three 3 × 3 convolutional layers and a ReLu activation function, and the input of the second partition block is characteristic information P _SEG The output of the second partition is dense prediction

The embedded sub-module comprises a first embedded block and a second embedded block, wherein the first embedded block is formed by five 3 x 3 convolution layers and used for combining a multi-scale feature map P _FPN Mapping to an embedded feature map P _EMB Representing discriminant features, the second embedded block is composed of a 1 × 1 convolutional layer, mapping all pixels in the graph to points in the feature space, each point being represented by an embedded vector

10. The anterior chamber angle image classification method based on visual text fusion according to claim 9, characterized in that the loss function in step S3 is specifically:

L _total ＝α·L _CLS +β·L _SEG +γ·L _EMB +δ·L _TE

in the formula, L _total As a function of the total loss, L _CLS As a loss function of the main branch, L _SEG To split the loss function of the submodule, L _EMB Loss function for embedding sub-modules, L _TE For the loss function of the text learning branch, alpha, beta, gamma and delta are the weights of four loss terms;

in the formula (I), the compound is shown in the specification,

for calculating the loss incurred by training samples with pixel-level labels,

for calculating the loss caused by label data at the unmarked pixel level,

and

a pseudo-label representing the pixel i is shown,

L _EMB ＝λ·L _var +ρ·L _dist +ω·L _reg

in the formula (I), the compound is shown in the specification,

is the embedding vector for pixel i, μ is the mean-like vector, i.e. the class center, δ _v ,δ _d Is the margin of variance and distance loss, i.e. the maximum distance and cluster, respectively, that can be accepted within a clusterAnd the smallest distance from each other between clusters, | | is the L2 norm, [ x | ]] ₊ ＝max(0,x)；

In the formula (I), the compound is shown in the specification,