CN112085718B

CN112085718B - NAFLD ultrasonic video diagnosis system based on twin attention network

Info

Publication number: CN112085718B
Application number: CN202010924390.7A
Authority: CN
Inventors: 王连生
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2022-05-10
Anticipated expiration: 2040-09-04
Also published as: CN112085718A

Abstract

The invention discloses a NAFLD ultrasonic video diagnosis system based on a twin attention network, which consists of two twin attention subnetworks with the same structure and sharing weight and a loss function, wherein the twin attention subnetworks consist of a double-flow feature extraction module, a linear classification module and a context attention module, and the loss function consists of binary cross entropy loss (BCE), Contrast Similarity Loss (CSL) and Contrast Difference Loss (CDL). The invention adds the double-current feature extraction module on the basis of the twin attention network and introduces the loss function, so that the NAFLD ultrasonic video diagnosis system achieves the accuracy of 90.56 percent, the specificity of 88.26 percent and the sensitivity of 93.58 percent, and an efficient and feasible method is provided for NAFLD ultrasonic video diagnosis.

Description

NAFLD ultrasonic video diagnosis system based on twin attention network

Technical Field

The invention relates to the technical field of short video processing, in particular to a NAFLD ultrasonic video diagnosis system based on a twin attention network.

Background

Early screening of non-alcoholic fatty liver disease (NAFLD) helps patients to prevent irreversible advanced liver disease, but manual diagnosis of NAFLD's ultrasound video requires physicians to review lengthy videos, which is both cumbersome and time consuming in clinical practice. Therefore, the method of deep learning can be utilized to realize the automatic diagnosis of NAFLD in the ultrasonic video so as to improve the diagnosis efficiency.

The most major problems facing the task of NAFLD diagnosis in ultrasound video are interference of extraneous information and poor characterization of the low quality of the ultrasound itself.

Disclosure of Invention

In order to solve the problems, the invention provides a NAFLD ultrasonic video diagnosis system based on a twin attention network, so as to realize efficient automatic diagnosis of NAFLD.

The invention adopts the following technical scheme:

a NAFLD ultrasonic video diagnosis system based on a twin attention network is composed of two twin attention subnetworks which are identical in structure and share weight, and a loss function, wherein the twin attention subnetworks are composed of a double-current feature extraction module, a linear classification module and a context attention module, and the loss function is composed of binary cross entropy loss, contrast similarity loss and contrast difference loss.

Further, the double-flow feature extraction module comprises a sharing module, a classification module and an attention module; the double-flow feature extraction module is used for extracting different features of classification and attention.

Further, the sharing module is to extract low-level features shared by the classification module and the attention module; the classification module is used for extracting features of a high level to generate a classification; the attention module is used to extract features of a high level to generate attention.

Further, V ═ I for a given video_t1, 2.. T }, the dual-stream feature extraction module providing two feature representations for each frame of the video, each frame I_tAre respectively f_cls(I_t；θ_cls,θ)∈R^DAnd f_att(I_t；θ_att,θ)∈R^DWhere θ denotes a sharing parameter, θ_cls，θ_attIndependent parameters, I, representing the classification module and the attention module, respectively_tIs the T-th frame of the video, and T represents the frame number of the video.

Further, the linear classification module uses a linear classifier to predict the probability that each frame belongs to NAFLD, providing a fine-grained reference for the final diagnosis.

Further, the linear classification module is based on the feature f extracted by the double-current feature extraction module_clsLearning the linear mapping W ∈ R^1×DExpressing the feature f_clsConversion to a one-dimensional scalar Wf_clsThe sigmoid function normalizes the scalar to be between 0 and 1, representing the final probability value, as follows:

p_t＝σ(Wf_cls+ b), where b is a constant term and σ represents a sigmoid function.

Further, the contextual attention module scores the importance of each frame in conjunction with the context for highlighting the discriminative information on key frames and suppressing extraneous information for useless frames.

Further, the contextual attention module is based on a feature vector f of each frame_attThe hidden layer features containing timing information further extracted by Bi-LSTM can be expressed as follows:

wherein, therein

And

respectively represent parameters of

Forward LSTM (t from 1 to t) and parameters of

Backward LSTM (t from t to 1), then the fully connected layer linearly maps W from feature to significance learning_a∈R^1×D/2Then, the importance of all frames is normalized by the softmax function, as follows:

further, at the end of the system, the classification probability of each frame is weighted and summed according to the attention distribution, and the obtained final probability value is used for representing the diagnosis result of the whole video, wherein the diagnosis result is represented as:

further, the mathematical expression of the loss function L is as follows:

L＝L_BCE+λ(L_SSL+L_CDL)

wherein λ is a scaling factor that controls the relative importance of binary cross-entropy loss (BCE), Contrast Similarity Loss (CSL), and Contrast Difference Loss (CDL);

the binary cross entropy loss is based on the prediction probability of each video

With the true value y, the final loss function can be calculated as follows:

wherein N represents the video frequency in the training set;

the contrast similarity loss is used to represent the similarity of key frame portions between positive and negative sample pairs, and the feature of the key frame portion used for attention generation of each video can be represented as follows:

in addition, cosine similarity is used to measure the similarity between two feature vectors, which can be expressed as:

thus, the contrast similarity loss is calculated as follows:

where P represents the positive and negative sample pair logarithm in a batch.

The contrast difference loss is used to represent the difference of the key frame portions between the positive and negative sample pairs, and the feature of the key frame portion for classifying each video can be represented as follows:

thus, the loss of contrast variability is calculated as follows:

after adopting the technical scheme, compared with the background technology, the invention has the following advantages:

the context attention network effectively solves the problem of irrelevant information interference by introducing an attention mechanism; the negative influence of low quality of ultrasound is relieved to a certain extent by combining time sequence information; the characteristics used for the classification module and the attention module are respectively extracted by adopting different branches in the double-flow characteristic extraction module, so that the expressiveness of the extracted characteristics is effectively improved, and the performance of the system is further improved, and the expressiveness of the system is further improved by combining the double-flow characteristic extraction module with a loss function (namely binary cross entropy loss, contrast similarity loss and contrast difference loss), so that the accuracy of 90.56%, the specificity of 88.26% and the sensitivity of 93.58% are finally obtained.

Drawings

Fig. 1 is a schematic diagram of twin attention subnetwork structure of the NAFLD ultrasonic diagnostic system of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example one

The invention discloses a NAFLD ultrasonic video diagnosis system based on a twin attention network, which consists of two twin attention subnetworks with the same structure and sharing weight, wherein the twin attention network consists of two twin attention subnetworks with the same structure and sharing weight and a loss function, as shown in figure 1, the twin attention subnetworks consist of a double-current feature extraction module a, a linear classification module b and a context attention c module, and the loss function consists of binary cross entropy loss, contrast similarity loss and contrast difference loss.

The double-flow feature extraction module a comprises a sharing module, a classification module and an attention module; the double-flow feature extraction module a is used for extracting different features of classification and attention.

The sharing module is used for extracting low-level features shared by the classification module and the attention module, and the calculation cost is greatly reduced while the relevance of two task bottom layers is established; then, the classification module is used for extracting the high-level features to generate classification, and the attention module is used for extracting the high-level features to generate attention, at the moment, the features of different branches can be well suitable for the requirements of different tasks, and the effect of a follow-up module is further improved.

For a given video V ═ I_t1, 2.. T }, said dual-stream feature extraction module a providing two feature representations for each frame of the video, each frame I_tAre respectively f_cls(I_t；θ_cls,θ)∈R^DAnd f_att(I_t；θ_att,θ)∈R^DWhere θ denotes a sharing parameter, θ_cls，θ_attIndependent parameters, I, representing the classification module and the attention module, respectively_tIs the T-th frame of the video, and T represents the frame number of the video.

The linear classifier is used by the linear classification module b to predict the probability that each frame belongs to NAFLD, providing a fine-grained reference for the final diagnosis.

The linear classification module b extracts the features f based on the double-flow feature extraction module a_clsLearning the linear mapping W ∈ R¹ ^×DExpressing the feature f_clsConversion to a one-dimensional scalar Wf_clsSigmoid function normalizes scalar to be between 0 and 1, representing the final probability valueAs follows:

The context attention module c scores the importance of each frame in conjunction with the context for highlighting the discriminative information on the key frames and suppressing extraneous information for the useless frames.

The contextual attention module c is based on the feature vector f of each frame_attThe hidden layer features containing timing information further extracted by Bi-LSTM can be expressed as follows:

wherein, therein

And

respectively represent parameters of

Forward LSTM (t from 1 to t) and parameters of

at the end of the system, the classification probability of each frame is weighted and summed according to the attention distribution, and the obtained final probability value is used for representing the diagnosis result of the whole video, wherein the diagnosis result is represented as:

after the NAFLD ultrasonic video diagnosis system is constructed, a training process is started to the model, and loss functions used in the training process are divided into the following three parts: binary Cross Entropy Loss (BCEL), Contrast Similarity Loss (CSL), Contrast Dissimilarity Loss (CDL). The binary cross entropy loss acts on the final diagnosis result, the difference between the final diagnosis result and the actual value is measured, and each module is optimized; the contrast similarity loss measures the feature similarity of the key frame part between the positive sample pair and the negative sample pair, and the contrast difference loss measures the feature difference of the key frame part between the positive sample pair and the negative sample pair, so that the selection capability of the model on the key frame is promoted, and the expressiveness of the features is enhanced.

The mathematical expression of the loss function L is as follows:

L＝L_BCE+λ(L_SS1+L_CDL)

With the true value y, the final loss function can be calculated as follows:

wherein N represents the video frequency in the training set;

thus, the contrast similarity loss is calculated as follows:

where P represents the positive and negative sample pair logarithm in a batch.

thus, the loss of contrast variability is calculated as follows:

the data used for training consisted of 520 subjects' liver ultrasound videos, with 260 videos from NAFLD patients and an additional 260 videos belonging to normal samples. Since the input of the training phase is a pair of positive and negative samples, we need to ensure that the positive and negative samples have the same length, so we sample 20 frames of images at equal intervals for all videos. The original resolution of the video is 800 × 600, and the sampling frequency is 31 Hz. After video acquisition, 3 doctors with abundant experience carry out manual annotation at the same time, and the voting results of the 3 doctors are synthesized to finally judge whether the subject suffers from NAFLD.

The evaluation indexes adopted in the embodiment include accuracy, specificity, sensitivity and AUC values. The following results were obtained:

(1) validity of dual stream feature extraction module

The ResNet50 is used as a basic network, improvement is carried out on the basis of the twin attention network, an original feature extraction module of the twin attention network is replaced by a double-flow feature extraction module, and model performances before and after replacement are compared to verify the superiority of the double-flow feature extraction module. The results obtained are as follows:

TABLE 1 double-flow feature extraction Module validation results

Method	Rate of accuracy	Specificity of	Sensitivity of the composition	AUC value
					CAN(ResNet50)	0.8736	0.8322	0.9358	0.9415
CAN (double current feature extraction module)	0.8868	0.8622	0.9207	0.9459

As can be seen from the above table, the dual-flow feature extraction module effectively improves the performance of most of the indicators of the twin attention network compared to the original ResNet50 as the base network. Specifically, compared with the original twin attention network, the twin attention network using the dual-flow feature extraction module as the base network improves the accuracy and specificity by 0.72%, 3.00% and 0.44% respectively. The results prove that the classification and attention modules need different features, and the double-branch structure of the double-flow feature extraction module can effectively improve the features with task specificity and enhance the performance of the model.

(2) Effectiveness of contrast difference loss and contrast similarity loss

The contrast difference loss and the contrast similarity loss are respectively used for measuring the difference and the similarity of key frame parts between the positive and negative sample pairs, a pair of positive and negative samples are given on the basis of the double-current feature extraction network, and for the features required by the classification branches, the contrast difference loss enables the key frame parts to be as far away as possible, so that the models can be better distinguished; for features required by the attention branch, the contrast similarity loss makes the key frame part distance as close as possible, so that the model can better select the key frame.

In this embodiment, contrast difference loss and contrast similarity loss with different weights are introduced to a twin attention network to which a dual-flow feature extraction module is added, and the following results are obtained by comparing networks without introduced loss functions:

table 2 CDL and CSL validity verification results

Method (lambda)	Rate of accuracy	Specificity of	Sensitivity of the composition	AUC value
					0(CAN+BFEM)	0.8868	0.8622	0.9207	0.9459
0.2	0.8942	0.8700	0.9269	0.9473
					0.4	0.9056	0.8826	0.9358	0.9521
0.6	0.8903	0.8690	0.9192	0.9402

As shown in table 2, at lower weights, the indices were improved after the addition of both CDL and CSL, with the best results being achieved around 0.4, with improvements in accuracy, specificity, sensitivity and AUC values of 1.88%, 2.04%, 1.51%, 0.62%, respectively, compared to the group without the introduction of CDL and CSL.

(3) The effectiveness of the NAFLD ultrasonic video diagnosis system of the invention

Compared with the common twin attention network (CAN), the NAFLD ultrasonic video diagnosis System (SAN) provided by the invention is additionally provided with a double-flow feature extraction module, and a loss function (binary cross entropy loss, contrast difference loss and contrast similarity loss) BFEM and a newly designed loss function are introduced, for example, a table 3 shows a comparison result with the common twin attention network CAN.

TABLE 3 SAN superiority verification results

Method	Rate of accuracy	Specificity of	Sensitivity of the composition	AUC
					CAN	0.8736	0.8322	0.9358	0.9415
SAN	0.9056	0.8826	0.9358	0.9521

As shown in table 3, SAN increased by 2.2%, 5.04% and 1.06%, respectively, compared to CAN, with the same sensitivity.

From the above results, it can be seen that the BFEM in the ANFLD ultrasound video diagnostic system SAN provided by the present invention effectively extracts different features required for classification and attention, and the newly designed CSL and CDL further constrain the distribution of the features, enhancing the expressiveness of the features, and finally obtaining an accuracy of 90.56%, a specificity of 88.26%, and a sensitivity of 93.58% by combining the two, which proves the feasibility and effectiveness of SAN.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A NAFLD ultrasonic video diagnosis system based on twin attention network is characterized in that: the system comprises two twin attention subnetworks with the same structure and sharing weight and a loss function, wherein the twin attention subnetworks comprise a double-current feature extraction module, a linear classification module and a context attention module, and the loss function comprises binary cross entropy loss BCE, contrast similarity loss CSL and contrast difference loss CDL;

the double-flow feature extraction module comprises a sharing module, a classification module and an attention module; the double-flow feature extraction module is used for extracting different features of classification and attention; the sharing module is used for extracting low-level features shared by the classification module and the attention module; the classification module is used for extracting features of a high level to generate a classification; the attention module is used for extracting features of a high level to generate attention;

the linear classification module predicts the probability that each frame belongs to the NAFLD by using a linear classifier, and provides fine-grained reference for final diagnosis;

the contextual attention module scores the importance of each frame in conjunction with the context for highlighting discriminative information on key frames and suppressing irrelevant information of useless frames.

2. The NAFLD ultrasound video diagnostic system based on a twin attention network of claim 1, wherein:

for a given video V ═ I_t1, 2.. T }, the dual-stream feature extraction module providing two feature representations for each frame of the video, each frame I_tAre respectively f_cls(I_t；θ_cls,θ)∈R^DAnd f_att(I_t；θ_att,θ)∈R^DWhere θ denotes a sharing parameter, θ_cls，θ_attIndependent parameters, I, representing the classification module and the attention module, respectively_tIs the T-th frame of the video, and T represents the frame number of the video.

3. The NAFLD ultrasound video diagnostic system based on a twin attention network of claim 1, wherein: the linear classification module is based on the feature f extracted by the double-flow feature extraction module_clsLearning the linear mapping W ∈ R^1×DExpressing the feature f_clsConversion to a one-dimensional scalar Wf_clsThe sigmoid function normalizes the scalar to be between 0 and 1, representing the final probability value, as follows:

4. The NAFLD ultrasound video diagnostic system based on a twin attention network of claim 1, wherein: the contextual attention module is based on a feature vector f for each frame_attHidden layer containing timing information further extracted by Bi-LSTMThe features may be expressed as follows:

wherein, therein

And

respectively represent parameters of

Has forward LSTM and parameters of

Backward LSTM of (1), where forward LSTM represents t from 1 to t and backward LSTM represents t from t to 1, and then the fully connected layer is linearly mapped W from feature to significance learning_a∈R^1×D/2Then, the importance of all frames is normalized by the softmax function, as follows:

5. the NAFLD ultrasound video diagnostic system based on a twin attention network of claim 1, wherein: at the end of the system, the classification probability of each frame is weighted and summed according to the attention distribution, and the final probability value obtained is used for representing the whole videoThe diagnostic result of (a), said diagnostic result being represented as:

6. the NAFLD ultrasound video diagnostic system based on a twin attention network of claim 1, wherein: the mathematical expression of the loss function L is as follows:

L＝L_BCE+λ(L_CSL+L_CDL)；

wherein, λ is a scale factor controlling the relative importance of binary cross entropy loss BCE, contrast similarity loss CSL and contrast difference loss CDL;

With the true value y, the final loss function can be calculated as follows:

wherein N represents the video frequency in the training set;

the loss of contrast similarity is calculated as follows:

wherein, P represents the positive and negative sample pair logarithm in a batch;

the loss of contrast variability was calculated as follows: