CN115177262A

CN115177262A - Heart sound and electrocardiogram combined diagnosis device and system based on deep learning

Info

Publication number: CN115177262A
Application number: CN202210664490.XA
Authority: CN
Inventors: ***; 张浩波; 张鹏
Original assignee: Huazhong University of Science and Technology; Wuhan Zhongke Medical Technology Industrial Technology Research Institute Co Ltd
Current assignee: Huazhong University of Science and Technology; Wuhan Zhongke Medical Technology Industrial Technology Research Institute Co Ltd
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2022-10-14

Abstract

The invention belongs to the field of medical signal processing, and particularly relates to a heart, sound and electrocardio combined diagnosis device and system based on deep learning, which comprises a dense fusion strategy with a cross-modal region perception (pixel-level weight maps are independently generated for each modal to fully evaluate the contributions of different modalities and different regions) and a multi-scale feature optimization module, so that multi-modal complementary information is better fused, and the diagnosis precision is effectively improved, wherein the cross-modal region perception fully evaluates the contributions of different modalities and different regions by independently generating the pixel-level weight maps for each modality; the cooperative learning strategy is combined with a specific mode-intensive fusion-specific mode three-branch encoder framework, so that the method can be applied to a single-mode signal scene of heart sound and electrocardio and a multi-mode signal scene with the simultaneous existence of heart sound and electrocardio, and the universality of a network is enhanced.

Description

Heart sound and electrocardiogram combined diagnosis device and system based on deep learning

Technical Field

The invention belongs to the field of medical signal processing, and particularly relates to a heart sound and electrocardio combined diagnosis device and system based on deep learning.

Background

Cardiovascular diseases have become the first killers threatening human health, the number of deaths caused by cardiovascular diseases accounts for about one third of the total number of deaths worldwide, and the prevalence rate is increasing year by year and is becoming more serious. Auscultation of the heart (phonocardiogram PCG) and Electrocardiogram (ECG) are two important primary screening means for heart diseases, and the occurrence mechanisms thereof are different: the heart sound signal provides diagnosis information from the mechanical motion angle of the heart, and is commonly used for detecting valve diseases, atrioventricular septal defects and the like; the electrocardiosignal records the potential change generated on the body surface of the heart in each cardiac cycle, provides diagnosis information from the aspect of the electrical activity of the heart, and is commonly used for detecting heart chronotropic and conductivity-variable diseases, such as arrhythmia, myocardial ischemia and the like. Therefore, the heart sound signals and the electrocardiosignals can provide diagnosis information from different angles, the information complementation is effectively realized, and the primary screening precision of heart diseases is improved. The heart sound electrocardio multi-mode signal diagnosis based on deep learning can effectively reduce the influence of subjective difference of doctors and improve the diagnosis level and the diagnosis efficiency of heart diseases.

At present, the heart sound electrocardio multi-modal signal diagnosis related researches based on deep learning are very few, and the existing researches are mainly divided into two types: one is to manually design a feature extraction scheme and then classify through a deep learning or machine learning algorithm, but such methods are time-consuming and labor-consuming and may lose important information; one method is to adopt a deep learning algorithm as a feature extractor, and use a machine learning algorithm for classification after feature dimension reduction, but the application process is complex. Furthermore, the existing research also faces two limitations: firstly, the existing research adopts an early or late direct fusion strategy as shown in fig. 1, but it is difficult to fully mine multi-modal complementary information by only fusing single-stage features, and direct fusion ignores the factors of different modalities and different region contributions, such as PCG signals S1 and S2 heart sounds as shown in fig. 2, P, QRS, T waves and the like of ECG signals contain richer diagnostic information, and the PCG and ECG modality signal contributions are also different; secondly, the existing research can only be applied to the multi-modal scene with both PCG and ECG signals, but in the actual clinical application, due to the limitation of acquisition equipment and other reasons, only the single-modal scene of the heart sound signals or the electrocardiosignals is very common, which greatly limits the application of the existing multi-modal method.

Disclosure of Invention

Aiming at the defects and the improvement requirements of the prior art, the invention provides a heart sound and electrocardio combined diagnosis device and system based on deep learning, and aims to provide a diagnosis device which can better integrate multi-mode complementary information, so that the method can be applied to a multi-mode scene in which heart sound and electrocardio exist simultaneously and a single-mode scene only containing heart sound signals or electrocardio signals.

In order to achieve the above object, according to an aspect of the present invention, there is provided a heart sound and electrocardiogram combined diagnosis apparatus based on deep learning, configured to calculate, by using a CRDNet network model, a classification result corresponding to each equal-length segment according to each equal-length segment obtained by preprocessing original heart sound and electrocardiogram signals; forming a diagnosis analysis report according to each classification result;

wherein the CRDNet network model comprises: the system comprises a PCG (pulse code generator) specific modal encoder, an ECG (electrocardiogram) specific modal encoder, a dense fusion encoder and a collaborative decision module which are identical in structure;

each specific modal encoder is used for carrying out feature progressive extraction on input equal-length segments, the zero level is a space-time feature, finally, deep-level features are obtained, classification is carried out on the basis of the deep-level features, and intra-modal classification results under the condition that the current equal-length segments are input are obtained;

the intensive fusion encoder is used for performing step-by-step fusion on the feature correspondence extracted step by the two specific modal encoders, adaptively evaluating the contributions of different modals and different regions of current equal-length segments of each modality to classification results in the process of each step of fusion, generating a pixel-level weight map for the multi-level aggregation feature of each modality, and performing pixel-level multiplication on the multi-level aggregation feature of each modality and the pixel-level weight map corresponding to the multi-level aggregation feature of each modality to obtain the weighted multi-level aggregation feature of the modality; fusing the weighted multi-level aggregation characteristics of each mode and the fusion characteristics obtained by the previous-level fusion through convolution operation to generate the fusion characteristics of the current level for obtaining a combined classification result under the input of the current equal-length segments; the multi-level aggregation feature of each mode adopted for generating the fusion feature of the zeroth level is the space-time feature corresponding to the mode, and the multi-level aggregation feature of each mode adopted for generating the fusion features of other levels is obtained by aggregating the current-level feature and the previous-level feature of the current-level feature extracted by a specific mode encoder of the mode;

and the cooperative decision module is used for weighting and adding intra-modal classification results of the PCG and the ECG and the combined classification results to obtain a final classification result under the condition of inputting the current equal-length segments.

Further, the forming of the diagnosis analysis report according to each classification result specifically includes:

and averaging the corresponding final classification results under the input of all the equal-length segments to form a diagnosis analysis report.

Further, the dense fusion encoder performs further multi-scale feature optimization on each level of fusion features generated by the dense fusion encoder to obtain optimized current level of fusion features, so as to obtain a joint classification result under the condition of inputting current equal-length segments.

Further, the dense fusion encoder provides different receptive fields by adopting progressive grouping convolution so as to optimize the multi-scale features.

Furthermore, the classification units in the specific modal encoder and the dense fusion encoder both adopt a double-layer LSTM network to replace a global pooling layer for the synthesis of global information.

Further, the CRDNet network model is obtained by training with the following loss function:

wherein λ represents a loss weight coefficient of each mode-specific encoder,

which represents the loss within the mode(s),

indicating a joint loss.

Further, the implementation manner of the current-level features and the previous-level features extracted by the modality-specific encoder aggregating the modalities is as follows:

in the formula (f) _i ^PCG 、f _i ^ECG Level i modality-specific features extracted for PCG and ECG modality-specific encoders, respectively, wherein

Spatio-temporal features extracted in PCG and ECG Modal-specific coders, respectively, f _i ^MLP 、f _i ^MLE The time dimension size and the channel number of the multi-stage aggregation feature of the i-1 level are adjusted to be the same as the specific modal feature of the i level.

Further, the method is also used for calculating intra-modal classification results corresponding to the equal-length segments by adopting a CRDNet network model according to the equal-length segments obtained by preprocessing the heart sound or electrocardio original signals; and forming a diagnosis analysis report under the single-mode signal according to the classification result in each mode.

The invention also provides a heart sound and electrocardio combined diagnosis system based on deep learning, which comprises the following components:

the client terminal is used for preprocessing the acquired heart sound and electrocardio original signals, respectively dividing the preprocessed heart sound and electrocardio signals into equal-length segments, sending the equal-length segments to the server side, and displaying a diagnosis and analysis report received from the server side to a user;

and the server side is used for calculating to obtain a classification result according to the obtained equal-length fragments by adopting the CRDNet network model, forming a diagnosis analysis report and sending the diagnosis analysis report to the client terminal.

The invention also provides a computer readable storage medium, which comprises a stored computer program, wherein when the computer program is executed by a processor, the computer program controls a device on which the storage medium is located to execute the functions of the heart sound-electrocardio combined diagnosis device based on deep learning.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

(1) The invention provides a CRDNet-based heart sound and electrocardio combined diagnosis device, which is an end-to-end classification device without manually designing a feature extraction scheme, can directly classify preprocessed original signals, and is convenient and rapid in process.

(2) The PCG and ECG specific mode encoder extracts space-time characteristics through a space-time characteristic extraction unit, and not only utilizes the space characteristics of input signals, but also utilizes time sequence information; deep level features are extracted through the residual optimization unit, gradient disappearance can be effectively relieved through the residual block structure, and network optimization is facilitated. And a double-layer LSTM network is used in the last classification unit to replace a global pooling layer, so that the time sequence information of deep features is extracted while the information loss is avoided, and the precision is effectively improved.

(3) The invention provides a dense fusion encoder embedded with cross-modal region perception and multi-scale feature optimization modalities. By using the dense feature aggregation module, when multi-modal features of each level are fused, not only the same level features extracted by PCG and ECG specific modal encoders are utilized, but also the features of each level previously extracted by PCG and ECG specific encoders are utilized, the low level features have richer detail information, the high level features have richer semantic information, and the feature fusion structure of dense connection can effectively realize the complementation of the low level detail information and the high level semantic information, and fully utilize the features of each level extracted by the specific modal encoders. The cross-modal region perception module can independently generate a pixel-level weight map for each mode, adaptively evaluate the contributions of different modes and different regions, and better utilize valuable complementary information of multi-modal signals. Due to the fact that the feature sizes of the target regions are different, the multi-scale feature optimization module provides different receptive fields by using progressive grouping convolution so as to enhance the expression capability of the network on the multi-scale target and improve diagnosis precision.

(4) The invention proposes a loss function

Using combined modal internal losses

And combined losses

Aiming at solving the problem of training imbalance existing in a multi-modal model, wherein intra-modal loss

The method is used for enabling the specific modal encoder to extract the more discriminative characteristics of the modal, and ensures that each specific modal encoder is fully trained and joint loss

The method is used for guiding the specific modal encoder to learn together, enhancing the feature fusion effect and effectively utilizing the multi-modal complementary information. Meanwhile, the output of the specific modal encoder and the dense fusion encoder can form an integrated decision, and the performance of the method is effectively improved.

In general, in order to overcome the limitations that the application process of the existing heart sound and electrocardio multi-modal signal diagnosis device is complex, the fusion strategy needs to be improved and the existing heart sound and electrocardio multi-modal signal diagnosis device cannot be used in a missing modal scene, an end-to-end deep learning classification method for diagnosing heart sound and electrocardio multi-modal signals by directly utilizing original signals is provided, a dense fusion strategy and a collaborative learning strategy which are embedded into a cross-modal region perception and multi-scale feature optimization module are combined, and multi-modal complementary information is better fused, so that the method can be applied to the multi-modal scene in which heart sound and electrocardio exist simultaneously and the single modal scene in which only heart sound signals or electrocardio signals exist simultaneously. Experimental results show that the performance of the CRDNet is superior to that of the existing multi-modal method in a multi-modal scene, and the performance of the CRDNet is equivalent to that of a single-modal model in a single-modal scene. The method is strong in universality and excellent in performance in the existing method.

Drawings

FIG. 1 is a schematic diagram of two feature fusion strategies in a conventional method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the synchronized PCG and ECG signals provided by an embodiment of the present invention;

FIG. 3 is a diagram of a CRDNet framework provided by an embodiment of the present invention;

fig. 4 is a schematic structural diagram of modules in a network according to an embodiment of the present invention, where (a) is a spatio-temporal feature extraction unit, (B) is a residual block-a, (c) is a residual block-B, (d) is a classification unit, (e) is a cross-modal region sensing module, and (f) is a multi-scale feature optimization module;

fig. 5 is a ROC curve of CRDNet and a single mode model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Example one

A heart sound and electrocardio combined diagnosis device based on deep learning is used for calculating classification results corresponding to all equal-length segments by adopting a CRDNet network model according to all equal-length segments obtained by preprocessing original heart sound and electrocardio signals; forming a diagnosis analysis report according to each classification result;

wherein, the CRDNet network model comprises: the system comprises a PCG (pulse code generator) specific modal encoder, an ECG (electrocardiogram) specific modal encoder, a dense fusion encoder and a cooperative decision module which are identical in structure;

each specific modal encoder is used for carrying out step-by-step feature extraction on input equal-length segments, the zero level is a space-time feature, finally deep features are obtained, classification is carried out on the basis of the deep features, and intra-modal classification results under the condition that the current equal-length segments are input are obtained;

the intensive fusion encoder is used for correspondingly fusing the features extracted by the two specific modal encoders step by step, adaptively evaluating the contributions of different modals and different regions of current equal-length segments of each modality to classification results in the process of each-stage fusion, generating a pixel-stage weight map for the multi-stage aggregation features of each modality, and performing pixel-stage multiplication on the multi-stage aggregation features of each modality and the corresponding pixel-stage weight map to obtain the weighted multi-stage aggregation features of the modality; fusing the weighted multi-level aggregation characteristics of each mode and the fusion characteristics obtained by the previous-level fusion through convolution operation to generate the current-level fusion characteristics for obtaining the joint classification result under the current equal-length fragment input; generating a plurality of levels of aggregation characteristics of each mode adopted by the fusion characteristics of the zeroth level as space-time characteristics corresponding to the mode, and generating a plurality of levels of aggregation characteristics of each mode adopted by the fusion characteristics of other levels as current-level characteristics and previous-level characteristics extracted by a specific mode encoder of the mode;

and the cooperative decision module is used for weighting and adding the intra-modal classification result of the PCG and the ECG and the combined classification result to obtain a final classification result under the condition of inputting the current equal-length segments.

Specifically, in order to better implement the above apparatus, it is now set that each specific mode encoder includes a space-time feature extraction unit, n residual error optimization units, and a first classification unit, which are connected in sequence, and the dense fusion encoder includes n +1 fusion feature units and a second classification unit, which are connected in sequence.

The time-space feature extraction unit is used for extracting time-space features of each equal-length segment, the n residual error optimization units are used for extracting the features step by step under the input of the time-space features to finally obtain deep-level features, and the first classification unit is used for classifying based on the deep-level features to obtain intra-modal classification results under the input of the current equal-length segments.

Each fusion feature unit is used for evaluating contributions of different modals and different regions of current equal-length segments of each modality to classification results in a self-adaptive manner, generating a pixel-level weight graph for received multi-level aggregation features of each modality, and performing pixel-level multiplication on the multi-level aggregation features of each modality and the corresponding pixel-level weight graph to obtain weighted multi-level aggregation features of the modality; fusing the weighted multi-level aggregation characteristics of each mode and the optimized fusion characteristics output by the previous fusion characteristic unit through convolution operation to generate current-level fusion characteristics; performing multi-scale feature optimization on the current-level fusion features to obtain the optimized fusion features of the current level; the second classification unit is used for obtaining a joint classification result under the current equal-length segment input according to the fusion characteristics after the last-stage optimization; the multi-level aggregation feature of each mode received by the first fusion feature unit is the spatio-temporal feature corresponding to the mode, and the multi-level aggregation feature of each mode received by each other fusion feature unit is the feature extracted by aggregating the current-level feature extracted by the residual optimization unit corresponding to the fusion feature unit and the features extracted by the previous-level residual optimization unit and the spatio-temporal feature extraction unit.

Further specifically, regarding data preprocessing, optionally, the heart sound and the electrocardio original signals are respectively normalized, and as the heart sound signals are easily interfered by external noise, the heart sound signals are filtered, and low-frequency artifacts, baseline drift and high-frequency noise interference in the original signals are filtered; then, the heart sound electrocardiosignals are divided into equal-length segments which are overlapped in a certain proportion, so that the data input into the deep learning network are equal in length.

For example, optionally, the original cardiac sound and electrocardiographic signals are respectively normalized by a z-score method, the normalized cardiac sound signals are filtered by a five-order butterworth band-pass filter with a frequency range of 25 to 400Hz, the cardiac sound and electrocardiographic signals are not subjected to filtering processing, and then the cardiac sound and electrocardiographic signals are divided into equal-length segments with the time length of 1.28s and the length of 50% overlap.

Regarding the deep learning model, optionally, the CRDNet (Cross model Region-aware depth fusion Network) proposed in this embodiment mainly consists of three parts: a PCG modality-specific encoder, an ECG modality-specific encoder, and a dense fusion encoder, as shown in fig. 3.

The specific modal encoder structure is divided into three parts: the device comprises a space-time feature extraction unit, a residual error optimization unit and a classification unit. The method comprises the steps of firstly, synthesizing time sequence information and local space information of input signals through a time-space feature extraction unit composed of an LSTM (Long Short Term memory Neural Network) and a CNN (Convolutional Neural Network) to obtain time-space features, on the basis, gradually extracting deep features and reducing time dimension through a residual optimization unit composed of a plurality of residual blocks, finally extracting context information of the deep features for classification through a classification unit, and replacing a global pooling layer with a double-layer LSTM Network in the classification unit to extract the time sequence information of the deep features while avoiding information loss and further outputting a classification result of a specific modal encoder.

For example, optionally, as shown in fig. 4 (a), in this embodiment, the spatio-temporal feature extraction unit extracts the local spatial feature of the original signal through a CNN branch formed by two sets of convolution, BN normalization, and ReLu activation operations, extracts the time-series feature of the original signal through a single-layer LSTM, then concatenates the extracted local spatial feature and the time-series feature, and integrates the time-series information and the local spatial information of the original signal through the convolution, BN normalization, and ReLu activation operations. Then, extracting deep level features step by step through a residual optimization unit consisting of 1 Res-block-A and 15 Res-block-B, performing downsampling on each two residual blocks through a convolutional layer with the step length of 2 to reduce the space dimension step by step, wherein Res-block-A is composed of a convolutional layer Conv, a BN return layer, a ReLu active layer, a Dropout layer, a convolutional layer Conv and a jump connection containing maximum pooling Maxpool as shown in (B) and (c) of fig. 4; the Res-block-B sequentially comprises a BN return layer, a ReLu active layer, a convolutional layer Conv, a BN return layer, a ReLu active layer, a Dropout layer, a convolutional layer Conv and a jump connection containing a maximum pooling Maxpool, wherein if the convolutional layer in the residual block is subjected to downsampling, the maximum pooling core size and the step size are both set to be 2, so that the input characteristic size of the residual block is adjusted to be the same as the output characteristic size, jump connection is facilitated, if the convolutional layer in the residual block is not subjected to downsampling, the maximum pooling core size and the step size are both set to be 1, and the input characteristic of the residual block is directly subjected to jump connection with the output characteristic. And finally, the classification unit extracts the time sequence information of the deep features by using a double-layer LSTM to replace a global pooling layer in the structure, so that information loss is avoided, the classification of the specific-mode encoder is completed by a full-connection-layer Dense and softmax activation function, the classification result of the specific-mode encoder is output, and the structure of the classification unit is shown as (d) in FIG. 4.

In fig. 4, conv represents a convolutional layer, BN represents a normalization layer, reLu represents an active layer, LSTM represents a long-short mnemonic neural network, concat represents a splicing operation, dropout represents a random discard layer, maxpool represents a max pooling layer, density represents a fully connected layer,

respectively, the cross-modal region perception module is f _i ^MLE 、f _i ^MLP The generated pixel-level weight, transition layer, represents the Transition layer, X _1～4 Are evenly divided into four groups of features according to channel dimension, Y _1～4 Are the corresponding generated four sets of features.

The dense fusion encoder comprises a plurality of stages of modules and a classification unit, wherein each module comprises: the PCG/ECG multi-level aggregation system comprises an intensive feature aggregation unit, a cross-modal region perception unit and a multi-scale feature optimization unit, wherein each level of feature extracted by a specific modal encoder is fused step by step, and in the process of fusing each level, the current level and the previous level of specific modal feature extracted by the PCG/ECG specific modal encoder are aggregated through an intensive feature aggregation module to obtain the PCG/ECG multi-level aggregation feature; the cross-modal regional perception module adaptively evaluates the contributions of different modalities and different regions, independently generates a corresponding pixel-level weight map for each modality, highlights an information-rich region, then fuses the weighted PCG multi-level aggregation features, the weighted ECG multi-level aggregation features and the fusion features output by the previous-level module, and generates the fusion features of the current level; and then the multi-scale feature optimization module provides different receptive fields by using progressive grouping convolution, enhances the expression capability of the network on the multi-scale target region, and outputs the fusion features after the optimization of the current level. After PCG/ECG features are fused step by step, context information of deep features is extracted through a classification unit for classification, and a classification result of a fusion encoder is obtained.

As shown in fig. 3, the dense fusion encoder gradually fuses the features of each level extracted by the specific modality encoder, and in the process of fusing each level, the dense feature aggregation module aggregates the specific modality features of the current level and the previous level extracted by the PCG/ECG specific modality encoder to obtain the multi-level aggregated features of the PCG/ECG. This process can be represented as follows:

wherein the content of the first and second substances,

features, f, output by a spatio-temporal feature extraction unit in PCG and ECG specific modality encoders, respectively _i ^PCG 、f _i ^ECG Respectively, the ith residual block in the PCG and ECG codersCharacteristic of the output, f _i ^MLP 、f _i ^MLE And S is a mapping generated by performing down-sampling or convolution operation with the kernel size of 1 × 1, and is used for adjusting the time dimension size and the channel number of the multi-stage features of the i-1 level to be the same as the features output by the ith residual block.

Then, the cross-modal domain sensing module is used for self-adaptive fusion of multi-modal features and mainly comprises two parts: and generating a weight graph and fusing the weighted features. The weight map generation can adaptively evaluate the contributions of different modalities and different regions, and a pixel-level weight map is independently generated for each modality to highlight the information-rich region; and then fusing the weighted PCG multilevel aggregation characteristic, the weighted ECG multilevel aggregation characteristic and the previous-level output characteristic of the dense fusion encoder to generate a current-level multimodal fusion characteristic, wherein the process can be expressed as follows:

wherein, g ₁ 、g ₂ Respectively representing the mapping of the weight map generation and the weighted feature fusion process,

respectively, the cross-modal region perception module is f _i ^MLP 、f _i ^MLE The generated pixel-level weights are used to generate,

optimized multi-modal fusion features, f, output by the i-1 st level multi-scale feature optimization module _i ^CMR For the ith-level trans-modal domain sensing module inputAnd (4) multi-modal fusion features are presented.

And then the multi-scale feature optimization module provides different receptive fields by using progressive grouping convolution, enhances the expression capability of the network on the multi-scale target region, and outputs the multi-mode fusion features optimized at the current level. After PCG/ECG features are fused step by step, context information of deep features is extracted through a classification unit for classification, and a classification result of a fusion encoder is obtained.

Optionally, as shown in (e) of fig. 4, in the weight map generation part of the cross-modal domain sensing module, the PCG and ECG multilevel aggregation features f are first aggregated _i ^MLP 、f _i ^MLE And (4) splicing according to channel dimensions, and then generating a pixel-level weight map of each mode through three groups of convolution operations and one sigmoid activation layer. The first convolutional layer is used to reduce the number of channels, the second convolutional layer is used to learn the feature importance, and the last convolutional layer output by 2 channels is combined with the sigmoid activation function to generate a weight map corresponding to each mode. In a weighted feature fusion part of the cross-modal regional sensing module, firstly, the weighted PCG multi-level aggregation feature, the weighted ECG multi-level aggregation feature and the previous-stage output feature of the dense fusion encoder are spliced according to the channel dimension, and then feature fusion is carried out through a BN normalization layer, a ReLu activation layer and a Conv convolution layer.

Optionally, as shown in (f) of fig. 4, in the multi-scale feature optimization module, the transition layer represents a 1 × 1 convolution or max boosting operation for adjusting the input feature f _i ^CMR PCG, ECG multilevel features aggregated with time dimension size and channel number and i +1 level

Keeping consistent facilitates later fusion operations. Then, the input features are evenly divided into X along the channel dimension _1～4 And (4) four groups. First set of features X ₁ No convolution is performed to achieve feature reuse, and other sets of features are progressively convolved to provide different receptive fields. Finally, the grouping feature Y generated by the convolutional layer pair _1～4 Carrying out aggregation and outputting the fusion characteristic f after the current level optimization _i ^MFO 。

With respect to model training and prediction, optionally, the CRDNet is trained using preprocessed paired phonocardiographic segments and corresponding labels. In order to solve the problem of unbalanced training in a multi-modal model, a collaborative learning strategy is provided, and intra-modal loss is used

And combined losses

As a total loss function, the intra-modal loss can enable the specific modal encoder to fully learn the characteristic of the modal with discriminability, the joint loss can guide the specific modal encoder to learn together, the fusion characteristic is enabled to be more discriminability, and multi-modal complementary information is fully utilized. Total loss function

Can be represented by the following formula:

wherein, the first and the second end of the pipe are connected with each other,

representing the classical cross-entropy loss, λ is the loss weight coefficient of the modality-specific encoder.

Wherein N is the number of samples in a batch, k represents that the tasks are divided into k categories, and y _ij Is the true probability that the sample i belongs to the j class, if the ith sample belongs to the j class, y _ij =1, otherwise y _ij ＝0，P _ij Is the prediction probability that sample i belongs to class j.

The cooperative learning strategy is combined with a specific mode-intensive fusion-specific mode three-branch encoder architecture, so that the method of the embodiment can be applied to both a multi-mode scene and a single-mode scene. When the trained network is used for prediction, under the condition of a heart-sound electrocardio multi-mode signal, the classification probability of the fragments is determined by adopting decision-level fusion and integrating the outputs of a dense fusion encoder, a PCG specific mode encoder and an ECG specific mode encoder; in the case of single-mode signals of heart sound and electrocardio, the missing mode is replaced by 0, and the classification probability of the segment is determined according to the output of the specific-mode encoder corresponding to the single mode. And finally, adding the corresponding class prediction probability distributions of the segments obtained by segmenting the same original signal to obtain a classification result of the whole segment of the heart sound electrocardiosignal, and further making a diagnosis.

Optionally, in this embodiment, the loss coefficients of the PCG modality-specific encoder and the ECG modality-specific encoder are both set to 1, and an Adam optimizer is used to optimize the network parameters. When the trained network prediction is used, under the condition of a heart-sound-electrocardio multi-mode signal, the segment prediction probabilities output by the fusion encoder, the PCG encoder and the ECG encoder are added to carry out integrated decision, and the final prediction probability distribution of the segments is obtained. Under the condition of single-mode signals of heart sound and electrocardio, the missing mode is replaced by 0, and the output of the specific mode encoder corresponding to the single mode is the prediction probability distribution of the segment. And finally, adding the prediction probabilities of all the segments of the same original signal after being segmented to obtain a classification result of the whole segment of the heart sound electrocardiosignal, and further making a diagnosis.

The embodiment provides an end-to-end classification method without manually designing a feature extraction scheme, and a preprocessed original signal is directly input into a deep learning network, and particularly provides a dense fusion strategy with cross-modal region perception (by independently generating a pixel-level weight map for each modal, contributions of different modalities and different regions are fully evaluated) and a multi-scale feature optimization module (by providing different receptive fields through progressive grouping convolution and enhancing the expression capability of the network on different-scale target regions), so that multi-modal complementary information is better fused, and the diagnosis precision is effectively improved. The cooperative learning strategy is combined with a specific mode-intensive fusion-specific mode three-branch encoder framework, so that the method can be applied to a single-mode signal scene of heart sound and electrocardio and a multi-mode signal scene with the heart sound and the electrocardio simultaneously, and the universality of a network is enhanced.

Further, in order to measure the effect of the heart sound and electrocardio combined diagnosis method provided by the invention, firstly, comparison is carried out on the 'training-a' subset of the PhysioNet/CinC Change 2016 heart sound public data set with the existing method. PhysioNet/CinC Challenge 2016 published only a training set consisting of six subsets "training-a" through "training-f", wherein only the "training-a" subset contains synchronized heart sounds and heart signals, for a total of 405 records. In order to avoid the influence of signals containing severe noise on algorithm evaluation, the method uses other 388 groups of heart sound and electrocardio records, including 116 groups of normal heart sound and electrocardio records and 272 abnormal heart sound and electrocardio records, and the sampling frequency is 2000Hz.

The method comprises the steps of dividing 388 groups of heart sound and electrocardio data in the whole data set into a training set and a testing set according to the recording level to carry out five-fold cross validation, ensuring that the fragments recorded by the same group of heart sound and electrocardio data cannot appear in the training set and the testing set at the same time, and keeping the positive and negative sample proportion the same as that of the total data set in each fold. Referring to the data preprocessing mode in the embodiment, the heart sound and the electrocardio data are respectively processed, the heart sound and electrocardio records are divided into segments with the duration of 1.28s and the overlapping of 50%, and then the paired heart sound and electrocardio segments are used for training the deep learning network. Due to the existence of the class imbalance phenomenon, when the network is trained, the loss weight of the normal and abnormal classes is set to be 2.5. And (4) testing by referring to the model prediction method in the embodiment, and obtaining the classification result of the whole heart sound electrocardio record.

In the application scene, three indexes of Sensitivity, specificity and average accuracy (Macc) are used for evaluating the performance of the method, and the formulas of the three indexes are respectively as follows:

wherein TP is the number of true positive strips, TN is the number of true negative strips, FP is the number of false positive strips, and FN is the number of false negative strips. And an AUC index (area enclosed by the coordinate axes under the ROC curve) is calculated to measure the overall performance of the method.

First compared to the performance of existing methods developed on the same common data set. Table one shows the comparison of the effect of the proposed method and the existing method on the same data set. As can be seen from the table, compared to the existing methods, the performance is significantly improved, and the performance is ranked by average accuracy, and compared to the suboptimal method [4] (j.li, l.ke, q.du, x.chen, x.ding, multi-modal cardiac functional analysis detailed on improved DS evaluation method, biomed.signal process.control 71 (2022) 103078.), the sensitivity, specificity and average accuracy of the method of the present invention are respectively improved by 9.89%, 0.87% and 5.38%; the sensitivity, specificity, average accuracy and overall performance of the present method are improved by AUC, compared to sub-optimal methods [2] ("R.Hettiarachhi, U.Haputanthhri, K.Herath, H.Kariyawasam, S.Munasinghe, K.Wickramasinge, D.Samaranghe, A.D.Silva, C.U.S.Edussoriya, A. Novel Transfer Learning-Based application for Screening Pre-existing Heart Diseases Using Synchronized ECG Signals and Heart Diseases, in: proceedings of the IEEE International Symposium Circuits and Systems (ISAS), IEEE,2021, 5 ], AUC, specificity, average accuracy and overall performance are improved by 7.6%, 13.6.5%, 3.5%, respectively, which are excellent.

Table one: each method compares the effect on PhysioNet/CinC Change 2016 public data set "training-a" subset

Note: * Shows the results obtained by the method reproduced in the paper, and the rest results are the results reported in the paper. [1] Li, Y.Hu, Z.P.Liu, prediction of cardiac diseases by integrating multi-modal diseases with machine learning methods, biomed.Signal Process.Control 66 (2021) 102474. [3] Li, X.Wang, C.Liu, P.Li, Y.Jiano, integrating multi-domain depletion features of electrochemical and photonuclear area detection, comut.biol.Med.138 (2021) 104914.

The performance of CRDNet in PCG or ECG single-modality scenarios was then evaluated and compared to PCG single-modality model (PCG-specific modality encoder as an independent network trained using PCG signals) and ECG single-modality model (ECG-specific modality encoder as an independent network trained using ECG signals) performance. Table two shows the performance of CRDNet compared to the unimodal model under different signal input modes, and fig. 5 shows the corresponding ROC (receiver operating characteristic curve) curve, where FPR =1-Specificity. Combining table two and fig. 5, it can be found that AUC and average accuracy of CRDNet are slightly lower than those of the PCG single mode model when only the PCG single mode signal is present; when only an ECG single-mode signal exists in the CRDNet, the performance of the CRDNet is slightly better than that of an ECG single-mode model, wherein the average accuracy is improved by 1.66%, and the AUC is improved from 0.951 to 0.959, which shows that the performance of the CRDNet under the single-mode signal is comparable to that of the single-mode model; compared with a PCG single-mode model, the CRDNet has the advantages that the average accuracy is improved by 17.20 percent, the AUC is improved to 0.973 from 0.835 when the CRDNet is used for PCG and ECG multi-mode signals, and the average accuracy is improved by 5.17 percent and the AUC is improved to 0.973 when the CRDNet is used for ECG single-mode signals. Meanwhile, the CRDNet can be applied to both a multi-mode scene in which the PCG and the ECG exist simultaneously and a single-mode scene only with the PCG or the ECG, and has strong universality.

Table two: performance comparison of CRDNet with Single-mode models at different Signal input modes

Generally, the end-to-end heart sound and electrocardio combined diagnosis method without manually designing a feature extraction scheme is realized through the CRDNet, the preprocessing process is convenient and quick, the preprocessed original signals are directly classified, and a specific mode-intensive fusion-specific mode three-branch encoder framework is combined with a collaborative learning strategy, so that the method can be applied to a multi-mode scene in which heart sound and electrocardio exist simultaneously, can be applied to a single-mode scene of heart sound and electrocardio, has strong universality and is excellent in performance in the existing method. The provided collaborative learning strategy can not only enable the specific modal encoder to extract more discriminative characteristics of the modal through intra-modal loss and joint loss, effectively solve the problem of unbalanced training in a multi-modal model, but also guide the specific modal encoder to learn together and enhance the characteristic fusion effect; the proposed dense fusion strategy with cross-modal region perception and multi-scale feature optimization modal can fully utilize all levels of discriminative features extracted by a specific modal encoder, considers different contributions of different modes and different regions during multi-modal feature fusion, enhances the feature expression capability of a network to a multi-scale target region, effectively realizes the feature information complementation of PCG and ECG, and improves the diagnosis precision; meanwhile, the output of the specific modal encoder and the dense fusion encoder can form an integrated decision, and the performance of the method is effectively improved.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A heart sound and electrocardio combined diagnosis device based on deep learning is characterized by being used for calculating classification results corresponding to isometric segments by adopting a CRDNet network model according to isometric segments obtained by preprocessing original signals of heart sound and electrocardio; forming a diagnosis analysis report according to each classification result;

2. The combined cardiac sound and electrocardiographic diagnostic apparatus according to claim 1, wherein the diagnostic analysis report is formed based on the classification results, and specifically comprises:

3. The integrated diagnostic apparatus for cardioelectric and electrocardiography as claimed in claim 1, wherein the dense fusion encoder further performs multi-scale feature optimization on each level of fusion features generated by the dense fusion encoder to obtain the optimized current level of fusion features for obtaining the combined classification result under the current isometric segment input.

4. The integrated diagnostic apparatus for cardioelectric and electrocardiograph as claimed in claim 3, wherein the dense fusion encoder provides different receptive fields by using progressive block convolution for multi-scale feature optimization.

5. The integrated cardioelectric and electrocardiographic diagnostic apparatus according to claim 1, wherein the classification units in the modality-specific encoder and the dense fusion encoder use a double-layer LSTM network instead of a global pooling layer for global information integration.

6. The integrated cardioelectric/electrocardiographic diagnostic apparatus according to claim 1, wherein the CRDNet network model is obtained by training with the loss function:

wherein λ represents a loss weight coefficient of each mode-specific encoder,

which represents the loss within the mode(s),

indicating a joint loss.

7. The apparatus according to claim 1, wherein the current-stage features and the previous-stage features extracted by the modality-specific encoder for aggregating the modalities are implemented as follows:

Spatio-temporal features extracted in PCG and ECG Modal-specific coders, respectively, f _i ^MLP 、f _i ^MLE The time dimension size and the channel number of the multi-stage aggregation features of the i-1 level are adjusted to be the same as those of the i-level specific modal features.

8. The combined heart sound and electrocardiogram diagnosis device of claim 1, wherein the device is further configured to calculate intra-modal classification results corresponding to the equal-length segments by using a CRDNet network model according to the equal-length segments obtained by preprocessing the original heart sound or electrocardiogram signals; and forming a diagnosis analysis report under the single-mode signal according to the classification result in each mode.

9. A heart sound electrocardio combined diagnosis system based on deep learning is characterized by comprising:

the server side is used for calculating a classification result according to the obtained equal-length fragments by adopting the CRDNet network model as claimed in any one of claims 1 to 8, forming a diagnosis analysis report and sending the diagnosis analysis report to the client terminal.

10. A computer-readable storage medium, comprising a stored computer program, wherein when the computer program is executed by a processor, the computer program controls a device on which the storage medium is located to perform the functions of the apparatus for diagnosing heart sound and ecg combination based on deep learning according to any one of claims 1 to 8.