WO2021104099A1 - 一种基于情景感知的多模态抑郁症检测方法和*** - Google Patents
一种基于情景感知的多模态抑郁症检测方法和*** Download PDFInfo
- Publication number
- WO2021104099A1 WO2021104099A1 PCT/CN2020/129214 CN2020129214W WO2021104099A1 WO 2021104099 A1 WO2021104099 A1 WO 2021104099A1 CN 2020129214 W CN2020129214 W CN 2020129214W WO 2021104099 A1 WO2021104099 A1 WO 2021104099A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- depression
- text
- acoustic
- channel subsystem
- context
- Prior art date
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 48
- 238000012549 training Methods 0.000 claims abstract description 65
- 238000000034 method Methods 0.000 claims abstract description 49
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 29
- 238000000605 extraction Methods 0.000 claims abstract description 17
- 230000008569 process Effects 0.000 claims abstract description 12
- 230000008447 perception Effects 0.000 claims description 23
- 230000002787 reinforcement Effects 0.000 claims description 12
- 230000004927 fusion Effects 0.000 claims description 9
- 230000007246 mechanism Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 4
- 238000002372 labelling Methods 0.000 claims description 4
- 238000013526 transfer learning Methods 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 3
- 230000003044 adaptive effect Effects 0.000 claims description 2
- 230000007774 longterm Effects 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 7
- 238000003745 diagnosis Methods 0.000 description 6
- 208000020401 Depressive disease Diseases 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000013145 classification model Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 230000000699 topical effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 230000002087 whitening effect Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 238000003759 clinical diagnosis Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000000994 depressogenic effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000013100 final test Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000013432 robust analysis Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
Definitions
- the present invention relates to the technical field of depression detection, in particular to a multi-modal depression detection method and system based on context perception.
- Deep learning is a new field of machine learning, which combines high-level abstract modeling of data by using multiple layers of non-linear transformations. Using deep learning algorithms can make the original data easier to adapt to learning and training in various directions.
- CNN and LSTM use CNN and LSTM to combine to form a new deep network, and then extract the acoustic features of the speech signal and use it for the detection of depression.
- Another example is the semantic analysis of the conversation between the doctor and the depression patient, such as filled pause extraction, Principal Components Analysis (PCA), whitening transform (whitening transform) and other techniques to get some text
- PCA Principal Components Analysis
- SVR Linear Support Vector Regressor
- the acoustic features used in the prior art are some artificially defined 279-dimensional features, and the text features are 100-dimensional word embedding vectors extracted using the Doc2Vec tool.
- the existing technology mainly has the following problems: in terms of the amount of training data, most of the existing multi-modal depression detection systems based on speech, text, or images are trained on limited depression data, so the performance is low.
- existing feature extraction methods lack verbal information related to topic and context, and are insufficient in the field of depression detection, which limits the performance of the final depression detection system; in terms of depression classification modeling, the existing technology does not consider speech , The long-term dependence of text features and depression diagnosis; in terms of multi-modal fusion, the prior art simply connects the subsystem outputs obtained under different modalities or channels in series, and finally makes a decision, ignoring each modal Or the weight relationship between channels, so performance is limited.
- the purpose of the present invention is to overcome the above-mentioned shortcomings of the prior art and provide a multi-modal depression detection method and system based on context perception.
- a multi-modal depression detection method based on context perception includes the following steps:
- Step S1 Construct a training sample set, the training sample set includes topic information, a spectrogram and corresponding text information;
- Step S2 Using a convolutional neural network, combined with multi-task learning, perform acoustic feature extraction on the spectrogram of the training sample set to obtain acoustic features with contextual awareness;
- Step S3 Use the training sample set to process the word embedding using the Transformer model, and extract context-aware text features
- Step S4 establishing an acoustic channel subsystem for depression detection for the context-aware acoustic features, and establishing a text channel subsystem for depression detection for the context-aware text features;
- Step S5 fusing the outputs of the acoustic channel subsystem and the text channel subsystem to obtain depression classification information.
- the acoustic characteristics of the contextual perception are obtained according to the following steps:
- the convolutional neural network includes an input layer, multiple convolutional layers, multiple fully connected layers, an output layer, and a bottleneck layer located between the last fully connected layer and the output layer.
- the bottleneck layer Compared with the convolutional layer and the fully connected layer, it has fewer nodes;
- the output layer contains the depression classification task and the topic labeling task
- the acoustic features of the context perception are extracted from the bottleneck layer of the convolutional neural network.
- the context-aware text features are extracted according to the following steps:
- the Transformer model includes multiple encoders and decoders with self-attention and a softmax layer at the last layer;
- the softmax layer is removed, and the output of the Transformer model is used as the context-aware text feature.
- step S5 includes:
- the outputs of the acoustic channel subsystem and the text channel subsystem are merged to obtain a classification score for depression.
- the classification score of the depression is expressed as:
- the weight w i [ ⁇ 1 , ⁇ 2 ,..., ⁇ c ], and c is the number of classifications of depression.
- the acoustic channel subsystem and the text channel subsystem are established based on a BLSTM network, and the network input of the acoustic channel subsystem is the perceptual linear prediction coefficients of consecutive multiple frames and the acoustic characteristics of the context perception ,
- the output is a depression classification label
- the network input of the text channel subsystem is text information
- the output is a depression classification label.
- the topic information in the training sample set includes multiple types of identifiers classified based on the content of the conversation between the doctor and the depression patient.
- a multi-modal depression detection system based on contextual perception includes:
- Training sample construction unit used to construct a training sample set, the training sample set including topic information, spectrogram and corresponding text information;
- Acoustic feature extraction unit used to extract acoustic features from the spectrogram of the training sample set by using a convolutional neural network, combined with multi-task learning, to obtain acoustic features with contextual awareness;
- Text feature extraction unit used to use the training sample set to process word embeddings using a Transformer model to extract context-aware text features
- Classification subsystem establishment unit used to establish an acoustic channel subsystem for depression detection for the context-aware acoustic features, and establish a text channel subsystem for depression detection for the context-aware text features;
- Classification and fusion unit used to fuse the output of the acoustic channel subsystem and the text channel subsystem to obtain depression classification information.
- the present invention has the advantage of using the method of data enhancement to expand the voice and text training data of depression according to the topic information in the content of the free conversation between the doctor and the depression patient, and use the data for model training;
- Verbal information related to depression detection including acquiring acoustic features that are not related to the speaker, highly related to depression, and context-aware, and text features that are highly related to depression and context-aware;
- a depression detection subsystem is established in the acoustic channel and the text channel;
- the reinforcement learning method is used to obtain a multi-system fusion framework to achieve robust multi-modal depression automatic detection.
- Fig. 1 is a general framework diagram of a multi-modal depression detection method based on context perception according to an embodiment of the present invention
- Fig. 2 is a flowchart of a multi-modal depression detection method based on context perception according to an embodiment of the present invention
- Figure 3 is a schematic diagram of topic-based data enhancement
- Figure 4 is a schematic diagram of the acoustic feature extraction process based on CNN and multi-task learning
- Figure 5 is a schematic diagram of a text feature extraction process based on a multi-head self-attention mechanism
- Figure 6 is a schematic diagram of reinforcement learning.
- the overall technical solution includes: firstly adopt topic-based data enhancement method to obtain more topic-related depression speech and text data; then use CNN network combined with multi-task learning
- the method is to extract context-aware acoustic features from the spectrogram, and use Transformer to process word embeddings to obtain context-aware text features; then, use context-aware acoustic features and context-aware text features, respectively, using BLSTM (two-way length and short Temporal memory network) model is used to establish the depression detection subsystem; finally, the reinforcement learning method is used to make a fusion decision on the output of each subsystem to obtain the final depression classification.
- BLSTM two-way length and short Temporal memory network
- the multi-modal depression detection method based on context perception includes the following steps:
- Step S210 Obtain a training sample set with context awareness.
- the training sample set can be expanded based on the original training set to include context perception information.
- the original data set usually only includes the correspondence between speech and text.
- topic labeling is performed on each pair of speech and text data in the existing training set. For example, divide the content of conversations between doctors and patients with depression into 7 topics: whether they are interested, whether they sleep well, whether they feel depressed, whether they feel defeated, self-evaluation, whether they have ever been diagnosed with depression, and whether their parents have ever suffered from depression.
- Some new training samples can be obtained through the above method, and the original training samples can be spliced together to expand the original data set and construct a new training sample set.
- this step by defining the content of multiple topics that the doctor talks with the depression patient, and expanding the original training data set by random combination, a richer set of context-aware training samples can be obtained, including topic information, Spectrogram, text information, and corresponding classification labels, etc., thereby improving the accuracy of subsequent training.
- Step S220 extracting acoustic features with context awareness based on CNN and multi-task learning.
- CNN Convolutional Neural Network
- the present invention combines multi-task learning and CNN network for classification network training.
- the input of the CNN network is the spectrogram of each training sample, and the CNN network includes several convolutional layers and several fully connected layers.
- the convolutional layer downsampling is performed using, for example, a maximum pooling technique.
- the embodiment of the present invention inserts a bottleneck layer, which contains only a few nodes, for example, the value is 39.
- the output layer of the CNN network contains two tasks.
- the first task is the classification of depression, for example, classification into multiple categories such as mild, severe, moderate, and normal.
- the second task is the labeling of different topics (or topic identification). ).
- the context-aware acoustic features are extracted from the bottleneck layer of the CNN network, and are spliced with traditional acoustic features for subsequent classification network training.
- CNN neural network and multi-task learning methods are used.
- the first task is the classification of depression, and the second task is the label of different topics.
- the output obtained by the network bottleneck layer is used as topical context awareness Characteristic acoustic characteristics.
- Step S230 extracting context-aware text features based on the multi-head self-attention mechanism.
- a Transformer model based on a multi-head self-attention mechanism is used to analyze the semantics of sentences, so as to extract context-aware text features.
- the input of the Transformer model is traditional word embedding plus topic ID (identification), and its main structure is composed of multiple encoders and decoders containing self-attention, which is the so-called multi-head mechanism.
- the Transformer model allows direct connections between data units, it allows the model to take into account the attention information of different locations and better capture long-term dependencies.
- the Transformer model in the embodiment of the present invention, first use large-scale text corpus (such as Weibo, Wikipedia, etc.) to pre-train the Transformer model parameters using an unsupervised training method; and then use transfer learning.
- large-scale text corpus such as Weibo, Wikipedia, etc.
- Method self-adaptive training is performed on the collected textual data of depression.
- the last softmax layer in Figure 5 is removed, and then the output is used as a text feature, that is, the extracted context-aware text feature, which will be used for subsequent depression detection model training.
- the Transformer model can be used to extract robust text features.
- step S240 a subsystem for detecting depression is established for the acoustic features of context perception and the distribution of text features of context perception.
- the embodiment of the present invention adopts a BLSTM-based method to establish a depression classification sub-network (or a sub-system).
- BLSTM can cache the current input, and use the current input to participate in the previous and next calculations to implicitly include time information into the model, thereby realizing the modeling of long-term dependencies.
- the BLSTM network adopted in the embodiment of the present invention has a total of 3 BLSTM layers, and each layer contains 128 nodes.
- the corresponding network input is continuous 11 frames of PLP (Perceptual Linear Prediction Coefficient) and the acoustic features of context perception, and the output is the depression classification label;
- the corresponding network input is the context perception of a training sample
- the text feature of the output is the depression classification label.
- the BLSTM network is used to establish a depression classification model to capture the long-term dependence of acoustic features or text features with the diagnosis of depression.
- Step S250 Use reinforcement learning to fuse the outputs of the various depression detection subsystems to obtain the final depression classification.
- the embodiment of the present invention adopts a reinforcement learning mechanism to minimize the difference between the final depression prediction result and feedback information of the combined system by adjusting the weight of each subsystem.
- the final score for depression is expressed as:
- the decision score function L t of reinforcement learning at time t is defined as:
- a t-1 represents the feedback at time t-1
- D represents the difference between the actual and predicted results of the development set
- W represents the weight of all subsystems ⁇ w i ⁇
- C represents the global accuracy rate on the development set. Therefore, it is necessary to sum L t at all times and maximize it, and the obtained W * is the weight of the final subsystem, which is expressed as:
- a hidden Markov model or other models can be used for reinforcement learning.
- the reinforcement learning method is used to automatically adjust the weights of the subsystem score of the acoustic channel and the subsystem score of the text channel, so that they can be organically integrated to perform the final depression classification.
- the trained network model can be used for new data (including topics, speech, text, etc.) using a process similar to training to treat depression Classification prediction.
- BLSTM other models containing time information can also be used.
- the present invention also provides a multi-modal depression detection system based on context perception.
- the system includes: a training sample construction unit, used to construct a training sample set, the training sample set includes topic information, a spectrogram and corresponding text information; an acoustic feature extraction unit, used to use a convolutional neural network, combined with multiple Task learning: extracting acoustic features from the spectrogram of the training sample set to obtain acoustic features with context awareness; text feature extraction unit: used to process the word embedding using the training sample set and using the Transformer model to extract With context-aware text features; classification subsystem establishment unit: used to establish an acoustic channel subsystem for depression detection for the context-aware acoustic features, and establish a text channel for depression detection for the context-aware text features Subsystem; classification fusion unit: used to fuse the output of the acoustic channel subsystem and the text channel subsystem to obtain depression classification information
- the present invention combines the information obtained by the acoustic channel and the text channel to achieve high-precision multi-modal depression detection.
- the main technical content includes: using topic-related data enhancement technology: based on limited depression speech and text data, using The topic information in the content of the free conversation between doctors and depression patients, expands the speech and text training data of depression; Robust analysis and extraction of depression-related features: Combining transfer learning and multi-head self-attention mechanism, extracting topical and context-aware features , And the acoustic feature description and text feature description showing the characteristics of depression patients to improve the accuracy of the detection system; BLSTM-based depression classification model: use the powerful time series modeling capabilities of the BLSTM network to capture acoustic information and text information and depression The long-term dependence of diagnosis; multi-modal fusion framework: the use of reinforcement learning methods to achieve the fusion of the depression detection subsystem under the acoustic channel and the text channel.
- the present invention has the following advantages:
- the existing depression detection method only uses limited speech and text data of depression. Compared with this, the present invention uses a topic-based data enhancement method to expand the original training data set;
- the present invention uses CNN neural network and multi-task learning methods to extract acoustic features with topic context perception characteristics, and uses Transformer model to extract topics with topic context awareness.
- the textual features of context-aware features are in-depth feature descriptions, which can improve the robustness of depression detection;
- the existing depression detection modeling technology does not consider the long-term dependence of speech, text features and depression diagnosis.
- the present invention uses the BLSTM network to capture acoustic features or text features and the long-term diagnosis of depression. Dependency, better performance;
- the existing multi-modal depression detection technology simply connects the outputs of different subsystems in series for decision-making.
- the present invention adopts a reinforcement learning method to automatically adjust the sub-system score weights under different channels, and Make the final classification decision, the performance is better.
- the present invention may be a system, a method and/or a computer program product.
- the computer program product may include a computer-readable storage medium loaded with computer-readable program instructions for enabling a processor to implement various aspects of the present invention.
- the computer-readable storage medium may be a tangible device that holds and stores instructions used by the instruction execution device.
- the computer-readable storage medium may include, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing, for example.
- Computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) Or flash memory), static random access memory (SRAM), portable compact disk read-only memory (CD-ROM), digital versatile disk (DVD), memory stick, floppy disk, mechanical encoding device, such as a printer with instructions stored thereon
- RAM random access memory
- ROM read-only memory
- EPROM erasable programmable read-only memory
- flash memory flash memory
- SRAM static random access memory
- CD-ROM compact disk read-only memory
- DVD digital versatile disk
- memory stick floppy disk
- mechanical encoding device such as a printer with instructions stored thereon
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (10)
- 一种基于情景感知的多模态抑郁症检测方法,包括以下步骤:步骤S1:构建训练样本集,所述训练样本集包括话题信息、语谱图和对应的文本信息;步骤S2:使用卷积神经网络,结合多任务学习,对所述训练样本集的语谱图进行声学特征提取,获得具备情景感知的声学特征;步骤S3:利用所述训练样本集,使用Transformer模型对词嵌入进行处理,提取具备情景感知的文本特征;步骤S4:对于所述情景感知的声学特征建立进行抑郁症检测的声学通道子***,对于所述情景感知的文本特征建立进行抑郁症检测的文本通道子***;步骤S5:对所述声学通道子***和所述文本通道子***的输出进行融合,获得抑郁症分类信息。
- 根据权利要求1所述的方法,其特征在于,根据以下步骤获得所述情景感知的声学特征:构建卷积神经网络,该卷积神经网络包括输入层、多个卷积层、多个全连接层、输出层、以及位于最后一层全连接层和输出层之间的瓶颈层,该瓶颈层相对于卷积层和全连接层具有较少的节点;将所述训练样本集中的语谱图输入到卷积神经网络,输出层包含抑郁症分类任务和话题的标签任务;从卷积神经网络的瓶颈层提取得到所述情景感知的声学特征。
- 根据权利要求1所述的方法,其特征在于,根据以下步骤提取所述情景感知的文本特征:构建Transformer模型,以词嵌入加上话题标识作为Transformer模型的输入,该Transformer模型包括多个含有自注意力的编码器和解码器以及位于最后一层的softmax层;利用已有的文本语料,使用无监督训练方法预训练Transformer模型参数,然后采用迁移学习,在采集得到的抑郁症文本数据进行自适应训练;在训练完成之后,将softmax层去除,以Transformer模型的输出作为所述情景感知的文本特征。
- 根据权利要求1所述的方法,其特征在于,步骤S5包括:采用强化学习机制,调整所述声学通道子***的权重和所述文本通道子***的权重,使得最终抑郁症分类预测结果和反馈信息之间的差异最小化;融合所述声学通道子***和所述文本通道子***的输出,获得抑郁症的分类打分。
- 根据权利要求1所述的方法,其特征在于,所述声学通道子***和所述文本通道子***基于BLSTM网络建立,所述声学通道子***的网络输入为连续多帧的感知线性预测系数和所述情景感知的声学特征,输出为抑郁症分类标签,所述文本通道子***的网络输入是文本信息,输出为抑郁症分类标签。
- 根据权利要求1所述的方法,其特征在于,所述训练样本集中的话题信息包括基于医生与抑郁症患者交谈的内容所划分的多种类型标识。
- 一种基于情景感知的多模态抑郁症检测***,包括:训练样本构建单元:用于构建训练样本集,所述训练样本集包括话题信息、语谱图和对应的文本信息;声学特征提取单元:用于使用卷积神经网络,结合多任务学习,对所述训练样本集的语谱图进行声学特征提取,获得具备情景感知的声学特征;文本特征提取单元:用于利用所述训练样本集,使用Transformer模型对词嵌入进行处理,提取具备情景感知的文本特征;分类子***建立单元:用于对于所述情景感知的声学特征建立进行抑郁症检测的声学通道子***,对于所述情景感知的文本特征建立进行抑郁症检测的文本通道子***;分类融合单元:用于对所述声学通道子***和所述文本通道子***的输出进行融合,获得抑郁症分类信息。
- 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现根据权利要求1至7中任一项所述方法的步骤。
- 一种计算机设备,包括存储器和处理器,在所述存储器上存储有 能够在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求1至7中任一项所述的方法的步骤。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911198356.X | 2019-11-29 | ||
CN201911198356.XA CN110728997B (zh) | 2019-11-29 | 2019-11-29 | 一种基于情景感知的多模态抑郁症检测*** |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021104099A1 true WO2021104099A1 (zh) | 2021-06-03 |
Family
ID=69225856
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/129214 WO2021104099A1 (zh) | 2019-11-29 | 2020-11-17 | 一种基于情景感知的多模态抑郁症检测方法和*** |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110728997B (zh) |
WO (1) | WO2021104099A1 (zh) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113627377A (zh) * | 2021-08-18 | 2021-11-09 | 福州大学 | 基于Attention-Based CNN的认知无线电频谱感知方法及*** |
CN113674767A (zh) * | 2021-10-09 | 2021-11-19 | 复旦大学 | 一种基于多模态融合的抑郁状态识别方法 |
CN113822192A (zh) * | 2021-09-18 | 2021-12-21 | 山东大学 | 一种基于Transformer的多模态特征融合的在押人员情感识别方法、设备及介质 |
CN114118200A (zh) * | 2021-09-24 | 2022-03-01 | 杭州电子科技大学 | 一种基于注意力引导双向胶囊网络的多模态情感分类方法 |
CN114464182A (zh) * | 2022-03-03 | 2022-05-10 | 慧言科技(天津)有限公司 | 一种音频场景分类辅助的语音识别快速自适应方法 |
US20220180056A1 (en) * | 2020-12-09 | 2022-06-09 | Here Global B.V. | Method and apparatus for translation of a natural language query to a service execution language |
CN114973120A (zh) * | 2022-04-14 | 2022-08-30 | 山东大学 | 一种基于多维传感数据与监控视频多模异构融合的行为识别方法及*** |
CN115346561A (zh) * | 2022-08-15 | 2022-11-15 | 南京脑科医院 | 基于语音特征的抑郁情绪评估预测方法及*** |
CN115481681A (zh) * | 2022-09-09 | 2022-12-16 | 武汉中数医疗科技有限公司 | 一种基于人工智能的乳腺采样数据的处理方法 |
CN115969381A (zh) * | 2022-11-16 | 2023-04-18 | 西北工业大学 | 一种基于多频段融合与时空Transformer的脑电信号分析方法 |
CN117137488A (zh) * | 2023-10-27 | 2023-12-01 | 吉林大学 | 基于脑电数据与面部表情影像的抑郁症病症辅助识别方法 |
CN117497140A (zh) * | 2023-10-09 | 2024-02-02 | 合肥工业大学 | 一种基于细粒度提示学习的多层次抑郁状态检测方法 |
CN117497140B (zh) * | 2023-10-09 | 2024-05-31 | 合肥工业大学 | 一种基于细粒度提示学习的多层次抑郁状态检测方法 |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110728997B (zh) * | 2019-11-29 | 2022-03-22 | 中国科学院深圳先进技术研究院 | 一种基于情景感知的多模态抑郁症检测*** |
CN111150372B (zh) * | 2020-02-13 | 2021-03-16 | 云南大学 | 一种结合快速表示学习和语义学习的睡眠阶段分期*** |
CN111329494B (zh) * | 2020-02-28 | 2022-10-28 | 首都医科大学 | 抑郁症参考数据的获取方法及装置 |
CN111581470B (zh) * | 2020-05-15 | 2023-04-28 | 上海乐言科技股份有限公司 | 用于对话***情景匹配的多模态融合学习分析方法和*** |
CN112006697B (zh) * | 2020-06-02 | 2022-11-01 | 东南大学 | 一种基于语音信号的梯度提升决策树抑郁程度识别*** |
CN111798874A (zh) * | 2020-06-24 | 2020-10-20 | 西北师范大学 | 一种语音情绪识别方法及*** |
CN113269277B (zh) * | 2020-07-27 | 2023-07-25 | 西北工业大学 | 基于Transformer编码器和多头多模态注意力的连续维度情感识别方法 |
CN112966429A (zh) * | 2020-08-11 | 2021-06-15 | 中国矿业大学 | 基于WGANs数据增强的非线性工业过程建模方法 |
CN112631147B (zh) * | 2020-12-08 | 2023-05-02 | 国网四川省电力公司经济技术研究院 | 一种面向脉冲噪声环境的智能电网频率估计方法及*** |
CN112768070A (zh) * | 2021-01-06 | 2021-05-07 | 万佳安智慧生活技术(深圳)有限公司 | 一种基于对话交流的精神健康评测方法和*** |
CN112885334A (zh) * | 2021-01-18 | 2021-06-01 | 吾征智能技术(北京)有限公司 | 基于多模态特征的疾病认知***、设备、存储介质 |
CN112818892B (zh) * | 2021-02-10 | 2023-04-07 | 杭州医典智能科技有限公司 | 基于时间卷积神经网络的多模态抑郁症检测方法及*** |
CN113012720B (zh) * | 2021-02-10 | 2023-06-16 | 杭州医典智能科技有限公司 | 谱减法降噪下多语音特征融合的抑郁症检测方法 |
CN115346657B (zh) * | 2022-07-05 | 2023-07-28 | 深圳市镜象科技有限公司 | 利用迁移学习提升老年痴呆的识别效果的训练方法及装置 |
CN116843377A (zh) * | 2023-07-25 | 2023-10-03 | 河北鑫考科技股份有限公司 | 基于大数据的消费行为预测方法、装置、设备及介质 |
CN116965817B (zh) * | 2023-07-28 | 2024-03-15 | 长江大学 | 一种基于一维卷积网络和Transformer的EEG情感识别方法 |
CN116978409A (zh) * | 2023-09-22 | 2023-10-31 | 苏州复变医疗科技有限公司 | 基于语音信号的抑郁状态评估方法、装置、终端及介质 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016028495A1 (en) * | 2014-08-22 | 2016-02-25 | Sri International | Systems for speech-based assessment of a patient's state-of-mind |
CN107133481A (zh) * | 2017-05-22 | 2017-09-05 | 西北工业大学 | 基于dcnn‑dnn和pv‑svm的多模态抑郁症估计和分类方法 |
CN107657964A (zh) * | 2017-08-15 | 2018-02-02 | 西北大学 | 基于声学特征和稀疏数学的抑郁症辅助检测方法及分类器 |
JP2018121749A (ja) * | 2017-01-30 | 2018-08-09 | 株式会社リコー | 診断装置、プログラム及び診断システム |
CN109599129A (zh) * | 2018-11-13 | 2019-04-09 | 杭州电子科技大学 | 基于注意力机制和卷积神经网络的语音抑郁症识别方法 |
CN110728997A (zh) * | 2019-11-29 | 2020-01-24 | 中国科学院深圳先进技术研究院 | 一种基于情景感知的多模态抑郁症检测方法和*** |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10204625B2 (en) * | 2010-06-07 | 2019-02-12 | Affectiva, Inc. | Audio analysis learning using video data |
EP3252769B8 (en) * | 2016-06-03 | 2020-04-01 | Sony Corporation | Adding background sound to speech-containing audio data |
US11557311B2 (en) * | 2017-07-21 | 2023-01-17 | Nippon Telegraph And Telephone Corporation | Satisfaction estimation model learning apparatus, satisfaction estimating apparatus, satisfaction estimation model learning method, satisfaction estimation method, and program |
CN107316654A (zh) * | 2017-07-24 | 2017-11-03 | 湖南大学 | 基于dis‑nv特征的情感识别方法 |
GB2567826B (en) * | 2017-10-24 | 2023-04-26 | Cambridge Cognition Ltd | System and method for assessing physiological state |
CN108764010A (zh) * | 2018-03-23 | 2018-11-06 | 姜涵予 | 情绪状态确定方法及装置 |
WO2019225801A1 (ko) * | 2018-05-23 | 2019-11-28 | 한국과학기술원 | 사용자의 음성 신호를 기반으로 감정, 나이 및 성별을 동시에 인식하는 방법 및 시스템 |
CN109389992A (zh) * | 2018-10-18 | 2019-02-26 | 天津大学 | 一种基于振幅和相位信息的语音情感识别方法 |
CN109841231B (zh) * | 2018-12-29 | 2020-09-04 | 深圳先进技术研究院 | 一种针对汉语普通话的早期ad言语辅助筛查*** |
CN110047516A (zh) * | 2019-03-12 | 2019-07-23 | 天津大学 | 一种基于性别感知的语音情感识别方法 |
-
2019
- 2019-11-29 CN CN201911198356.XA patent/CN110728997B/zh active Active
-
2020
- 2020-11-17 WO PCT/CN2020/129214 patent/WO2021104099A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016028495A1 (en) * | 2014-08-22 | 2016-02-25 | Sri International | Systems for speech-based assessment of a patient's state-of-mind |
JP2018121749A (ja) * | 2017-01-30 | 2018-08-09 | 株式会社リコー | 診断装置、プログラム及び診断システム |
CN107133481A (zh) * | 2017-05-22 | 2017-09-05 | 西北工业大学 | 基于dcnn‑dnn和pv‑svm的多模态抑郁症估计和分类方法 |
CN107657964A (zh) * | 2017-08-15 | 2018-02-02 | 西北大学 | 基于声学特征和稀疏数学的抑郁症辅助检测方法及分类器 |
CN109599129A (zh) * | 2018-11-13 | 2019-04-09 | 杭州电子科技大学 | 基于注意力机制和卷积神经网络的语音抑郁症识别方法 |
CN110728997A (zh) * | 2019-11-29 | 2020-01-24 | 中国科学院深圳先进技术研究院 | 一种基于情景感知的多模态抑郁症检测方法和*** |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220180056A1 (en) * | 2020-12-09 | 2022-06-09 | Here Global B.V. | Method and apparatus for translation of a natural language query to a service execution language |
CN113627377A (zh) * | 2021-08-18 | 2021-11-09 | 福州大学 | 基于Attention-Based CNN的认知无线电频谱感知方法及*** |
CN113822192A (zh) * | 2021-09-18 | 2021-12-21 | 山东大学 | 一种基于Transformer的多模态特征融合的在押人员情感识别方法、设备及介质 |
CN113822192B (zh) * | 2021-09-18 | 2023-06-30 | 山东大学 | 一种基于Transformer的多模态特征融合的在押人员情感识别方法、设备及介质 |
CN114118200A (zh) * | 2021-09-24 | 2022-03-01 | 杭州电子科技大学 | 一种基于注意力引导双向胶囊网络的多模态情感分类方法 |
CN113674767A (zh) * | 2021-10-09 | 2021-11-19 | 复旦大学 | 一种基于多模态融合的抑郁状态识别方法 |
CN114464182A (zh) * | 2022-03-03 | 2022-05-10 | 慧言科技(天津)有限公司 | 一种音频场景分类辅助的语音识别快速自适应方法 |
CN114464182B (zh) * | 2022-03-03 | 2022-10-21 | 慧言科技(天津)有限公司 | 一种音频场景分类辅助的语音识别快速自适应方法 |
CN114973120A (zh) * | 2022-04-14 | 2022-08-30 | 山东大学 | 一种基于多维传感数据与监控视频多模异构融合的行为识别方法及*** |
CN114973120B (zh) * | 2022-04-14 | 2024-03-12 | 山东大学 | 一种基于多维传感数据与监控视频多模异构融合的行为识别方法及*** |
CN115346561A (zh) * | 2022-08-15 | 2022-11-15 | 南京脑科医院 | 基于语音特征的抑郁情绪评估预测方法及*** |
CN115346561B (zh) * | 2022-08-15 | 2023-11-24 | 南京医科大学附属脑科医院 | 基于语音特征的抑郁情绪评估预测方法及*** |
CN115481681A (zh) * | 2022-09-09 | 2022-12-16 | 武汉中数医疗科技有限公司 | 一种基于人工智能的乳腺采样数据的处理方法 |
CN115481681B (zh) * | 2022-09-09 | 2024-02-06 | 武汉中数医疗科技有限公司 | 一种基于人工智能的乳腺采样数据的处理方法 |
CN115969381A (zh) * | 2022-11-16 | 2023-04-18 | 西北工业大学 | 一种基于多频段融合与时空Transformer的脑电信号分析方法 |
CN115969381B (zh) * | 2022-11-16 | 2024-04-30 | 西北工业大学 | 一种基于多频段融合与时空Transformer的脑电信号分析方法 |
CN117497140A (zh) * | 2023-10-09 | 2024-02-02 | 合肥工业大学 | 一种基于细粒度提示学习的多层次抑郁状态检测方法 |
CN117497140B (zh) * | 2023-10-09 | 2024-05-31 | 合肥工业大学 | 一种基于细粒度提示学习的多层次抑郁状态检测方法 |
CN117137488B (zh) * | 2023-10-27 | 2024-01-26 | 吉林大学 | 基于脑电数据与面部表情影像的抑郁症病症辅助识别方法 |
CN117137488A (zh) * | 2023-10-27 | 2023-12-01 | 吉林大学 | 基于脑电数据与面部表情影像的抑郁症病症辅助识别方法 |
Also Published As
Publication number | Publication date |
---|---|
CN110728997A (zh) | 2020-01-24 |
CN110728997B (zh) | 2022-03-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021104099A1 (zh) | 一种基于情景感知的多模态抑郁症检测方法和*** | |
Shou et al. | Conversational emotion recognition studies based on graph convolutional neural networks and a dependent syntactic analysis | |
Mirheidari et al. | Detecting Signs of Dementia Using Word Vector Representations. | |
Schuller et al. | Cross-corpus acoustic emotion recognition: Variances and strategies | |
Batliner et al. | The automatic recognition of emotions in speech | |
Gu et al. | Speech intention classification with multimodal deep learning | |
Atmaja et al. | Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM | |
Wang et al. | Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition. | |
Qin et al. | An end-to-end approach to automatic speech assessment for Cantonese-speaking people with aphasia | |
Harati et al. | Speech-based depression prediction using encoder-weight-only transfer learning and a large corpus | |
Saha et al. | Emotion aided dialogue act classification for task-independent conversations in a multi-modal framework | |
Sechidis et al. | A machine learning perspective on the emotional content of Parkinsonian speech | |
Zhang et al. | Deep cross-corpus speech emotion recognition: Recent advances and perspectives | |
CN115640530A (zh) | 一种基于多任务学习的对话讽刺和情感联合分析方法 | |
CN116130092A (zh) | 多语言预测模型的训练及阿尔茨海默病预测的方法、装置 | |
Prabhakaran et al. | Detecting institutional dialog acts in police traffic stops | |
Özkanca et al. | Multi-lingual depression-level assessment from conversational speech using acoustic and text features | |
Pérez-Espinosa et al. | Using acoustic paralinguistic information to assess the interaction quality in speech-based systems for elderly users | |
Jia et al. | A deep learning system for sentiment analysis of service calls | |
Wang | Research on open oral English scoring system based on neural network | |
JP6992725B2 (ja) | パラ言語情報推定装置、パラ言語情報推定方法、およびプログラム | |
Johar | Paralinguistic profiling using speech recognition | |
Ryumina et al. | Emotional speech recognition based on lip-reading | |
Akhtiamov et al. | Gaze, prosody and semantics: relevance of various multimodal signals to addressee detection in human-human-computer conversations | |
Ohta et al. | Response type selection for chat-like spoken dialog systems based on LSTM and multi-task learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20892740 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20892740 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 240123) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20892740 Country of ref document: EP Kind code of ref document: A1 |