WO2021104099A1 - 一种基于情景感知的多模态抑郁症检测方法和*** - Google Patents

一种基于情景感知的多模态抑郁症检测方法和*** Download PDF

Info

Publication number
WO2021104099A1
WO2021104099A1 PCT/CN2020/129214 CN2020129214W WO2021104099A1 WO 2021104099 A1 WO2021104099 A1 WO 2021104099A1 CN 2020129214 W CN2020129214 W CN 2020129214W WO 2021104099 A1 WO2021104099 A1 WO 2021104099A1
Authority
WO
WIPO (PCT)
Prior art keywords
depression
text
acoustic
channel subsystem
context
Prior art date
Application number
PCT/CN2020/129214
Other languages
English (en)
French (fr)
Inventor
苏荣锋
王岚
燕楠
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2021104099A1 publication Critical patent/WO2021104099A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition

Definitions

  • the present invention relates to the technical field of depression detection, in particular to a multi-modal depression detection method and system based on context perception.
  • Deep learning is a new field of machine learning, which combines high-level abstract modeling of data by using multiple layers of non-linear transformations. Using deep learning algorithms can make the original data easier to adapt to learning and training in various directions.
  • CNN and LSTM use CNN and LSTM to combine to form a new deep network, and then extract the acoustic features of the speech signal and use it for the detection of depression.
  • Another example is the semantic analysis of the conversation between the doctor and the depression patient, such as filled pause extraction, Principal Components Analysis (PCA), whitening transform (whitening transform) and other techniques to get some text
  • PCA Principal Components Analysis
  • SVR Linear Support Vector Regressor
  • the acoustic features used in the prior art are some artificially defined 279-dimensional features, and the text features are 100-dimensional word embedding vectors extracted using the Doc2Vec tool.
  • the existing technology mainly has the following problems: in terms of the amount of training data, most of the existing multi-modal depression detection systems based on speech, text, or images are trained on limited depression data, so the performance is low.
  • existing feature extraction methods lack verbal information related to topic and context, and are insufficient in the field of depression detection, which limits the performance of the final depression detection system; in terms of depression classification modeling, the existing technology does not consider speech , The long-term dependence of text features and depression diagnosis; in terms of multi-modal fusion, the prior art simply connects the subsystem outputs obtained under different modalities or channels in series, and finally makes a decision, ignoring each modal Or the weight relationship between channels, so performance is limited.
  • the purpose of the present invention is to overcome the above-mentioned shortcomings of the prior art and provide a multi-modal depression detection method and system based on context perception.
  • a multi-modal depression detection method based on context perception includes the following steps:
  • Step S1 Construct a training sample set, the training sample set includes topic information, a spectrogram and corresponding text information;
  • Step S2 Using a convolutional neural network, combined with multi-task learning, perform acoustic feature extraction on the spectrogram of the training sample set to obtain acoustic features with contextual awareness;
  • Step S3 Use the training sample set to process the word embedding using the Transformer model, and extract context-aware text features
  • Step S4 establishing an acoustic channel subsystem for depression detection for the context-aware acoustic features, and establishing a text channel subsystem for depression detection for the context-aware text features;
  • Step S5 fusing the outputs of the acoustic channel subsystem and the text channel subsystem to obtain depression classification information.
  • the acoustic characteristics of the contextual perception are obtained according to the following steps:
  • the convolutional neural network includes an input layer, multiple convolutional layers, multiple fully connected layers, an output layer, and a bottleneck layer located between the last fully connected layer and the output layer.
  • the bottleneck layer Compared with the convolutional layer and the fully connected layer, it has fewer nodes;
  • the output layer contains the depression classification task and the topic labeling task
  • the acoustic features of the context perception are extracted from the bottleneck layer of the convolutional neural network.
  • the context-aware text features are extracted according to the following steps:
  • the Transformer model includes multiple encoders and decoders with self-attention and a softmax layer at the last layer;
  • the softmax layer is removed, and the output of the Transformer model is used as the context-aware text feature.
  • step S5 includes:
  • the outputs of the acoustic channel subsystem and the text channel subsystem are merged to obtain a classification score for depression.
  • the classification score of the depression is expressed as:
  • the weight w i [ ⁇ 1 , ⁇ 2 ,..., ⁇ c ], and c is the number of classifications of depression.
  • the acoustic channel subsystem and the text channel subsystem are established based on a BLSTM network, and the network input of the acoustic channel subsystem is the perceptual linear prediction coefficients of consecutive multiple frames and the acoustic characteristics of the context perception ,
  • the output is a depression classification label
  • the network input of the text channel subsystem is text information
  • the output is a depression classification label.
  • the topic information in the training sample set includes multiple types of identifiers classified based on the content of the conversation between the doctor and the depression patient.
  • a multi-modal depression detection system based on contextual perception includes:
  • Training sample construction unit used to construct a training sample set, the training sample set including topic information, spectrogram and corresponding text information;
  • Acoustic feature extraction unit used to extract acoustic features from the spectrogram of the training sample set by using a convolutional neural network, combined with multi-task learning, to obtain acoustic features with contextual awareness;
  • Text feature extraction unit used to use the training sample set to process word embeddings using a Transformer model to extract context-aware text features
  • Classification subsystem establishment unit used to establish an acoustic channel subsystem for depression detection for the context-aware acoustic features, and establish a text channel subsystem for depression detection for the context-aware text features;
  • Classification and fusion unit used to fuse the output of the acoustic channel subsystem and the text channel subsystem to obtain depression classification information.
  • the present invention has the advantage of using the method of data enhancement to expand the voice and text training data of depression according to the topic information in the content of the free conversation between the doctor and the depression patient, and use the data for model training;
  • Verbal information related to depression detection including acquiring acoustic features that are not related to the speaker, highly related to depression, and context-aware, and text features that are highly related to depression and context-aware;
  • a depression detection subsystem is established in the acoustic channel and the text channel;
  • the reinforcement learning method is used to obtain a multi-system fusion framework to achieve robust multi-modal depression automatic detection.
  • Fig. 1 is a general framework diagram of a multi-modal depression detection method based on context perception according to an embodiment of the present invention
  • Fig. 2 is a flowchart of a multi-modal depression detection method based on context perception according to an embodiment of the present invention
  • Figure 3 is a schematic diagram of topic-based data enhancement
  • Figure 4 is a schematic diagram of the acoustic feature extraction process based on CNN and multi-task learning
  • Figure 5 is a schematic diagram of a text feature extraction process based on a multi-head self-attention mechanism
  • Figure 6 is a schematic diagram of reinforcement learning.
  • the overall technical solution includes: firstly adopt topic-based data enhancement method to obtain more topic-related depression speech and text data; then use CNN network combined with multi-task learning
  • the method is to extract context-aware acoustic features from the spectrogram, and use Transformer to process word embeddings to obtain context-aware text features; then, use context-aware acoustic features and context-aware text features, respectively, using BLSTM (two-way length and short Temporal memory network) model is used to establish the depression detection subsystem; finally, the reinforcement learning method is used to make a fusion decision on the output of each subsystem to obtain the final depression classification.
  • BLSTM two-way length and short Temporal memory network
  • the multi-modal depression detection method based on context perception includes the following steps:
  • Step S210 Obtain a training sample set with context awareness.
  • the training sample set can be expanded based on the original training set to include context perception information.
  • the original data set usually only includes the correspondence between speech and text.
  • topic labeling is performed on each pair of speech and text data in the existing training set. For example, divide the content of conversations between doctors and patients with depression into 7 topics: whether they are interested, whether they sleep well, whether they feel depressed, whether they feel defeated, self-evaluation, whether they have ever been diagnosed with depression, and whether their parents have ever suffered from depression.
  • Some new training samples can be obtained through the above method, and the original training samples can be spliced together to expand the original data set and construct a new training sample set.
  • this step by defining the content of multiple topics that the doctor talks with the depression patient, and expanding the original training data set by random combination, a richer set of context-aware training samples can be obtained, including topic information, Spectrogram, text information, and corresponding classification labels, etc., thereby improving the accuracy of subsequent training.
  • Step S220 extracting acoustic features with context awareness based on CNN and multi-task learning.
  • CNN Convolutional Neural Network
  • the present invention combines multi-task learning and CNN network for classification network training.
  • the input of the CNN network is the spectrogram of each training sample, and the CNN network includes several convolutional layers and several fully connected layers.
  • the convolutional layer downsampling is performed using, for example, a maximum pooling technique.
  • the embodiment of the present invention inserts a bottleneck layer, which contains only a few nodes, for example, the value is 39.
  • the output layer of the CNN network contains two tasks.
  • the first task is the classification of depression, for example, classification into multiple categories such as mild, severe, moderate, and normal.
  • the second task is the labeling of different topics (or topic identification). ).
  • the context-aware acoustic features are extracted from the bottleneck layer of the CNN network, and are spliced with traditional acoustic features for subsequent classification network training.
  • CNN neural network and multi-task learning methods are used.
  • the first task is the classification of depression, and the second task is the label of different topics.
  • the output obtained by the network bottleneck layer is used as topical context awareness Characteristic acoustic characteristics.
  • Step S230 extracting context-aware text features based on the multi-head self-attention mechanism.
  • a Transformer model based on a multi-head self-attention mechanism is used to analyze the semantics of sentences, so as to extract context-aware text features.
  • the input of the Transformer model is traditional word embedding plus topic ID (identification), and its main structure is composed of multiple encoders and decoders containing self-attention, which is the so-called multi-head mechanism.
  • the Transformer model allows direct connections between data units, it allows the model to take into account the attention information of different locations and better capture long-term dependencies.
  • the Transformer model in the embodiment of the present invention, first use large-scale text corpus (such as Weibo, Wikipedia, etc.) to pre-train the Transformer model parameters using an unsupervised training method; and then use transfer learning.
  • large-scale text corpus such as Weibo, Wikipedia, etc.
  • Method self-adaptive training is performed on the collected textual data of depression.
  • the last softmax layer in Figure 5 is removed, and then the output is used as a text feature, that is, the extracted context-aware text feature, which will be used for subsequent depression detection model training.
  • the Transformer model can be used to extract robust text features.
  • step S240 a subsystem for detecting depression is established for the acoustic features of context perception and the distribution of text features of context perception.
  • the embodiment of the present invention adopts a BLSTM-based method to establish a depression classification sub-network (or a sub-system).
  • BLSTM can cache the current input, and use the current input to participate in the previous and next calculations to implicitly include time information into the model, thereby realizing the modeling of long-term dependencies.
  • the BLSTM network adopted in the embodiment of the present invention has a total of 3 BLSTM layers, and each layer contains 128 nodes.
  • the corresponding network input is continuous 11 frames of PLP (Perceptual Linear Prediction Coefficient) and the acoustic features of context perception, and the output is the depression classification label;
  • the corresponding network input is the context perception of a training sample
  • the text feature of the output is the depression classification label.
  • the BLSTM network is used to establish a depression classification model to capture the long-term dependence of acoustic features or text features with the diagnosis of depression.
  • Step S250 Use reinforcement learning to fuse the outputs of the various depression detection subsystems to obtain the final depression classification.
  • the embodiment of the present invention adopts a reinforcement learning mechanism to minimize the difference between the final depression prediction result and feedback information of the combined system by adjusting the weight of each subsystem.
  • the final score for depression is expressed as:
  • the decision score function L t of reinforcement learning at time t is defined as:
  • a t-1 represents the feedback at time t-1
  • D represents the difference between the actual and predicted results of the development set
  • W represents the weight of all subsystems ⁇ w i ⁇
  • C represents the global accuracy rate on the development set. Therefore, it is necessary to sum L t at all times and maximize it, and the obtained W * is the weight of the final subsystem, which is expressed as:
  • a hidden Markov model or other models can be used for reinforcement learning.
  • the reinforcement learning method is used to automatically adjust the weights of the subsystem score of the acoustic channel and the subsystem score of the text channel, so that they can be organically integrated to perform the final depression classification.
  • the trained network model can be used for new data (including topics, speech, text, etc.) using a process similar to training to treat depression Classification prediction.
  • BLSTM other models containing time information can also be used.
  • the present invention also provides a multi-modal depression detection system based on context perception.
  • the system includes: a training sample construction unit, used to construct a training sample set, the training sample set includes topic information, a spectrogram and corresponding text information; an acoustic feature extraction unit, used to use a convolutional neural network, combined with multiple Task learning: extracting acoustic features from the spectrogram of the training sample set to obtain acoustic features with context awareness; text feature extraction unit: used to process the word embedding using the training sample set and using the Transformer model to extract With context-aware text features; classification subsystem establishment unit: used to establish an acoustic channel subsystem for depression detection for the context-aware acoustic features, and establish a text channel for depression detection for the context-aware text features Subsystem; classification fusion unit: used to fuse the output of the acoustic channel subsystem and the text channel subsystem to obtain depression classification information
  • the present invention combines the information obtained by the acoustic channel and the text channel to achieve high-precision multi-modal depression detection.
  • the main technical content includes: using topic-related data enhancement technology: based on limited depression speech and text data, using The topic information in the content of the free conversation between doctors and depression patients, expands the speech and text training data of depression; Robust analysis and extraction of depression-related features: Combining transfer learning and multi-head self-attention mechanism, extracting topical and context-aware features , And the acoustic feature description and text feature description showing the characteristics of depression patients to improve the accuracy of the detection system; BLSTM-based depression classification model: use the powerful time series modeling capabilities of the BLSTM network to capture acoustic information and text information and depression The long-term dependence of diagnosis; multi-modal fusion framework: the use of reinforcement learning methods to achieve the fusion of the depression detection subsystem under the acoustic channel and the text channel.
  • the present invention has the following advantages:
  • the existing depression detection method only uses limited speech and text data of depression. Compared with this, the present invention uses a topic-based data enhancement method to expand the original training data set;
  • the present invention uses CNN neural network and multi-task learning methods to extract acoustic features with topic context perception characteristics, and uses Transformer model to extract topics with topic context awareness.
  • the textual features of context-aware features are in-depth feature descriptions, which can improve the robustness of depression detection;
  • the existing depression detection modeling technology does not consider the long-term dependence of speech, text features and depression diagnosis.
  • the present invention uses the BLSTM network to capture acoustic features or text features and the long-term diagnosis of depression. Dependency, better performance;
  • the existing multi-modal depression detection technology simply connects the outputs of different subsystems in series for decision-making.
  • the present invention adopts a reinforcement learning method to automatically adjust the sub-system score weights under different channels, and Make the final classification decision, the performance is better.
  • the present invention may be a system, a method and/or a computer program product.
  • the computer program product may include a computer-readable storage medium loaded with computer-readable program instructions for enabling a processor to implement various aspects of the present invention.
  • the computer-readable storage medium may be a tangible device that holds and stores instructions used by the instruction execution device.
  • the computer-readable storage medium may include, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing, for example.
  • Computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) Or flash memory), static random access memory (SRAM), portable compact disk read-only memory (CD-ROM), digital versatile disk (DVD), memory stick, floppy disk, mechanical encoding device, such as a printer with instructions stored thereon
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • flash memory flash memory
  • SRAM static random access memory
  • CD-ROM compact disk read-only memory
  • DVD digital versatile disk
  • memory stick floppy disk
  • mechanical encoding device such as a printer with instructions stored thereon

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于情景感知的多模态抑郁症检测方法和***。方法包括:构建训练样本集,训练样本集包括话题信息、语谱图和对应的文本信息;使用卷积神经网络,结合多任务学习,对训练样本集的语谱图进行声学特征提取,获得具备情景感知的声学特征;利用训练样本集,使用Transformer模型对词嵌入进行处理,提取具备情景感知的文本特征;对于情景感知的声学特征建立进行抑郁症检测的声学通道子***,对于情景感知的文本特征建立进行抑郁症检测的文本通道子***;对声学通道子***和文本通道子***的输出进行融合,获得抑郁症分类信息。能够提高抑郁症检测的准确性。

Description

一种基于情景感知的多模态抑郁症检测方法和*** 技术领域
本发明涉及抑郁症检测技术领域,尤其涉及一种基于情景感知的多模态抑郁症检测方法和***。
背景技术
在与抑郁症相关的特征提取方面,早期的基于语音的抑郁症相关研究主要集中于时域特征,例如停顿时间、录音时间、对问题的反馈时间、语速等。后来,人们发现单一的特征无法涵盖具有足够辨识度的信息去辅助临床诊断。随着对语音信号的深入研究,大量其余语音信号特征被构造出来。研究者尝试了各种语音特征组合,希望可以构建出检测抑郁症患者的分类模型。这些特征有音高(pitch)、能量(energy)、语速(speaking rate)、共振峰(formant)、梅尔倒谱系数(MFCC)等特征。文本是另外一种“隐藏”在语音信号中的与抑郁症相关的信息,它较容易从语音信号中获得。研究表明,抑郁患者使用消极情感词和愤怒词明显较正常人多。而人们常常使用词频统计作为文本特征表示。这种特征属于底层(low-level)的文本特征,最近人们更偏向于使用高层次(high-level)的文本特征来描述抑郁状态,也就是所谓的词嵌入(word embedding)特征,获取词嵌入特征的常用网络结构有skip-gram或者CBOW(continuous bag-of-words)等。
在有限抑郁症语音文本数据条件下进行抑郁症检测方面,鉴于抑郁症患者的语音文本数据很难进行大规模采集,因此可用于研究抑郁症的语音数据库一般规模较小。目前研究者一般只能采用较为简单的分类模型进行抑郁症检测。传统的基于语音的抑郁症检测方法有:支撑向量机(Support Vector Machine,SVM)、决策树、混合高斯模型(Gaussian Mixture Model,GMM)等。深度学习是机器学习的一个新的领域,它通过使用多层的非线性转换进行组合,对数据进行高层次抽象建模。利用深度学习算法,能够使得原始数据更加容易的适应各种方向的学习训练。例如,利用CNN和LSTM组合成一个新的深层网络,然后对语音信号提取声学特征,并用 于抑郁症的检测。又如,通过对医生与抑郁症患者的对话进行语义分析,如停留词提取(filled pause extraction)、主成分分析(Principal Components Analysis,PCA)、白化变换(whitening transform)等技术,从中得到一些文本特征并结合一个线性支撑向量回归器(Support Vector Regressor,SVR)分类器进行抑郁症分类。再如,首先使用独立的LSTM层分别对声学通道和文本通道进行处理,然后再把其中的输入特征输入到全连接层中,最后进行抑郁症类别输出。现有技术所使用的声学特征是一些人工定义的279维特征,而文本特征是使用Doc2Vec工具提取得到的100维词嵌入向量。
在现有技术中,通常采取基于生化试剂和基于脑电的检测手段,而在基于语音、文本或图像的技术方案中,多以语音数据为依托,在特征提取及分类的基础上进行抑郁症检测。简言之,现有技术主要存在以下几方面的问题:训练数据量方面,现有的基于语音、文本或图像的多模态抑郁症检测***大部分由有限抑郁症数据训练得到,因此性能低下;特征提取方面,现有特征提取方法缺少话题情景相关的言语信息,在抑郁症检测领域表现力不足,限制了最终抑郁症检测***的性能;抑郁症分类建模方面,现有技术没有考虑语音、文本特征与抑郁症诊断的长时间依赖关系;多模态融合方面,现有技术简单地把不同模态或通道下所得到的子***输出串联在一起,最终进行决策,忽略了各个模态或通道之间的轻重关系,因此性能受到限制。
发明内容
本发明的目的在于克服上述现有技术的缺陷,提供一种基于情景感知的多模态抑郁症检测方法和***。
根据本发明的第一方面,提供一种基于情景感知的多模态抑郁症检测方法。该方法包括以下步骤:
步骤S1:构建训练样本集,所述训练样本集包括话题信息、语谱图和对应的文本信息;
步骤S2:使用卷积神经网络,结合多任务学习,对所述训练样本集的语谱图进行声学特征提取,获得具备情景感知的声学特征;
步骤S3:利用所述训练样本集,使用Transformer模型对词嵌入进行处理,提取具备情景感知的文本特征;
步骤S4:对于所述情景感知的声学特征建立进行抑郁症检测的声学通 道子***,对于所述情景感知的文本特征建立进行抑郁症检测的文本通道子***;
步骤S5:对所述声学通道子***和所述文本通道子***的输出进行融合,获得抑郁症分类信息。
在一个实施例中,根据以下步骤获得所述情景感知的声学特征:
构建卷积神经网络,该卷积神经网络包括输入层、多个卷积层、多个全连接层、输出层、以及位于最后一层全连接层和输出层之间的瓶颈层,该瓶颈层相对于卷积层和全连接层具有较少的节点;
将所述训练样本集中的语谱图输入到卷积神经网络,输出层包含抑郁症分类任务和话题的标签任务;
从卷积神经网络的瓶颈层提取得到所述情景感知的声学特征。
在一个实施例中,根据以下步骤提取所述情景感知的文本特征:
构建Transformer模型,以词嵌入加上话题标识作为Transformer模型的输入,该Transformer模型包括多个含有自注意力的编码器和解码器以及位于最后一层的softmax层;
利用已有的文本语料,使用无监督训练方法预训练Transformer模型参数,然后采用迁移学习,在采集得到的抑郁症文本数据进行自适应训练;
在训练完成之后,将softmax层去除,以Transformer模型的输出作为所述情景感知的文本特征。
在一个实施例中,步骤S5包括:
采用强化学习机制,调整所述声学通道子***的权重和所述文本通道子***的权重,使得最终抑郁症分类预测结果和反馈信息之间的差异最小化;
融合所述声学通道子***和所述文本通道子***的输出,获得抑郁症的分类打分。
在一个实施例中,所述抑郁症的分类打分表示为:
Figure PCTCN2020129214-appb-000001
其中,权重w i=[λ 12,…,λ c],c为抑郁症的分类个数。
在一个实施例中,所述声学通道子***和所述文本通道子***基于BLSTM网络建立,所述声学通道子***的网络输入为连续多帧的感知线 性预测系数和所述情景感知的声学特征,输出为抑郁症分类标签,所述文本通道子***的网络输入是文本信息,输出为抑郁症分类标签。
在一个实施例中,所述训练样本集中的话题信息包括基于医生与抑郁症患者交谈的内容所划分的多种类型标识。
根据本发明的第二方面,提供一种基于情景感知的多模态抑郁症检测***。该***包括:
训练样本构建单元:用于构建训练样本集,所述训练样本集包括话题信息、语谱图和对应的文本信息;
声学特征提取单元:用于使用卷积神经网络,结合多任务学习,对所述训练样本集的语谱图进行声学特征提取,获得具备情景感知的声学特征;
文本特征提取单元:用于利用所述训练样本集,使用Transformer模型对词嵌入进行处理,提取具备情景感知的文本特征;
分类子***建立单元:用于对于所述情景感知的声学特征建立进行抑郁症检测的声学通道子***,对于所述情景感知的文本特征建立进行抑郁症检测的文本通道子***;
分类融合单元:用于对所述声学通道子***和所述文本通道子***的输出进行融合,获得抑郁症分类信息。
与现有技术相比,本发明的优点在于:利用数据增强的方法,根据医生与抑郁症患者自由交谈内容中的话题信息,扩展抑郁症语音文本训练数据,并利用该数据进行模型训练;获取与抑郁症检测相关的言语信息,包括获取与说话人无关、与抑郁症高度相关、具备情景感知的声学特征,以及获取与抑郁症高度相关、具备情景感知的文本特征;考虑医生与抑郁症患者自由交谈内容中的话题情景信息,在声学通道和文本通道建立抑郁症检测子***;使用强化学习方法,得到多***融合框架,以实现鲁棒的多模态抑郁症自动检测。
附图说明
以下附图仅对本发明作示意性的说明和解释,并不用于限定本发明的范围,其中:
图1是根据本发明一个实施例的基于情景感知的多模态抑郁症检测方法的总体框架图;
图2是根据本发明一个实施例的基于情景感知的多模态抑郁症检测方 法的流程图;
图3是基于话题的数据增强示意;
图4是基于CNN和多任务学习的声学特征提取过程的示意图;
图5是基于多头自注意力机制的文本特征提取过程的示意图;
图6是强化学习示意图。
具体实施方式
为了使本发明的目的、技术方案、设计方法及优点更加清楚明了,以下结合附图通过具体实施例对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅用于解释本发明,并不用于限定本发明。
在本文示出和讨论的所有例子中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它例子可以具有不同的值。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。
为进一步理解本发明,首先参见图1所示,总体技术方案包括:首先采用基于话题的数据增强方法,得到更多的与话题相关的抑郁症语音文本数据;然后使用CNN网络,结合多任务学习方法,对语谱图进行情景感知的声学特征提取,使用Transformer对词嵌入进行处理,得到情景感知的文本特征;接着,分别使用情景感知的声学特征和情景感知的文本特征,利用BLSTM(双向长短时记忆网络)模型进行抑郁症检测子***建立;最后使用强化学习的方法,对每个子***的输出进行融合决策,得到最终的抑郁症分类。
具体地,参见图2所示,本发明实施例的基于情景感知的多模态抑郁症检测方法包括以下步骤:
步骤S210,获得具有情景感知的训练样本集。
训练样本集可基于原有的训练集进行扩充,使其包含情景感知信息,原有数据集通常仅包括语音和文本的对应关系。
具体地,首先,对已有的训练集中每一对语音文本数据进行话题标注。例如,将医生与抑郁症患者交谈的内容分成7个话题:是否有兴趣、睡觉是否安稳、是否感到沮丧、是否感到失败、自我评价、是否曾经诊断为抑郁症、父母是否曾经患有抑郁症。
接下来,将原有训练集进行扩充:
对于训练集中属于每一个被试的语音和文本,计算其中唯一的话题数目;如果该数字大于等于m,则把其作为数据增强的备选被试,其中m为限定的最小话题数目;
对于每一个备选被试,随机选取n个属于该被试的语音文本数据对,作为一个新的组合;
对于每一个新的组合,把其中的语音文本数据对的顺序进行随机打乱,然后作为新的训练样本,参见图3所示。
通过上述方式可以得到一些新的训练样本,将其与原来的训练样本拼接在一起即可扩展原有数据集,构建为新的训练样本集。
在此步骤中,通过定义医生与抑郁症患者交谈的多个话题内容,并通过随机组合的方法扩展原有训练数据集,能够获得更丰富的具有情景感知的训练样本集,其中包括话题信息、语谱图、文本信息以及对应的分类标签等,从而提高了后续训练的精度。
步骤S220,基于CNN和多任务学习提取具有情景感知的声学特征。
传统方法中,使用的声学特征(如语速、音高、停顿时长等)均是基于特定领域的人类知识的所设计。由于这些传统特征在抑郁症领域表现力不足,而影响了最终检测的结果的准确性。从生物学上分析,人类的视觉感知是从低层局面感知到高层全局感知,而卷积神经网络(Convolutional Neural Network,CNN)恰恰模拟了这个过程。在CNN网络中,经过局部权重共享和一系列的非线性变换后,去掉原有的视觉信息中一些冗余和混淆的信息,仅保留每个局部区域最具区分度的信息。也就是说,经CNN得到的特征只包含不同说话人的“共性”描述,个体信息均被抛弃。
为了使得最终获得的特征包含不同层面的信息,本发明结合多任务学习与CNN网络进行分类网络训练。参见图4所示,CNN网络的输入为每一个训练样本的语谱图,而该CNN网络包含有若干卷积层以及若干全连接层。在卷积层中,使用例如最大池化技术进行降采样。在最后一层全连接层与输出层之间,本发明实施例***了一个瓶颈层,它只含有较少的节点,例如取值为39。CNN网络的输出层含有两个任务,第一个任务是抑郁症的分类,例如,分类为轻微、严重、中等、正常等多个类别,第二个任务是不同话题的标签(或称话题标识)。
需要注意的是,在本发明实施例中,将从CNN网络的瓶颈层提取得 到情景感知的声学特征,并且将其与传统声学特征拼接在一起进行后续分类网络训练。
在此步骤中,利用CNN神经网络以及多任务学习的方法,其中第一个任务是抑郁症的分类,而第二个任务是不同话题的标签,由网络瓶颈层得到的输出作为具有话题情景感知特性的声学特征。
步骤S230,基于多头自注意力机制提取情景感知的文本特征。
传统方法使用词嵌入来描述一段文本,然而该特征难以从语义角度理解句子意义,尤其在某些与抑郁症相关的话题上,严重缺乏与之相关的语义情感表征。自注意力机制模仿了生物观察行为的内部过程,擅长捕捉数据或特征的内部相关性。
在本发明实施例中,采用基于多头自注意力机制的Transformer模型,来对句子中的语义进行分析,从而提取情景感知的文本特征。参见图5所示,Transformer模型的输入是传统的词嵌入加上话题的ID(标识),其主体结构由多个含有自注意力的编码器和解码器组成,也就是所谓的多头机制。由于Transformer模型允许各个数据单元之间直接连接,因此能让模型考虑到不同位置的注意力信息,更好地捕获长期依赖关系。另外,为了使得Transformer模型得到充分训练,在本发明实施例中,首先利用大规模文本语料(如微博、***等),使用无监督训练方法预训练Transformer模型参数;然后再采用迁移学习的方法,在采集得到的抑郁症文本数据进行自适应训练。在训练完毕后,将图5中最后一层softmax层去除,然后将该输出作为文本特征,即提取的情景感知的文本特征,该特征将用于后续的抑郁症检测模型训练。
在此步骤中,结合词嵌入和话题情景信息作为输入,利用Transformer模型能够提取得到鲁棒的文本特征。
步骤S240,对于情景感知的声学特征和情景感知的文本特征分布建立进行抑郁症检测的子***。
由于抑郁症的诊断往往不是由某一时刻的一帧或者一句话决定的,而是由长时间的多句话的信息综合决定,即所谓的长时依赖关系。为了对这种长时依赖关系进行捕捉,本发明实施例采用基于BLSTM的方法进行抑郁症分类子网络(或称子***)的建立。BLSTM可以缓存当前的输入,并用该当前输入参与上一次和下一次的计算,以隐式地将时间信息包含到模型,从而实现对长时间的依赖关系进行建模。本发明实施例采用的 BLSTM网络共有3层BLSTM层,其中每层含有128个节点。对于声学通道,其对应的网络输入为连续11帧PLP(感知线性预测系数)以及情景感知的声学特征,输出为抑郁症分类标签;对于文本通道,其对应的网络输入为一个训练样本的情景感知的文本特征,输出为抑郁症分类标签。
在此步骤中,利用BLSTM网络进行抑郁症分类模型的建立,以捕捉声学特征或文本特征与抑郁症诊断的长时依赖关系。
步骤S250,利用强化学习,对各抑郁症检测的子***的输出进行融合,得到最终的抑郁症分类。
针对多模态***信息融合的策略,本发明实施例采用强化学习机制,通过调整各个子***的权重,使得组合***的最终抑郁症预测结果以及反馈信息之间的差异最小化。抑郁症的最终打分表示为:
Figure PCTCN2020129214-appb-000002
其中,权重w i=[λ 12,…,λ c],c为抑郁症的分类个数,S i对应子***。而强化学习在t时刻的决策得分函数L t定义为:
L t=W(A t-1)D-C(2)
其中A t-1表示在t-1时刻的反馈,D表示开发集中真实和预测结果的差异,W表示所有子***的权重{w i},C表示在开发集上的全局准确率。因此,需要对所有时刻的L t求和并令其最大化,所得到的W *就是最终的子***的权重,将其表示为:
W *=arg max WtL t(3)
在本发明实时例中,强化学习可采用隐马尔可夫模型或其它模型。
在此步骤中,采用强化学习的方法,自动调整声学通道的子***评分与文本通道的子***评分的权重,使其有机融合在一起进行最终抑郁症分类。
应理解的是,尽管本文以训练过程进行介绍,但在实际应用中,利用训练好的网络模型,可以针对新的数据(包括话题、语音、文本等)采用与训练类似的过程来进行抑郁症的分类预测。此外,除了BLSTM之外,也可采用其他包含时间信息的模型。
相应地,本发明还提供一种基于情景感知的多模态抑郁症检测***。用于实现上述方法的一个方面或多个方面。例如该***包括:训练样本构建单元,用于构建训练样本集,所述训练样本集包括话题信息、语谱图和对应的文本信息;声学特征提取单元,用于使用卷积神经网络,结合多任 务学习,对所述训练样本集的语谱图进行声学特征提取,获得具备情景感知的声学特征;文本特征提取单元:用于利用所述训练样本集,使用Transformer模型对词嵌入进行处理,提取具备情景感知的文本特征;分类子***建立单元:用于对于所述情景感知的声学特征建立进行抑郁症检测的声学通道子***,对于所述情景感知的文本特征建立进行抑郁症检测的文本通道子***;分类融合单元:用于对所述声学通道子***和所述文本通道子***的输出进行融合,获得抑郁症分类信息。
综上,本发明结合声学通道和文本通道得到的信息,实现高精度的多模态抑郁症检测,主要技术内容包括:利用话题相关的数据增强技术:在有限抑郁症语音文本数据基础上,利用医生与抑郁症患者自由交谈内容中的话题信息,扩展抑郁症语音文本训练数据;鲁棒的抑郁症相关特征的分析与提取:结合迁移学习和基于多头自注意力机制,提取具备话题情景感知特性,以及显示抑郁症患者特性的声学特征描述和文本特征描述,以提高检测***的精度;基于BLSTM的抑郁症分类模型:利用BLSTM网络的强大时序建模能力,捕捉声学信息和文本信息与抑郁症诊断的长时依赖关系;多模态融合框架:利用强化学习的方法,实现在声学通道和文本通道下的抑郁症检测子***的融合。
与现有技术相比,本发明具有以下优势:
1)、现有的抑郁症检测方法只使用有限的抑郁症语音文本数据,与其相比,本发明使用基于话题的数据增强方法扩展原有训练数据集;
2)、现有技术大部分使用缺少话题情景感知的特征,与其相比,本发明使用CNN神经网络以及多任务学习的方法提取得到具备话题情景感知特性的声学特征,以及使用Transformer模型提取具备话题情景感知特性的文本特征,是深层的特征描述,能提升抑郁症检测的鲁棒性;
3)、现有的抑郁症检测建模技术没有考虑语音、文本特征与抑郁症诊断的长时间依赖关系,与其相比,本发明利用BLSTM网络捕捉声学特征或文本特征与抑郁症诊断的长时依赖关系,性能更好;
4)、现有的多模态抑郁症检测技术简单地把不同子***输出串联在一起进行决策,与其相比,本发明采用强化学习的方法,自动调整不同通道下的子***评分权重,并进行最终分类决策,性能更好。
需要说明的是,虽然上文按照特定顺序描述了各个步骤,但是并不意味着必须按照上述特定顺序来执行各个步骤,实际上,这些步骤中的一些 可以并发执行,甚至改变顺序,只要能够实现所需要的功能即可。
本发明可以是***、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。
计算机可读存储介质可以是保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以包括但不限于电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。
以上已经描述了本发明的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (10)

  1. 一种基于情景感知的多模态抑郁症检测方法,包括以下步骤:
    步骤S1:构建训练样本集,所述训练样本集包括话题信息、语谱图和对应的文本信息;
    步骤S2:使用卷积神经网络,结合多任务学习,对所述训练样本集的语谱图进行声学特征提取,获得具备情景感知的声学特征;
    步骤S3:利用所述训练样本集,使用Transformer模型对词嵌入进行处理,提取具备情景感知的文本特征;
    步骤S4:对于所述情景感知的声学特征建立进行抑郁症检测的声学通道子***,对于所述情景感知的文本特征建立进行抑郁症检测的文本通道子***;
    步骤S5:对所述声学通道子***和所述文本通道子***的输出进行融合,获得抑郁症分类信息。
  2. 根据权利要求1所述的方法,其特征在于,根据以下步骤获得所述情景感知的声学特征:
    构建卷积神经网络,该卷积神经网络包括输入层、多个卷积层、多个全连接层、输出层、以及位于最后一层全连接层和输出层之间的瓶颈层,该瓶颈层相对于卷积层和全连接层具有较少的节点;
    将所述训练样本集中的语谱图输入到卷积神经网络,输出层包含抑郁症分类任务和话题的标签任务;
    从卷积神经网络的瓶颈层提取得到所述情景感知的声学特征。
  3. 根据权利要求1所述的方法,其特征在于,根据以下步骤提取所述情景感知的文本特征:
    构建Transformer模型,以词嵌入加上话题标识作为Transformer模型的输入,该Transformer模型包括多个含有自注意力的编码器和解码器以及位于最后一层的softmax层;
    利用已有的文本语料,使用无监督训练方法预训练Transformer模型参数,然后采用迁移学习,在采集得到的抑郁症文本数据进行自适应训练;
    在训练完成之后,将softmax层去除,以Transformer模型的输出作为所述情景感知的文本特征。
  4. 根据权利要求1所述的方法,其特征在于,步骤S5包括:
    采用强化学习机制,调整所述声学通道子***的权重和所述文本通道子***的权重,使得最终抑郁症分类预测结果和反馈信息之间的差异最小化;
    融合所述声学通道子***和所述文本通道子***的输出,获得抑郁症的分类打分。
  5. 根据权利要求4所述的方法,其特征在于,所述抑郁症的分类打分表示为:
    Figure PCTCN2020129214-appb-100001
    其中,权重w i=[λ 12,…,λ c],c为抑郁症的分类个数。
  6. 根据权利要求1所述的方法,其特征在于,所述声学通道子***和所述文本通道子***基于BLSTM网络建立,所述声学通道子***的网络输入为连续多帧的感知线性预测系数和所述情景感知的声学特征,输出为抑郁症分类标签,所述文本通道子***的网络输入是文本信息,输出为抑郁症分类标签。
  7. 根据权利要求1所述的方法,其特征在于,所述训练样本集中的话题信息包括基于医生与抑郁症患者交谈的内容所划分的多种类型标识。
  8. 一种基于情景感知的多模态抑郁症检测***,包括:
    训练样本构建单元:用于构建训练样本集,所述训练样本集包括话题信息、语谱图和对应的文本信息;
    声学特征提取单元:用于使用卷积神经网络,结合多任务学习,对所述训练样本集的语谱图进行声学特征提取,获得具备情景感知的声学特征;
    文本特征提取单元:用于利用所述训练样本集,使用Transformer模型对词嵌入进行处理,提取具备情景感知的文本特征;
    分类子***建立单元:用于对于所述情景感知的声学特征建立进行抑郁症检测的声学通道子***,对于所述情景感知的文本特征建立进行抑郁症检测的文本通道子***;
    分类融合单元:用于对所述声学通道子***和所述文本通道子***的输出进行融合,获得抑郁症分类信息。
  9. 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现根据权利要求1至7中任一项所述方法的步骤。
  10. 一种计算机设备,包括存储器和处理器,在所述存储器上存储有 能够在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求1至7中任一项所述的方法的步骤。
PCT/CN2020/129214 2019-11-29 2020-11-17 一种基于情景感知的多模态抑郁症检测方法和*** WO2021104099A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911198356.X 2019-11-29
CN201911198356.XA CN110728997B (zh) 2019-11-29 2019-11-29 一种基于情景感知的多模态抑郁症检测***

Publications (1)

Publication Number Publication Date
WO2021104099A1 true WO2021104099A1 (zh) 2021-06-03

Family

ID=69225856

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/129214 WO2021104099A1 (zh) 2019-11-29 2020-11-17 一种基于情景感知的多模态抑郁症检测方法和***

Country Status (2)

Country Link
CN (1) CN110728997B (zh)
WO (1) WO2021104099A1 (zh)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627377A (zh) * 2021-08-18 2021-11-09 福州大学 基于Attention-Based CNN的认知无线电频谱感知方法及***
CN113674767A (zh) * 2021-10-09 2021-11-19 复旦大学 一种基于多模态融合的抑郁状态识别方法
CN113822192A (zh) * 2021-09-18 2021-12-21 山东大学 一种基于Transformer的多模态特征融合的在押人员情感识别方法、设备及介质
CN114118200A (zh) * 2021-09-24 2022-03-01 杭州电子科技大学 一种基于注意力引导双向胶囊网络的多模态情感分类方法
CN114464182A (zh) * 2022-03-03 2022-05-10 慧言科技(天津)有限公司 一种音频场景分类辅助的语音识别快速自适应方法
US20220180056A1 (en) * 2020-12-09 2022-06-09 Here Global B.V. Method and apparatus for translation of a natural language query to a service execution language
CN114973120A (zh) * 2022-04-14 2022-08-30 山东大学 一种基于多维传感数据与监控视频多模异构融合的行为识别方法及***
CN115346561A (zh) * 2022-08-15 2022-11-15 南京脑科医院 基于语音特征的抑郁情绪评估预测方法及***
CN115481681A (zh) * 2022-09-09 2022-12-16 武汉中数医疗科技有限公司 一种基于人工智能的乳腺采样数据的处理方法
CN115969381A (zh) * 2022-11-16 2023-04-18 西北工业大学 一种基于多频段融合与时空Transformer的脑电信号分析方法
CN117137488A (zh) * 2023-10-27 2023-12-01 吉林大学 基于脑电数据与面部表情影像的抑郁症病症辅助识别方法
CN117497140A (zh) * 2023-10-09 2024-02-02 合肥工业大学 一种基于细粒度提示学习的多层次抑郁状态检测方法
CN117497140B (zh) * 2023-10-09 2024-05-31 合肥工业大学 一种基于细粒度提示学习的多层次抑郁状态检测方法

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110728997B (zh) * 2019-11-29 2022-03-22 中国科学院深圳先进技术研究院 一种基于情景感知的多模态抑郁症检测***
CN111150372B (zh) * 2020-02-13 2021-03-16 云南大学 一种结合快速表示学习和语义学习的睡眠阶段分期***
CN111329494B (zh) * 2020-02-28 2022-10-28 首都医科大学 抑郁症参考数据的获取方法及装置
CN111581470B (zh) * 2020-05-15 2023-04-28 上海乐言科技股份有限公司 用于对话***情景匹配的多模态融合学习分析方法和***
CN112006697B (zh) * 2020-06-02 2022-11-01 东南大学 一种基于语音信号的梯度提升决策树抑郁程度识别***
CN111798874A (zh) * 2020-06-24 2020-10-20 西北师范大学 一种语音情绪识别方法及***
CN113269277B (zh) * 2020-07-27 2023-07-25 西北工业大学 基于Transformer编码器和多头多模态注意力的连续维度情感识别方法
CN112966429A (zh) * 2020-08-11 2021-06-15 中国矿业大学 基于WGANs数据增强的非线性工业过程建模方法
CN112631147B (zh) * 2020-12-08 2023-05-02 国网四川省电力公司经济技术研究院 一种面向脉冲噪声环境的智能电网频率估计方法及***
CN112768070A (zh) * 2021-01-06 2021-05-07 万佳安智慧生活技术(深圳)有限公司 一种基于对话交流的精神健康评测方法和***
CN112885334A (zh) * 2021-01-18 2021-06-01 吾征智能技术(北京)有限公司 基于多模态特征的疾病认知***、设备、存储介质
CN112818892B (zh) * 2021-02-10 2023-04-07 杭州医典智能科技有限公司 基于时间卷积神经网络的多模态抑郁症检测方法及***
CN113012720B (zh) * 2021-02-10 2023-06-16 杭州医典智能科技有限公司 谱减法降噪下多语音特征融合的抑郁症检测方法
CN115346657B (zh) * 2022-07-05 2023-07-28 深圳市镜象科技有限公司 利用迁移学习提升老年痴呆的识别效果的训练方法及装置
CN116843377A (zh) * 2023-07-25 2023-10-03 河北鑫考科技股份有限公司 基于大数据的消费行为预测方法、装置、设备及介质
CN116965817B (zh) * 2023-07-28 2024-03-15 长江大学 一种基于一维卷积网络和Transformer的EEG情感识别方法
CN116978409A (zh) * 2023-09-22 2023-10-31 苏州复变医疗科技有限公司 基于语音信号的抑郁状态评估方法、装置、终端及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016028495A1 (en) * 2014-08-22 2016-02-25 Sri International Systems for speech-based assessment of a patient's state-of-mind
CN107133481A (zh) * 2017-05-22 2017-09-05 西北工业大学 基于dcnn‑dnn和pv‑svm的多模态抑郁症估计和分类方法
CN107657964A (zh) * 2017-08-15 2018-02-02 西北大学 基于声学特征和稀疏数学的抑郁症辅助检测方法及分类器
JP2018121749A (ja) * 2017-01-30 2018-08-09 株式会社リコー 診断装置、プログラム及び診断システム
CN109599129A (zh) * 2018-11-13 2019-04-09 杭州电子科技大学 基于注意力机制和卷积神经网络的语音抑郁症识别方法
CN110728997A (zh) * 2019-11-29 2020-01-24 中国科学院深圳先进技术研究院 一种基于情景感知的多模态抑郁症检测方法和***

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10204625B2 (en) * 2010-06-07 2019-02-12 Affectiva, Inc. Audio analysis learning using video data
EP3252769B8 (en) * 2016-06-03 2020-04-01 Sony Corporation Adding background sound to speech-containing audio data
US11557311B2 (en) * 2017-07-21 2023-01-17 Nippon Telegraph And Telephone Corporation Satisfaction estimation model learning apparatus, satisfaction estimating apparatus, satisfaction estimation model learning method, satisfaction estimation method, and program
CN107316654A (zh) * 2017-07-24 2017-11-03 湖南大学 基于dis‑nv特征的情感识别方法
GB2567826B (en) * 2017-10-24 2023-04-26 Cambridge Cognition Ltd System and method for assessing physiological state
CN108764010A (zh) * 2018-03-23 2018-11-06 姜涵予 情绪状态确定方法及装置
WO2019225801A1 (ko) * 2018-05-23 2019-11-28 한국과학기술원 사용자의 음성 신호를 기반으로 감정, 나이 및 성별을 동시에 인식하는 방법 및 시스템
CN109389992A (zh) * 2018-10-18 2019-02-26 天津大学 一种基于振幅和相位信息的语音情感识别方法
CN109841231B (zh) * 2018-12-29 2020-09-04 深圳先进技术研究院 一种针对汉语普通话的早期ad言语辅助筛查***
CN110047516A (zh) * 2019-03-12 2019-07-23 天津大学 一种基于性别感知的语音情感识别方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016028495A1 (en) * 2014-08-22 2016-02-25 Sri International Systems for speech-based assessment of a patient's state-of-mind
JP2018121749A (ja) * 2017-01-30 2018-08-09 株式会社リコー 診断装置、プログラム及び診断システム
CN107133481A (zh) * 2017-05-22 2017-09-05 西北工业大学 基于dcnn‑dnn和pv‑svm的多模态抑郁症估计和分类方法
CN107657964A (zh) * 2017-08-15 2018-02-02 西北大学 基于声学特征和稀疏数学的抑郁症辅助检测方法及分类器
CN109599129A (zh) * 2018-11-13 2019-04-09 杭州电子科技大学 基于注意力机制和卷积神经网络的语音抑郁症识别方法
CN110728997A (zh) * 2019-11-29 2020-01-24 中国科学院深圳先进技术研究院 一种基于情景感知的多模态抑郁症检测方法和***

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220180056A1 (en) * 2020-12-09 2022-06-09 Here Global B.V. Method and apparatus for translation of a natural language query to a service execution language
CN113627377A (zh) * 2021-08-18 2021-11-09 福州大学 基于Attention-Based CNN的认知无线电频谱感知方法及***
CN113822192A (zh) * 2021-09-18 2021-12-21 山东大学 一种基于Transformer的多模态特征融合的在押人员情感识别方法、设备及介质
CN113822192B (zh) * 2021-09-18 2023-06-30 山东大学 一种基于Transformer的多模态特征融合的在押人员情感识别方法、设备及介质
CN114118200A (zh) * 2021-09-24 2022-03-01 杭州电子科技大学 一种基于注意力引导双向胶囊网络的多模态情感分类方法
CN113674767A (zh) * 2021-10-09 2021-11-19 复旦大学 一种基于多模态融合的抑郁状态识别方法
CN114464182A (zh) * 2022-03-03 2022-05-10 慧言科技(天津)有限公司 一种音频场景分类辅助的语音识别快速自适应方法
CN114464182B (zh) * 2022-03-03 2022-10-21 慧言科技(天津)有限公司 一种音频场景分类辅助的语音识别快速自适应方法
CN114973120A (zh) * 2022-04-14 2022-08-30 山东大学 一种基于多维传感数据与监控视频多模异构融合的行为识别方法及***
CN114973120B (zh) * 2022-04-14 2024-03-12 山东大学 一种基于多维传感数据与监控视频多模异构融合的行为识别方法及***
CN115346561A (zh) * 2022-08-15 2022-11-15 南京脑科医院 基于语音特征的抑郁情绪评估预测方法及***
CN115346561B (zh) * 2022-08-15 2023-11-24 南京医科大学附属脑科医院 基于语音特征的抑郁情绪评估预测方法及***
CN115481681A (zh) * 2022-09-09 2022-12-16 武汉中数医疗科技有限公司 一种基于人工智能的乳腺采样数据的处理方法
CN115481681B (zh) * 2022-09-09 2024-02-06 武汉中数医疗科技有限公司 一种基于人工智能的乳腺采样数据的处理方法
CN115969381A (zh) * 2022-11-16 2023-04-18 西北工业大学 一种基于多频段融合与时空Transformer的脑电信号分析方法
CN115969381B (zh) * 2022-11-16 2024-04-30 西北工业大学 一种基于多频段融合与时空Transformer的脑电信号分析方法
CN117497140A (zh) * 2023-10-09 2024-02-02 合肥工业大学 一种基于细粒度提示学习的多层次抑郁状态检测方法
CN117497140B (zh) * 2023-10-09 2024-05-31 合肥工业大学 一种基于细粒度提示学习的多层次抑郁状态检测方法
CN117137488B (zh) * 2023-10-27 2024-01-26 吉林大学 基于脑电数据与面部表情影像的抑郁症病症辅助识别方法
CN117137488A (zh) * 2023-10-27 2023-12-01 吉林大学 基于脑电数据与面部表情影像的抑郁症病症辅助识别方法

Also Published As

Publication number Publication date
CN110728997A (zh) 2020-01-24
CN110728997B (zh) 2022-03-22

Similar Documents

Publication Publication Date Title
WO2021104099A1 (zh) 一种基于情景感知的多模态抑郁症检测方法和***
Shou et al. Conversational emotion recognition studies based on graph convolutional neural networks and a dependent syntactic analysis
Mirheidari et al. Detecting Signs of Dementia Using Word Vector Representations.
Schuller et al. Cross-corpus acoustic emotion recognition: Variances and strategies
Batliner et al. The automatic recognition of emotions in speech
Gu et al. Speech intention classification with multimodal deep learning
Atmaja et al. Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM
Wang et al. Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition.
Qin et al. An end-to-end approach to automatic speech assessment for Cantonese-speaking people with aphasia
Harati et al. Speech-based depression prediction using encoder-weight-only transfer learning and a large corpus
Saha et al. Emotion aided dialogue act classification for task-independent conversations in a multi-modal framework
Sechidis et al. A machine learning perspective on the emotional content of Parkinsonian speech
Zhang et al. Deep cross-corpus speech emotion recognition: Recent advances and perspectives
CN115640530A (zh) 一种基于多任务学习的对话讽刺和情感联合分析方法
CN116130092A (zh) 多语言预测模型的训练及阿尔茨海默病预测的方法、装置
Prabhakaran et al. Detecting institutional dialog acts in police traffic stops
Özkanca et al. Multi-lingual depression-level assessment from conversational speech using acoustic and text features
Pérez-Espinosa et al. Using acoustic paralinguistic information to assess the interaction quality in speech-based systems for elderly users
Jia et al. A deep learning system for sentiment analysis of service calls
Wang Research on open oral English scoring system based on neural network
JP6992725B2 (ja) パラ言語情報推定装置、パラ言語情報推定方法、およびプログラム
Johar Paralinguistic profiling using speech recognition
Ryumina et al. Emotional speech recognition based on lip-reading
Akhtiamov et al. Gaze, prosody and semantics: relevance of various multimodal signals to addressee detection in human-human-computer conversations
Ohta et al. Response type selection for chat-like spoken dialog systems based on LSTM and multi-task learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20892740

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20892740

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 240123)

122 Ep: pct application non-entry in european phase

Ref document number: 20892740

Country of ref document: EP

Kind code of ref document: A1