CN113255635A - Multi-mode fused psychological stress analysis method - Google Patents

Multi-mode fused psychological stress analysis method Download PDF

Info

Publication number
CN113255635A
CN113255635A CN202110812718.0A CN202110812718A CN113255635A CN 113255635 A CN113255635 A CN 113255635A CN 202110812718 A CN202110812718 A CN 202110812718A CN 113255635 A CN113255635 A CN 113255635A
Authority
CN
China
Prior art keywords
sequence
frame
vector
video
facial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110812718.0A
Other languages
Chinese (zh)
Other versions
CN113255635B (en
Inventor
陶建华
何宇
刘斌
连政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202110812718.0A priority Critical patent/CN113255635B/en
Publication of CN113255635A publication Critical patent/CN113255635A/en
Application granted granted Critical
Publication of CN113255635B publication Critical patent/CN113255635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H20/00ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
    • G16H20/70ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to mental therapies, e.g. psychological therapy or autogenous training

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Child & Adolescent Psychology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Developmental Disabilities (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Psychology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Social Psychology (AREA)
  • Epidemiology (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-modal fused psychological stress analysis method, which comprises the following steps: dividing long audio and video into short audio and video with faces and voice, and performing framing processing on the short audio and video to obtain an image sequence and a voice signal; extracting facial features of the image sequence to obtain a facial frame sequence; extracting optical flows from adjacent frames of the facial frame sequence by applying an optical flow method to obtain an optical flow sequence; fusing the facial frame sequence and the optical flow sequence, and performing linear mapping to obtain a facial embedding vector; extracting an interested region from the image sequence to obtain an interested sequence, and performing linear mapping to obtain a physiological signal embedding vector; extracting basic acoustic features of the voice signal by taking a frame as a unit, and performing linear mapping to obtain an acoustic embedding vector; extracting emotional characteristics from the voice signals and the image sequences; fusing the characteristics according to the time sequence of the frame sequence to obtain a space-time characteristic vector; and inputting the space-time characteristic vector into the model, and classifying by softmax to obtain the psychological pressure grade.

Description

Multi-mode fused psychological stress analysis method
Technical Field
The application relates to the field of emotion recognition, in particular to a multi-mode fusion psychological stress analysis method.
Background
The facial features comprise facial expression features and facial movement features, the facial expressions and facial movements which are accidentally exposed by a person can express the real psychological state of the person, the micro expression changes of eyebrows, mouths, eyes, forehead and the like of the person are related to the psychological pressure of the person, and meanwhile, the facial movements of the person during speaking are also related to the psychological pressure of the person.
The voice characteristics mainly comprise voice rhythm characteristics, voice spectrum characteristics and voice tone characteristics; the speech rhythm characteristics are acoustically expressed as fundamental frequency, duration and energy parameters; the voice spectrum characteristics comprise frequency spectrum, frequency spectrum envelope, cepstrum coefficient, formant and the like; the voice tone characteristics comprise base frequency jitter, amplitude jitter, harmonic noise rate, noise harmonic ratio and entropy characteristic parameters, and when a person suffers psychological stress, the change of the voice characteristics can reflect the state of the person suffering psychological stress.
The physiological state comprises heart rate, heart rate variability, blood oxygen, blood pressure, respiration rate and the like, the facial rppg signal characteristics can be extracted in a non-contact mode through videos to evaluate the physiological state of a person, and the physiological state parameters of the person can reflect the state of the person subjected to psychological stress.
Personal psychological pressure and emotional state have close relation, emotional characteristics are extracted through the pre-training dimension emotional model and used as characteristic input of the prediction psychological pressure model, and the accuracy rate of psychological pressure prediction is improved.
The disclosure No. CN110301920B discloses a multi-mode fusion method and device for psychological pressure detection, the invention is based on an attention-enhancing feature matrix of physiological data- > text, physiological data- > picture, text- > physiological data, text- > picture, picture- > physiological data, picture- > text, and a fusion feature matrix of the text, the picture and the physiological data is obtained based on a feedforward full-link neural network; then acquiring fusion expression matrixes of the three modes based on the importance weight values of the text, the picture and the physiological data and the fusion characteristic matrix of the text, the picture and the physiological data; and finally, acquiring a pressure classification vector reflecting the psychological pressure problem based on the fusion expression matrix of the three modes and the feedforward full-connection network.
Publication No. CN112155577A discloses a social pressure detection method, apparatus, computer device, and storage medium, the method includes: acquiring a multi-modal physiological signal of a testee, calibrating the multi-modal physiological signal, and storing the calibrated multi-modal physiological signal as sample data in a multi-modal physiological signal pressure database; fusing the deep neural network and the generated countermeasure network to construct a social pressure detection model; inputting random Gaussian noise into a generator to obtain generated data, adding the generated data into a multi-modal physiological signal pressure database, and calibrating the generated data into a class y which is K + 1; increasing the output dimensionality of the classifier to K +1 according to the generated data of the class of K +1, and setting the target of the classifier to be the class of K + 1; training and learning the social pressure detection model by using the sample data and the generated data; and detecting the appointed multi-mode physiological signals through the trained social pressure detection model to obtain the corresponding pressure probability value.
In the prior art, facial expressions, voice, physiological signals and the like are modeled and analyzed respectively, and then fusion decision is carried out to obtain a psychological pressure prediction result.
Disclosure of Invention
In view of the above, the present invention provides a method for analyzing mental stress through multi-modal fusion, comprising:
s11: dividing a long audio and video into short audio and video with a face and voice, and performing framing processing on the short audio and video to obtain t image sequences and t frame voice signals;
s12: extracting facial features of the image sequence by using an existing trained neural network to obtain a facial frame sequence;
extracting optical flows from adjacent frames of the facial frame sequence by applying an optical flow method to obtain an optical flow sequence;
fusing and unfolding the facial frame sequence and the optical flow sequence to obtain a one-dimensional facial vector, and performing linear mapping on the one-dimensional facial vector to obtain a facial embedded vector Z1
S13: extracting an interested region from the image sequence to obtain an interested sequence, then expanding the interested sequence to obtain an interested one-dimensional vector, and then carrying out linear mapping on the interested one-dimensional vector to obtain a physiological signal embedded vector Z2
S14: extracting basic acoustic features of the voice signal by taking a frame as a unit, forming the basic acoustic features into an acoustic one-dimensional vector, and performing linear mapping on the acoustic one-dimensional vector to obtain an acoustic embedded vector Z3
S15: extracting feature vectors of depspectum, vggish and egemps from the voice signal;
extracting vgface and fau _ intensity characteristic vectors from the image sequence;
splicing the deepspread, vggish and egemaps feature vectors and the vgface and face _ intensity feature vectors to obtain audio and video fusion vectors;
inputting the audio and video fusion vector into a pre-training emotional feature extraction model to obtain t emotional feature vectors Z4=[ Z4T1…Z4Ti… Z4Tt];
S16: the Z is1、 Z2、 Z3And Z4Fusing according to the time sequence of the frame sequence to obtain a space-time feature vector;
s17: and inputting the space-time characteristic vector into an improved Transformer encoder model based on a CNN + MLP module, and classifying by softmax to obtain the psychological pressure grade.
In some embodiments, the specific method for performing framing processing on the short-pitch video to obtain t image sequences and t frame speech signals includes:
s11-1: inputting an audio/video file, wherein the audio/video file contains facial expressions and voices of the same person;
s11-2: dividing long audio and video into short audio and video with face and voice, wherein the duration is T0, the starting time point and the ending time point of the video and audio division are the same, and each long video has a psychological pressure level label corresponding to a subject;
s11-3: for the segmented short video band, the video has a frame rate of F, T0 frames are taken as input for extracting single features, and the segmented short video band is averagely divided into T sequences, wherein T = T0 XF/T0 sequences specifically;
s11-4: and for the segmented short audio frequency band, taking 20-50 ms as the frame length, taking the frame shift size as half of the frame length, performing frame division processing on the audio, and regarding the voice signal as a steady-state signal in each frame interval to obtain t frames of voice signals.
In some embodiments, the deriving the face embedding vector Z1The specific method comprises the following steps:
s12-1: extracting face pictures with equal pixel sizes from an image sequence of a t0 frame image by using an existing trained neural network, and performing mask processing on a non-face part to obtain a face frame sequence;
s12-2: extracting optical flows from adjacent frames of the facial frame sequence by applying an optical flow method to obtain an optical flow sequence;
s12-3: fusing the facial frame sequence and the optical flow sequence to obtain a facial fusion sequence, and dividing each frame of the facial fusion sequence into p1A block of pixels of equal size;
s12-4: subjecting said p to1Respectively expanding the pixel blocks with equal size into one-dimensional vectors, and performing normalization processing to obtain p1A face one-dimensional vector;
s12-5: the p is1One-dimensional vector of each face is linearMapping, then each short video gets t × p1A D-dimensional face embedding vector, denoted as Z1=[ Z1T1…Z1Ti… Z1Tt]。
In some embodiments, the deriving the physiological signal embedding vector Z2The specific method comprises the following steps:
s13-1: extracting aligned face images with equal pixel sizes from the t0 frame image sequence;
s13-2: taking the face cheek area as an interested area, and extracting pixel values in rectangular frames at two sides of the cheek to obtain an interested sequence;
s13-3: dividing each frame of image of interest sequence into p2A sub-region;
s13-4: converting each subregion from an RGB color space to a YUV color space;
s13-5: handle p2Averaging a plurality of pixel values of each channel of the sub-region of the YUV color space, expanding the pixel values into a one-dimensional vector, and normalizing the one-dimensional vector to obtain p2A one-dimensional vector of interest;
s13-6: p is to be2Linear mapping is carried out on the interested one-dimensional vectors, and then t × p is obtained by each short video2The D-dimensional embedding vector of interest, denoted as Z2=[ Z2T1…Z2Ti… Z2Tt]。
In some embodiments, the basic acoustic features of the speech signal are extracted in units of frames, the basic acoustic features are combined into an acoustic one-dimensional vector, and then the acoustic one-dimensional vector is subjected to linear mapping to obtain an acoustic embedding vector Z3The specific method comprises the following steps:
s14-1: windowing is carried out on each frame of audio signal, signal discontinuity at the start and the end of the frame is reduced, and frequency spectrum leakage is reduced;
s14-2: extracting basic acoustic features of the audio signal in units of frames, wherein the basic acoustic features include: fundamental frequency, duration and energy parameters, frequency spectrum envelope, cepstrum coefficient, formant, fundamental frequency jitter, amplitude jitter, harmonic noise rate, noise-harmonic ratio and entropy characteristic parameters;
s14-3: forming one-dimensional vectors by the basic acoustic features of each frame of audio signal, and performing normalization processing on each one-dimensional vector to obtain acoustic one-dimensional vectors;
s14-4: performing linear mapping on the acoustic one-dimensional vectors, and obtaining t D-dimensional acoustic embedding vectors in total for each short video, and recording the T D-dimensional acoustic embedding vectors as Z3=[ Z3T1…Z3Ti… Z3Tt]。
In some embodiments, said converting said Z1、 Z2、 Z3And Z4The specific method for obtaining the space-time feature vector by fusing according to the time sequence of the frame sequence comprises the following steps:
s16-1: the Z is1、 Z2、 Z3And Z4Component Z of the corresponding position1Ti、Z2Ti、Z3TiAnd Z4TiRespectively inputting corresponding self-attention modules to obtain self-correlation characteristics;
s16-2: the Z is1、 Z2、 Z3And Z4Component Z of the corresponding position1Ti、Z2Ti、Z3TiAnd Z4TiRespectively inputting the two components and the three components into corresponding self-attention modules to obtain cross-correlation characteristics;
s16-3: splicing the self-correlation characteristic and the cross-correlation characteristic to obtain a fusion frame characteristic;
s16-4: and repeating the steps S16-1 to S16-3 to obtain the fusion frame characteristics of all the frames, and splicing the fusion frame characteristics of all the frames to obtain the space-time characteristic vector.
In some embodiments, the CNN + MLP module-based improved Transformer encoder model is embodied in the form of:
the improved Transformer encoder model is formed by overlapping a plurality of encoder blocks;
the encoder block consists of a multi-head attention layer, a CNN + MLP module and a normalization layer.
In some embodiments, inputting the vector of the improved Transformer encoder model based on the CNN + MLP module further comprises: additional learnable embedded vector Z0Z is the same as0And space-time feature vectorIrrelevant, it is used to truly reflect the classification information, and not to bias a certain feature.
In some embodiments, the psychological stress level is specifically described as:
the psychological stress level is classified into 3 levels, the 1 st level is no psychological stress, the 2 nd level is slight psychological stress, and the 3 rd level is psychological stress.
In some embodiments, the loss function used for the improved Transformer encoder model training based on the CNN + MLP module is:
Figure 986138DEST_PATH_IMAGE001
wherein the content of the first and second substances,
μpreand muYRespectively taking the mean values of the mean label values of the model predicted values;
σpre 2and σY 2Respectively representing the variance of the model predicted value and the variance of the label value;
ρ is the Pearson correlation coefficient.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
according to the method provided by the embodiment of the application, a long audio and video is divided into a short audio and video with a face and a voice, and the short audio and video is subjected to framing processing to obtain an image sequence and a voice signal; extracting facial features of the image sequence to obtain a facial frame sequence; extracting optical flows from adjacent frames of the facial frame sequence by applying an optical flow method to obtain an optical flow sequence; fusing the facial frame sequence and the optical flow sequence, and performing linear mapping to obtain a facial embedding vector; extracting an interested region from the image sequence to obtain an interested sequence, and performing linear mapping to obtain a physiological signal embedding vector; extracting basic acoustic features of the voice signal by taking a frame as a unit, and performing linear mapping to obtain an acoustic embedding vector; extracting emotional characteristics from the voice signals and the image sequences; fusing the characteristics according to the time sequence of the frame sequence to obtain a space-time characteristic vector; and inputting the space-time characteristic vector into the model, and classifying by softmax to obtain the psychological pressure grade. The method only makes up the defects caused by the subjectivity of the text and the picture data, solves some inherent problems of the physiological related data (for example, the physiological related data in an extreme excitation state and an extreme stress state are very similar), and also makes up a psychological detection blank window period caused by certain data loss to a certain extent.
Drawings
Fig. 1 is a flowchart of a multi-modal fusion psychological stress analysis method according to an embodiment of the present invention;
FIGS. 2a-2b are flow charts of facial feature extraction provided by embodiments of the present invention;
3a-3b are flow charts of physiological signal extraction provided by embodiments of the present invention;
FIGS. 4a-4b are block diagrams of feature fusion structures provided by embodiments of the present invention;
fig. 5a to 5c are structural block diagrams of an improved Transformer encoder model based on a CNN + MLP module according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
The mental stress analysis method for multi-modal fusion provided by the embodiment of the application as shown in fig. 1 comprises the following steps:
s11: dividing a long audio and video into short audio and video with a face and voice, and performing framing processing on the short audio and video to obtain t image sequences and t frame voice signals;
s11-1: inputting an audio/video file, wherein the audio/video file contains facial expressions and voices of the same person;
s11-2: dividing long audio and video into short audio and video with face and voice, wherein the duration is T0, the starting time point and the ending time point of the video and audio division are the same, and each long video has a psychological pressure level label corresponding to a subject;
in a specific embodiment, a person speaks in a video, continuous frame images of the video comprise facial expressions and facial movements of the person, audio comprises continuous speaking sounds of the person, and the person in the video has a psychological pressure level label;
in the specific embodiment, the duration T0 is 5s, the frame rate is 30 frames/second, and if one frame is extracted every 2 frames as a picture sequence, T0 is 50 frames;
s11-3: for the segmented short video band, the video has a frame rate of F, T0 frames are taken as input for extracting single features, and the segmented short video band is averagely divided into T sequences, wherein T = T0 XF/T0 sequences specifically;
specifically, for the segmented short video band, the video has a frame rate of F, and if T0 frames are taken as input for extracting a single feature, there are T = T0 × F/T0 sequences;
s11-4: for the segmented short audio frequency band, taking 20-50 ms as the frame length, taking the frame shift size as half of the frame length, performing frame division processing on the audio, and regarding the voice signal as a steady-state signal in each frame interval to obtain t frames of voice signals;
in the specific embodiment, if the duration T0 is 5s, the frame length is 50ms, and the frame shift is 25ms, 39 frames of audio signals are obtained;
s12: as shown in fig. 2a and 2b, performing facial feature extraction on the image sequence by using an existing trained neural network to obtain a facial frame sequence;
extracting optical flows from adjacent frames of the facial frame sequence by applying an optical flow method to obtain an optical flow sequence;
fusing and unfolding the facial frame sequence and the optical flow sequence to obtain a one-dimensional facial vector, and performing linear mapping on the one-dimensional facial vector to obtain a facial embedded vector Z1
The obtained face embedding vector Z1The specific method comprises the following steps:
s12-1: extracting equal pixels from the image sequence of the t0 frame by using the existing trained neural network
The face picture of the size is subjected to mask processing on a non-face part to obtain a face frame sequence;
in a specific embodiment, a function in a dlib library or an openface is used for processing a video file to obtain a face sequence picture;
s12-2: extracting optical flows from adjacent frames of the facial frame sequence by applying an optical flow method to obtain an optical flow sequence;
in the specific embodiment, the optical flow of the micro-expression adjacent frames is extracted by using an optical flow method such as Lucas-Kanade to obtain an optical flow sequence;
s12-3: fusing the facial frame sequence and the optical flow sequence to obtain a facial fusion sequence, and dividing each frame of the facial fusion sequence into p1A block of pixels of equal size;
in a specific embodiment, a pixel block matrix shape of each frame is (c, h, w), H, W is a picture height and width, h and w are pixel block heights and widths, c represents the number of channels, and the size is equal to 3, and the number N1= HW/HW of pixel blocks is known through the above conditions; if the picture size is 256 × 256, it can be divided into 16 × 16 pixel blocks;
s12-4: subjecting said p to1Respectively expanding the pixel blocks with equal size into one-dimensional vectors, and performing normalization processing to obtain p1A face one-dimensional vector;
in one embodiment, if the picture sequence has m frames, each frame has p1A one-dimensional vector Xj=(x1,x2,...xi...xchw),j=1,2,...p 1Normalizing all frames to obtain Xj_norm
Figure 246218DEST_PATH_IMAGE002
Figure 59453DEST_PATH_IMAGE003
Figure 963955DEST_PATH_IMAGE004
Wherein the denominator is added
Figure 532340DEST_PATH_IMAGE005
Real number of (2), avoiding denominator of 0
S12-5: the p is1The one-dimensional vector of each face is linearly mapped, so that each short video obtains t × p1A D-dimensional face embedding vector, denoted as Z1=[ Z1T1…Z1Ti… Z1Tt];
In a specific embodiment, the specific expression of linear mapping is as follows:
Figure 217399DEST_PATH_IMAGE006
in the (p, t), p represents the spatial position of a pixel block, namely the number of blocks divided by each frame of picture, and the value range is 1-N1;
(p, t) where t denotes the pixel block temporal position, i.e. the few frames of the picture sequence;
Figure 506429DEST_PATH_IMAGE007
representing a learnable spatiotemporal position code;
Figure 655651DEST_PATH_IMAGE008
is a learnable matrix;
embedding vectors
Figure 4724DEST_PATH_IMAGE009
Inputting a model;
s13: as shown in fig. 3a and 3b, an interested region is extracted from the image sequence to obtain an interested sequence, then the interested sequence is expanded to obtain an interested one-dimensional vector, and the interested one-dimensional vector is linearly mapped to obtain a physiological signal embedding vector Z2
Obtaining a physiological signal embedding vector Z2The specific method comprises the following steps:
s13-1: extracting aligned face images with equal pixel sizes from the t0 frame image sequence;
in a specific embodiment, a function in a dlib library or an openface is used for processing a video file to obtain an aligned face image;
s13-2: taking the face cheek area as an interested area, and extracting pixel values in rectangular frames at two sides of the cheek to obtain an interested sequence;
in a specific embodiment, the coordinates of a rectangular region where the cheek is located can be determined by using a dlib library 68 point marking method, and a pixel matrix of the rectangular region is extracted;
s13-3: dividing each frame of image of interest sequence into p2A sub-region;
in a specific embodiment, the size of the region of interest division block is different from the size of the block of the facial feature extraction division, and the division is more suitable for fine-grained analysis of an rppg signal;
s13-4: converting each subregion from an RGB color space to a YUV color space;
s13-5: handle p2Averaging a plurality of pixel values of each channel of the sub-region of the YUV color space, expanding the pixel values into a one-dimensional vector, and normalizing the one-dimensional vector to obtain p2A one-dimensional vector of interest;
in one embodiment, if the picture sequence has m frames, each frame has p2A one-dimensional vector Xj=(x1,x2,...xi...xcab),j=1,2,...p 2Normalizing all frames to obtain Xj_norm
Figure 177079DEST_PATH_IMAGE010
Figure 128855DEST_PATH_IMAGE011
Figure 7949DEST_PATH_IMAGE012
WhereinDenominator plus
Figure 918136DEST_PATH_IMAGE013
The real number of (3), avoiding a denominator of 0;
s13-6: p is to be2Linear mapping is carried out on the interested one-dimensional vectors, and then t × p is obtained by each short video2The D-dimensional embedding vector of interest, denoted as Z2=[ Z2T1…Z2Ti… Z2Tt];
In a specific embodiment, the specific expression of linear mapping is as follows:
Figure 577787DEST_PATH_IMAGE014
in (p, t), p represents the spatial position of a pixel block, namely the number of blocks divided by each frame of picture, and the value range is 1-p2
(p, t) where t denotes the pixel block temporal position, i.e. the few frames of the picture sequence;
Figure 208620DEST_PATH_IMAGE015
representing a learnable spatiotemporal position code;
Figure 66855DEST_PATH_IMAGE016
is a learnable matrix;
embedding vectors
Figure 757730DEST_PATH_IMAGE017
S14: extracting basic acoustic features of the voice signal by taking a frame as a unit, forming the basic acoustic features into an acoustic one-dimensional vector, and performing linear mapping on the acoustic one-dimensional vector to obtain an acoustic embedded vector Z3
The specific method comprises the following steps:
s14-1: windowing is carried out on each frame of audio signal, signal discontinuity at the start and the end of the frame is reduced, and frequency spectrum leakage is reduced;
s14-2: extracting basic acoustic features of the audio signal in units of frames, wherein the basic acoustic features include: fundamental frequency, duration and energy parameters, frequency spectrum envelope, cepstrum coefficient, formant, fundamental frequency jitter, amplitude jitter, harmonic noise rate, noise-harmonic ratio and entropy characteristic parameters;
s14-3: forming one-dimensional vectors by the basic acoustic features of each frame of audio signal, and performing normalization processing on each one-dimensional vector to obtain acoustic one-dimensional vectors;
in a specific embodiment, the specific expression of linear mapping is as follows:
Figure 904678DEST_PATH_IMAGE018
in (p, t), p represents the spatial position of a pixel block, namely the number of blocks divided by each frame of picture, and the value range is 1-p3
(p, t) where t denotes the pixel block temporal position, i.e. the few frames of the picture sequence;
Figure 729414DEST_PATH_IMAGE019
representing a learnable spatiotemporal position code;
Figure 645418DEST_PATH_IMAGE020
is a learnable matrix;
embedding vectors
Figure 507194DEST_PATH_IMAGE021
S14-4: performing linear mapping on the acoustic one-dimensional vectors, and obtaining t D-dimensional acoustic embedding vectors in total for each short video, and recording the T D-dimensional acoustic embedding vectors as Z3=[ Z3T1…Z3Ti… Z3Tt];
S15: extracting feature vectors of depspectum, vggish and egemps from the voice signal;
extracting vgface and fau _ intensity characteristic vectors from the image sequence;
splicing the deepspread, vggish and egemaps feature vectors and the vgface and face _ intensity feature vectors to obtain audio and video fusion vectors;
extracting a depth spectrum characteristic depspectum from a pre-training CNN network;
the VGG network voice depth feature vggish is extracted by a pre-training VGG network;
the 88-dimensional manual features egemaps are obtained by an opensimle tool;
extracting the VGG network face depth feature vggface by a pre-trained VGG network;
the facial AU feature fau _ intensity is obtained by an openface tool, the audio and video fusion vector is input into a pre-training emotion feature extraction model, and t emotion feature vectors Z are obtained4=[ Z4T1…Z4Ti… Z4Tt];
S16: as shown in fig. 4a and 4b, the Z is set1、 Z2、 Z3And Z4Fusing according to the time sequence of the frame sequence to obtain a space-time feature vector;
fig. 4a is a block diagram of a structure of obtaining a fusion frame feature by feature component fusion according to the embodiment of the present invention;
FIG. 4b is a block diagram of a structure of spatiotemporal feature vectors obtained by fusing the feature of the fused frames of all frames according to the embodiment of the present invention;
the specific method comprises the following steps:
s16-1: the Z is1、 Z2、 Z3And Z4Component Z of the corresponding position1Ti、Z2Ti、Z3TiAnd Z4TiRespectively inputting corresponding self-attention modules to obtain self-correlation characteristics;
s16-2: the Z is1、 Z2、 Z3And Z4Component Z of the corresponding position1Ti、Z2Ti、Z3TiAnd Z4TiRespectively inputting the two components and the three components into corresponding self-attention modules to obtain cross-correlation characteristics;
s16-3: splicing the self-correlation characteristic and the cross-correlation characteristic to obtain a fusion frame characteristic;
s16-4: repeating the steps S16-1 to S16-3 to obtain the fusion frame characteristics of all frames, and splicing the fusion frame characteristics of all frames to obtain a space-time characteristic vector;
s17: as shown in fig. 5a-5c, the spatio-temporal feature vectors are input into an improved Transformer encoder model based on a CNN + MLP module, and then are classified by softmax to obtain a psychological pressure grade;
fig. 5a is a structural block diagram of an improved Transformer encoder model based on a CNN + MLP module according to an embodiment of the present invention;
fig. 5b is a block diagram of an encoder block according to an embodiment of the present invention;
fig. 5c is a block diagram of a CNN + MLP module structure according to an embodiment of the present invention;
the improved Transformer encoder model is formed by overlapping a plurality of encoder blocks;
the encoder block consists of a multi-head attention layer, a CNN + MLP module and a normalization layer;
in a specific embodiment, the specific calculation process of the encoder block is as follows:
Figure 141438DEST_PATH_IMAGE022
among them, MSA (Multi-Head Self-Attention) is the Multi-Head Attention;
MLP (Multi-Layer Perceptiron) is a multilayer Perceptron;
LN (layer norm) for layer normalization;
l represents the L-1 encoder _ block, and the value range is 0-L;
single headaThe self-attention MSA calculation process is as follows:
Figure 504286DEST_PATH_IMAGE023
Figure 540375DEST_PATH_IMAGE024
Figure 304545DEST_PATH_IMAGE025
Figure 488401DEST_PATH_IMAGE026
Figure 592623DEST_PATH_IMAGE027
Figure 155323DEST_PATH_IMAGE028
is a learnable weight matrix;
LN (layer norm) for layer normalization;
Figure 686798DEST_PATH_IMAGE029
the previous layer output of l;
a represents the index of the number of heads of multi-head attention;
(p, t) p represents a spatial index and t represents a temporal index;
(0, 0) additional parameters brought by adding the classification flag bit;
SM is softmax activation function;
Dha dimension for each attention head;
Figure 826793DEST_PATH_IMAGE030
is an attention weight coefficient;
MSA Multi-headed attention is the connection of all single heads
Concat(head1,head2,...heada...headh)
Inputting the vector of the improved Transformer encoder model based on the CNN + MLP module further comprises: additional learnable embedded vector Z0Z is the same as0The system is independent of the space-time feature vector and is used for truly reflecting the classification information without being biased to a certain feature;
the psychological stress level is specifically described as follows:
the psychological pressure grade is 3 grades, the 1 st grade is no psychological pressure, the 2 nd grade is slight psychological pressure, and the 3 rd grade is psychological pressure;
in a specific embodiment, through full connection, a one-dimensional vector with the length of 3 is output;
Y=Wc+b
wherein the content of the first and second substances,
w is a learnable weight matrix;
c is a one-dimensional vector output by a Transformer encoder model;
b is a bias coefficient;
y is a psychological stress prediction result, and the vector length is 3;
sorted by softmax function
Figure 285DEST_PATH_IMAGE031
xiThe value range of the ith element is 1-3;
Softmax(xi) Outputting the probability that the psychological state of the subject is the ith class;
the loss function adopted by the improved Transformer encoder model training based on the CNN + MLP module is as follows:
Figure 683070DEST_PATH_IMAGE032
wherein the content of the first and second substances,
μpreand muYRespectively taking the mean value of the model predicted values and the mean value of the label values;
σpre 2and σY 2Respectively representing the variance of the model predicted value and the variance of the label value;
ρ is the Pearson correlation coefficient.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. The psychological stress analysis method of multimodal fusion is characterized by comprising the following steps:
s11: dividing a long audio and video into short audio and video with a face and voice, and performing framing processing on the short audio and video to obtain t image sequences and t frame voice signals;
s12: extracting facial features of the image sequence by using an existing trained neural network to obtain a facial frame sequence;
extracting optical flows from adjacent frames of the facial frame sequence by applying an optical flow method to obtain an optical flow sequence;
fusing and unfolding the facial frame sequence and the optical flow sequence to obtain a one-dimensional facial vector, and performing linear mapping on the one-dimensional facial vector to obtain a facial embedded vector Z1
S13: extracting an interested region from the image sequence to obtain an interested sequence, then expanding the interested sequence to obtain an interested one-dimensional vector, and then carrying out linear mapping on the interested one-dimensional vector to obtain a physiological signal embedded vector Z2
S14: extracting basic acoustic features of the voice signal by taking a frame as a unit, forming the basic acoustic features into an acoustic one-dimensional vector, and performing linear mapping on the acoustic one-dimensional vector to obtain an acoustic embedded vector Z3
S15: extracting feature vectors of depspectum, vggish and egemps from the voice signal;
extracting vgface and fau _ intensity characteristic vectors from the image sequence;
splicing the deepspread, vggish and egemaps feature vectors and the vgface and face _ intensity feature vectors to obtain audio and video fusion vectors;
inputting the audio and video fusion vector into a pre-training emotional feature extraction model to obtain t emotional feature vectors Z4=[ Z4T1…Z4Ti… Z4Tt];
S16: the Z is1、 Z2、 Z3And Z4Fusing according to the time sequence of the frame sequence to obtain a space-time feature vector;
s17: and inputting the space-time characteristic vector into an improved Transformer encoder model based on a CNN + MLP module, and classifying by softmax to obtain the psychological pressure grade.
2. The method for analyzing mental stress of multi-modal fusion according to claim 1, wherein the specific method for performing framing processing on the short audio video to obtain t image sequences and t frame speech signals comprises:
s11-1: inputting an audio/video file, wherein the audio/video file contains facial expressions and voices of the same person;
s11-2: dividing long audio and video into short audio and video with face and voice, wherein the duration is T0, the starting time point and the ending time point of the video and audio division are the same, and each long video has a psychological pressure level label corresponding to a subject;
s11-3: for the segmented short video band, the video has a frame rate of F, T0 frames are taken as input for extracting single features, and the segmented short video band is averagely divided into T sequences, wherein T = T0 XF/T0 sequences specifically;
s11-4: and for the segmented short audio frequency band, taking 20-50 ms as the frame length, taking the frame shift size as half of the frame length, performing frame division processing on the audio, and regarding the voice signal as a steady-state signal in each frame interval to obtain t frames of voice signals.
3. The method of claim 2, wherein the deriving the face-embedding vector Z is based on a multi-modal fusion psycho-stress analysis1The specific method comprises the following steps:
s12-1: extracting face pictures with equal pixel sizes from an image sequence of a t0 frame by using an existing trained neural network, and performing mask processing on a non-face part to obtain a face frame sequence;
s12-2: extracting optical flows from adjacent frames of the facial frame sequence by applying an optical flow method to obtain an optical flow sequence;
s12-3: fusing the facial frame sequence and the optical flow sequence to obtain a facial fusion sequence, and dividing each frame of the facial fusion sequence into p1A block of pixels of equal size;
s12-4: subjecting said p to1Respectively expanding the pixel blocks with equal size into one-dimensional vectors, and performing normalization processing to obtain p1A face one-dimensional vector;
s12-5: the p is1The one-dimensional vector of each face is linearly mapped, so that each short video obtains t × p1A D-dimensional face embedding vector, denoted as Z1=[ Z1T1…Z1Ti… Z1Tt]。
4. The method of claim 3, wherein the physiological signal embedding vector Z is obtained2The specific method comprises the following steps:
s13-1: extracting aligned face images with equal pixel sizes from the t0 frame image sequence;
s13-2: taking the face cheek area as an interested area, and extracting pixel values in rectangular frames at two sides of the cheek to obtain an interested sequence;
s13-3: dividing each frame of image of interest sequence into p2A sub-region;
s13-4: converting each subregion from an RGB color space to a YUV color space;
s13-5: handle p2Averaging a plurality of pixel values of each channel of the sub-region of the YUV color space, expanding the pixel values into a one-dimensional vector, and normalizing the one-dimensional vector to obtain p2A one-dimensional vector of interest;
s13-6: p is to be2Linear mapping is carried out on the interested one-dimensional vectors, and then t × p is obtained by each short video2The D-dimensional embedding vector of interest, denoted as Z2=[ Z2T1…Z2Ti… Z2Tt]。
5. Multimode according to claim 4The state-fused psychological pressure analysis method is characterized in that the basic acoustic features of the voice signals are extracted by taking a frame as a unit, the basic acoustic features form an acoustic one-dimensional vector, and the acoustic one-dimensional vector is subjected to linear mapping to obtain an acoustic embedded vector Z3The specific method comprises the following steps:
s14-1: windowing each frame of audio signal;
s14-2: extracting basic acoustic features of the audio signal in units of frames, wherein the basic acoustic features include: fundamental frequency, duration and energy parameters, frequency spectrum envelope, cepstrum coefficient, formant, fundamental frequency jitter, amplitude jitter, harmonic noise rate, noise-harmonic ratio and entropy characteristic parameters;
s14-3: forming one-dimensional vectors by the basic acoustic features of each frame of audio signal, and performing normalization processing on each one-dimensional vector to obtain acoustic one-dimensional vectors;
s14-4: performing linear mapping on the acoustic one-dimensional vectors, and obtaining t D-dimensional acoustic embedding vectors in total for each short video, and recording the T D-dimensional acoustic embedding vectors as Z3=[ Z3T1…Z3Ti… Z3Tt]。
6. The method of claim 5 wherein said Z is combined with said mental stress analysis1、Z2、Z3And Z4The specific method for obtaining the space-time feature vector by fusing according to the time sequence of the frame sequence comprises the following steps:
s16-1: the Z is1、Z2、Z3And Z4Component Z of the corresponding position1Ti、Z2Ti、Z3TiAnd Z4TiRespectively inputting corresponding self-attention modules to obtain self-correlation characteristics;
s16-2: the Z is1、Z2、Z3And Z4Component Z of the corresponding position1Ti、Z2Ti、Z3TiAnd Z4TiRespectively inputting the two components and the three components into corresponding self-attention modules to obtain cross-correlation characteristics;
s16-3: splicing the self-correlation characteristic and the cross-correlation characteristic to obtain a fusion frame characteristic;
s16-4: and repeating the steps S16-1 to S16-3 to obtain the fusion frame characteristics of all the frames, and splicing the fusion frame characteristics of all the frames to obtain the space-time characteristic vector.
7. The method for analyzing mental stress of multi-modal fusion as claimed in claim 1, wherein the improved Transformer encoder model based on CNN + MLP module is embodied in the form of:
the improved Transformer encoder model is formed by overlapping a plurality of encoder blocks;
the encoder block consists of a multi-head attention layer, a CNN + MLP module and a normalization layer.
8. The method of multimodal fusion psychoacoustic pressure analysis of claim 7, wherein inputting vectors of a modified transform encoder model based on a CNN + MLP module further comprises: additional learnable embedded vector Z0Z is the same as0Independent of the spatio-temporal feature vector, for truly reflecting classification information without biasing to a certain feature.
9. The method for mental stress analysis of multi-modal fusion according to claim 8, wherein the mental stress level is specified by:
the psychological stress level is classified into 3 levels, the 1 st level is no psychological stress, the 2 nd level is slight psychological stress, and the 3 rd level is psychological stress.
10. The method for analyzing mental stress of multi-modal fusion according to claim 9, wherein the loss function adopted by the improved Transformer encoder model training based on the CNN + MLP module is:
Figure 898606DEST_PATH_IMAGE001
wherein the content of the first and second substances,
μpreand muYRespectively taking the mean value of the model predicted values and the mean value of the label values;
σpre 2and σY 2Respectively representing the variance of the model predicted value and the variance of the label value;
ρ is the Pearson correlation coefficient.
CN202110812718.0A 2021-07-19 2021-07-19 Multi-mode fused psychological stress analysis method Active CN113255635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110812718.0A CN113255635B (en) 2021-07-19 2021-07-19 Multi-mode fused psychological stress analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110812718.0A CN113255635B (en) 2021-07-19 2021-07-19 Multi-mode fused psychological stress analysis method

Publications (2)

Publication Number Publication Date
CN113255635A true CN113255635A (en) 2021-08-13
CN113255635B CN113255635B (en) 2021-10-15

Family

ID=77180530

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110812718.0A Active CN113255635B (en) 2021-07-19 2021-07-19 Multi-mode fused psychological stress analysis method

Country Status (1)

Country Link
CN (1) CN113255635B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113888453A (en) * 2021-09-27 2022-01-04 邹子杰 Industrial quality inspection image character matching method and device
CN114091466A (en) * 2021-10-13 2022-02-25 山东师范大学 Multi-modal emotion analysis method and system based on Transformer and multi-task learning
CN114343612A (en) * 2022-03-10 2022-04-15 中国科学院自动化研究所 Transfomer-based non-contact respiration rate measurement method
CN117540007A (en) * 2024-01-04 2024-02-09 烟台大学 Multi-mode emotion analysis method, system and equipment based on similar mode completion
CN117725547A (en) * 2023-11-17 2024-03-19 华南师范大学 Emotion and cognition evolution mode identification method based on cross-modal feature fusion network

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160113567A1 (en) * 2013-05-28 2016-04-28 Laszlo Osvath Systems and methods for diagnosis of depression and other medical conditions
GB201816383D0 (en) * 2018-10-08 2018-11-28 Biobeats Group Ltd Multimodal digital therapy and biometric analysis of biometric signals
CN110507335A (en) * 2019-08-23 2019-11-29 山东大学 Inmate's psychological health states appraisal procedure and system based on multi-modal information
CN111513732A (en) * 2020-04-29 2020-08-11 山东大学 Intelligent psychological stress assessment early warning system for various groups of people under epidemic disease condition
CN111738210A (en) * 2020-07-20 2020-10-02 平安国际智慧城市科技股份有限公司 Audio and video based student psychological state analysis method, device, terminal and medium
CN112057059A (en) * 2020-09-14 2020-12-11 中国刑事警察学院 Psychological stress intelligent acquisition, test and analysis system based on multi-modal physiological data
CN112201228A (en) * 2020-09-28 2021-01-08 苏州贝果智能科技有限公司 Multimode semantic recognition service access method based on artificial intelligence
CN113095357A (en) * 2021-03-04 2021-07-09 山东大学 Multi-mode emotion recognition method and system based on attention mechanism and GMN

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160113567A1 (en) * 2013-05-28 2016-04-28 Laszlo Osvath Systems and methods for diagnosis of depression and other medical conditions
GB201816383D0 (en) * 2018-10-08 2018-11-28 Biobeats Group Ltd Multimodal digital therapy and biometric analysis of biometric signals
CN110507335A (en) * 2019-08-23 2019-11-29 山东大学 Inmate's psychological health states appraisal procedure and system based on multi-modal information
CN111513732A (en) * 2020-04-29 2020-08-11 山东大学 Intelligent psychological stress assessment early warning system for various groups of people under epidemic disease condition
CN111738210A (en) * 2020-07-20 2020-10-02 平安国际智慧城市科技股份有限公司 Audio and video based student psychological state analysis method, device, terminal and medium
CN112057059A (en) * 2020-09-14 2020-12-11 中国刑事警察学院 Psychological stress intelligent acquisition, test and analysis system based on multi-modal physiological data
CN112201228A (en) * 2020-09-28 2021-01-08 苏州贝果智能科技有限公司 Multimode semantic recognition service access method based on artificial intelligence
CN113095357A (en) * 2021-03-04 2021-07-09 山东大学 Multi-mode emotion recognition method and system based on attention mechanism and GMN

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JIAN HUANG等: "Efficient Modeling of Long Temporal Contexts for Continuous Emotion Recognition", 《2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII)》 *
SIMONE LEONARDI等: "Multilingual Transformer-Based Personality Traits Estimation", 《INFORMATION》 *
董永峰等: "基于多头注意力机制的模型层融合维度情感识别方法", 《信号处理》 *
陈鹏达: "基于深度学习的商品推荐***的研究与实现", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113888453A (en) * 2021-09-27 2022-01-04 邹子杰 Industrial quality inspection image character matching method and device
CN114091466A (en) * 2021-10-13 2022-02-25 山东师范大学 Multi-modal emotion analysis method and system based on Transformer and multi-task learning
CN114343612A (en) * 2022-03-10 2022-04-15 中国科学院自动化研究所 Transfomer-based non-contact respiration rate measurement method
CN114343612B (en) * 2022-03-10 2022-05-24 中国科学院自动化研究所 Non-contact respiration rate measuring method based on Transformer
CN117725547A (en) * 2023-11-17 2024-03-19 华南师范大学 Emotion and cognition evolution mode identification method based on cross-modal feature fusion network
CN117540007A (en) * 2024-01-04 2024-02-09 烟台大学 Multi-mode emotion analysis method, system and equipment based on similar mode completion
CN117540007B (en) * 2024-01-04 2024-03-15 烟台大学 Multi-mode emotion analysis method, system and equipment based on similar mode completion

Also Published As

Publication number Publication date
CN113255635B (en) 2021-10-15

Similar Documents

Publication Publication Date Title
CN113255635B (en) Multi-mode fused psychological stress analysis method
Anina et al. Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis
Sinha Recognizing complex patterns
Hassanat Visual speech recognition
Dhuheir et al. Emotion recognition for healthcare surveillance systems using neural networks: A survey
Porras et al. DNN-based acoustic-to-articulatory inversion using ultrasound tongue imaging
JP2010256391A (en) Voice information processing device
CN106096642B (en) Multi-mode emotional feature fusion method based on identification of local preserving projection
CN115169507B (en) Brain-like multi-mode emotion recognition network, recognition method and emotion robot
US20060029277A1 (en) Change information recognition apparatus and change information recognition method
Chetty et al. A multilevel fusion approach for audiovisual emotion recognition
CN116825365B (en) Mental health analysis method based on multi-angle micro-expression
Jachimski et al. A comparative study of English viseme recognition methods and algorithms
Kim et al. Vocal tract shaping of emotional speech
Arya et al. Speech based emotion recognition using machine learning
Loizou An automated integrated speech and face imageanalysis system for the identification of human emotions
Joosten et al. Voice activity detection based on facial movement
Petridis et al. Fusion of audio and visual cues for laughter detection
Ochi et al. Learning a Parallel Network for Emotion Recognition Based on Small Training Data
Brooke Computational aspects of visual speech: machines that can speechread and simulate talking faces
KR102564570B1 (en) System and method for analyzing multimodal emotion
Murai et al. Face-to-talk: audio-visual speech detection for robust speech recognition in noisy environment
Belete College of Natural Sciences
Rahul et al. Detecting and Analyzing Depression: A Comprehensive Survey of Assessment Tools and Techniques
Savran et al. Speaker-independent 3D face synthesis driven by speech and text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant