CN115393927A - Multi-modal emotion emergency decision system based on multi-stage long and short term memory network - Google Patents

Multi-modal emotion emergency decision system based on multi-stage long and short term memory network Download PDF

Info

Publication number
CN115393927A
CN115393927A CN202210941178.0A CN202210941178A CN115393927A CN 115393927 A CN115393927 A CN 115393927A CN 202210941178 A CN202210941178 A CN 202210941178A CN 115393927 A CN115393927 A CN 115393927A
Authority
CN
China
Prior art keywords
emotion
level
fusion
lstm
risk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210941178.0A
Other languages
Chinese (zh)
Inventor
戴亚平
陈奕杉
廖天睿
邵帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202210941178.0A priority Critical patent/CN115393927A/en
Publication of CN115393927A publication Critical patent/CN115393927A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Psychiatry (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Biomedical Technology (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a multi-modal emotion emergency decision system based on a multi-level Long Short-Term Memory (LSTM) network, belonging to the field of natural language processing multi-modal emotion analysis. The invention provides a multi-modal emotion emergency decision system based on a multi-stage LSTM network, which is used for monitoring group emotions in an indoor public place, integrating public place group emotion atmosphere fields and individual extreme emotions and carrying out risk assessment on emergencies in the scene. The implementation method of the invention comprises the following steps: performing emotion recognition on the audio and image information; establishing a multi-stage LSTM network, and performing decision-level fusion aiming at the time relevance of multi-mode information; fusing the output results of all levels of LSTM in the time dimension; carrying out outlier processing on the fusion result; by strengthening individual extreme emotion monitoring, an emotion atmosphere field is constructed, collective emotion in the public place environment is integrally evaluated, the prediction result of the emotion atmosphere field and the individual extreme emotion state are integrated, and the risk level probability of the public place emergency is calculated.

Description

Multi-modal emotion emergency decision system based on multi-stage long and short term memory network
Technical Field
The invention belongs to the field of natural language processing multi-modal emotion analysis tasks, and particularly relates to a multi-modal emotion emergency decision system based on a long-term and short-term memory network.
Background
Emotion recognition is one of important research contents in the field of natural language processing, and is a prerequisite for realizing emotion understanding and expression by a computer. Under the high-speed impact of the network information flow, the video integrating the picture and the voice characteristics contains richer information and is easy to acquire, so that the video becomes a main information source for emotion recognition. As the single-modal emotion recognition can only notice local features, and the multi-modal emotion fusion model can integrate information from different modalities, the limitation of a single data form is overcome, and more complex emotional states are analyzed. However, the conventional multi-modal emotion fusion method ignores dynamic changes of emotion, and thus performs poorly in practical applications. The context coding network and attention mechanism algorithm is only used for single-mode emotion judgment, and the short-time continuity of user emotion change among multiple types of modes is not considered.
With the development of machine learning and deep learning, the audio and video emotion recognition method is applied to various decision support systems, and particularly in the field of security monitoring, an audio and video emotion emergency decision system is widely applied to emotion monitoring of people in indoor public places and risk assessment of scene emergencies. In 2019, in a track traffic emergency plan system and decision response mechanism research project based on an emotion atmosphere field, a multi-mode emotion emergency decision system displays emotion atmospheres of passengers in track traffic in real time through an emotion atmosphere field model according to information such as passenger behaviors, facial expressions and sounds, establishes plan system models of complex groups under different abnormal emotion atmospheres, and makes important contribution to track traffic safety guarantee. In a key technical research project for improving the railway safety development efficiency in the new period of 2022, aiming at multi-mode emotional information of passengers, a description model of passenger emotional atmosphere is established, and an abnormal emotional atmosphere recognition result is obtained; and constructing a case-based safety guarantee emergency plan system related to the abnormal emotional atmosphere, and establishing and realizing a dynamic decision mechanism and safety guarantee emergency regulation and control facing to the abnormal emotional atmosphere.
However, the emergency decision-making system based on the audio and video information still has the problems of insufficient emotion recognition result precision and insufficient decision result attention to individual extreme emotions in user emotion prediction, and still has a great promotion space in the aspect of risk assessment for emergency events in public places.
Disclosure of Invention
Aiming at the technical problems in the existing multi-modal emotion recognition and emergency decision system, the invention aims to provide the multi-modal emotion emergency decision system based on the multi-level Long-Short Term Memory network (LSTM), which utilizes audio and video modal emotion information to carry out emotion estimation on groups in an indoor public place, utilizes the multi-level LSTM to carry out emotion information fusion on a decision level, synthesizes group emotion atmosphere surrounding areas and individual extreme emotions in the indoor public place, and carries out risk assessment on public place emergencies.
The purpose of the invention is realized by the following technical scheme:
the invention discloses a multi-modal emotion emergency decision system based on multi-level LSTM, which is used for fusing the relevance of multi-modal information in time through an LSTM network and estimating the emotion state of a user at the current moment by combining emotion context information. The LSTM node is used for extracting context information of two modes of audio and video at different moments and establishing a dependency relationship on a time dimension; secondly, context connection among input subsequences of the LSTM networks at all levels is strengthened by building a multi-level LSTM fusion model. The system is used for integrally evaluating the collective emotion in the indoor public place environment by constructing an emotion atmosphere field, carrying out independent risk decision on the individual extreme emotion by strengthening individual extreme emotion monitoring, integrating the prediction result of the emotion atmosphere field and the individual extreme emotion state, and calculating the risk level probability of the emergency in the indoor public place.
The invention discloses a multi-modal emotion emergency decision system based on multi-stage LSTM, which comprises the following steps:
step 1: continuous dimensional emotion on audio informationAnd (6) estimating the feeling. Carrying out data preprocessing on the acquired audio information, extracting an audio modal characteristic sequence under a continuous frame sequence, and recording the audio modal characteristic sequence as X a ∈R T×da ;d a Is the length of the audio feature vector at each time instant, and T is the size of the time domain dimension. Performing emotion classification on the audio modal characteristic sequence by using an audio emotion perception model based on VGGish-13, acquiring emotion bias of the audio in two dimensions of valence and awakening under a continuous frame sequence, and recording emotion estimation results of the audio in the valence dimension and the awakening dimension at the moment t as
Figure BDA0003784251760000021
Step 2: and performing emotion estimation on continuous dimensions on the face information in the video. Extracting the face information of the collected video frame by frame, preprocessing the face image, obtaining the characteristic sequence of the preprocessed face image, and recording the characteristic sequence as
Figure BDA0003784251760000022
d v Is the length of the audio feature vector at each time instant. Performing emotion classification on the preprocessed face image feature sequence by using a face emotion perception model based on ResNet-18, and recording the emotion estimation results of the face information at the moment t on the valence dimension and the awakening dimension
Figure BDA0003784251760000023
And step 3: and integrating output results of the audio emotion perception model and the human face emotion perception model, and respectively recording emotion bias of audio and images under the continuous frame sequence in a valence dimension and an awakening dimension. At time t, the emotional bias of valence and arousal dimensions are recorded separately
Figure BDA0003784251760000024
And 4, step 4: and (4) establishing a multi-stage LSTM network, taking the emotion bias in the step (3) as input, and performing decision-level fusion on the multi-mode emotion in valence and awakening dimensions respectively.
The step 4 comprises the following steps:
step 4.1: and 3, taking the emotion bias in the single dimension in the step 3 as input, and starting to segment the primary subsequence from the first frame to obtain the primary subsequence to be processed in the single dimension. The primary subsequence includes a target frame and a plurality of consecutive frames adjacent to and subsequent to the target frame. And recording the length of the primary subsequence as timestamp. In the multi-stage LSTM network, the input of the first stage LSTM network is composed of a plurality of the first-stage subsequences, and the number of the first-stage subsequences of the samples under the condition of single dimension is recorded as
Figure BDA0003784251760000031
Step 4.2: inputting the feature information of each frame under the single dimension into a first-stage LSTM network according to the time sequence relation of the first-stage subsequence, and acquiring reference feature information through the first-stage LSTM network, wherein the method comprises the following steps: receiving characteristic information of a t-th frame corresponding to a t moment, wherein the t moment is a current moment; and receiving the hidden state information and the unit state information output at the time t-1. Converting the hidden state information and the unit state information output at the t-1 moment into target hidden state information and target unit state information at the t-1 moment; inputting the characteristic information of the t-th frame, the target hidden state information at the t-1 moment and the target unit state information into the LSTM module, and outputting the hidden state information and the unit state information at the t moment through the LSTM node; calculating a fusion emotion predicted value Y of the LSTM node at the t moment according to the hidden state information of the LSTM node at the t moment d t im,1 (dim∈(val,arl))。
And training the plurality of primary subsequences and target output results under the corresponding valence and awakening dimensionality in a first-stage LSTM network, and respectively reserving two optimal training models under the valence and awakening dimensionality. And testing the video acquired in real time according to the optimal training model under two dimensions. Respectively recording the output results of the first-stage LSTM network at t moment under valence and awakening dimension
Figure BDA0003784251760000032
Step 4.3: with the output of step 3 as input from
Figure BDA0003784251760000033
And the frame emotion information starts to carry out the segmentation of the secondary subsequence, and the secondary subsequence to be processed under the single dimension is obtained. The secondary subsequence includes a target frame and a plurality of consecutive frames adjacent to and subsequent to the target frame. The length of the secondary subsequence is the same as that of the primary subsequence and is timestamp. In the multi-stage LSTM network, the input of the second-stage LSTM network consists of a plurality of the secondary subsequences, and the number of the sample secondary subsequences under a single dimension is batchsize.
Step 4.4: inputting the feature information of each frame into a second-level LSTM network according to the time sequence relation of the second-level subsequence, and acquiring reference feature information through the second-level LSTM network, wherein the method comprises the following steps: receiving characteristic information of a t-th frame corresponding to a t moment, wherein the t moment is a current moment; and receiving the hidden state information and the unit state information output at the time t-1. Converting the hidden state information and the unit state information output at the t-1 moment into target hidden state information and target unit state information at the t-1 moment; inputting the characteristic information of the t-th frame, the target hidden state information at the t-1 moment and the target unit state information into the LSTM module, and outputting the hidden state information and the unit state information at the t moment through the LSTM node; calculating the fusion emotion predicted value of the LSTM node at the t moment according to the hidden state information of the LSTM node at the t moment
Figure BDA0003784251760000041
And training the plurality of secondary subsequences and target output results under the valence and wake-up dimensions corresponding to the secondary subsequences in a second-level LSTM network, and respectively keeping two optimal training models under the valence and wake-up dimensions. And testing the video acquired in real time according to the optimal training model under two dimensions. Respectively recording output results of the second-level LSTM network at t moment under valence and awakening dimension
Figure BDA0003784251760000042
And 5: output result of first-stage LSTM network at t moment under single dimension
Figure BDA0003784251760000043
Output result with second stage LSTM network
Figure BDA0003784251760000044
Fusing in time dimension to obtain multi-stage LSTM network emotion fusion result in single dimension
Figure BDA0003784251760000045
Because the nodes of each stage of LSTM at the same time have different hidden state information, different weights are given to the output results of the two stages of LSTM networks at different times for fusion. Nodes that acquire long-term memory are given higher weight, nodes that acquire only short-term memory are given lower weight.
Step 6: multi-stage LSTM network emotion fusion result at t moment under single dimension
Figure BDA0003784251760000046
And (5) carrying out outlier processing. In order to ensure that the multi-stage LSTM network emotion fusion result from the time t-1 to the time t +1 meets the short-time continuity and avoid the large-amplitude mutation of the emotion fusion result, the emotion fusion result from the time t-1 to the time t +1 is taken
Figure BDA0003784251760000047
When it satisfies
Figure BDA0003784251760000048
And is
Figure BDA0003784251760000049
When the temperature of the water is higher than the set temperature,
Figure BDA00037842517600000410
and 7: realizing multi-modal emotion fusion based on multi-level LSTM on valence dimension and awakening dimension of people in indoor public place environment respectively, and recording the obtained single emotion fusion result as
Figure BDA00037842517600000411
Wherein N is the number of people who have collected videos in the environment of public places, Y i t As a result of the emotion fusion at time t,
Figure BDA00037842517600000412
and
Figure BDA00037842517600000413
the emotion fusion results in valence and arousal dimensions, respectively.
And 8: and performing atmosphere field estimation, extreme emotion judgment and risk level judgment on the group emotion of the indoor public place according to the single emotion fusion result.
The step 8 comprises the following steps:
step 8.1: dividing the single emotion estimation result into a normal emotion state and an extreme emotion state, and judging the single emotion
Figure BDA00037842517600000414
Emotional fusion results in medium potency dimension
Figure BDA00037842517600000415
And satisfy
Figure BDA00037842517600000416
Then, the emotion of the user is judged to belong to the extreme category, and the emotion is fused into a result Y i t The extreme mood list is entered.
Step 8.2: establishing an atmosphere field model, sensing public place group emotional atmosphere according to the normal or extreme emotional state of individuals in public place groups within a period of time, calculating an estimation result of the emotional atmosphere field by integrating the group emotional atmosphere and the individual extreme emotion, and expressing the estimation result in an emotional two-dimensional model.
The emotion two-dimensional model has two mutually orthogonal dimensions of titer and arousal and is used for representing emotion intensity change in continuous dimensions. The values of the valence and the awakening two dimensions respectively represent the offset from negative to positive and from calm to excitement, the value ranges are [ -3,3], and the coordinates in the two-dimensional space represent different emotions.
Step 8.3: and (4) building a risk model, calculating the risk level according to the emotional atmosphere field estimation result, and outputting the corresponding risk level. When the public place environment is in a high risk state, the system will provide a corresponding emergency plan.
The risk model is of a ring structure, and risk calculation is carried out based on emotional offset under different dimensions. The risk levels are rated as 0 (no risk), 1 (general), 2 (large), 3 (large), and 4 (particularly large) according to the emotional atmosphere field estimation result. And according to the valence and the awakened emotion fusion bias, when the valence value is greater than 0 in the emotion atmosphere field estimation result, the risk level rank is marked as level 0. Risk rating under other circumstances
Figure BDA0003784251760000051
Namely, the distance from the coordinates of the multi-mode emotion fusion result in the emotion two-dimensional model to the origin is calculated, and the distance is recorded as a risk level after being rounded up in the warp direction.
The multi-modal emotion emergency decision system based on the multi-stage LSTM is developed and realized based on a Pycharm platform and comprises an emotion monitoring subsystem, a risk decision subsystem and a user interface.
The emotion monitoring subsystem comprises a sensor information interface module, an emotion estimation module and a result output module, wherein the emotion estimation module needs to call an audio and face emotion estimation model to perform single-mode emotion recognition, and calls a multi-stage LSTM network to perform multi-mode emotion fusion.
The sensor information interface module collects video information through a camera and a microphone, and monitors the real-time environment of an indoor public place. The module transmits video information to a user interface and separates audio and image signals in the video to be transmitted to the emotion estimation module.
And the emotion estimation module is used for respectively calling the audio and face emotion estimation models to carry out emotion recognition of valence and awakening dimension after corresponding data preprocessing on the obtained audio and image signals, calling the multi-stage LSTM network to carry out multi-mode emotion fusion, and obtaining real-time emotion states of individuals in indoor public place groups on the valence and the awakening dimension.
And the result output module outputs the individual emotion estimation results on the valence and awakening dimension to an emotion recognition interface module and a user interface of the risk decision subsystem.
The risk decision subsystem comprises an emotion recognition interface module and a risk decision calculation module, wherein the risk decision calculation module comprises an emotion two-dimensional model, an atmosphere field model and a risk model.
And the risk decision calculation module carries out group emotion atmosphere field estimation according to the atmosphere field model, carries out risk grade estimation on real-time events of the indoor public places according to the risk model, and outputs a risk estimation result to the scheduling module and the user interface.
The user interface displays the real-time environment state, emotion estimation results, emotion two-dimensional model display and risk assessment results of the current indoor public place, and the real-time emotion monitoring and event quick emergency function of the indoor public place is achieved.
Has the advantages that:
1. the invention discloses a multi-modal emotion emergency decision system based on multi-stage LSTM, which is used for extracting context information of two audio and video modes at different moments through an LSTM node and establishing a time sequence dependency relationship of the two types of modal emotions in order to realize a multi-modal emotion fusion function.
2. The invention discloses a multi-modal emotion emergency decision system based on a multi-stage LSTM, which aims to solve the problem that context connection is lacked among input subsequences of the traditional LSTM network, and ensures that each frame of emotion in the input subsequences can obtain state information from context by constructing the multi-stage LSTM network.
3. The invention discloses a multi-modal emotion emergency decision system based on multi-stage LSTM, which improves the efficiency of neural network training and emotion estimation by reducing the length of each stage of subsequence, and simultaneously improves the real-time property of the system while ensuring the accuracy of emotion estimation because each frame of emotion in an input subsequence has no context state information loss.
4. The invention discloses a multi-modal emotion emergency decision system based on multi-stage LSTM, which carries out outlier processing on the emotion fusion result of the multi-stage LSTM network according to the short-time continuity characteristics of human emotion, thereby avoiding the large-amplitude mutation of the emotion fusion result.
5. The multi-modal emotion emergency decision system based on the multi-stage LSTM can realize real-time emotion monitoring and rapid event emergency in public places of the emergency decision system. By enhancing the monitoring of the extreme emotions of the individuals, calculating the estimation result of the emotional atmosphere field by integrating the group emotional atmosphere and the extreme emotions of the individuals, judging the risk level of the indoor emergency, and solving the problem that the existing crowd emotion estimation method is lack of attention to the extreme emotions of the individuals.
6. The invention discloses a multi-modal emotion emergency decision system based on multi-stage LSTM, which designs an emotion two-dimensional model according to the short-time continuity characteristics of human emotion, visually displays the group emotion change of indoor public places,
7. the invention discloses a multi-modal emotion emergency decision system based on multi-stage LSTM, which designs a risk model according to the emotion intensity, has lower time complexity in the algorithm, and can realize real-time risk assessment on emergencies in the indoor public place environment.
Drawings
The invention will be further described with reference to the following examples and embodiments, in which:
FIG. 1 is a flow chart of a multi-modal emotion emergency decision system based on a multi-stage LSTM network.
Fig. 2 is a flow chart of personal audio and video emotion estimation proposed by the present invention.
Fig. 3 is a schematic diagram of a single stage LSTM network as proposed by the present invention.
Fig. 4 is a schematic diagram of a multi-stage LSTM network structure proposed by the present invention.
Fig. 5 is a flow chart of risk decision implementation in an embodiment of the present invention.
FIG. 6 is a diagram showing a two-dimensional model of emotion proposed by the present invention.
Fig. 7 is a schematic diagram of a risk model proposed by the present invention.
Fig. 8 is an architecture diagram of an emergency decision system according to the present invention.
FIG. 9 is a schematic diagram of a user interface of an emergency decision system in an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following further describes the present invention with reference to the accompanying drawings and examples.
As shown in fig. 1, the multi-modal emotion emergency decision system based on the multi-stage LSTM disclosed in this embodiment specifically includes the following steps:
as shown in fig. 1, the multi-level LSTM-based multi-modal emotion emergency decision system disclosed in this embodiment includes two important components: personal audio-video emotion estimation and risk decision. The personal audio/video emotion estimation part is shown in fig. 2 and comprises a data acquisition module, an audio emotion recognition module, a face emotion recognition module and a multi-mode emotion recognition module. The multi-stage LSTM-based multi-modal emotion emergency decision-making system disclosed by the embodiment comprises the following steps of:
step 1: and (6) data acquisition. The video information is collected by using two sensors, namely a camera and a microphone, and the real-time environment of an indoor public place is monitored. And separating the audio signal collected by the microphone and the frame-by-frame picture signal collected by the camera, and respectively performing emotion recognition on the corresponding signals by adopting the step 2 and the step 3.
And 2, step: and carrying out single-mode emotion recognition on the audio information. The emotion recognition process based on the audio information comprises signal preprocessing, feature extraction and emotion recognition by using an audio emotion perception model based on VGGish-13.
The step 2 comprises the following concrete implementation steps:
step 2.1: pre-processing the audio signal extracted in the step 1 by adopting a zero filling and amplitude normalization modeAnd (6) processing. Because the lengths of the collected audio signals are different, zero padding is adopted, and a blank section is added behind the audio, so that the total frame number of all the audio is the same. Amplitude normalization is to raise the point of maximum amplitude in a segment of audio to 0dB, with other points stretched proportionally. And extracting low-level description factors such as frame energy, fundamental frequency, short-time jitter parameters, mel frequency cepstrum coefficients and the like by using an OpenSMILE tool box, and converting the low-level description factors into feature vectors through statistical operations such as mean values, variances, regression coefficients and the like. Performing frame processing on the audio to obtain audio information of a continuous frame sequence, extracting audio modal characteristic sequence under the continuous frame sequence, and recording the audio modal characteristic sequence as
Figure BDA0003784251760000071
d a Is the length of the audio feature vector at each time instant.
Step 2.2: and carrying out emotion classification on the audio modal characteristic sequence by using an audio emotion perception model based on VGGish-13. The audio emotion perception model based on VGGish-13 comprises 4 VGG modules, each module comprises two convolution layers and one pooling layer, and linear rectification functions are used as activation functions. Setting the kernel function size of the convolution layer to be 3 and the step length to be 1; the step size of the pooling layer is 2. The VGG module is followed by 4 layers of full connection layers and a softmax layer. And (3) taking the output characteristic sequence of the step 2.1 as the input of the audio emotion perception model, and performing emotion estimation on the valence dimension and the awakening dimension respectively. The emotion estimation results of the audio at the time t in the valence and arousal dimensions are recorded
Figure BDA0003784251760000081
And step 3: and carrying out single-mode emotion recognition on the face information in the image. The emotion recognition process based on the face information comprises face framing and cropping, picture preprocessing and emotion recognition by using a face emotion perception model based on ResNet-18.
The step 3 comprises the following concrete implementation steps:
step 3.1: for face frame selection and cutting, the audio signal extracted in step 1 is subjected to face information retrieval frame by adoptingAnd the Retinaface target detection network performs effective facial region recognition and cutting in a complex environment. Secondly, for the face image preprocessing, a double linear interpolation method is adopted for image scaling, and graying and gray level equalization are adopted for enhancing the overall contrast of the face image. Obtaining the characteristic sequence of the preprocessed face image and recording the characteristic sequence as
Figure BDA0003784251760000082
d v Is the length of the audio feature vector at each time instant.
Step 3.2: and carrying out emotion classification on the preprocessed image by using a face emotion perception model based on ResNet-18. The face emotion perception model based on ResNet-18 is composed of 4 residual blocks, and each residual block circulates twice. The principle of the residual block is shown in the formula:
y k =F(x k ,{W k })+h(x k )
wherein, y k And x k Is a matrix of output and input vectors of the k-th layer, F (x) k ,{W k H (x) is a residual function obtained by model training, h (x) k ) As a linear projection to match the function F (x) k ,{W k }) and input x k Of (c) is calculated. The average pooling layer and the full-connection layer are connected behind the residual block. And (4) taking the face image feature sequence output in the step (3.1) as the input of a face emotion perception model, and performing emotion estimation on the valence dimension and the awakening dimension respectively. The emotion estimation results of the face information at the t moment on the valence and the awakening dimension are recorded as
Figure BDA0003784251760000083
And 4, step 4: and integrating the output results of the audio emotion perception model and the human face emotion perception model on a valence dimension and an awakening dimension respectively, and recording the emotion bias of the audio and the images in the continuous frame sequence under a single-dimensional state. For time t, outputting results of the audio emotion perception model and the face emotion perception model in valence dimension
Figure BDA0003784251760000084
At the moment, the emotion bias of the user in the valence dimension is recorded by video acquisition
Figure BDA0003784251760000085
Similarly, the emotion bias of the user in the wake-up dimension at time t is recorded as
Figure BDA0003784251760000086
And 5: and (4) establishing a multi-stage LSTM network consisting of two single-stage LSTM networks, taking the emotion bias of the step (4) as input, and performing decision-making level fusion on the multi-mode emotion in titer and awakening dimensions respectively.
The step 5 comprises the following steps:
step 5.1: and 4, taking the emotion bias in the single dimension in the step 4 as input, and starting to segment the primary subsequence from the first frame to obtain the primary subsequence to be processed in the single dimension. The primary subsequence comprises a target frame
Figure BDA0003784251760000091
And a plurality of consecutive frames adjacent to and following the target frame. The primary subsequence is denoted as
Figure BDA0003784251760000092
The length is timekeeper. In the multi-stage LSTM network, the input of the first stage LSTM network is composed of a plurality of the first-stage subsequences, and the number of the first-stage subsequences of the samples under the condition of single dimension is recorded as
Figure BDA0003784251760000093
And step 5.2: inputting the feature information of each frame in the single dimension into the first-stage LSTM network shown in FIG. 3 according to the time sequence relationship of the first-stage subsequence, and outputting the feature information of each frame from the target frame through the first-stage LSTM network
Figure BDA0003784251760000094
Extracting corresponding characteristic information
Figure BDA0003784251760000095
From adjacent frames
Figure BDA0003784251760000096
Extracting corresponding characteristic information
Figure BDA0003784251760000097
823060, and obtaining the primary subsequence
Figure BDA0003784251760000098
Corresponding characteristic information sequence
Figure BDA0003784251760000099
By receiving the characteristic information of the t-th frame corresponding to the t time
Figure BDA00037842517600000910
Hidden state information output at time t-1
Figure BDA00037842517600000911
And cell state information
Figure BDA00037842517600000912
Outputting target hidden state information at time t
Figure BDA00037842517600000913
And target unit state information
Figure BDA00037842517600000914
Feature information of the t-th frame
Figure BDA00037842517600000915
target hidden state information at time t-1
Figure BDA00037842517600000916
And target unit state information
Figure BDA00037842517600000917
Input LSTM modeBlock, output hidden state information at time t
Figure BDA00037842517600000918
And cell state information
Figure BDA00037842517600000919
Finally, passing the hidden state information at the time t
Figure BDA00037842517600000920
Calculating the fusion emotion prediction value at the moment
Figure BDA00037842517600000921
The specific working principle is as follows:
Figure BDA00037842517600000922
Figure BDA00037842517600000923
Figure BDA00037842517600000924
Figure BDA00037842517600000925
Figure BDA00037842517600000926
Figure BDA00037842517600000927
Figure BDA0003784251760000101
wherein, W xf ,W hf ,W xg ,W hg ,W xi ,W hi ,W xo ,W ho ,W hy All are weight parameters in the LSTM module; b f ,b g ,b i ,b o ,b y Are all bias terms in the LSTM module; the parameters are obtained through model training.
Equations 1,2 are forgetting gates that accept a memory message and determine which portion of the memory to retain and forget. Wherein the forgetting factor is
Figure BDA0003784251760000102
Target unit state information representing output from time t to time t-1
Figure BDA0003784251760000103
The selection weight of (2).
The formula 3,4 is an input gate for selecting information to be memorized. Wherein,
Figure BDA0003784251760000104
representing temporal cell state information at time t
Figure BDA0003784251760000105
The weight of the selection of (a) is,
Figure BDA0003784251760000106
temporary cell state information at time t.
Figure BDA0003784251760000107
Indicating that the information is desired to be deleted,
Figure BDA0003784251760000108
the cell state information at the time t is obtained from the two parts
Figure BDA0003784251760000109
Equation 5,6 is an output gateHidden state information at output time t
Figure BDA00037842517600001010
Wherein,
Figure BDA00037842517600001011
the selection weight of the cell state information at time t is represented.
And training the plurality of primary subsequences and target output results under the corresponding valence and awakening dimensionality in a first-stage LSTM network, and respectively reserving two optimal training models under the valence and awakening dimensionality. And testing the video acquired in real time according to the optimal training model under two dimensions. Respectively recording the output results of the first-stage LSTM network at t moment under valence and awakening dimension
Figure BDA00037842517600001012
Step 5.3: with the output of step 4 as input from
Figure BDA00037842517600001013
And the frame emotion information starts to carry out the segmentation of the secondary subsequence, and the secondary subsequence to be processed under the single dimension is obtained. The secondary subsequence comprises a target frame
Figure BDA00037842517600001014
And a plurality of consecutive frames adjacent to and following the target frame. Recording the length of the secondary subsequence
Figure BDA00037842517600001015
The length of the primary subsequence is the same as the primary subsequence and is timestamp. In the multi-stage LSTM network, the input of the second-stage LSTM network is composed of a plurality of the second-stage subsequences, and the number of the sample second-stage subsequences under a single dimension is the same as that of the first-stage subsequences, namely the batch size.
Step 5.4: and inputting the characteristic information of each frame into a second-level LSTM network according to the time sequence relation of the second-level subsequence. Second level LSTM networkThe structure and the specific working principle of the network are the same as those of the first-level LSTM network. And training the plurality of secondary subsequences and target output results under the corresponding valence and awakening dimensionality in a secondary LSTM network, and respectively reserving two optimal training models under the valence and the awakening dimensionality. And testing the video acquired in real time according to the optimal training model under two dimensions. Respectively recording output results of the second-level LSTM network at t moment under valence and awakening dimension
Figure BDA00037842517600001016
Step 6: output result of first-stage LSTM network at t moment under single dimension
Figure BDA0003784251760000111
Output result with second stage LSTM network
Figure BDA0003784251760000112
Fusing in time dimension to obtain multi-stage LSTM network emotion fusion result in single dimension
Figure BDA0003784251760000113
As shown in fig. 4. Because the nodes of each stage of LSTM at the same time have different hidden state information, different weights are given to the output results of the two stages of LSTM networks at different times for fusion.
The step 6 comprises the following steps:
step 6.1: in order to ensure that the fusion result meets the short-time continuity of the emotion, corresponding weight is given to the fusion emotion output by each subsequence of the first-level and second-level LSTM networks. The node at the front timing/2 of each subsequence only has short-term memory, so that the weight of 0.1 is given to the output result, and the node at the rear timing/2 has both long-term memory and short-term memory, so that the weight of 0.9 is given to the output result.
Step 6.2: output results to first-level LSTM network under single dimension
Figure BDA0003784251760000114
And output results of the second level LSTM network
Figure BDA0003784251760000115
The fusion is performed in the time dimension. The input time of the first level LSTM network is used as the reference time. For any t 1 The emotion of the moment is fused with the result
Figure BDA0003784251760000116
Emotion fusion result of multi-stage LSTM network
Figure BDA0003784251760000117
When in use
Figure BDA0003784251760000118
Emotion fusion result of multi-stage LSTM network
Figure BDA0003784251760000119
And 7: multi-stage LSTM network emotion fusion result at t moment under single dimension
Figure BDA00037842517600001110
And (5) carrying out outlier processing. In order to ensure that the multi-stage LSTM network emotion fusion result from the time t-1 to the time t +1 meets the short-time continuity and avoid the large-amplitude mutation of the emotion fusion result, the emotion fusion result from the time t-1 to the time t +1 is taken
Figure BDA00037842517600001111
When it is satisfied with
Figure BDA00037842517600001112
And is
Figure BDA00037842517600001113
When the temperature of the water is higher than the set temperature,
Figure BDA00037842517600001114
and 8: to indoor public place environmentThe crowd respectively realizes multi-modal emotion fusion based on multi-level LSTM in valence dimension and awakening dimension, and the obtained single emotion fusion result is recorded as
Figure BDA00037842517600001115
Wherein N is the number of people who have collected videos in the environment of public places, Y i t As a result of the emotion fusion at time t,
Figure BDA00037842517600001116
and
Figure BDA00037842517600001117
the emotion fusion results in valence and arousal dimensions, respectively.
And step 9: and carrying out risk decision on group emotions in the indoor public place according to the single emotion fusion result, wherein the risk decision comprises atmosphere field estimation, extreme emotion judgment and risk grade judgment. The implementation flow is shown in fig. 5.
The step 9 is implemented as follows:
step 9.1: dividing the single emotion estimation result into a normal emotion state and an extreme emotion state, and judging the single emotion
Figure BDA0003784251760000121
Emotional fusion results in medium potency dimension
Figure BDA0003784251760000122
And satisfy
Figure BDA0003784251760000123
Then, the emotion of the user is judged to belong to the extreme category, and the emotion is fused into a result Y i t The extreme mood list is entered.
Step 9.2: establishing an atmosphere field model, sensing public place group emotional atmosphere according to the normal or extreme emotional state of individuals in public place groups in a period of time, calculating an emotional atmosphere field estimation result by integrating group emotional atmosphere and individual extreme emotion, and expressing the estimation result in an emotion two-dimensional model.
The emotion two-dimensional model is shown in fig. 6, and has two mutually orthogonal dimensions of valence and arousal, and is used for representing emotion intensity change in a continuous dimension. The values of the valence and the awakening two dimensions respectively represent the offset from negative to positive and from calm to excitement, the value ranges are [ -3,3], and the coordinates in the two-dimensional space represent different emotions.
Step 9.3: and (4) building a risk model, calculating the risk level according to the emotional atmosphere field estimation result, and outputting the corresponding risk level. When the public place environment is in a high risk state, the system will provide a corresponding emergency plan.
The risk model is shown in fig. 7, and is in a ring structure, and risk calculation is performed based on emotional offset under different dimensions. The risk levels are rated as 0 (no risk), 1 (general), 2 (large), 3 (large), and 4 (particularly large) according to the emotional atmosphere field estimation result. And according to the valence and the awakened emotion fusion bias, when the valence value is greater than 0 in the emotion atmosphere field estimation result, the risk level rank is marked as level 0. Risk rating under other circumstances
Figure BDA0003784251760000124
Namely, the distance from the coordinates of the multi-mode emotion fusion result in the emotion two-dimensional model to the origin is calculated, and the distance is recorded as a risk level after being rounded up in the warp direction.
As shown in fig. 8, the multi-modal emotion emergency decision system based on the multi-level LSTM disclosed by the present invention is developed and implemented based on a Pycharm platform, and includes an emotion monitoring subsystem, a risk decision subsystem, and a user interface.
The emotion monitoring subsystem comprises a sensor information interface module, an emotion estimation module and a result output module, wherein the emotion estimation module needs to call an audio and face emotion estimation model to carry out single-mode emotion recognition, and call a multi-stage LSTM network to carry out multi-mode emotion fusion.
The sensor information interface module collects video information through a camera and a microphone, and monitors the real-time environment of an indoor public place. The module transmits video information to a user interface and separates audio and image signals in the video to be transmitted to the emotion estimation module.
And the emotion estimation module is used for respectively calling the audio and face emotion estimation models to carry out emotion recognition of valence and awakening dimension after corresponding data preprocessing on the obtained audio and image signals, calling the multi-stage LSTM network to carry out multi-mode emotion fusion, and obtaining real-time emotion states of individuals in indoor public place groups on the valence and the awakening dimension.
And the result output module outputs the individual emotion estimation results on the valence and awakening dimension to an emotion recognition interface module and a user interface of the risk decision subsystem.
The risk decision subsystem comprises an emotion recognition interface module and a risk decision calculation module, wherein the risk decision calculation module comprises an emotion two-dimensional model, an atmosphere field model and a risk model.
And the risk decision calculation module carries out group emotion atmosphere field estimation according to the atmosphere field model, carries out risk grade estimation on real-time events of the indoor public places according to the risk model, and outputs a risk estimation result to the scheduling module and the user interface.
The user interface is shown in fig. 9, and shows the real-time environmental state, the emotion estimation result, the emotion two-dimensional model display and the risk assessment result of the current indoor public place, so that the real-time emotion monitoring and event quick emergency function of the indoor public place is realized.
The above detailed description is further intended to illustrate the objects, technical solutions and advantages of the present invention, and it should be understood that the above detailed description is only an example of the present invention and should not be used to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (5)

1. A multimode emotion emergency decision system based on multi-stage LSTM is characterized in that: comprises the following steps of (a) carrying out,
step 1: performing emotion estimation on continuous dimensions on audio information, performing data preprocessing on the acquired audio information, extracting an audio modal feature sequence under a continuous frame sequence, performing emotion classification on the audio modal feature sequence by using an audio emotion perception model based on VGGish-13, and acquiring emotion offset of audio in two dimensions of titer and awakening under the continuous frame sequence;
and 2, step: performing emotion estimation on continuous dimensionality on face information in a video, extracting frame-by-frame face information of the acquired video, preprocessing a face image to obtain a preprocessed face image feature sequence, performing emotion classification on the preprocessed face image feature sequence by using a face emotion perception model based on ResNet-18 to obtain emotion bias of the face image on two dimensionalities of titer and awakening under a continuous frame sequence;
and step 3: integrating output results of the audio emotion perception model and the human face emotion perception model, and respectively recording emotion biases of audio and images under a continuous frame sequence on a valence dimension and an awakening dimension;
and 4, step 4: establishing a multi-level LSTM network, and performing decision-level fusion on the multi-mode emotion in a valence dimension and an awakening dimension respectively;
and 5: fusing the output result of the first-stage LSTM network and the output result of the second-stage LSTM network in the time dimension; according to the fact that nodes of all levels of LSTM at the same moment have different hidden state information, different weights are given to output results of two levels of LSTM networks at different moments for fusion, and multi-level LSTM network emotion fusion results under a single dimension are obtained;
and 6: performing outlier processing on the multi-level LSTM network emotion fusion result in a single dimension;
and 7: according to the steps 1 to 6, multi-modal emotion fusion based on multi-level LSTM is respectively realized on valence dimension and awakening dimension for people in indoor public place environment;
and 8: and carrying out risk decision on group emotions in the indoor public place according to the single emotion fusion result, wherein the risk decision comprises atmosphere field estimation, extreme emotion judgment and risk grade judgment.
2. The multi-stage LSTM-based multi-modal emotion emergency decision system of claim 1, wherein: in order to improve the efficiency of neural network training and emotion estimation, in step 4, the real-time performance of the system is improved while the accuracy of emotion estimation is guaranteed by reducing the length of each level of subsequence.
3. The multi-stage LSTM-based multi-modal emotion emergency decision system of claim 2, wherein: in order to solve the problem that the conventional LSTM network input subsequence lacks context relation, in step 5, different weights are given to output results of two stages of LSTM networks at different moments for fusion according to the fact that nodes of all stages of LSTM networks at the same moment have different hidden state information, and a multi-stage LSTM network emotion fusion result is obtained; the method comprises the following concrete steps:
step 5.1: in order to ensure that the fusion result meets the short-time continuity of the emotion, corresponding weight is given to the fusion emotion output by each subsequence of the first-level LSTM network and the second-level LSTM network; because the node at the front timing/2 of each subsequence only has short-term memory, the weight of 0.1 is given to the output result, and because the node at the rear timing/2 has long-term memory and short-term memory at the same time, the weight of 0.9 is given to the output result;
and step 5.2: output result to first-level LSTM network under single dimension
Figure FDA0003784251750000021
And output results of the second level LSTM network
Figure FDA0003784251750000022
Fusion is performed in the time dimension: taking the input time of the first-level LSTM network as a reference time, and aiming at any t 1 The emotion of the moment is fused with the result
Figure FDA0003784251750000023
Emotion fusion result of multi-stage LSTM network
Figure FDA0003784251750000024
When in use
Figure FDA0003784251750000025
Emotion fusion result of multi-stage LSTM network
Figure FDA0003784251750000026
4. The multi-stage LSTM-based multi-modal emotion emergency decision system of claim 3, wherein: in order to avoid large-amplitude mutation of the emotion fusion result, in step 6, outlier processing is carried out on the emotion fusion result of the multi-stage LSTM network according to the short-time continuity characteristics of human emotion; the concrete implementation steps are as follows:
taking the emotion fusion result from the time t-1 to the time t +1
Figure FDA0003784251750000027
When it is satisfied with
Figure FDA0003784251750000028
And is provided with
Figure FDA0003784251750000029
When the utility model is used, the water is discharged,
Figure FDA00037842517500000210
5. the multi-stage LSTM-based multi-modal emotion emergency decision system of claim 4, wherein: in step 8, in order to enhance the monitoring of the individual extreme emotions, an atmosphere field model is established for the problem that the existing crowd emotion estimation does not attach importance to the individual extreme emotions; designing an emotion two-dimensional model according to the short-time continuity characteristics of human emotion; designing a risk model in the emotion two-dimensional model according to the emotion intensity; the concrete implementation steps are as follows:
step 8.1: will be a single personThe emotion estimation result is divided into a normal emotion state and an extreme emotion state, and the single emotion is
Figure FDA00037842517500000211
Emotional fusion results in medium potency dimension
Figure FDA00037842517500000212
And satisfy
Figure FDA00037842517500000213
Then, the emotion of the user is judged to belong to the extreme category, and the emotion is fused into a result Y i t Entering an extreme emotion list;
step 8.2: establishing an atmosphere field model, sensing public place group emotional atmosphere according to the normal or extreme emotional state of individuals in public place groups in a period of time, calculating an emotional atmosphere field estimation result by integrating group emotional atmosphere and individual extreme emotion, and expressing the estimation result in an emotion two-dimensional model;
the emotion two-dimensional model has two mutually orthogonal dimensions of valence and arousal and is used for representing emotion intensity change under continuous dimensions; the values of the valence and the awakening two dimensions respectively represent the offset from negative to positive and from calm to excitation, the value ranges are [ -3,3], and the coordinates in a two-dimensional space formed by the valence and the awakening represent different emotions;
step 8.3: setting up a risk model, calculating risk levels according to the estimation result of the emotional atmosphere field, outputting corresponding risk levels, and providing corresponding emergency plans by the system when the public place environment is in a medium-high risk state;
the risk model is of a ring structure, risk calculation is carried out based on emotional offset under different dimensions, risk levels are recorded as 0 level (no risk), 1 level (general), 2 level (larger), 3 level (heavy weight) and 4 level (particularly heavy weight) according to the conditions of the estimation result of the emotional atmosphere field, when the effective value is larger than 0 in the estimation result of the emotional atmosphere field, the risk level rank is recorded as 0 level and under other conditions, the risk levels are calculated according to the valence and the emotion fusion bias on awakening
Figure FDA0003784251750000031
Namely, the distance from the coordinates of the multi-mode emotion fusion result in the emotion two-dimensional model to the origin is calculated, and the distance is recorded as a risk level after being rounded up in the warp direction.
CN202210941178.0A 2022-08-05 2022-08-05 Multi-modal emotion emergency decision system based on multi-stage long and short term memory network Pending CN115393927A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210941178.0A CN115393927A (en) 2022-08-05 2022-08-05 Multi-modal emotion emergency decision system based on multi-stage long and short term memory network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210941178.0A CN115393927A (en) 2022-08-05 2022-08-05 Multi-modal emotion emergency decision system based on multi-stage long and short term memory network

Publications (1)

Publication Number Publication Date
CN115393927A true CN115393927A (en) 2022-11-25

Family

ID=84118182

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210941178.0A Pending CN115393927A (en) 2022-08-05 2022-08-05 Multi-modal emotion emergency decision system based on multi-stage long and short term memory network

Country Status (1)

Country Link
CN (1) CN115393927A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116521303A (en) * 2023-07-04 2023-08-01 四川易诚智讯科技有限公司 Dynamic display method and system of emergency plan based on multi-source data fusion

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116521303A (en) * 2023-07-04 2023-08-01 四川易诚智讯科技有限公司 Dynamic display method and system of emergency plan based on multi-source data fusion
CN116521303B (en) * 2023-07-04 2023-09-12 四川易诚智讯科技有限公司 Dynamic display method and system of emergency plan based on multi-source data fusion

Similar Documents

Publication Publication Date Title
CN111275085B (en) Online short video multi-modal emotion recognition method based on attention fusion
Oliver et al. Layered representations for human activity recognition
CN110956953B (en) Quarrel recognition method based on audio analysis and deep learning
CN112699774B (en) Emotion recognition method and device for characters in video, computer equipment and medium
CN111310672A (en) Video emotion recognition method, device and medium based on time sequence multi-model fusion modeling
CN108549841A (en) A kind of recognition methods of the Falls Among Old People behavior based on deep learning
CN111626116B (en) Video semantic analysis method based on fusion of multi-attention mechanism and Graph
CN111626199B (en) Abnormal behavior analysis method for large-scale multi-person carriage scene
CN112766172A (en) Face continuous expression recognition method based on time sequence attention mechanism
CN108563624A (en) A kind of spatial term method based on deep learning
CN112183334B (en) Video depth relation analysis method based on multi-mode feature fusion
CN110232564A (en) A kind of traffic accident law automatic decision method based on multi-modal data
CN116564338A (en) Voice animation generation method, device, electronic equipment and medium
CN116975776A (en) Multi-mode data fusion method and device based on tensor and mutual information
CN115393927A (en) Multi-modal emotion emergency decision system based on multi-stage long and short term memory network
CN115527271A (en) Elevator car passenger abnormal behavior detection system and method
Dissanayake et al. Utalk: Sri Lankan sign language converter mobile app using image processing and machine learning
CN113469023B (en) Method, apparatus, device and storage medium for determining alertness
Abdul-Ameer et al. Development smart eyeglasses for visually impaired people based on you only look once
CN114694254B (en) Method and device for detecting and early warning robbery of articles in straight ladder and computer equipment
Rony et al. An effective approach to communicate with the deaf and mute people by recognizing characters of one-hand bangla sign language using convolutional neural-network
Salam et al. You Only Look Once (YOLOv3): Object Detection and Recognition for Indoor Environment
CN117576279B (en) Digital person driving method and system based on multi-mode data
Yang et al. GME-dialogue-NET: gated multimodal sentiment analysis model based on fusion mechanism
CN111813943B (en) Multitask classification disambiguation method and device based on generative countermeasure network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination