CN115393927A

CN115393927A - Multi-modal emotion emergency decision system based on multi-stage long and short term memory network

Info

Publication number: CN115393927A
Application number: CN202210941178.0A
Authority: CN
Inventors: 戴亚平; 陈奕杉; 廖天睿; 邵帅
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-08-05
Filing date: 2022-08-05
Publication date: 2022-11-25

Abstract

The invention discloses a multi-modal emotion emergency decision system based on a multi-level Long Short-Term Memory (LSTM) network, belonging to the field of natural language processing multi-modal emotion analysis. The invention provides a multi-modal emotion emergency decision system based on a multi-stage LSTM network, which is used for monitoring group emotions in an indoor public place, integrating public place group emotion atmosphere fields and individual extreme emotions and carrying out risk assessment on emergencies in the scene. The implementation method of the invention comprises the following steps: performing emotion recognition on the audio and image information; establishing a multi-stage LSTM network, and performing decision-level fusion aiming at the time relevance of multi-mode information; fusing the output results of all levels of LSTM in the time dimension; carrying out outlier processing on the fusion result; by strengthening individual extreme emotion monitoring, an emotion atmosphere field is constructed, collective emotion in the public place environment is integrally evaluated, the prediction result of the emotion atmosphere field and the individual extreme emotion state are integrated, and the risk level probability of the public place emergency is calculated.

Description

Multi-modal emotion emergency decision system based on multi-stage long and short term memory network

Technical Field

The invention belongs to the field of natural language processing multi-modal emotion analysis tasks, and particularly relates to a multi-modal emotion emergency decision system based on a long-term and short-term memory network.

Background

Emotion recognition is one of important research contents in the field of natural language processing, and is a prerequisite for realizing emotion understanding and expression by a computer. Under the high-speed impact of the network information flow, the video integrating the picture and the voice characteristics contains richer information and is easy to acquire, so that the video becomes a main information source for emotion recognition. As the single-modal emotion recognition can only notice local features, and the multi-modal emotion fusion model can integrate information from different modalities, the limitation of a single data form is overcome, and more complex emotional states are analyzed. However, the conventional multi-modal emotion fusion method ignores dynamic changes of emotion, and thus performs poorly in practical applications. The context coding network and attention mechanism algorithm is only used for single-mode emotion judgment, and the short-time continuity of user emotion change among multiple types of modes is not considered.

With the development of machine learning and deep learning, the audio and video emotion recognition method is applied to various decision support systems, and particularly in the field of security monitoring, an audio and video emotion emergency decision system is widely applied to emotion monitoring of people in indoor public places and risk assessment of scene emergencies. In 2019, in a track traffic emergency plan system and decision response mechanism research project based on an emotion atmosphere field, a multi-mode emotion emergency decision system displays emotion atmospheres of passengers in track traffic in real time through an emotion atmosphere field model according to information such as passenger behaviors, facial expressions and sounds, establishes plan system models of complex groups under different abnormal emotion atmospheres, and makes important contribution to track traffic safety guarantee. In a key technical research project for improving the railway safety development efficiency in the new period of 2022, aiming at multi-mode emotional information of passengers, a description model of passenger emotional atmosphere is established, and an abnormal emotional atmosphere recognition result is obtained; and constructing a case-based safety guarantee emergency plan system related to the abnormal emotional atmosphere, and establishing and realizing a dynamic decision mechanism and safety guarantee emergency regulation and control facing to the abnormal emotional atmosphere.

However, the emergency decision-making system based on the audio and video information still has the problems of insufficient emotion recognition result precision and insufficient decision result attention to individual extreme emotions in user emotion prediction, and still has a great promotion space in the aspect of risk assessment for emergency events in public places.

Disclosure of Invention

Aiming at the technical problems in the existing multi-modal emotion recognition and emergency decision system, the invention aims to provide the multi-modal emotion emergency decision system based on the multi-level Long-Short Term Memory network (LSTM), which utilizes audio and video modal emotion information to carry out emotion estimation on groups in an indoor public place, utilizes the multi-level LSTM to carry out emotion information fusion on a decision level, synthesizes group emotion atmosphere surrounding areas and individual extreme emotions in the indoor public place, and carries out risk assessment on public place emergencies.

The purpose of the invention is realized by the following technical scheme:

the invention discloses a multi-modal emotion emergency decision system based on multi-level LSTM, which is used for fusing the relevance of multi-modal information in time through an LSTM network and estimating the emotion state of a user at the current moment by combining emotion context information. The LSTM node is used for extracting context information of two modes of audio and video at different moments and establishing a dependency relationship on a time dimension; secondly, context connection among input subsequences of the LSTM networks at all levels is strengthened by building a multi-level LSTM fusion model. The system is used for integrally evaluating the collective emotion in the indoor public place environment by constructing an emotion atmosphere field, carrying out independent risk decision on the individual extreme emotion by strengthening individual extreme emotion monitoring, integrating the prediction result of the emotion atmosphere field and the individual extreme emotion state, and calculating the risk level probability of the emergency in the indoor public place.

The invention discloses a multi-modal emotion emergency decision system based on multi-stage LSTM, which comprises the following steps:

step 1: continuous dimensional emotion on audio informationAnd (6) estimating the feeling. Carrying out data preprocessing on the acquired audio information, extracting an audio modal characteristic sequence under a continuous frame sequence, and recording the audio modal characteristic sequence as X _a ∈R ^T×da ；d _a Is the length of the audio feature vector at each time instant, and T is the size of the time domain dimension. Performing emotion classification on the audio modal characteristic sequence by using an audio emotion perception model based on VGGish-13, acquiring emotion bias of the audio in two dimensions of valence and awakening under a continuous frame sequence, and recording emotion estimation results of the audio in the valence dimension and the awakening dimension at the moment t as

Step 2: and performing emotion estimation on continuous dimensions on the face information in the video. Extracting the face information of the collected video frame by frame, preprocessing the face image, obtaining the characteristic sequence of the preprocessed face image, and recording the characteristic sequence as

d _v Is the length of the audio feature vector at each time instant. Performing emotion classification on the preprocessed face image feature sequence by using a face emotion perception model based on ResNet-18, and recording the emotion estimation results of the face information at the moment t on the valence dimension and the awakening dimension

And step 3: and integrating output results of the audio emotion perception model and the human face emotion perception model, and respectively recording emotion bias of audio and images under the continuous frame sequence in a valence dimension and an awakening dimension. At time t, the emotional bias of valence and arousal dimensions are recorded separately

And 4, step 4: and (4) establishing a multi-stage LSTM network, taking the emotion bias in the step (3) as input, and performing decision-level fusion on the multi-mode emotion in valence and awakening dimensions respectively.

The step 4 comprises the following steps:

step 4.1: and 3, taking the emotion bias in the single dimension in the step 3 as input, and starting to segment the primary subsequence from the first frame to obtain the primary subsequence to be processed in the single dimension. The primary subsequence includes a target frame and a plurality of consecutive frames adjacent to and subsequent to the target frame. And recording the length of the primary subsequence as timestamp. In the multi-stage LSTM network, the input of the first stage LSTM network is composed of a plurality of the first-stage subsequences, and the number of the first-stage subsequences of the samples under the condition of single dimension is recorded as

Step 4.2: inputting the feature information of each frame under the single dimension into a first-stage LSTM network according to the time sequence relation of the first-stage subsequence, and acquiring reference feature information through the first-stage LSTM network, wherein the method comprises the following steps: receiving characteristic information of a t-th frame corresponding to a t moment, wherein the t moment is a current moment; and receiving the hidden state information and the unit state information output at the time t-1. Converting the hidden state information and the unit state information output at the t-1 moment into target hidden state information and target unit state information at the t-1 moment; inputting the characteristic information of the t-th frame, the target hidden state information at the t-1 moment and the target unit state information into the LSTM module, and outputting the hidden state information and the unit state information at the t moment through the LSTM node; calculating a fusion emotion predicted value Y of the LSTM node at the t moment according to the hidden state information of the LSTM node at the t moment _d ^t _im,1 (dim∈(val,arl))。

And training the plurality of primary subsequences and target output results under the corresponding valence and awakening dimensionality in a first-stage LSTM network, and respectively reserving two optimal training models under the valence and awakening dimensionality. And testing the video acquired in real time according to the optimal training model under two dimensions. Respectively recording the output results of the first-stage LSTM network at t moment under valence and awakening dimension

Step 4.3: with the output of step 3 as input from

And the frame emotion information starts to carry out the segmentation of the secondary subsequence, and the secondary subsequence to be processed under the single dimension is obtained. The secondary subsequence includes a target frame and a plurality of consecutive frames adjacent to and subsequent to the target frame. The length of the secondary subsequence is the same as that of the primary subsequence and is timestamp. In the multi-stage LSTM network, the input of the second-stage LSTM network consists of a plurality of the secondary subsequences, and the number of the sample secondary subsequences under a single dimension is batchsize.

Step 4.4: inputting the feature information of each frame into a second-level LSTM network according to the time sequence relation of the second-level subsequence, and acquiring reference feature information through the second-level LSTM network, wherein the method comprises the following steps: receiving characteristic information of a t-th frame corresponding to a t moment, wherein the t moment is a current moment; and receiving the hidden state information and the unit state information output at the time t-1. Converting the hidden state information and the unit state information output at the t-1 moment into target hidden state information and target unit state information at the t-1 moment; inputting the characteristic information of the t-th frame, the target hidden state information at the t-1 moment and the target unit state information into the LSTM module, and outputting the hidden state information and the unit state information at the t moment through the LSTM node; calculating the fusion emotion predicted value of the LSTM node at the t moment according to the hidden state information of the LSTM node at the t moment

And training the plurality of secondary subsequences and target output results under the valence and wake-up dimensions corresponding to the secondary subsequences in a second-level LSTM network, and respectively keeping two optimal training models under the valence and wake-up dimensions. And testing the video acquired in real time according to the optimal training model under two dimensions. Respectively recording output results of the second-level LSTM network at t moment under valence and awakening dimension

And 5: output result of first-stage LSTM network at t moment under single dimension

Output result with second stage LSTM network

Fusing in time dimension to obtain multi-stage LSTM network emotion fusion result in single dimension

Because the nodes of each stage of LSTM at the same time have different hidden state information, different weights are given to the output results of the two stages of LSTM networks at different times for fusion. Nodes that acquire long-term memory are given higher weight, nodes that acquire only short-term memory are given lower weight.

Step 6: multi-stage LSTM network emotion fusion result at t moment under single dimension

And (5) carrying out outlier processing. In order to ensure that the multi-stage LSTM network emotion fusion result from the time t-1 to the time t +1 meets the short-time continuity and avoid the large-amplitude mutation of the emotion fusion result, the emotion fusion result from the time t-1 to the time t +1 is taken

When it satisfies

And is

When the temperature of the water is higher than the set temperature,

and 7: realizing multi-modal emotion fusion based on multi-level LSTM on valence dimension and awakening dimension of people in indoor public place environment respectively, and recording the obtained single emotion fusion result as

Wherein N is the number of people who have collected videos in the environment of public places, Y _i ^t As a result of the emotion fusion at time t,

and

the emotion fusion results in valence and arousal dimensions, respectively.

And 8: and performing atmosphere field estimation, extreme emotion judgment and risk level judgment on the group emotion of the indoor public place according to the single emotion fusion result.

The step 8 comprises the following steps:

step 8.1: dividing the single emotion estimation result into a normal emotion state and an extreme emotion state, and judging the single emotion

Emotional fusion results in medium potency dimension

And satisfy

Then, the emotion of the user is judged to belong to the extreme category, and the emotion is fused into a result Y _i ^t The extreme mood list is entered.

Step 8.2: establishing an atmosphere field model, sensing public place group emotional atmosphere according to the normal or extreme emotional state of individuals in public place groups within a period of time, calculating an estimation result of the emotional atmosphere field by integrating the group emotional atmosphere and the individual extreme emotion, and expressing the estimation result in an emotional two-dimensional model.

The emotion two-dimensional model has two mutually orthogonal dimensions of titer and arousal and is used for representing emotion intensity change in continuous dimensions. The values of the valence and the awakening two dimensions respectively represent the offset from negative to positive and from calm to excitement, the value ranges are [ -3,3], and the coordinates in the two-dimensional space represent different emotions.

Step 8.3: and (4) building a risk model, calculating the risk level according to the emotional atmosphere field estimation result, and outputting the corresponding risk level. When the public place environment is in a high risk state, the system will provide a corresponding emergency plan.

The risk model is of a ring structure, and risk calculation is carried out based on emotional offset under different dimensions. The risk levels are rated as 0 (no risk), 1 (general), 2 (large), 3 (large), and 4 (particularly large) according to the emotional atmosphere field estimation result. And according to the valence and the awakened emotion fusion bias, when the valence value is greater than 0 in the emotion atmosphere field estimation result, the risk level rank is marked as level 0. Risk rating under other circumstances

Namely, the distance from the coordinates of the multi-mode emotion fusion result in the emotion two-dimensional model to the origin is calculated, and the distance is recorded as a risk level after being rounded up in the warp direction.

The multi-modal emotion emergency decision system based on the multi-stage LSTM is developed and realized based on a Pycharm platform and comprises an emotion monitoring subsystem, a risk decision subsystem and a user interface.

The emotion monitoring subsystem comprises a sensor information interface module, an emotion estimation module and a result output module, wherein the emotion estimation module needs to call an audio and face emotion estimation model to perform single-mode emotion recognition, and calls a multi-stage LSTM network to perform multi-mode emotion fusion.

The sensor information interface module collects video information through a camera and a microphone, and monitors the real-time environment of an indoor public place. The module transmits video information to a user interface and separates audio and image signals in the video to be transmitted to the emotion estimation module.

And the emotion estimation module is used for respectively calling the audio and face emotion estimation models to carry out emotion recognition of valence and awakening dimension after corresponding data preprocessing on the obtained audio and image signals, calling the multi-stage LSTM network to carry out multi-mode emotion fusion, and obtaining real-time emotion states of individuals in indoor public place groups on the valence and the awakening dimension.

And the result output module outputs the individual emotion estimation results on the valence and awakening dimension to an emotion recognition interface module and a user interface of the risk decision subsystem.

The risk decision subsystem comprises an emotion recognition interface module and a risk decision calculation module, wherein the risk decision calculation module comprises an emotion two-dimensional model, an atmosphere field model and a risk model.

And the risk decision calculation module carries out group emotion atmosphere field estimation according to the atmosphere field model, carries out risk grade estimation on real-time events of the indoor public places according to the risk model, and outputs a risk estimation result to the scheduling module and the user interface.

The user interface displays the real-time environment state, emotion estimation results, emotion two-dimensional model display and risk assessment results of the current indoor public place, and the real-time emotion monitoring and event quick emergency function of the indoor public place is achieved.

Has the advantages that:

1. the invention discloses a multi-modal emotion emergency decision system based on multi-stage LSTM, which is used for extracting context information of two audio and video modes at different moments through an LSTM node and establishing a time sequence dependency relationship of the two types of modal emotions in order to realize a multi-modal emotion fusion function.

2. The invention discloses a multi-modal emotion emergency decision system based on a multi-stage LSTM, which aims to solve the problem that context connection is lacked among input subsequences of the traditional LSTM network, and ensures that each frame of emotion in the input subsequences can obtain state information from context by constructing the multi-stage LSTM network.

3. The invention discloses a multi-modal emotion emergency decision system based on multi-stage LSTM, which improves the efficiency of neural network training and emotion estimation by reducing the length of each stage of subsequence, and simultaneously improves the real-time property of the system while ensuring the accuracy of emotion estimation because each frame of emotion in an input subsequence has no context state information loss.

4. The invention discloses a multi-modal emotion emergency decision system based on multi-stage LSTM, which carries out outlier processing on the emotion fusion result of the multi-stage LSTM network according to the short-time continuity characteristics of human emotion, thereby avoiding the large-amplitude mutation of the emotion fusion result.

5. The multi-modal emotion emergency decision system based on the multi-stage LSTM can realize real-time emotion monitoring and rapid event emergency in public places of the emergency decision system. By enhancing the monitoring of the extreme emotions of the individuals, calculating the estimation result of the emotional atmosphere field by integrating the group emotional atmosphere and the extreme emotions of the individuals, judging the risk level of the indoor emergency, and solving the problem that the existing crowd emotion estimation method is lack of attention to the extreme emotions of the individuals.

6. The invention discloses a multi-modal emotion emergency decision system based on multi-stage LSTM, which designs an emotion two-dimensional model according to the short-time continuity characteristics of human emotion, visually displays the group emotion change of indoor public places,

7. the invention discloses a multi-modal emotion emergency decision system based on multi-stage LSTM, which designs a risk model according to the emotion intensity, has lower time complexity in the algorithm, and can realize real-time risk assessment on emergencies in the indoor public place environment.

Drawings

The invention will be further described with reference to the following examples and embodiments, in which:

FIG. 1 is a flow chart of a multi-modal emotion emergency decision system based on a multi-stage LSTM network.

Fig. 2 is a flow chart of personal audio and video emotion estimation proposed by the present invention.

Fig. 3 is a schematic diagram of a single stage LSTM network as proposed by the present invention.

Fig. 4 is a schematic diagram of a multi-stage LSTM network structure proposed by the present invention.

Fig. 5 is a flow chart of risk decision implementation in an embodiment of the present invention.

FIG. 6 is a diagram showing a two-dimensional model of emotion proposed by the present invention.

Fig. 7 is a schematic diagram of a risk model proposed by the present invention.

Fig. 8 is an architecture diagram of an emergency decision system according to the present invention.

FIG. 9 is a schematic diagram of a user interface of an emergency decision system in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following further describes the present invention with reference to the accompanying drawings and examples.

As shown in fig. 1, the multi-modal emotion emergency decision system based on the multi-stage LSTM disclosed in this embodiment specifically includes the following steps:

as shown in fig. 1, the multi-level LSTM-based multi-modal emotion emergency decision system disclosed in this embodiment includes two important components: personal audio-video emotion estimation and risk decision. The personal audio/video emotion estimation part is shown in fig. 2 and comprises a data acquisition module, an audio emotion recognition module, a face emotion recognition module and a multi-mode emotion recognition module. The multi-stage LSTM-based multi-modal emotion emergency decision-making system disclosed by the embodiment comprises the following steps of:

step 1: and (6) data acquisition. The video information is collected by using two sensors, namely a camera and a microphone, and the real-time environment of an indoor public place is monitored. And separating the audio signal collected by the microphone and the frame-by-frame picture signal collected by the camera, and respectively performing emotion recognition on the corresponding signals by adopting the step 2 and the step 3.

And 2, step: and carrying out single-mode emotion recognition on the audio information. The emotion recognition process based on the audio information comprises signal preprocessing, feature extraction and emotion recognition by using an audio emotion perception model based on VGGish-13.

The step 2 comprises the following concrete implementation steps:

step 2.1: pre-processing the audio signal extracted in the step 1 by adopting a zero filling and amplitude normalization modeAnd (6) processing. Because the lengths of the collected audio signals are different, zero padding is adopted, and a blank section is added behind the audio, so that the total frame number of all the audio is the same. Amplitude normalization is to raise the point of maximum amplitude in a segment of audio to 0dB, with other points stretched proportionally. And extracting low-level description factors such as frame energy, fundamental frequency, short-time jitter parameters, mel frequency cepstrum coefficients and the like by using an OpenSMILE tool box, and converting the low-level description factors into feature vectors through statistical operations such as mean values, variances, regression coefficients and the like. Performing frame processing on the audio to obtain audio information of a continuous frame sequence, extracting audio modal characteristic sequence under the continuous frame sequence, and recording the audio modal characteristic sequence as

d _a Is the length of the audio feature vector at each time instant.

Step 2.2: and carrying out emotion classification on the audio modal characteristic sequence by using an audio emotion perception model based on VGGish-13. The audio emotion perception model based on VGGish-13 comprises 4 VGG modules, each module comprises two convolution layers and one pooling layer, and linear rectification functions are used as activation functions. Setting the kernel function size of the convolution layer to be 3 and the step length to be 1; the step size of the pooling layer is 2. The VGG module is followed by 4 layers of full connection layers and a softmax layer. And (3) taking the output characteristic sequence of the step 2.1 as the input of the audio emotion perception model, and performing emotion estimation on the valence dimension and the awakening dimension respectively. The emotion estimation results of the audio at the time t in the valence and arousal dimensions are recorded

And step 3: and carrying out single-mode emotion recognition on the face information in the image. The emotion recognition process based on the face information comprises face framing and cropping, picture preprocessing and emotion recognition by using a face emotion perception model based on ResNet-18.

The step 3 comprises the following concrete implementation steps:

step 3.1: for face frame selection and cutting, the audio signal extracted in step 1 is subjected to face information retrieval frame by adoptingAnd the Retinaface target detection network performs effective facial region recognition and cutting in a complex environment. Secondly, for the face image preprocessing, a double linear interpolation method is adopted for image scaling, and graying and gray level equalization are adopted for enhancing the overall contrast of the face image. Obtaining the characteristic sequence of the preprocessed face image and recording the characteristic sequence as

d _v Is the length of the audio feature vector at each time instant.

Step 3.2: and carrying out emotion classification on the preprocessed image by using a face emotion perception model based on ResNet-18. The face emotion perception model based on ResNet-18 is composed of 4 residual blocks, and each residual block circulates twice. The principle of the residual block is shown in the formula:

y _k ＝F(x _k ,{W _k })+h(x _k )

wherein, y _k And x _k Is a matrix of output and input vectors of the k-th layer, F (x) _k ,{W _k H (x) is a residual function obtained by model training, h (x) _k ) As a linear projection to match the function F (x) _k ,{W _k }) and input x _k Of (c) is calculated. The average pooling layer and the full-connection layer are connected behind the residual block. And (4) taking the face image feature sequence output in the step (3.1) as the input of a face emotion perception model, and performing emotion estimation on the valence dimension and the awakening dimension respectively. The emotion estimation results of the face information at the t moment on the valence and the awakening dimension are recorded as

And 4, step 4: and integrating the output results of the audio emotion perception model and the human face emotion perception model on a valence dimension and an awakening dimension respectively, and recording the emotion bias of the audio and the images in the continuous frame sequence under a single-dimensional state. For time t, outputting results of the audio emotion perception model and the face emotion perception model in valence dimension

At the moment, the emotion bias of the user in the valence dimension is recorded by video acquisition

Similarly, the emotion bias of the user in the wake-up dimension at time t is recorded as

And 5: and (4) establishing a multi-stage LSTM network consisting of two single-stage LSTM networks, taking the emotion bias of the step (4) as input, and performing decision-making level fusion on the multi-mode emotion in titer and awakening dimensions respectively.

The step 5 comprises the following steps:

step 5.1: and 4, taking the emotion bias in the single dimension in the step 4 as input, and starting to segment the primary subsequence from the first frame to obtain the primary subsequence to be processed in the single dimension. The primary subsequence comprises a target frame

And a plurality of consecutive frames adjacent to and following the target frame. The primary subsequence is denoted as

The length is timekeeper. In the multi-stage LSTM network, the input of the first stage LSTM network is composed of a plurality of the first-stage subsequences, and the number of the first-stage subsequences of the samples under the condition of single dimension is recorded as

And step 5.2: inputting the feature information of each frame in the single dimension into the first-stage LSTM network shown in FIG. 3 according to the time sequence relationship of the first-stage subsequence, and outputting the feature information of each frame from the target frame through the first-stage LSTM network

Extracting corresponding characteristic information

From adjacent frames

Extracting corresponding characteristic information

823060, and obtaining the primary subsequence

Corresponding characteristic information sequence

By receiving the characteristic information of the t-th frame corresponding to the t time

Hidden state information output at time t-1

And cell state information

Outputting target hidden state information at time t

And target unit state information

Feature information of the t-th frame

target hidden state information at time t-1

And target unit state information

Input LSTM modeBlock, output hidden state information at time t

And cell state information

Finally, passing the hidden state information at the time t

Calculating the fusion emotion prediction value at the moment

The specific working principle is as follows:

wherein, W _xf ,W _hf ,W _xg ,W _hg ,W _xi ,W _hi ,W _xo ,W _ho ,W _hy All are weight parameters in the LSTM module; b _f ,b _g ,b _i ,b _o ,b _y Are all bias terms in the LSTM module; the parameters are obtained through model training.

Equations

1,2 are forgetting gates that accept a memory message and determine which portion of the memory to retain and forget. Wherein the forgetting factor is

Target unit state information representing output from time t to time t-1

The selection weight of (2).

The

formula

3,4 is an input gate for selecting information to be memorized. Wherein,

representing temporal cell state information at time t

The weight of the selection of (a) is,

temporary cell state information at time t.

Indicating that the information is desired to be deleted,

the cell state information at the time t is obtained from the two parts

Equation 5,6 is an output gateHidden state information at output time t

Wherein,

the selection weight of the cell state information at time t is represented.

Step 5.3: with the output of step 4 as input from

And the frame emotion information starts to carry out the segmentation of the secondary subsequence, and the secondary subsequence to be processed under the single dimension is obtained. The secondary subsequence comprises a target frame

And a plurality of consecutive frames adjacent to and following the target frame. Recording the length of the secondary subsequence

The length of the primary subsequence is the same as the primary subsequence and is timestamp. In the multi-stage LSTM network, the input of the second-stage LSTM network is composed of a plurality of the second-stage subsequences, and the number of the sample second-stage subsequences under a single dimension is the same as that of the first-stage subsequences, namely the batch size.

Step 5.4: and inputting the characteristic information of each frame into a second-level LSTM network according to the time sequence relation of the second-level subsequence. Second level LSTM networkThe structure and the specific working principle of the network are the same as those of the first-level LSTM network. And training the plurality of secondary subsequences and target output results under the corresponding valence and awakening dimensionality in a secondary LSTM network, and respectively reserving two optimal training models under the valence and the awakening dimensionality. And testing the video acquired in real time according to the optimal training model under two dimensions. Respectively recording output results of the second-level LSTM network at t moment under valence and awakening dimension

Step 6: output result of first-stage LSTM network at t moment under single dimension

Output result with second stage LSTM network

As shown in fig. 4. Because the nodes of each stage of LSTM at the same time have different hidden state information, different weights are given to the output results of the two stages of LSTM networks at different times for fusion.

The step 6 comprises the following steps:

step 6.1: in order to ensure that the fusion result meets the short-time continuity of the emotion, corresponding weight is given to the fusion emotion output by each subsequence of the first-level and second-level LSTM networks. The node at the front timing/2 of each subsequence only has short-term memory, so that the weight of 0.1 is given to the output result, and the node at the rear timing/2 has both long-term memory and short-term memory, so that the weight of 0.9 is given to the output result.

Step 6.2: output results to first-level LSTM network under single dimension

And output results of the second level LSTM network

The fusion is performed in the time dimension. The input time of the first level LSTM network is used as the reference time. For any t ₁ The emotion of the moment is fused with the result

Emotion fusion result of multi-stage LSTM network

When in use

Emotion fusion result of multi-stage LSTM network

And 7: multi-stage LSTM network emotion fusion result at t moment under single dimension

When it is satisfied with

And is

When the temperature of the water is higher than the set temperature,

and 8: to indoor public place environmentThe crowd respectively realizes multi-modal emotion fusion based on multi-level LSTM in valence dimension and awakening dimension, and the obtained single emotion fusion result is recorded as

and

the emotion fusion results in valence and arousal dimensions, respectively.

And step 9: and carrying out risk decision on group emotions in the indoor public place according to the single emotion fusion result, wherein the risk decision comprises atmosphere field estimation, extreme emotion judgment and risk grade judgment. The implementation flow is shown in fig. 5.

The step 9 is implemented as follows:

step 9.1: dividing the single emotion estimation result into a normal emotion state and an extreme emotion state, and judging the single emotion

Emotional fusion results in medium potency dimension

And satisfy

Step 9.2: establishing an atmosphere field model, sensing public place group emotional atmosphere according to the normal or extreme emotional state of individuals in public place groups in a period of time, calculating an emotional atmosphere field estimation result by integrating group emotional atmosphere and individual extreme emotion, and expressing the estimation result in an emotion two-dimensional model.

The emotion two-dimensional model is shown in fig. 6, and has two mutually orthogonal dimensions of valence and arousal, and is used for representing emotion intensity change in a continuous dimension. The values of the valence and the awakening two dimensions respectively represent the offset from negative to positive and from calm to excitement, the value ranges are [ -3,3], and the coordinates in the two-dimensional space represent different emotions.

Step 9.3: and (4) building a risk model, calculating the risk level according to the emotional atmosphere field estimation result, and outputting the corresponding risk level. When the public place environment is in a high risk state, the system will provide a corresponding emergency plan.

The risk model is shown in fig. 7, and is in a ring structure, and risk calculation is performed based on emotional offset under different dimensions. The risk levels are rated as 0 (no risk), 1 (general), 2 (large), 3 (large), and 4 (particularly large) according to the emotional atmosphere field estimation result. And according to the valence and the awakened emotion fusion bias, when the valence value is greater than 0 in the emotion atmosphere field estimation result, the risk level rank is marked as level 0. Risk rating under other circumstances

As shown in fig. 8, the multi-modal emotion emergency decision system based on the multi-level LSTM disclosed by the present invention is developed and implemented based on a Pycharm platform, and includes an emotion monitoring subsystem, a risk decision subsystem, and a user interface.

The emotion monitoring subsystem comprises a sensor information interface module, an emotion estimation module and a result output module, wherein the emotion estimation module needs to call an audio and face emotion estimation model to carry out single-mode emotion recognition, and call a multi-stage LSTM network to carry out multi-mode emotion fusion.

The user interface is shown in fig. 9, and shows the real-time environmental state, the emotion estimation result, the emotion two-dimensional model display and the risk assessment result of the current indoor public place, so that the real-time emotion monitoring and event quick emergency function of the indoor public place is realized.

The above detailed description is further intended to illustrate the objects, technical solutions and advantages of the present invention, and it should be understood that the above detailed description is only an example of the present invention and should not be used to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A multimode emotion emergency decision system based on multi-stage LSTM is characterized in that: comprises the following steps of (a) carrying out,

step 1: performing emotion estimation on continuous dimensions on audio information, performing data preprocessing on the acquired audio information, extracting an audio modal feature sequence under a continuous frame sequence, performing emotion classification on the audio modal feature sequence by using an audio emotion perception model based on VGGish-13, and acquiring emotion offset of audio in two dimensions of titer and awakening under the continuous frame sequence;

and 2, step: performing emotion estimation on continuous dimensionality on face information in a video, extracting frame-by-frame face information of the acquired video, preprocessing a face image to obtain a preprocessed face image feature sequence, performing emotion classification on the preprocessed face image feature sequence by using a face emotion perception model based on ResNet-18 to obtain emotion bias of the face image on two dimensionalities of titer and awakening under a continuous frame sequence;

and step 3: integrating output results of the audio emotion perception model and the human face emotion perception model, and respectively recording emotion biases of audio and images under a continuous frame sequence on a valence dimension and an awakening dimension;

and 4, step 4: establishing a multi-level LSTM network, and performing decision-level fusion on the multi-mode emotion in a valence dimension and an awakening dimension respectively;

and 5: fusing the output result of the first-stage LSTM network and the output result of the second-stage LSTM network in the time dimension; according to the fact that nodes of all levels of LSTM at the same moment have different hidden state information, different weights are given to output results of two levels of LSTM networks at different moments for fusion, and multi-level LSTM network emotion fusion results under a single dimension are obtained;

and 6: performing outlier processing on the multi-level LSTM network emotion fusion result in a single dimension;

and 7: according to the steps 1 to 6, multi-modal emotion fusion based on multi-level LSTM is respectively realized on valence dimension and awakening dimension for people in indoor public place environment;

and 8: and carrying out risk decision on group emotions in the indoor public place according to the single emotion fusion result, wherein the risk decision comprises atmosphere field estimation, extreme emotion judgment and risk grade judgment.

2. The multi-stage LSTM-based multi-modal emotion emergency decision system of claim 1, wherein: in order to improve the efficiency of neural network training and emotion estimation, in step 4, the real-time performance of the system is improved while the accuracy of emotion estimation is guaranteed by reducing the length of each level of subsequence.

3. The multi-stage LSTM-based multi-modal emotion emergency decision system of claim 2, wherein: in order to solve the problem that the conventional LSTM network input subsequence lacks context relation, in step 5, different weights are given to output results of two stages of LSTM networks at different moments for fusion according to the fact that nodes of all stages of LSTM networks at the same moment have different hidden state information, and a multi-stage LSTM network emotion fusion result is obtained; the method comprises the following concrete steps:

step 5.1: in order to ensure that the fusion result meets the short-time continuity of the emotion, corresponding weight is given to the fusion emotion output by each subsequence of the first-level LSTM network and the second-level LSTM network; because the node at the front timing/2 of each subsequence only has short-term memory, the weight of 0.1 is given to the output result, and because the node at the rear timing/2 has long-term memory and short-term memory at the same time, the weight of 0.9 is given to the output result;

and step 5.2: output result to first-level LSTM network under single dimension

And output results of the second level LSTM network

Fusion is performed in the time dimension: taking the input time of the first-level LSTM network as a reference time, and aiming at any t ₁ The emotion of the moment is fused with the result

Emotion fusion result of multi-stage LSTM network

When in use

Emotion fusion result of multi-stage LSTM network

4. The multi-stage LSTM-based multi-modal emotion emergency decision system of claim 3, wherein: in order to avoid large-amplitude mutation of the emotion fusion result, in step 6, outlier processing is carried out on the emotion fusion result of the multi-stage LSTM network according to the short-time continuity characteristics of human emotion; the concrete implementation steps are as follows:

taking the emotion fusion result from the time t-1 to the time t +1

When it is satisfied with

And is provided with

When the utility model is used, the water is discharged,

5. the multi-stage LSTM-based multi-modal emotion emergency decision system of claim 4, wherein: in step 8, in order to enhance the monitoring of the individual extreme emotions, an atmosphere field model is established for the problem that the existing crowd emotion estimation does not attach importance to the individual extreme emotions; designing an emotion two-dimensional model according to the short-time continuity characteristics of human emotion; designing a risk model in the emotion two-dimensional model according to the emotion intensity; the concrete implementation steps are as follows:

step 8.1: will be a single personThe emotion estimation result is divided into a normal emotion state and an extreme emotion state, and the single emotion is

Emotional fusion results in medium potency dimension

And satisfy

Then, the emotion of the user is judged to belong to the extreme category, and the emotion is fused into a result Y _i ^t Entering an extreme emotion list;

step 8.2: establishing an atmosphere field model, sensing public place group emotional atmosphere according to the normal or extreme emotional state of individuals in public place groups in a period of time, calculating an emotional atmosphere field estimation result by integrating group emotional atmosphere and individual extreme emotion, and expressing the estimation result in an emotion two-dimensional model;

the emotion two-dimensional model has two mutually orthogonal dimensions of valence and arousal and is used for representing emotion intensity change under continuous dimensions; the values of the valence and the awakening two dimensions respectively represent the offset from negative to positive and from calm to excitation, the value ranges are [ -3,3], and the coordinates in a two-dimensional space formed by the valence and the awakening represent different emotions;

step 8.3: setting up a risk model, calculating risk levels according to the estimation result of the emotional atmosphere field, outputting corresponding risk levels, and providing corresponding emergency plans by the system when the public place environment is in a medium-high risk state;

the risk model is of a ring structure, risk calculation is carried out based on emotional offset under different dimensions, risk levels are recorded as 0 level (no risk), 1 level (general), 2 level (larger), 3 level (heavy weight) and 4 level (particularly heavy weight) according to the conditions of the estimation result of the emotional atmosphere field, when the effective value is larger than 0 in the estimation result of the emotional atmosphere field, the risk level rank is recorded as 0 level and under other conditions, the risk levels are calculated according to the valence and the emotion fusion bias on awakening