CN117556084B

CN117556084B - Video emotion analysis system based on multiple modes

Info

Publication number: CN117556084B
Application number: CN202311812195.5A
Authority: CN
Inventors: 张卫平; 张伟; 李显阔; 王丹; 邵胜博
Original assignee: Global Digital Group Co Ltd
Current assignee: Global Digital Group Co Ltd
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-03-26
Anticipated expiration: 2043-12-27
Also published as: CN117556084A

Abstract

The invention provides a video emotion analysis system based on multiple modes, which relates to the field of electric digital data processing and comprises an audio and video acquisition module, an expression recognition module, a voice analysis module and an emotion comprehensive analysis module, wherein the audio and video acquisition module is used for acquiring facial video information and voice information of a user, the expression recognition module is used for analyzing and processing the facial video information, the voice analysis module is used for analyzing and processing the voice information, and the emotion comprehensive analysis module is used for processing and obtaining emotion information of the user based on a video analysis result and a voice analysis result; the system performs cut-in analysis from two modes of video information and audio information, and fuses the two analysis results, so that more accurate emotion results can be obtained.

Description

Video emotion analysis system based on multiple modes

Technical Field

The invention relates to the field of electric digital data processing, in particular to a video emotion analysis system based on multiple modes.

Background

With the development of artificial intelligence, more and more application products for emotion communication are generated, and the application premise of the products is that the emotion states of users can be accurately mastered, and in the existing emotion analysis system, a single mode is often adopted for analysis, or the results of various modes are combined in a plurality of modes but only in a simple way, so that a multi-mode system is required to accurately analyze the emotion of the users.

The foregoing discussion of the background art is intended to facilitate an understanding of the present invention only. This discussion is not an admission or admission that any of the material referred to was common general knowledge.

A number of emotion analysis systems have been developed and found, through extensive searching and reference, to have a system as disclosed in publication No. CN111222464B, which generally includes: acquiring a physiological signal corresponding to a target user; wherein the physiological signal comprises an electroencephalogram signal and an electromyogram signal; acquiring facial image information corresponding to a target user; and respectively inputting the physiological signals and the facial image information into at least one pre-trained target classification model to obtain physiological signal recognition results and micro-expression recognition results corresponding to the target user, and determining emotion analysis results corresponding to the target user based on the physiological signal recognition results and the micro-expression recognition results. However, the system needs to acquire physiological signals, is more complex than acquiring audio and video information, cannot comprehensively analyze in a multi-mode, and is easy to judge emotion errors.

Disclosure of Invention

The invention aims to provide a video emotion analysis system based on multiple modes aiming at the defects.

The invention adopts the following technical scheme:

a video emotion analysis system based on multiple modes comprises an audio and video acquisition module, an expression recognition module, a voice analysis module and an emotion comprehensive analysis module;

the emotion comprehensive analysis module is used for obtaining emotion information of the user based on a video analysis result and a voice analysis result;

the audio and video acquisition module comprises a video acquisition unit, an audio acquisition unit and a synchronous marking unit, wherein the video acquisition unit is used for acquiring facial video information of a user, the audio acquisition unit is used for acquiring voice information of the user, and the synchronous marking unit is used for marking synchronous time points in the video information and the voice information;

the expression recognition module comprises a facial feature extraction unit and an expression analysis unit, wherein the facial feature extraction unit is used for extracting facial features of a user from video information, and the expression analysis unit is used for analyzing emotion of the user based on the facial features;

the voice analysis module comprises a voice feature extraction unit and a intonation analysis unit, wherein the voice feature extraction unit is used for extracting key features in voice information, and the intonation analysis unit is used for analyzing emotion of a user according to the key features;

the emotion comprehensive analysis module comprises a data fusion unit and an emotion judgment unit, wherein the data fusion unit is used for carrying out multi-mode fusion on analysis data of the expression recognition module and analysis data of the voice analysis module, and the emotion judgment unit is used for carrying out judgment analysis on the overall emotion state of the user based on the fused data;

further, the facial feature extraction unit comprises a frame information extraction processor, a face alignment processor, a key point positioning processor and a feature vector processor, wherein the frame information extraction processor is used for sequentially extracting frame information from video information, the face alignment processor is used for acquiring a local facial picture from the frame information, the key point positioning processor is used for acquiring position information of a key point in the facial picture, and the feature vector processor is used for calculating a feature vector according to the position information of the key point;

further, the expression analysis unit comprises a vector analysis processor, a first emotion feature register and a first proofreading analysis processor, wherein the vector analysis processor is used for calculating and processing feature vectors to obtain expression data, the first emotion feature register is used for storing the expression data of each emotion, and the first proofreading analysis processor is used for comparing the calculated expression data with recorded expression data and outputting a first judgment vector;

the first collation analysis processor calculates a first judgment vector Jv1 according to the following formula:

；

wherein Jv1 _i For the ith element value of the first judgment vector, jv1 shares n elements, n is the number of emotions recorded by the first emotion characteristic register, and Ep ₁ And Ep is a ₂ Respectively a transverse ratio and a longitudinal ratio of expression data, ep ₁ (i) And Ep is a ₂ (i) A transverse ratio and a longitudinal ratio for the ith emotion;

further, the intonation analysis unit includes a second emotion feature register for storing intonation data of each emotion, and a second correction analysis processor for comparing the peak feature vector with the intonation data and outputting a second judgment vector Jv2, specifically expressed as follows:

；

wherein Jv2 _i The i-th element value representing the second judgment vector, jv2 has n elements in total,and->Intonation feature vector for the ith emotion, (-)>，/>) The intonation feature vector in the corresponding target time period;

further, the data fusion unit comprises a time matching processor and a fusion analysis processor, wherein the time matching processor divides the first judgment vector into a plurality of sets according to the synchronous time point, each set is matched with a corresponding second judgment vector, and the fusion analysis processor analyzes and processes the matched first judgment vector set and second judgment vector set;

the fusion analysis processor performs primary fusion processing on the first judgment vector set according to the following steps of:

；

wherein Jv1 _i ' is the i element value of the first-level fusion vector, N is the number of vectors in the first judgment vector set, and Jv1 _i (j) N (i, j) is the sorting value of the ith element value of the jth vector in the first judging vector set in the element value of the present vector;

the fusion analysis processor performs secondary fusion processing according to the following steps to obtain a secondary fusion vector Jv2':

；

wherein Jv2 _i ' is the value of the i-th element in the secondary fusion vector.

The beneficial effects obtained by the invention are as follows:

the system obtains the judgment vectors by independently analyzing the video information and the audio information, and then fuses the judgment vectors to obtain the emotion analysis result under multiple modes, compared with a single mode, the system is more accurate, and the independently analyzed judgment vectors do not directly represent emotion results, but represent the possibility of various emotions, so that the two judgment vectors can be organically fused, and the results are not simply combined.

For a further understanding of the nature and the technical aspects of the present invention, reference should be made to the following detailed description of the invention and the accompanying drawings, which are provided for purposes of reference only and are not intended to limit the invention.

Drawings

FIG. 1 is a schematic diagram of the overall structural framework of the present invention;

FIG. 2 is a schematic diagram of an audio/video acquisition module according to the present invention;

FIG. 3 is a schematic diagram of an expression recognition module according to the present invention;

FIG. 4 is a schematic diagram of a voice analysis module according to the present invention;

FIG. 5 is a schematic diagram of the emotion comprehensive analysis module of the present invention.

Detailed Description

The following embodiments of the present invention are described in terms of specific examples, and those skilled in the art will appreciate the advantages and effects of the present invention from the disclosure herein. The invention is capable of other and different embodiments and its several details are capable of modification and variation in various respects, all without departing from the spirit of the present invention. The drawings of the present invention are merely schematic illustrations, and are not intended to be drawn to actual dimensions. The following embodiments will further illustrate the related art content of the present invention in detail, but the disclosure is not intended to limit the scope of the present invention.

Embodiment one: the embodiment provides a video emotion analysis system based on multiple modes, which comprises an audio and video acquisition module, an expression recognition module, a voice analysis module and an emotion comprehensive analysis module;

the facial feature extraction unit comprises a frame information extraction processor, a face alignment processor, a key point positioning processor and a feature vector processor, wherein the frame information extraction processor is used for sequentially extracting frame information from video information, the face alignment processor is used for acquiring local facial pictures from the frame information, the key point positioning processor is used for acquiring position information of key points in the facial pictures, and the feature vector processor is used for calculating feature vectors according to the position information of the key points;

the expression analysis unit comprises a vector analysis processor, a first emotion feature register and a first proofreading analysis processor, wherein the vector analysis processor is used for calculating feature vectors to obtain expression data, the first emotion feature register is used for storing the expression data of each emotion, and the first proofreading analysis processor is used for comparing the calculated expression data with recorded expression data and outputting a first judgment vector;

；

wherein Jv1 _i For the i-th element value of the first judgment vector, jv1 has n elements in total,n is the number of emotions recorded by the first emotion feature register, ep ₁ And Ep is a ₂ Respectively a transverse ratio and a longitudinal ratio of expression data, ep ₁ (i) And Ep is a ₂ (i) A transverse ratio and a longitudinal ratio for the ith emotion;

the intonation analysis unit comprises a second emotion feature register and a second correction analysis processor, wherein the second emotion feature register is used for storing intonation data of each emotion, and the second correction analysis processor is used for comparing peak feature vectors with the intonation data and outputting second judgment vectors Jv2, and the specific formula is as follows:

；

the data fusion unit comprises a time matching processor and a fusion analysis processor, wherein the time matching processor divides a first judgment vector into a plurality of sets according to a synchronous time point, each set is matched with a corresponding second judgment vector, and the fusion analysis processor analyzes and processes the matched first judgment vector set and second judgment vector set;

；

Embodiment two: the embodiment comprises the whole content of the first embodiment, and provides a video emotion analysis system based on multiple modes, which comprises an audio and video acquisition module, an expression recognition module, a voice analysis module and an emotion comprehensive analysis module;

referring to fig. 2, the audio/video acquisition module includes a video acquisition unit, an audio acquisition unit and a synchronization marking unit, wherein the video acquisition unit is used for acquiring facial video information of a user, the audio acquisition unit is used for acquiring voice information of the user, and the synchronization marking unit is used for marking synchronization time points in the video information and the voice information;

referring to fig. 3, the expression recognition module includes a facial feature extraction unit for extracting facial features of a user from video information and an expression analysis unit for analyzing emotion of the user based on the facial features;

referring to fig. 4, the voice analysis module includes a voice feature extraction unit and a intonation analysis unit, wherein the voice feature extraction unit is used for extracting key features in voice information, and the intonation analysis unit analyzes emotion of a user according to the key features;

referring to fig. 5, the emotion comprehensive analysis module includes a data fusion unit and an emotion judgment unit, where the data fusion unit is configured to perform multi-modal fusion on analysis data of the expression recognition module and analysis data of the voice analysis module, and the emotion judgment unit performs judgment analysis on an overall emotion state of a user based on the fused data;

the frame information extraction processor detects frames containing synchronous time point information as basic frames, extracts one frame of information every same frame after the basic frames, stores the basic frames and the extracted frames as analysis frames and sequentially sends the analysis frames to the face alignment processor;

the face alignment processor intercepts a rectangular picture from an analysis frame, wherein the two sides of the rectangular picture are boundary vertical lines of ears, the bottom side of the rectangular picture is a boundary horizontal line of chin, the upper side of the rectangular picture is a boundary horizontal line of eyebrows, and the face alignment processor marks the width and the height of the rectangular picture as w and h respectively;

the process of acquiring the key point position information by the key point positioning processor comprises the following steps:

s1, acquiring edge information of eyes, a mouth, a nose and eyebrows in a rectangular picture;

s2, intersecting the edge information by using a preset intercept line, wherein the intersection point is used as a key point;

s3, reading coordinate information of the key points in the rectangular picture;

the preset stub includes three pieces of information: the key points obtained by the corresponding sectional lines of the eyes, the vertical line and the 0 are the left end point of the eyes, and the two key points obtained by the corresponding sectional lines of the mouth, the vertical line and the 0.5 are the upper end point and the lower end point in the middle of the mouth;

the feature vector processor uses the nose core key points as vector starting points and other key points as vector ending points to calculate feature vectors, and usesRepresenting an ith feature vector;

the facial feature extraction unit sends the feature vector of each analysis frame to the expression analysis unit;

the vector analysis processor calculates and processes the feature vector according to the following steps:

；

wherein Ep is ₁ And Ep is a ₂ To represent two ratios of expression data, respectively referred to as a transverse ratio and a longitudinal ratio, { k _1i And } is a transverse coefficient group, { k _2i Is the longitudinal coefficient group, m isThe number of feature vectors;

the transverse coefficient group and the longitudinal coefficient group are obtained by measuring and counting a large number of face images;

；

wherein Jv1 _i For the ith element value of the first judgment vector, jv1 shares n elements, n is the number of emotions recorded by the first emotion characteristic register, and Ep ₁ (i) And Ep is a ₂ (i) A transverse ratio and a longitudinal ratio for the ith emotion;

the expression recognition module sends the first judgment vector of each analysis frame to the emotion comprehensive analysis module;

the voice feature extraction unit comprises a peak detection processor and a peak feature processor, wherein the peak detection processor is used for detecting a peak time point from audio data, and the peak feature processor is used for processing according to the interval time of the peak time point and the change of the amplitude at the peak time point to obtain voice features;

for time intervalsThe amplitude change is represented by->Representing that the peak characteristic processor is +_ for two adjacent synchronization points in time>And->Calculating standard deviation, respectively marked as +.>And->The period between two adjacent synchronization time points is called the target period, which is defined by +.>And->Vectors of constitution (+)>，/>) As intonation feature vectors in the corresponding target time period;

；

wherein Jv2 _i The i-th element value representing the second judgment vector, jv2 has n elements in total,and->Intonation feature vector for the ith emotion;

the second judgment vector of each target time period of the voice analysis module is sent to the emotion comprehensive analysis module;

；

the sorting value refers to the sequence number of the element values when sorting from small to large;

the fusion analysis processor performs secondary fusion processing on the primary fusion vector and the secondary judgment vector according to the following formula to obtain a secondary fusion vector Jv2':

；

wherein Jv2 _i ' is the value of the ith element in the secondary fusion vector;

the emotion judging unit comprises a data receiving processor and an emotion output processor, wherein the data receiving processor is used for receiving the secondary fusion vector, and the emotion output processor outputs emotion information according to the secondary fusion vector;

the emotion output processor retrieves the element item with the maximum element value from each secondary fusion vector, converts the element item into corresponding emotion, and then arranges the emotion in sequence and outputs the emotion as emotion information;

both i and j appearing above are ordinals used to represent sequence numbers.

The foregoing disclosure is only a preferred embodiment of the present invention and is not intended to limit the scope of the invention, so that all equivalent technical changes made by applying the description of the present invention and the accompanying drawings are included in the scope of the present invention, and in addition, elements in the present invention can be updated as the technology develops.

Claims

1. The video emotion analysis system based on the multiple modes is characterized by comprising an audio and video acquisition module, an expression recognition module, a voice analysis module and an emotion comprehensive analysis module;

；

2. The multi-modality based video emotion analysis system of claim 1, wherein the facial feature extraction unit includes a frame information extraction processor for sequentially extracting frame information from video information, a face alignment processor for acquiring a partial face picture from the frame information, a key point location processor for acquiring position information of a key point in the face picture, and a feature vector processor for calculating a feature vector from the position information of the key point.