CN111401147A

CN111401147A - Intelligent analysis method and device based on video behavior data and storage medium

Info

Publication number: CN111401147A
Application number: CN202010122870.1A
Authority: CN
Inventors: 吴智炜
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2020-07-10
Anticipated expiration: 2040-02-26

Abstract

The invention relates to an artificial intelligence technology, and discloses an intelligent analysis method based on video behavior data, which comprises the following steps: receiving a prerecorded user video, performing voice extraction operation on the user video to obtain voice data and video data, inputting the video data to a pre-trained expression recognition model to obtain an expression recognition result, inputting the voice data to a pre-trained speech recognition model to obtain a speech recognition result, constructing a classification tree according to the speech recognition result and the expression recognition result to obtain a shallow depth psychological characteristic set, constructing a target function according to the shallow depth psychological characteristic set, solving a partial derivative of the target function to obtain an offset value, and outputting a psychological state analysis result if the offset value is less than or equal to a preset offset error. The invention also provides an intelligent analysis device based on the video behavior data and a computer readable storage medium. The invention can realize accurate and efficient intelligent analysis function based on the video behavior data.

Description

Intelligent analysis method and device based on video behavior data and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for intelligent analysis based on video behavior data and a computer readable storage medium.

Background

The intelligent analysis based on the video behavior data is applied to a plurality of fields at present, for example, in the process of claims collection of insurance companies, a video recording device is used for recording the communication video between business personnel and the personnel to be claimed, then whether the personnel to be claimed have cheating and insurance behaviors is analyzed intelligently, when a policeman examines a criminal, the psychological state of the criminal is analyzed to give psychological attack to the criminal, and the criminal is expected to be in good faith and wait.

At present, the common intelligent analysis based on video behavior data is to summarize the psychological state situation by recording videos such as micro-expressions, body movements, speaking mood and the like and observing and analyzing the videos by relevant psychological experts, and although the purpose of psychological state identification can be achieved, the efficiency in the fields such as insurance, investigation and the like is low because a large amount of time and manpower are required to be invested for analysis.

Disclosure of Invention

The invention provides an intelligent analysis method and device based on video behavior data and a computer readable storage medium, and mainly aims to identify expressions and language states of a user through a model so as to perform intelligent analysis on a psychological state.

In order to achieve the above object, the present invention provides an intelligent analysis method based on video behavior data, which includes:

receiving a pre-recorded user video, and executing voice extraction operation on the user video to obtain voice data and video data not including the voice data;

inputting the video data into a pre-trained expression recognition model for expression recognition to obtain an expression recognition result;

inputting the voice data into a pre-trained morphological recognition model for morphological recognition to obtain a morphological recognition result;

and constructing a classification tree according to the speech state recognition result and the expression recognition result, obtaining a deep shallow layer psychological characteristic set according to the classification tree, constructing a target function according to the deep shallow layer psychological characteristic set, solving a partial derivative of the target function to obtain an offset value, feeding back the speech state recognition result and the expression recognition result to a preset user if the offset value is greater than a preset offset error, generating a psychological state analysis result according to the expression recognition result and the speech state recognition result if the offset value is less than or equal to the preset offset error, and outputting the psychological state analysis result.

Optionally, the voice extraction operation includes:

carrying out pre-emphasis operation on the user video;

performing frame division and windowing operation on the user video subjected to the pre-emphasis operation;

and separating voice data from the user video subjected to the framing windowing operation based on a discrete Fourier transform method to obtain the voice data and the video data not comprising the voice data.

Optionally, the intelligent analysis method based on video behavior data further includes training the expression recognition model, where the training includes:

constructing the expression recognition model;

establishing a facial expression library and a comparative expression library;

positioning and cutting a face area of the face expression library according to the expression recognition model to obtain a cut face expression library;

predicting the feature points of the facial expression library by using the expression recognition model, judging the errors between the feature points of the facial expression library and the comparison expression library, if the errors are larger than the preset errors, adjusting the parameters of the facial expression recognition model, and predicting the feature points of the facial expression library again, if the errors are smaller than the preset errors, quitting the prediction, and finishing the training of the facial expression recognition model.

Optionally, the deep shallow psychological characteristic set is obtained by calculating a kini index of the classification tree by using a kini index method;

wherein the Gini index method is as follows:

wherein, A represents the depth shallow psychological characteristic set, D represents the set formed by the morphism recognition result and the expression recognition result, and T represents the set formed by the morphism recognition result and the expression recognition result_sData volume, T, representing different label classifications₁Amount of data, T, representing anger label₂And K represents the data volume of the set formed by the morphism recognition result and the expression recognition result.

Optionally, the constructing an objective function according to the set of deep and shallow psychological features, and solving a partial derivative of the objective function to obtain an offset value includes:

respectively constructing a penalty item and an error function based on the depth shallow psychological characteristic set;

adding the error function and the penalty term to obtain a target function;

solving a first order partial derivative result and a second order partial derivative result of the error function;

and reversely deducing to obtain an offset value in the target function according to the first-order partial derivative result and the second-order partial derivative result.

In addition, in order to achieve the above object, the present invention further provides an intelligent analysis device based on video behavior data, the device including a memory and a processor, the memory storing therein an intelligent analysis program based on video behavior data, the intelligent analysis program based on video behavior data being executable on the processor, and the intelligent analysis program based on video behavior data implementing the following steps when executed by the processor:

and constructing a classification tree according to the speech state recognition result and the expression recognition result, obtaining a depth shallow layer psychological characteristic set according to the classification tree, inputting the depth shallow layer psychological characteristic set to a pre-constructed psychological analysis model to obtain an offset value, feeding back the speech state recognition result and the expression recognition result to a preset user if the offset value is greater than a preset offset error, generating a psychological state analysis result according to the expression recognition result and the speech state recognition result if the offset value is less than or equal to the preset offset error, and outputting the psychological state analysis result.

Optionally, the voice extraction operation includes:

carrying out pre-emphasis operation on the user video;

Optionally, when executed by the processor, the intelligent analysis program based on video behavior data further implements the following steps: training the expression recognition model, the training comprising:

constructing the expression recognition model;

establishing a facial expression library and a comparative expression library;

wherein the Gini index method is as follows:

In addition, to achieve the above object, the present invention also provides a computer readable storage medium, on which an intelligent analysis program based on video behavior data is stored, the intelligent analysis program based on video behavior data being executable by one or more processors to implement the steps of the intelligent analysis method based on video behavior data as described above.

According to the method, the user videos are recorded in advance, the voice data and the video data without the voice data are obtained through voice extraction operation, and the expression and the language state are recognized through the model to obtain the recognition result, so that the intelligent degree is high, and a large amount of time and manpower are not required to be invested for intervention; meanwhile, according to the constructed classification tree, error analysis is carried out, and psychological state analysis can be automatically completed. Therefore, the intelligent analysis method, the intelligent analysis device and the computer-readable storage medium based on the video behavior data can achieve the purpose of intelligently analyzing the psychological state.

Drawings

Fig. 1 is a schematic flowchart of an intelligent analysis method based on video behavior data according to an embodiment of the present invention;

fig. 2 is a schematic internal structural diagram of an intelligent analysis apparatus based on video behavior data according to an embodiment of the present invention;

fig. 3 is a block diagram illustrating an intelligent analysis program based on video behavior data in an intelligent analysis device based on video behavior data according to an embodiment of the present invention;

fig. 4 is a structural diagram of a speech recognition model in an intelligent analysis based on video behavior data according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides an intelligent analysis method based on video behavior data. Fig. 1 is a schematic flow chart of an intelligent analysis method based on video behavior data according to an embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

In this embodiment, the intelligent analysis method based on video behavior data includes:

and S1, receiving the pre-recorded user video, and performing voice extraction operation on the user video to obtain voice data and video data not including the voice data.

Preferably, the pre-recorded user video can be divided according to different scenes, for example, in the process of claiming by an insurance company, the exchange video of business personnel and the personnel to be claimed is recorded, when a public security system inspects a person, the whole process of inspecting when the person is recorded.

Preferably, the performing a voice extraction operation on the user video to obtain voice data and video data not including voice includes: and carrying out pre-emphasis operation on the user video, carrying out framing and windowing operation on the user video after the pre-emphasis operation, and separating voice data from the user video after the framing and windowing operation on the basis of discrete Fourier change.

The pre-emphasis operation is to compensate the voice signal of the user video, because the voice system of human voice will suppress the high frequency part, and in addition, in order to make the voice energy of the high frequency part and the voice energy of the low frequency part have similar amplitude, so as to flatten the frequency spectrum of the signal, and to keep the same signal-to-noise ratio in the whole frequency band from low frequency to high frequency, it is necessary to increase the energy of the high frequency part. The pre-emphasis operation may be calculated by:

y(n)＝x(n)-μx(n-1)

wherein, y (n) is the user video after the pre-emphasis operation, x (n) is the user video, n is the waveform, μ is the adjustment value of the pre-emphasis operation, and the value range is [0.9,1.0 ].

Preferably, the framing windowing operation is to remove the voice overlapping part in the user video, for example, in the video of the communication between the service person who logs in and the person who is to be claimed, there is a voice overlapping part between the service person and the person who is to be claimed, so the voice of the service person can be removed by using the framing windowing operation, and the voice of the person who is to be claimed is retained. The method for adopting the frame windowing operation comprises the following steps:

wherein w (n) is the user video after the frame windowing operation, n is the waveform, and L is the frame length of the user video.

Preferably, the calculation method for separating the voice data from the user video after the frame windowing operation based on the discrete fourier transform is as follows:

wherein, s (N) is the separated voice data, N is the number of change points of the discrete fourier transform, w (N) is the user video after the frame windowing operation, j is the weight of the discrete fourier transform, and k is the interval division value of the waveform N.

And S2, inputting the video data into a pre-trained expression recognition model for expression recognition to obtain an expression recognition result.

Preferably, the training process of the expression recognition model includes: the method comprises the steps of establishing an expression recognition model, establishing a facial expression library and a comparative expression library, positioning and cutting out a facial area of the facial expression library according to the expression recognition model to obtain a cut facial expression library, predicting feature points of the cut facial expression library by using the expression recognition model, judging errors of the feature points of the cut facial expression library and the comparative expression library, if the errors are larger than preset errors, predicting the feature points of the cut facial expression library again, and if the errors are smaller than the preset errors, quitting the prediction to obtain the pre-established expression recognition model.

The expression database JAFFE of the Japanese ATR (advanced telecom research institute International) is a database specially used for expression recognition research, the database comprises 213 faces (resolution: 256 pixels of each image: × 256 pixels) of Japanese females, each image is marked with original expression definitions, the expression database comprises 10 persons in total, and each person has 7 expressions (normal (also called neutral face), happiness, sadness, surprise, anger, disgust and fear).

Preferably, the facial expression library is created by using crawler technology to crawl facial expression images and normalizing captured facial brightness, and the facial expression library takes six emotions as labels, including happiness, sadness, surprise, anger, disgust and fear, and the facial label of each label is different, such as happiness (face smiles, mouth is raised, eyes are smaller than normal state, because the pupil is shrunk when the person is happy), anger (the main characteristic is pupil is enlarged, eyes are bigger than normal state, because the pupil is enlarged when the person is angry).

In a preferred embodiment of the present invention, the expression recognition model adopts a dcnn (deep convolutional network for Facial Point detection) deep convolutional network model.

The positioning cutting is that the judgment of the facial expression recognition is influenced because the range contained in the facial expression graph is too large, so that the first partial convolution network model of the DCNN positions the face and cuts the face by searching for 5 characteristic points (left and right eyes, nose, left and right mouth corners) of the face.

Specifically, the convolutional neural network of the first part of the DCNN consists of three convolutional neural networks, which are named as: f1 (input of network is a whole face picture), EN1 (input picture contains eyes and nose), NM1 (contains nose and mouth area). For the input facial expression image, a 10-dimensional feature vector (5 feature points) is output through F1; according to the output 10-dimensional feature vector, the EN1 is used for positioning three feature points of a left eye, a right eye and a nose; meanwhile, according to the output 10-dimensional feature vector, NM1 positions three feature points of a left mouth angle, a right mouth angle and a nose, and cuts out a face region picture containing eyes and a nose and mouth after combining the nose feature points positioned by EN 1.

The positions of the 5 human face feature points can be roughly positioned through the operation, and the five predicted feature points are further taken as centers to continue to perform feature positioning by utilizing the convolutional neural network model of the DCNN second part. The convolutional neural network model of the second part consists of 10 CNNs, and the 10 CNNs are respectively used for predicting 5 feature points, each feature point uses two CNNs, and then the two CNNs average the predicted results.

And the neural network model of the third part of the DCNN performs face cutting again on the basis of the positions predicted by the two characteristic points. The neural network model of the third part of the DCNN has the same structure as that of the second part and also consists of 10 CNNs.

Further, the error is calculated by

Wherein l is the image width of the facial expression image; and x is the vector representation of 5 characteristic points of the facial expression library picture, x 'is the characteristic vector of the facial expression library data, and y' is the corresponding facial expression label of the facial expression library.

And S3, inputting the voice data into a pre-trained morphological recognition model to perform morphological recognition to obtain a morphological recognition result.

Preferably, the speech recognition model is based on a convolution-cycle neural network, and the network structure diagram of the whole speech recognition model is shown in the attached figure 4 in the specification.

As shown in the attached figure 4 of the specification, the language state recognition model comprises a convolutional layer, a pooling layer, a Permute layer, an L STM layer and a full connection layer.

Preferably, the speech recognition includes: and the convolutional layer and the pooling layer receive the voice data to carry out convolution processing and pooling processing.

The calculation method of the convolution processing comprises the following steps:

wherein the content of the first and second substances,

representing the input of the jth characteristic diagram of the mth convolutional layer,

which represents the convolution kernel or kernels, is,

representing bias terms, representing convolution operations, M_iRepresenting a set of feature maps, f represents an activation function.

The calculation method of the pooling treatment comprises the following steps:

wherein the content of the first and second substances,

an input feature map representing the nth layer,

showing the output characteristic of the (n-1) th layer,

and

respectively, weight and bias terms, and down represents a down-sampling function from n-1 layers to n layers.

The Permutee layer conducts dimensionality extension on the data after convolution processing and pooling processing, and the L STM layer and the full connection layer conduct a calculation method to obtain a morphological recognition result, wherein the morphological recognition result is the same as S3 and has 6 states of happiness, sadness, surprise, anger, disgust and fear.

The calculation method of the L STM layer comprises the following steps:

i_t＝σ(W_ix_t+W_im_t-1+b_i)

f_t＝σ(W_fx_t+W_fm_t-1+b_f)

o_t＝σ(W_ox_t+W_om_t-1+b_o)

c_new＝h(W_cx_t+W_cm_t-1+b_c

wherein, c_newIs the output value, i, of the L STM layer_t,f_t,o_tRespectively representing an input gate, an output gate and a forgetting gate of the L STM layer, wherein t is time, sigma is a sigmoid function, h is a tanh function, W is weight, b is offset, and m is_t-1Is a hidden state at time t-1.

S4, constructing a classification tree according to the morphism recognition result and the expression recognition result, and obtaining a depth shallow psychological characteristic set according to the classification tree.

Preferably, the constructing of the deep shallow psychological characteristic based on the morphological recognition result and the expression recognition result includes: and constructing a classification feature sequence tree according to the morphism recognition result and the expression recognition result, and obtaining the depth shallow psychological feature set according to the classification feature sequence tree.

Preferably, the classification feature order tree may adopt a CART tree.

Further, the depth shallow psychological characteristic set obtained according to the classification characteristic sequence tree may adopt a kini index method, and a calculation formula of the kini index method is as follows:

wherein A represents the depth shallow layer psychological characteristics, D represents a set formed by the morphism recognition result and the expression recognition result, and T_sIndicating label classification, including joy, anger, etc., such as T₁Indicating anger. Further, the air conditioner is provided with a fan,

and K represents the data volume of a set formed by the morphism recognition result and the expression recognition result.

S5, constructing an objective function according to the depth shallow layer psychological characteristic set, and solving a partial derivative of the objective function to obtain an offset value.

Preferably, the constructing an objective function according to the set of deep and shallow psychological features, and solving a partial derivative of the objective function to obtain an offset value includes: and respectively constructing a penalty term and an error function based on the deep shallow psychological characteristic set, adding the error function and the penalty term to obtain an objective function, solving a first-order partial derivative result and a second-order partial derivative result of the error function, and reversely deducing to obtain an offset value in the objective function according to the first-order partial derivative result and the second-order partial derivative result.

Preferably, after the pre-constructed psychology analysis model receives the shallow depth psychology feature, an objective function is constructed based on the shallow depth psychology feature:

wherein y is the bias value, deep _ show represents the deep shallow psychological characteristic set, K is the data size of the deep shallow psychological characteristic set, and f_k(x_i) Is the objective function.

Further, the objective function is:

in the formula, l (x)_i) Is an error function of the deep superficial psychographic features, omega (f)_i) Is a penalty term function, aiming at improving the accuracy excellence of the evaluation of the invention, and further, a penalty term omega (f)_t) Comprises the following steps:

wherein M is the number of leaf nodes of the CART tree, omega_jAnd further, the error function is the weight of the leaf node of the CART tree.

In the formula g_i,h_iAre each l (x)_i) First and second order partial derivatives of (d):

and (3) combining the above formula to obtain a final objective function:

wherein G is_i,H_iAnd respectively calculating a first order partial derivative and a second order partial derivative, wherein T is a penalty term, and gamma is a penalty term coefficient to obtain the bias value.

And S6, judging whether the offset value is greater than a preset offset error, and if so, feeding the speech recognition result and the expression recognition result back to a professional psychoanalyst for further psychological state analysis.

If the bias value is larger than the preset bias error, the superficial depth psychology feature set does not reach the expected psychology analysis result, and the obtained expression recognition result and the obtained language recognition result are inconsistent, so a professional psychoanalyst needs to be further combined for analysis.

And S7, if the offset value is less than or equal to a preset offset error, generating a psychological state analysis result according to the expression recognition result and the speech recognition result, and outputting the psychological state analysis result.

The invention also provides an intelligent analysis device based on the video behavior data. Fig. 2 is a schematic diagram illustrating an internal structure of an intelligent analysis device based on video behavior data according to an embodiment of the present invention.

In the present embodiment, the intelligent analysis device 1 based on video behavior data may be a PC (personal computer), a terminal device such as a smart phone, a tablet computer, or a mobile computer, or may be a server. The intelligent analysis device 1 based on video behavior data comprises at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.

The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit of the intelligent analysis device 1 based on video behavior data, for example a hard disk of the intelligent analysis device 1 based on video behavior data. The memory 11 may also be an external storage device of the intelligent analysis device 1 based on the video behavior data in other embodiments, such as a plug-in hard disk provided on the intelligent analysis device 1 based on the video behavior data, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and so on. Further, the memory 11 may also include both an internal storage unit and an external storage device of the intelligent analysis apparatus 1 based on video behavior data. The memory 11 may be used not only to store application software installed in the intelligent analysis device 1 based on video behavior data and various types of data, such as codes of the intelligent analysis program 01 based on video behavior data, but also to temporarily store data that has been output or is to be output.

The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip in some embodiments, and is used for executing program codes stored in the memory 11 or Processing data, such as executing the intelligent analysis program 01 based on video behavior data.

The communication bus 13 is used to realize connection communication between these components.

The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), typically used to establish a communication link between the apparatus 1 and other electronic devices.

Optionally, the apparatus 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and an optional user interface may also comprise a standard wired interface, a wireless interface, optionally, in some embodiments, the Display may be a L ED Display, a liquid crystal Display, a touch-sensitive liquid crystal Display, and an O L ED (Organic L light-Emitting Diode) touch-sensitive device, etc.

Fig. 2 shows only the video behavior data based intelligent analysis apparatus 1 having the components 11-14 and the video behavior data based intelligent analysis program 01, and it will be understood by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the video behavior data based intelligent analysis apparatus 1, and may include fewer or more components than those shown, or combine some components, or different arrangement of components.

In the embodiment of the apparatus 1 shown in fig. 2, the memory 11 stores therein an intelligent analysis program 01 based on video behavior data; the processor 12 executes the intelligent analysis program 01 based on video behavior data stored in the memory 11 to implement the following steps:

step one, receiving a pre-recorded user video, and executing voice extraction operation on the user video to obtain voice data and video data not including the voice data.

y(n)＝x(n)-μx(n-1)

And step two, inputting the video data into an expression recognition model which is trained in advance to perform expression recognition to obtain an expression recognition result.

Further, the error is calculated by

And step three, inputting the voice data into a pre-trained speech recognition model for speech recognition to obtain a speech recognition result.

wherein the content of the first and second substances,

which represents the convolution kernel or kernels, is,

The calculation method of the pooling treatment comprises the following steps:

wherein the content of the first and second substances,

an input feature map representing the nth layer,

showing the output characteristic of the (n-1) th layer,

and

respectively representing the weight and bias terms, down representing n-1 layers ton layers of downsampling function.

And the Permutee layer performs dimensionality extension on the data after the convolution processing and the pooling processing, and the L STM layer and the full connection layer perform a calculation method to obtain a morphological recognition result, wherein the morphological recognition result is the same as the three steps, and has 6 states of happiness, sadness, surprise, anger, disgust and fear.

The calculation method of the L STM layer comprises the following steps:

i_t＝σ(W_ix_t+W_im_t-1+b_i)

f_t＝σ(W_fx_t+W_fm_t-1+b_f)

o_t＝σ(W_ox_t+W_om_t-1+b_o)

c_new＝h(W_cx_t+W_cm_t-1+b_c

And fourthly, constructing a classification tree according to the morphism recognition result and the expression recognition result, and obtaining a depth shallow psychological characteristic set according to the classification tree.

Preferably, the classification feature order tree may adopt a CART tree.

And fifthly, constructing an objective function according to the depth shallow psychological characteristic set, and solving a partial derivative of the objective function to obtain an offset value.

Further, the objective function is:

and (3) combining the above formula to obtain a final objective function:

And step six, judging whether the offset value is larger than a preset offset error, and if the offset value is larger than the preset offset error, feeding the speech recognition result and the expression recognition result back to a professional psychoanalyst for further psychological state analysis.

And seventhly, if the bias value is smaller than or equal to a preset bias error, generating a psychological state analysis result according to the expression recognition result and the speech state recognition result, and outputting the psychological state analysis result.

If the bias value is larger than the preset bias error, the superficial depth psychology feature set does not reach the expected psychology analysis result, and therefore a professional psychoanalyst needs to be further combined for analysis.

Alternatively, in other embodiments, the intelligent analysis program based on video behavior data may be further divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention.

For example, referring to fig. 3, a schematic diagram of program modules of an intelligent analysis program based on video behavior data in an embodiment of the intelligent analysis device based on video behavior data of the present invention is shown, in this embodiment, the intelligent analysis program based on video behavior data may be divided into a data receiving and separating module 10, an expression and speech recognition module 20, a classification number constructing module 30, and a mental state analysis module 40, which exemplarily:

the data receiving and separating module 10 is configured to: receiving a pre-recorded user video, and executing voice extraction operation on the user video to obtain voice data and video data not including the voice data.

The expression and morphism recognition module 20 is configured to: and inputting the video data into a pre-trained expression recognition model for expression recognition to obtain an expression recognition result, and inputting the voice data into a pre-trained morphological recognition model for morphological recognition to obtain a morphological recognition result.

The classification number construction module 30 is configured to: and constructing a classification tree according to the morphism recognition result and the expression recognition result, and obtaining a deep shallow psychological characteristic set according to the classification tree.

The mental state analysis module 40 is configured to: and constructing an objective function according to the deep shallow psychological characteristic set, solving a partial derivative of the objective function to obtain an offset value, feeding back the speech recognition result and the expression recognition result to a preset user if the offset value is greater than a preset offset error, generating a psychological state analysis result according to the expression recognition result and the speech recognition result if the offset value is less than or equal to the preset offset error, and outputting the psychological state analysis result.

The functions or operation steps of the data receiving and separating module 10, the expression and language state identifying module 20, the classification number constructing module 30, the mental state analyzing module 40 and other program modules when executed are substantially the same as those of the above embodiments, and are not described herein again.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium has stored thereon an intelligent analysis program based on video behavior data, and the intelligent analysis program based on video behavior data is executable by one or more processors to implement the following operations:

receiving a pre-recorded user video, and executing voice extraction operation on the user video to obtain voice data and video data not including the voice data.

And inputting the video data into a pre-trained expression recognition model for expression recognition to obtain an expression recognition result, and inputting the voice data into a pre-trained morphological recognition model for morphological recognition to obtain a morphological recognition result.

And constructing a classification tree according to the morphism recognition result and the expression recognition result, and obtaining a deep shallow psychological characteristic set according to the classification tree.

And constructing an objective function according to the deep shallow psychological characteristic set, solving a partial derivative of the objective function to obtain an offset value, feeding back the speech recognition result and the expression recognition result to a preset user if the offset value is greater than a preset offset error, generating a psychological state analysis result according to the expression recognition result and the speech recognition result if the offset value is less than or equal to the preset offset error, and outputting the psychological state analysis result.

It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An intelligent analysis method based on video behavior data, the method comprising:

2. The intelligent analysis method based on video behavior data according to claim 1, wherein the performing of the voice extraction operation on the user video to obtain voice data and video data not including the voice data comprises:

carrying out pre-emphasis operation on the user video;

3. The intelligent analysis method based on video behavior data according to claim 1, further comprising training the expression recognition model, the training comprising:

constructing the expression recognition model;

establishing a facial expression library and a comparative expression library;

4. The intelligent analysis method based on video behavior data according to any one of claims 1 to 3, characterized in that the deep superficial psychology feature set is obtained by calculating a Gini index of the classification tree by using a Gini index method;

wherein the Gini index method is as follows:

wherein, A represents the depth shallow psychological characteristic set, D represents the set formed by the morphism recognition result and the expression recognition result, and T represents the set formed by the morphism recognition result and the expression recognition result_sData volume, T, representing different label classifications₁Amount of data, T, representing anger label₂Data volume representing joy tag representing said morphemeAnd the data size of a set formed by the expression recognition result and the identification result.

5. The intelligent analysis method based on video behavior data according to claim 4, wherein the constructing an objective function according to the set of deep and shallow psychological characteristics, and solving the partial derivative of the objective function to obtain an offset value comprises:

adding the error function and the penalty term to obtain a target function;

6. An intelligent analysis device based on video behavior data, characterized in that the device comprises a memory and a processor, the memory stores thereon an intelligent analysis program based on video behavior data, the intelligent analysis program based on video behavior data is executable on the processor, and when being executed by the processor, the intelligent analysis program based on video behavior data realizes the following steps:

7. The intelligent analysis device based on video behavior data as claimed in claim 6, wherein said performing voice extraction operation on the user video to obtain voice data and video data not including voice data comprises:

carrying out pre-emphasis operation on the user video;

8. The intelligent video behavior data-based analysis device according to claim 6, wherein the intelligent video behavior data-based analysis program, when executed by the processor, further implements the steps of: training the expression recognition model, the training comprising:

constructing the expression recognition model;

establishing a facial expression library and a comparative expression library;

9. The intelligent analysis device based on video behavior data according to any one of claims 6 to 8, characterized in that the deep superficial psychology feature set is obtained by calculating a Gini index of the classification tree by using a Gini index method;

wherein the Gini index method is as follows:

10. A computer-readable storage medium, wherein the computer-readable storage medium has stored thereon a video behavior data-based intelligent analysis program, which is executable by one or more processors to implement the steps of the video behavior data-based intelligent analysis method according to any one of claims 1 to 5.