CN108985358A

CN108985358A - Emotion identification method, apparatus, equipment and storage medium

Info

Publication number: CN108985358A
Application number: CN201810694899.XA
Authority: CN
Inventors: 林英展; 陈炳金; 梁川; 梁一川; 凌光; 周超
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2018-12-11
Anticipated expiration: 2038-06-29
Also published as: CN108985358B

Abstract

The embodiment of the invention discloses a kind of Emotion identification method, apparatus, equipment and storage mediums.Wherein, this method comprises: determining the fusion session characteristics of multi-modal session information；In the multi-modal Emotion identification model that the fusion session characteristics input of the multi-modal session information is constructed in advance, the emotional characteristics of the multi-modal session information are obtained.Technical solution provided in an embodiment of the present invention, by the way that the session characteristics of mode each in multi-modal session information are merged to obtain fusion session characteristics, and the fusion session characteristics are input in a unified multi-modal Emotion identification model, for model training, final mood result can directly be predicted, it is not necessary that the identification model of each mode is respectively trained, and carry out the fusion of different model results.Sample training process is simplified, and improves the accuracy of Emotion identification result.

Description

Emotion identification method, apparatus, equipment and storage medium

Technical field

The present embodiments relate to field of artificial intelligence more particularly to a kind of Emotion identification method, apparatus, equipment and Storage medium.

Background technique

With the development of artificial intelligence, intelligent interaction plays increasingly important role in more and more fields. And in intelligent interaction, an important direction is how to identify the emotional state that user is current in multi-modal interactive process, from And the feedback of mood level is provided for entire intelligent interactive system, it makes adjustment in time, to cope under different emotional states User promotes the service quality of entire interactive process.

Currently, main Emotion identification method is as shown in Figure 1, whole process is as follows: by voice, text and facial expression image Independent modeling is carried out etc. each mode, and is fused together the result of each model finally, according to rule or engineering Model is practised, fusion judgement, the multi-modal Emotion identification result of one entirety of final output are carried out to the result of multiple mode.

Since meaning is different under different scenes for same word, the emotional state of expression is different, and above method versatility It is poor；In addition it is also necessary to acquire mass data, higher cost and result controllability is poor dependent on manual operation.

Summary of the invention

The embodiment of the invention provides a kind of Emotion identification method, apparatus, equipment and storage mediums, simplify sample training Process, and improve the accuracy of Emotion identification result.

In a first aspect, the embodiment of the invention provides a kind of Emotion identification methods, this method comprises:

Determine the fusion session characteristics of multi-modal session information；

In the multi-modal Emotion identification model that the fusion session characteristics input of the multi-modal session information is constructed in advance, Obtain the emotional characteristics of the multi-modal session information.Second aspect, the embodiment of the invention also provides a kind of Emotion identification dresses It sets, which includes:

Fusion feature determining module, for determining the fusion session characteristics of multi-modal session information；

Emotional characteristics determining module, for construct the fusion session characteristics input of the multi-modal session information in advance In multi-modal Emotion identification model, the emotional characteristics of the multi-modal session information are obtained.

The third aspect, the embodiment of the invention also provides a kind of equipment, which includes:

One or more processors；

Storage device, for storing one or more programs；

When one or more of programs are executed by one or more of processors, so that one or more of processing Device realizes any Emotion identification method in first aspect.

Fourth aspect, the embodiment of the invention also provides a kind of storage mediums, are stored thereon with computer program, the program Any Emotion identification method in first aspect is realized when being executed by processor.

Technical solution provided in an embodiment of the present invention, by melting the session characteristics of mode each in multi-modal session information Conjunction obtains fusion session characteristics, and the fusion session characteristics are input in a unified multi-modal Emotion identification model, supplies Model training, so that it may directly predict final mood as a result, being not necessarily to that the identification model of each mode is respectively trained, and carry out difference The fusion of model result.Sample training process is simplified, and improves the accuracy of Emotion identification result.

Detailed description of the invention

Fig. 1 is a kind of multi-modal emotion recognition schematic diagram based on independent modal training that the prior art provides；

Fig. 2A is a kind of flow chart of the Emotion identification method provided in the embodiment of the present invention one；

Fig. 2 B be the present invention implement be applicable in based on multi-modal Fusion Features learning model schematic diagram；

Fig. 3 is a kind of flow chart of the Emotion identification method provided in the embodiment of the present invention two；

Fig. 4 is a kind of structural block diagram of the Emotion identification device provided in the embodiment of the present invention three；

Fig. 5 is a kind of structural schematic diagram of the equipment provided in the embodiment of the present invention four.

Specific embodiment

The embodiment of the present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this Locate described specific embodiment and is used only for explaining the embodiment of the present invention, rather than limitation of the invention.It further needs exist for Bright, only parts related to embodiments of the present invention are shown for ease of description, in attached drawing rather than entire infrastructure.

Embodiment one

Fig. 2A is a kind of flow chart for Emotion identification method that the embodiment of the present invention one provides, and Fig. 2 B is the embodiment of the present invention Be applicable in based on multi-modal Fusion Features learning model schematic diagram.The present embodiment is suitable for how multi-modal interactive process The case where user emotion is recognized accurately.This method can be executed by Emotion identification device provided in an embodiment of the present invention, should The mode that software and/or hardware can be used in device is realized, and can be integrated in and be calculated in equipment.A and 2B referring to fig. 2, this method tool Body includes:

S210 determines the fusion session characteristics of multi-modal session information.

Wherein, the term used when mode is a kind of interactive, it is multi-modal refer to integrated use text, image, video, The phenomenon that multiple means such as voice and gesture and symbolic carrier interact.Corresponding, multi-modal session information is while wrapping Session information containing at least two mode such as includes the session information of three kinds of voice, text and image mode simultaneously.

Merging session characteristics is by merge by the session characteristics for the different modalities for including in a session information It arrives.Optionally, deep learning model can be used, while considering the multiple modal characteristics for including in a session information to determine The fusion session characteristics of multi-modal session information.

The fusion session characteristics of multi-modal session information are inputted the multi-modal Emotion identification model constructed in advance by S220 In, obtain the emotional characteristics of multi-modal session information.

Wherein, multi-modal Emotion identification model is known based on language identification, intelligent knowledge figure and the text in artificial intelligence The model that other technology etc. is established；Specifically, can be in advance using sample data set to initial machine learning model such as neural network What model training obtained.Emotional characteristics are multi-modal Emotion identification as a result, for characterizing individual to a kind of state of extraneous things Degree, may include type of emotion and emotional intensity etc.；Type of emotion may include happiness, anger, sorrow and pleasure etc.；Emotional intensity is to be used for Characterize the degree of strength of a certain mood.

Illustratively, the fusion session characteristics of multi-modal session information are inputted to the multi-modal Emotion identification mould constructed in advance It can also include: to believe according to the fusion session characteristics of multi-modal session sample information and multi-modal session sample before in type The emotional characteristics of breath are trained initial machine learning model, obtain multi-modal Emotion identification model.

Specifically, obtaining a large amount of multi-modal meeting by constantly accumulating the session information under various scenes in interactive process The fusion session characteristics of sample information and the emotional characteristics of corresponding multi-modal session sample information are talked about, as training sample This collection is input in neural network and is trained to it, after training by each sample, obtains multi-modal Emotion identification model. When the fusion session characteristics of a multi-modal session information are input in the multi-modal Emotion identification model, model can combine should The existing parameter of model judges the fusion session characteristics of input, and exports corresponding emotional characteristics.

It should be noted that since the prior art needs individually to establish identification model to each mode, and by each model As a result weighting obtains final mood result, it is therefore desirable to a large amount of training sample, and there are the moulds that single mode learns out Type poor quality, the problem of the whole Emotion identification effect difference eventually led to.And the present embodiment B referring to fig. 2, due to directly will be more The session characteristics of each mode merge to obtain fusion session characteristics in mode session information, and only need to be by fusion session characteristics input In the multi-modal Emotion identification model unified to one, for model training, so that it may export final emotional characteristics, training sample phase Greatly reduce than the prior art；And due to the fusion of multi-modal session characteristics so that the multi-modal Emotion identification model can not only be learned The characteristic information of each mode is practised, can also be learnt to the characteristic relation between different modalities, be can be avoided and the prior art occur Since the model quality that single mode learns out is bad, the problem of the whole Emotion identification effect difference eventually led to.

It is illustrated by taking text and voice bimodal session information as an example.When user says " I just want to buy apple X now, Just want that certificate ", when this sentence, if considering text modality information and speech modality respectively using existing technology Information eventually leads to Emotion identification result inaccuracy then can this sentence be not know be noted as negative emotions.But It is, using the technical solution of this implementation along with the letter in terms of the speech modality of user while considering text modality information Breath, for example, when user says this sentence voice fluctuating it is very violent, by by " text "+" voice " bimodal Fusion Features, It is negative emotions that mood, which can finally be recognized accurately,.

Furthermore, it is necessary to which, it is emphasized that the emotional characteristics of multi-modal session sample information used by the present embodiment are comprehensive It closes and considers to be labeled multi-modal session information in the case where each mode, it can be ensured that the emotional state marked is Do not have ambiguous, constructs a more accurate data set for model training below, make finally obtained multi-modal Emotion identification mould Type is more acurrate.And the prior art is that independent each mode is labeled, and since independence is labeled a mode, Ke Nengwu Method correctly marks the emotional characteristics of a sentence, and the recognition accuracy that will lead to the corresponding mood model of each mode is poor, Eventually lead to subsequent result fusing stage effect decline.

Embodiment two

Fig. 3 is a kind of flow chart of Emotion identification method provided by Embodiment 2 of the present invention, and the present embodiment is in above-mentioned implementation On the basis of example one, further the fusion session characteristics of the multi-modal session information of determination are optimized.Referring to Fig. 3, the party Method specifically includes:

S310 determines at least two mode meetings in voice conversation information, text session information and image session information respectively The vector for talking about information indicates.

Illustratively, multi-modal session information may include: voice conversation information, text session information and image session letter Breath.The vector expression of session information refers to expression of the session information in vector space, can be obtained by modeling.

Specifically, the characteristic parameter of emotional change can be characterized by extracting respectively, to text in voice conversation information Session information cut sentence and word cutting etc. extract keyword and extract in image session information effective dynamic expression feature or Static expressive features, and be input to vector and extract in model, the vector of available voice conversation information indicates, image session is believed The vector of breath indicates and the vector of text session information indicates.Vector extraction model, which can be one, to be had phonetic feature, text This keyword and characteristics of image etc. are converted to the collective model that corresponding vector indicates, are also possible to each submodel and are composed 's.

S320 merges the vector expression of at least two mode session informations, obtains melting for multi-modal session information The vector for closing session characteristics indicates.

Specifically, the vector of each mode session information being indicated to, direct splicing is one long according to certain rules Unified vector table be shown as multi-modal session information fusion session characteristics vector indicate, to realize multiple modalities meeting Talk about the fusion that the vector of information indicates.Key message part in the vector expression of each mode session information of extraction can also be passed through Vector indicate, and splice to obtain the vector expression of the fusion session characteristics of multi-modal session information.

Illustratively, carrying out fusion to the vector expression of at least two mode session informations may include: according to preset Mode sequence indicates to carry out sequential concatenation to the vector of at least two mode session informations.

Wherein, preset mode sequence can be the sequencing of pre-set mode input, can be according to the actual situation It is modified.It such as can increase, delete or be inserted into some mode, so as to dynamically adjust the sequencing of each mode input.

Specifically, when the vector that the corresponding each mode session information of the multi-modal session information of input has been determined indicates Afterwards, according to the input sequence of each mode, the vector expression of each mode session information is directly connected to, to realize a variety of moulds The fusion that the vector of state session information indicates.

Illustratively, merging to the vector expression of at least two mode session informations can also include: to extract respectively The nonlinear characteristic that the vector of at least two mode session informations indicates；To the non-thread of at least two mode session informations of extraction Property feature is merged.

Wherein, the nonlinear characteristic that vector indicates is used to characterize the unique portion of a vector, can be in vector expression It is not 0 part.The nonlinear characteristic that the vector of a corresponding mode session information indicates refers to a mode session information In can identify mood word vector indicate.Such as the vector of a mode session information is expressed as [0,1,1,0,0], then The nonlinear characteristic that the vector of the mode session information indicates can be [1,1].

Specifically, B can be by the vector table of each mode session information in multi-modal Fusion Features layer referring to fig. 2 Show, be input in deep learning model, first passes through one layer of full articulamentum (Full Connection Layer, FCL) behaviour respectively Make, extracts the nonlinear characteristic that the vector of each mode session information indicates, obtain corresponding hidden layer vector；Then by the hidden of output Layer vector is stitched together to realize the fusion that the vector of multiple modalities session information indicates.

The fusion session characteristics of multi-modal session information are inputted the multi-modal Emotion identification model constructed in advance by S330 In, obtain the emotional characteristics of multi-modal session information.

Specifically, the expression of the vectors of the fusion session characteristics of multi-modal session information is input to construct in advance it is multi-modal In Emotion identification model, model can judge the fusion session characteristics of input, and export in conjunction with the existing parameter of the model Corresponding emotional characteristics.

Technical solution provided in an embodiment of the present invention, by by mode session information each in multi-modal session information to Amount indicates that the vector for being merged to obtain the fusion session characteristics of multi-modal session information indicates, and by the fusion session characteristics Vector expression is input in a unified multi-modal Emotion identification model, for model training, so that it may directly predict final Mood is as a result, be not necessarily to that the identification model of each mode is respectively trained, and carry out the fusion of different model results.Simplify sample training Process, and improve the accuracy of Emotion identification result.

Embodiment three

Fig. 4 is a kind of structural block diagram for Emotion identification device that the embodiment of the present invention three provides, which can be performed this hair Emotion identification method provided by bright any embodiment has the corresponding functional module of execution method and beneficial effect.Such as Fig. 4 institute Show, the apparatus may include:

Fusion feature determining module 410, for determining the fusion session characteristics of multi-modal session information；

Emotional characteristics determining module 420, for construct the fusion session characteristics input of multi-modal session information in advance In multi-modal Emotion identification model, the emotional characteristics of multi-modal session information are obtained.

Illustratively, fusion feature determining module 410 may include:

Multi-modal vector determination unit, for determining voice conversation information, text session information and image session letter respectively The vector of at least two mode session informations indicates in breath；

Vector determination unit is merged, merges, obtains more for the vector expression at least two mode session informations The vector of the fusion session characteristics of mode session information indicates.

Optionally, fusion vector determination unit is specifically used for:

According to preset mode sequence, the vector of at least two mode session informations is indicated to carry out sequential concatenation.

Optionally, fusion vector determination unit also particularly useful for:

The nonlinear characteristic that the vector of at least two mode session informations indicates is extracted respectively；To at least two moulds of extraction The nonlinear characteristic of state session information is merged.

Illustratively, above-mentioned apparatus can also include:

Identification model determining module, for according to multi-modal session sample information fusion session characteristics and multi-modal meeting The emotional characteristics for talking about sample information, are trained initial machine learning model, obtain multi-modal Emotion identification model.

Example IV

Fig. 5 is a kind of structural schematic diagram for equipment that the embodiment of the present invention four provides, and Fig. 5, which is shown, to be suitable for being used to realizing this The block diagram of the example devices of inventive embodiments embodiment.The equipment 12 that Fig. 5 is shown is only an example, should not be to this hair The function and use scope of bright embodiment bring any restrictions.As shown in figure 5, the table in the form of universal computing device of equipment 12 It is existing.The component of equipment 12 can include but is not limited to: one or more processor or processing unit 16, system storage 28, connect the bus 18 of different system components (including system storage 28 and processing unit 16).

Bus 18 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.

Equipment 12 typically comprises a variety of computer system readable media.These media can be it is any can be by equipment 12 The usable medium of access, including volatile and non-volatile media, moveable and immovable medium.

System storage 28 may include the computer system readable media of form of volatile memory, such as arbitrary access Memory (RAM) 30 and/or cache memory 32.Equipment 12 may further include it is other it is removable/nonremovable, Volatile/non-volatile computer system storage medium.Only as an example, storage system 34 can be used for reading and writing irremovable , non-volatile magnetic media (Fig. 5 do not show, commonly referred to as " hard disk drive ").Although being not shown in Fig. 5, use can be provided In the disc driver read and write to removable non-volatile magnetic disk (such as " floppy disk "), and to removable anonvolatile optical disk The CD drive of (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these cases, each driver can To be connected by one or more data media interfaces with bus 18.System storage 28 may include that at least one program produces Product, the program product have one group of (for example, at least one) program module, these program modules are configured to perform of the invention real Apply the function of each embodiment of example.

Program/utility 40 with one group of (at least one) program module 42 can store and store in such as system In device 28, such program module 42 includes but is not limited to operating system, one or more application program, other program modules And program data, it may include the realization of network environment in each of these examples or certain combination.Program module 42 Usually execute the function and/or method in described embodiment of the embodiment of the present invention.

Equipment 12 can also be communicated with one or more external equipments 14 (such as keyboard, sensing equipment, display 24 etc.), Can also be enabled a user to one or more equipment interacted with the equipment 12 communication, and/or with enable the equipment 12 with One or more of the other any equipment (such as network interface card, modem etc.) communication for calculating equipment and being communicated.It is this logical Letter can be carried out by input/output (I/O) interface 22.Also, equipment 12 can also by network adapter 20 and one or The multiple networks of person (such as local area network (LAN), wide area network (WAN) and/or public network, such as internet) communication.As shown, Network adapter 20 is communicated by bus 18 with other modules of equipment 12.It should be understood that although not shown in the drawings, can combine Equipment 12 use other hardware and/or software module, including but not limited to: microcode, device driver, redundant processing unit, External disk drive array, RAID system, tape drive and data backup storage system etc..

Processing unit 16 by the program that is stored in system storage 28 of operation, thereby executing various function application and Data processing, such as realize Emotion identification method provided by the embodiment of the present invention.

Embodiment five

The embodiment of the present invention five also provides a kind of computer readable storage medium, be stored thereon with computer program (or For computer executable instructions), Emotion identification side described in above-mentioned any embodiment can be realized when which is executed by processor Method.

The computer storage medium of the embodiment of the present invention, can be using any of one or more computer-readable media Combination.Computer-readable medium can be computer-readable signal media or computer readable storage medium.It is computer-readable Storage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device or Device, or any above combination.The more specific example (non exhaustive list) of computer readable storage medium includes: tool There are electrical connection, the portable computer diskette, hard disk, random access memory (RAM), read-only memory of one or more conducting wires (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD- ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer-readable storage Medium can be any tangible medium for including or store program, which can be commanded execution system, device or device Using or it is in connection.

Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.

The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.

Can with one or more programming languages or combinations thereof come write for execute the embodiment of the present invention operation Computer program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, further include conventional procedural programming language-such as " C " language or similar program design language Speech.Program code can be executed fully on the user computer, partly be executed on the user computer, as an independence Software package execute, part on the user computer part execute on the remote computer or completely in remote computer or It is executed on server.In situations involving remote computers, remote computer can pass through the network of any kind --- packet It includes local area network (LAN) or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as benefit It is connected with ISP by internet).

Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being implemented by above embodiments to the present invention Example is described in further detail, but the embodiment of the present invention is not limited only to above embodiments, is not departing from structure of the present invention It can also include more other equivalent embodiments in the case where think of, and the scope of the present invention is determined by scope of the appended claims It is fixed.

Claims

1. a kind of Emotion identification method characterized by comprising

In the multi-modal Emotion identification model that the fusion session characteristics input of the multi-modal session information is constructed in advance, obtain The emotional characteristics of the multi-modal session information.

2. being wrapped the method according to claim 1, wherein determining the fusion session characteristics of multi-modal session information It includes:

Respectively determine voice conversation information, text session information and image session information at least two mode session informations to Amount indicates；

The vector expression of at least two mode session information is merged, the fusion session of multi-modal session information is obtained The vector of feature indicates.

3. according to the method described in claim 2, it is characterized in that, the vector at least two mode session information indicates It is merged, comprising:

According to preset mode sequence, the vector of at least two mode session information is indicated to carry out sequential concatenation.

4. according to the method described in claim 2, it is characterized in that, the vector at least two mode session information indicates It is merged, comprising:

The nonlinear characteristic that the vector of at least two mode session information indicates is extracted respectively；

The nonlinear characteristic of at least two mode session information of extraction is merged.

5. the method according to claim 1, wherein the fusion session characteristics of the multi-modal session information are defeated Before entering in the multi-modal Emotion identification model constructed in advance, further includes:

According to multi-modal session sample information fusion session characteristics and the multi-modal session sample information emotional characteristics, Initial machine learning model is trained, the multi-modal Emotion identification model is obtained.

6. a kind of Emotion identification device characterized by comprising

Emotional characteristics determining module, for the fusion session characteristics of the multi-modal session information to be inputted the multimode constructed in advance In state Emotion identification model, the emotional characteristics of the multi-modal session information are obtained.

7. device according to claim 6, which is characterized in that the fusion feature determining module includes:

Multi-modal vector determination unit, for being determined in voice conversation information, text session information and image session information respectively The vector of at least two mode session informations indicates；

Vector determination unit is merged, merges, obtains more for the vector expression at least two mode session information The vector of the fusion session characteristics of mode session information indicates.

8. device according to claim 7, which is characterized in that the fusion vector determination unit is specifically used for:

9. device according to claim 7, which is characterized in that the fusion vector determination unit also particularly useful for:

10. device according to claim 6, which is characterized in that further include:

Identification model determining module, for according to multi-modal session sample information fusion session characteristics and the multi-modal meeting The emotional characteristics for talking about sample information, are trained initial machine learning model, obtain the multi-modal Emotion identification model.

11. a kind of equipment, which is characterized in that the equipment includes:

One or more processors；

Storage device, for storing one or more programs；

When one or more of programs are executed by one or more of processors, so that one or more of processors are real Existing Emotion identification method according to any one of claims 1 to 5.

12. a kind of storage medium, is stored thereon with computer program, which is characterized in that the realization when program is executed by processor Emotion identification method according to any one of claims 1 to 5.