CN113238654A

CN113238654A - Multi-modal based reactive response generation

Info

Publication number: CN113238654A
Application number: CN202110545116.3A
Authority: CN
Inventors: 宋睿华; 杜涛
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2021-08-10
Also published as: WO2022242706A1

Abstract

The present disclosure provides methods, systems, and apparatus for multi-modal based reactive response generation. Multimodal input data can be obtained. At least one information element can be extracted from the multimodal input data. At least one reference information item may be generated based at least on the at least one information element. The multimodal output data may be generated using at least the at least one item of reference information. The multimodal output data may be provided.

Description

Multi-modal based reactive response generation

Background

In recent years, intelligent human-computer interaction systems are widely applied to more and more scenes and fields, and can effectively improve human-computer interaction efficiency and optimize human-computer interaction experience. With the development of Artificial Intelligence (AI) technology, man-machine interactive systems have also been developed more deeply in the aspects of, for example, intelligent conversational systems. For example, the intelligent conversation system already covers application scenarios such as task conversation, knowledge question answering, open domain conversation, etc., and can be implemented by adopting various technologies such as template-based technology, retrieval-based technology, deep learning-based technology, etc.

Disclosure of Invention

This summary is provided to introduce a selection of concepts that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Embodiments of the present disclosure present methods, systems, and apparatus for multi-modal based reactive response generation. Multimodal input data can be obtained. At least one information element can be extracted from the multimodal input data. At least one reference information item may be generated based at least on the at least one information element. The multimodal output data may be generated using at least the at least one item of reference information. The multimodal output data may be provided.

It should be noted that one or more of the above aspects include features that are specifically pointed out in the following detailed description and claims. The following description and the annexed drawings set forth in detail certain illustrative features of the one or more aspects. These features are indicative of but a few of the various ways in which the principles of various aspects may be employed and the present disclosure is intended to include all such aspects and their equivalents.

Drawings

The disclosed aspects will hereinafter be described in conjunction with the appended drawings, which are provided to illustrate, but not to limit, the disclosed aspects.

Fig. 1 illustrates an exemplary architecture of a multi-modal based reactive response generation system, according to an embodiment.

FIG. 2 illustrates an exemplary process for multi-modal based reactive response generation, according to embodiments.

FIG. 3 illustrates an example of a smart animated character scene according to an embodiment.

FIG. 4 illustrates an exemplary process of a smart animated character scene according to an embodiment.

FIG. 5 illustrates an exemplary process of intelligent animation generation, according to an embodiment.

FIG. 6 illustrates a flow diagram of an exemplary method for multi-modal based reactive response generation, according to an embodiment.

Fig. 7 illustrates an exemplary apparatus for multi-modal based reactive response generation, according to an embodiment.

Fig. 8 illustrates an exemplary apparatus for multi-modal based reactive response generation, according to an embodiment.

Detailed Description

The present disclosure will now be discussed with reference to various exemplary embodiments. It is to be understood that the discussion of these embodiments is merely intended to enable those skilled in the art to better understand and thereby practice the embodiments of the present disclosure, and does not teach any limitation as to the scope of the present disclosure.

The conventional human-computer interaction system usually uses a single medium as a channel for inputting and outputting information, for example, communication between human and machine or between machine and machine is performed through one of text, voice, gesture, etc. Taking the intelligent conversational system as an example, although it may be text or speech oriented, it still centers on text processing or text analysis. The intelligent conversation system lacks consideration of information of an interactive object outside a text, such as facial expressions, body movements and the like, and also lacks consideration of factors such as sound, light and the like in the environment, so that a common problem exists in the interactive process. One problem is that the understanding of the information is not sufficiently comprehensive and accurate. In the actual communication process, people often use tone, facial expressions, body movements, etc. as important channels for expressing or transmitting information, rather than simply expressing all their communication contents through language texts. For example, if different moods are used or accompanied by different facial expressions and limb movements for the same sentence, it may convey distinct semantics in different instances. The existing intelligent conversation technology taking text processing as a core lacks the information which is important in the interaction process, so that the extraction and application of the context information in the conversation become very difficult. Another problem is that the expression of the information is not sufficiently vivid. The existing intelligent conversation technology is mainly performed through text in information expression, and can also convert output text into voice under the condition of supporting voice recognition and voice synthesis. However, such information transfer channels are still limited, and cannot comprehensively and accurately express their own intentions by using language, facial expressions, body movements, etc. as human beings do, resulting in difficulty in exhibiting vivid and lively anthropomorphic expressions. A further problem is that existing intelligent conversational techniques are limited to responding to received incoming conversational messages and are unable to react spontaneously to various environmental factors. For example, existing chat robots only focus on responding to conversation messages from users to be able to chat around conversation messages from users.

Embodiments of the present disclosure propose a multi-modal based reactive (interaction) response generation scheme that can be implemented on a variety of intelligent session principals and can be widely applied in a variety of scenarios including human-machine interaction. In this context, an intelligent conversation agent may broadly refer to an AI product modality capable of generating and presenting information contents, providing interactive functions, and the like in a specific application scenario, for example, a chat robot, an intelligent animated character, a virtual anchor, an intelligent car assistant, an intelligent customer service, an intelligent speaker, and the like. In accordance with embodiments of the present disclosure, the intelligent conversation agent can generate multimodal output data based on the multimodal input data, wherein the multimodal output data is a response generated in a reactive manner to be presented to the user.

The way in which people communicate naturally is often multimodal. When people communicate with each other, people often take various types of information such as voices, characters, facial expressions, body movements and the like from an object to be communicated into comprehensive consideration, and simultaneously take account of scenes, light, sound, even temperature, humidity and other information of the environment where people are located. By comprehensively considering the multi-modal information, the human can more comprehensively, accurately and quickly understand the contents to be expressed by the communication object. Similarly, when expressing information, humans tend to use multi-modal expressions such as voice, facial expressions, and body movements to express their intentions more accurately, vividly, and comprehensively.

Based on the above-mentioned inspiration from human communication methods, in the context of human-computer interaction, the natural and natural human-computer interaction scheme should also be multimodal. Therefore, the embodiment of the disclosure provides a man-machine interaction mode based on multiple modes. In this context, interaction may refer broadly to, for example, understanding and expression of information, data, content, etc., while human-machine interaction may refer broadly to interaction between a smart conversation agent and an interaction object, for example, interaction between a smart conversation agent and a human user, interaction between a smart conversation agent, responses of a smart conversation agent to various media content or informational data, and so forth. Compared with the existing single-medium-based interaction mode, the embodiment of the disclosure has multiple advantages. In one aspect, more accurate information understanding can be achieved. By comprehensively processing multimodal input data including, for example, media content, captured images or audio, chat sessions, external environment data, etc., information can be more comprehensively collected and analyzed, misunderstandings due to information loss are reduced, and thus the deep-level intent of an interactive object is more accurately understood. In one aspect, the expression pattern is more efficient. Information and emotions can be more efficiently expressed by multi-modal superimposing of information in a variety of ways, such as superimposing the facial expressions and/or limb movements or other animated sequences of an avatar on a speech or text basis, etc. In one aspect, the interactive behavior of the intelligent conversational agent will be more lively. Understanding and expressing the multimodal data will make the intelligent conversation body more anthropomorphic, thereby significantly improving the user experience.

Furthermore, embodiments of the present disclosure may cause the intelligent conversation agent to mimic a human being to react naturally, i.e., respond reactively, to multimodal input data such as speech, text, music, video images, and the like. In this context, the reactive response of the intelligent conversation agent is not limited to a reaction to a chat message from, for example, a user, but may also encompass a reaction to various input data voluntarily, such as media content, captured images or audio, an external environment, and the like. Taking the example that the intelligent conversation body serves as an intelligent animation role to provide a scene of an AI intelligent company, assuming that the intelligent conversation body can accompany a user to watch a video through a corresponding avatar, the intelligent conversation body not only can directly interact with the user, but also can spontaneously respond to the content in the video, for example, the avatar can make voice, make facial expressions, make body movements, present characters, and the like. Thus, the behavior of the intelligent conversation agent will be more humanoid.

The embodiment of the disclosure provides a general multi-mode-based reactive response generation technology, and an intelligent conversation body can efficiently and quickly obtain multi-mode interaction capability by integrating and applying a multi-mode-based reactive response generation system. Through the multi-modal-based reactive response generation technology according to the embodiment of the disclosure, multi-modal input data from various media channels can be processed in an integrated manner, and the intention expressed by the multi-modal input data can be interpreted more accurately and effectively. In addition, through the multi-modal-based reactive response generation technology according to the embodiment of the disclosure, the intelligent conversation agent can provide multi-modal output data via various channels to express overall consistent information, thereby improving the accuracy and efficiency of information expression, making the information expression of the intelligent conversation agent more vivid and interesting, and significantly improving the user experience.

The multi-modal based reactive response generation technique according to embodiments of the present disclosure may be adaptively applied in a variety of scenarios. Based on the input and output capabilities supported by different scenarios, embodiments of the present disclosure can obtain corresponding multimodal input data in different scenarios and output multimodal output data that is appropriate for the particular scenario. Taking the example of a scenario in which an animation is automatically generated for an intelligent conversational agent acting as an intelligent animated character, embodiments of the present disclosure may generate a reactive response including, for example, an animation sequence, for an avatar of the intelligent animated character. For example, in the case where the smart animated character is applied to accompany a user watching a video, the smart animated character can comprehensively process multi-modal input data from video content, captured images or audio, chat sessions, external environment data, etc., perform deep perception and understanding on the multi-modal input data, and accordingly make reasonable reactions through various modalities such as voice, text, animation sequences including facial expressions and/or body movements in an intelligent and dynamic manner, thereby realizing a comprehensive, efficient, and vivid human-computer interaction experience. The perception capability and emotion expression capability of the intelligent animated character are greatly enhanced, and the intelligent animated character becomes more anthropomorphic. This can also become the technical basis for e.g. intelligent animation content authoring by AI techniques.

The application of the embodiments of the present disclosure in the smart animated character scene is merely exemplified above, and the embodiments of the present disclosure can also be applied to various other scenes. For example, in a scenario where the intelligent conversation agent is a chat bot that can chat with a user in forms such as voice, text, video, etc., then the multimodal input data processed by embodiments of the present disclosure can include, for example, chat sessions, captured images or audio, ambient environment data, etc., and the multimodal output data provided can include, for example, voice, text, animation sequences, etc. For example, in a scenario where the intelligent conversation body is an avatar, which may have a corresponding avatar and play and explain predetermined media content to a plurality of users, the multimodal input data processed by embodiments of the present disclosure may include, for example, played media content, ambient environment data, etc., and the multimodal output data provided may include, for example, speech, text, animated sequences of avatars, etc. For example, in scenarios where the intelligent conversation agent is a smart car assistant that can provide assistance or companion during user driving of a vehicle (e.g., a vehicle), the multimodal input data processed by embodiments of the present disclosure can include, for example, chat conversations, captured images or audio, ambient environment data, etc., and the multimodal output data provided can include, for example, speech, text, etc. For example, in a scenario where the intelligent conversation agent is an intelligent customer service that can provide interactions such as problem resolution, product information provision, etc. to customers, the multimodal input data processed by embodiments of the present disclosure can include, for example, chat sessions, ambient data, etc., and the multimodal output data provided can include, for example, speech, text, animation, etc. For example, in a scenario where the intelligent conversation agent is an intelligent loudspeaker in which a voice assistant or chat bot may interact with a user, play audio content, etc., then the multimodal input data processed by embodiments of the present disclosure may include, for example, the played audio content, chat sessions, captured audio, ambient data, etc., and the multimodal output data provided may include, for example, speech, etc. It should be understood that embodiments of the present disclosure may be applied to any other scenarios in addition to the exemplary scenarios described above.

Fig. 1 illustrates an exemplary architecture of a multi-modal based reactive response generation system 100, according to an embodiment. The system 100 can enable intelligent conversational partners to make multimodal based reactive responses in different scenarios. The intelligent session master may be implemented or resident on the terminal device or any user accessible device or platform.

The system 100 can include a multimodal data input interface 110 for obtaining multimodal input data. The multimodal data input interface 110 can collect multiple types of input data from multiple data sources. For example, in the case of playing targeted content to a user, the multimodal data input interface 110 can collect data such as images, audio, bullet screen files, etc. of the targeted content. In this context, targeted content may broadly refer to various media content that is played or presented to a user on a device, such as video content, audio content, pictorial content, textual content, and so forth. For example, where the intelligent conversation agent can chat with the user, the multimodal data input interface 110 can obtain input data regarding the chat session. For example, the multimodal data input interface 110 can capture images and/or audio around the user through a camera and/or microphone on the terminal device. The multimodal data input interface 110 can also obtain ambient data from third party applications or any other information source, for example. In this context, ambient data may broadly refer to various environmental parameters in the real world in which the terminal device or user is located, e.g. data about weather, temperature, humidity, travel speed, etc.

The multimodal data input interface 110 can provide the obtained multimodal input data 112 to a core processing unit 120 in the system 100. The core processing unit 120 provides various core processing capabilities required for reactive response generation. Based on the processing stage and type, the core processing unit 120 may further include a plurality of processing modules, for example, a data integration processing module 130, a scene logic processing module 140, a multimodal output data generation module 150, and the like.

The data integration processing module 130 can extract information of different types of multimodalities from the multimodal input data 112, and the extracted information of the multimodalities can be in the same context under specific scene and time sequence conditions. In one implementation, the data integration processing module 130 may extract one or more information elements 132 from the multimodal input data 112. In this context, an information element may broadly refer to computer-understandable information or a representation of information extracted from raw data. In one aspect, the data integration processing module 130 may extract information elements from the target content included in the multi-modal input data 112, for example, from images, audio, bullet screen files, etc. of the target content. Illustratively, the information elements extracted from the image of the target content may include, for example, character features, text, image light, objects, etc., the information elements extracted from the audio of the target content may include, for example, music, voice, etc., and the information elements extracted from the bullet-screen file of the target content may include, for example, bullet-screen text, etc. In this context, music may broadly refer to singing of a song, instrumental performance, or a combination thereof, and speech may broadly refer to the sound of speech. In one aspect, the data integration processing module 130 can extract information elements, such as message text, from the chat session included with the multimodal input data 112. In one aspect, the data integration processing module 130 may extract information elements, such as object features, from the acquired images comprised by the multimodal input data 112. In one aspect, the data integration processing module 130 may extract information elements such as speech, music, etc. from the captured audio included in the multimodal input data 112. In one aspect, the data integration processing module 130 may extract information elements, such as ambient information, from the ambient data included in the multimodal input data 112.

The scene logic processing module 140 may generate one or more reference information items 142 based at least on the information elements 132. In this context, reference to an item of information can broadly refer to various pieces of guidance information generated based on various pieces of information for reference by the system 100 in producing multimodal output data. In one aspect, reference information item 142 can include emotion tags that can direct the emotion upon which multimodal output data is to be presented or based. In one aspect, the reference information item 142 can include an animation tag that can be used to select an animation to be presented in the event that the multimodal output data is to include an animation sequence. In one aspect, the reference information item 142 may include comment text, which may be a comment for, for example, target content, in order to express the intelligent conversation agent's own opinion or rating, etc., of the target content. In one aspect, the reference information item 142 can include chat response text, which can be a response to message text from a chat session. It should be understood that scene logic processing module 140 may optionally also consider more other factors in generating reference information item 142, such as scene specific emotions, a preset personality of the body of the intelligent conversation, a preset role of the body of the intelligent conversation, and the like.

Multimodal output data generation module 150 can utilize at least reference information item 142 to produce multimodal output data 152. The multimodal output data 152 can include various types of output data, such as speech, text, animated sequences, and the like. The speech included in the multimodal output data 152 may be, for example, speech corresponding to comment text or chat response text, the words included in the multimodal output data 152 may be, for example, words corresponding to comment text or chat response text, and the animation sequence included in the multimodal output data 152 may be, for example, an animation sequence of an avatar of a body of a smart conversation. It should be appreciated that multimodal output data generation module 150 may also optionally consider more other factors in generating multimodal output data 152, such as, for example, scene specific requirements, etc.

The system 100 can include a multimodal data output interface 160 for providing multimodal output data 152. The multimodal data output interface 160 can support the provision or presentation of multiple types of output data to a user. For example, the multimodal data output interface 160 can present text, animated sequences, etc. via a display screen, and can play voice, etc. via a speaker.

It should be understood that the above-described architecture of the multi-modal based reactive response generating system 100 is merely exemplary, and that the system 100 may include more or fewer component units or modules depending on actual application requirements and design. Further, it should be understood that system 100 may be implemented in hardware, software, or a combination thereof. For example, in one case, the multimodal data input interface 110, the core processing unit 120, and the multimodal data output interface 160 can be hardware-based units, e.g., the core processing unit 120 can be implemented by a processor, controller, etc. having data processing capabilities, while the multimodal data input interface 110 and the multimodal data output interface 160 can be implemented by hardware interface units having data input/output capabilities. For example, in one case, the units or modules included in the system 100 may also be implemented by software or programs, so that the units or modules may be software units or software modules. Further, it should be understood that the units and modules comprised by the system 100 may be implemented at a terminal device, or may be implemented at a network device or platform, or may be implemented in part at a terminal device and in part at a network device or platform.

Fig. 2 illustrates an exemplary process 200 for multi-modal based reactive response generation, according to an embodiment. The steps or processes in process 200 may be performed by corresponding units or modules in a multi-modal based reactive response generating system, such as in fig. 1.

At 210, multimodal input data 212 can be obtained. Illustratively, based on different application scenarios, the multimodal input data 212 can include, for example, at least one of an image of the targeted content, audio of the targeted content, a barrage file of the targeted content, a chat session, a captured image, captured audio, ambient data, and the like. For example, in a scenario where target content is present, such as a smart animated character scenario, a virtual anchor scenario, etc., data such as images, audio, barrage files, etc., of the target content may be obtained at 210. For example, in a scenario where the intelligent session body supports chat functionality, data regarding a chat session, including chat records in the chat session, etc., may be obtained at 210. For example, in a scenario where the terminal device implementing the intelligent conversation agent has a camera or a microphone, data such as an image captured by the camera, audio captured by the microphone, and the like may be obtained at 210. For example, in a scenario where the intelligent conversation agent has the capability to obtain ambient data, various ambient data may be obtained at 210. It should be understood that the multimodal input data 212 is not limited to the exemplary input data described above.

At 220, one or more information elements 222 may be extracted from the multimodal input data 212. Depending on the particular input data included in the multimodal input data 212, corresponding information elements may be extracted from the input data, respectively.

In the case where the multimodal input data 212 includes images of targeted content, the personality characteristics may be extracted from the images of targeted content. Taking the case where the target content is a concert video played on the terminal device as an example, various character features of the singer, such as facial expressions, body movements, clothing colors, and the like, can be extracted from the image of the video. It should be understood that embodiments of the present disclosure are not limited to any particular human feature extraction technique.

Where the multimodal input data 212 includes an image of targeted content, text can be recognized from the image of targeted content. In one implementation, text may be recognized from an image by text recognition techniques, such as Optical Character Recognition (OCR). Still taking the example where the target content is a concert video, some images in the video may contain musical information, such as song names, word makers, composers, singers, players, etc., and thus, may be obtained through text recognition. It should be understood that embodiments of the present disclosure are not limited to recognizing text by OCR technology, but may employ any other text recognition technology. Further, the text recognized from the image of the target content is not limited to the music information, and may include any other text indicating information related to an event occurring in the image, for example, subtitles, lyrics, and the like.

Where the multimodal input data 212 includes an image of the target content, image rays can be detected from the image of the target content. Image light may refer to ambient light characteristics within a picture rendered by an image, such as bright, dim, shady, flickering, and the like. Still taking the example where the target content is a concert video, assuming that the singer is singing a cheerful style song, the stage at the concert site may employ bright lighting, and thus, the image light may be detected from these images as bright. It should be understood that embodiments of the present disclosure are not limited to any particular image light detection technique.

Where the multimodal input data 212 includes images of targeted content, objects can be identified from the images of targeted content. The identified objects may be, for example, representative objects in the image, objects that appear in prominent or significant positions in the image, objects associated with a person in the image, and the like, e.g., the identified objects may include props, background furnishings, and the like. Still taking the example where the target content is a concert video, assuming that the singer is singing a song while simultaneously playing a guitar carried on his body, the object "guitar" can be identified from the image. It should be understood that embodiments of the present disclosure are not limited to any particular object identification technique.

In the case where the multimodal input data 212 includes audio of the target content, music may be extracted from the audio of the target content. The target content itself may be audio, e.g. a song played to the user on the terminal device, from which audio music corresponding to the song may be extracted accordingly. In addition, the target content may also be a video, such as a concert video, and accordingly, music may be extracted from the audio contained in the video. In this context, music may broadly include, for example, music played by musical instruments, songs sung by singers, special effects produced by dedicated equipment or dubbing personnel, and so forth. The extracted music may be background music, foreground music, etc. Further, music extraction may broadly refer to, for example, obtaining a sound file, sound wave data, or the like corresponding to music. It should be understood that embodiments of the present disclosure are not limited to any particular music extraction technique.

Where the multimodal input data 212 includes audio of the targeted content, speech can be extracted from the audio of the targeted content. In this context, speech may refer to the sound of speech. For example, when the target content includes a conversation, speech, comment, or the like of a person or character, the corresponding voice may be extracted from the audio of the target content. The voice extraction may broadly refer to, for example, obtaining a sound file corresponding to a voice, sound wave data, and the like. It should be understood that embodiments of the present disclosure are not limited to any particular speech extraction technique.

In the case where the multimodal input data 212 includes a bullet screen file of the target content, bullet screen text can be extracted from the bullet screen file of the target content. In some cases, some video playback applications or playback platforms support different viewers of a video to send their own comments, feelings, or the like in the form of a bullet screen, which may be included as bullet screen text in a bullet screen file attached to the video, and thus, the bullet screen text may be extracted from the bullet screen file. It should be understood that embodiments of the present disclosure are not limited to any particular barrage text extraction technique.

Where the multimodal input data 212 includes a chat session, message text can be extracted from the chat session. The message text may include, for example, text of a chat message sent by the intelligent conversation agent, text of a chat message sent by at least one other chat participant, and so forth. In the case where the chat session is performed in a text manner, the message text may be directly extracted from the chat session, and in the case where the chat session is performed in a voice manner, the voice message in the chat session may be converted into the message text through a voice recognition technology. It should be understood that embodiments of the present disclosure are not limited to any particular message text extraction technique.

Where the multimodal input data 212 includes captured images, object features may be extracted from the captured images. Object features may broadly refer to various features of an object appearing in an acquired image, which may include, for example, a person, an object, and so forth. For example, in the case of capturing an image of a computer user through a computer camera, various features about the user, such as facial expressions, limb movements, etc., may be extracted from the image. For example, in the case where an image in front of an automobile is captured by a camera mounted on the automobile, various features regarding, for example, a preceding vehicle, a traffic sign, a roadside building, and the like may be extracted from the image. It should be understood that embodiments of the present disclosure are not limited to extracting the above exemplary object features from the acquired image, but may also extract any other object features. In addition, embodiments of the present disclosure are not limited to any particular object feature extraction technique.

Where the multimodal input data 212 includes captured audio, speech and/or music may be extracted from the captured audio. Similarly to the manner of extracting speech, music, and the like from the audio of the target content described above, speech, music, and the like may be extracted from the captured audio.

Where the multimodal input data 212 includes ambient environment data, ambient environment information can be extracted from the ambient environment data. For example, specific weather information may be extracted from data regarding weather, specific temperature information may be extracted from data regarding temperature, specific speed information may be extracted from data regarding travel speed, and so forth. It should be understood that embodiments of the present disclosure are not limited to any particular ambient information extraction technique.

It should be appreciated that the information elements extracted from the multimodal input data 212 described above are exemplary, and that embodiments of the present disclosure may also extract any other type of information element. Furthermore, the extracted information elements may be in the same context under specific scenarios and timing conditions, e.g. the information elements may be time aligned, and accordingly different combinations of information elements may be extracted at different points in time.

At 230, one or more reference information items 232 may be generated based at least on the information element 222.

According to an embodiment of the disclosure, the reference information item 232 generated at 230 may include an emotion tag. The emotion tags may indicate, for example, emotion type, emotion level, and the like. Embodiments of the present disclosure may encompass any number of predetermined emotion types, as well as any number of emotion levels defined for each emotion type. Exemplary emotion types may include, for example, happiness, injury, anger, etc., and exemplary emotion levels may include level 1, level 2, level 3, etc., from low to high emotional intensity. Accordingly, if the emotion tag < happy, level 2 > is determined at 230, it indicates that the information element 222 expresses happy emotion as a whole and that the emotion level is a medium level of level 2. It should be understood that the exemplary emotion types, exemplary emotion levels, and expressions thereof are given above for ease of explanation only, and that any other emotion types, and any other emotion levels, greater or lesser, and any other expressions may also be employed by embodiments of the present disclosure.

The respective expressed emotions may be first determined for each information element and then considered together to determine the final emotion type and emotion level. For example, one or more sentiment representations corresponding to one or more of the information elements 222, respectively, may first be generated, and then a final sentiment tag may be generated based at least on these sentiment representations. In this context, emotion representation may refer to an informational representation of emotion, which may take the form of, for example, an emotion vector, an emotion tag, and the like. The emotion vector may include a plurality of dimensions for representing emotion distribution, each dimension corresponding to an emotion type, and the value in each dimension indicates a predicted probability or weight of the corresponding emotion type.

Where the information element 222 includes a character feature extracted from an image of the target content, an emotional representation corresponding to the character feature may be generated using, for example, a machine learning model trained in advance. Taking facial expressions in human features as an example, a convolutional neural network model, for example, for facial emotion recognition, may be employed to predict the corresponding emotional representations. Similarly, the convolutional neural network model may also be trained to predict the emotional representation in further combination with other features, such as limb movements, that may be included in the character features. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to a character feature.

In the case where the information element 222 includes text identified from an image of the target content, taking the text as music information as an example, emotion information corresponding to the music may be retrieved in a pre-established music database based on the music information, thereby forming an emotion representation. The music database may include music information of a large number of pieces of music collected in advance, and corresponding emotion information, music type, background knowledge, chat corpus, and the like. The music database may be indexed by various music information such as song name, singer, player, etc., so that emotion information corresponding to a specific music can be found from the music database based on the music information. Alternatively, the music genre found from the music database may also be used to form the emotion representation, since different music genres may also typically indicate different emotions. In addition, taking as an example a caption in which the recognized text is a speech of a person in an image, an emotion expression corresponding to the caption can be generated using a machine learning model trained in advance. The machine learning model may be, for example, an emotion classification model based on a convolutional neural network. It should be appreciated that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to text recognized from an image of target content.

Where the information element 222 includes an object identified from an image of the target content, the sentiment representation corresponding to the object may be determined based on a pre-established machine learning model or pre-set heuristic rules. In some cases, objects in the image may also help convey emotion. For example, if multiple red ornaments arranged on a stage for highlighting an atmosphere are displayed in an image, these red ornaments identified from the image may help to determine an emotion such as happy or happy. It should be appreciated that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to an object identified from an image of target content.

Where the information element 222 includes music extracted from the audio of the target content, the emotional representation corresponding to the music may be determined or generated in a variety of ways. In one approach, if music information has been identified, emotion information corresponding to the music may be found from a music database based on the music information, forming an emotion representation. In one approach, a previously trained machine learning model may be utilized to generate an emotional representation corresponding to the music based on a plurality of music features extracted from the music. The musical features may include an audio Average Energy (AE) of the music, expressed as

Where x is the discrete audio input signal, t is time, and N is the number of input signals x. The music features may also include rhythm features extracted from the music in terms of a distribution of the number of beats and/or beat intervals. Alternatively, the musical feature may include emotion information corresponding to the music obtained by using the musical information. The machine learning model may be trained based on one or more of the music features described above, such that the trained machine learning model is capable of predicting an emotional representation of the music. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to music extracted from audio of target content.

Where the information element 222 includes speech extracted from the audio of the target content, a previously trained machine learning model may be utilized to generate an emotional representation corresponding to the speech. It should be appreciated that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to speech extracted from audio of target content.

Where the information element 222 includes a bullet screen text extracted from a bullet screen file of the target content, a previously trained machine learning model may be utilized to generate an emotional representation corresponding to the bullet screen text. The machine learning model may be, for example, an emotion classification model based on a convolutional neural network, denoted CNN_sen. Suppose a word in the bullet screen text is represented as d₀,d₁,d₂,…]Then can pass through emotion classification model CNN_senTo predict the emotion vector corresponding to the barrage text, which is expressed as s₀,s₁,s₂,…]＝CNN_sen[d₀,d₁,d₂,…]Wherein, the emotion vector [ s₀,s₁,s₂,…]One for each emotion category. It should be appreciated that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to barrage text extracted from a barrage file of target content.

Where information element 222 includes message text extracted from a chat session, a previously trained machine learning model can be utilized to generate an emotion representation corresponding to the message text. The machine learning model may be built in a similar manner to the machine learning model described above for generating emotion representations corresponding to barrage text. It should be appreciated that embodiments of the present disclosure are not limited to any particular technique for determining an emotion representation corresponding to message text extracted from a chat session.

Where information element 222 includes object features extracted from the captured image, a previously trained machine learning model may be utilized to generate an emotional representation corresponding to the object features. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to object features extracted from an acquired image.

Where information element 222 includes speech and/or music extracted from the captured audio, an emotional representation corresponding to the speech and/or music may be generated. The emotional representations corresponding to the extracted speech and/or music from the captured audio may be generated in a manner similar to the determination of the emotional representations corresponding to the extracted speech and/or music from the audio of the target content described above. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to speech and/or music extracted from captured audio.

In the case where the information element 222 includes ambient information extracted from the ambient data, the emotional representation corresponding to the ambient information may be determined based on a pre-established machine learning model or a pre-set heuristic rule. Taking the external environment information as "rainy" weather as an example, since people tend to express feelings of slight anxiety in rainy weather, an emotional representation corresponding to the feelings of anxiety can be determined from the external environment information. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to ambient information extracted from ambient data.

After generating one or more sentiment representations corresponding to one or more of the information elements 222, respectively, as described above, a final sentiment tag may be generated based at least on the sentiment representations. The final sentiment tag may be understood to indicate the overall sentiment determined by comprehensively considering the various information elements. The emotion tags may be formed from multiple emotion representations in various ways. For example, in the case of an emotion representation using an emotion vector, a plurality of emotion representations may be superimposed to obtain a total emotion vector, and the emotion type and emotion level may be derived from the emotion distribution in the total emotion vector to form a final emotion tag. For example, where emotion tags are employed for the emotion presentation, a final emotion tag may be calculated, selected, or determined from a plurality of emotion tags corresponding to a plurality of information elements based on a predetermined rule. It should be understood that embodiments of the present disclosure are not limited to any particular manner of generating emotion labels based on multiple emotional representations.

It should be appreciated that although the above discussion relates to first generating a plurality of sentiment representations corresponding to the plurality of information elements, respectively, and then generating sentiment tags based on the sentiment representations at 230, alternatively, embodiments of the present disclosure may generate sentiment tags directly based on the plurality of information elements. For example, a machine learning model may be trained in advance, which may be trained to take information elements as input features and predict emotion labels accordingly. Thus, the trained model may be used to generate emotion labels based directly on information elements 222.

According to an embodiment of the present disclosure, the reference information item 232 generated at 230 may include an animation tag. In case the multimodal output data is to comprise an animated sequence of avatars of the intelligent conversation body, the animation tag may be used to select the animation to be presented. The animation label may indicate at least one of, or a combination of, for example, facial expression types, limb action types, etc., of the avatar. Facial expressions may include, for example, smiling, laughing, blinking, left-handed, talking, etc., and limb movements may include, for example, turning left, waving hands, swinging body, dancing movements, etc.

At least one information element 222 may be mapped to an animation tag according to predetermined rules. For example, a variety of animation tags may be predefined, and a number of mapping rules from a set of information elements to animation tags may be predefined, wherein a set of information elements may include one or more information elements. Thus, when a set of information elements including one or more information elements is given, a corresponding animation tag may be determined based on one or a combination of the information elements in the set of information elements with reference to a predefined mapping rule. An exemplary mapping rule is: when the character features extracted from the image of the target content indicate a singing action of the character, and the bullet-screen text includes key words such as "good hearing", "drunken", etc., these information elements may be mapped to animation tags such as "close both eyes", "swing body", etc., so that the avatar may exhibit a behavior such as listening to a song inebriously. An exemplary mapping rule is: when the speech extracted from the audio of the target content indicates that people are quarreling, the barrage text includes key words such as "noise", "do not want to hear", and the message text extracted from the chat session includes key words indicating aversion of the user, these information elements may be mapped to animation tags such as "keep their ears with hands", "pan head", and so on, so that the avatar may exhibit behavior such as not wanting to hear the quarrel. An exemplary mapping rule is: when the image light detected from the image of the target content indicates a rapid change in brightness, the object recognized from the image of the target content is a guitar, and the music extracted from the audio of the target content indicates a fast-paced music piece, these information elements may be mapped to animation tags such as "pluck guitar", "fast-paced dance action", etc., so that the avatar may exhibit a behavior such as plucking a piano with a hot music piece. It should be understood that the above lists only a few exemplary mapping rules, and that embodiments of the present disclosure may define a wide variety of any other mapping rules.

Further, optionally, animation tags may also be generated further based on emotion tags. For example, emotion tags may be used with information elements to define mapping rules, such that a corresponding animation tag may be determined based on a combination of information elements and emotion tags. Alternatively, a direct mapping rule from emotion tag to animation tag may be defined, so that, after an emotion tag is generated, the defined mapping rule may be referenced to determine the corresponding animation tag based directly on the emotion tag. For example, a mapping rule from emotional tag < sad, level 2 > to animation tag "crying", "wiping tears with hands", etc. may be defined.

According to an embodiment of the present disclosure, the reference information item 232 generated at 230 may include comment text. The comment text may be a comment for, for example, the target content so as to express the intelligent conversation agent's own opinion or evaluation or the like of the target content. Comment text may be selected from the bullet screen text of the target content. Illustratively, comment texts may be selected from bullet screen texts using a comment generation model constructed based on a double tower model. The bullet screen text of the target content may be temporally aligned with the image and/or audio of the target content, where temporally aligned may refer to being at the same time or within the same time period. The barrage text at a particular moment in time may include multiple sentences, which may be comments by different viewers on the image and/or audio of the target content at that moment in time or in an adjacent time period. At each time instant, the comment generation model may select an appropriate sentence from the corresponding barrage text as a comment text for the image and/or audio of the target content at that time instant or in a period of time adjacent thereto. For example, a double tower model may be utilized to determine a matching degree between sentences in the bullet screen text of the target content and images and/or audio of the target content, and select sentences having the highest matching degree from the bullet screen text as the comment text. The comment generation model may include, for example, two double tower models. For a sentence in the barrage text, one two-tower model may be used to output a first match score based on the input target content image and the sentence to represent the degree of match between the image and the sentence, while another two-tower model may be used to output a second match score based on the input target content audio and the sentence to represent the degree of match between the audio and the sentence. The first and second match score may be combined in any manner to arrive at a composite match score for the sentence. After obtaining a plurality of composite matching degree scores of a plurality of sentences of the bullet screen text, the sentence with the highest matching degree score can be selected as the comment text of the current image and/or audio. It should be appreciated that the structure of the comment generation model described above is merely exemplary, and the comment generation model may also include only one of two double-tower models, or be based on any other model trained to determine a degree of match between the sentence of the barrage text and the image and/or audio of the target content.

In accordance with an embodiment of the present disclosure, if the intelligent session master is chatting with at least another chat participant in a chat session, the reference information item 232 generated at 230 may also include chat response text. The other chat participant may be, for example, a user, other intelligent conversation agent, etc. After message text from another chat participant is obtained, a corresponding chat response text can be generated by the chat engine based at least on the message text.

In one implementation, any general chat engine may be employed to generate the chat response text.

In one implementation, the chat engine can generate chat response text based at least on the emotion tags. For example, the chat engine can be trained to generate chat response text based at least on the input message text and the emotion tags, such that the chat response text is generated at least under the influence of the emotion indicated by the emotion tags.

In one implementation, the intelligent conversation agent may exhibit emotional continuation characteristics in the chat session, e.g., the response of the intelligent conversation agent is affected not only by the emotion of the currently received message text, but also by the emotional state that the intelligent conversation agent itself is currently in. As an example, assuming that the body of the smart conversation is currently in a happy emotional state, although the received current message text may have or cause a negative emotion such as anger, the body of the smart conversation does not immediately give a response with an angry emotion due to the current message text, but may still maintain a happy emotion or only slightly lower the emotional level of the happy emotion. In contrast, existing chat engines typically determine the emotional type of the response only for the current turn of the conversation or only from the currently received message text, so that the emotional type of the response may change frequently with the received message text, which does not conform to the behavior of human being in a more steady emotional state in chatting and not changing the emotional state frequently. The intelligent conversation body with the emotion continuation characteristic in the chat conversation provided by the embodiment of the disclosure is more anthropomorphic. To implement the emotion continuation feature in a chat session, the chat engine can generate chat response text based at least on the emotion representations from the emotion transfer network. The emotion transfer network is used to model dynamic emotion transformations, which can either maintain a steady emotional state or make appropriate adjustments or updates to the emotional state in response to the currently received message text. For example, the emotion transfer network may take as input a current emotion representation, which may be, for example, a vector representation of the current emotional state to the body of the intelligent conversation, with the currently received message text, and output an updated emotion representation. The updated emotional representation contains both information reflecting the previous emotional state and information of the emotional changes that may be caused by the current message text. The updated emotion representation can be further provided to the chat engine, so that the chat engine can generate chat response text for the current message text under the influence of the received emotion representation.

In one implementation, the chat engine may be trained to enable chatting against the target content, i.e., topics related to the target content may be discussed with another chat participant. Illustratively, the chat engine may be a retrieval-based chat engine built based on chat content between people in, for example, forums related to the targeted content. The construction of the chat engine may include processing in a number of ways. In one aspect, chat corpuses relating to chat content between people can be crawled from forums related to targeted content. In one aspect, a word vector model may be trained for finding possible names for each named entity. For example, word vector techniques can be utilized to find the relevant words for each named entity, and then, optionally, the correct words are retained from the relevant words as possible names for the named entity by, for example, manual checking. In one aspect, keywords can be extracted from the chat corpus. For example, statistics may be performed according to the word segmentation results of the related corpus, and then the statistics may be compared with the statistics results of the non-related corpus, so as to find out the word with a larger word frequency-inverse document frequency (TF-IDF) difference as the keyword. In one aspect, a deep search model based on, for example, a deep convolutional neural network, which is the core network of a chat engine, can be trained. The deep search model may be trained using message-reply pairs in a chat corpus as training data. The text in the message-reply pair may include the original sentence or extracted keywords in the message and the reply. In one aspect, an intent detection model may be trained that may detect which targeted content is specifically associated with a received message text, such that a forum associated with the targeted content may be selected from a plurality of forums. The intention detection model may be a two-class classifier, which may be, for example, a convolutional neural network text classification model in particular. The positive examples for the intent detection model may be from chat corpuses in forums related to the targeted content, while the negative examples may be from chat corpuses in other forums or plain text. Through one or more of the processes described above, and possibly any other processes, a search-based chat engine may be constructed that may provide chat response text based on the corpus in the forum associated with the targeted content in response to the entered message text.

It should be appreciated that the above-discussed process of generating reference information items 232 at 230, including, for example, emotion tags, animation tags, comment text, chat response text, and the like, is exemplary, and in other implementations, the process of reference information item generation may also take into account more other factors, such as scene specific emotions, a preset personality of the body of the intelligent conversation, a preset role of the body of the intelligent conversation, and the like.

Scene specific sentiment may refer to a predefined sentiment preference associated with a particular scene. For example, in some scenarios it may be desirable for the intelligent conversational agent to try to make a positive optimistic response, and thus scenario-specific emotions that can lead to a positive optimistic response, e.g., happy, excited, etc., may be pre-set for these scenarios. The scene specific emotion may include an emotion type or an emotion type and an emotion level thereof. Scene specific emotions may be used to influence the generation of the reference information items. In one aspect, in the process of generating emotion tags described above, scene specific emotions may be taken as input along with information element 222 to collectively generate emotion tags. For example, the scene specific emotion can be used as an emotion representation, and the emotion representation can be used together with a plurality of emotion representations respectively corresponding to a plurality of information elements to generate an emotion tag. In one aspect, in the above-described process of generating animation tags, scene-specific emotions may be considered in a similar manner as emotion tags, e.g., the scene-specific emotions may be used with information elements to define mapping rules. In one aspect, in the above-described process of generating the comment text, the plurality of sentences in the bullet screen text may be sorted in consideration of not only the degree of matching between the sentences and the image and/or audio of the target content but also the degree of matching between the emotion information detected from the sentences and the scene-specific emotion. In one aspect, in the above-described process of generating chat response text, scene specific emotions can be considered in a similar manner as emotion tags. For example, the chat engine may use the entered message text along with scene specific emotions and possibly emotion tags to generate chat response text.

The preset personality of the intelligent session body may refer to personality characteristics preset for the intelligent session body, such as liveliness, loveliness, mild character, excitement, and the like. The response made by the intelligent conversation agent can be made to conform to the preset personality as much as possible. The preset personality may be used to influence the generation of the reference information item. In one aspect, in the process of generating an emotion tag described above, a preset personality may be mapped to a corresponding emotional propensity, and the emotional propensity may be used as input along with information element 222 to jointly generate an emotion tag. For example, the emotional tendency may be used as an emotional representation, which may be used to generate an emotional tag along with a plurality of emotional representations respectively corresponding to a plurality of information elements. In one aspect, in the above-described process of generating an animation tag, a preset personality may be used with the information element to define the mapping rule. For example, a lively preset personality would be more helpful in determining animated labels with more body movements, a lovely preset personality would be more helpful in determining animated labels with lovely facial expressions, and so on. In one aspect, in the above process of generating the comment text, the plurality of sentences in the bullet screen text may be sorted by considering not only the degree of matching between the sentences and the image and/or audio of the target content, but also the degree of matching between the emotion information detected from the sentences and the emotion tendencies corresponding to the preset personality. In one aspect, in the above-described process of generating a chat response text, emotional tendency corresponding to a preset personality may be considered in a similar manner as the emotion tag. For example, the chat engine may use the entered message text along with the emotional tendencies and possibly the emotion tags to generate chat response text.

The preset role of the intelligent session body may refer to a role to be played by the intelligent session body. The preset characters may be classified according to various criteria, for example, characters of girls, middle-aged men, and the like classified according to age and gender, characters of teachers, doctors, policemen, and the like classified according to professions, and the like. The response made by the intelligent conversation agent can be made to conform to the preset role as much as possible. The preset role may be used to influence the generation of the reference information item. In one aspect, in the process of generating emotion labels described above, the preset characters can be mapped to corresponding emotional tendencies, and the emotional tendencies can be used as input along with information element 222 to collectively generate emotion labels. For example, the emotional tendency may be used as an emotional representation, which may be used to generate an emotional tag along with a plurality of emotional representations respectively corresponding to a plurality of information elements. In one aspect, in the above-described process of generating an animation tag, a preset character can be used together with an information element to define a mapping rule. For example, a girl's pre-set character would be more helpful in determining animated labels with pleasing facial expressions, more body movements, etc. In one aspect, in the above process of generating the comment text, the plurality of sentences in the bullet screen text may be sorted by considering not only the degree of matching between the sentences and the image and/or audio of the target content, but also the degree of matching between the emotion information detected from the sentences and the emotion tendencies corresponding to the preset characters. In one aspect, in the above-described process of generating a chat response text, emotional tendency corresponding to a preset character may be considered in a similar manner to the emotion tag. For example, the chat engine may use the entered message text along with the emotional tendencies and possibly the emotion tags to generate chat response text. In addition, the training corpus of the chat engine can also comprise more corpora corresponding to the preset role, so that the chat response text output by the chat engine is more in line with the language characteristics of the preset role.

In accordance with the process 200, after the item of reference information 232 is obtained, multimodal output data 242 can be generated at 240 utilizing at least the item of reference information 232. Multimodal output data 242 is data to be provided or presented to a user and may include various types of output data, such as speech, text of the intelligent conversation body, animated sequences of avatars of the intelligent conversation body, and so forth.

The speech in the multimodal output data may be generated for comment text, chat response text, etc. in the reference information item. For example, comment text, chat response text, and the like may be converted to corresponding speech by any text-to-speech (TTS) conversion technique. Alternatively, the TTS conversion process may be conditioned on the emotion tag, such that the generated speech has an emotion indicated by the emotion tag.

The words in the multimodal output data may be visual words corresponding to comment text, chat response text, etc. in the reference information item. Thus, the comment content, the chat response content, and the like, which are described by the intelligent conversation body, can be visually presented by the text. Alternatively, the text may be generated in a predetermined font or presentation effect.

The animation sequence in the multimodal output data may be generated using at least the animation tags and/or emotion tags in the reference information items. An animation library of avatars of the intelligent conversation body may be pre-established. The animation library may include a number of animation templates pre-authored with the avatar of the intelligent conversational body. Each animation template may include, for example, a plurality of GIF images. Further, the animation templates in the animation library may be indexed with animation tags and/or emotion tags, e.g., each animation template may be labeled with at least one of a corresponding facial expression type, limb action type, emotion level, etc. Thus, when the reference information item 232 generated at 230 includes an animation tag and/or an emotion tag, the animation tag and/or emotion tag can be utilized to select a corresponding animation template from the animation library. Preferably, after the animation template is selected, a time adaptation may be performed on the animation template to form an animated sequence of the avatar of the intelligent conversational body. Temporal adaptation is intended to adjust the animation template to match the temporal sequence of speech corresponding to the comment text and/or chat response text. For example, the duration of facial expressions, limb movements, etc. in the animated template may be adjusted to match the duration of speech of the smart animated character. As an example, images in the animation template relating to mouth open and close speech may be made to repeat continuously during a period of time when speech of the smart animated character is played, thereby presenting the visual effect that the avatar is speaking. Further, it should be understood that time adaptation is not limited to matching an animation template to a temporal sequence of speech corresponding to comment text and/or chat response text, but may also include matching an animation template to a temporal sequence of extracted one or more information elements 222. For example, assuming that the singer is playing a guitar in the target content, that information elements such as the object "guitar" have been identified from the target content, and that these information elements have been mapped to the "playing guitar" animation tag, the selected animation template corresponding to "playing guitar" may be repeated continuously during the time period the singer is playing the guitar, thereby presenting the visual effect that the avatar is playing a guitar with the singer in the target content. It should be understood that in different application scenarios, the intelligent conversation body may have different avatars, so that different animation libraries may be pre-established for the different avatars, respectively.

It should be appreciated that the above-discussed process of generating multi-modal output data 242 at 240, including, for example, animated sequences, speech, text, etc., is exemplary, and in other implementations, the process of multi-modal output data generation may also take into account more other factors, such as, for example, scene-specific requirements, etc., i.e., multi-modal output data may be generated based further on the scene-specific requirements. Consideration of scenario-specific requirements may enable embodiments of the present disclosure to be adaptively applied in a variety of scenarios, e.g., multi-modal output data suitable for a particular scenario may be adaptively output based on the output capabilities supported by different scenarios.

Scenario specific requirements may refer to specific requirements of different application scenarios of the intelligent session master. Scenario specific requirements may include, for example, the type of multimodal output data supported, speech rate reservation settings, chat mode settings, etc. associated with a particular scenario. In one aspect, different scenes may have different data output capabilities, and thus, the types of multimodal output data supported by the different scenes may include outputting only one of speech, animated sequences and text, or outputting at least two of speech, animated sequences and text. For example, smart animated characters and virtual anchor scenarios require that the terminal device be capable of at least supporting the output of images and audio, whereby scenario specific requirements may dictate the output of one or more of speech, animation sequences and text. For example, a smart speaker scenario supports only audio output, and thus, scenario-specific requirements may dictate that only speech is output. In one aspect, different speech rate preferences may exist for different scenarios, and thus, scenario specific requirements may be subject to speech rate reservation settings. For example, since the user can both view images and hear voices in the smart animated character and the virtual anchor scene, the speech rate can be set faster in order to express richer emotions. For example, in the scenario of smart speakers and smart car assistants, users often only obtain or focus on speech output, and therefore, the speech rate can be set to be slower, so that users can clearly know what the smart conversation entity is to express through speech alone. In one aspect, different chat mode preferences may exist for different scenarios, and thus, scenario specific requirements may be subject to chat mode settings. For example, in the scenario of a smart car assistant, the chatting output of the chat engine may be reduced in order not to distract the user too much, since the user may be driving a vehicle. In addition, chat mode settings may also be associated with captured images, captured audio, ambient environmental data, and the like. For example, the voice output of a chat response generated by a chat engine may be reduced when the captured audio indicates that there is significant noise around the user. For example, when the ambient environment data indicates that the user is traveling at a relatively fast speed, e.g., driving a vehicle at a high speed, the chatting output of the chat engine may be reduced.

Multimodal output data can be generated at 240 based at least on scenario-specific requirements. For example, when the scene specific requirements indicate that image output is not supported or only voice output is supported, the generation of animation sequences and text may not be performed. For example, when the context specific requirements indicate that a faster speech rate is employed, the speech rate of the generated speech may be increased during TTS conversion. For example, when a scene specific requirement indicates a decrease in chat response output under specific conditions, then generation of speech or text corresponding to the chat response text may be limited.

At 250, multimodal output data can be provided. For example, animation sequences, text, and the like are displayed through a display screen, and voice and the like are played through a speaker.

It should be appreciated that process 200 may be performed continuously to obtain multimodal input data continuously and to provide multimodal output data continuously.

FIG. 3 illustrates an example of a smart animated character scene according to an embodiment. In the smart animated character scenario of fig. 3, a user 310 may view a video on a terminal device 320, while a smart conversational entity according to embodiments of the disclosure may accompany the user 310 to view the video together as a smart animated character. The terminal device 320 may include, for example, a display screen 330, a camera 322, a speaker (not shown), a microphone (not shown), and the like. A video 332 as the target content may be presented in the display screen 330. In addition, an avatar 334 of the intelligent conversation agent may also be presented in display screen 330. The intelligent conversation agent may perform multi-modal based reactive response generation according to embodiments of the present disclosure, and accordingly, the generated multi-modal based reactive response may be provided on the terminal device 320 via the avatar 334. For example, the avatar 334 may make facial expressions, body movements, speak, etc., in response to content in the video 332, a chat session with the user 310, captured images and/or audio, obtained external environment data, etc.

FIG. 4 illustrates an exemplary process 400 for a smart animated character scene according to an embodiment. Process 400 illustrates the processing flows, data/information flows, etc., involved in a smart animated character scene such as that of FIG. 3. Further, process 400 may be considered a specific example of process 200 in fig. 2.

In accordance with process 400, multimodal input data can first be obtained that includes, for example, at least one of video, ambient environment data, captured images, captured audio, chat sessions, and the like. The video serves as target content, which may in turn include, for example, images, audio, bullet screen files, and the like. It should be understood that the obtained multimodal input data may be temporally aligned and accordingly have the same context.

Information elements may be extracted from multimodal input data. For example, character features, text, image light, objects, etc. are extracted from an image of a video, music, voice, etc. are extracted from an audio of a video, bullet screen text is extracted from a bullet screen file of a video, external environment information is extracted from external environment data, object features are extracted from a captured image, music, voice, etc. are extracted from a captured audio, message text is extracted from a chat session, etc.

A reference information item may be generated based at least on the extracted information elements, including at least one of, for example, emotion tags, animation tags, comment text, chat response text. The comment text may be generated by the comment generation model 430. Chat response text can be generated by chat engine 450 and optional emotion transfer network 452.

The generated reference information items can be utilized to generate multimodal output data including, for example, at least one of animation sequences, commentary speech, commentary text, chat response speech, chat response text, and the like. The animation sequence may be generated based on the description above in connection with fig. 2. For example, animation selection 410 may be performed in an animation library using animation tags, emotion tags, etc. to select an animation template, and animation sequence generation 420 may be performed based on the selected animation template to obtain an animation sequence through time adaptation performed at animation sequence generation 420. The comment speech may be obtained by performing speech generation 440 (e.g., TTS conversion) on the comment text. The comment text may be obtained based on the comment text. The chat response speech may be obtained by performing speech generation 460 (e.g., TTS conversion) on the chat response text. The chat response text can be obtained based on the chat response text.

The generated multimodal output data may be provided on a terminal device. For example, animation sequences, comment texts, chat response texts, and the like are presented on the display screen, and comment voices, chat response voices, and the like are played through the speaker.

It is to be appreciated that all of the processes, data/information, etc. in process 400 are exemplary and in actual practice, process 400 may only involve one or more of such processes, data/information, etc.

The multi-modal based reactive response generation according to embodiments of the present disclosure may be applied to perform a variety of tasks. The following merely illustrates exemplary intelligent animation generation tasks of these tasks. It should be understood that embodiments of the present disclosure are not limited to use in performing intelligent animation generation tasks, but may also be used in performing a variety of other tasks.

FIG. 5 illustrates an exemplary process 500 of intelligent animation generation, according to an embodiment. Process 500 may be considered a specific implementation of process 200 in fig. 2. The smart animation generation of process 500 is a specific application of the multi-modal based reactive response generation of process 200. The intelligent animation generation of process 500 may involve at least one of generation of an animation sequence of the avatar performed in response to the target content, generation of comment speech of the avatar, generation of comment text, and the like.

In process 500, the multimodal input data acquisition step at 210 of FIG. 2 can be embodied as obtaining at least one of an image, audio, bullet screen file of the target content at 510.

In process 500, the information element extraction step at 220 of fig. 2 may be embodied as extracting at least one information element from an image, audio, or bullet screen file of the target content at 520. For example, character features, text, image light, objects, and the like are extracted from an image of target content, music, voice, and the like are extracted from audio of the target content, bullet-screen text is extracted from a bullet-screen file of the target content, and the like.

In process 500, the reference information item generation step at 230 of FIG. 2 may be embodied as generating at least one of an animation tag, an emotion tag, and a comment text at 530. For example, animation tags, sentiment tags, comment text, and the like may be generated based at least on the at least one information element extracted at 520.

In process 500, the multi-modal output data generation step at 240 of FIG. 2 may be embodied as producing at least one of an animated sequence of the avatar, a commentary voice of the avatar, a commentary text with at least one of an animation tag, an emotion tag, and a commentary text at 540. Taking the animation sequence as an example, the animation sequence may be generated using at least animation tags and/or emotion tags in the manner described above in connection with FIG. 2. Furthermore, the comment speech and comment text may also be generated in the manner described above in connection with fig. 2.

In process 500, the multimodal output data providing step at 250 of FIG. 2 may be embodied as providing at least one of the generated animation sequence, comment speech, comment text at 550.

It should be understood that each step in process 500 may be performed in a manner similar to that described above for the corresponding step in fig. 2. Moreover, process 500 may also include any other processing described above with respect to process 200 of fig. 2.

Fig. 6 illustrates a flow diagram of an exemplary method 600 for multi-modal based reactive response generation, according to an embodiment.

At 610, multimodal input data can be obtained.

At 620, at least one information element can be extracted from the multimodal input data.

At 630, at least one reference information item may be generated based at least on the at least one information element.

At 640, multimodal output data can be generated utilizing at least the at least one item of reference information.

At 650, the multimodal output data can be provided.

In one implementation, the multimodal input data may include at least one of: an image of the target content, audio of the target content, a barrage file of the target content, a chat session, a captured image, captured audio, and external environment data.

Extracting at least one information element from the multimodal input data may include at least one of: extracting character features from an image of the target content; recognizing text from an image of the target content; detecting image light from an image of the target content; identifying an object from an image of the target content; extracting music from the audio of the target content; extracting a voice from the audio of the target content; extracting a bullet screen text from a bullet screen file of the target content; extracting message text from the chat session; extracting object features from the acquired image; extracting voice and/or music from the collected audio; and extracting the external environment information from the external environment data.

In one implementation, generating at least one reference information item based at least on the at least one information element may include: at least one of an emotion tag, an animation tag, a comment text, and a chat response text is generated based at least on the at least one information element.

Generating the emotion tag based at least on the at least one information element may include: generating one or more sentiment representations respectively corresponding to one or more information elements in the at least one information element; and generating the emotion tag based at least on the one or more emotional representations.

The emotion tag may indicate an emotion type and/or an emotion level.

Generating the animation label based at least on the at least one information element may include: mapping the at least one information element to the animation tag according to a predetermined rule.

The animation label may indicate a facial expression type and/or a limb action type.

The animation tag may be generated further based on the emotion tag.

Generating the comment text based at least on the at least one information element may include: and selecting the comment text from the bullet screen text of the target content.

The selecting the comment text may include: determining a matching degree between a sentence in the bullet screen text of the target content and an image and/or audio of the target content by using a double-tower model; and selecting the sentence with the highest matching degree from the bullet screen texts as the comment text.

Generating the chat response text based at least on the at least one information element may include: generating, by the chat engine, the chat response text based at least on the message text in the chat session.

The chat response text can be generated further based on the emotion tag.

The chat response text can be generated further based on the emotion representation from the emotion transfer network.

In one implementation, the at least one reference information item may be generated further based on at least one of: a scene specific emotion; the preset personality of the intelligent conversation body; and a preset role of the intelligent session master.

In one implementation, the multimodal output data may include at least one of: an animation sequence of an avatar of the intelligent conversation agent; voice of the intelligent conversation agent; and text.

Generating the multimodal output data using at least the at least one item of reference information may include: and generating voice and/or words corresponding to the comment text and/or the chat response text.

Generating the multimodal output data using at least the at least one item of reference information may include: selecting a corresponding animation template from an animation library of the virtual image of the intelligent conversation body by using the animation label and/or the emotion label; and performing time adaptation on the animation template to form an animation sequence of the avatar of the intelligent conversation body.

The time adaptation may include: adjusting the animation template to match a temporal sequence of speech corresponding to the comment text and/or the chat response text.

In one implementation, the multimodal output data may be generated further based on scene specific requirements.

The scenario-specific requirements may include at least one of: outputting only one of voice, animation sequence and text; outputting at least two of speech, animation sequences, and text; presetting the speed of speech; and chat mode settings.

In one implementation, the multi-modal based reactive response generation may include smart animation generation. Obtaining multimodal input data may include: at least one of an image, audio, and a bullet screen file of the target content is obtained. Extracting at least one information element from the multimodal input data may include: at least one information element is extracted from the image, audio and bullet screen files of the target content. Generating at least one reference information item based at least on the at least one information element may comprise: at least one of an animation tag, an emotion tag, and a comment text is generated based at least on the at least one information element. Generating the multimodal output data using at least the at least one item of reference information may include: at least one of an animation sequence of the avatar, a comment speech of the avatar, and a comment text is generated using at least one of the animation tag, the emotion tag, and the comment text. Providing the multimodal output data may include: providing at least one of the animation sequence, the comment speech, and the comment text.

It should be understood that method 600 may also include any steps/processes for multi-modal based reactive response generation in accordance with embodiments of the present disclosure described above.

Fig. 7 illustrates an exemplary apparatus 700 for multi-modal based reactive response generation, according to an embodiment.

The apparatus 700 may include: a multimodal input data obtaining module 710 for obtaining multimodal input data; a data integration processing module 720, configured to extract at least one information element from the multimodal input data; a scene logic processing module 730 for generating at least one reference information item based at least on the at least one information element; a multimodal output data generating module 740 for generating multimodal output data using at least the at least one item of reference information; and a multimodal output data providing module 750 for providing the multimodal output data.

Furthermore, the apparatus 700 may also include any other modules that perform the steps of the method for multi-modal based reactive response generation according to embodiments of the present disclosure described above.

Fig. 8 illustrates an exemplary apparatus 800 for multi-modal based reactive response generation, according to an embodiment.

The apparatus 800 may include: at least one processor 810; and a memory 820 storing computer executable instructions. When executed, the at least one processor 810 may perform any of the steps/processes of the method for multi-modal based reactive response generation according to embodiments of the present disclosure described above.

Embodiments of the present disclosure provide a multi-modality based reactive response generation system, comprising: a multimodal data input interface for obtaining multimodal input data; a core processing unit configured to extract at least one information element from the multimodal input data, generate at least one item of reference information based at least on the at least one information element, and generate multimodal output data using at least the at least one item of reference information; and a multimodal data output interface for providing the multimodal output data. Furthermore, the multimodal data input interface, the core processing unit, the multimodal data output interface may also perform any relevant steps/processes of the method for multimodal based reactive response generation according to the embodiments of the present disclosure described above. Furthermore, the multi-modal based reactive response generation system may also include any other units and modules for multi-modal based reactive response generation according to embodiments of the present disclosure described above.

Embodiments of the present disclosure propose computer program products for multi-modal based reactive response generation, comprising a computer program to be run by at least one processor for performing any of the steps/processes of the methods for multi-modal based reactive response generation according to embodiments of the present disclosure described above.

Embodiments of the present disclosure may be embodied in non-transitory computer readable media. The non-transitory computer-readable medium may include instructions that, when executed, cause the one or more processors to perform any of the steps/processes of the method for multi-modal based reactive response generation according to embodiments of the present disclosure described above.

It should be understood that all operations in the methods described above are exemplary only, and the present disclosure is not limited to any operations in the methods or the order of the operations, but rather should encompass all other equivalent variations under the same or similar concepts.

In addition, the articles "a" and "an" as used in this specification and the appended claims should generally be construed to mean "one" or "one or more" unless specified otherwise or clear from context to be directed to a singular form.

It should also be understood that all of the modules in the above described apparatus may be implemented in various ways. These modules may be implemented as hardware, software, or a combination thereof. In addition, any of these modules may be further divided functionally into sub-modules or combined together.

The processor has been described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software depends upon the particular application and the overall design constraints imposed on the system. By way of example, the processor, any portion of the processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, microcontroller, Digital Signal Processor (DSP), Field Programmable Gate Array (FPGA), Programmable Logic Device (PLD), state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described in this disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor, microcontroller, DSP, or other suitable platform.

Software should be viewed broadly as representing instructions, instruction sets, code segments, program code, programs, subroutines, software modules, applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, and the like. The software may reside in a computer readable medium. The computer readable medium may include, for example, memory, which may be, for example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, a Random Access Memory (RAM), a Read Only Memory (ROM), a programmable ROM (prom), an erasable prom (eprom), an electrically erasable prom (eeprom), a register, or a removable disk. Although the memory is shown as being separate from the processor in aspects presented in this disclosure, the memory may be located internal to the processor (e.g., a cache or a register).

The above description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described herein that are known or later come to be known to those of ordinary skill in the art are intended to be encompassed by the claims.

Claims

1. A method for multi-modal based reactive response generation, comprising:

obtaining multimodal input data;

extracting at least one information element from the multimodal input data;

generating at least one reference information item based at least on the at least one information element;

generating multimodal output data using at least the at least one item of reference information; and

providing the multimodal output data.

2. The method of claim 1, wherein the multimodal input data comprises at least one of:

an image of the target content, audio of the target content, a barrage file of the target content, a chat session, a captured image, captured audio, and external environment data.

3. The method of claim 2, wherein extracting at least one information element from the multimodal input data comprises at least one of:

extracting character features from an image of the target content;

recognizing text from an image of the target content;

detecting image light from an image of the target content;

identifying an object from an image of the target content;

extracting music from the audio of the target content;

extracting a voice from the audio of the target content;

extracting a bullet screen text from a bullet screen file of the target content;

extracting message text from the chat session;

extracting object features from the acquired image;

extracting voice and/or music from the collected audio; and

and extracting the external environment information from the external environment data.

4. The method of claim 1, wherein generating at least one reference information item based at least on the at least one information element comprises:

at least one of an emotion tag, an animation tag, a comment text, and a chat response text is generated based at least on the at least one information element.

5. The method of claim 4, wherein generating an emotion tag based at least on the at least one information element comprises:

generating one or more sentiment representations respectively corresponding to one or more information elements in the at least one information element; and

generating the sentiment tag based at least on the one or more sentiment representations.

6. The method of claim 5, wherein,

the emotion tag indicates an emotion type and/or emotion level.

7. The method of claim 4, wherein generating an animation tag based at least on the at least one information element comprises:

mapping the at least one information element to the animation tag according to a predetermined rule.

8. The method of claim 7, wherein,

the animation label indicates a facial expression type and/or a limb action type.

9. The method of claim 7, wherein,

the animation tag is generated further based on the emotion tag.

10. The method of claim 4, wherein generating comment text based at least on the at least one information element comprises:

and selecting the comment text from the bullet screen text of the target content.

11. The method of claim 10, wherein the selecting the comment text comprises:

determining a matching degree between a sentence in the bullet screen text of the target content and an image and/or audio of the target content by using a double-tower model; and

and selecting the sentence with the highest matching degree from the bullet screen texts as the comment text.

12. The method of claim 4, wherein generating chat response text based at least on the at least one information element comprises:

generating, by the chat engine, the chat response text based at least on the message text in the chat session.

13. The method of claim 12, wherein,

the chat response text is generated further based on the emotion tag.

14. The method of claim 12, wherein,

the chat response text is generated further based on the emotion representation from the emotion transfer network.

15. The method of claim 1, wherein the at least one reference information item is generated further based on at least one of:

a scene specific emotion;

the preset personality of the intelligent conversation body; and

the preset role of the intelligent conversation body.

16. The method of claim 1, wherein the multi-modal output data comprises at least one of:

an animation sequence of an avatar of the intelligent conversation agent;

voice of the intelligent conversation agent; and

and (5) writing.

17. The method of claim 4, wherein generating multi-modal output data utilizing at least the at least one item of reference information comprises:

and generating voice and/or words corresponding to the comment text and/or the chat response text.

18. The method of claim 4, wherein generating multi-modal output data utilizing at least the at least one item of reference information comprises:

selecting a corresponding animation template from an animation library of the virtual image of the intelligent conversation body by using the animation label and/or the emotion label; and

and performing time adaptation on the animation template to form an animation sequence of the virtual image of the intelligent conversation body.

19. The method of claim 18, wherein the time adaptation comprises:

adjusting the animation template to match a temporal sequence of speech corresponding to the comment text and/or the chat response text.

20. The method of claim 1, wherein,

the multimodal output data is further generated based on scene specific requirements.

21. The method of claim 20, wherein the scenario-specific requirements include at least one of:

outputting only one of voice, animation sequence and text;

outputting at least two of speech, animation sequences, and text;

presetting the speed of speech; and

and (4) setting a chat mode.

22. The method of claim 1, wherein,

obtaining multimodal input data includes: obtaining at least one of an image, audio and a bullet screen file of the target content,

extracting at least one information element from the multimodal input data comprises: extracting at least one information element from the image, audio and bullet screen files of the target content,

generating at least one reference information item based at least on the at least one information element comprises: generating at least one of an animation tag, an emotion tag, and a comment text based at least on the at least one information element,

generating the multimodal output data using at least the at least one item of reference information comprises: generating at least one of an animation sequence of the avatar, a comment speech of the avatar, and a comment text using at least one of the animation tag, the emotion tag, and the comment text, and

providing the multimodal output data comprises: providing at least one of the animation sequence, the comment speech, and the comment text.

23. A multi-modality based reactive response generation system, comprising:

a multimodal data input interface for obtaining multimodal input data;

a core processing unit configured to: extracting at least one information element from the multimodal input data; generating at least one reference information item based at least on the at least one information element; and generating multimodal output data using at least the at least one item of reference information; and

a multimodal data output interface for providing the multimodal output data.

24. An apparatus for multi-modal based reactive response generation, comprising:

at least one processor; and

a memory storing computer-executable instructions that, when executed, cause the at least one processor to perform the steps of the method of any one of claims 1 to 21.

25. An apparatus for multi-modal based reactive response generation, comprising:

the multi-modal input data acquisition module is used for acquiring multi-modal input data;

the data integration processing module is used for extracting at least one information element from the multi-modal input data;

a scene logic processing module to generate at least one reference information item based at least on the at least one information element;

a multimodal output data generating module for generating multimodal output data using at least the at least one item of reference information; and

a multimodal output data providing module for providing the multimodal output data.

26. A computer program product for multi-modal based reactive response generation, comprising a computer program to be executed by at least one processor for performing the steps of the method according to any of claims 1 to 21.