CN115390678B

CN115390678B - Virtual human interaction method and device, electronic equipment and storage medium

Info

Publication number: CN115390678B
Application number: CN202211326573.4A
Authority: CN
Inventors: 江昊宸; 何山; 殷兵; 刘聪; 周良; 胡金水
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-03-31
Anticipated expiration: 2042-10-27
Also published as: CN115390678A

Abstract

The application provides a virtual human interaction method, a virtual human interaction device, electronic equipment and a storage medium, wherein sight line tracking and emotion analysis are carried out on audio and video data of a target object, and sight line track characteristics and emotion states of the target object are determined; predicting the emotional state of the virtual image at the next moment according to the emotional state of the target object and the current emotional state of the virtual image; and predicting the interaction state parameter of the virtual image at the next moment according to the sight line track characteristics of the target object and the virtual image, the emotional state of the target object, the emotional state of the virtual image at the next moment and the interaction state parameter of the virtual image at the current moment, wherein the interaction state parameter comprises the sight line direction. According to the scheme, the sight line interactive prediction is performed on the target object and the virtual image based on the emotional state of the target object and the emotional state of the virtual image, the sight line interaction of the virtual image and the target object under different emotional states is realized, and the interactive reality and interactive experience of the target object and the virtual image are improved.

Description

Virtual human interaction method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a virtual human interaction method and apparatus, an electronic device, and a storage medium.

Background

The digital avatar is a virtual character having a digitized appearance, and unlike the physical robot, the digital avatar exists depending on a display device. In the field of virtual human interaction, the voice of a user is automatically read, analyzed and recognized through an intelligent system, the reply content of a virtual human is decided according to the analysis result, and a character model is driven to generate corresponding voice and actions to realize the interaction between the virtual human and the user.

In the existing interaction process of the virtual human and the user, only dialogue interaction is realized, but the interaction effect is not reflected on the sight between the virtual human and the user, so that the reality sense of the interaction process of the virtual human and the user is low, and the interaction experience of the user and the virtual human is influenced.

Disclosure of Invention

Based on the defects and shortcomings of the prior art, the application provides a virtual human interaction method, a virtual human interaction device, electronic equipment and a storage medium, and the interactivity and the anthropomorphic effect of a virtual digital human in the interaction process can be improved.

The application provides a virtual human interaction method in a first aspect, which comprises the following steps:

determining sight line track characteristics of a target object and an emotional state of the target object by performing sight line tracking processing and emotion analysis processing on audio and video data of the target object;

predicting and determining the emotional state of the virtual image at the next moment according to the emotional state of the target object and the current emotional state of the virtual image corresponding to the target object;

and predicting and determining the interaction state parameters of the virtual image at the next moment according to the sight line track characteristics of the target object, the emotional state of the virtual image at the next moment, the interaction state parameters of the virtual image at the current moment and the sight line track characteristics of the virtual image, wherein the interaction state parameters at least comprise sight line directions.

Optionally, the determining the sight locus characteristic of the target object by performing sight tracking processing on the audio and video data of the target object includes:

extracting target video data from audio and video data of a target object, wherein the target video data is video data in a preset time length before a video frame at the current moment in the audio and video data;

and performing sight tracking processing based on the target video data, and determining sight locus characteristics of the target object.

Optionally, performing gaze tracking processing based on the target video data, and determining a gaze track feature of the target object, includes:

obtaining a first sight track characteristic corresponding to the target video data by extracting sight characteristics corresponding to each video frame in the target video data;

predicting a second sight line track characteristic within a preset time length in the future according to the first sight line track characteristic;

and taking the combination of the first sight line track characteristic and the second sight line track characteristic as the sight line track characteristic of the target object.

Optionally, the determining the sight line track characteristic of the target object and the emotional state of the target object by performing sight line tracking processing and emotion analysis processing on the audio and video data of the target object includes:

inputting a video data stream in audio and video data of a target object into a pre-trained sight tracking network to obtain sight track characteristics of the target object;

and the number of the first and second groups,

and inputting the audio and video data of the target object into a pre-trained emotion analysis network to obtain the emotion state of the target object.

Optionally, the predicting and determining the interaction state parameter of the avatar at the next moment according to the sight line trajectory feature of the target object, the emotional state of the avatar at the next moment, the interaction state parameter of the avatar at the current moment, and the sight line trajectory feature of the avatar, includes:

determining an interaction rule according to the emotional state of the target object and the emotional state of the virtual image at the next moment, wherein the interaction rule at least comprises a sight line interaction rule;

and predicting the interaction state parameters of the virtual image at the next moment according to the sight line track characteristics of the target object, the interaction state parameters of the virtual image at the current moment and the sight line track characteristics of the virtual image based on the interaction rule.

Optionally, based on the interaction rule, predicting the interaction state parameter of the avatar at the next moment from the sight line trajectory feature of the target object, the interaction state parameter of the avatar at the current moment, and the sight line trajectory feature of the avatar, including:

inputting the sight line track characteristic of the target object, the interaction state parameter of the virtual image at the current moment and the sight line track characteristic of the virtual image into a target parameter prediction model to obtain the interaction state parameter of the virtual image at the next moment;

the target parameter prediction model is used for predicting the interaction state parameters of the virtual image according to the interaction rules when the target object is in a first emotion state and the virtual image is in a second emotion state, wherein the first emotion state is the emotion state of the target object, and the second emotion state is the emotion state of the virtual image at the next moment.

Optionally, the method further includes:

judging whether the target object is in a state of interacting with the virtual image corresponding to the target object according to the audio and video data of the target object;

and if the target object is not in the state of interacting with the virtual image corresponding to the target object, determining the interaction state parameter of the virtual image at the next moment as a preset rest state parameter.

Optionally, the predicting and determining an emotional state of the avatar at a next time according to the emotional state of the target object and the current emotional state of the avatar corresponding to the target object includes:

if the target object is in a state of interacting with the virtual image corresponding to the target object, predicting and determining the emotional state of the virtual image at the next moment according to the emotional state of the target object and the current emotional state of the virtual image corresponding to the target object.

Optionally, the interaction state parameters further include an expression parameter and/or a head posture parameter.

A second aspect of the present application provides a virtual human interaction apparatus, including:

the first determining module is used for determining sight line track characteristics of a target object and emotional state of the target object by performing sight line tracking processing and emotion analysis processing on audio and video data of the target object;

the second determining module is used for predicting and determining the emotional state of the virtual image at the next moment according to the emotional state of the target object and the current emotional state of the virtual image corresponding to the target object;

and the interaction state parameter determining module is used for predicting and determining the interaction state parameter of the virtual character at the next moment according to the sight line track characteristic of the target object, the emotional state of the virtual character at the next moment, the interaction state parameter of the virtual human at the current moment and the sight line track characteristic of the virtual character, wherein the interaction state parameter at least comprises the sight line direction.

A third aspect of the present application provides an electronic device, comprising: a memory and a processor;

wherein the memory is connected with the processor and used for storing programs;

the processor is used for realizing the virtual human interaction method by running the program in the memory.

A fourth aspect of the present application provides a storage medium, where a computer program is stored, and when the computer program is executed by a processor, the virtual human interaction method is implemented.

According to the virtual human interaction method, the sight line track characteristics of the target object and the emotional state of the target object are determined by performing sight line tracking processing and emotion analysis processing on audio and video data of the target object; predicting and determining the emotional state of the virtual image at the next moment according to the emotional state of the target object and the current emotional state of the virtual image corresponding to the target object; and predicting and determining the interaction state parameters of the virtual image at the next moment according to the sight line track characteristics of the target object, the emotional state of the virtual image at the next moment, the interaction state parameters of the virtual image at the current moment and the sight line track characteristics of the virtual image, wherein the interaction state parameters at least comprise sight line directions. By adopting the technical scheme, the visual line interactive prediction of the virtual image can be carried out according to the visual line track characteristics of the user based on the emotional state of the target object and the emotional state of the virtual image, so that the visual line interaction of the virtual image and the target object in different emotional states is realized, and the reality and the interactive experience of the interactive process of the virtual image and the target object are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a virtual human interaction method provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a process for determining a sight line trajectory feature of a target object according to an embodiment of the present disclosure;

fig. 3 is a schematic processing flow diagram for predicting interaction state parameters of an avatar at a next time according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating another avatar interaction method according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a virtual human interaction device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical scheme of the embodiment of the application is suitable for the application scene of the digital virtual human, and is particularly suitable for the application scene of virtual human interaction. By adopting the technical scheme of the embodiment of the application, the sense of reality and the interactive experience of the interactive process of the virtual image and the target object can be improved.

With the continuous development of digital virtual human technology in recent years, a part of technology exploration and application exist in the field of digital virtual human driving, for example, the real-time lip driving of a virtual human can be realized by analyzing the voice and the lip movement rule; by analyzing the relation between the user input command and the corresponding action, the real-time motion driving of the virtual human can be realized. When the digital virtual human carries out interaction, an intelligent system is usually used for automatically reading, analyzing and identifying the voice of a target object which interacts with the digital virtual human, the reply content of the digital virtual human is decided according to the analysis result, and a character model of the digital virtual human is driven to generate corresponding voice and action to realize the interaction between the digital virtual human and the target object.

In the field of virtual human interaction, line-of-sight interaction is inevitably generated when people and people carry out daily dialogue interaction, so the line-of-sight interaction is particularly important for digital virtual human personification.

In view of the defects of the prior art and the problems that the reality of the interaction process of the virtual human and the target object is low and the interaction experience of the target object and the virtual human is influenced, the inventor of the application provides a virtual human interaction method through research and experiments, the method can realize the sight line interaction of the virtual image and the target object in different emotion states, and the reality and the interaction experience of the interaction process of the virtual image and the target object are improved.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

An embodiment of the present application provides a virtual human interaction method, which is shown in fig. 1 and includes:

s101, determining sight track characteristics of the target object and emotional state of the target object by performing sight tracking processing and emotion analysis processing on audio and video data of the target object.

Specifically, when the interactive device detects a target object interacting with a virtual human, a video capture component (e.g., a camera) disposed on the interactive device needs to capture a video data stream of the target object, an audio capture component (e.g., a microphone) disposed on the interactive device captures an audio data stream of the target object, and the captured audio data stream and the captured video data stream are used as audio and video data of the target object. The interactive device can be a mobile phone, a computer or an interactive all-in-one machine.

The embodiment determines the sight line track characteristic of the target object and the emotional state of the target object by performing sight line tracking processing and emotion analysis processing on the audio and video data of the target object. The sight tracking processing only needs to use the video data stream, so when the sight tracking processing is performed, video data within a certain time range needs to be extracted from the video data stream of the audio and video data of the target object to serve as target video data for sight tracking, and then sight tracking processing is performed on the target video data to determine sight track characteristics of the target object. The embodiment can determine the sight line track characteristic of the target object by analyzing the sight line direction of the target object in each video frame of the target video data; the sight-line track characteristics of the target object can also be determined by utilizing a pre-trained sight-line tracking network.

The method comprises the steps of determining sight track characteristics of a target object by utilizing a pre-trained sight tracking network, mainly inputting a video data stream in audio and video data of the target object into the pre-trained sight tracking network to obtain the sight track characteristics of the target object, and also inputting target video data extracted from the video data stream of the audio and video data of the target object into the sight tracking network to obtain the sight track characteristics of the target object. In this embodiment, the sight tracking network may adopt an autoregressive model, and then the autoregressive model is trained by using sample video data which is collected in advance and carries sight locus characteristics, that is, the sample video data which is collected in advance is input into the autoregressive model, the autoregressive model is used to determine the sample sight locus characteristics corresponding to the sample video data, the autoregressive model is subjected to parameter adjustment by calculating a loss function between the sample sight locus characteristics and the sight locus characteristics carried by the sample video data, and the trained autoregressive model is used as the sight tracking network. The autoregressive model can determine the sight line track characteristics of the sample video data within the time range, and can predict the sight line track characteristics within a certain time length in the future, so that the sample sight line track characteristics comprise the sight line track characteristics of the sample video data within the time range and the predicted sight line track characteristics within the certain time length in the future, and the sight line track characteristics carried by the sample video data also comprise the actual sight line track characteristics of the sample video data within the time range and the actual sight line track characteristics within the certain time length in the future. Then, the sight line locus characteristics of the target object determined by the pre-trained sight line tracking network also include sight line locus characteristics within a time range in which video data input to the sight line tracking network is located and predicted sight line locus characteristics within a certain time range in the future.

For emotion analysis processing, the embodiment can train an emotion analysis network in advance, then input the audio and video data of a target object into the emotion analysis network to obtain the emotional state of the target object, and can also acquire the audio and video data within a certain recent time range from the audio and video data of the target object, align the audio frame and the video frame in the audio and video data, and input the aligned audio frame and video frame into the emotion analysis network to obtain the emotional state of the target object. The emotion analysis network can be obtained by training sample audio and video data which are collected in advance and carry emotion states, namely, the sample audio and video data which are collected in advance are input into the emotion analysis network to obtain sample emotion states corresponding to the sample audio and video data, and then parameter adjustment is carried out on the emotion analysis network by calculating loss functions between the sample emotion states and the emotion states carried by the sample audio and video data and using the loss functions.

The embodiment can also analyze the emotional state of the audio data stream and the video data stream in the audio and video data of the target object respectively, determine the audio emotional state of the target object according to the audio data stream, determine the video emotional state of the target object according to the video data stream, and analyze the final emotional state according to the audio emotional state and the video emotional state. The analyzing of the target object emotion according to the video data and the analyzing of the target object emotion according to the audio data are both prior art means, and this embodiment is not specifically described.

S102, predicting and determining the emotional state of the virtual image at the next moment according to the emotional state of the target object and the current emotional state of the virtual image corresponding to the target object.

Specifically, the interaction device stores the emotional state of the avatar, that is, the emotional state of the avatar in each time range is stored, for example, the emotional state of the avatar from the first time to the second time is a, the emotional state of the avatar at the second time is changed into B until the current time, then the emotional state of the avatar stored in the interaction device is the emotional state a from the first time to the second time, and the emotional state B from the second time to the current time.

The embodiment predicts the emotional state of the virtual character at the next moment according to the emotional state of the target object and the current emotional state of the virtual character interacted with the target object. The current emotional state of the virtual image is the emotional state of the virtual image at the current moment stored in the interactive equipment. In this embodiment, various emotion combinations between the emotion state of the target object and the current emotion state of the avatar may be pre-established according to the emotion state change rules of the two interactive parties in the actual interaction, and the correspondence between the emotion states at the next time of the avatar may be respectively established, for example, the emotion state at the next time of the avatar corresponding to the emotion combination between the emotion state a of the target object and the emotion state a of the avatar may be recorded as the emotion state at the next time of the avatar, and the emotion state at the next time of the avatar corresponding to the emotion combination between the emotion state B of the target object and the emotion state a of the avatar may be recorded as the emotion state B. And then inquiring the emotion state of the target object at the next moment of the virtual image corresponding to the emotion combination between the emotion state of the target object and the current emotion state of the virtual image from the corresponding relation between various emotion combinations and the emotion state of the virtual image at the next moment, which are established in advance.

In addition, the embodiment can also train the virtual human emotion analysis network in advance so that the virtual human emotion analysis network learns the change rule of the emotion state, the emotion state of the target object and the current emotion state of the virtual image are input into the virtual human emotion analysis network to obtain the emotion change data of the virtual image, then the emotion change data and the preset emotion change range are compared, and the emotion state of the virtual image at the next moment is determined according to the comparison result. For example, when the emotion change data is in the first emotion change range, the next-time emotion state of the avatar may be determined to be an a-emotion state, when the emotion change data is in the second emotion change range, the next-time emotion state of the avatar may be determined to be a B-emotion state, and so on. For the training of the virtual human emotion analysis network, the embodiment can use the emotion combination of the emotion state of the target object carrying emotion change data and the emotion state of the virtual image as a training sample to train the virtual human emotion analysis network.

S103, predicting and determining the interaction state parameter of the virtual image at the next moment according to the sight line track characteristic of the target object, the emotion state of the virtual image at the next moment, the interaction state parameter of the virtual image at the current moment and the sight line track characteristic of the virtual image.

Specifically, the interaction device stores the visual line direction of the avatar in each frame of motion of the avatar in real time, so as to form the visual line trajectory feature of the avatar, and this embodiment needs to acquire the visual line trajectory feature of the avatar stored in advance, and the visual line trajectory feature of the avatar needs to be the same as the time range involved in the acquired visual line trajectory feature of the target object. For example, if the sight line locus feature of the target object is determined from the video data stream within the second before the current time, the extracted sight line locus feature of the avatar also needs to be the sight line locus feature within the second before the current time, and if the sight line locus feature of the target object also includes the predicted sight line locus feature within the second before, the embodiment also needs to predict the sight line locus feature of the avatar within the second before the current time of the avatar from the sight line locus feature within the second before the current time of the avatar, and combine the sight line locus feature within the second before the current time of the avatar and the predicted sight line locus feature of the avatar within the second before as the sight line locus feature of the avatar. The prediction of the sight-line locus characteristics of the avatar within a certain time range in the future can be performed by predicting the sight-line locus characteristics of the target object within a certain time range in the future, for example, by using an autoregressive model.

According to the emotion combination scene corresponding to the emotion state of the target object and the emotion state of the virtual image at the next moment, the interaction state parameter of the virtual image at the next moment is predicted by using the sight line track characteristic of the target object, the sight line track characteristic of the virtual image and the interaction state parameter of the virtual image at the current moment. Wherein the interaction state parameters at least comprise a line of sight direction. That is, the view direction of the avatar at the next time is predicted in the case of an emotional combination of the emotional state of the target object and the emotional state of the avatar at the next time, based on the view line trajectory characteristic of the target object, the view line trajectory characteristic of the avatar, and the view line direction of the avatar at the current time. In order to realize the visual interaction between the virtual image and the target object, according to the predicted visual track characteristics in a certain time range in the future included in the visual track characteristics of the target object and the current visual direction of the virtual image, the interactive visual direction variation generated between the current visual direction of the virtual image and the next visual direction of the target object can be predicted, then the visual direction of the virtual image is adjusted according to the visual direction variation, so that the visual direction of the virtual image at the next time can be obtained, and the visual interaction between the virtual image and the target object can be realized. The visual interaction between the virtual image and the target object can improve the sense of reality in the interaction process, so that the interaction experience of the target object is improved.

In the present embodiment, if the target object and the avatar have different emotional combinations, the interaction state parameter at the next predicted moment of the avatar is also different because the different emotional combinations have different laws when the target object and the avatar perform line-of-sight interaction, for example, if the emotional state of the target object indicates angry and the emotional state of the avatar at the next moment indicates favorable, the line-of-sight between the avatar and the target object needs to be interacted in real time, if the emotional state of the target object indicates angry and the emotional state of the avatar at the next moment indicates consignment, the line-of-sight between the avatar and the target object is not interacted in real time and may be interacted only occasionally, and if the emotional state of the target object indicates happy and the emotional state of the avatar at the next moment indicates consignment, the line-of-sight between the avatar and the target object may not be interacted in real time and not occasionally interacted for the most of time, and may not be interacted occasionally. Therefore, in this embodiment, the interaction state parameter of the avatar at the next moment needs to be predicted according to the emotion state of the target object and the emotion combination scene corresponding to the emotion state of the avatar at the next moment, so that the predicted interaction state parameter can make the avatar better conform to the current emotion combination scene when performing line-of-sight interaction with the target object, thereby improving the sense of reality during the interaction process, and improving the interaction experience of the target object.

Furthermore, in this embodiment, the interaction state parameters further include an expression parameter and/or a head pose parameter, and in the interaction process between the avatar and the target object, along with the change of the emotional state, the head pose of the avatar also changes along with the change of the sight line direction, and the expression coefficient also changes along with the change of the head pose and the change of the emotional state, so that the embodiment may also predict the variation amount of the expression parameter and/or the head pose parameter, and thus determine the expression parameter and/or the head pose parameter of the avatar at the next moment according to the expression parameter and/or the head pose parameter at the current moment of the avatar, and the variation amount of the expression parameter and/or the head pose parameter, so as to improve the sense of reality and the interaction experience of the target object in the interaction process between the avatar and the target object.

According to the introduction, the virtual human interaction method provided by the embodiment of the application determines the sight line track characteristics of the target object and the emotional state of the target object by performing sight line tracking processing and emotion analysis processing on the audio and video data of the target object; predicting and determining the emotional state of the virtual image at the next moment according to the emotional state of the target object and the current emotional state of the virtual image corresponding to the target object; and predicting and determining the interaction state parameters of the virtual image at the next moment according to the sight line track characteristics of the target object, the emotional state of the virtual image at the next moment, the interaction state parameters of the virtual human at the current moment and the sight line track characteristics of the virtual image, wherein the interaction state parameters at least comprise sight line directions. By adopting the technical scheme of the embodiment, the visual line interactive prediction of the virtual image can be carried out according to the visual line track characteristics of the user based on the emotional state of the target object and the emotional state of the virtual image, so that the visual line interaction of the virtual image and the target object in different emotional states is realized, and the reality sense and the interactive experience of the interaction process of the target object and the virtual image are improved.

As an optional implementation manner, referring to fig. 2, another embodiment of the present application discloses that, in step S101, performing gaze tracking processing on audio/video data of a target object to determine a gaze track feature of the target object, including:

s201, extracting target video data from the audio and video data of the target object.

Specifically, in order to reduce data processing and ensure the efficiency of performing gaze tracking on a target object and the performance of an interaction device, when extracting the gaze track features of the target object, all previous gaze track features of the target object do not need to be extracted, and only the gaze track features within a certain time duration range in the recent past need to be extracted, so that the embodiment needs to extract the required target video data from the audio and video data of the target object, and the avatar interacts with the target object at the current time according to the real-time property of the interaction between the avatar and the target object, so that the target video data to be extracted is also the video data (including the video frame at the current time) within a preset time duration before the video frame at the current time in the audio and video data of the target object.

S202, performing sight tracking processing based on the target video data, and determining sight track characteristics of the target object.

Specifically, in this embodiment, according to target video data extracted from audio and video data of a target object, gaze tracking is performed on the target object in the target video data, that is, according to a change of a gaze direction of the target object in the target video data, a gaze track feature of the target object is determined, which includes the following specific steps:

firstly, a first sight track characteristic corresponding to target video data is obtained by extracting sight characteristics corresponding to each video frame in the target video data.

In this embodiment, each video frame in the target video data is analyzed, the sight line characteristics of the target object in each video frame, that is, the sight line direction of the target object in each video frame, are determined, the sight line characteristics of each target object are combined according to the sequence of each video frame, a matrix composed of the sight line characteristics of the target object in all the video frames is obtained, and the matrix is used as a first sight line track characteristic corresponding to the target video data.

And secondly, predicting a second sight line track characteristic within a preset time length in the future according to the first sight line track characteristic.

The first sight line track characteristic corresponding to the target video data is a sight line track actually generated by the target object in a preset time length corresponding to the current time, and according to the embodiment, a second sight line track characteristic in a future preset time length needs to be predicted according to the first sight line track characteristic.

Thirdly, the combination of the first sight line track characteristic and the second sight line track characteristic is used as the sight line track characteristic of the target object.

In this embodiment, the first sight line track characteristic actually generated by the target object and the predicted second sight line track characteristic need to be spliced, and the spliced sight line track characteristic is used as the sight line track characteristic of the target object.

Further, when performing emotion analysis processing on the audio/video data of the target object, in order to reduce data processing and ensure the efficiency of emotion analysis on the target object and the performance of the interaction device, the audio/video data within a certain time range may be extracted from the audio/video data of the target object to perform emotion analysis processing, and in order to ensure the accuracy of emotion analysis, the audio/video data closer to the current time needs to be extracted, so that the audio/video data (including the audio/video frame at the current time) within a preset time before the audio/video frame at the current time may be extracted from the audio/video data of the target object.

As an alternative implementation manner, referring to fig. 3, another embodiment of the present application discloses that, in step S103, the predicting and determining the interaction state parameter of the avatar at the next time according to the sight line trajectory feature of the target object, the emotional state of the avatar at the next time, the interaction state parameter of the avatar at the current time, and the sight line trajectory feature of the avatar includes:

s301, determining an interaction rule according to the emotional state of the target object and the emotional state of the virtual image at the next moment.

Specifically, in the actual interaction between people, the emotional states are different, and the interaction rules are different, for example, when the emotional states of two people are excited or positive, the line-of-sight interaction between the two people is more frequent, the expression change is richer, and when the emotional states of the two people are negative or sad, the line-of-sight interaction between the two people is less, and the expression change is finer. Therefore, in order to improve the reality of the interaction between the target object and the avatar, the interaction rule of the target object and the avatar at the next moment needs to be determined according to the emotional state of the target object and the emotional state of the avatar at the next moment. According to the embodiment, the interaction rules of the human and the human in different emotional states in the real interaction scene can be summarized, and the interaction rules corresponding to various emotional state combinations are recorded. And then selecting an interaction rule which is consistent with the combination of the emotional state of the target object and the emotional state of the virtual image at the next moment from all the recorded interaction rules as the interaction rule of the target object and the virtual image at the next moment. Wherein the interaction rule at least comprises a sight line interaction rule.

For example, if the emotional state of the target object represents anger and the emotional state of the avatar at the next moment represents joy, then the line of sight between the avatar and the target object requires real-time interaction, if the emotional state of the target object represents anger and the emotional state of the avatar at the next moment represents uncomfortableness, then the line of sight between the avatar and the target object is not real-time interactive and may only occasionally interact, and if the emotional state of the target object represents joy and the emotional state of the avatar at the next moment represents uncomfortableness, then the line of sight between the avatar and the target object may not be real-time interactive and not occasionally interactive, but is interactive most of the time and occasionally not interactive. Therefore, the sight line interaction law may include the frequency of sight line interaction and the displacement of the sight line when the sight lines are not interacted, for example, when the sight lines are not interacted, the sight line direction may be displaced to the face or the body of the target object, when different emotional states are combined, when the sight lines are not interacted, the range of the sight line displacement, and the like.

In addition, the interaction rules may also include expression interaction rules, for example, when the emotional states of the avatar and the target object are both relatively positive emotional states, the expression change of the avatar may be changed to a relatively large extent, when the emotional states of the avatar and the target object are both relatively negative emotional states, the expression change of the avatar may be changed to a relatively small extent, and when one of the emotional states of the avatar and the target object is relatively negative, the expression change of the avatar may be changed to a medium extent, etc.

S302, based on the interaction rule, the interaction state parameter of the virtual image at the next moment is obtained through prediction by the sight line track characteristic of the target object, the interaction state parameter of the virtual image at the current moment and the sight line track characteristic of the virtual image.

Specifically, after determining the interaction rule of the target object and the avatar at the next moment, the embodiment predicts the gaze direction offset of the avatar at the next moment based on the gaze interaction rule and according to the gaze direction of the avatar at the current moment and the gaze direction offset at the next moment, determines the gaze direction of the avatar at the next moment; and/or based on the sight line interaction rule, predicting and obtaining the head posture offset of the virtual image at the next moment by the sight line track characteristic of the target object, the sight line track characteristic of the virtual image and the head posture parameter of the virtual image at the current moment, and determining the head posture parameter of the virtual image at the next moment according to the head posture parameter of the virtual image at the current moment and the head posture offset at the next moment; and/or predicting the expression variation of the virtual image at the next moment according to the expression parameters of the virtual image at the current moment and the expression variation at the next moment based on the expression interaction rule, and determining the expression parameters of the virtual image at the next moment according to the label parameters of the virtual image at the current moment and the expression variation at the next moment.

In addition, the embodiment may also learn the interaction rules of different emotional state combinations by using the parameter prediction model, so as to obtain parameter prediction models corresponding to various emotional state combinations, and then select a corresponding target parameter prediction model from all the parameter prediction models according to the emotional state of the target object and the emotional state combination before the emotional state of the virtual image at the next moment. And then inputting the sight line track characteristic of the target object, the interaction state parameter of the virtual image at the current moment and the sight line track characteristic of the virtual image into a target parameter prediction model so that the target parameter prediction model predicts the interaction state parameter variation of the virtual image at the next moment, and obtaining the interaction state parameter of the virtual image at the next moment according to the interaction state parameter of the virtual image at the current moment and the interaction state parameter variation of the virtual image at the next moment. The target parameter prediction model predicts interaction state parameters of the virtual image at the next moment according to interaction rules of the target object in a first emotion state and the virtual image in a second emotion state, the first emotion state is the emotion state of the target object, the second emotion state is the emotion state of the virtual image at the next moment, and the first emotion state and the second emotion state can be the same or different.

The training process of the parameter prediction model corresponding to the emotional state combination of the target object which is the first emotional state and the virtual image which is the second emotional state is as follows:

the method comprises the steps of obtaining the emotional state of a sample object as a first emotional state, obtaining the emotional state of a virtual image as a second emotional state, and when the sample object is interacted with the virtual object, carrying out sight tracking processing on first video data of the sample object according to the first video data of the sample object to obtain sight track characteristics of the sample object.

And then, acquiring the sight line track characteristic and the interaction state parameter of the avatar at the moment of the first video data, and the interaction state parameter of the avatar at the next moment of the first video data, and calculating the actual variation between the interaction state parameter of the avatar at the moment of the first video data and the interaction state parameter of the avatar at the next moment of the first video data.

And inputting the sight line track characteristic of the sample object, the sight line track characteristic of the virtual image at the moment of the first video data and the interaction state parameter into a parameter prediction model, outputting the sample variation quantity by the parameter prediction model, outputting a loss function between the sample variation quantity and the actual variation quantity by the parameter prediction model, and performing parameter adjustment on the parameter prediction model.

As an optional implementation manner, referring to fig. 4, another embodiment of the present application discloses that the virtual human interaction method further includes:

s402, judging whether the target object is in a state of interacting with the virtual image corresponding to the target object according to the audio and video data of the target object.

Specifically, in this embodiment, it is required to analyze audio/video data of the target object to determine whether the target object is in a state of interacting with an avatar corresponding to the target object. Specifically, whether the face orientation of the target object in the video data faces the avatar or not and whether the sight line direction is the avatar or not are analyzed for the video data in the audio and video data of the target object, and whether interactive voice exists in the audio data or not is specifically analyzed for the audio data in the audio and video data of the target object. For example, for the case that only an object passing through the interactive device, only an object standing in front of the interactive device and not speaking, and the case that no object exists in front of the interactive device but speech of a passerby is collected, etc., all represent that the target object is not in a state of interacting with the virtual image corresponding to the target object.

In this embodiment, an interactive state analysis network may also be trained in advance, and audio and video data of the target object is input into the interactive state analysis network trained in advance to obtain an output result for determining whether the target object is in a state of performing interaction with an avatar corresponding to the target object, where 1 and 0 may be used as the output result, 0 represents that the target object is not in a state of performing interaction with the avatar corresponding to the target object, and 1 represents that the target object is in a state of performing interaction with the avatar corresponding to the target object. The interactive state analysis network can adopt a classification model, the classification model is trained by using sample audio and video data carrying interactive state labels, the parameter of the classification model is adjusted according to a loss function between an output result and the carried interactive state labels, and the trained classification model is used as the interactive state analysis network.

Further, when the audio/video data of the target object is analyzed for the interaction state, in order to reduce data processing and ensure the efficiency of analyzing the interaction state of the target object and the performance of the interaction device, the audio/video data within a certain time range may be extracted from the audio/video data of the target object to analyze the interaction state, and in order to ensure the accuracy of the interaction state analysis, the audio/video data closer to the current time needs to be extracted, so that the audio/video data (including the audio/video frame at the current time) within a preset time before the audio/video frame at the current time may be extracted from the audio/video data of the target object.

S403, if the target object is not in the state of interacting with the virtual image corresponding to the target object, determining the interaction state parameter of the virtual image at the next moment as a preset resting state parameter.

And if the target object is judged not to be in the state of interacting with the virtual image corresponding to the target object, determining the interaction state parameter of the virtual image at the next moment as a preset resting state parameter, and driving the virtual image to recover to the resting state at the next moment. Wherein, can set up the virtual image and not have the expression, be the rest state when head gesture and sight direction are the dead ahead, so the rest state parameter can include the head gesture parameter, expression parameter and sight direction of virtual image when the rest state.

S404, if the target object is in a state of interacting with the virtual image corresponding to the target object, predicting and determining the next-moment emotional state of the virtual image according to the emotional state of the target object and the current emotional state of the virtual image corresponding to the target object.

Step S401 in fig. 4 is the same as step S101 in fig. 1, and steps S404 to S405 in fig. 4 are the same as steps S102 to S103 in fig. 1, and steps S401 and S404 to S405 are not specifically described again in this embodiment. Step S402 and step S403 may be executed before step S101, before step S102, or before step S103, and this embodiment is not limited.

Corresponding to the above virtual human interaction method, an embodiment of the present application further provides a virtual human interaction device, as shown in fig. 5, the device includes:

the first determining module 100 is configured to determine a sight track characteristic of the target object and an emotional state of the target object by performing sight tracking processing and emotion analysis processing on audio and video data of the target object;

a second determining module 110, configured to predict and determine an emotional state of the avatar at a next time according to the emotional state of the target object and the current emotional state of the avatar corresponding to the target object;

the interaction state parameter determining module 120 is configured to predict and determine an interaction state parameter of the avatar at a next moment according to the sight line trajectory feature of the target object, the emotional state of the avatar at the next moment, the interaction state parameter of the avatar at the current moment, and the sight line trajectory feature of the avatar, where the interaction state parameter at least includes a sight line direction.

In the virtual human interaction device provided by the embodiment of the application, the first determining module 100 determines the sight line track characteristics of the target object and the emotional state of the target object by performing sight line tracking processing and emotion analysis processing on the audio and video data of the target object; the second determining module 110 predicts and determines the emotional state of the avatar at the next moment according to the emotional state of the target object and the current emotional state of the avatar corresponding to the target object; the interaction state parameter determining module 120 predicts and determines an interaction state parameter of the avatar at the next moment according to the sight line trajectory feature of the target object, the emotional state of the avatar at the next moment, the interaction state parameter of the virtual person at the current moment, and the sight line trajectory feature of the avatar, where the interaction state parameter at least includes a sight line direction. By adopting the technical scheme of the embodiment, the visual line interactive prediction of the virtual image can be carried out according to the visual line track characteristics of the user based on the emotional state of the target object and the emotional state of the virtual image, so that the visual line interaction of the virtual image and the target object in different emotional states is realized, and the reality sense and the interactive experience of the interaction process of the target object and the virtual image are improved.

As an optional implementation manner, another embodiment of the present application further discloses that the first determining module 100 includes: a video extraction unit and a gaze tracking unit.

The video extraction unit is used for extracting target video data from audio and video data of a target object, wherein the target video data is video data within a preset time length before a video frame at the current moment in the audio and video data;

and the sight tracking unit is used for performing sight tracking processing based on the target video data and determining the sight track characteristic of the target object.

As an optional implementation manner, another embodiment of the present application further discloses that the gaze tracking unit is specifically configured to:

obtaining a first sight track characteristic corresponding to target video data by extracting sight characteristics corresponding to all video frames in the target video data;

As an optional implementation manner, another embodiment of the present application further discloses that the first determining module 100 is specifically configured to:

inputting a video data stream in audio and video data of a target object into a pre-trained sight tracking network to obtain sight locus characteristics of the target object;

and (c) a second step of,

As an optional implementation manner, another embodiment of the present application further discloses that the interactive status parameter determining module 120 includes: the device comprises an interaction rule determining unit and a parameter predicting unit.

The interaction rule determining unit is used for determining an interaction rule according to the emotional state of the target object and the emotional state of the virtual image at the next moment, and the interaction rule at least comprises a sight line interaction rule;

and the parameter prediction unit is used for predicting the interaction state parameter of the virtual image at the next moment according to the sight line track characteristic of the target object, the interaction state parameter of the virtual image at the current moment and the sight line track characteristic of the virtual image based on the interaction rule.

As an optional implementation manner, another embodiment of the present application further discloses that the parameter prediction unit is specifically configured to:

the target parameter prediction model is used for predicting interaction state parameters of the virtual image according to interaction rules when the target object is in a first emotion state and the virtual image is in a second emotion state, wherein the first emotion state is the emotion state of the target object, and the second emotion state is the emotion state of the virtual image at the next moment.

As an optional implementation manner, another embodiment of the present application further discloses that the virtual human interaction apparatus further includes: the device comprises an interaction state judgment module and a rest module.

The interactive state judging module is used for judging whether the target object is in an interactive state with the virtual image corresponding to the target object according to the audio and video data of the target object;

and the resting module is used for determining the interaction state parameter of the virtual image at the next moment as the preset resting state parameter if the target object is not in the state of interacting with the virtual image corresponding to the target object.

As an optional implementation manner, another embodiment of the present application further discloses that the second determining module 110 is specifically configured to:

As an optional implementation manner, another embodiment of the present application further discloses that the interaction state parameters further include an expression parameter and/or a head posture parameter.

The virtual human interaction device provided by the embodiment of the present application and the virtual human interaction method provided by the embodiments of the present application belong to the same application concept, can execute the virtual human interaction method provided by any of the embodiments of the present application, and has a functional module and a beneficial effect corresponding to the execution of the virtual human interaction method. For details of the virtual human interaction method provided in the foregoing embodiments of the present application, reference may be made to specific processing contents of the virtual human interaction method provided in the foregoing embodiments, and details are not described here again.

Another embodiment of the present application further discloses an electronic device, as shown in fig. 6, the electronic device includes:

a memory 200 and a processor 210;

wherein, the memory 200 is connected to the processor 210 for storing programs;

the processor 210 is configured to implement the virtual human interaction method disclosed in any of the above embodiments by running the program stored in the memory 200.

Specifically, the electronic device may further include: a bus, a communication interface 220, an input device 230, and an output device 240.

The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are connected to each other through a bus. Wherein:

a bus may comprise a path that transfers information between components of a computer system.

The processor 210 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), a microprocessor, etc., an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs according to the present invention. But may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The processor 210 may include a main processor and may also include a baseband chip, modem, and the like.

The memory 200 stores programs for executing the technical solution of the present invention, and may also store an operating system and other key services. In particular, the program may include program code comprising computer operating instructions. More specifically, memory 200 may include a read-only memory (ROM), another type of static storage device that may store static information and instructions, a Random Access Memory (RAM), another type of dynamic storage device that may store information and instructions, a magnetic disk storage, a flash, and so forth.

The input device 230 may include a means for receiving data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.

Output device 240 may include equipment that allows output of information to a user, such as a display screen, printer, speakers, etc.

Communication interface 220 may include any means for using a transceiver or the like to communicate with other devices or communication networks, such as ethernet, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.

The processor 210 executes the program stored in the memory 200 and calls other devices, which can be used to implement the steps of the virtual human interaction method provided by the embodiment of the present application.

Another embodiment of the present application further provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the steps of the virtual human interaction method provided in any of the above embodiments are implemented.

While, for purposes of simplicity of explanation, the foregoing method embodiments are presented as a series of acts or combinations, it will be appreciated by those of ordinary skill in the art that the present application is not limited by the illustrated ordering of acts, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

It should be noted that, in this specification, each embodiment is described in a progressive manner, and each embodiment focuses on differences from other embodiments, and portions that are the same as and similar to each other in each embodiment may be referred to. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The steps in the method of each embodiment of the present application may be sequentially adjusted, combined, and deleted according to actual needs.

The modules and sub-modules in the device and the terminal of the embodiment of the application can be combined, divided and deleted according to actual needs.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of a module or a sub-module is only one logical function division, and other division manners may be available in actual implementation, for example, a plurality of sub-modules or modules may be combined or integrated into another module, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules or sub-modules described as separate components may or may not be physically separate, and the components described as modules or sub-modules may or may not be physical modules or sub-modules, may be located in one place, or may be distributed on a plurality of network modules or sub-modules. Some or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, each functional module or sub-module in the embodiments of the present application may be integrated into one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated into one module. The integrated modules or sub-modules can be implemented in the form of hardware, and can also be implemented in the form of software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software cells may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A virtual human interaction method is characterized by comprising the following steps:

the method comprises the steps of determining sight track characteristics of a target object and the emotional state of the target object by performing sight tracking processing and emotion analysis processing on audio and video data of the target object; wherein the sight line track characteristic of the target object comprises: determining a first sight line track characteristic according to a sight line characteristic corresponding to a video frame in audio and video data of the target object and predicting a second sight line track characteristic within a preset time length in the future according to the first sight line track characteristic;

2. The method of claim 1, wherein determining the sight-line trajectory characteristic of the target object by performing sight-line tracking processing on audio and video data of the target object comprises:

and performing sight tracking processing based on the target video data, and determining sight track characteristics of the target object.

3. The method of claim 2, wherein performing gaze tracking processing based on the target video data to determine gaze trajectory characteristics of the target object comprises:

4. The method of claim 1, wherein the determining of the sight line track characteristic of the target object and the emotional state of the target object by performing sight line tracking processing and emotion analysis processing on audio and video data of the target object comprises:

and (c) a second step of,

5. The method of claim 1, wherein predicting the interaction state parameter of the avatar at the next time based on the gaze track characteristics of the target object, the emotional state of the avatar at the next time, the interaction state parameter of the avatar at the current time, and the gaze track characteristics of the avatar comprises:

6. The method of claim 5, wherein predicting the interaction state parameters of the avatar at the next moment based on the interaction rules from the gaze track characteristics of the target object, the interaction state parameters of the avatar at the current moment, and the gaze track characteristics of the avatar comprises:

7. The method of claim 1, further comprising:

judging whether the target object is in a state of interacting with an avatar corresponding to the target object according to audio and video data of the target object;

and if the target object is not in the state of interacting with the virtual image corresponding to the target object, determining the interaction state parameter of the virtual image at the next moment as a preset resting state parameter.

8. The method of claim 7, wherein the step of predicting and determining the emotional state of the avatar at the next moment according to the emotional state of the target object and the current emotional state of the avatar corresponding to the target object comprises:

and if the target object is in a state of interacting with the virtual image corresponding to the target object, predicting and determining the emotional state of the virtual image at the next moment according to the emotional state of the target object and the current emotional state of the virtual image corresponding to the target object.

9. The method of claim 1, wherein the interaction state parameters further comprise an expression parameter and/or a head pose parameter.

10. A virtual human interaction device is characterized by comprising:

the first determining module is used for determining sight line track characteristics of a target object and emotional state of the target object by performing sight line tracking processing and emotion analysis processing on audio and video data of the target object; wherein the sight line trajectory characteristic of the target object comprises: determining a first sight line track characteristic according to a sight line characteristic corresponding to a video frame in audio and video data of the target object and predicting a second sight line track characteristic within a preset time length in the future according to the first sight line track characteristic;

and the interaction state parameter determining module is used for predicting and determining the interaction state parameter of the virtual character at the next moment according to the sight line track characteristic of the target object, the emotional state of the virtual character at the next moment, the interaction state parameter of the virtual person at the current moment and the sight line track characteristic of the virtual character, wherein the interaction state parameter at least comprises a sight line direction.

11. An electronic device, comprising: a memory and a processor;

the processor is used for realizing the virtual human interaction method as claimed in any one of claims 1 to 9 by running the program in the memory.

12. A storage medium, characterized in that a computer program is stored thereon, which, when executed by a processor, implements the virtual human interaction method as claimed in any one of claims 1 to 9.