CN107704169B

CN107704169B - Virtual human state management method and system

Info

Publication number: CN107704169B
Application number: CN201710883643.9A
Authority: CN
Inventors: 尚小维; 李晓丹
Original assignee: Beijing Guangnian Wuxian Technology Co Ltd
Current assignee: Beijing Virtual Point Technology Co Ltd
Priority date: 2017-09-26
Filing date: 2017-09-26
Publication date: 2020-11-17
Anticipated expiration: 2037-09-26
Also published as: CN107704169A

Abstract

The invention provides a virtual human state management method, which comprises the following steps: acquiring multi-modal input data; acquiring multi-modal interaction history information and current running state among interaction objects and the current running state of hardware equipment; analyzing the multi-modal input data to obtain intention data; judging a new running state to be entered by the virtual human according to the acquired data and the intention data; calling a state adjustment capability interface to enter a new running state, and generating multi-modal output data in the new state, wherein the multi-modal output data is associated with the character, the attribute and the skill of the virtual human; and outputting the multi-mode output data through the image of the virtual human. According to the method and the system for managing the virtual human state, the virtual human can be interacted with the user in different experiences in different states, multi-mode output of the virtual human is coordinated, the visual and sensory viscosity of the user is enhanced, and the interactive experience is improved.

Description

Virtual human state management method and system

Technical Field

The invention relates to the field of artificial intelligence, in particular to a method and a system for managing the state of a virtual human.

Background

The development of robotic chat interactive systems has been directed to mimicking human conversation. Early well-known chat bot applications include the mini i chat bot, the siri chat bot on apple cell phone, etc. process received input (including text or speech) and respond in an attempt to mimic human responses between contexts.

However, these existing robotic chat systems are far from satisfactory in order to fully mimic human conversation and enrich the user's interactive experience.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method for managing the state of a virtual human, wherein the virtual human is mounted on a hardware device having an operating system and supporting sensing and control, and the virtual human is displayed in a preset area after being started, and has a specific character, a character setting, a social attribute and a skill, the method for managing the state of the virtual human comprising the steps of:

acquiring multi-modal input data;

acquiring multi-modal interaction history information and current running state among interaction objects and the current running state of hardware equipment;

analyzing the multi-modal input data to obtain intention data;

judging a new running state to be entered by the virtual human according to the acquired data and the intention data;

calling a state adjustment capability interface to enter the new running state, and generating multi-modal output data in the new state, wherein the multi-modal output data is associated with the character, the attribute and the skill of the virtual human;

and outputting the multi-mode output data through the image of the virtual human.

According to one embodiment of the invention, the operating state is classified as: a passive response state, a triggered response state, an active interaction state, an autonomous activity state, a background running state, and an exit application state.

According to one embodiment of the invention, in the step of judging the new running state to be entered by the virtual human,

and preferentially judging whether the intention data indicate skill exhibition intention, if so, directly enabling the virtual human to enter a passive response state.

judging whether the intention data indicate the start and stop of hardware equipment, and if so, enabling the virtual human to enter the trigger interaction state;

generating multimodal output data in the new state comprises:

and calling a capability interface to trigger the starting and stopping of the corresponding hardware equipment, and outputting multi-mode data associated with the intention data.

According to one embodiment of the invention, the virtual person actively enters the active interaction state in the process of interacting with the interaction object or enters the active interaction state within the preset time of interacting with the interaction object, and when the virtual person is in the active interaction state, the virtual person actively launches a topic or performs skill display.

According to an embodiment of the invention, the method further comprises,

entering the autonomous activity state and performing a particular skill show in the autonomous activity state without entering any of the active interaction, the passive response, the triggering interaction, and the exit state.

According to an embodiment of the invention, the method further comprises,

and after the interaction is finished or in a non-interaction state, entering a background running state, and hiding and displaying the virtual human and entering the background running.

According to an embodiment of the invention, the method further comprises: and when the time of the interactive system in the sleep state exceeds the preset time or the interactive object sends an exit instruction, enabling the virtual human to enter the exit state.

According to another aspect of the invention, there is also provided a storage medium having stored thereon program code executable to perform the method steps of any of the above.

According to another aspect of the present invention, there is provided a system for managing a state of a virtual human being mounted on a hardware device having an operating system and supporting sensing and control, the virtual human being displayed in a preset area after being started up and having a specific character, character setting, social attributes, and skills, the system comprising:

a hardware device, comprising:

an input module for obtaining multimodal input data;

the learning module is used for learning the multi-modal interaction history information and the current running state among the interaction objects and the current running state of the hardware equipment;

and the output module is used for outputting the multi-mode output data through the image of the virtual human.

A cloud server, comprising:

the analysis module is used for analyzing the multi-modal input data to obtain intention data;

the judging module is used for judging a new running state to be entered by the virtual human according to the acquired data and the intention data;

and the calling module is used for calling a state adjustment capability interface to enter the new running state and generating multi-modal output data in the new state, wherein the multi-modal output data is associated with the character, the attribute and the skill of the virtual human.

The virtual human in the method and the system for managing the state of the virtual human provided by the invention has a plurality of states, such as a passive response state, a trigger response state, an active interaction state, an autonomous activity state, a background running state and an application quitting state. The virtual human can be developed with the interaction with different experiences with the user in different states, and the multi-mode output of the virtual human is coordinated, so that the visual and sensory viscosity of the user is enhanced, and the interaction experience is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 shows an interaction diagram of a state management system for a avatar according to one embodiment of the present invention;

FIG. 2 shows a block diagram of a state management system for a avatar according to an embodiment of the present invention;

fig. 3 shows a type block diagram of the avatar's state management system according to an embodiment of the present invention;

FIG. 4 shows a block diagram of a state management system for a avatar, according to one embodiment of the present invention;

FIG. 5 is a diagram showing multi-modal output data influencing factors of a state management method of a virtual human according to an embodiment of the present invention;

FIG. 6 is a flowchart of a method for managing the status of a avatar according to an embodiment of the present invention;

FIG. 7 further shows a detailed flowchart of a status management method of the avatar according to an embodiment of the present invention;

figure 8 shows another flow diagram of a dialogue interaction of the state management system of a avatar according to an embodiment of the present invention; and

fig. 9 shows a flowchart of communication between a user, a hardware device, and a cloud server according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

For clarity, the following description is required before the examples:

the virtual person mentioned in the invention is an intelligent device which is carried on an input/output module supporting perception, control and the like;

the high-simulation 3d virtual character image is taken as a main user interface, and the appearance with remarkable character characteristics is achieved;

the system supports multi-mode human-computer interaction and has AI capabilities of natural language understanding, visual perception, touch perception, language voice output, emotion expression and action output and the like;

the social attributes, personality attributes, character skills and the like can be configured, so that the user can enjoy the virtual character with intelligent and personalized smooth experience.

The cloud server is a terminal which provides processing capability of the multi-mode interactive virtual human for performing semantic understanding (language semantic understanding, action semantic understanding, emotion calculation and cognitive calculation) on interaction requirements of the user, and interaction with the user is achieved so as to help the user to make decisions.

Various embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Fig. 1 shows an interaction diagram of a state management system of a virtual human according to an embodiment of the present invention.

As shown in fig. 1, the system includes a user 101, hardware devices (including a display area 1021 and hardware support devices 1022), a avatar 103, and a cloud server 104. The user 101 interacting with the avatar 103 can be a real avatar, another avatar, and an entity avatar, and the interaction process of the other avatar and the entity avatar with the avatar is similar to that of a single avatar. Thus, only the multi-modal interaction process of the user (human) with the avatar is illustrated in fig. 1.

In addition, the hardware devices include a display area 1021 and hardware support devices 1022 (essentially core processors). The display area 1021 is used for displaying the image of the avatar 103, and the hardware support device 1022 is used in cooperation with the cloud server 104 for data processing in the interaction process. The avatar 103 requires a screen carrier to present. Thus, the display area 1021 includes: PC screens, projectors, televisions, multimedia display screens, holographic projection, VR, and AR. The multi-modal interaction process provided by the present invention requires a certain hardware performance as a support, and generally, a PC end with a host is selected as the hardware support device 1022. In fig. 1, the display area 1021 is a PC screen.

The process of interaction between the avatar 103 and the user 101 in fig. 1 is:

the virtual human is carried in a hardware device which is provided with an operating system and supports perception and control, and the virtual human is displayed in a preset area after being started and has specific image, character setting, social attribute and skill. The virtual human 103 needs to be mounted on a hardware device having an operating system, and in order to match the perception function and the control function of the virtual human, the hardware device also needs to be equipped with a component having the perception function and a component having the control function. In order to improve the interactive experience, the virtual human is displayed in a preset area of the hardware equipment after being started, so that the waiting time of a user is prevented from being too long. In order to enrich the interactive content and improve the interactive feeling, the virtual human also has specific image, character, social attribute and skill.

It should be noted here that the avatar and the dress of avatar 103 are not limited to one mode. Avatar 103 may be provided with different images and with a dress. The avatar of avatar 103 is typically a 3D high-modulus animated avatar. The avatar 103 may have different appearances and decorations. Each kind of virtual human 103 image can also correspond to different kinds of dressing, and the dressing classification can be classified according to seasons and occasions. These images and masquerades may be present in cloud server 104 or in a hardware device, and may be invoked at any time when they need to be invoked. Later stage operation personnel can regularly upload new image and dress up to interactive platform, and the user can select the image of oneself liking and dress up as required.

The personality, social attributes, and skills of the avatar are also not limited to one category or the other. The avatar may have multiple characters, multiple social attributes, and multiple skills. The characters, social attributes and skills can be matched respectively and are not fixed in a matching mode; the user can select and match according to the requirement.

After the early preparation is completed, the interaction formally starts, and firstly, multi-modal input data is acquired. The multimodal input data may be spoken by the user 101 or may be input through a perceptual environment. The multimodal input data may contain information in a variety of modalities, such as text, speech, visual, and perceptual information. The receiving devices for acquiring the multi-modal input data are installed or configured on hardware equipment, and comprise a text receiving device for receiving text, a voice receiving device for receiving voice, a camera for receiving vision, infrared equipment for receiving perception information and the like.

Then, the multi-modal interaction history information and the current running state between the interaction objects are obtained, and the current running state of the hardware equipment is obtained. The history of multi-modal interactions among the interactive objects may correspond to information recorded in previous interactions, such as lifestyle habits of the user 101. Knowing the multi-modal interaction history information among the interaction objects can enable the virtual human 103 to preliminarily judge the user 101 in advance before the current interaction, and can give the user 101 better interaction experience and smoother and natural interaction feeling in the current interaction.

The virtual human performs conversion of different states in the hardware device, and the virtual human performs new state selection according to the current operation state, where the current operation state includes the current operation state of the current operating system and the current state of the virtual human 103. The current operating state of the hardware device includes the operating state of the receiving apparatus and the current operating state of the apparatus related to the interaction on the hardware device. The purpose of knowing this information is to learn about the avatar 103 and the hardware devices at the beginning of the interaction, in order to make decisions in the following interaction.

The previously acquired multimodal input data is then parsed to obtain intent data. The multi-modal input data needs to be analyzed and understood on the cloud server side so as to obtain the interaction intention or environment information of the user.

And after the intention data is obtained, judging a new running state of the virtual human to enter according to the obtained data and the intention data. The running states include a passive response state, a triggered response state, an active interaction state, an autonomous activity state, a background running state, and an exit application state. These states may have priority and different conditions for entering the state. And judging a new state which the virtual human should enter according to the priority and the interaction intention contained in the intention data.

And entering a new running state by calling the state adjustment capability interface, and performing multi-mode interaction with the user in the new state, wherein multi-mode output data is associated with the character, the attribute and the skill of the virtual human. After knowing the state information to be entered, the avatar 103 calls the state adjustment capability interface and enters a new state through the state adjustment capability interface. The multimodal output data is then generated in the new state. The generated multimodal output data is associated with the personality, social attributes and skills of the avatar itself, in addition to the interaction data. Avatars of different characters, social attributes, and skills may generate different multimodal output data.

And finally, outputting the multi-mode output data through the image of the virtual human. After the virtual human generates the multi-modal output data, the generated multi-modal output data needs to be displayed on an interactive object interacted with the virtual human. The carrier for outputting the multi-mode output data is the image of the virtual human, and the output data such as texts, voice, visual recognition results and the like in the multi-mode output data can be displayed in an all-around mode through the image of the virtual human. The interactive object can also quickly and accurately acquire the interactive information contained in the multi-modal output data.

The above interaction steps are simply, first, obtaining multimodal input data. Then, the multi-modal interaction history information and the current running state between the interaction objects are obtained, and the current running state of the hardware equipment is obtained. Next, the multimodal input data is parsed to obtain intent data. And then, judging a new running state to be entered by the virtual human according to the acquired data and the intention data. And then, calling a state adjustment capability interface to enter a new running state, and generating multi-modal output data in the new state, wherein the multi-modal output data is associated with the character, the attribute and the skill of the virtual human. And finally, outputting the multi-mode output data through the image of the virtual human.

Fig. 2 shows a block diagram of a state management system of a virtual human according to an embodiment of the present invention. As shown in fig. 2, the system includes a user 101, a hardware device 102, a display area 1021, a avatar 103, and a cloud server 104. The hardware device 102 includes a receiving means 102A, a state means 102B, a processing means 102C, and a connecting means 102D. Cloud server 104 includes a communication device 1041.

The virtual human state management system provided by the invention needs to establish communication connection among three parties, namely, a smooth communication channel is established among the user 101, the hardware device 102 and the cloud server 104, so that the interaction between the user 101 and the virtual human 103 can be completed. To accomplish the interaction task, the hardware device 102 and the cloud server 104 are provided with devices and components to support the interaction. The object interacting with the virtual human can be one party or multiple parties.

The hardware apparatus includes a receiving device 102A, a state device 102B, a processing device 102C, and a connecting device 102D. Wherein the receiving means 102A is adapted to receive multimodal interaction input data. Examples of the receiving apparatus 102A include a keyboard, a cursor control device (mouse), a microphone for voice operation, a scanner, a touch function (e.g., a capacitive sensor to detect physical touch), a camera (detecting motion not involving touch using visible or invisible wavelengths), and so forth. The hardware device may retrieve multimodal interaction input data through the input devices mentioned above.

The state device 102B is used for storing information of a plurality of states, and the virtual human can switch different states. Subdivided, the state device 102B includes a passive response state unit, a trigger response state unit, an active interaction state unit, an autonomous activity state unit, a background running state unit, and an application exit state unit. These cells correspond to one state each. The interactive data in each state can be stored or processed.

The processing device 102C is configured to process multimodal interaction data transmitted by the cloud server in an interaction process. The connection device 102D is used for communicating with the cloud server 104, and the processing device 102C sends a call instruction to call the virtual human capability on the cloud server 104 to parse the multi-modal interaction input data by using the connection device 10D to receive multi-modal input data preprocessed by the device or multi-modal output data transmitted by the cloud server.

The communication device 1041 included in the cloud server 104 is used for completing communication with the hardware device 102. The communication device 1041 maintains communication with the connection device 102D on the hardware device 102, receives the instruction from the hardware device 102, and sends the instruction sent by the cloud server 104, which is a medium for communication between the hardware device 102 and the cloud server 104.

Fig. 3 shows a type block diagram of the avatar status of the status management system of the avatar according to one embodiment of the present invention. As shown in fig. 3, the states of the avatar 103 include a passive response state, a triggered response state, an active interaction state, an autonomous activity state, a background running state, and an exit application state.

The priority of the passive response state is the highest, and as a preferred embodiment, when the intention data contains the indication skill exhibition intention, the virtual human enters the passive response state, and correspondingly generates multi-modal output data in the passive response state. As an example, whether the intention data indicates a hardware device start-stop, and if so, causing the avatar to enter a triggered interaction state. Generating the multimodal output data in the new state includes invoking a capability interface to trigger a start-stop of the corresponding hardware device, outputting the multimodal data associated with the intent data. The intention may be to take a photograph, i.e.: instructing the camera to open, the output multimodal data may be a photograph or sending a confirmation question.

The active interaction state is actively entered in the interaction process with the interaction object or within the preset time of interaction with the interaction object, and when the virtual person is in the active interaction state, the virtual person actively initiates a topic or performs skill display. For example, the avatar actively pushes the current headline news for the user in the morning.

And under the condition that any one of the states of active interaction, passive response, triggering interaction and exit is not entered, entering an autonomous activity state, and performing specific skill display in the autonomous activity state. The autonomous activity state is that the virtual human blinks or is in a screen real space.

After the interaction is finished or in a non-interaction state, the virtual human enters a background running state, and the virtual human is hidden and displayed and enters the background running state. And when the time of the interactive system in the sleep state exceeds the preset time or the interactive object sends an exit instruction, the virtual human enters an exit application state.

It should be noted that the virtual human may select which state to enter according to the history information of the user, such as: in the past interactive process, when a user sends an interactive request of 'listening to music', the user selects a pure music type more. When the intention data contains the indication skill exhibition intention, the virtual human enters a passive response state, and the virtual human is more inclined to exhibit the song with mild singing and quickness. Preferably, the history information is extractable from the user pictorial data.

Fig. 4 shows a block diagram of a status management system of a avatar according to an embodiment of the present invention. As shown in fig. 4, the system module includes an input module 401, a learning module 402, an analysis module 403, a judgment module 404, a calling module 405, and an output module 406.

The input module 401 includes a text capture unit 4011, an audio capture unit 4012, a visual capture unit 4013, and a perception capture unit 4014. The input module 401 is primarily used to collect multimodal input data. The text collection unit 4011 is configured to collect text information. The audio collecting unit 4012 is used to collect audio information. The visual acquisition unit 4013 is used for visual information. The perception collecting unit 4014 is used to collect perception information such as touch.

The learning module 402 includes a history information unit 4021 and a current status unit 4022. The history information unit 4021 is configured to acquire history information of interaction between the user 101 and the avatar 103, so as to acquire information such as behavior habits of the user 101. The current state unit 4022 is configured to determine the current operating state of the virtual human 103.

The output module 406 is used for outputting the multi-modal output data through the avatar of the avatar. The input module 401, the learning module 402 and the output module 406 may be configured in a hardware device;

the parsing module 403, the determining module 404, and the calling module 405 may be configured as a cloud server.

The parsing module 403 contains an intent unit 4031 that is used to learn intent data by parsing multimodal input data. The determining module 404 includes an analyzing unit 4041, which is used to determine the new operating state to be entered by the avatar according to the above learning data and intention data.

The calling module 405 includes an interface unit 4051 and a new state unit 4052. Therein, the invoking unit 4051 is configured to invoke the state adjustment capability interface to enter a new running state. The new state unit 4052 is configured to generate multi-modal output data in the new state, wherein the multi-modal output data is associated with the personality, attributes, and skills of the avatar.

Fig. 5 shows a multi-modal output data influence factor diagram of the status management method of the avatar according to an embodiment of the present invention.

The generation of multi-modal output data is influenced by various factors, and the multi-modal input data can have a main influence on the generation of the multi-modal output data. And generating multi-modal output data mainly according to interactive information contained in the multi-modal input data and analyzed intention data. Based on the interaction information and the intention data, the avatar 103 can generate corresponding multi-modal output data. However, the generation of multimodal output data is affected by many other factors in addition to the multimodal input data. Fig. 5 shows a schematic diagram of the multi-modal output data being affected by other factors.

As shown in fig. 5, the influencing factors include personality, social attributes, skills, and historical information of interaction with the user. Each influencing factor is represented by an ellipse for its boundary. The ellipses representing the influencing factors also cross each other. Wherein 501 represents a single influencing factor, including a part where four factors, namely character, social attribute, skill and historical information, intersect with the multi-modal output data ellipse respectively. 502 represents two influencing factors, including the part of intersection between the social attribute, the historical information and the multi-modal output data and the part of intersection between the character, the historical information and the multi-modal output data. 503 represents the interaction factors, including the part where the social attribute intersects with the history information and the part where the character intersects with the history information.

The generating influence of the interaction factor 503 on the multimodal output data is indirect. The interaction factors contained in 503 indirectly affect the generation of multimodal output data by interacting with the personality, social attributes, and historical information.

The generation of the multi-modal output data is affected by a plurality of factors and a plurality of types of factors, and therefore, the influence factors shown in fig. 5 need to be taken into consideration when generating the multi-modal output data to generate the multi-modal output data more suitable for practical situations.

Fig. 6 shows a flowchart of a status management method of a virtual human according to an embodiment of the present invention.

In order to manage the state of the virtual human 103, a set of strict and smooth state management mechanism and a state management method matched with the mechanism are required, and therefore, the invention provides a method and a system for managing the state of the virtual human.

First, in step S601, multimodal input data is acquired. Interaction between the avatar 103 and the user 101 entails the transfer, i.e. reception and transmission, of data. First, the avatar 103 needs to receive multimodal input data to perform semantic understanding, visual analysis, perceptual calculation, emotional calculation, and other processing on the data. Next, in step S602, multimodal interaction history information and a current operating state between the interaction objects, and a current operating state of the hardware device are known. The role of this step is to collect information about the user 101 and the hardware device 102 related to the interaction, and to provide data and information support for the next interaction.

Then, in step S603, the multimodal input data is analyzed to obtain intention data. Next, in step S604, a new operating state to be entered by the virtual human is determined based on the above learned data and intention data. The step of judging the new running state to be entered by the virtual human includes the steps of preferentially judging whether the intention data indicate skill exhibition intention, and if so, directly enabling the virtual human to enter a passive response state. And judging whether the intention data indicate the start and stop of the hardware equipment, and if so, enabling the virtual human to enter a trigger interaction state. Generating the multimodal output data in the new state includes invoking a capability interface to trigger a start-stop of the corresponding hardware device, outputting the multimodal data associated with the intent data.

Then, in step S605, the state adjustment capability interface is called to enter a new operating state, and multimodal output data is generated in the new state, wherein the multimodal output data is associated with the character, attribute, and skill of the virtual human. The operating state is classified as: a passive response state, a triggered response state, an active interaction state, an autonomous activity state, a background running state, and an exit application state.

Finally, in step S606, the multimodal output data is output through the avatar of the avatar.

Fig. 7 further shows a detailed flowchart of a status management method of the avatar according to an embodiment of the present invention.

As shown in fig. 7, in step S701, intention data is obtained by analyzing multimodal input data. Subsequently, in step S702, it is determined whether a preset time is reached. If the preset time is reached, the avatar enters an active interaction state in step S703. If the preset time is not reached, in step S704, it is determined whether the intention data includes the skill presentation intention. If the skill exhibition intention is included in the intention data, the avatar enters a passive response state in step S705. If the intention data does not include the skill presentation intention, in step S706, it is determined whether the intention data includes the hardware device start-stop intention.

If the intention data contains the hardware device start-stop intention, then in step S707, the avatar enters a trigger response state. If the intention data does not contain the hardware device start-stop intention, then in step S708, it is determined whether the avatar is in any one of a passive response state, a triggered response state, an active interaction state, and an exit state. If the avatar is not in any of the passive response, triggered response, active interaction, exit states, then in step S709 the avatar enters an autonomous active state.

If the avatar is in any one of the passive response, the triggered response, the active interaction, and the exit state, then in step S710, it is determined whether the interaction is over or no interaction is currently performed. If the interaction is over or no interaction is currently available, then in step S711 the avatar enters a background running state. If the interaction is not over or is currently there, then in step S712, it is determined whether a certain sleep time is exceeded or an exit instruction is received. If so, then in step S713, the avatar enters an exit application state. If not, the virtual human continues to interact or enters a background running state.

It should be noted that the entry condition of the autonomous interaction state is that a preset time is reached or a virtual human autonomously enters in the interaction process. In the flowchart shown in fig. 7, it is necessary to determine whether the preset time is reached or whether the virtual human autonomously enters in real time, and as an example, only one step of determining is shown in fig. 7, which is described herein.

Fig. 8 shows another flowchart of the dialogue interaction of the state management system of the avatar according to an embodiment of the present invention. As shown, in step S801, the hardware device 102 issues dialog content to the cloud server 104. Thereafter, the hardware device 102 is waiting for a reply from the cloud server 104. During the waiting period, the hardware device 102 will time the time it takes to return data. If the returned response data is not obtained for a long time, for example, the time length exceeds the predetermined time length of 5S, the hardware device 102 may choose to perform local reply, and generate local general response data. Then the virtual human image outputs the animation matched with the local common response, and the voice playing equipment is called to play the voice.

In order to realize the state management and switching of the avatar 103, a communication connection needs to be established among the user 101, the hardware device 102 and the cloud server 104. The communication connection should be real-time and unobstructed to ensure that the interaction is not affected.

In order to complete the interaction, some conditions or preconditions need to be met. These conditions or preconditions include that the virtual human is mounted on a hardware device which has an operating system and supports sensing and control, and that the virtual human is displayed in a preset area after being started, and has a specific image, character setting, social attributes and skills.

The premise of the interaction of the hardware device 102 is to install an operating system, which is compatible with the virtual human 103 and has hardware facilities with sensing and control functions. The hardware device 102 should also be provided with a display screen with a display function for displaying the avatar of the avatar 103.

The interaction of the avatar 103 is accomplished on the condition that the avatar 103 has a specific character, personality setting, social attributes and skills. The particular avatar enables more avatar-like interaction with user 101, helping to improve user 101's awareness of avatar 103. Due to the setting of the character and the social attribute, the virtual human 103 is no longer a robot with cold ice and ice, so that the virtual human 103 has the character and the social attribute of a human, and the image of the virtual human 103 is more vivid and closer to the human. In addition, the avatar 103 with skills can better complete the appeal of the user 101, and the avatar 103 can be classified into avatars 103 with different skill attributes according to different skill requirements.

After the above-described preliminary preparation is completed, as shown in fig. 9, the interaction between the user 101 and the avatar 103 is formally started, and first, the hardware device 102 acquires multimodal input data. Here, a communication connection is established between the user 101 and the hardware device 102, and the receiving device on the hardware device 102 receives the multimodal input data sent by the user 101 or other devices in real time.

The virtual human 103 interacts with history information and current running state in a multi-mode, and the current running state of the hardware equipment. Meanwhile, the hardware device 102 and the cloud server 104 are in contact, and the hardware device 102 calls a state adjustment capability interface to analyze the multimodal input data to obtain intention data. The analysis of multi-modal input data requires virtual human capabilities, which may include semantic understanding, cognitive computation, and the like. Then, according to the acquired data and the intention data, judging a new running state to be entered by the virtual human; and generating multi-modal output data in the new state, wherein the multi-modal output data is associated with the character, the attribute and the skill of the virtual human.

Finally, a relationship is established between the user 101 and the hardware device 102, and the hardware device 102 outputs multi-modal output data through the avatar of the avatar 103. The avatar of the avatar 103 can output various information such as mouth shape information, expression information, and limb movement information, and by combining the above various output modes, the avatar 103 can vividly express information included in the modal output data to the user 101. The user 101 can intuitively acquire content information included in the multimodal output data.

It is to be understood that the disclosed embodiments of the invention are not limited to the particular structures, process steps, or materials disclosed herein but are extended to equivalents thereof as would be understood by those ordinarily skilled in the relevant arts. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, the appearances of the phrase "one embodiment" or "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.

Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A virtual human state management method is characterized in that the virtual human is mounted in a hardware device which is provided with an operating system and supports perception and control, the virtual human is displayed in a preset area after being started, and has specific image, character setting, social attribute and skill, the virtual human state management method comprises the following steps:

acquiring multi-modal input data;

analyzing the multi-modal input data to obtain intention data;

according to the acquired data and the intention data, judging a new operation state to be entered by the virtual human, wherein the new operation state is classified as: a passive response state, a trigger response state, an active interaction state, an autonomous activity state, a background running state and an application quitting state;

calling a state adjustment capability interface to enter the new running state, and generating multi-modal output data in the new running state, wherein the multi-modal output data is associated with the character, the attribute and the skill of the virtual human;

outputting the multi-modal output data through the avatar of the avatar,

analyzing multi-modal input data to obtain intention data, then judging whether preset time is reached, if the preset time is reached, enabling the virtual human to enter an active interaction state, if the preset time is not reached, preferentially judging whether the intention data contain skill exhibition intents, if the intention data contain the skill exhibition intents, enabling the virtual human to enter a passive response state, if the intention data do not contain the skill exhibition intents, judging whether the intention data contain hardware equipment start-stop intents, if the intention data contain the hardware equipment start-stop intents, enabling the virtual human to enter a trigger response state, and generating multi-modal output data in the new operation state comprises the following steps: and calling a capability interface to trigger the starting and stopping of the corresponding hardware equipment, and outputting multi-mode data associated with the intention data.

2. The virtual human state management method of claim 1, wherein the virtual human actively enters the active interaction state in the process of interacting with the interaction object or enters the active interaction state within a preset time of interacting with the interaction object, and when the virtual human is in the active interaction state, the virtual human actively initiates a topic or performs skill exhibition.

3. The virtual human state management method of claim 1, wherein the method further comprises,

entering the autonomous activity state and performing a particular skill show in the autonomous activity state without entering any of the active interaction state, the passive response state, the trigger response state, and the exit application state.

4. The virtual human state management method of claim 1, wherein the method further comprises,

5. The virtual human state management method of claim 1, wherein the method further comprises: and when the time of the interactive system in the sleep state exceeds the preset time or the interactive object sends an exit instruction, enabling the virtual human to enter the exit application state.

6. A storage medium having stored thereon program code executable to perform the method steps of any of claims 1-5.

7. A virtual human state management system, wherein the virtual human is mounted on a hardware device having an operating system and supporting perception and control, and the virtual human is displayed in a preset area after being started up, and has a specific character, a character setting, a social attribute, and a skill, the virtual human state management system comprising:

a hardware device, comprising:

an input module for obtaining multimodal input data;

an output module, configured to output the multi-modal output data through the avatar of the virtual human, where parsing the multi-modal input data to obtain intention data, then determining whether a preset time is reached, if the preset time is reached, the virtual human enters an active interaction state, if the preset time is not reached, preferentially determining whether the intention data includes a skill exhibition intention, if the intention data includes the skill exhibition intention, the virtual human enters a passive response state, if the intention data does not include the skill exhibition intention, determining whether the intention data includes a hardware device start/stop intention, if the intention data includes the hardware device start/stop intention, the virtual human enters a trigger response state, and generating the multi-modal output data in a new operating state includes: calling a capability interface to trigger starting and stopping of corresponding hardware equipment, and outputting multi-modal data associated with the intention data;

a cloud server, comprising:

a judging module, configured to judge, according to the obtained data and the intention data, a new operating state to which the virtual human is to enter, where the new operating state is classified as: a passive response state, a trigger response state, an active interaction state, an autonomous activity state, a background running state and an application quitting state;

and the calling module is used for calling a state adjustment capability interface to enter the new running state and generating multi-modal output data in the new running state, wherein the multi-modal output data is associated with the character, the attribute and the skill of the virtual human.