CN117275476A

CN117275476A - Digital person interaction method and device, electronic equipment and storage medium

Info

Publication number: CN117275476A
Application number: CN202311207038.1A
Authority: CN
Inventors: 周润民; 张雨仁
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-09-18
Filing date: 2023-09-18
Publication date: 2023-12-22

Abstract

The disclosure provides a digital human interaction method, a digital human interaction device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence. The specific implementation scheme is as follows: receiving voice sent by a digital person terminal, and performing voice recognition on the voice to obtain query text; sending query text to the large language model and receiving a reply result of the query text returned by the large language model; and sending a rendering instruction to a rendering engine based on the reply result, and generating a digital human video stream based on the rendering engine. Therefore, the method and the device can obtain more natural and accurate answer results by converting the voice of the user into the query text and sending the query text to the large language model. Based on the reply result, a rendering instruction is sent to a rendering engine, so that a digital person video stream can be generated, and the digital person has high-fidelity human language understanding and generating capability, thereby improving the use experience of a user.

Description

Digital person interaction method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, and in particular relates to a digital human interaction method and device, electronic equipment and a storage medium.

Background

The existing digital person has limited understanding ability on fuzzy questions, and in a complex dialogue, the questions cannot be answered by using context information, so that the answers of the questions are too mechanized. And lack of personalized expression results in a lack of emotional resonance for the user during interaction with the digital person.

Disclosure of Invention

The disclosure provides an interaction method, device, electronic equipment and storage medium for digital people.

According to an aspect of the present disclosure, there is provided a digital person interaction method, including: receiving voice sent by a digital person terminal, and performing voice recognition on the voice to obtain an inquiry text; sending the query text to a large language model, and receiving a reply result of the query text returned by the large language model; and sending a rendering instruction to a rendering engine based on the reply result, and generating a digital human video stream based on rendering of the rendering engine.

According to another aspect of the present disclosure, there is provided a digital person interaction method, including: receiving a rendering instruction sent by a digital personal center controller, wherein the rendering instruction is generated based on a reply result of a query text fed back by a large language model; sending a text request to a text-to-speech (TTS) service based on the rendering instruction, and receiving audio data generated by the TTS service based on the text request; rendering generates the digital human video stream based on the audio data.

According to another aspect of the present disclosure, there is provided a digital person interaction method, including: receiving a digital person central control sending query text; determining a reply result of the query text based on the query text; and sending the reply result to the digital personal central control.

According to another aspect of the present disclosure, there is provided a digital human interaction device, including: the voice recognition module is used for receiving voice sent by the digital person terminal and carrying out voice recognition on the voice to obtain an inquiry text; the receiving and transmitting module is used for sending the query text to the large language model and receiving a reply result of the query text returned by the large language model; and the sending module is used for sending a rendering instruction to a rendering engine based on the reply result and generating a digital human video stream based on the rendering of the rendering engine.

According to another aspect of the present disclosure, there is provided a digital human interaction device, including: the receiving module is used for receiving a rendering instruction sent by the digital personal center control, wherein the rendering instruction is generated based on a reply result of the query text fed back by the large language model; the receiving and transmitting module is used for sending a text request to a text-to-speech (TTS) service based on the rendering instruction and receiving audio data generated by the TTS service based on the text request; and the rendering module is used for rendering and generating the digital human video stream based on the audio data.

According to another aspect of the present disclosure, there is provided a digital human interaction device, including: the receiving module is used for receiving the digital person central control sending query text; the determining module is used for determining a reply result of the query text based on the query text; and the sending module is used for sending the reply result to the digital personal central control.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the digital human interaction method according to the embodiment of the above aspect.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon computer instructions for causing the computer to perform the digital human interaction method according to the embodiment of the above aspect.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program/instruction which, when executed by a processor, implements the method of digital human interaction according to the embodiments of the above aspect.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flow chart of a digital person interaction method provided in an embodiment of the disclosure;

FIG. 2 is a flow chart of another method of digital human interaction provided by an embodiment of the disclosure;

FIG. 3 is a flow chart of another method of digital human interaction provided by an embodiment of the disclosure;

fig. 4 is a flow chart of another digital human interaction method provided in an embodiment of the disclosure;

FIG. 5 is a flow chart of another method of digital human interaction provided by an embodiment of the disclosure;

FIG. 6 is a flow chart of another method of digital human interaction provided by an embodiment of the present disclosure;

fig. 7 is a flow chart of another digital human interaction method according to an embodiment of the disclosure;

FIG. 8 is a flow chart of another method of digital human interaction provided by an embodiment of the present disclosure;

FIG. 9 is a schematic flow chart of user interaction with a digital person provided by an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of a digital human interaction device according to an embodiment of the disclosure;

FIG. 11 is a schematic structural diagram of another digital human interaction device according to an embodiment of the present disclosure;

FIG. 12 is a schematic structural diagram of another digital human interaction device according to an embodiment of the present disclosure;

fig. 13 is a block diagram of an electronic device for implementing a digital human interaction method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The following describes a digital person interaction method, a digital person interaction device and electronic equipment according to the embodiment of the disclosure with reference to the accompanying drawings.

Artificial intelligence (Artificial Intelligence, AI for short) is a discipline of researching and enabling a computer to simulate certain thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning and the like) of a person, and has a technology at a hardware level and a technology at a software level. Artificial intelligence hardware technologies generally include computer vision technologies, speech recognition technologies, natural language processing technologies, and learning/deep learning, big data processing technologies, knowledge graph technologies, and the like.

Natural language processing (Natural Language Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. The natural language processing is mainly applied to the aspects of machine translation, public opinion monitoring, automatic abstracting, viewpoint extraction, text classification, question answering, text semantic comparison, voice recognition and the like.

Deep Learning (DL) is a new research direction in the field of Machine Learning (ML), and is introduced into Machine Learning to make it closer to the original goal, i.e., artificial intelligence. Deep learning is the inherent law and presentation hierarchy of learning sample data, and the information obtained during such learning is helpful in interpreting data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data. Deep learning is a complex machine learning algorithm that achieves far greater results in terms of speech and image recognition than prior art.

Smart searches are a new generation of search engines that incorporate artificial intelligence technology. Besides the functions of traditional quick search, relevance sorting and the like, the system can also provide functions of user role registration, automatic user interest identification, semantic understanding of content, intelligent informatization filtering, pushing and the like.

The voice technology is that the computer can listen, watch, speak and feel, and is the development direction of human-computer interaction in the future, wherein the voice becomes the best human-computer interaction mode in the future, and the voice has more advantages than other interaction modes. The core of Speech synthesis techniques for computer Speech is Text to Speech (Speech) technology.

Machine translation (machine translation), also known as automatic translation, is a process of converting one natural language (source language) to another natural language (target language) using a computer. It is a branch of computational linguistics, one of the goals of artificial intelligence.

Fig. 1 is a flow chart of a digital person interaction method according to an embodiment of the disclosure.

As shown in fig. 1, the digital person interaction method may include:

s101, receiving voice sent by the digital personal terminal, and performing voice recognition on the voice to obtain query text.

It should be noted that, the execution body of the digital person interaction method in the embodiment of the present disclosure may be a hardware device having a data information processing capability and/or software necessary for driving the hardware device to operate. Alternatively, the execution body may include a server, a computer, a user terminal, and other intelligent devices. Optionally, the user terminal includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, etc. Alternatively, the server includes, but is not limited to, a web server, an application server, a server of a distributed system, a server incorporating a blockchain, etc.

In some implementations, the digital personal center can receive voice sent by the digital personal terminal by establishing a connection with the digital personal terminal. Wherein the voice is a recording of the user collected by the digital personal terminal. Alternatively, the digital personal central control may invoke an automatic speech recognition (Automatic Speech Recognition, ASR) service to recognize received speech and convert the speech into query text.

In some implementations, after invoking the ASR service, the digital personal central control may send the speech to the ASR service based on the streaming protocol, and after the ASR service converts the speech to query text, the digital personal central control receives the query text returned by the ASR service based on the streaming protocol.

It is understood that streaming (Stream) is a data transmission manner based on Stream (Stream), which is to transmit data as a continuous Stream. In streaming, data is divided into small packets and transmitted continuously in a certain order, and a receiver can immediately start processing the received data without waiting for the entire data transmission to be completed. The transmission mode has the characteristics of strong real-time performance and low delay, can gradually load data, and provides better user experience.

S102, sending query text to the large language model, and receiving a reply result of the query text returned by the large language model.

In some implementations, the digital personal center control may send query text to the large language model based on a transmission interface with the large language model, and receive a reply result of the query text returned by the large language model based on the transmission interface. Alternatively, a transmission interface between the digital personal center and the large language model may be established in advance. For example, the transmission interface between the large language model and the digital personal center can be a web nest (websocket) interface, and also can be a hypertext transfer protocol (Hypertext Transfer Protocol, HTTP) interface.

It is understood that a large language model refers to a language model that has large scale parameters and is capable of generating coherent, semantic rich text. The method is a natural language processing model based on the deep learning technology, and has the understanding and generating capability of language by pre-training a large amount of text data. That is, the query text is sent to the large language model, which may generate a reply result to the query text based on the text.

And S103, sending a rendering instruction to a rendering engine based on the reply result, and generating a digital human video stream based on rendering of the rendering engine.

In some implementations, after the digital person central control obtains the reply result, a rendering instruction of the digital person video stream may be generated based on the reply result, and by sending the rendering instruction to the rendering engine, the rendering engine may generate the reply result into the digital person video stream based on the rendering instruction. And further, the digital person video stream is sent to the digital person terminal through Real-time audio and video communication (Real-time Communications, RTC) so as to realize the interaction between the user and the digital person.

According to the digital human interaction method, the voice of the user is converted into the query text, the query text is sent to the large language model, the large language model generates the reply result of the query text, so that more natural and accurate replies can be obtained, and the real-time performance and response speed of interaction are improved. By receiving the reply result of the query text and sending a rendering instruction to the rendering engine based on the reply result, a digital person video stream can be generated, so that the digital person has high-fidelity human language understanding and generating capability, more natural and intelligent interaction is provided, a user can reply more quickly and instantly when interacting with the digital person, and personalized experience of the user is improved.

Fig. 2 is a flow chart of a method for digital person interaction according to an embodiment of the disclosure.

As shown in fig. 2, the digital person interaction method may include:

s201, receiving voice sent by the digital person terminal, and performing voice recognition on the voice to obtain query text.

The relevant content of step S201 may be referred to the above embodiments, and will not be described herein.

S202, based on the query text, acquiring the history dialogue text cached locally.

In some implementations, to understand the user's contextual information, to provide an intelligent interactive experience, the digital personal central controller may store the identified query text locally and store the query text as historical dialog text, such as in a memory of the digital personal central controller. When the digital personal center receives the query text again, the local historical dialogue text can be directly acquired to provide the context information of the user for the large language model.

Alternatively, the history dialogue text adjacent to the query text may be determined based on the time information as the context information of the query text. For example, if the time of the query text at the current time is 15:05, and the historical dialog text with the time of 10:08, 14:59 and 15:03 exists in the historical dialog text, the historical dialog text with the time of 14:59 and 15:03 is obtained from the local cache and is used as the context information of the query text.

Alternatively, the historical dialog text and the query text may also be semantically matched, and a higher matching degree indicates that the historical dialog text and the query text are more similar. The historical dialogue text with the matching degree larger than the matching degree threshold can be obtained by setting the matching degree threshold and used as the context information of the query text.

S203, inputting a history dialogue text and an inquiry text into the large language model, and acquiring a reply result based on the history dialogue text and the inquiry text by the large language model.

In some implementations, to generate a more accurate reply result, the query text may be sent to the large language model at the same time as the query text is sent to the large language model, so that the large language model may combine the context information to generate a reply result to the query text.

In some implementations, the historical dialog text and the query text may be combined and spliced based on a time sequence relationship of the historical dialog and the query text to obtain a spliced query text, and the spliced query text is input into the large language model, and a reply result is generated by the large language model.

The exemplary illustration, let the history dialogue text be "cool today. "what is the current query text" which preparations to do with park play? By "do it necessary to do any preparation work to park play? And inputting the answer result into the large language model, and generating a reply result by the large language model.

In some implementations, guidance prompt information is also generated based on the historical dialog text and the query text, and the guidance prompt information is sent to the large language model to obtain a reply result of the query text. Alternatively, the historical dialog text and the query text may be combined according to a set template or format to generate the guidance prompt for the large language model. And inputting the guide prompt information into a large language model, and acquiring a reply result by the large language model based on the guide prompt information.

It will be appreciated that the purpose of guiding the prompt is to help the large language model understand the query text more accurately and to generate the response results. By providing explicit, clear guidance cues, the probability of a large language model generating a suitable answer can be increased and possible errors or ambiguities reduced. In the method, the system and the device, the guidance prompt information is generated based on the historical dialogue text and the query text, so that the large language model can be helped to understand the context information of the user, and personalized reply results are generated, and therefore interaction which is more customized and meets the requirements of the user is provided.

S204, sending a rendering instruction to a rendering engine based on the reply result, and generating a digital human video stream based on rendering of the rendering engine.

The relevant content of step S204 may be referred to the above embodiments, and will not be described herein.

According to the digital human interaction method, the voice of the user is converted into the query text, the local historical dialogue text is obtained, the query text and the historical dialogue text are sent to the large language model, the reply result of the query text is generated by the large language model, the reply of the context information of the user can be obtained more naturally, and the real-time performance and the response speed of interaction are improved. By receiving the reply result of the query text and sending a rendering instruction to the rendering engine based on the reply result, a digital person video stream can be generated, so that the digital person has high-fidelity human language understanding and generating capability, more natural and intelligent interaction is provided, a user can reply more quickly and instantly when interacting with the digital person, and personalized experience of the user is improved.

Fig. 3 is a flow chart of a digital person interaction method according to an embodiment of the disclosure.

As shown in fig. 3, the digital person interaction method may include:

s301, receiving voice sent by the digital person terminal, and carrying out voice recognition on the voice to obtain query text.

The relevant content of step S301 may be referred to the above embodiments, and will not be described herein.

S302, a transmission interface is established with the large language model through an adapter of the large language model.

S303, sending query text to the large language model through the transmission interface.

In some implementations, the digital personal central controller may pre-establish a transmission interface with the large language model before sending the query text to the large language model, and then send the query text to the large language model based on the transmission interface.

Alternatively, the input/output interface protocol of the large language model may be determined based on the function of the adapter of the large language model, and the transmission interface may be established with the large language model according to the interface protocol.

Alternatively, a web nest (HTTP) interface may be established, or a hypertext transfer protocol (Hypertext Transfer Protocol, HTTP) interface may be established, for the digital personal central control to transmit query text to the large language model, and the large language model to transmit reply results to the digital personal central control.

S304, receiving a reply result of the query text returned by the large language model.

In some implementations, the reply result returned by the large language model is returned in a streaming mode, and the digital personal center control can receive the reply result returned by the large language model in a streaming mode through the transmission interface based on the streaming protocol, so that the reply result with strong real-time performance and low delay is obtained, and the interactive experience of the user is improved.

S305, sending a rendering instruction to the rendering engine based on the reply result, and generating a digital human video stream based on the rendering engine rendering.

The relevant content of step S305 may be referred to the above embodiments, and will not be described herein.

According to the digital human interaction method, the voice of the user is converted into the query text, the query text is sent to the large language model based on the transmission interface, the reply result of the query text is generated by the large language model and returned in parallel, so that the reply of the context information of the user can be more natural, and the real-time performance and the response speed of interaction are improved. By receiving the reply result of the query text and sending a rendering instruction to the rendering engine based on the reply result, a digital person video stream can be generated, so that the digital person has high-fidelity human language understanding and generating capability, more natural and intelligent interaction is provided, a user can reply more quickly and instantly when interacting with the digital person, and personalized experience of the user is improved.

Fig. 4 is a flow chart of a method for digital human interaction according to an embodiment of the disclosure.

As shown in fig. 4, the digital person interaction method may include:

S401, receiving a rendering instruction sent by the digital personal central control, wherein the rendering instruction is generated based on a reply result of the query text fed back by the large language model.

In some implementations, the rendering engine may perform rendering of the digital person video stream by receiving a rendering instruction sent by the digital person central control and based on the rendering instruction. It will be appreciated that the rendering instructions are generated by the digital person central control based on the reply result of the query text fed back by the large language model.

And S402, sending a text request to a text-to-speech (TTS) service based on the rendering instruction, and receiving audio data generated by the TTS service based on the text request.

In some implementations, after receiving the rendering instructions, the rendering engine may generate a Text-To-Speech (TTS) request for Text based on the rendering instructions and send the Text request To a TTS service requesting audio data To generate a reply result.

Alternatively, the rendering engine may receive the audio data streamed back by the TTS service based on a streaming protocol to obtain real-time, low-latency audio data.

S403, rendering generates a digital human video stream based on the audio data.

In some implementations, the rendering engine may drive the digital person based on the expression-driven sequence of data BS (Blendshape), i.e., render the digital person, e.g., drive the digital person to speak, act, or express, etc., so that the digital person appears more realistic and natural. Wherein the BS sequence describes the degree of deformation of various regions of the digital human face to reflect the corresponding facial expression.

Alternatively, the rendering engine may generate the expression-driven data BS sequence based on the audio data. Alternatively, audio data may be preprocessed to improve audio quality and to extract facial expression-related audio features from the audio data. By mapping the audio features onto the corresponding BS coefficients, a data BS sequence of the emotion driver can be obtained.

Further, based on the BS sequence, action and expression rendering are carried out on the digital person, and a digital person video stream is generated. Optionally, the rendering effect of the corresponding expression and action can be achieved by changing the shape and structure of the digital person, so as to obtain the digital person video stream presenting the realistic facial expression. For example, the rendering engine may render the lip actions of the digital person, and based on the BS sequence, may determine the effects corresponding to the lips, thereby generating the digital person video stream.

According to the digital person interaction method, a rendering engine sends a text request to a TTS service by receiving a rendering instruction sent by a digital person central control, and the TTS service is requested to convert the text into audio data. And further, the expression-driven data BS sequence is obtained based on the audio data to render the digital person so as to generate a digital person video stream, so that a vivid visual effect can be realized, the digital person is more natural and intelligent, and the use experience of a user is improved.

Fig. 5 is a flow chart of a method for digital person interaction according to an embodiment of the present disclosure.

As shown in fig. 5, the digital person interaction method may include:

s501, receiving a rendering instruction sent by the digital personal central control, wherein the rendering instruction is generated based on a reply result of the query text fed back by the large language model.

S502, a text request is sent to a text-to-speech TTS service based on the rendering instruction, and audio data generated by the TTS service based on the text request is received.

S503, rendering generates a digital human video stream based on the audio data.

The relevant content of steps S501-S503 can be seen in the above embodiments, and will not be described here again.

S504, the digital human video stream is sent to the real-time audio and video RTC service.

In some implementations, after the rendering engine generates the digital person video stream, the digital person video stream may be sent to a Real-time audio-video communication (Real-time Communications, RTC) service to enable transmission of the digital person video stream to the digital person terminal for interaction with the user. That is, the RTC service may transmit digital person video to the digital person terminal.

S505, receiving the acquisition request of the digital person and sending the digital person video stream to the digital person terminal.

In some implementations, the digital person terminal actively sends an acquisition request of the digital person video stream to the RTC service to obtain the digital person video stream, enabling interaction with the user. The RTC service transmits the digital person video stream to the digital person terminal by receiving an acquisition request from the digital person terminal.

According to the digital person interaction method, a rendering engine sends a text request to a TTS service by receiving a rendering instruction sent by a digital person central control, and the TTS service is requested to convert the text into audio data. And further, the expression-driven data BS sequence is obtained based on the audio data to render the digital person so as to generate a digital person video stream, so that a vivid visual effect can be realized, the digital person is more natural and intelligent, and the use experience of a user is improved. The digital person video stream can be sent to the digital person terminal by sending the digital person video stream to the RTC service, so that interaction with a user is realized.

Fig. 6 is a flow chart of a method for digital person interaction according to an embodiment of the disclosure.

As shown in fig. 6, the digital person interaction method may include:

s601, receiving the inquiry text sent by the digital personal central control.

In some implementations, the large language model receives query text sent by the digital personal center based on a transmission interface with the digital personal center. Alternatively, the transmission interface between the large language model and the digital personal center may be a web nest (websocket) interface, and may also be a hypertext transfer protocol (Hypertext Transfer Protocol, HTTP) interface.

S602, determining a reply result of the query text based on the query text.

In some implementations, after receiving the query text, the large language model may determine semantic information and context information of the query text by understanding the query text, thereby generating a corresponding answer. Optionally, the large language model may also select and filter the generated answers to provide the most appropriate and accurate answer results. Alternatively, the answers may be evaluated and selected based on an index such as relevance or confidence of the answer. The large language model selects the answer with highest relevance or highest confidence as the answer result of the query text.

S603, sending a reply result to the digital personal center controller.

In some implementations, the large language model may stream reply results back to the digital person based on a streaming protocol. Optionally, the large language model can send the reply result to the digital personal center controller through the transmission interface based on the streaming protocol, so as to obtain the reply result with strong real-time performance and low delay, and improve the interactive experience of the user.

According to the digital person interaction method, the large language model receives the query text controlled by the digital person to generate the reply result of the query text, so that more natural and accurate replies can be obtained, and the real-time performance and response speed of interaction are improved. The historical dialogue text and the query text are combined to obtain the guiding prompt information, and the large language model can obtain the context information based on the guiding prompt information, so that the large language model can better understand the context information and obtain a personalized reply result, and interaction which is more customized and meets the requirements of users is provided.

Fig. 7 is a flow chart of a method for digital person interaction according to an embodiment of the present disclosure.

As shown in fig. 7, the digital person interaction method may include:

s701, receiving a history dialogue text sent by the digital person central control.

It will be appreciated that in order for the large language model to understand the user's contextual information, providing an intelligent interactive experience, the digital personal center may obtain historical dialog text cached locally and send the historical dialog text and query text to the large language model.

In some implementations, the large language model receives the historical dialogue text and the query text sent by the digital personal center through the transmission interface, so that the context information of the user can be better understood, and a more accurate reply result can be generated.

S702, obtaining a reply result based on the historical dialogue text and the query text.

In some implementations, the historical dialog text and the query text may be combined and spliced based on a time sequence relationship of the historical dialog and the query text to obtain a spliced query text, and the spliced query text is input into a large language model, and the large language model generates a reply result according to the spliced query text.

In some implementations, to generate more accurate reply results, historical dialog text and query text may also be combined according to set templates and formats to generate guidance prompts for large language models. The large language model receives the guiding prompt information of the large language model sent by the digital personal central control, wherein the guiding prompt information is obtained by combining a historical dialogue text and an inquiry text, and the large language model obtains a reply result based on the guiding prompt information.

Optionally, the large language model can determine semantic information, context information and context information of the guiding prompt information by understanding the guiding prompt information, so as to generate corresponding answers. Optionally, the large language model may also select and filter the generated answers to provide the most appropriate and accurate answer results. Alternatively, the answers may be evaluated and selected based on an index such as relevance or confidence of the answer. The large language model selects the answer with highest relevance or highest confidence as the answer result of the query text.

S703, sending a reply result to the digital personal center controller.

The relevant content of step S703 may be referred to the above embodiments, and will not be described here again.

Fig. 8 is an interaction schematic diagram of a digital person interaction method provided in an embodiment of the disclosure.

As shown in fig. 8, the digital person interaction method may include:

s801, the digital person terminal transmits voice to the digital person central controller.

S802, the digital personal central control identifies the voice to generate query text.

S803, the digital personal central controller sends query text to the large language model.

S804, the large language model determines a reply result of the query text based on the query text.

S805, the large language model sends a reply result to the digital personal controller.

S806, the digital personal center control sends a rendering instruction to the rendering engine based on the reply result.

S807, the rendering engine renders the generated digital human video stream based on the rendering instruction.

According to the digital person interaction method, the digital person central control converts the voice of the user into the query text, the query text is sent to the large language model, the large language model generates a reply result of the query text, so that more natural and accurate replies can be obtained, and the real-time performance and response speed of interaction are improved. By receiving the reply result of the query text and sending a rendering instruction to the rendering engine based on the reply result, a digital person video stream can be generated, so that the digital person has high-fidelity human language understanding and generating capability, more natural and intelligent interaction is provided, a user can reply more quickly and instantly when interacting with the digital person, and personalized experience of the user is improved.

Fig. 9 shows a flow chart of user interaction with a digital person. S0, establishing communication connection between the digital person terminal and the digital person central control. S1, the digital personal center control can receive the voice of the user sent by the digital personal terminal. S2, the digital personal central control streams the voice to an ASR service based on a streaming protocol, and the ASR service recognizes the voice to obtain an inquiry text. S3, the ASR service transmits the query text to the digital personal center control based on the streaming protocol. S4, the digital personal central control locally caches the query text as a historical dialogue text. S5, sending the guide prompt information generated by the history pair phonebook and the query text to the large language model, and querying the large language model. The large language model obtains a reply result combined with the context information based on the guidance prompt information. S6, streaming to the digital personal center control. S7, the digital person central control further sends a rendering instruction to the rendering engine. And S8, the rendering engine sends a text request for converting the text into the voice to the TTS service based on the rendering instruction and the reply result. S9, the TTS service converts the reply result into audio data and transmits the audio data to the rendering engine in a streaming mode. S10, the rendering engine renders and generates a digital human video stream based on the audio data. S11, sending the digital personal video stream to the RTC service. S12, the RTC service receives the acquisition request of the digital person terminal and sends the digital person video stream to the digital person terminal to realize the interaction process with the user. S13, the digital person terminal is disconnected with the digital person central control after obtaining the digital person video stream.

Corresponding to the digital person interaction method provided by the above embodiments, an embodiment of the present disclosure further provides a digital person interaction device, and since the digital person interaction device provided by the embodiment of the present disclosure corresponds to the digital person interaction method provided by the above embodiments, implementation of the digital person interaction method described above is also applicable to the digital person interaction device provided by the embodiment of the present disclosure, and will not be described in detail in the following embodiments.

Fig. 10 is a schematic structural diagram of a digital human interaction device according to an embodiment of the present disclosure.

As shown in fig. 10, the digital human interaction device 1000 of the embodiment of the present disclosure includes a voice recognition module 1001, a transceiver module 1002, and a transmission module 1003.

The voice recognition module 1001 is configured to receive a voice sent by the digital personal terminal, and perform voice recognition on the voice to obtain a query text.

And the transceiver module 1002 is configured to send the query text to a large language model, and receive a reply result of the query text returned by the large language model.

And the sending module 1003 is used for sending a rendering instruction to a rendering engine based on the reply result and generating a digital human video stream based on the rendering engine.

In one embodiment of the present disclosure, the transceiver module 1002 is further configured to: based on the query text, acquiring a history dialogue text cached locally; and inputting the historical dialogue text and the query text into the large language model, and acquiring the reply result based on the historical dialogue text and the query text by the large language model.

In one embodiment of the present disclosure, the transceiver module 1002 is further configured to: combining the history dialogue text and the query text to generate guide prompt information of the large language model; and inputting the guide prompt information into the large language model, and acquiring the reply result by the large language model based on the guide prompt information.

In one embodiment of the present disclosure, the transceiver module 1002 is further configured to: establishing a transmission interface with the large language model through an adapter of the large language model; and sending the query text to the large language model through the transmission interface.

In one embodiment of the present disclosure, the transceiver module 1002 is further configured to: and receiving the reply result returned by the large language model in a streaming mode based on a streaming protocol through the transmission interface.

Fig. 11 is a schematic structural diagram of a digital human interaction device according to an embodiment of the present disclosure.

As shown in fig. 11, the digital human interaction device 1100 of the embodiment of the disclosure includes a receiving module 1101, a transceiving module 1102, and a rendering module 1103.

And the receiving module 1101 is used for receiving a rendering instruction sent by the digital personal center control, wherein the rendering instruction is generated based on a reply result of the query text fed back by the large language model.

And a transceiver module 1102, configured to send a text request to a text-to-speech TTS service based on the rendering instruction, and receive audio data generated by the TTS service based on the text request.

A rendering module 1103, configured to render and generate the digital human video stream based on the audio data.

In one embodiment of the present disclosure, the rendering module 1103 is further configured to: generating a data BS sequence of expression driving based on the audio data; and based on the BS sequence, performing action and expression rendering on the digital person to generate the digital person video stream.

In one embodiment of the present disclosure, the rendering module 1103 is further configured to: transmitting the digital human video stream to a real-time audio-video RTC service; and receiving the acquisition request of the digital person and sending the digital person video stream to the digital person terminal.

In one embodiment of the present disclosure, the transceiver module 1102 is further configured to: the audio data streamed back by the TTS service is received based on a streaming protocol.

Fig. 12 is a schematic structural diagram of a digital human interaction device according to an embodiment of the present disclosure.

As shown in fig. 12, the digital human interaction device 1200 of the embodiment of the disclosure includes a receiving module 1201, a determining module 1202, and a transmitting module 1203.

And the receiving module 1201 is used for receiving the digital personal central control sending query text.

A determining module 1202, configured to determine a reply result of the query text based on the query text.

And the sending module 1203 is used for sending the reply result to the digital personal central controller.

In one embodiment of the present disclosure, the apparatus further comprises: receiving a history dialogue text sent by the digital person central control; and acquiring the reply result based on the historical dialogue text and the query text.

In one embodiment of the present disclosure, the apparatus further comprises: receiving guide prompt information of the large language model sent by the digital personal central control, wherein the guide prompt information is obtained by combining the historical dialogue text and the query text; and acquiring the reply result based on the guiding prompt information.

In one embodiment of the present disclosure, the sending module 1203 is further configured to: the reply result is streamed back to the digital person based on a streaming protocol.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 13 illustrates a schematic block diagram of an example electronic device 1300 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 13, the apparatus 1300 includes a computing unit 1301 that can perform various appropriate actions and processes according to computer programs/instructions stored in a Read Only Memory (ROM) 1302 or loaded from a storage unit 1306 into a Random Access Memory (RAM) 1303. In the RAM 1303, various programs and data required for the operation of the device 1300 can also be stored. The computing unit 1301, the ROM 1302, and the RAM 1303 are connected to each other through a bus 1304. An input/output (I/O) interface 1305 is also connected to bus 1304.

Various components in device 1300 are connected to I/O interface 1305, including: an input unit 1306 such as a keyboard, a mouse, or the like; an output unit 1307 such as various types of displays, speakers, and the like; storage unit 1308, such as a magnetic disk, optical disk, etc.; and a communication unit 1309 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1309 allows the device 1300 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 1301 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1301 performs the respective methods and processes described above, for example, a digital human interaction method. For example, in some embodiments, the digital human interaction method may be implemented as a computer software program tangibly embodied on a machine-readable medium, e.g., in some embodiments of storage unit 1306, some or all of the computer program/instructions may be loaded onto and/or installed onto device 1300 via ROM 1302 and/or communication unit 1309. When the computer program/instructions are loaded into RAM 1303 and executed by computing unit 1301, one or more steps of the digital human interaction method described above may be performed. Alternatively, in other embodiments, computing unit 1301 may be configured to perform the digital human interaction method in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs/instructions that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs/instructions running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of digital human interaction, wherein the method is performed by a digital human central control, the method comprising:

receiving voice sent by a digital person terminal, and performing voice recognition on the voice to obtain an inquiry text;

sending the query text to a large language model, and receiving a reply result of the query text returned by the large language model;

and sending a rendering instruction to a rendering engine based on the reply result, and generating a digital human video stream based on rendering of the rendering engine.

2. The method of claim 1, wherein the sending the query text to a large language model and receiving a reply result of the query text returned by the large language model comprises:

based on the query text, acquiring a history dialogue text cached locally;

and inputting the historical dialogue text and the query text into the large language model, and acquiring the reply result based on the historical dialogue text and the query text by the large language model.

3. The method of claim 2, wherein the inputting the historical dialog text and the query text into the large language model and obtaining, by the large language model, the reply result based on the historical dialog text and the query text comprises:

combining the history dialogue text and the query text to generate guide prompt information of the large language model;

and inputting the guide prompt information into the large language model, and acquiring the reply result by the large language model based on the guide prompt information.

4. A method according to claim 1 or 3, wherein said sending said query text to a large language model comprises:

Establishing a transmission interface with the large language model through an adapter of the large language model;

and sending the query text to the large language model through the transmission interface.

5. The method of claim 4, wherein the receiving the reply result of the query text returned by the large language model comprises:

and receiving the reply result returned by the large language model in a streaming mode based on a streaming protocol through the transmission interface.

6. A method of digital human interaction, wherein the method is performed by a rendering engine, the method comprising:

receiving a rendering instruction sent by a digital personal center controller, wherein the rendering instruction is generated based on a reply result of a query text fed back by a large language model;

sending a text request to a text-to-speech (TTS) service based on the rendering instruction, and receiving audio data generated by the TTS service based on the text request;

rendering generates the digital human video stream based on the audio data.

7. The method of claim 6, wherein the rendering generates the digital human video stream based on the audio data, comprising:

generating a data BS sequence of expression driving based on the audio data;

And based on the BS sequence, performing action and expression rendering on the digital person to generate the digital person video stream.

8. The method of claim 6 or 7, wherein the rendering, after generating the digital human video stream based on the audio data, further comprises:

transmitting the digital human video stream to a real-time audio-video RTC service;

and receiving the acquisition request of the digital person and sending the digital person video stream to the digital person terminal.

9. The method of claim 6 or 7, wherein the receiving audio data generated by the TTS service based on the text request comprises:

the audio data streamed back by the TTS service is received based on a streaming protocol.

10. A method of digital human interaction, wherein the method is performed by a large language model, the method comprising:

receiving a digital person central control sending query text;

determining a reply result of the query text based on the query text;

and sending the reply result to the digital personal central control.

11. The method of claim 10, wherein the method further comprises:

receiving a history dialogue text sent by the digital person central control;

and acquiring the reply result based on the historical dialogue text and the query text.

12. The method of claim 11, wherein the method further comprises:

receiving guide prompt information of the large language model sent by the digital personal central control, wherein the guide prompt information is obtained by combining the historical dialogue text and the query text;

and acquiring the reply result based on the guiding prompt information.

13. The method of claim 10, wherein the sending the reply result to the digital personal central control comprises:

the reply result is streamed back to the digital person based on a streaming protocol.

14. A digital human interactive apparatus, wherein the digital human interactive apparatus is executed by a digital human central controller, the apparatus comprising:

the voice recognition module is used for receiving voice sent by the digital person terminal and carrying out voice recognition on the voice to obtain an inquiry text;

the receiving and transmitting module is used for sending the query text to the large language model and receiving a reply result of the query text returned by the large language model;

and the sending module is used for sending a rendering instruction to a rendering engine based on the reply result and generating a digital human video stream based on the rendering of the rendering engine.

15. The apparatus of claim 14, wherein the transceiver module is further configured to:

Based on the query text, acquiring a history dialogue text cached locally;

16. The apparatus of claim 15, wherein the transceiver module is further configured to:

17. The apparatus of claim 14 or 16, wherein the transceiver module is further configured to:

18. The apparatus of claim 17, wherein the transceiver module is further configured to:

19. A digital human interaction device, wherein the device is executed by a rendering engine, the device comprising:

the receiving module is used for receiving a rendering instruction sent by the digital personal center control, wherein the rendering instruction is generated based on a reply result of the query text fed back by the large language model;

the receiving and transmitting module is used for sending a text request to a text-to-speech (TTS) service based on the rendering instruction and receiving audio data generated by the TTS service based on the text request;

and the rendering module is used for rendering and generating the digital human video stream based on the audio data.

20. The apparatus of claim 19, wherein the rendering module is further to:

generating a data BS sequence of expression driving based on the audio data;

21. The apparatus of claim 19 or 20, wherein the rendering module is further configured to:

22. The apparatus of claim 19 or 20, wherein the transceiver module is further configured to:

23. A digital human interaction device, wherein the device is executed by a large language model, the device comprising:

the receiving module is used for receiving the digital person central control sending query text;

the determining module is used for determining a reply result of the query text based on the query text;

and the sending module is used for sending the reply result to the digital personal central control.

24. The apparatus of claim 23, wherein the apparatus further comprises:

receiving a history dialogue text sent by the digital person central control;

25. The apparatus of claim 24, wherein the apparatus further comprises:

and acquiring the reply result based on the guiding prompt information.

26. The apparatus of claim 23, wherein the means for transmitting is further configured to:

27. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5, or to perform the method of any one of claims 6-9, or to perform the method of any one of claims 10-13.

28. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5, or to perform the method of any one of claims 6-9, or to perform the method of any one of claims 10-13.

29. A computer program product comprising computer program/instructions which, when executed by a processor, carries out the method steps of any one of claims 1 to 5, or performs the method of any one of claims 6 to 9, or performs the method of any one of claims 10 to 13.