CN117727303A

CN117727303A - Audio and video generation method, device, equipment and storage medium

Info

Publication number: CN117727303A
Application number: CN202410175540.7A
Authority: CN
Inventors: 廖少毅; 容培淼; 董伟
Original assignee: Yidong Huanqiu Shenzhen Digital Technology Co ltd
Current assignee: Yidong Huanqiu Shenzhen Digital Technology Co ltd
Priority date: 2024-02-08
Filing date: 2024-02-08
Publication date: 2024-03-19

Abstract

The embodiment of the application discloses a method, a device, equipment and a storage medium for generating audio and video, wherein the method comprises the following steps: acquiring an image containing the interactive object through an image sensor, and acquiring voice data through a voice sensor; performing feature analysis on the image to obtain attitude information of the interactive object; acquiring the posture information of the digital person according to the posture information of the interactive object; obtaining reply voice data corresponding to the voice data; generating a digital person video according to the reply voice data and the gesture information of the digital person, wherein the mouth shape of the digital person in the digital person video is matched with the reply voice data, and the gesture of the digital person in the digital person video is matched with the gesture information of the digital person; and constructing audio and video based on the digital human video and the reply voice data, and outputting the audio and video. By adopting the embodiment of the invention, the interactivity between the digital person and the interactive object can be effectively improved, so that the digital person in the played audio and video is ensured to be more personified.

Description

Audio and video generation method, device, equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a device, and a storage medium for generating an audio and video.

Background

With the rapid development of artificial intelligence technology, human-machine conversations have also become a reality from once not conceivable, and the forms of presentation have become more and more diversified. The human-computer dialogue is presented in the form of replying the voice of the user through the constructed digital human figure, and the digital human figure can make a corresponding mouth shape along with the replying content. However, the digital human image constructed at present is harder, personalized and personified is not achieved, and the digital human image has a great difference with a real human. That is, the constructed digital person has insufficient interactivity with the user.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present application is to provide a method, an apparatus, a device, and a storage medium for generating an audio/video, which can effectively improve the interactivity between a digital person and an interactive object, thereby ensuring that the digital person in the played audio/video is more personified.

In a first aspect, an embodiment of the present application provides a method for generating an audio and video, including:

acquiring an image containing the interactive object through an image sensor, and acquiring voice data through a voice sensor; wherein, the interactive object refers to an object which interacts with a digital person;

Performing feature analysis on the image to obtain the attitude information of the interactive object;

acquiring the posture information of the digital person according to the posture information of the interactive object; wherein the gesture information of the digital person is matched with the gesture information of the interactive object;

obtaining reply voice data corresponding to the voice data;

generating a digital person video according to the reply voice data and the gesture information of the digital person; the mouth shape of the digital person in the digital person video is matched with the reply voice data, and the gesture of the digital person in the digital person video is matched with the gesture information of the digital person;

and constructing an audio and video based on the digital human video and the reply voice data, and outputting the audio and video.

In an alternative embodiment, the pose information includes a head feature and an eye feature;

the step of obtaining the gesture information of the digital person according to the gesture information of the interactive object comprises the following steps:

acquiring the sight line orientation of the interaction object according to the eye feature and the corresponding relation between the head feature and the sight line orientation;

determining the sight direction of the digital person according to the sight direction of the interaction object; wherein the visual line orientation of the digital person is consistent with the visual line orientation of the interactive object;

Acquiring head features and eye features of the digital person based on the gaze direction of the digital person; wherein the digital person's head features and eye features are used to control the digital person's gaze direction;

pose information is generated that includes head features and eye features of the digital person.

In an alternative embodiment, the gesture information includes limb characteristics;

determining the limb characteristics of the interactive object as the limb characteristics of the digital person;

pose information is generated that includes limb characteristics of the digital person.

determining the limb characteristics of the digital person according to the limb characteristics of the interactive object and the reply text data corresponding to the voice data;

In an alternative embodiment, the gesture information includes an expression feature;

determining the expression characteristics of the digital person according to the expression characteristics of the interactive object and the reply text data corresponding to the voice data;

and generating posture information containing the expression characteristics of the digital person.

In an alternative embodiment, the method further comprises:

invoking the selected voice service in response to a selection operation of any one of the plurality of voice services; wherein the plurality of voice services includes a local voice service and a third party voice service;

the obtaining the reply voice data corresponding to the voice data includes:

and acquiring reply voice data corresponding to the voice data through the called voice service.

In an optional implementation manner, the obtaining the reply voice data corresponding to the voice data includes:

performing text conversion on the voice data to obtain text data corresponding to the voice data;

if the problem information with the similarity to the text data larger than a preset threshold exists in the local database, searching reply information corresponding to the problem information from the local database, and determining the reply information as reply text data corresponding to the text data; at least one piece of problem information and reply information corresponding to each piece of problem information are stored in the local database;

And performing voice conversion on the reply text data to obtain reply voice data corresponding to the voice data.

In an alternative embodiment, the acquiring the voice data by the voice sensor includes:

collecting voice data of an interactive object in real time, and if the time period from the last collection end point to the current system time reaches the preset duration, acquiring a voice fragment collected from the last collection end point to the current system time;

the obtaining the reply voice data corresponding to the voice data includes:

and obtaining the reply voice data corresponding to the voice fragment.

collecting voice data of an interactive object in real time, and if the number of phonemes collected from the last collection end point to the current system time reaches a preset threshold value, obtaining a voice fragment collected from the last collection end point to the current system time;

the obtaining the reply voice data corresponding to the voice data includes:

and obtaining the reply voice data corresponding to the voice fragment.

In a second aspect, an embodiment of the present application provides an audio/video generating device, where the device includes:

An input unit for acquiring an image containing an interactive object through an image sensor and acquiring voice data through a voice sensor; wherein, the interactive object refers to an object which interacts with a digital person;

the processing unit is used for carrying out feature analysis on the image to obtain the posture information of the interactive object; acquiring the posture information of the digital person according to the posture information of the interactive object; wherein the gesture information of the digital person is matched with the gesture information of the interactive object; obtaining reply voice data corresponding to the voice data; generating a digital person video according to the reply voice data and the gesture information of the digital person; the mouth shape of the digital person in the digital person video is matched with the reply voice data, and the gesture of the digital person in the digital person video is matched with the gesture information of the digital person; constructing an audio/video based on the digital human video and the reply voice data;

and the output unit is used for outputting the audio and video.

In a third aspect, embodiments of the present application provide a computer device including a memory, a communication interface, and a processor, where the memory, the communication interface, and the processor are connected to each other; the memory stores a computer program and the processor invokes the computer program stored in the memory for implementing the method of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the method of the first aspect described above.

In a fifth aspect, embodiments of the present application provide a computer program product comprising computer program code which, when run on a computer, causes the computer to perform the method of the first aspect described above.

In a sixth aspect, embodiments of the present application provide a computer program comprising computer program code which, when run on a computer, causes the computer to perform the method of the first aspect described above.

In the embodiment of the application, an image containing an interactive object can be acquired through an image sensor, the image is subjected to feature analysis to obtain the posture information of the interactive object, the posture information of a digital person is acquired according to the posture information of the interactive object, and the posture information of the digital person is matched with the posture information of the interactive object. The voice data can be obtained through the voice sensor, the digital person video is generated according to the reply voice data corresponding to the voice data and the gesture information of the digital person, the audio and video are constructed based on the digital person video and the reply voice data, and the audio and video are output. The mouth shape of the digital person in the digital person video is matched with the reply voice data, and the gesture of the digital person in the digital person video is matched with the gesture information of the digital person, so that the interaction between the interactive object and the limbs of the digital person can be realized, the interaction between the digital person and the interactive object can be effectively improved, and the digital person in the played audio and video is ensured to be more personified.

Drawings

In order to more clearly describe the embodiments of the present invention or the technical solutions in the background art, the following description will describe the drawings that are required to be used in the embodiments of the present invention or the background art.

Fig. 1 is a schematic architecture diagram of an audio/video generation system according to an embodiment of the present application;

fig. 2 is a flow chart of a method for generating an audio and video according to an embodiment of the present application;

fig. 3 is a flowchart of another audio/video generation method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an audio/video generating device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

Embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

The audio and video generation method provided by the embodiment of the application can be applied to any one of the following scenes: interactive customer service digital persons, virtual teacher digital persons, virtual live digital persons and the like.

Taking an interactive customer service digital person or a virtual live digital person as an example, the voice data of the interactive object can be collected, and the voice data of the interactive object is replied by using an artificial intelligence technology to obtain replied voice data corresponding to the voice data. In addition, an image containing the interactive object can be acquired, the image is subjected to feature analysis to obtain the posture information of the interactive object, the posture information of the digital person is obtained according to the posture information of the interactive object, and the posture information of the digital person is matched with the posture information of the interactive object. The method can generate a digital person video according to the reply voice data and the gesture information of the digital person, wherein the mouth shape of the digital person in the digital person video is matched with the reply voice data, the gesture of the digital person in the digital person video is matched with the gesture information of the digital person, an audio video is constructed based on the digital person video and the reply voice data, and the audio video is played to an interactive object, so that the interactive dialogue with the interactive object can be realized, and services such as shopping guidance, product recommendation, answering and the like are provided. The participation of the interactive objects is enhanced, and the humanization and individuation of shopping experience are improved.

Taking a virtual teacher digital person as an example, a real teacher can input voice, and based on the audio and video generation method provided by the embodiment of the application, voice data of the real teacher (i.e. interactive object) is collected, and the voice data is replied. In addition, an image containing the real teacher can be acquired, the image is subjected to feature analysis to obtain the posture information of the real teacher, the posture information of the digital person is obtained according to the posture information of the real teacher, and the posture information of the digital person is matched with the posture information of the real teacher. The method comprises the steps that a digital person video can be generated according to the reply voice data and the gesture information of the digital person, the mouth shape of the digital person in the digital person video is consistent with the mouth shape of the real person teacher when the digital person outputs voice, the gesture of the digital person in the digital person video is consistent with the gesture of the real person teacher when the digital person outputs voice, an audio and video, namely a teaching audio and video, is constructed based on the digital person video and the reply voice data, the audio contained in the teaching audio and video can comprise voice recording audio of the real person teacher or synthetic audio of a designated tone consistent with voice content of the real person teacher, and the virtual teacher digital person is applied to remote education or online training, can provide teaching experience with more interactivity and communication effect, and can enhance participation feeling and learning effect of learners.

The audio and video generation method provided by the embodiment of the application can be applied to a client, a server or computer equipment, wherein the client or the server can run in the computer equipment, and the computer equipment comprises, but is not limited to, a mobile phone, a computer, intelligent voice interaction equipment, intelligent household appliances, a vehicle-mounted terminal, an aircraft and the like.

Taking the audio and video generation method as an example, the client can acquire the image containing the interactive object through the image sensor and acquire the voice data through the voice sensor. The client performs feature analysis on the image to obtain the gesture information of the interactive object, and the client obtains the gesture information of the digital person according to the gesture information of the interactive object. In addition, the client can acquire the reply voice data corresponding to the voice data. The client generates a digital person video according to the reply voice data and the gesture information of the digital person, the client constructs an audio and video based on the digital person video and the reply voice data, and the client plays the audio and video.

Taking the example that the audio and video generation method is applied to a server, the client can acquire an image containing an interactive object through an image sensor, acquire voice data through a voice sensor, and send the image and the voice data to the server. The server performs feature analysis on the image to obtain the posture information of the interactive object, and the server obtains the posture information of the digital person according to the posture information of the interactive object. In addition, the server may obtain reply voice data corresponding to the voice data. The server generates a digital person video according to the reply voice data and the gesture information of the digital person, the server constructs an audio and video based on the digital person video and the reply voice data, and the server transmits the audio and video to the client. And the client plays the audio and video.

Taking the example that the audio and video generation method is applied to computer equipment, the computer equipment can acquire images containing interactive objects through an image sensor and acquire voice data through a voice sensor. The computer equipment performs feature analysis on the image to obtain the gesture information of the interactive object, and obtains the gesture information of the digital person according to the gesture information of the interactive object. In addition, the computer device may obtain reply voice data corresponding to the voice data. The computer equipment generates a digital person video according to the reply voice data and the gesture information of the digital person, builds an audio and video based on the digital person video and the reply voice data, and plays the audio and video.

In one example, the computer device may be an intelligent customer service, and any client may obtain voice data submitted by the interactive object through the session interface, and obtain an image of the interactive object when submitting the voice data, and then send the voice data and the image to a certain intelligent customer service. After receiving the voice data and the image, the intelligent customer service can conduct feature analysis on the image to obtain gesture information of an interactive object, obtain gesture information of a digital person according to the gesture information of the interactive object, obtain reply voice data corresponding to the voice data, generate digital person video according to the reply voice data and the gesture information of the digital person, construct audio and video based on the digital person video and the reply voice data, and send the audio and video to the client. After receiving the audio and video sent by the intelligent customer service, the client can play the audio and video in the session interface.

In another example, the computer device may be a chat robot that may obtain voice data submitted by the interactive object through a user interface and obtain an image of the interactive object at the time the voice data was submitted. Then, the chat robot can conduct feature analysis on the image to obtain gesture information of the interaction object, obtain gesture information of the digital person according to the gesture information of the interaction object, obtain reply voice data corresponding to the voice data, generate digital person video according to the reply voice data and the gesture information of the digital person, construct audio and video based on the digital person video and the reply voice data, and play the audio and video on the user interface.

The computer device includes, but is not limited to, at least one of a server (e.g., intelligent customer service), a terminal (e.g., chat robot), etc., that can be configured to perform the method provided by embodiments of the present application. In other words, the audio/video generation method may be performed by software or hardware installed in a terminal device or a server device, where the software may be a blockchain platform or a content distribution platform. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

The audio and video generation method provided by the embodiment of the application can be widely applied to the fields of intelligent customer service, intelligent medical treatment, financial analysis, staff training, school education and the like.

In particular embodiments of the present application, where user-related data, such as image or voice data, is involved, when embodiments of the present application are applied to particular products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of the related data requires compliance with local laws and regulations and standards.

Referring to fig. 1, fig. 1 is a schematic architecture diagram of an audio/video generation system according to an embodiment of the present application. The audio-visual generation system may generate artificial intelligence (Artificial Intelligence, AI) driven digital persons. The audio/video generation system may include, for example, a voice recognition module, a human body recognition module, a content generation module, and a persona driving module.

The voice recognition module is used for collecting voice data of the interactive object. For example, a digital person may be displayed on the interactive interface and voice data of the interactive object may be collected in real time through the microphone, and the voice recognition module may transmit the collected voice data to the content generation module.

The content generation module is used for generating reply voice data corresponding to the voice data, for example, text conversion can be carried out on the voice data to obtain text data, then the artificial intelligent model is used for generating reply text data corresponding to the text data, and voice conversion is carried out on the reply text data to obtain reply voice data.

For example, the content generation module may include an intelligent voice technology service and an intelligent voice technology service operated by a custom server, and may be replaced by a plug-in configuration file. Intelligent speech technology services mainly provide speech synthesis and speech recognition. Therefore, the plug-in realizes functions including audio recording, audio playing, multithreading operation, sound recording resource release, various data communication modes and the like. The voice audio is stored in the memory in various forms such as wav (a sound file format), binary array, pulse code modulation (Pulse Code Modulation, PCM), character string and the like, different program functions are provided for use, meanwhile, the blueprint source code of the wav audio received by the meta-huamSDKATL is modified, a received audio byte array variable is added in the C++ method for receiving the wav audio, and when the method is needed, the byte array can be converted into wav, so that the meta-huamSDKATL can receive the audio data of the binary array or the character string. Therefore, the running speed is reduced, and the interactive object can quickly acquire the content fed back by the digital person.

And the human body identification module is used for acquiring images containing the interactive objects. For example, an image including the interactive object may be acquired in real time by a camera or an infrared sensor, and the image may be subjected to feature analysis to obtain posture information of the interactive object. The human body recognition module may transmit the gesture information of the interactive object to the character driving module.

And the persona driving module is used for acquiring the gesture information of the digital person so as to generate the audio and video containing the digital person. For example, the pose information of the digital person may be obtained according to the pose information of the interactive object, where the pose information of the digital person matches the pose information of the interactive object. And then generating a digital person video according to the reply voice data output by the content generation module and the gesture information of the digital person, constructing an audio and video based on the digital person video and the reply voice data, and outputting the audio and video.

Illustratively, the character driving module comprises a mouth shape driver and a body driver, wherein the mouth shape driver generates a UE animation sequence through driving data generated by artificial intelligence to play the animation asset of the digital person. And meanwhile, based on the recognition of the camera by using the openpoint action, acquiring a plurality of key points of the interactive object, detecting the expression and position information of the interactive object according to the plurality of key points of the interactive object, and judging the action and expression form of the interactive object, so that a digital person can look at or simulate the action of the digital person by using the animation asset. The digital person can be made to look at the face eye dynamic and the mouth dynamic of the interactive object. And according to the interactive content (namely, the reply audio data) received by the digital person, emotion classification including smiling, confusing, comprehending and the like of the digital person can be generated, and finally, different emotions and semantics of the interactive object are obtained and then the interactive response is carried out, so that the interactive object obtains the fun of interacting with the digital person.

In an alternative embodiment, the content generation module may send the collected voice data To an automatic voice recognition (Automatic Speech Recognition, ASR) interface of the server, convert the voice data To corresponding Text data using an artificial intelligence model, submit the Text data To a ChatGPT (Chat Generative Pre-trained Transformer) interface of the server, generate reply Text data corresponding To the Text data using the artificial intelligence model, submit the reply Text data To a voice synthesis (TTS) interface of the server, convert the reply Text data To corresponding reply voice data using the artificial intelligence model, and submit the reply voice data To the content generation module.

In an alternative embodiment, the audio-video generation system may further include an AI large language model and a local database. For example, the content generating module may perform text conversion on the collected voice data to obtain text data corresponding to the voice data, and then may determine whether there is reply information corresponding to the problem information in the local database, if so, determine the reply information as reply text data corresponding to the text data, and perform voice conversion on the reply text data to obtain reply voice data corresponding to the voice data; if the text data does not exist, inputting the text data into an AI large language model to obtain reply text data corresponding to the text data, and performing voice conversion on the reply text data to obtain reply voice data corresponding to the voice data. At least one piece of problem information and reply information corresponding to each piece of problem information are stored in the local database.

In an alternative embodiment, the audio-video generation system may further include a character image module. Character avatar modules are used to construct digital persons, such as to create digital person's hair, clothing, etc. For example, a simulated virtual digital person may be generated by the Unreal Engine, and in order to make the digital person more feature, maya may be used to make hair and clothing of the person, and model editing may be performed by deriving a character fbx model from the Unreal Engine to maya software. The use at maya creates an "interactive" modification to create hair. The hair is "combed" with a tool from the scalp with perfectly straight hair to create the desired hairstyle. In addition, the clothing is made by creating cloth, copying a skeleton grid body of a digital person to be a passive collision body of the cloth, and then adjusting resistance in the property of the cloth so as to convert the clothing into dynamic solution. And then exporting the edited hair and clothes assets in maya into a file in the Unreal Engine asset format and guiding the file back to the position of the Unreal Engine. Finally binding the hair asset and the skeleton grid asset of the bound clothes through the bloom of the Unreal Engine, and replacing the materials of the hair asset and the skeleton grid asset. Alternatively, maya may be used to make static mesh models of terrain, buildings, roads, vegetation, etc., for scenes in digital human video. The scene plus the avatar (i.e., the digital person) of the personalized character constitutes a complete image. That is, the present application provides realistic digital people and scenes.

In an alternative embodiment, the audio-video generation system may include a text acquisition module for acquiring text data, for example, text data input by the interactive object in the text input box may be received. The text acquisition module sends the text data to the content generation module, and then the content generation module generates reply text data corresponding to the text data by utilizing the artificial intelligent model, and performs voice conversion on the reply text data to obtain reply voice data.

In this embodiment, not only the voice data of the interactive object may be acquired through the voice recognition module, but also the text data of the interactive object may be acquired through the text acquisition module.

Among other things, AI large language models may include large language models (Large Language Model, LLM) or ChatGPT models, etc.

LLM, also known as a large language model, is an artificial intelligence model that aims to understand and generate human language. LLM can train on a large amount of text data and can perform a wide range of tasks including text summarization, translation, emotion analysis, intelligent questions and answers, and the like.

ChatGPT is a chat robot program developed by OpenAI. ChatGPT is an artificial intelligence technology driven natural language processing tool that can generate answers based on patterns and statistics seen in the pre-training phase, and can also interact according to the chat context, really like a human being. The transducer architecture enables ChatGPT to understand the grammar and semantics of human language by analyzing the input corpus, and generate a response which is smooth and has strong comprehensiveness according to the grammar.

Based on the above description, please refer to fig. 2, fig. 2 is a flowchart of a method for generating an audio and video according to an embodiment of the present application, where the method for generating an audio and video may be executed by a computer device such as a server or a terminal; the audio/video generation method shown in fig. 2 includes, but is not limited to, step S201 to step S206, where:

s201, acquiring an image including an interactive object by an image sensor, and acquiring voice data by a voice sensor.

Where an interactive object refers to an object that interacts with a digital person, the interactive object may refer to a user, for example.

The image sensor may include a camera or an infrared sensor, for example, the image sensor may collect a video of an interactive object when submitting voice data, where the video includes one or more images, and the image includes the interactive object.

Wherein the voice sensor may comprise a microphone or the like.

In an alternative embodiment, in order to increase the real-time dialogue speed, a streaming mode may be used on the intelligent interface and the digital person data communication mode to perform generation of a digital person interactive animation sequence and speech recognition and speech synthesis of a specified phrase or content word number, so as to perform playing of digital person speech and animation sequence, and enable the digital person to perform dialogue reply interaction in a short time (for example, 1 second).

In one example, voice data of the interactive object may be collected in real time, if the time period from the last collection end point to the current system time reaches the preset duration, a voice segment collected from the last collection end point to the current system time is obtained, and then reply voice data corresponding to the voice segment may be obtained. Alternatively, the image sensor may also acquire an image of the interactive object at the time of submitting the speech segment.

In another example, voice data of the interactive object may be collected in real time, if the number of phonemes collected from the last collection end point to the current system time reaches a preset threshold, a voice segment collected from the last collection end point to the current system time is obtained, and then reply voice data corresponding to the voice segment may be obtained. Alternatively, the image sensor may also acquire an image of the interactive object at the time of submitting the speech segment.

S202, performing feature analysis on the image to obtain the attitude information of the interactive object.

Specifically, feature analysis can be performed on the interactive object in the image to obtain the gesture information of the interactive object. The pose information may include one or more of the following: head features and eye features; limb characteristics; expression characteristics. The head features and eye features may be used to characterize the eye concentration of the interactive object. The limb features may be used to characterize limb movements of the interactive object. The expressive features may be used to characterize facial expressions of the interactive object.

S203, according to the gesture information of the interactive object, the gesture information of the digital person is obtained, and the gesture information of the digital person is matched with the gesture information of the interactive object.

In an optional embodiment, in a case that the gesture information of the interactive object includes a head feature and an eye feature of the interactive object, the eye feature of the interactive object and a corresponding relationship between the head feature and the eye orientation may be used to obtain the eye orientation of the interactive object; according to the sight line orientation of the interactive object, determining the sight line orientation of the digital person, wherein the sight line orientation of the digital person is consistent with the sight line orientation of the interactive object; acquiring head features and eye features of the digital person based on the line of sight orientation of the digital person, the head features and eye features of the digital person being used to control the line of sight orientation of the digital person; pose information is generated that includes head features and eye features of the digital person.

In the embodiment, the digital person can be controlled to be in visual contact with the interactive object according to the eye-mind dynamics, the mouth dynamics and the like of the face of the interactive object, so that eye-mind communication between the digital person and the interactive object in the played audio and video is realized.

In an alternative embodiment, in the case that the gesture information of the interactive object includes the limb feature of the interactive object, the limb feature of the interactive object may be determined as the limb feature of the digital person; pose information is generated that includes limb features of the digital person.

In the embodiment, the digital person can be controlled to simulate the limb actions of the interactive object, so that the limb interaction between the digital person and the interactive object in the played audio and video is realized.

In an alternative embodiment, in the case that the gesture information of the interactive object includes the limb characteristics of the interactive object, the limb characteristics of the digital person may be determined according to the limb characteristics of the interactive object and the reply text data corresponding to the voice data; pose information is generated that includes limb features of the digital person.

In this embodiment, the limb actions (such as gestures, body orientations, etc.) of the digital person can be controlled to match with the contents expressed by the reply text data, and meanwhile, the limb actions of the digital person and the limb actions of the interactive objects are controlled to have interaction, so that the limb interaction between the digital person and the interactive objects in the played audio and video is realized.

In an optional implementation manner, in the case that the gesture information of the interactive object includes the expression feature of the interactive object, the expression feature of the digital person may be determined according to the expression feature of the interactive object and the reply text data corresponding to the voice data; pose information including the expressive features of the digital person is generated.

In this embodiment, emotion classification including smiling, confusing, and comprehending is generated according to the interactive content received by the digital person, and different animations are used for interactive response after different emotions and semantics of the interactive object are finally obtained. The interactive object is enabled to obtain the fun of interacting with the digital person.

In the above embodiments, the AI-driven digital person may interact with the user in real-time through voice and images. Not only can accurately answer questions, but also can express emotion and attitude, has the appearance of ultra-high simulation human, and transmits emotion information through facial expression and actions. The interaction with the digital person is enabled to be rich in emotion resonance, and the interaction experience between the people and the machine is enhanced.

S204, obtaining reply voice data corresponding to the voice data.

In an alternative embodiment, the selected voice service may be invoked in response to a selection operation of any one of the plurality of voice services; wherein the plurality of voice services includes a local voice service and a third party voice service; and acquiring reply voice data corresponding to the voice data through the called voice service.

According to research, the current intelligent voice plug-in only has Microsoft voice, and the Microsoft voice needs to pass through network requests in voice recognition and voice synthesis, so that the operation process of the Microsoft voice plug-in needs time due to the influence of network conditions and the Microsoft voice plug-in, and the time consumption is long. In order to improve the speed of digital human interaction data, reduce factors affecting the slow operation of driving digital human, use a plug-in to call a local intelligent voice technology, and locally deploy an independent program service, so as to reduce the time consumed by the digital human in the network transmission of voice recognition and voice synthesis. That is, the application invokes the intelligent voice service by creating the c++ plug-in, the plug-in functions include recording, invoking the intelligent voice recognition and voice synthesis service by the asynchronous communication protocol, and performing voice parameter setting by the configuration file, that is, using the local intelligent voice service or the third party intelligent voice service according to the condition.

In an alternative embodiment, text conversion may be performed on the voice data to obtain text data corresponding to the voice data; if the problem information with the similarity to the text data larger than the preset threshold exists in the local database, searching reply information corresponding to the problem information from the local database, determining the reply information as reply text data corresponding to the text data, and storing at least one problem information and reply information corresponding to each problem information in the local database. If the problem information with the similarity larger than the preset threshold value does not exist in the local database, the text data is processed through the intelligent voice service (namely the artificial intelligent model) to obtain the reply text data corresponding to the text data. After the reply text data is obtained, the reply text data can be subjected to voice conversion to obtain reply voice data corresponding to the voice data.

In the embodiment, the same problem submitted by different interactive objects occurs in daily real-time interaction, in order to avoid wasting artificial intelligent resources, the embodiment of the application uses an artificial intelligent model and a local database, and when problem information with similarity to text data larger than a preset threshold exists in the local database through keyword retrieval, reply information corresponding to the problem information is searched from the local database; when the problem information with the similarity to the text data larger than the preset threshold value does not exist in the local database, the reply text data corresponding to the text data is acquired through the intelligent voice service.

S205, generating a digital person video according to the reply voice data and the gesture information of the digital person, wherein the mouth shape of the digital person in the digital person video is matched with the reply voice data, and the gesture of the digital person in the digital person video is matched with the gesture information of the digital person.

S206, constructing an audio and video based on the digital human video and the reply voice data, and outputting the audio and video.

In an alternative implementation mode, questions and answers in the interaction process with the interaction object can be stored, and the data volume of the local database can be optimized and enriched through continuous communication with the interaction object, so that knowledge upgrading of data iteration is achieved.

According to the embodiment of the application, the image containing the interactive object is obtained through the image sensor, the image is subjected to feature analysis to obtain the posture information of the interactive object, the posture information of the digital person is obtained according to the posture information of the interactive object, and the posture information of the digital person is matched with the posture information of the interactive object. The voice data can be obtained through the voice sensor, the digital person video is generated according to the reply voice data corresponding to the voice data and the gesture information of the digital person, the audio and video are constructed based on the digital person video and the reply voice data, and the audio and video are output. The mouth shape of the digital person in the digital person video is matched with the reply voice data, and the gesture of the digital person in the digital person video is matched with the gesture information of the digital person, so that the interaction between the interactive object and the limbs of the digital person can be realized, the interaction between the digital person and the interactive object can be effectively improved, and the digital person in the played audio and video is ensured to be more personified.

Based on the above description, please refer to fig. 3, fig. 3 is a flowchart of another audio/video generation method provided in the embodiment of the present application, where the audio/video generation method may be executed by a computer device such as a server or a terminal. The audio/video generation method shown in fig. 3 includes, but is not limited to, step S301 to step S308, where:

s301, recording the interactive object through a microphone, and acquiring voice data of the interactive object.

S302, performing text conversion on the voice data to obtain text data corresponding to the voice data.

S303, judging whether problem information with the similarity to the text data being larger than a preset threshold exists in the local database.

S304, if the problem information exists, searching reply information corresponding to the problem information from a local database, and determining the reply information as reply text data corresponding to the text data.

And S305, if the answer text data does not exist, processing the text data through the intelligent voice service to obtain the answer text data corresponding to the text data.

S306, performing voice conversion on the reply text data to obtain reply voice data corresponding to the voice data.

S307, generating a digital person video according to the reply voice data, wherein the mouth shape of the digital person in the digital person video is matched with the reply voice data.

S308, constructing an audio and video based on the digital human video and the reply voice data, and outputting the audio and video.

Based on the description of the related embodiments, the embodiments of the present application further provide an audio/video generating apparatus, where the audio/video generating apparatus may perform the operations performed by the computer device shown in fig. 2 or fig. 3. Referring to fig. 4, fig. 4 is a schematic structural diagram of an audio/video generating device according to an embodiment of the present application. As shown in fig. 4, the audio/video generating apparatus may include, but is not limited to, an input unit 401, a processing unit 402, and an output unit 403.

An input unit 401 for acquiring an image containing an interactive object by an image sensor and acquiring voice data by a voice sensor; wherein, the interactive object refers to an object which interacts with a digital person;

a processing unit 402, configured to perform feature analysis on the image to obtain pose information of the interactive object; acquiring the posture information of the digital person according to the posture information of the interactive object; wherein the gesture information of the digital person is matched with the gesture information of the interactive object; obtaining reply voice data corresponding to the voice data; generating a digital person video according to the reply voice data and the gesture information of the digital person; the mouth shape of the digital person in the digital person video is matched with the reply voice data, and the gesture of the digital person in the digital person video is matched with the gesture information of the digital person; constructing an audio/video based on the digital human video and the reply voice data;

And an output unit 403, configured to output the audio and video.

the processing unit 402 obtains the gesture information of the digital person according to the gesture information of the interactive object, including:

In an alternative embodiment, the processing unit 402 is further configured to invoke the selected voice service in response to a selection operation of any one of the plurality of voice services; wherein the plurality of voice services includes a local voice service and a third party voice service;

The processing unit 402 obtains reply voice data corresponding to the voice data, including:

In an alternative embodiment, the processing unit 402 obtains reply voice data corresponding to the voice data, including:

In an alternative embodiment, the input unit 401 acquires voice data through a voice sensor, including:

and obtaining the reply voice data corresponding to the voice fragment.

In this embodiment of the present application, the processing unit 402 obtains an image including an interactive object through an image sensor, performs feature analysis on the image to obtain pose information of the interactive object, and obtains pose information of a digital person according to the pose information of the interactive object, where the pose information of the digital person is matched with the pose information of the interactive object. The voice data can also be obtained through the voice sensor, the digital person video is generated according to the reply voice data corresponding to the voice data and the gesture information of the digital person, the audio and video is constructed based on the digital person video and the reply voice data, and the output unit 403 outputs the audio and video. The mouth shape of the digital person in the digital person video is matched with the reply voice data, and the gesture of the digital person in the digital person video is matched with the gesture information of the digital person, so that the interaction between the interactive object and the limbs of the digital person can be realized, the interaction between the digital person and the interactive object can be effectively improved, and the digital person in the played audio and video is ensured to be more personified.

The embodiment of the application also provides a computer device, please refer to fig. 5, fig. 5 is a schematic structural diagram of the computer device provided in the embodiment of the application. As shown in fig. 5, the computer device includes at least a processor 501, a memory 502, and a communication interface 503, which may be connected by a bus 504 or otherwise, and in this embodiment, the connection is exemplified by the bus 504. The processor 501 of the embodiment of the present application may execute the operations of the foregoing intelligent question-answering method by running a computer program stored in the memory 502, for example:

after the communication interface 503 obtains an image containing the interactive object through the image sensor and obtains voice data through the voice sensor, performing feature analysis on the image to obtain posture information of the interactive object;

obtaining reply voice data corresponding to the voice data;

Constructing an audio/video based on the digital human video and the reply voice data;

the audio and video are output through the communication interface 503.

the processor 501 is configured to, when acquiring the pose information of the digital person according to the pose information of the interactive object, perform the following operations:

In an alternative embodiment, the processor 501 is further configured to perform the following operations:

the processor 501 is configured to, when acquiring the reply voice data corresponding to the voice data, perform the following operations:

In an alternative embodiment, the processor 501 is configured to, when acquiring the reply voice data corresponding to the voice data, perform the following operations:

In an alternative embodiment, the communication interface 503 is configured to perform the following operations when voice data is acquired by a voice sensor:

and obtaining the reply voice data corresponding to the voice fragment.

In this embodiment of the present application, the processor 501 obtains an image including an interactive object through an image sensor, performs feature analysis on the image to obtain pose information of the interactive object, and obtains pose information of a digital person according to the pose information of the interactive object, where the pose information of the digital person is matched with the pose information of the interactive object. The voice data can also be acquired through the voice sensor, the digital person video is generated according to the reply voice data corresponding to the voice data and the gesture information of the digital person, the audio and video is constructed based on the digital person video and the reply voice data, and the communication interface 503 outputs the audio and video. The mouth shape of the digital person in the digital person video is matched with the reply voice data, and the gesture of the digital person in the digital person video is matched with the gesture information of the digital person, so that the interaction between the interactive object and the limbs of the digital person can be realized, the interaction between the digital person and the interactive object can be effectively improved, and the digital person in the played audio and video is ensured to be more personified.

The present application also provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of any of the method embodiments described above.

The present application also provides a computer program product comprising computer program code to, when run on a computer, cause the computer to perform the steps of any of the method embodiments described above.

The embodiment of the application further provides a chip, which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for calling and running the computer program from the memory, so that the device provided with the chip executes the steps in any method embodiment.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.

The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs.

Claims

1. The audio and video generation method is characterized by comprising the following steps:

Obtaining reply voice data corresponding to the voice data;

2. The method of claim 1, wherein the pose information includes head features and eye features;

3. The method of claim 1, wherein the gesture information comprises limb characteristics;

4. A method as claimed in claim 3, wherein the gesture information comprises limb characteristics;

5. The method of claim 1, wherein the gesture information comprises an expressive feature;

6. The method of claim 1, wherein the method further comprises:

the obtaining the reply voice data corresponding to the voice data includes:

7. The method of claim 1, wherein the obtaining reply voice data corresponding to the voice data comprises:

8. The method of claim 1, wherein the obtaining voice data by a voice sensor comprises:

the obtaining the reply voice data corresponding to the voice data includes:

and obtaining the reply voice data corresponding to the voice fragment.

9. The method of claim 1, wherein the obtaining voice data by a voice sensor comprises:

the obtaining the reply voice data corresponding to the voice data includes:

and obtaining the reply voice data corresponding to the voice fragment.

10. An audio/video generation device, characterized in that the device comprises:

and the output unit is used for outputting the audio and video.

11. A computer device comprising a memory, a communication interface, and a processor, wherein the memory, the communication interface, and the processor are interconnected; the memory stores a computer program, and the processor invokes the computer program stored in the memory for implementing the method of any one of claims 1 to 9.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 9.