CN118250523A

CN118250523A - Digital human video generation method and device, storage medium and electronic equipment

Info

Publication number: CN118250523A
Application number: CN202410250339.0A
Authority: CN
Inventors: 王朋强
Original assignee: Beijing Yuanke Fangzhou Technology Co ltd
Current assignee: Beijing Yuanke Fangzhou Technology Co ltd
Priority date: 2024-03-05
Filing date: 2024-03-05
Publication date: 2024-06-25

Abstract

The application discloses a digital human video generation method, a device, a storage medium, electronic equipment and a computer program product, wherein the method comprises the following steps: acquiring multi-mode information input by a user aiming at a digital human model; determining an identification content text and an identification emotion corresponding to the multi-mode information; determining action information and voice information of the digital human model according to the identification content text and the identification emotion; and generating a digital human video according to the action information, the voice information and the digital human model, so that the digital human model can be directly driven by means of texts, videos, voices and the like, manual adjustment or design of the action of the digital human model is not required, and the method is simple, high in flexibility and wide in application range.

Description

Digital human video generation method and device, storage medium and electronic equipment

Technical Field

The present application relates to digital human video generation methods, apparatuses, storage media, electronic devices, and computer program products, and particularly, to a digital human video generation method, an apparatus, a storage medium, an electronic device, and a computer program product.

Background

With the rapid development of Virtual Reality (VR), augmented Reality (Augmented Reality, AR) and artificial intelligence, digital man-machine technology has gradually become an important research direction in the field of man-machine interaction. Digital persons refer to virtual characters generated by a computer that can exhibit a look, action, and interactive capabilities similar to real humans.

The virtual digital man technology is generally divided into two routes of a 2D virtual digital man and a 3D virtual digital man, and compared with the 2D virtual digital man, the 3D virtual digital man has better display effect, operability and interactivity, is widely applied to the fields of news broadcasting, intelligent customer service, movie making, game development, virtual reality, augmented reality, online social interaction and the like, and provides more immersive and personalized interactive experience for users. However, in the prior art, when a 3D digital human video is produced, most of the digital human video needs to be manually adjusted by a professional animator or is called after the digital human video of each action is set in advance, and the digital human video production method is complex, low in flexibility and limited in application range.

Disclosure of Invention

The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the application provides a digital human video generation method, a device, a storage medium, electronic equipment and a computer program product, which can generate digital human video without manual adjustment or digital human action design of a user, and have high flexibility and wide application range.

In a first aspect, the present application provides a digital human video generation method, including:

acquiring multi-mode information input by a user aiming at a digital human model;

Determining an identification content text and an identification emotion corresponding to the multi-mode information;

Determining action information and voice information of the digital human model according to the identification content text and the identification emotion;

and generating a digital human video according to the action information, the voice information and the digital human model.

In some embodiments, the digital human video generation method further comprises:

acquiring a plurality of character images shot for a target character through a preset image acquisition array and a light source array, wherein different character images are shot at different shooting angles and/or lighting colors for the target character;

and generating the digital human model corresponding to the target person according to the person image.

In some embodiments, the determining the identified content text and the identified emotion corresponding to the multimodal information includes:

And identifying the multi-modal information by using the trained multi-modal identification model to obtain an identification content text and an identification emotion corresponding to the multi-modal information.

In some embodiments, the multimodal information includes video information, text information, audio information and/or image information, the identifying the multimodal information by using a trained multimodal identification model to obtain an identified content text and an identified emotion corresponding to the multimodal information includes:

determining at least one information type corresponding to the multi-mode information;

determining a feature extraction module corresponding to each information type from the trained multi-mode recognition model to obtain a target extraction module;

extracting the characteristics of the multi-mode information of the corresponding information type by utilizing the target extraction module so as to obtain corresponding characteristic vectors;

And carrying out fusion recognition processing on the feature vectors of all the information types by utilizing a fusion recognition module in the multi-mode recognition model to obtain a recognition content text and a recognition emotion corresponding to the multi-mode information.

In some embodiments, the determining the action information and the voice information of the digital person model according to the identified content text and the identified emotion includes:

determining action information of at least one key part in the digital human model according to the identification content text and the identification emotion, wherein the key part comprises limbs, faces and/or heads;

and determining the voice information of the digital human model according to the identification content text.

In some embodiments, the generating a digital person video from the motion information, the speech information, and the digital person model includes:

Determining an image frame sequence to be rendered according to the action information and the digital human model;

determining an audio frame sequence to be rendered according to the voice information;

And fusing the image frame sequence and the audio frame sequence to obtain the digital human video.

In a second aspect, the present application provides a digital human video generating apparatus comprising:

The acquisition unit is used for acquiring multi-mode information input by a user aiming at the digital human model;

The first determining unit is used for determining the identification content text and the identification emotion corresponding to the multi-mode information;

the second determining unit is used for determining action information and voice information of the digital human model according to the identification content text and the identification emotion;

And the video generation unit is used for generating a digital human video according to the action information, the voice information and the digital human model.

In some embodiments, the digital human video generating apparatus further comprises a model generating unit for:

In some embodiments, the first determining unit is specifically configured to:

In some embodiments, the multimodal information includes video information, text information, audio information, and/or image information, and the first determining unit is specifically configured to:

In some embodiments, the second determining unit is specifically configured to:

In some embodiments, the generating unit is specifically configured to:

In a third aspect, the present application provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the digital human video generation method of any of the above.

In a fourth aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the digital human video generation method of any one of the above when executing the program.

In a fifth aspect, the application provides a computer program product comprising a computer program which when executed by a processor implements the digital human video generation method of any of the above.

The embodiment of the application provides a digital human video generation method, a device, a storage medium, electronic equipment and a computer program product, wherein multimode information input by a user aiming at a digital human model is obtained; determining an identification content text and an identification emotion corresponding to the multi-mode information; determining action information and voice information of the digital human model according to the identification content text and the identification emotion; according to the action information, the voice information and the digital human model, digital human videos are generated, namely, a user can directly drive the digital human model to execute the required actions and make the required sounds in a mode of inputting a text, a video, a voice and the like, the user does not need to manually adjust or design the actions of the digital human model, and the method is simple, high in flexibility and wide in application range.

Drawings

The foregoing and/or additional aspects and advantages of the application will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

fig. 1 is a schematic flow chart of a digital human video generation method according to an embodiment of the present application;

FIG. 2 is another flow chart of a digital human video generation method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a digital human video generating apparatus according to an embodiment of the present application;

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

Fig. 5 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application.

The embodiment of the application provides a digital human video generation method, a digital human video generation device, a storage medium, electronic equipment and a computer program product.

Referring to fig. 1, fig. 1 is a flowchart of a digital human video generating method according to an embodiment of the present application. The digital human video generation method is applied to electronic equipment, the electronic equipment can be implemented as a user terminal, the user terminal comprises AR equipment, VR equipment, a notebook computer, a tablet computer, a desktop computer, mobile equipment (such as a mobile phone, a personal digital assistant, a special message equipment) and the like, and the electronic equipment can also be implemented as a server. Specifically, the digital human video generation method may include the following steps 101-104, in which:

101. And acquiring multi-mode information input by a user aiming at the digital human model.

The digital human model may be a 3D digital human model made from a real person in the real world or a virtual human model in the virtual world. When the digital human model is manufactured, the digital human model can be driven through multi-mode information. The multimodal information may include video information, text information, audio information, and/or image information, etc. that may be user-generated by the user, such as the user entering text information via a keyboard carried by or external to the electronic device, recording audio information via an audio collector (such as a microphone), collecting image information via an image collector (such as a camera), etc., or may be obtained by recognizing the collected audio information via a voice recognition technique, or may be obtained by the user directly from other devices, such as teaching video or audio downloaded from a public teaching video website, electronic book downloaded from an electronic book website, etc., without limitation.

It should be noted that when the digital mannequin is made from real characters in the real world, it may be manually modeled or automatically modeled from the characters in the user-provided character video. When modeling automatically, in order to improve the fidelity of the model and achieve a photo-level realistic 3D digital human model, please refer to fig. 2, fig. 2 is another flow chart of the digital human video generation method provided by the embodiment of the present application, the digital human video generation method may further include:

1051. acquiring a plurality of character images shot for a target character through a preset image acquisition array and a light source array, wherein different character images are shot for the target character at different shooting angles and/or lighting colors;

1052. And generating the digital human model corresponding to the target person according to the person image.

The target person may be a real person in a real shooting scene, and the person image may be an image frame sequence obtained by continuously shooting the real person. The image acquisition array comprises a plurality of image collectors (such as cameras) which are arranged in an array mode, and the image collectors are arranged at different positions in a shooting scene and present different shooting angles to an object in the shooting scene. The light source array comprises a plurality of illumination light sources which are arranged in an array, and the illumination light sources can illuminate objects in a shooting scene in different colors and even different brightnesses. When the digital human model is generated, the three-dimensional space position of the target human figure can be determined according to the human figure image, then the structural data of the digital human model can be determined according to the three-dimensional space position, and the digital human model can be constructed according to the structural data, so that the digital human model matched with the target human figure can be obtained.

Specifically, in the process of shooting a real target person, the target person can be instructed to continuously change expressions, limb actions, expression actions and the like so as to enrich the image of the person, meanwhile, the light source array is controlled to shine the target person with different shining colors and lighting brightness, the image acquisition array continuously shoots the target person with different shooting angles, and therefore the obtained image frame sequence can truly reflect the overall appearance of the target person to the greatest extent, and then when a digital human model is generated according to the image frame sequence, the digital human model with the height consistent with the target person can be obtained, the model fidelity and the restoration degree are improved, and the construction of a photo-level realistic 3D digital human model is realized.

102. And determining the identification content text and the identification emotion corresponding to the multi-mode information.

The recognition content text and the recognition emotion corresponding to the multi-modal information can be recognized through an algorithm or a deep learning model. With continued reference to fig. 2, the step 102 may specifically include:

The multi-modal recognition model comprises at least one deep learning neural network, and model training is required to be performed in advance according to a large number of multi-modal samples. Further, the multi-modal information includes video information, text information, audio information and/or image information, and at this time, the step of identifying the multi-modal information by using a trained multi-modal identification model to obtain an identification content text and an identification emotion corresponding to the multi-modal information may specifically include:

the target extraction module is utilized to extract the characteristics of the multi-mode information of the corresponding information type so as to obtain corresponding characteristic vectors;

Wherein the multimodal information may include at least one of video information, text information, audio information, and image information, and the corresponding information type includes at least one of video, text, audio, and image. The structure of the multi-modal recognition model mainly comprises two parts: the feature extraction module and the fusion recognition module can be respectively provided with a corresponding feature extraction module for each information type in the multi-mode recognition model in advance, and different feature extraction modules are obtained based on sample training of different types, for example, the feature extraction module for extracting video features is obtained according to a large number of video sample training, the feature extraction module for extracting text features is obtained according to a large number of text sample training, and the feature extraction module for extracting image features is obtained according to a large number of image sample training.

When the multi-mode information is processed later, only a feature extraction module of a corresponding information type is required to be called for feature extraction, and feature vectors of corresponding types, such as audio feature vectors, text feature vectors, image feature vectors, video feature vectors and the like, are extracted. And then fusing and identifying the feature vectors through a fusion identification module, wherein the fusion mode is set manually, for example, all the extracted feature vectors are fused into one through simple vector splicing or weighted average, and if the multi-mode information only comprises one information type, the feature vectors corresponding to other information types are regarded as 0 during fusion.

103. And determining action information and voice information of the digital human model according to the identification content text and the identification emotion.

In some embodiments, referring to fig. 2, the step 103 may specifically include:

Wherein, the limb movements mainly comprise limb movements, the facial movements mainly comprise mouth shape movements and facial expression movements (movements involving eyebrows, eyes and other parts), and the facial expression movements are used for expressing the emotion such as happiness, anger, grime and the like of the character. The head movements mainly include rotational movements of the head, such as left turn, head up 45 °, etc. Specifically, the action information of the key parts can be determined through the trained deep learning model, different deep learning models are required to be trained for different key parts, the action information of the corresponding key parts is determined through the respective deep learning models, for example, mouth shape actions are determined through a mouth shape recognition model, limb actions are determined through a limb recognition model, and the like.

The voice information can be synthesized by a voice synthesizer, and during synthesis, besides the basic content required to be expressed by the voice, the voice information can also show a certain emotion, such as voice attribute content representing emotion, such as tone level, speed of voice, etc., the voice attribute content can be changed according to the change of the multimodal information input by the user, and at this time, the step of determining the voice information of the digital human model according to the identification content text can be specifically: and determining the voice information of the digital human model according to the identification text content and the identification emotion. Of course, in other embodiments, voice information may also be synthesized using default sound attribute content of the system.

104. And generating a digital human video according to the action information, the voice information and the digital human model.

The user can determine the motion information (such as limb motion, expression motion and mouth shape motion) and the voice information of the digital human model by inputting multi-mode information such as a piece of text, video, voice and the like, and drive the digital human model by the motion information and the voice information so that the digital human model executes the model motion consistent with the motion information and emits model sound consistent with the voice information to obtain the digital human video.

In some embodiments, referring to fig. 2, the step 104 may specifically include:

1041. Determining a sequence of image frames to be rendered according to the action information and the digital human model; determining an audio frame sequence to be rendered according to the voice information;

1042. And fusing the image frame sequence and the audio frame sequence to obtain the digital human video.

In the digital human video, the image frame sequence shows the model action of the digital human model, the audio frame sequence shows the model sound of the digital human model, the model action of the digital human model is consistent with the action information, and the model sound is consistent with the voice information.

As can be seen from the above, in the digital human video generation method provided by the embodiment of the present application, the multimodal information input by the user for the digital human model is obtained; determining an identification content text and an identification emotion corresponding to the multi-mode information; determining action information and voice information of the digital human model according to the identification content text and the identification emotion; according to the action information, the voice information and the digital human model, digital human videos are generated, namely, a user can directly drive the digital human model to execute the required actions and make the required sounds in a mode of inputting a text, a video, a voice and the like, the user does not need to manually adjust or design the actions of the digital human model, and the method is simple, high in flexibility and wide in application range. In addition, a plurality of character images shot for a target character are acquired by utilizing the image acquisition array and the light source array, and different character images are shot for the target character at different shooting angles and/or lighting colors; and generating the digital human model corresponding to the target person according to the person image, so that the digital human model with the height consistent with that of the target person can be obtained, the fidelity and the reduction degree of the model are improved, and the construction of the photo-level realistic 3D digital human model is realized.

According to the method described in the above embodiment, the embodiment of the present application further provides a digital human video generating device, which is configured to execute the steps in the digital human video generating method. Referring to fig. 3, fig. 3 is a schematic structural diagram of a digital human video generating apparatus according to an embodiment of the present application. The digital personal video generation apparatus 200 is applied to an electronic device, which may be implemented as a user terminal including an AR device, a VR device, a notebook, a tablet, a desktop computer, a mobile device (e.g., a mobile phone, a personal digital assistant, a dedicated messaging device), etc., and may also be implemented as a server. Specifically, the digital personal video generating apparatus 200 includes an acquisition unit 201, a first determination unit 202, a second determination unit 203, and a video generating unit 204, wherein:

an obtaining unit 201, configured to obtain multimodal information input by a user for a digital human model;

A first determining unit 202, configured to determine an identified content text and an identified emotion corresponding to the multimodal information;

A second determining unit 203, configured to determine, according to the identified content text and the identified emotion, action information and voice information of the digital person model;

A video generating unit 204 for generating a digital person video from the motion information, the speech information and the digital person model.

In some embodiments, the digital human video generating apparatus 200 further comprises a model generating unit for:

acquiring a plurality of character images shot for a target character through a preset image acquisition array and a light source array, wherein different character images are shot for the target character at different shooting angles and/or lighting colors;

In some embodiments, the first determining unit 202 is specifically configured to:

In some embodiments, the multimodal information includes video information, text information, audio information, and/or image information, and the first determining unit 202 is specifically configured to:

In some embodiments, the second determining unit 203 is specifically configured to:

In some embodiments, the generating unit 204 is specifically configured to:

Determining a sequence of image frames to be rendered according to the action information and the digital human model;

It should be noted that, the specific details of each module unit in the digital human video generating apparatus 200 are described in detail in the embodiment of the digital human video generating method, and are not described herein again.

In some embodiments, the digital human video generating apparatus in the embodiments of the present application may be an electronic device, or may be a component in an electronic device, such as an integrated circuit or a chip. The electronic device may be a terminal device. The electronic device may be a Mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a Mobile internet appliance (Mobile INTERNET DEVICE, MID), an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a robot, a wearable device, an ultra-Mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), etc., and may also be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a Television (TV), a teller machine, a self-service machine, etc., which are not particularly limited in the embodiments of the present application.

In some embodiments, as shown in fig. 4, an electronic device 300 is further provided in the embodiments of the present application, which includes a processor 301, a memory 302, and a computer program stored in the memory 302 and capable of running on the processor 301, where the program, when executed by the processor 301, implements the respective processes of the above-mentioned digital human video generation method embodiment, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here.

The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.

Fig. 5 is a schematic hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 400 includes, but is not limited to: radio frequency unit 401, network module 402, audio output unit 403, input unit 404, sensor 405, display unit 406, user input unit 407, interface unit 408, memory 409, and processor 410.

Those skilled in the art will appreciate that the electronic device 400 may also include a power source (e.g., a battery) for powering the various components, which may be logically connected to the processor 410 by a power management system to perform functions such as managing charge, discharge, and power consumption by the power management system. The electronic device structure shown in fig. 5 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.

It should be appreciated that in embodiments of the present application, the input unit 404 may include a graphics processor (Graphics Processing Unit, GPU) 4041 and a microphone 4042, with the graphics processor 4041 processing image data of still pictures or video obtained by an image capture device (e.g., a camera) in a video capture mode or an image capture mode. The display unit 406 may include a display panel 4061, and the display panel 4061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 407 includes at least one of a touch panel 4071 and other input devices 4072. The touch panel 4071 is also referred to as a touch screen. The touch panel 4071 may include two parts, a touch detection device and a touch controller. Other input devices 4072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and so forth, which are not described in detail herein.

Memory 409 may be used to store software programs as well as various data. The memory 409 may mainly include a first memory area storing programs or instructions and a second memory area storing data, wherein the first memory area may store an operating system, application programs or instructions (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. Further, the memory 409 may include volatile memory or nonvolatile memory, or the memory 409 may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM), static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate Synchronous dynamic random access memory (Double DATA RATE SDRAM, DDRSDRAM), enhanced Synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCH LINK DRAM, SLDRAM), and Direct random access memory (DRRAM). Memory 409 in embodiments of the application includes, but is not limited to, these and any other suitable types of memory.

Processor 410 may include one or more processing units; the processor 410 integrates an application processor that primarily processes operations involving an operating system, user interface, application programs, etc., and a modem processor that primarily processes wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into the processor 410.

The embodiment of the application also provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, realizes the processes of the digital human video generation method embodiment, and can achieve the same technical effects, and in order to avoid repetition, the description is omitted here.

Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes computer readable storage medium such as computer readable memory ROM, random access memory RAM, magnetic or optical disk, etc.

The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program is executed by a processor to realize the digital human video generation method.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

In the description of the present application, "plurality" means two or more.

In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the application, the scope of which is defined by the claims and their equivalents.

Claims

1. A digital human video generation method, comprising:

2. The digital human video generation method of claim 1, further comprising:

3. The digital personal video generation method according to claim 1, wherein the determining the identified content text and the identified emotion corresponding to the multimodal information includes:

4. The digital human video generating method according to claim 3, wherein the multi-modal information includes video information, text information, audio information and/or image information, the multi-modal information is identified by using a trained multi-modal identification model, and the identifying content text and the identifying emotion corresponding to the multi-modal information are obtained, including:

5. The digital human video generating method according to claim 1, wherein said determining motion information and voice information of the digital human model based on the recognized content text and the recognized emotion comprises:

6. The digital person video generating method according to any one of claims 1 to 5, wherein the generating a digital person video from the motion information, the voice information, and the digital person model includes:

7. A digital human video generating apparatus, comprising:

8. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the digital human video generation method of any of claims 1-6.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the digital human video generation method of any of claims 1-6 when the program is executed by the processor.

10. A computer program product comprising a computer program which, when executed by a processor, implements the digital human video generation method of any of claims 1-6.