WO2022148083A1 - 仿真3d数字人交互方法、装置、电子设备及存储介质 - Google Patents

仿真3d数字人交互方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2022148083A1
WO2022148083A1 PCT/CN2021/123815 CN2021123815W WO2022148083A1 WO 2022148083 A1 WO2022148083 A1 WO 2022148083A1 CN 2021123815 W CN2021123815 W CN 2021123815W WO 2022148083 A1 WO2022148083 A1 WO 2022148083A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
digital human
image
simulated digital
user
Prior art date
Application number
PCT/CN2021/123815
Other languages
English (en)
French (fr)
Inventor
杨国基
常向月
陈泷翔
王鑫宇
刘云峰
吴悦
Original Assignee
深圳追一科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳追一科技有限公司 filed Critical 深圳追一科技有限公司
Publication of WO2022148083A1 publication Critical patent/WO2022148083A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0481Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
    • G06F3/04815Interaction with a metaphor-based environment or interaction object displayed as three-dimensional, e.g. changing the user viewpoint with respect to the environment or object
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/006Mixed reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras

Definitions

  • the present application relates to the technical field of human-computer interaction, and more particularly, to a method, device, electronic device and storage medium for simulating 3D digital human interaction.
  • the present application proposes a method, device, electronic device and storage medium for simulating 3D digital human interaction.
  • an embodiment of the present application provides a method for simulating 3D digital human interaction, which is characterized in that, being executed by an electronic device, the method includes: acquiring scene data collected by a collection device; if it is determined according to the scene data that there is a target user in the scene , then process the scene data to obtain the relative position of the target user and the display screen; if the target user is located in the preset area, obtain the target simulation digital human image corresponding to the relative position, the target simulation digital image
  • the human image includes a simulated digital human whose face is facing the target user, and the preset area is an area whose distance from the display screen is less than a preset value; the target simulated digital human image is displayed on the display screen.
  • an embodiment of the present application provides a simulated 3D digital human interaction device, which is set in an electronic device, and the device includes a data acquisition module, a position acquisition module, an image acquisition module, and a display module, wherein the data acquisition module , used to obtain the scene data collected by the collection device; the position obtaining module is used to process the scene data to obtain the relative position of the target user and the display screen if it is determined according to the scene data that there is a target user in the scene; the image an acquisition module, configured to acquire a target simulated digital human image corresponding to the relative position if the target user is located in a preset area, where the target simulated digital human image includes a simulated digital human whose face is facing the target user, The preset area is an area with a distance from the display screen smaller than a preset value; a display module is used to display the target simulated digital human image on the display screen.
  • embodiments of the present application provide an electronic device, including: one or more processors; a memory; and one or more computer-readable instructions, wherein the one or more computer-readable instructions are stored in a in the memory and configured to be executed by the one or more processors, the one or more computer readable instructions configured to perform the method of the first aspect above.
  • the embodiments of the present application provide one or more computer-readable storage media, where computer-readable instructions are stored in the computer-readable storage media, and the computer-readable instructions can be invoked by a processor to execute as follows: The method described in the first aspect above.
  • FIG. 1 shows a schematic diagram of an application environment suitable for an embodiment of the present application
  • FIG. 2 shows a schematic flowchart of a method for simulating 3D digital human interaction provided by an embodiment of the present application
  • FIG. 3 shows a schematic flowchart of a method for simulating 3D digital human interaction provided by yet another embodiment of the present application
  • FIG. 4 shows a schematic flowchart of a method for simulating 3D digital human interaction provided by another embodiment of the present application
  • FIG. 5 shows a schematic flowchart of a method for simulating 3D digital human interaction provided by still another embodiment of the present application
  • FIG. 6 shows a schematic flowchart of a method for simulating 3D digital human interaction provided by still another embodiment of the present application
  • FIG. 7 shows a schematic flowchart of a method for simulating 3D digital human interaction provided by yet another embodiment of the present application.
  • FIG. 8 shows a schematic flowchart of a method for simulating 3D digital human interaction provided by still another embodiment of the present application.
  • FIG. 9 shows a structural block diagram of a simulated 3D digital human interaction device provided by an embodiment of the present application.
  • FIG. 10 shows a structural block diagram of an electronic device for executing the method for simulating 3D digital human interaction according to an embodiment of the present application
  • FIG. 11 shows a storage unit according to an embodiment of the present application for storing or carrying computer-readable instructions for implementing the method for simulating 3D digital human interaction according to an embodiment of the present application.
  • 3D digital human A digital human that is realized by computer graphics technologies such as 3D modeling and rendering.
  • a video digital human may be generated from coherent photorealistic images.
  • Simulation of 3D digital human The digital human is generated by the simulation digital human technology, and the spatial position and display perspective of the digital human and the audience are considered, and the three-dimensional realistic effect is realized by simulating the digital human.
  • a stereoscopic and realistic video digital human can be generated from a plurality of simulated digital human image sequences.
  • avatars generated based on real models that is, digital humans
  • some actions can be pre-designed for the digital human, and the actions can be matched with the interactive voice content or text content, so as to improve the user's visual experience.
  • the action and interactive content approach can make the digital human's action pose close to the real model, this approach only matches the interactive content with the action, and does not establish a connection between the user's location and the digital human.
  • the digital human on the screen is usually fixed in front of the screen.
  • the digital human on the screen is usually fixed in front of the screen.
  • the presentation mode of the digital human in the prior art does not deeply consider the user's behavior, which leads to a low level of fidelity of the displayed image of the digital human, unnatural interaction, and poor user interaction experience.
  • a 3D digital human can present a three-dimensional visual effect
  • a 3D digital human is a digital human that is realized by computer graphics technologies such as 3D modeling and rendering.
  • the digital human effect presented is usually an animation effect, which cannot achieve the effect of a real-life camera shot. .
  • the inventors of the present application study how to take more account of the user's behavior when interacting with a digital human, so as to achieve a natural anthropomorphic interaction effect. Based on this, the inventor proposes a simulated 3D digital human interaction method, device, electronic device and medium, so that during the process of human-computer interaction, the user can display the simulated digital human whose face is facing the target user according to the user's position to interact with the target user.
  • the simulated digital human is not only realistic like a real person captured by a camera, but also can simulate the interaction effect of face-to-face communication between the user and the real model, realize anthropomorphic natural interaction, and improve the user's interactive experience.
  • FIG. 1 shows a schematic diagram of an application environment suitable for the embodiment of the present application.
  • the simulated 3D digital human interaction method provided by the embodiment of the present application can be applied to the interaction system 10 shown in FIG. 1 .
  • the interactive system 10 includes a terminal device 101 and a server 102 .
  • the server 102 and the terminal device 101 are connected through a wireless or wired network to realize data transmission between the terminal device 101 and the server 102 based on the network connection.
  • the transmitted data includes but is not limited to audio, video, text, and images.
  • the server 102 may be a single server, a server cluster, or a server center composed of multiple servers, a local server, or a cloud server.
  • the server 102 may be used to provide background services for users, and the background services may include, but are not limited to, simulation 3D digital human interaction services, etc., which are not limited herein.
  • the terminal device 101 may be various electronic devices having a display screen and supporting data input, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, wearable electronic devices, and the like.
  • the data input may be based on the voice module on the intelligent terminal device 101 to input voice, the character input module to input characters, the image input module to input images, the video input module to input video, etc., or based on the intelligent terminal device 101 installed on the
  • the gesture recognition module enables users to implement gesture input and other interactive methods.
  • the terminal device 101 may be installed with computer-readable instructions of a client application, and a user may communicate with the server 102 based on the computer-readable instructions of the client application (eg, APP, etc.). Specifically, the terminal device 101 can obtain the user's input information, and communicate with the server 102 based on the client application computer-readable instructions on the terminal device 101. The server 102 can process the received user input information, and the server 102 can also use The information returns the corresponding output information to the terminal device 101, and the terminal device 101 can perform operations corresponding to the output information.
  • the user's input information may be voice information, screen-based touch operation information, gesture information, action information, etc.
  • the output information may be images, videos, text, audio, etc., which are not limited herein.
  • the client application computer readable instructions can provide human-computer interaction services based on the simulated digital human, and the human-computer interaction services can be different based on different scene requirements.
  • the client-side application computer-readable instructions can be used to provide users with product display information or service guidance in public areas such as shopping malls, banks, and exhibition halls, and different interactive services can be provided for different application scenarios.
  • the terminal device 101 may display a simulated digital human corresponding to the reply information on the display screen of the terminal device 101 or other image output devices connected thereto.
  • the simulated digital human may be a human-like image created according to the user's own or other people's shape, or may be a robot with animation effects, such as a robot in the shape of an animal or a cartoon character.
  • the audio corresponding to the simulated digital human image can be played through the speaker of the terminal device 101 or other audio output devices connected thereto, and can also be displayed on the display screen of the terminal device 101
  • the text or graphics corresponding to the reply information realizes polymorphic interaction with the user in multiple aspects such as image, voice, and text.
  • the device for processing user input information can also be set on the terminal device 101, so that the terminal device 101 can interact with the user without relying on establishing communication with the server 102, and realize human-computer interaction based on digital human , at this time, the interactive system 10 may only include the terminal device 101 .
  • FIG. 2 is a schematic flowchart of a method for simulating 3D digital human interaction according to an embodiment of the present application, which is applied to the above-mentioned terminal device, and the method includes steps S110 to S140 .
  • Step S110 Acquire scene data collected by the collection device.
  • the collection device may be a device arranged inside the terminal device, or may be a device connected to the terminal device.
  • the collection device may be an image collection device, an infrared sensor, a microphone, a laser ranging sensor, and the like.
  • the image acquisition device may be an ordinary camera, or may be a camera that can acquire spatial depth information, such as a binocular camera, a structured light camera, or a TOF camera.
  • the infrared sensor may be a distance sensor with infrared function, or the like.
  • the image acquisition device can also automatically change the lens angle to acquire images from different angles.
  • the collection device is used to collect scene data in the current scene, where the current scene is the scene where the terminal device is currently located.
  • the scene data may be at least one of visual data, infrared data, and sound data.
  • Step S120 If it is determined according to the scene data that the target user exists in the scene, the scene data is processed to obtain the relative position of the target user and the display screen.
  • the scene data After the scene data is acquired, it can be determined whether there is a target user in the scene by analyzing the scene data, and the target user is a real user in the scene. If it is determined by analyzing the scene data that the target user exists in the scene, the scene data can be processed to obtain the relative position of the target user and the display screen.
  • the relative position is used to represent the positional relationship between the target user and the display screen, and may include information such as relative distance and relative angle between the target user and the display screen.
  • the relative position may be the positional relationship between the key point of the target user and the preset position on the display screen, wherein the key point may be the target user's eyes, face center point, body part, etc., which can be detected by the image. , processing sensor data, etc. to determine the position of key points, which is not limited here; the preset position can be the center point of the display screen, the border of the display screen, the display position used to display the image of the simulated digital human, etc. limited.
  • the relative position information of the target user and the collecting device can be obtained, and the relative position of the target user and the display screen can be further determined according to the relative position information.
  • the positional relationship between the acquisition device and the display screen can be obtained, and the conversion relationship between the camera coordinate system and the space coordinate system can be determined according to the positional relationship, wherein the space coordinate system takes the position of the display screen as the origin; based on the conversion relationship and the three-dimensional coordinates , determine the relative position of the target user and the display screen in the space coordinate system. In this way, a more accurate relative position of the user with respect to the display screen can be obtained.
  • the relative position information of the target user and the collecting device can also be used as the relative position of the target user and the display screen. It can be understood that when the acquisition device is a built-in device in the terminal device, or when the acquisition device is connected to the terminal device and the distance is relatively close, the difference between the relative position information and the relative position is small, and the relative position of the target user and the acquisition device can be compared.
  • the position information is used as the relative position of the target user and the display screen, so that there is no need to obtain the positional relationship between the acquisition device and the display screen in advance, the position of the acquisition device can be changed, and the flexibility is better.
  • the collected scene data are also different.
  • the scene data is visual data collected by the image acquisition device
  • the relative position information of the target user and the acquisition device can be further obtained by image ranging or analyzing depth image data.
  • the scene data is infrared data collected by an infrared sensor
  • it can be determined whether there is a target user in the scene by analyzing the infrared data.
  • the infrared sensor can transmit infrared light, and when the infrared light encounters an obstacle, it will be reflected, and the infrared sensor can obtain the intensity of the reflected infrared light, and the infrared light intensity is proportional to the distance between the obstacles. Therefore, whether there is a target user in the scene can be determined by analyzing the infrared data, and when it is determined that the target user exists in the scene, the relative position information of the target user and the collecting device can be further determined.
  • the scene data is sound data collected by a sound collection device such as a microphone
  • a sound collection device such as a microphone
  • a preset simulated digital human image in a state to be awakened may be displayed on the display screen.
  • the preset simulated digital human image in the state to be woken up may be a simulated digital human image with the face facing straight ahead.
  • the preset simulated digital human image in the wake-up state can also be a dynamically turned simulated digital human image sequence, that is, a dynamic simulated digital human video, so as to show the user that the simulated digital human can present different angle feature.
  • it could be a simulated digital human that dynamically changes from 15 degrees to the left to 15 degrees to the right.
  • the preset simulated digital human image or the simulated digital human image sequence may also be a greeting simulated digital human to remind the user to interact.
  • Step S130 If the target user is located in the preset area, acquire the target simulated digital human image corresponding to the relative position.
  • the preset area is a preset area for interacting with the target user in the area.
  • the preset area may be an area whose distance from the display screen is smaller than a preset value.
  • the preset area may also be an area whose distance from the display screen is smaller than a preset value and whose angle with the display screen is smaller than a preset angle. Specifically, it can be determined whether the target user exists in the preset area by comparing the relative position and the preset value.
  • the area in the scene corresponding to the scene data can be used as the preset area, that is to say, the preset area is the area where the collection device can collect the scene data. If it is determined according to the scene data that there is a target user in the scene, Then it is determined that the target user is located in the preset area. In this way, it is possible to proactively interact when a target user is detected in an area with low traffic density and few users.
  • the preset area is a smaller area than the area of the scene data.
  • the preset area may be an area with a distance from the display screen less than 5 meters. In this way, in an area with high crowd density and many users, when a target user is detected, the user's interaction intention can be further determined according to the distance between the user and the display screen.
  • the preset area for interaction By setting the preset area for interaction, it can be determined whether to interact with the target user according to whether the target user is located in the preset area. In this way, on the one hand, the interaction can be performed without the user's perception, and the interaction is more efficient.
  • the user's interaction intention can be further determined according to the preset area, and the user in the preset area is regarded as a user with an interaction intention, so as to interact accurately.
  • the terminal device is a large-screen device set in the company lobby, and the preset area is the company's front desk, there may be multiple users in the company lobby when there is a lot of traffic, and the terminal device cannot know which user to communicate with. interaction, and when there is a user at the company's front desk, the user can be used as a target user for interaction.
  • the preset simulated digital human model is a model that is pre-trained according to a plurality of sample images including a real mannequin and reference parameters corresponding to each sample image, and the simulated digital human model is used according to the input reference parameters, Output the simulated digital human image corresponding to the sample image.
  • the target reference parameter is determined from the preset reference parameters; the target reference parameter is input into the preset simulated digital human model, and the output simulated digital human image is used as the target simulated digital human image.
  • the reference parameter can be used to represent the relative position of the real model included in the sample image and the image acquisition device that collects the sample image, and the relative position can be a relative angle or a relative distance. Specifically, please refer to the subsequent embodiments.
  • the process of obtaining a three-dimensional 3D digital human through 3D modeling is very dependent on the prior experience of the modeler, and a large number of artificial adjustments are used to achieve a 3D digital human that is close to the real person, and obtain the corresponding models of different models.
  • the 3D digital human needs to repeat the modeling process, which consumes a lot of labor costs.
  • the preset simulated digital human model is a deep learning model obtained through training.
  • the process of obtaining the target simulated digital human image from the simulated digital human model does not require 3D modeling, and the obtained simulated digital human is also closer to the real model, and the effect is more realistic. It is suitable for practical applications where it may be necessary to model different real-life models to obtain a simulated digital human.
  • the scene data may include a scene image, and human head information in the scene image may be identified; the number of users in the scene image may be obtained according to the human head information; if the number of users is one, the identified user will be used as the target user ; Process the scene image to obtain the relative position of the target user to the display screen.
  • the number of users is multiple, it is monitored whether the interaction instruction input by the user is acquired; if the interaction instruction input by the user is acquired, the user corresponding to the interaction instruction is used as the target user. It can be understood that the user corresponding to the interaction instruction is the user who inputs the interaction instruction. Specifically, please refer to the subsequent embodiments.
  • interaction information can also be obtained; the interaction information can be processed to obtain response voice information; the target reference parameters and the response voice information can be input into a preset simulated digital human model to obtain an output image sequence, the image sequence consisting of multiple consecutive frames
  • the target simulation digital human image composition can also be obtained; the interaction information can be processed to obtain response voice information; the target reference parameters and the response voice information can be input into a preset simulated digital human model to obtain an output image sequence, the image sequence consisting of multiple consecutive frames The target simulation digital human image composition.
  • the interaction information can be processed to obtain response voice information
  • the target reference parameters and the response voice information can be input into a preset simulated digital human model to obtain an output image sequence, the image sequence consisting of multiple consecutive frames
  • the target simulation digital human image composition can also be obtained; the interaction information can be processed to obtain response voice information; the target reference parameters and the response voice information can be input into a preset simulated digital human model to obtain an output image sequence, the image sequence consisting of multiple consecutive frames The target simulation digital human image composition.
  • Step S140 Display the target simulated digital human image on the display screen.
  • the target simulated digital human image can be displayed at the display position of the display screen.
  • the display screen may be the display screen of the terminal device, or other image display devices connected to the terminal device, and the display position may be a preset position for displaying the simulated digital human, or a display position determined according to the relative position.
  • a prompt for human-computer interaction can be performed by voice or text, so as to guide the user to further interact. For example, in a bank usage scenario, the wake-up interface can display "What help do you need? You can try to ask me 'How do I handle the deposit business?'".
  • a video including the target simulated digital human can be generated according to the image sequence, and the video can be displayed on the display screen.
  • a preset simulated digital human with the face facing forward can be displayed on the display screen, and the image of the target simulated digital human corresponding to the relative position can be obtained, which can be the simulated digital human with the face facing directly forward.
  • the video corresponding to the simulated digital human is synthesized according to the image sequence, and the effect of the simulated digital human being turned naturally can be realized.
  • the scene data is data collected in real time, and if a relative position change is detected, a new target simulated digital human image is generated according to the changed relative position; the new target simulated digital human image is displayed on the display screen.
  • a relative position change is detected, a new target simulated digital human image is generated according to the changed relative position; the new target simulated digital human image is displayed on the display screen.
  • steps S120 and S130 can be performed locally by the terminal device, or can be performed in the server, and can also be performed by division of labor between the terminal device and the server. According to different actual application scenarios, tasks can be allocated according to requirements. This is not limited.
  • the scene data collected by the collection device is obtained. If it is determined according to the scene data that there is a target user in the scene, the scene data is processed to obtain the relative position of the target user and the display screen. If the target user is located in Within the preset area, the target simulated digital human image corresponding to the relative position is acquired, and the target simulated digital human image is displayed on the display screen. It can simulate the interaction effect of face-to-face communication between the user and the simulated digital human, and realize the anthropomorphic simulated digital human interaction. Compared with the traditional method, it can only be in a fixed way. For example, the virtual image can only interact in a fixed position and in a fixed orientation.
  • the proposed solution improves the flexibility of human-computer interaction, avoids limitations, and further improves the user's interactive experience.
  • FIG. 3 is a schematic flowchart of a method for simulating 3D digital human interaction according to an embodiment of the present application, which is applied to the above-mentioned terminal device, and the method includes steps S210 to S250 .
  • Step S210 Acquire scene data collected by the collection device.
  • Step S220 If it is determined according to the scene data that the target user exists in the scene, the scene data is processed to obtain the relative position of the target user and the display screen.
  • the scene data may include a scene image, and human head information in the scene image may be identified; the number of users in the scene image may be obtained according to the human head information; if the number of users is one, the identified user will be used as the target user ; Process the scene image to obtain the relative position of the target user to the display screen.
  • the number of users is multiple, it is monitored whether the interaction instruction input by the user is acquired; if the interaction instruction input by the user is acquired, the user corresponding to the interaction instruction is used as the target user.
  • the subsequent embodiments please refer to the subsequent embodiments.
  • Step S230 If the target user is located in the preset area, the target reference parameter is determined from a plurality of preset reference parameters according to the relative position.
  • the reference parameter is used to represent the pose of the real mannequin contained in the sample image used for training the preset simulated digital human model relative to the image acquisition device that collects the sample image, wherein the pose may include position information and
  • the posture information correspondingly, the reference parameter may include at least one of a distance parameter and an angle parameter between the real model and the image acquisition device.
  • the distance parameter is used to characterize the relative distance between the real model and the image capture device
  • the angle parameter is used to characterize the relative angle between the real model and the image capture device.
  • the image acquisition device can be regarded as the eyes of the target user
  • the target simulated digital human image is the sample image of the real model acquired according to the image acquisition device, that is, it can be viewed as if through the image acquisition device. Live-model-like effect.
  • the relative positions of the target user and the display screen may not exactly the same reference parameters. Therefore, by determining the target reference parameter among the preset reference parameters according to the relative position, it is possible to generate the pose of the simulated digital human that is closest to the position of the current target user.
  • a mapping relationship between the relative position and a plurality of preset reference parameters may be set, and the target reference parameter corresponding to the relative position is determined according to the mapping relationship.
  • the requirements for the accuracy of the relative position can be reduced.
  • the 3D effect of simulating a digital human can be realized, thereby reducing the requirements for the acquisition device and reducing the processing of scene data to obtain the first
  • the power consumption required for a relative position on the other hand, the number of sample images with different reference parameters required for training a preset simulated digital human image can also be reduced.
  • an angle mapping relationship may be preset, and the target reference parameter corresponding to the relative position is determined based on the angle mapping relationship.
  • the angle mapping relationship includes a plurality of angle intervals and an angle parameter corresponding to each angle interval. The angle interval to which the relative position belongs can be determined from the angle mapping relationship, and then the angle parameter corresponding to the angle interval is used as the target reference parameter.
  • a distance mapping relationship may be preset, and the target reference parameter corresponding to the relative position is determined based on the distance mapping relationship.
  • the distance mapping relationship includes a plurality of distance intervals and a distance parameter corresponding to each distance interval. The distance interval to which the relative position belongs can be determined from the distance mapping relationship, and then the distance parameter corresponding to the distance interval is used as the target reference parameter.
  • the target reference parameter corresponding to the relative position may be determined from a plurality of preset reference parameters based on an optimal path solving algorithm.
  • the optimal path solving algorithm may be Dijkstra algorithm, A* algorithm, SPFA algorithm, Bellman-Ford algorithm, Floyd-Warshall algorithm, etc., which are not limited herein.
  • Step S240 Input the target reference parameters into the preset simulated digital human model, and use the output simulated digital human image as the target simulated digital human image.
  • the preset simulated digital human model is a model that is pre-trained based on multiple sample images including a real model and reference parameters corresponding to each sample image, and the simulated digital human model is used for outputting sample images corresponding to the input reference parameters.
  • a simulated digital human image Specifically, a plurality of images including real models corresponding to different reference parameters may be collected as sample images by an image collection device, and reference parameters corresponding to each sample image may be obtained.
  • each reference parameter may also correspond to sample images of multiple real models with different poses. For example, four images of a real-life model containing four expressions of emotion, sadness and joy may be collected from the same camera perspective as sample images corresponding to the reference parameter.
  • the simulated digital human model may include a feature generation model and an image generation model, and both the feature generation model and the image generation model are preset deep learning-based models.
  • the feature generation model is used to obtain the feature parameters of the real model in the sample image corresponding to the reference parameters according to the input reference parameters, wherein the feature parameters of the real model are obtained by extracting the facial key points and posture key points of the real model in the image. , contour key points, etc.
  • the image generation model is used to generate a corresponding simulated digital human image according to the characteristic parameters of the real model.
  • the target reference parameters can be input into the preset simulated digital human model, and the feature parameters of the real model in the sample image corresponding to the target reference parameters can be obtained through the depth generation model.
  • the corresponding simulated digital human image is used as the target simulated digital human image.
  • the facing angle of the simulated digital human in the target simulated digital human image is the same as the facing angle of the real model in the sample image corresponding to the target reference parameter.
  • the orientation angle is used to represent the rotation angle of the real model in the sample image relative to the front face.
  • the orientation angle may include at least one of a horizontal angle and a vertical angle.
  • the horizontal angle can be used to characterize the angle of the live model in the horizontal direction.
  • the sample images collected by the collection device located on the left side of the real model and the sample images collected by the collection device located on the right side of the real model correspond to different horizontal angles of the real model.
  • the vertical angle can be used to characterize the angle of the live model in the vertical direction.
  • the sample images collected by the collection device located at a high place and taken with an overhead shot correspond to different vertical angles of the real model.
  • the physical features of the simulated digital human in the target simulated digital human image are the same as the physical features of the real model in the sample image corresponding to the target reference parameters.
  • Body features include facial expression, body shape, action posture, texture and other features.
  • the obtained simulated digital human is as realistic as a real model, and visually it is like watching a real model captured by a camera.
  • the orientation angle of the simulated digital human in the target simulated digital human image is the same as the orientation angle of the real mannequin in the sample image corresponding to the target reference parameter, and the physical features of the simulated digital human in the target simulated digital human image are the same as those of the target.
  • the physical features of the real models in the sample images corresponding to the reference parameters are the same.
  • the current position of the target user relative to the simulated digital human on the display screen can be converted into the image acquisition device relative to the real model when the sample image is collected.
  • Location By acquiring the target simulated digital human image corresponding to the target reference parameters, the visual experience of the target user looking at the real model from the position of the image acquisition device can be realized, so that the simulated digital human image presents a stereoscopic and realistic 3D effect.
  • the image of the target simulated digital human includes the face on the left side of the digital human, that is, the corresponding angle after the simulated digital human is turned from the front to the right; when the target user is facing the display screen, The target simulated digital human includes the frontal face of the digital human; when the target user is on the right side of the display screen, the target simulated digital human image includes the face on the right side of the digital human, that is, the angle after the simulated digital human is turned from the front to the left.
  • the simulated digital human image with the face facing the target user will be displayed, so as to realize the effect of face-to-face interaction between the simulated digital human and the target user.
  • the size of the simulated digital human may also be different.
  • Step S250 Display the target simulated digital human image on the display screen.
  • the scene data is data collected in real time, and if a relative position change is detected, a new target simulated digital human image is generated according to the changed relative position; the new target simulated digital human image is displayed on the display screen.
  • a relative position change is detected, a new target simulated digital human image is generated according to the changed relative position; the new target simulated digital human image is displayed on the display screen.
  • steps S210 to S240 can be performed locally by the terminal device, or can be performed in the server, and can also be performed by division of labor between the terminal device and the server. According to different actual application scenarios, tasks can be allocated according to requirements. This is not limited.
  • the scene data collected by the collection device is obtained, and if the target user is located in the preset area, the target reference parameter is determined from the preset reference parameters according to the relative position, and the target user is Input the preset simulated digital human model with reference to the parameters, take the output simulated digital human image as the target simulated digital human image, and display the target simulated digital human image on the display screen.
  • the presentation angle of the simulation digital human can be made to face the target user, and the target simulation digital human image is based on the image containing the real model. It is generated from a sample image of , which can achieve a realistic effect that approximates a real model.
  • FIG. 4 is a schematic flowchart of a method for simulating 3D digital human interaction according to an embodiment of the present application, applied to the above-mentioned terminal device, and the method includes steps S310 to S360 .
  • Step S310 Acquire scene data collected by the collection device.
  • Step S320 If it is determined according to the scene data that the target user exists in the scene, the scene data is processed to obtain the relative position of the target user and the display screen.
  • the scene data may include a scene image, and human head information in the scene image may be identified; the number of users in the scene image may be obtained according to the human head information; if the number of users is one, the identified user will be used as the target user ; Process the scene image to obtain the relative position of the target user to the display screen.
  • the number of users is multiple, it is monitored whether the interaction instruction input by the user is acquired; if the interaction instruction input by the user is acquired, the user corresponding to the interaction instruction is used as the target user.
  • the subsequent embodiments please refer to the subsequent embodiments.
  • Step S330 If the target user is located in the preset area, determine the user viewing angle parameter according to the relative position.
  • the user viewing angle parameter is used to represent the viewing angle of the target user toward the preset position of the display screen.
  • the preset position may be a position where the center point of the display screen, a frame of the display screen, etc. will not change, or may be a display position for simulating a digital human image, which is not limited herein.
  • the target user can be identified by processing the scene data, so as to determine the user's viewing angle parameter.
  • the face of the target user can be detected by an image detection algorithm to determine the position of the target user's eyes, and then the user viewing angle parameters can be determined according to the position of the target user's eyes and the preset position of the display screen.
  • the power consumption required to identify the target user not in the preset area to obtain the viewing angle parameter can be reduced, Improve the efficiency of resource utilization.
  • the target display position of the display screen can be determined according to the relative position, and the target display position is the display position of the target simulated digital human image on the display screen; the user viewing angle parameter is determined according to the relative position and the target display position.
  • the corresponding relationship between the preset display position and the relative position of the user may be obtained, and after the relative position is obtained, the display position corresponding to the relative position is used as the target display position according to the corresponding relationship.
  • the target display position is the right area of the display screen
  • the target display position is the left side area of the display screen.
  • the simulated digital human can also move from the left area of the display screen to the right area of the display screen, as if the target user and the simulated digital human are walking side by side.
  • the target display position may also be the display position of the simulated digital human eyes in the target simulated digital human image. In this way, the viewing angle parameters of the target user looking at the eyes of the simulated digital human can be obtained, thereby realizing the effect that the simulated digital human is looking at the target user like a real person.
  • different relative positions can correspond to different target display positions, so that the simulated digital human is more realistic and vivid.
  • different target display positions can be determined according to different positions of the target user, so as to shorten the distance between the digital human and the target user, and enable more natural interpersonal interaction.
  • Step S340 From the preset multiple reference parameters, determine the target reference parameter according to the user viewing angle parameter.
  • the image acquisition device can be regarded as the eyes of the target user, because the target simulated digital human image is a sample image of a real mannequin collected by the image acquisition device, and the real mannequin is relative to the image acquisition device.
  • the pose which is the pose of the simulated digital human relative to the target user, realizes the effect that the target user is watching a real model through an image acquisition device. Specifically, please refer to step S230.
  • Step S350 Input the target reference parameters into the preset simulated digital human model, and use the output simulated digital human image as the target simulated digital human image.
  • Step S360 Display the target simulated digital human image on the display screen.
  • the scene data is data collected in real time, and if a relative position change is detected, a new target simulated digital human image is generated according to the changed relative position; the new target simulated digital human image is displayed on the display screen.
  • a relative position change is detected, a new target simulated digital human image is generated according to the changed relative position; the new target simulated digital human image is displayed on the display screen.
  • steps S310 to S350 can be performed locally by the terminal device, or can be performed in the server, and can also be performed by division of labor between the terminal device and the server. According to different actual application scenarios, tasks can be assigned according to requirements. This is not limited.
  • the scene data collected by the collection device is obtained. If it is determined according to the scene data that there is a target user in the scene, the scene data is processed to obtain the relative position of the target user and the display screen. If the target user is located in In the preset area, the user's viewing angle parameter is determined according to the relative position, and among the preset multiple reference parameters, the target reference parameter is determined according to the user's viewing angle parameter, and the target reference parameter is input into the preset simulation digital human model, The output simulated digital human image is taken as the target simulated digital human image, and the target simulated digital human image is displayed on the display screen. By determining the user's viewing angle parameters, the target simulated digital human image corresponding to the viewing angle parameters is obtained, which further increases the fidelity of the simulated digital human and optimizes the human-computer interaction experience.
  • FIG. 5 is a schematic flowchart of a method for simulating 3D digital human interaction provided by an embodiment of the present application, which is applied to the above-mentioned terminal device, and the method includes steps S410 to S470 .
  • Step S410 Acquire scene data collected by the collection device.
  • Step S420 If it is determined according to the scene data that the target user exists in the scene, the scene data is processed to obtain the relative position of the target user and the display screen.
  • the scene data may include a scene image, and human head information in the scene image may be identified; the number of users in the scene image may be obtained according to the human head information; if the number of users is one, the identified user will be used as the target user ; Process the scene image to obtain the relative position of the target user to the display screen.
  • the number of users is multiple, it is monitored whether the interaction instruction input by the user is acquired; if the interaction instruction input by the user is acquired, the user corresponding to the interaction instruction is used as the target user.
  • the subsequent embodiments please refer to the subsequent embodiments.
  • Step S430 If the target user is located in the preset area, obtain interaction information.
  • the interaction information may be multimodal information such as voice information, motion information, touch operation information, and the like.
  • the interaction information may be information of preset interaction instructions input by the target user, or may be multimodal information that can be recognized by the terminal device.
  • Step S440 Process the interaction information to obtain response voice information.
  • a corresponding relationship between the interaction instruction and the corresponding response voice information may be preset, and the response voice information is acquired based on the corresponding relationship.
  • the interaction information is a preset wake-up word
  • the corresponding response voice information may be "Hello, can I help you with something?".
  • the voice information when the interaction information is the voice information input by the target user, the voice information can be converted into text through automatic speech recognition technology (ASR, Automatic Speech Recognition), and then a natural language understanding operation ((Natural Speech Recognition) is performed on the text. Language Understanding, NLU), to realize the analysis of the voice information, obtain the response text information according to the result of the analysis.Further, the response voice information corresponding to the response text information can be obtained by text-to-speech technology (Text To Speech, TTS).
  • ASR Automatic Speech Recognition
  • NLU Natural Speech Recognition
  • TTS text-to-speech technology
  • the natural language understanding operation can be realized by the intent recognition model, and the intent recognition model can use the Recurrent Neural Network (RNN) model, the Convolutional Neural Networks (CNN) model, and the Variational Autoencoder (Variational Autoencoder) , VAE) model, Bidirectional Encoder Representations from Transformers (BERT) and Support Vector Machine (SVM) and other machine learning models, which are not limited here.
  • RNN Recurrent Neural Network
  • CNN Convolutional Neural Networks
  • VAE Variational Autoencoder
  • BET Support Vector Machine
  • SVM Support Vector Machine
  • Step S450 Determine the target reference parameter from the preset reference parameters according to the relative position.
  • step S230 Please refer to step S230.
  • Step S460 Input the target reference parameters and the response voice information into a preset simulated digital human model to obtain an output image sequence.
  • the image sequence is composed of multiple frames of continuous images of the target simulated digital human, and the action posture or facial expression of the simulated digital human in the images may be continuously changed.
  • the semantic information corresponding to the response text information corresponding to the response voice, and the phoneme information of the voice can be obtained.
  • the corresponding orientation angle of the simulated digital human can be determined, and the simulated digital human whose face is facing the target user can be obtained, and then according to the response voice information, the image sequence of the simulated digital human corresponding to the action posture or facial expression and the response voice information can be obtained.
  • the target simulated digital human image in the image sequence is an image of the simulated digital human whose face faces the target user, and the action state corresponds to the voice information.
  • the simulated digital human model may include a feature generation model and an image generation model, and target reference parameters may be input into the feature generation model to obtain initial feature parameters, and the initial feature parameters are used to characterize the shape of the real model corresponding to the sample image;
  • target reference parameters may be input into the feature generation model to obtain initial feature parameters, and the initial feature parameters are used to characterize the shape of the real model corresponding to the sample image;
  • at least one parameter among the expression parameters, action parameters and mouth shape parameters of the initial feature parameters is adjusted to obtain a parameter sequence, the parameter sequence includes multiple target feature parameters; based on the image generation model, each target feature parameter is obtained The corresponding target simulates the digital human image to obtain the image sequence corresponding to the parameter sequence.
  • the shape of the real model corresponding to the sample image may include at least one of an orientation angle and a physical feature, that is to say, the orientation angle and physical features of the simulated digital human obtained according to the initial feature parameters may be the same as those of the real model.
  • the preset simulated digital human model may further include an audio visual prediction model, which can acquire feature parameters corresponding to the response voice information according to the input response voice information and initial feature parameters. Through the audio visual prediction model, at least one of the expression parameters, action parameters, and mouth shape parameters of the initial feature parameters can be adjusted to obtain a parameter sequence composed of multiple target feature parameters, so that the external performance of the simulated digital human is consistent with the Answer the corresponding voice message.
  • the target simulated digital human image corresponding to each target feature parameter can be obtained, so as to obtain the image sequence corresponding to the parameter sequence.
  • more accurate characteristic parameters of the simulated digital human can be obtained, so that the image of the simulated digital human is more realistic and natural.
  • the corresponding target reference parameter can be determined according to the relative position of the user, and then the target reference parameter can be determined to determine the sample image, and the simulation can be obtained.
  • the digital human faces the initial characteristic parameters that match the target user's position, and then according to the response voice information, the action parameters in the initial characteristic parameters are modified to the action parameters of the action of waving and greeting, and the mouth shape parameters of the simulated digital human are also modified to "you"
  • the mouth shape parameters corresponding to "good” can be obtained, so as to obtain multiple target feature parameters corresponding to the response voice information, and then obtain the corresponding continuously changing image sequence. In this way, it is possible to display a simulated digital person whose face is facing the user and beckons.
  • Step S470 Generate and output the video of the simulated digital human according to the image sequence, and play the response voice information synchronously.
  • multiple target simulated digital human images in the image sequence can be synthesized into a simulated digital human video that matches the Yangda voice information, and the simulated digital human video can be displayed on the display screen.
  • the response voice message is played synchronously.
  • the simulated digital human will not only display the corresponding angle according to the position of the target user to realize the interaction with the face facing the target user, but also according to the action state corresponding to the response voice information. In this way, the fidelity of the simulated digital human can be improved, thereby enhancing the user's human-computer interaction experience.
  • the scene data is data collected in real time, and if a relative position change is detected, a new target simulated digital human image is generated according to the changed relative position; the new target simulated digital human image is displayed on the display screen. That is to say, the simulated digital human not only corresponds to the response voice information, but also corresponds to the real-time relative position of the target user. Therefore, the simulated digital human is more flexible and vivid. Specifically, please refer to the subsequent embodiments.
  • steps S410 to S470 can be performed locally by the terminal device, or can be performed in the server, and can also be performed by the terminal device and the server by division of labor. According to different actual application scenarios, tasks can be assigned according to requirements. This is not limited.
  • the simulated 3D digital human interaction method acquires scene data collected by a collection device; if it is determined according to the scene data that a target user exists in the scene, the scene data is processed to obtain the relative position of the target user and the display screen; if the target user is located in In the preset area, the interaction information is obtained; the interaction information is processed to obtain the response voice information; the target reference parameter is determined from the preset multiple reference parameters according to the relative position; the target reference parameter and the response voice information are input into the preset simulation
  • the digital human model is used to obtain the output image sequence; the video of the simulated digital human is generated and output according to the image sequence, and the response voice information is played synchronously.
  • the response voice information can be displayed while the video is displayed, which further increases the simulation performance.
  • the fidelity of the digital human optimizes the human-computer interaction experience.
  • FIG. 6 is a schematic flowchart of a method for simulating 3D digital human interaction provided by an embodiment of the present application, which is applied to the above-mentioned terminal device, and the method includes steps S510 to S570.
  • Step S510 Acquire the scene image collected by the collection device.
  • the acquisition device may be a common camera, or may be an image acquisition device for acquiring spatial depth information.
  • the image acquisition device may be a binocular camera, a structured light camera, a TOF camera, or the like.
  • the scene image may be an ordinary image in the current scene, or may be a depth image containing depth information and color information.
  • Step S520 Determine whether the target user exists in the scene image.
  • Whether there is a target user in the scene can be determined by analyzing the scene image.
  • the detection algorithm can be collected to identify the human head information in the scene image, and the detection algorithm can be the YOLO (You Only Look Once) algorithm, RCNN, SSD (Single Shot MultiBox Detector) and other algorithms that can identify the natural person in the image for judgment.
  • other types of scene data may also be used to determine whether the target user exists in the scene. Please refer to step S120 for a specific description of determining that the target user exists in the scene according to the scene data, and details are not repeated here.
  • Step S530 If yes, identify the scene image to obtain the three-dimensional coordinates of the target user in the camera coordinate system.
  • the camera coordinate system takes the position of the acquisition device as the origin.
  • the scene images can be processed in different ways to identify the target user in the scene image, thereby obtaining the three-dimensional coordinates of the target user in the camera coordinate system.
  • the image is an image collected by a common camera
  • depth information corresponding to the target user in the image can be obtained through a depth estimation algorithm, so as to determine the three-dimensional coordinates.
  • the image is a depth image
  • the three-dimensional coordinates of the target user in the camera coordinate system can be calculated for the depth information.
  • binocular ranging can be used to determine the three-dimensional coordinates corresponding to the target user
  • acquisition device is a structured light camera
  • triangular parallax ranging can be used to determine the three-dimensional coordinates corresponding to the target user
  • the acquisition device is a TOF camera
  • the operation of the light pulse from the transmitter of the TOF camera to the target object, and then back to the receiver of the TOF camera in pixel format can be calculated, so as to determine the corresponding three-dimensional coordinates of the target user.
  • the camera calibration of the acquisition device may also be performed in advance to obtain the external camera parameters and the camera internal parameters of the acquisition device, and the three-dimensional coordinates of the target user can be accurately acquired in combination with the camera parameters.
  • Step S540 Obtain the positional relationship between the acquisition device and the display screen, and determine the conversion relationship between the camera coordinate system and the space coordinate system according to the positional relationship.
  • the space coordinate system takes the position of the display screen as the origin, and the space coordinate system can be used to represent the position coordinates in the real world.
  • the positional relationship between the acquisition device and the display screen can be acquired in advance, and the coordinates of the image acquisition device in the space coordinate system can be determined according to the positional relationship, thereby obtaining the conversion relationship between the camera coordinate system and the space coordinate system.
  • the position of the display screen may be a position where the center point of the display screen, a frame of the display screen, etc. will not change, or may be a display position for simulating a digital human image, which is not limited herein.
  • the display position of the simulated digital human image can be changed according to the relative position of the target user to the display screen.
  • Step S550 Based on the conversion relationship and the three-dimensional coordinates, determine the relative position of the target user and the display screen in the space coordinate system.
  • the relative position includes at least one of relative distance and relative angle. Based on the transformation relationship and the three-dimensional coordinates, the relative position of the target user and the display screen can be determined in the space coordinate system. In this way, a relatively accurate first relative position of the target user relative to the display screen can be obtained.
  • the eyes of the target user can be identified by a detection algorithm, and the position of the eyes is used as the three-dimensional coordinates of the target user in the camera coordinate system.
  • the display position of the acquisition device and the simulated digital human image on the display screen the conversion relationship between the camera coordinate system and the space coordinate system is determined.
  • the display position may be the position of the eyes of the simulated digital human.
  • the relative position between the eyes of the target user and the simulated digital human eyes on the display screen can be determined in the space coordinate system, so that according to the relative position, it is possible to obtain not only the angle towards the target user, but also the eyes towards the target The user's simulated digital human.
  • Step S560 If the target user is located in the preset area, acquire the target simulated digital human image corresponding to the relative position.
  • interaction information can also be obtained; the interaction information can be processed to obtain response voice information; the target reference parameters and the response voice information can be input into a preset simulated digital human model to obtain an output image sequence, the image sequence consisting of multiple consecutive frames
  • the target simulation digital human image composition can also be obtained; the interaction information can be processed to obtain response voice information; the target reference parameters and the response voice information can be input into a preset simulated digital human model to obtain an output image sequence, the image sequence consisting of multiple consecutive frames The target simulation digital human image composition.
  • Step S570 Display the target simulated digital human image on the display screen.
  • the scene data is data collected in real time, and if a relative position change is detected, a new target simulated digital human image is generated according to the changed relative position; the new target simulated digital human image is displayed on the display screen.
  • a relative position change is detected, a new target simulated digital human image is generated according to the changed relative position; the new target simulated digital human image is displayed on the display screen.
  • steps S510 to S560 can be performed locally by the terminal device, or can be performed in the server, and can also be performed by division of labor between the terminal device and the server. According to different actual application scenarios, tasks can be allocated according to requirements. This is not limited.
  • the scene data collected by the collection device is obtained; whether there is a target user in the scene image is determined; if so, the scene image is identified to obtain the three-dimensional coordinates of the target user in the camera coordinate system, and the collected
  • the conversion relationship between the camera coordinate system and the space coordinate system is determined according to the positional relationship; based on the conversion relationship and three-dimensional coordinates, the relative position of the target user and the display screen is determined in the space coordinate system;
  • the target simulated digital human image corresponding to the relative position is obtained; the target simulated digital human image is displayed on the display screen.
  • the position of the target user and the display screen can be determined more accurately, so as to obtain the simulated digital human whose face is accurately facing the target user according to the position, and the picture fidelity of the virtual image is higher.
  • FIG. 7 is a schematic flowchart of a method for simulating 3D digital human interaction according to an embodiment of the present application, which is applied to the above-mentioned terminal device, and the method includes steps S610 to S670 .
  • Step S610 Acquire scene data collected by the collection device.
  • Step S620 Identify the human head information in the scene image.
  • the scene data includes a scene image
  • the human head information in the scene image can be identified by a detection algorithm.
  • the detection algorithm may be an algorithm such as YOLO (You Only Look Once) algorithm, RCNN, SSD (Single Shot MultiBox Detector), etc., which can identify a natural person in an image.
  • Step S630 Acquire the number of users in the scene image according to the head information.
  • the number of users in the scene image can be determined, that is, the number of users in the current scene can be determined.
  • the simulated digital human can be kept in a forward-facing posture, and the simulated digital human can say hello to everyone or not.
  • a target user speaks to the digital human
  • the digital human turns to this target user for interaction.
  • a preset simulated digital human image in a state to be awakened can be displayed.
  • the interaction instruction may be preset multimodal information.
  • the interaction instruction may be multimodal information such as voice information, motion instructions, and touch operations.
  • the voice information may be voice information containing preset keywords, and the user's interaction intention may be obtained by performing intent recognition on the voice information;
  • the action command may be a preset action, gesture, etc. for interaction, such as waving at the screen, etc. . This embodiment does not limit this.
  • the sound information in the scene can be collected through a microphone, and whether the sound information includes the user's voice information can be determined through human voice detection.
  • the preset keywords may also be detected by using the acoustic model, so as to further determine whether to obtain the interactive instruction input by the user.
  • the interactive command is voice information
  • the position of the sound source of the voice information can be determined by means of sound ranging, so that the user in the position can be regarded as the target user;
  • the scene image can be The processing is performed to recognize the lip movements of multiple users, and the user who inputs the interaction instruction can be determined through the lip language recognition, and the user is regarded as the target user.
  • the user who inputs the action instruction may be the target user by performing motion recognition on the scene image to determine whether there is an action instruction input by the user.
  • gesture recognition can be performed on scene images to detect if a user is waving toward the screen.
  • whether the touch operation input by the user is acquired may be detected by the screen sensor, and if yes, the user who inputs the touch operation is regarded as the target user.
  • each user in the scene image may be used as a target user, and multiple first relative positions of the multiple target users and the display screen may be obtained.
  • the preset area based on the preset avatar model, a plurality of target avatar images corresponding to a plurality of first relative positions are acquired, and a plurality of target avatar images are displayed on the display screen to interact with the target users respectively. .
  • multiple target avatars are displayed on the display screen to interact with multiple target users respectively. In this way, each target user can interact face-to-face with the avatar, which improves the efficiency of the interaction.
  • Step S640 If the number of users is one, the identified user is used as the target user.
  • the user is taken as the target user.
  • Step S650 Process the scene image to obtain the relative position of the target user and the display screen.
  • Step S660 If the target user is located in the preset area, acquire the target simulated digital human image corresponding to the relative position.
  • interaction information can also be obtained; the interaction information can be processed to obtain response voice information; the target reference parameters and the response voice information can be input into a preset simulated digital human model to obtain an output image sequence, the image sequence consisting of multiple consecutive frames
  • the target simulation digital human image composition can also be obtained; the interaction information can be processed to obtain response voice information; the target reference parameters and the response voice information can be input into a preset simulated digital human model to obtain an output image sequence, the image sequence consisting of multiple consecutive frames The target simulation digital human image composition.
  • Step S670 Display the target simulated digital human image on the display screen.
  • the scene data is data collected in real time, and if a relative position change is detected, a new target simulated digital human image is generated according to the changed relative position; the new target simulated digital human image is displayed on the display screen.
  • a relative position change is detected, a new target simulated digital human image is generated according to the changed relative position; the new target simulated digital human image is displayed on the display screen.
  • steps S610 to S660 can be performed locally by the terminal device, or can be performed in the server, and can also be performed by the terminal device and the server by division of labor. According to different actual application scenarios, tasks can be allocated according to requirements. This is not limited.
  • the scene data collected by the collection device is acquired; the head information in the scene image is identified; the number of users in the scene image is obtained according to the head information; If the target user is located in the preset area, then obtain the target simulation digital human image corresponding to the relative position; display the target simulation digital human on the display screen image.
  • FIG. 9 is a schematic flowchart of a method for simulating 3D digital human interaction provided by an embodiment of the present application, which is applied to the above-mentioned terminal device, and the method includes steps S710 to S760.
  • Step S710 Acquire scene data collected by the collection device.
  • the scene data is data collected in real time.
  • Step S720 If it is determined according to the scene data that the target user exists in the scene, the scene data is processed to obtain the relative position of the target user and the display screen.
  • the scene data may include a scene image, and human head information in the scene image may be identified; the number of users in the scene image may be obtained according to the human head information; if the number of users is one, the identified user will be used as the target user ; Process the scene image to obtain the relative position of the target user to the display screen. In one embodiment, if the number of users is multiple, it is monitored whether the interaction instruction input by the user is acquired; if the interaction instruction input by the user is acquired, the user corresponding to the interaction instruction is used as the target user. Specifically, please refer to the foregoing embodiments.
  • Step S730 If the target user is located in the preset area, acquire the target simulated digital human image corresponding to the relative position.
  • interaction information can also be obtained; the interaction information can be processed to obtain response voice information; the target reference parameters and the response voice information can be input into a preset simulated digital human model to obtain an output image sequence, the image sequence consisting of multiple consecutive frames
  • the target simulation digital human image composition can also be obtained; the interaction information can be processed to obtain response voice information; the target reference parameters and the response voice information can be input into a preset simulated digital human model to obtain an output image sequence, the image sequence consisting of multiple consecutive frames The target simulation digital human image composition.
  • Step S740 Display the target simulated digital human image on the display screen.
  • Step S750 If the relative position change is detected, generate a new target simulated digital human image according to the changed relative position.
  • the relative position between the target user and the display screen can be detected in real time. If the relative position change is detected, a new target simulation digital human image is generated according to the changed relative position. By detecting the change of the relative position, the corresponding target simulated digital human can be generated according to the real-time relative position of the user, so that the simulated digital human faces the target user at every moment, and the interaction is more natural and vivid.
  • the image of the target simulated digital human displayed on the display screen is not updated.
  • the preset threshold may be at least one of a displacement threshold and a rotation angle threshold.
  • the change parameter of the target user whose position has changed within the preset time relative to the initial relative position can be determined, and the change parameter includes the phase shift parameter and the rotation angle parameter.
  • the relative position of the target simulation digital human image is generated, and the image displayed on the display screen is not updated.
  • the target simulation digital human is realized by adjusting the orientation of the displayed simulation digital human in real time according to the change of the relative position of the target user to interact more naturally, and saving the computing power and power consumption of the real-time generation of the simulation digital human.
  • an image sequence including a plurality of images of the target simulated digital human according to the relative position of the target user before the change and the relative position after the change that is, to obtain a change from the previous target simulated digital human image to
  • the changed relative positions correspond to a plurality of images with time series of the target simulation digital human image.
  • a simulated digital human video can be generated from image sequences and timing to present a gradually changing dynamic simulated digital human. For example, when the relative position of the user and the display screen changes, the viewing angle of the target user looking at the display screen also changes, and the target user sees the image of the simulated digital human being switched.
  • the digital human displayed on the display screen is like a walking camera. The effect of the video taken by walking around the real model, showing the visual effect of a three-dimensional real model.
  • the new target simulated digital human image when it is detected that the target user leaves the preset area, may be a preset simulated digital human image in a state to be awakened. At the same time, the state of the terminal device can also be switched to the state to be woken up, thereby reducing the power consumption required for real-time interaction.
  • the simulated digital human image of the preset action when it is detected that the target user leaves the preset area, may also be used as a new target simulated digital human image, such as waving goodbye.
  • Step S760 Display the new target simulated digital human image on the display screen.
  • a digital human video generated from an image sequence containing multiple images of the target simulated digital human may also be displayed.
  • steps S710 to S750 may be performed locally by the terminal device, or may be performed in the server, or may be performed by division of labor between the terminal device and the server. According to different actual application scenarios, tasks may be allocated as required.
  • the scene data collected by the collection device is obtained. If it is determined according to the scene data that there is a target user in the scene, the scene data is processed to obtain the relative position of the target user and the display screen. If the target user is located in In the preset area, the target simulated digital human image corresponding to the relative position is obtained, and the target simulated digital human image is displayed on the display screen. If the relative position change is detected, a new target simulated digital human image is generated according to the changed relative position. , showing the new target simulation digital human on the display. By detecting the user's position in real time, the target simulated digital human image is updated in real time according to the relative position of the user and the display screen, so as to realize the real-time face-to-face interaction between the target user and the simulated digital human.
  • FIG. 9 shows a structural block diagram of a simulated 3D digital human interaction apparatus 800 provided by an embodiment of the present application.
  • the block diagram shown in FIG. 9 will be described below.
  • the simulation-based 3D digital human interaction device 800 includes: a data acquisition module 810, a position acquisition module 820, an image acquisition module 830, and a display module 840, wherein:
  • the data acquisition module 810 is used to acquire the scene data collected by the acquisition device; the location acquisition module 820 is used to process the scene data to acquire the target user and the display screen if it is determined that there is a target user in the scene according to the scene data the relative position of The user's simulated digital human, the preset area is an area whose distance from the display screen is less than a preset value; the display module 840 is configured to display the target simulated digital human image on the display screen.
  • the preset simulated digital human model is a model obtained by training in advance according to a plurality of sample images including a real model and the reference parameters corresponding to each of the sample images, and the simulated digital human model is used according to the input
  • the simulated digital human image corresponding to the sample image is output
  • the image acquisition module 830 includes a parameter determination sub-module and a parameter input sub-module, wherein the parameter determination sub-module is used for, according to the relative position
  • a target reference parameter is determined from a plurality of preset reference parameters, and the reference parameter is used to represent the pose of the real model included in the sample image relative to the image acquisition device that collects the sample image.
  • the module is configured to input the target reference parameters into the preset simulated digital human model, and use the output simulated digital human image as the target simulated digital human image.
  • the parameter determination submodule includes a first parameter determination unit and a second parameter determination unit, wherein the first parameter determination unit is configured to determine a user viewing angle parameter according to the relative position, and the user viewing angle parameter is used for representing the viewing angle of the target user toward the preset position of the display screen; the second parameter determining unit is configured to determine the target reference parameter according to the user viewing angle parameter among the preset multiple reference parameters .
  • the first parameter determination unit includes a position determination subunit and a viewing angle parameter determination subunit, wherein the position determination subunit is used to determine a target display position of the display screen according to the relative position, the target The display position is the display position of the target simulated digital human image on the display screen; the viewing angle parameter determination subunit is configured to determine the user viewing angle parameter according to the relative position and the target display position.
  • the simulated 3D digital human interaction device 800 further includes an interaction information acquisition module and a voice information acquisition module, the interaction information acquisition module is used to acquire interaction information, and the voice information acquisition module is used to process the interaction information to Acquire response voice information,
  • the parameter input sub-module includes an image sequence acquisition unit, and the image sequence acquisition unit is configured to input the target reference parameter and the response voice information into the preset simulated digital human model, and obtain an output
  • the image sequence is composed of multiple consecutive frames of the target simulated digital human image.
  • the display module 840 includes a video output unit, and the video output unit is configured to generate and output the simulated digital human according to the image sequence. video, and play the response voice message synchronously.
  • the simulated digital human model includes a feature generation model and an image generation model
  • the image sequence acquisition unit includes an initial feature parameter acquisition subunit, a parameter sequence acquisition subunit and an image sequence acquisition subunit, wherein the initial feature parameter acquisition subunit, It is used to input the target reference parameters into the feature generation model to obtain the initial feature parameters, and the initial feature parameters are used to represent the shape of the real model corresponding to the sample image; the parameter sequence acquisition subunit is used to obtain the expression of the initial feature parameters according to the response voice information.
  • orientation angle of the simulated digital human in the target simulated digital human image is the same as the orientation angle of the real mannequin in the sample image corresponding to the target reference parameter.
  • the physical features of the simulated digital human in the target simulated digital human image are the same as the physical features of the real model in the sample image corresponding to the target reference parameters.
  • the position acquisition module 820 includes a judgment submodule, a coordinate acquisition submodule, a conversion relationship determination submodule and a position determination submodule, and the judgment submodule is used for judging whether there is a target user in the scene image, and the coordinate acquisition submodule is used for if it is, Then identify the scene image to obtain the three-dimensional coordinates of the target user in the camera coordinate system, wherein the camera coordinate system takes the position of the acquisition device as the origin; the conversion relationship determination sub-module is used to acquire the positional relationship between the acquisition device and the display screen, according to the position The relationship determines the conversion relationship between the camera coordinate system and the space coordinate system, where the space coordinate system takes the position of the display screen as the origin; the position determination sub-module is used to determine the target user and display in the space coordinate system based on the conversion relationship and three-dimensional coordinates.
  • the relative position of the screen, the relative position includes at least one of a relative distance and a relative angle.
  • the location acquisition module 820 also includes an image recognition submodule, a user number acquisition submodule, and a first processing submodule, wherein the image recognition submodule is used to identify the head information in the scene image, and the user number acquisition submodule is used for The number of users in the scene image is acquired according to the head information, and the first processing sub-module is configured to use the identified user as the target user if the number of users is one.
  • the simulated 3D digital human interaction device 800 also includes an instruction monitoring sub-module and a second processing sub-module, and the instruction monitoring sub-module is used to monitor whether the interactive instruction input by the user is obtained if the number of users is multiple;
  • the second processing sub-module is configured to take the user corresponding to the interaction instruction as the target user if the interaction instruction input by the user is obtained.
  • the scene data is the data collected in real time, and after displaying the target simulated digital human image on the display screen, the simulated 3D digital human interaction device 800 also includes a position detection module, a display update module, and a position detection module for detecting a relative When the position is changed, a new target simulation digital human image is generated according to the changed relative position; the display updating module is used for displaying the new target simulation digital human image on the display screen.
  • the electronic device 900 may be an electronic device such as a smart phone, a tablet computer, an electronic book, etc., which can run application computer-readable instructions.
  • the electronic device 900 in the present application may include one or more of the following components: a processor 910 , a memory 920 , and one or more application computer-readable instructions, wherein the one or more application computer-readable instructions may be stored in the memory 920 And configured to be executed by one or more processors 910, the one or more computer readable instructions are configured to perform the method as described in the foregoing method embodiments.
  • Processor 910 may include one or more processing cores.
  • the processor 910 utilizes various interfaces and lines to connect various parts within the entire electronic device 900, by executing or executing instructions, computer-readable instructions, code sets or sets of instructions stored in the memory 920, and calling the data, perform various functions of the electronic device 900 and process data.
  • the processor 910 may adopt digital signal processing (Digital Signal Processing, DSP), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), Programmable Logic Array (Programmable Logic Array, PLA). At least one form of hardware is implemented.
  • the processor 910 may integrate one or a combination of a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), a modem, and the like.
  • the CPU mainly processes the operating system, user interface and application computer-readable instructions, etc.
  • the GPU is used for rendering and drawing of the display content
  • the modem is used for processing wireless communication. It can be understood that, the above-mentioned modem may not be integrated into the processor 910, and is implemented by a communication chip alone.
  • the memory 920 may include random access memory (Random Access Memory, RAM), or may include read-only memory
  • Memory 920 may be used for computer readable instructions, codes, sets of codes or sets of instructions.
  • the memory 920 may include an area for storing computer readable instructions and an area for storing data.
  • the storage computer-readable instruction area can store instructions for implementing the operating system, instructions for implementing at least one function (such as a touch function, a sound playback function, an image playback function, etc.), and the instructions for implementing the following method embodiments. instructions etc.
  • the storage data area may also store data and the like created by the electronic device 900 in use.
  • FIG. 11 shows a structural block diagram of one or more computer-readable storage media provided by an embodiment of the present application.
  • the computer-readable storage medium 1000 stores computer-readable instructions, and the computer-readable instructions can be invoked by the processor to execute the methods described in the above method embodiments.
  • the computer-readable storage medium 1000 may be, for example, flash memory, electrically-erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM) ), hard disk or electronic storage such as ROM.
  • the computer-readable storage medium 1000 includes a non-transitory computer-readable storage medium.
  • Computer readable storage medium 1000 has storage space for computer readable instructions 1010 to perform any of the method steps in the above-described methods. These computer readable instructions may be read from or written to one or more computer readable instruction products.
  • Computer readable instructions 1010 may be compressed, for example, in a suitable form.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Graphics (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Processing Or Creating Images (AREA)

Abstract

一种仿真3D数字人交互方法,包括:获取采集装置采集的场景数据;若根据场景数据确定场景内存在目标用户,则处理场景数据以获取目标用户与显示屏的第一相对位置;若目标用户位于预设区域内,则获取第一相对位置对应的目标仿真数字人图像,目标仿真数字人图像中包括面部朝向目标用户的仿真数字人,预设区域为与显示屏的距离小于预设数值的区域;在显示屏上显示目标仿真数字人图像。

Description

仿真3D数字人交互方法、装置、电子设备及存储介质
本申请要求于2021年01月07日提交中国专利局,申请号为2021100196750,申请名称为“仿真3D数字人交互方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人机交互技术领域,更具体地,涉及一种仿真3D数字人交互方法、装置、电子设备及存储介质。
背景技术
近年来,随着科技的进步,智能化的人机交互方式已逐渐成为国内外研究的热点,一些智能设备或者应用中设置有虚拟形象,以通过虚拟形象实现与用户的可视化交互,从而提高用户的人机交互体验。但是当前的大多数场景下,只能基于固定的虚拟形象按照固定的方式与用户进行交互,导致人机交互方式较为局限,无法模拟真实环境下的人与人之间的交互状态。
发明内容
本申请提出了一种仿真3D数字人交互方法、装置、电子设备及存储介质。
第一方面,本申请实施例提供了一种仿真3D数字人交互方法,其特征在于,由电子设备执行,包括:获取采集装置采集的场景数据;若根据所述场景数据确定场景内存在目标用户,则处理所述场景数据以获取所述目标用户与显示屏的相对位置;若所述目标用户位于预设区域内,则获取所述相对位置对应的目标仿真数字人图像,所述目标仿真数字人图像中包括面部朝向所述目标用户的仿真数字人,所述预设区域为与所述显示屏的距离小于预设数值的区域;在所述显示屏上显示所述目标仿真数字人图像。
第二方面,本申请实施例提供了一种仿真3D数字人交互装置,设置于电子设备中,所述装置包括数据采集模块,位置获取模块,图像获取模块,以及显示模块,其中,数据采集模块,用于获取采集装置采集的场景数据;位置获取模块,用于若根据所述场景数据确定场景内存在目标用户,则处理所述场景数据以获取所述目标用户与显示屏的相对位置;图像获取模块,用于若所述目标用户位于预设区域内,则获取所述相对位置对应的目标仿真数字人图像,所述目标仿真数字人图像中包括面部朝向所述目标用户的仿真数字人,所述预设区域为与所述显示屏的距离小于预设数值的区域;显示模块,用于在所述显示屏上显示所述目标仿真数字人图像。
第三方面,本申请实施例提供了一种电子设备,包括:一个或多个处理器;存储器;一个或多个计算机可读指令,其中所述一个或多个计算机可读指令被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个计算机可读指令配置用于执行上述第一方面所述的方法。
第四方面,本申请实施例提供了一个或多个计算机可读取存储介质,所述计算机可读取存储介质中存储有计算机可读指令,所述计算机可读指令可被处理器调用执行如上述第一方面所述的方法。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。基于本申请的说明书、附图以及权利要求书,本申请的其它特征、目的和优点将变得更加明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了一种适用于本申请实施例的应用环境示意图;
图2示出了本申请一个实施例提供的仿真3D数字人交互方法的流程示意图;
图3示出了本申请又一个实施例提供的仿真3D数字人交互方法的流程示意图;
图4示出了本申请另一个实施例提供的仿真3D数字人交互方法的流程示意图;
图5示出了本申请再一个实施例提供的仿真3D数字人交互方法的流程示意图;
图6示出了本申请还一个实施例提供的仿真3D数字人交互方法的流程示意图;
图7示出了本申请又另一个实施例提供的仿真3D数字人交互方法的流程示意图;
图8示出了本申请又再一个实施例提供的仿真3D数字人交互方法的流程示意图;
图9示出了本申请实施例提供的仿真3D数字人交互装置的结构框图;
图10示出了本申请实施例的用于执行根据本申请实施例的仿真3D数字人交互方法的电子设备的结构框图;
图11示出了本申请实施例的用于保存或者携带实现根据本申请实施例的仿真3D数字人交互方法的计算机可读指令的存储单元。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
术语定义
3D数字人:通过3D建模、渲染等计算机图形学技术现实的数字人。
仿真数字人:通过深度学习模型生成每一帧画质近乎于相机拍摄的逼真图像,数字人如同相机拍摄的真人的效果。在一个实施例中,可以由连贯的逼真图像生成视频数字人。
仿真3D数字人:以仿真数字人技术生成数字人,并考虑到数字人及观众的空间位置及展现视角,通过仿真数字人实现立体逼真的效果。在一个实施例中,可以由多张仿真数字人图像序列生成出立体逼真的视频数字人。
目前,大多数应用虚拟形象的交互场景中,为了提高呈现画面的逼真度,通常会采用根据真人模特生成的虚拟形象,也就是数字人,来与用于进行交互。并且,可以对数字人预先设计一些动作,该动作可以与交互的语音内容或文字内容相配合,以提升用户的观感体验。虽然将动作与交互内容的方式能够使数字人的动作姿态接近真人模特,但是,该种方式仅将交互内容与动作进行配合,没有建立用户的位置与数字人之间的联系。在实际应用时,若用户在播放数字人画面的屏幕相对较偏的区域,即用户相对屏幕中心的角度较大时,该屏幕中的数字人通常还是处于固定地正视于屏幕正前方。而真实生活中,人与人进行交流时,通常是面对面的交流状态,此时数字人的角度显然与真实环境下的人与人之间的交互状态不符。因此,现有技术中的数字人的呈现方式未深入考虑用户的行为,进而导致数字人呈现画面的逼真程度较低,交互不自然,用户的交互体验较差。
虽然3D数字人可呈现立体的视觉效果,但3D数字人是通过3D建模、渲染等计算机图形学技术现实的数字人,呈现的数字人效果通常为动画效果,不能实现相机拍摄真人般的效果。
为改善上述问题,本申请发明人研究如何在用户与数字人进行交互时,更多地考虑用户的行为,以实现自然拟人的交互效果。基于此,发明人提出了仿真3D数字人交互方法、装置、电子设备及介质,以使用户在进行人机交互的过程中,能根据用户的位置,显示面部朝向目标用户的仿真数字人来与用户交互,仿真数字人不但逼真如同相机拍摄的真人,并且可以模拟用户与真人模特面对面交流的交互效果,实现拟人化的自然交互,提高了用户的交互体验。
请参阅图1,图1示出了一种适用于本申请实施例的应用环境示意图。本申请实施例提供的仿真3D数字人交互方法可以应用于如图1所示的交互***10。交互***10包括终端设备101以及服务器102。服务器102与终端设备101之间通过无线或者有线网络连接,以基于该网络连接实现终端设备101与服务器102之间的数据传输,传输的数据包括但不限于音频、视频、文字、图像等。
其中,服务器102可以是单独的服务器,也可以是服务器集群,还可以是多台服务器构成的服务器中心,可以是本地服务器,也可以是云端服务器。服务器102可用于为用户提供后台服务,该后台服务可包括但不限于仿真3D数字人交互服务等,在此不作限定。
在一些实施方式中,终端设备101可以是具有显示屏且支持数据输入的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机、台式计算机和可穿戴式电子设备等。具体的,数据输入可以是基于智能终端设备101上具有的语音模块输入语音、字符输入模块输入字符、图像输入模块输入图像、视频输入模块输入视频等,还可以是基于智能终端设备101上安装有的手势识别模块,使得用户可以实现手势输入等交互方式。
在一些实施方式中,终端设备101上可以安装有客户端应用计算机可读指令,用户可以基于客户端应用计算机可读指令(例如APP等)与服务器102进行通信。具体地,终端设备101可以获取用户的输入信息,基于终端设备101上的客户端应用计算机可读指令与服务器102进行通信,服务器102可以对接收到的用户输入信息进行处理,服务器102还可以根据该信息返回对应的输出信息至终端设备101,终端设备101可执行输出信息对应的操作。其中,用户的输入信息可以是语音信息、基于屏幕的触控操作信息、手势信息、动作信息等,输出信息可以是图像、视频、文字、音频等,在此不做限定。
在一些实施方式中,客户端应用计算机可读指令可以基于仿真数字人提供人机交互服务,人机交互服务可以基于场景需求的不同而不同。例如,客户端应用计算机可读指令可以用于在商场、银行、展厅等公共区域,向用户提供产品展示信息或服务指引,针对不同的应用场景,可以提供不同的交互服务。
在一些实施方式中,终端设备101在获取与用户输入的信息对应的回复信息后,可以在终端设备101的显示屏或与其连接的其他图像输出设备上显示对应与该回复信息的仿真数字人。其中,仿真数字人可以是根据用户自身或其他人的形态建立的形似真人的形象,也可以是动漫效果式的机器人,例如动物形态或卡通人物形态的机器人。作为一种方式,在显示仿真数字人图像的同时,可以通过终端设备101的扬声器或与其连接的其他音频输出设备播放与仿真数字人图像对应的音频,还可以在终端设备101的显示屏上显示与该回复信息对应的文字或图形,实现在图像、语音、文字等多个方面上与用户的多态交互。
在一些实施方式中,对用户输入信息进行处理的装置也可以设置于终端设备101上,使得终端设备101无需依赖与服务器102建立通信即可实现与用户的交互,实现基于数字人的人机交互,此时交互***10可以只包括终端设备101。
上述的应用环境仅为方便理解所作的示例,本申请实施例不仅局限于上述应用环境。
下面将通过具体实施例对本申请实施例提供的仿真3D数字人交互方法、装置、电子设备及介质进行详细说明。
请参阅图2,图2为本申请一实施例提供的一种仿真3D数字人交互方法的流程示意图,应用于上述终端设备,该方法包括步骤S110至步骤S140。
步骤S110:获取采集装置采集的场景数据。
采集装置可以是设置在终端设备内部的装置,也可以是与终端设备连接的装置。其中,采集装置可以是图像采集装置、红外传感器、麦克风、激光测距传感器等。具体地,图像采集装置可以是普通摄像头,也可以是双目摄像头、结构光摄像头、TOF摄像头等可获取空间深度信息的摄像头。红外传感器可以是具有红外功能的距离传感器等。在一些实施方式中,图像采集装置也自动可以改变镜头角度,从而获取不同角度的图像。
采集装置用于采集当前场景中的场景数据,当前场景为终端设备当前所处的场景。根据采集装置的不同种类,场景数据可为视觉数据、红外数据、声音数据中的至少一种。
步骤S120:若根据场景数据确定场景内存在目标用户,则处理场景数据以获取目标用户与显示屏的相对位置。
在获取到场景数据后,可以通过分析场景数据以判断该场景中是否存在目标用户,目标用户为该场景内的真人用户。若分析场景数据确定场景内存在目标用户时,可以处理场景数据以获取目标用户与显示屏的相对位置。
其中,相对位置用于表征目标用户与显示屏之间的位置关系,可以包括目标用户与显示屏之间的相对距离、相对角度等信息。作为一种方式,相对位置可以是目标用户的关键点与显示屏上预设位置之间的位置关系,其中,关键点可以是目标用户的眼睛、面部中心点、 肢体部位等,可以通过图像检测、处理传感器数据等方式确定关键点的位置,在此不做限定;预设位置可以是显示屏的中心点、显示屏的边框、用于显示仿真数字人图像的显示位置等,在此不做限定。
具体地,当根据所述场景数据确定场景内存在目标用户时,可以获取目标用户与采集装置的相对位置信息,进一步地根据相对位置信息确定目标用户与显示屏的相对位置。
作为一种方式,可以获取采集装置与显示屏的位置关系,根据位置关系确定相机坐标系与空间坐标系的转换关系,其中,空间坐标系以显示屏的位置为原点;基于转换关系和三维坐标,在空间坐标系中确定目标用户与显示屏的相对位置。通过这种方式,可以得到用户相对显示屏的更为准确的相对位置。
作为另一种方式,也可以将目标用户与采集装置的相对位置信息作为目标用户与显示屏的相对位置。可以理解的是,当采集装置为终端设备中内置的装置,或者采集装置与终端设备连接并且距离较近时,相对位置信息和相对位置之间相差较小,可以将目标用户与采集装置的相对位置信息作为目标用户与显示屏的相对位置,从而无需在使用前预先获取采集装置与显示屏的位置关系,采集装置的位置可以变化,灵活性更好。
可以理解的是,根据采集装置的不同,采集得到的场景数据也不同。
作为一种方式,当场景数据为图像采集装置采集的视觉数据时,可以通过分析视觉数据以判断场景中是否存在目标用户。例如,可以通过人脸检测、行人识别等方式确定场景中是否存在目标用户。当确定场景内存在目标用户时,可以通过图像测距或者分析深度图像数据进一步地获取目标用户与采集装置的相对位置信息。
作为另一种方式,当场景数据为红外传感器采集的红外数据时,可以通过分析红外数据以判断场景中是否存在目标用户。具体地,红外传感器可以发送红外光,当红外光遇到障碍物时会发生反射,红外传感器可以获取反射回来的红外光强度,并且该红外光强度与障碍物之间的距离成正比。因此,可以通过分析红外数据确定场景内是否存在目标用户,并在确定场景内存在目标用户时,进一步地确定该目标用户与采集装置的相对位置信息。
作为又一种方式,当场景数据为麦克风等声音采集装置采集的声音数据时,可以通过分析声音数据以判断场景中是否存在目标用户。具体地,可以通过人声检测等方式确定当前场景中是否存在目标用户,若存在,则可以通过声音测距等方式进一步地获取目标用户与采集装置的相对位置信息。
在一些实施方式中,当根据场景数据确定场景内不存在目标用户时,可以在显示屏上显示预设的待唤醒状态的仿真数字人图像。作为一种方式,预设的待唤醒状态的仿真数字人图像可以为面部朝向正前方的仿真数字人图像。作为另一种方式,预设的待唤醒状态的仿真数字人图像也可以是动态地转向的仿真数字人图像序列,即一个动态的仿真数字人视频,以向用户展示仿真数字人可以呈现不同的角度这一特性。例如,可以是从朝向左15度动态变化到朝向右15度的仿真数字人。在一个实施例中,预设的仿真数字人图像或者仿真数字人图像序列,还可以是打招呼的仿真数字人,以提醒用户进行交互。
步骤S130:若目标用户位于预设区域内,则获取相对位置对应的目标仿真数字人图像。
预设区域为预先设置的与该区域内的目标用户进行交互的区域。其中,预设区域可以是与显示屏的距离小于预设数值的区域。在一些实施方式中,预设区域还可以是与显示屏的距离小于预设数值,并且与显示屏的角度小于预设角度的区域。具体地,可以通过比较相对位置和预设数值,来判断预设区域内是否存在目标用户。
作为一种方式,可以将场景数据所对应的场景内的区域作为预设区域,也就是说,预设区域为采集装置可以采集到场景数据的区域,若根据场景数据确定场景内存在目标用户,则判定目标用户位于预设区域内。通过这种方式,可以在人流密度低、用户较少的区域,在检测到目标用户时主动地交互。
作为另一种方式,预设区域为与场景数据的区域相比更小的区域。例如,当采集装置为可以获取距离显示屏10米以内场景数据的传感器时,预设区域可以是与显示屏的距离小 于5米的区域。通过这种方式,可以在人流密度高,用户较多的区域,在检测到目标用户时进一步根据用户距离显示屏的距离,来确定用户的交互意图。
通过设置进行交互的预设区域,可以根据目标用户是否位于预设区域,从而确定是否与该目标用户进行交互,通过这种方式,一方面,可以在用户无感知的情况进行交互,交互更为自然;另一方面,在多人交互场景中,可以根据预设区域进一步地确定用户的交互意图,将预设区域内的用户视为有交互意图的用户,从而准确地进行交互。例如,在终端设备为公司大厅中设置的大屏设备,预设区域为公司前台位置时,在人流量较多的情况下公司大厅中可能存在多个用户,终端设备无法得知与哪个用户进行交互,而当存在用户位于公司前台位置时,则可以将该用户作为目标用户以进行交互。
在一些实施方式中,预设的仿真数字人模型为预先根据包含真人模特的多个样本图像和每个样本图像对应的参考参数训练得到的模型,仿真数字人模型用于根据输入的参考参数,输出样本图像对应的仿真数字人图像。根据相对位置,在预设的多个参考参数中确定目标参考参数;将目标参考参数输入预设的仿真数字人模型,将输出的仿真数字人图像作为目标仿真数字人图像。其中,参考参数可用于表征样本图像包含的真人模特与采集样本图像的图像采集装置的相对位置,相对位置可以是相对角度,也可以是相对距离。具体地,请参阅后续实施例。
可以理解的是,通过3D建模获取立体的3D数字人的过程,非常依赖于建模师人工的先验经验,通过大量地人为的调整来实现与真人接近的3D数字人,获取不同模特对应3D数字人需要重复进行建模过程,耗费大量的人工成本。而预设的仿真数字人模型是通过训练得到的深度学习模型,由仿真数字人模型得到目标仿真数字人图像的过程无需3D建模,得到的仿真数字人也更接近真人模特,效果更加逼真,适用于实际应用中对可能需要对不同真人模特进行建模以获取仿真数字人的情况。
在一些实施方式中,场景数据可以包括场景图像,可以识别场景图像中的人头信息;根据人头信息获取场景图像中用户的数量;若用户的数量为一个,则将所识别到的用户作为目标用户;处理场景图像以获取目标用户与显示屏的相对位置。在一个实施例中,若用户的数量为多个,则监测是否获取到用户输入的交互指令;若获取到用户输入的交互指令,则将交互指令对应的用户作为目标用户。可以理解,交互指令对应的用户,即为输入该交互指令的用户。具体地,请参见后续实施例。
在一些实施方式中,也可以获取交互信息;处理交互信息以获取应答语音信息;将目标参考参数和应答语音信息输入预设的仿真数字人模型,得到输出的图像序列,图像序列由多帧连续的目标仿真数字人图像构成。具体地,请参见后续实施例。
步骤S140:在显示屏上显示目标仿真数字人图像。
在获取目标仿真数字人图像后,可以在显示屏的显示位置显示目标仿真数字人图像。其中,显示屏可以是终端设备的显示屏,也可以是与终端设备相连的其他图像显示装置,显示位置可以是预先设置的显示仿真数字人的位置,也可以是根据相对位置确定的显示位置。在一个实施例中,还可以在显示目标仿真数字人后,通过语音或者文字进行人机交互的提示,以引导用户进一步进行交互。例如,在银行使用场景下,待唤醒界面可以显示“请问您需要什么帮助?您可以试试问我‘如何办理存款业务?’”。
在一些实施方式中,当获取由多帧连续的目标仿真数字人图像构成的图像序列时,可以根据该图像序列生成包含目标仿真数字人的视频,在显示屏上显示该视频。例如,在未检测到目标用户前,在显示屏上可以显示有预设的面部朝向正前方的仿真数字人,获取相对位置对应的目标仿真数字人图像,可以是面部朝向正前方的仿真数字人转向相对位置对应的仿真数字人对应的多张图像序列。根据图像序列合成仿真数字人对应的视频,可以实现仿真数字人自然地转向的效果。
在一些实施方式中,场景数据为实时采集的数据,若检测到相对位置改变,则根据改变后的相对位置生成新的目标仿真数字人图像;在显示屏上显示新的目标仿真数字人图像。具体地,请参见后续实施例。
在一些实施方式中,可以检测目标用户是否离开预设区域,若目标用户已经离开,则显示 预设的待唤醒状态的仿真数字人图像。
可以理解的是,步骤S120和S130可以由终端设备在本地进行,也可以在服务器中进行,还可以由终端设备与服务器分工进行,根据实际应用场景的不同,可以按照需求进行任务的分配,在此不做限定。
本实施例提供的仿真3D数字人交互方法,获取采集装置采集的场景数据,若根据场景数据确定场景内存在目标用户,则处理场景数据以获取目标用户与显示屏的相对位置,若目标用户位于预设区域内,则获取相对位置对应的目标仿真数字人图像,在显示屏上显示目标仿真数字人图像。可以模拟用户与仿真数字人面对面交流的交互效果,实现了拟人化的仿真数字人交互,相较于传统方法只能按照固定方式,比如,虚拟形象只能在固定位置按照固定朝向进行交互,本申请的方案提高了人机交互的灵活性,避免了局限,进而提高了用户的交互体验。
请参阅图3,图3为本申请一实施例提供的一种仿真3D数字人交互方法的流程示意图,应用于上述终端设备,该方法包括步骤S210至步骤S250。
步骤S210:获取采集装置采集的场景数据。
步骤S220:若根据场景数据确定场景内存在目标用户,则处理场景数据以获取目标用户与显示屏的相对位置。
在一些实施方式中,场景数据可以包括场景图像,可以识别场景图像中的人头信息;根据人头信息获取场景图像中用户的数量;若用户的数量为一个,则将所识别到的用户作为目标用户;处理场景图像以获取目标用户与显示屏的相对位置。在一个实施例中,若用户的数量为多个,则监测是否获取到用户输入的交互指令;若获取到用户输入的交互指令,则将交互指令对应的用户作为目标用户。具体地,请参见后续实施例。
步骤S230:若目标用户位于预设区域内,则根据相对位置,在预设的多个参考参数中确定目标参考参数。
其中,参考参数用于表征训练预设的仿真数字人模型所采用的样本图像包含的所述真人模特相对于采集所述样本图像的图像采集装置的位姿,其中,位姿可以包括位置信息和姿态信息,相应地,参考参数可以包括真人模特与图像采集装置的距离参数和角度参数中的至少一个。距离参数用于表征真人模特与图像采集装置的相对距离,角度参数用于表征真人模特与图像采集装置的相对角度。
可以理解的是,在实际应用中,可以将图像采集装置视为目标用户的眼睛,目标仿真数字人图像是根据图像采集装置采集得到的真人模特的样本图像,即实现了仿佛通过图像采集装置观看真人模特般的效果。虽然在训练仿真数字人模型时,可以获取尽可能多的参考参数对应的样本图像,以获取目标用户不同的位置对应的图像,但是在实际应用中,可能出现目标用户与显示屏的相对位置没有一模一样的参考参数的情况。因此,通过根据相对位置在预设的多个参考参数中确定目标参考参数,可以实现生成与当前目标用户的位置最为接近的仿真数字人的位姿。
在一些实施方式中,可以设定相对位置和预设的多个参考参数之间的映射关系,根据映射关系确定相对位置对应的目标参考参数。通过这种方式,一方面可以降低对相对位置的精度的要求,只要确定相对位置大概的范围就可以实现仿真数字人的3D效果,从而降低对采集装置的要求,并减少处理场景数据以获取第一相对位置所需的功耗;另一方面,也可以减少训练预设的仿真数字人图像所需的不同参考参数的样本图像数量。
作为一种方式,当参考参数包括相对角度时,可以预先设定角度映射关系,基于角度映射关系确定相对位置对应的目标参考参数。具体地,角度映射关系包括多个角度区间和每个角度区间对应的角度参数,可以由角度映射关系确定相对位置所属于的角度区间,进而将该角度区间对应的角度参数作为目标参考参数。
作为另一种方式,当参考参数包括相对距离时,可以预先设定距离映射关系,基于距离映射关系确定相对位置对应的目标参考参数。具体地,距离映射关系包括多个距离区间和每个距离区间对应的距离参数,可以由距离映射关系确定相对位置所属于的距离区间,进而将该距离区间对应的距离参数作为目标参考参数。
在一些实施方式中,可以基于最优路径求解算法,在预设的多个参考参数中确定相对位置对应的目标参考参数。其中,最优路径求解算法可以是Dijkstra算法、A*算法、SPFA算法、Bellman-Ford算法和Floyd-Warshall算法等,在此不做限定。
步骤S240:将目标参考参数输入预设的仿真数字人模型,将输出的仿真数字人图像作为目标仿真数字人图像。
其中,预设的仿真数字人模型为预先根据包含真人模特的多个样本图像和每个样本图像对应的参考参数训练得到的模型,仿真数字人模型用于根据输入的参考参数,输出样本图像对应的仿真数字人图像。具体地,可以通过图像采集装置采集不同参考参数对应的包含真人模特的多个图像作为样本图像,并获取每个样本图像所对应的参考参数。在一个实施例中,每一个参考参数还可对应有不同姿态的多个真人模特的样本图像。例如,可通过相同的相机的视角采集包含喜怒哀乐四种表情的真人模特的四张图像作为该参考参数对应的样本图像。
其中,仿真数字人模型可以包括特征生成模型和图像生成模型,特征生成模型和图像生成模型都是预设的基于深度学习的模型。具体地,特征生成模型用于根据输入的参考参数获取该参考参数对应的样本图像中真人模特的特征参数,其中,真人模特的特征参数是通过提取图像中真人模特的面部关键点、姿态关键点、轮廓关键点等得到的特征。图像生成模型,用于根据真人模特的特征参数生成对应的仿真数字人图像。
在获取目标参考参数后,可以将目标参考参数输入预设的仿真数字人模型,通过深度生成模型获取该目标参考参数对应的样本图像中真人模特的特征参数,通过图像生成模型,根据特征参数生成对应的仿真数字人图像,作为目标仿真数字人图像。
作为一种方式,目标仿真数字人图像中仿真数字人的朝向角度与目标参考参数对应的样本图像中的真人模特的朝向角度相同。其中,朝向角度用于表征样本图像中的真人模特相对于正面朝前的旋转角度。在一个实施例中,朝向角度可以包括水平角度和竖直角度中至少一个。水平角度可用于表征水平方向上真人模特的角度。例如,位于真人模特左侧的采集装置,和位于真人模特右侧的采集装置采集得到的样本图像对应真人模特的不同的水平角度。竖直角度可以用于表征上竖直方向上真人模特的角度。例如,位于高处俯拍的采集装置,和位于低处仰拍的采集装置采集得到的样本图像对应真人模特不同的竖直角度。
作为一种方式,目标仿真数字人图像中仿真数字人的体貌特征与目标参考参数对应的样本图像中的真人模特的体貌特征相同。体貌特征包括表情、体形、动作姿态、纹理等特征。通过这种方式,得到的仿真数字人如真人模特般逼真,视觉上如同观看相机拍摄的真人模特。
作为一种方式,目标仿真数字人图像中仿真数字人的朝向角度与目标参考参数对应的样本图像中的真人模特的朝向角度相同,并且,目标仿真数字人图像中仿真数字人的体貌特征与目标参考参数对应的样本图像中的真人模特的体貌特征相同。
通过在预设的多个参考参数中确定相对位置对应的目标参考参数,可以将目标用户相对于显示屏上仿真数字人的当前位置,转换为采集样本图像时的图像采集装置相对于真人模特的位置。通过获取目标参考参数对应的目标仿真数字人图像,可以实现目标用户由图像采集装置的位置看向真人模特的视觉体验,以使仿真数字人图像呈现立体逼真的3D效果。
例如,当目标用户在显示屏的左侧时,目标仿真数字人图像包括数字人左侧的脸,即仿真数字人由正面向右侧转动后对应的角度;当目标用户正对显示屏时,目标仿真数字人包括数字人正脸;当目标用户在显示屏的右侧时,目标仿真数字人图像包括数字人右侧的脸,即仿真数字人由正面向左转动后的角度。根据用户的不同位置,都会显示面部朝向目标用户的仿真数字人图像,从而实现仿真数字人与目标用户面对面交互的效果。又如,当目标用户距离显示屏不同的距离时,目标仿真数字人图像中,仿真数字人的大小尺寸也可以是不同的。
步骤S250:在显示屏上显示目标仿真数字人图像。
在一些实施方式中,场景数据为实时采集的数据,若检测到相对位置改变,则根据改变后的相对位置生成新的目标仿真数字人图像;在显示屏上显示新的目标仿真数字人图像。具体地,请参见后续实施例。
需要说明的是,本实施例中未详细描述的部分可以参考前述实施例,在此不再赘述。
可以理解的是,步骤S210至S240可以由终端设备在本地进行,也可以在服务器中进行,还可以由终端设备与服务器分工进行,根据实际应用场景的不同,可以按照需求进行任务的分配,在此不做限定。
本实施例提供的仿真3D数字人交互方法,获取采集装置采集的场景数据,若目标用户位于预设区域内,则根据相对位置,在预设的多个参考参数中确定目标参考参数,将目标参考参数输入预设的仿真数字人模型,将输出的仿真数字人图像作为目标仿真数字人图像,在显示屏上显示目标仿真数字人图像。通过在预设的多个参考参数中确定目标参考参数,并生成目标参考参数对应的目标仿真数字人,可以使仿真数字人的呈现角度朝向目标用户,并且目标仿真数字人图像是根据包含真人模特的样本图像生成的,可以实现近似真人模特的逼真效果。
请参阅图4,图4为本申请一实施例提供的一种仿真3D数字人交互方法的流程示意图,应用于上述终端设备,该方法包括步骤S310至步骤S360。
步骤S310:获取采集装置采集的场景数据。
步骤S320:若根据场景数据确定场景内存在目标用户,则处理场景数据以获取目标用户与显示屏的相对位置。
在一些实施方式中,场景数据可以包括场景图像,可以识别场景图像中的人头信息;根据人头信息获取场景图像中用户的数量;若用户的数量为一个,则将所识别到的用户作为目标用户;处理场景图像以获取目标用户与显示屏的相对位置。在一个实施例中,若用户的数量为多个,则监测是否获取到用户输入的交互指令;若获取到用户输入的交互指令,则将交互指令对应的用户作为目标用户。具体地,请参见后续实施例。
步骤S330:若目标用户位于预设区域内,则根据所述相对位置确定用户视角参数。
其中,用户视角参数用于表征目标用户朝向显示屏的预设位置的视角。预设位置可以是显示屏的中心点、显示屏的边框等不会发生变化的位置,也可以是用于仿真数字人图像的显示位置,在此不做限定。
具体地,可以通过处理场景数据来识别目标用户,从而确定用户视角参数。例如,可以通过图像检测算法检测目标用户的人脸,从而确定目标用户的眼睛的位置,进而根据目标用户的眼睛的位置和显示屏的预设位置,确定用户视角参数。
通过先判断目标用户位于预设区域内,若目标用户位于预设区域内根据相对位置确定用户视角参数,可以减少对不在预设区域内的目标用户进行识别来获取视角参数所需的功耗,提高了资源的利用效率。
在一些实施方式中,可根据相对位置确定显示屏的目标显示位置,目标显示位置为目标仿真数字人图像在显示屏上的显示位置;根据相对位置和目标显示位置确定用户视角参数。
具体地,可以获取预设的显示位置和用户的相对位置之间的对应关系,在获取相对位置后,根据对应关系将相对位置所对应的显示位置,作为目标显示位置。例如,当目标用户位于显示屏右侧时,目标显示位置为显示屏的右侧区域,而当目标用户位于显示屏左侧时,目标显示位置为显示屏的左侧区域。在一个实施例中,当目标用户由左侧走向右侧时,仿真数字人也可以由显示屏的左侧区域走向显示屏的右侧区域,仿佛是目标用户与仿真数字人并肩走路。在一个实施例中,目标显示位置也可以是目标仿真数字人图像中,仿真数字人眼睛的显示位置。这样,可以获取目标用户看向仿真数字人眼睛的视角参数,从而实现仿真数字人如同真人般注视目标用户的效果。
通过这种方式,不同的相对位置可以对应有不同的目标显示位置,以使仿真数字人更加逼真和生动。特别是在显示屏为大屏幕的情况下,可以根据目标用户不同的位置确定不同的目标显示位置,从而拉近数字人与目标用户之间的距离,可以更为自然的人际交互。
步骤S340:在预设的多个参考参数中,根据用户视角参数确定所述目标参考参数。
通过根据用户视角参数确定目标参考参数,可以将图像采集装置视为目标用户的眼睛,由于,目标仿真数字人图像是根据图像采集装置采集得到的真人模特的样本图像,真人模特相对于图像采集装置的位姿,即为仿真数字人相对于目标用户的位姿,实现了目标用户仿佛 通过图像采集装置观看真人模特般的效果。具体地,请参阅步骤S230。
步骤S350:将目标参考参数输入预设的仿真数字人模型,将输出的仿真数字人图像作为目标仿真数字人图像。
步骤S360:在所述显示屏上显示所述目标仿真数字人图像。
在一些实施方式中,场景数据为实时采集的数据,若检测到相对位置改变,则根据改变后的相对位置生成新的目标仿真数字人图像;在显示屏上显示新的目标仿真数字人图像。具体地,请参见后续实施例。
需要说明的是,本实施例中未详细描述的部分可以参考前述实施例,在此不再赘述。
可以理解的是,步骤S310至S350可以由终端设备在本地进行,也可以在服务器中进行,还可以由终端设备与服务器分工进行,根据实际应用场景的不同,可以按照需求进行任务的分配,在此不做限定。
本实施例提供的仿真3D数字人交互方法,获取采集装置采集的场景数据,若根据场景数据确定场景内存在目标用户,则处理场景数据以获取目标用户与显示屏的相对位置,若目标用户位于预设区域内,则根据所述相对位置确定用户视角参数,在预设的多个参考参数中,根据用户视角参数确定所述目标参考参数,将目标参考参数输入预设的仿真数字人模型,将输出的仿真数字人图像作为目标仿真数字人图像,在所述显示屏上显示所述目标仿真数字人图像。通过确定用户视角参数,从而得到与用于视角参数对应的目标仿真数字人图像,进一步地增加了仿真数字人的逼真度,优化了人机交互体验。
请参阅图5,图5为本申请一实施例提供的一种仿真3D数字人交互方法的流程示意图,应用于上述终端设备,该方法包括步骤S410至步骤S470。
步骤S410:获取采集装置采集的场景数据。
步骤S420:若根据场景数据确定场景内存在目标用户,则处理场景数据以获取目标用户与显示屏的相对位置。
在一些实施方式中,场景数据可以包括场景图像,可以识别场景图像中的人头信息;根据人头信息获取场景图像中用户的数量;若用户的数量为一个,则将所识别到的用户作为目标用户;处理场景图像以获取目标用户与显示屏的相对位置。在一个实施例中,若用户的数量为多个,则监测是否获取到用户输入的交互指令;若获取到用户输入的交互指令,则将交互指令对应的用户作为目标用户。具体地,请参见后续实施例。
步骤S430:若目标用户位于预设区域内,则获取交互信息。
若目标用户位于预设区域内,则可以监测是否获取目标用户输入的交互信息,其中,交互信息可以是语音信息、动作信息、触控操作信息等多模态信息。在一个实施例中,交互信息可以是目标用户输入的预设的交互指令的信息,也可以是能够被终端设备所识别的多模态信息。
可以理解的是,由于实时地监测用户输入的交互信息需要较多的功耗,通过设定预设区域,当检测到预设区域内存在目标用户时才监测是否获取交互信息,可以减少在无目标用户时监测交互信息的浪费的功耗。
步骤S440:处理交互信息以获取应答语音信息。
在一些实施方式中,当交互信息为预设的交互指令的信息时,可以预先设置有交互指令和对应的应答语音信息的对应关系,基于该对应关系获取应答语音信息。例如,当交互信息为预设的唤醒词时,对应的应答语音信息可以是“您好,可以帮助您做些什么吗?”。
在一些实施方式中,当交互信息为目标用户输入的语音信息时,可以通过自动语音识别技术(ASR,Automatic Speech Recognition)将语音信息转换为文本后,对该文本执行自然语言理解操作((Natural Language Understanding,NLU),以实现对语音信息的解析,根据解析的结果获取应答文本信息。进一步地,可以通过文本转语音技术(Text To Speech,TTS)得到应答文本信息对应的应答语音信息。其中,自然语言理解操作可以通过意图识别模型实现,意图识别模型可以采用循环神经网络(Recurrent Neural Network,RNN)模型、卷积神经网络(Convolutional Neural Networks,CNN)模型、变分自编码器(Variational Autoencoder,VAE)模型、变压器的双向编码器表示(Bidirectional Encoder Representations from Transformers,BERT)和支 持向量机(Support Vector Machine,SVM)等机器学习模型,在此不做限定。
步骤S450:根据相对位置,在预设的多个参考参数中确定目标参考参数。
具体地,请参阅步骤S230。
步骤S460:将目标参考参数和应答语音信息输入预设的仿真数字人模型,得到输出的图像序列。
其中,图像序列由多帧连续的目标仿真数字人图像构成,图像中仿真数字人的动作姿态或者面部表情可以是连续变化的。具体地,可以获取应答语音对应的应答文本信息对应的语义信息,以及语音音素信息。根据目标参考参数确定对应的仿真数字人的朝向角度,可以获取面部朝向目标用户的仿真数字人,进而根据应答语音信息可以获取动作姿态或者面部表情与应答语音信息对应的仿真数字人的图像序列。图像序列中的目标仿真数字人图像为面部朝向目标用户,并且动作状态与语音信息相对应仿真数字人的图像。
在一些实施方式中,仿真数字人模型可以包括特征生成模型和图像生成模型,可以将目标参考参数输入特征生成模型以获取初始特征参数,初始特征参数用于表征样本图像对应的真人模特的形态;根据应答语音信息,对初始特征参数的表情参数、动作参数、嘴型参数中至少一个参数进行调整以得到参数序列,参数序列包括多个目标特征参数;基于图像生成模型,获取每个目标特征参数对应的目标仿真数字人图像,以得到参数序列对应的图像序列。
其中,样本图像对应的真人模特的形态可以包括朝向角度和体貌特征中至少一个,也就是说根据初始特征参数得到的仿真数字人的朝向角度和体貌特征都可以与真人模特是相同的。预设的仿真数字人模型还可以包括音频视觉预测模型,可以根据输入的应答语音信息和初始特征参数,获取与应答语音信息对应的特征参数。通过音频视觉预测模型,可以对初始特征参数的表情参数、动作参数、嘴型参数中至少一个参数进行调整以得到多个目标特征参数所组成的参数序列,以使仿真数字人的外在表现与应答语音信息相对应。进而可以基于图像生成模型,获取每个目标特征参数对应的目标仿真数字人图像,以得到参数序列对应的图像序列。通过这种方式,可以获取更精准的仿真数字人的特征参数,使得仿真数字人的形象更加逼真和自然。
例如,当目标用户位于屏幕的左侧,根据交互信息确定应答语音信息为“你好”时,可以根据用户的相对位置确定对应的目标参考参数,进而确定该目标参考参数确定样本图像,得到仿真数字人朝向符合目标用户位置的初始特征参数,进而根据应答语音信息将初始特征参数中的动作参数修改为招手打招呼这一动作的动作参数,并且将仿真数字人的嘴型参数也修改为“你好”所对应的嘴型参数,从而得到动作和嘴型都与应答语音信息对应的多个目标特征参数,进而得到对应连续变化的图像序列。从而可以显示面部朝向用户,招手打招呼的仿真数字人。
步骤S470:根据图像序列生成并输出仿真数字人的视频,同步播放应答语音信息。
在获取得到图像序列后,可以根据应答语音信息,将图像序列中的多个目标仿真数字人图像合成与养大语音信息匹配的仿真数字人的视频,并在显示屏上显示仿真数字人的视频的同时,同步地播放应答语音信息。这样,仿真数字人不仅会根据目标用户的位置显示相应的角度,以实现面部朝向目标用户来进行交互,还可以根据具有与应答语音信息相对应的动作状态。通过这种方式,可以提高仿真数字人的逼真程度,从而提升用户的人机交互体验。
在一些实施方式中,场景数据为实时采集的数据,若检测到相对位置改变,则根据改变后的相对位置生成新的目标仿真数字人图像;在显示屏上显示新的目标仿真数字人图像。也就是说,仿真数字人不仅与应答语音信息对应,也与目标用户的实时的相对位置对应。从而仿真数字人更加灵活和生动。具体地,请参见后续实施例。
需要说明的是,本实施例中未详细描述的部分可以参考前述实施例,在此不再赘述。
可以理解的是,步骤S410至S470可以由终端设备在本地进行,也可以在服务器中进行,还可以由终端设备与服务器分工进行,根据实际应用场景的不同,可以按照需求进行任务的分配,在此不做限定。
本实施例提供的仿真3D数字人交互方法,获取采集装置采集的场景数据;若根据场景数据 确定场景内存在目标用户,则处理场景数据以获取目标用户与显示屏的相对位置;若目标用户位于预设区域内,则获取交互信息;处理交互信息以获取应答语音信息;根据相对位置,在预设的多个参考参数中确定目标参考参数;将目标参考参数和应答语音信息输入预设的仿真数字人模型,得到输出的图像序列;根据图像序列生成并输出仿真数字人的视频,同步播放应答语音信息。这样,不仅会根据目标用户的位置显示面部朝向目标用户的仿真数字人,还可以根据使数字人具有与应答语音信息对应的动作状态,通过显示视频的同时播放应答语音信息,进一步地增加了仿真数字人的逼真度,优化了人机交互体验。
请参阅图6,图6为本申请一实施例提供的一种仿真3D数字人交互方法的流程示意图,应用于上述终端设备,该方法包括步骤S510至步骤S570。
步骤S510:获取采集装置采集的场景图像。
其中,采集装置是可以是普通的摄像头,也可以是获取空间深度信息的图像采集装置。例如,图像采集装置可以是双目摄像头、结构光摄像头、TOF摄像头等。
相应地,场景图像可以是当前场景下的普通的图像,也可以是包含深度信息和彩色信息的深度图像。
步骤S520:判断场景图像内是否存在目标用户。
可以通过分析场景图像以判断场景内是否存在目标用户。例如,可以采集检测算法识别场景图像中的人头信息,检测算法可以是YOLO(You Only Look Once)算法、RCNN、SSD(Single Shot MultiBox Detector)等可以识别图像中自然人的算法进行判断。在一个实施例中,也可以通过其他类型的场景数据来判断场景内是否存在目标用户。根据场景数据确定场景内存在目标用户的具体描述请参阅步骤S120,在此不再赘述。
步骤S530:若是,则识别场景图像以获取目标用户在相机坐标系中的三维坐标。
其中,相机坐标系以采集装置的位置为原点。根据不同采集装置得到的场景图像,可以采用不同的方式对场景图像进行处理,以识别场景图像中的目标用户,从而得到目标用户在相机坐标系中的三维坐标。当图像为普通摄像头采集得到的图像时,可以通过深度估计算法等获取图像中目标用户对应的深度信息,从而确定三维坐标。当图像为深度图像时,可以给深度信息计算目标用户在相机坐标系中的三维坐标。例如,当采集装置为双目摄像头时,可以采用双目测距以确定目标用户对应的三维坐标;当采集装置为结构光摄像头时,可以采用三角视差测距确定目标用户对应的三维坐标;当采集装置为TOF摄像头时,可以计算光脉冲从TOF摄像头的发射器到目标对象,再以像素格式返回到TOF摄像头的接收器的运行,从而确定目标用户对应的三维坐标。
在一个实施例中,也可以预先对采集装置进行相机标定,以获取采集装置的相机外参数和相机内参数,结合相机参数来准确地获取目标用户的三维坐标。
步骤S540:获取采集装置与显示屏的位置关系,根据位置关系确定相机坐标系与空间坐标系的转换关系。
其中,空间坐标系以显示屏的位置为原点,空间坐标系可用于表征真实世界中的位置坐标。可以预先获取采集装置和显示屏的位置关系,根据该位置关系确定图像采集装置在空间坐标系中的坐标,进而得到相机坐标系与空间坐标系的转换关系。
其中,显示屏的位置可以是显示屏的中心点、显示屏的边框等不会发生变化的位置,也可以是用于仿真数字人图像的显示位置,在此不做限定。在一些实施方式中,仿真数字人图像的显示位置可以根据目标用户与显示屏的相对位置而改变。
步骤S550:基于转换关系和三维坐标,在空间坐标系中确定目标用户与显示屏的相对位置。
其中,相对位置包括相对距离和相对角度中的至少一个。可以基于转换关系和三维坐标,在空间坐标系中确定目标用户与显示屏的相对位置。通过这种方式,可以获取目标用户相对显示屏的较为精准的第一相对位置。
在一些实施方式中,可以通过检测算法识别目标用户的眼睛,以眼睛的位置作为目标用户在相机坐标系中的三维坐标。根据采集装置与显示屏上仿真数字人图像的显示位置,确定相机坐标系与空间坐标系的转换关系。其中,显示位置可以是仿真数字人的眼睛的位置。通过这种方 式,可以在空间坐标系中确定目标用户的眼睛与显示屏上仿真数字人眼睛之间的相对位置,从而可以根据该相对位置,获取不仅角度朝向目标用户,并且眼睛也看向目标用户的仿真数字人。
步骤S560:若目标用户位于预设区域内,则获取相对位置对应的目标仿真数字人图像。
在一些实施方式中,也可以获取交互信息;处理交互信息以获取应答语音信息;将目标参考参数和应答语音信息输入预设的仿真数字人模型,得到输出的图像序列,图像序列由多帧连续的目标仿真数字人图像构成。具体地,请参见前述实施例。
步骤S570:在显示屏上显示目标仿真数字人图像。
在一些实施方式中,场景数据为实时采集的数据,若检测到相对位置改变,则根据改变后的相对位置生成新的目标仿真数字人图像;在显示屏上显示新的目标仿真数字人图像。具体地,请参见后续实施例。
需要说明的是,本实施例中未详细描述的部分可以参考前述实施例,在此不再赘述。
可以理解的是,步骤S510至S560可以由终端设备在本地进行,也可以在服务器中进行,还可以由终端设备与服务器分工进行,根据实际应用场景的不同,可以按照需求进行任务的分配,在此不做限定。
本实施例提供的仿真3D数字人交互方法,获取采集装置采集的场景数据;判断场景图像内是否存在目标用户;若是,则识别场景图像以获取目标用户在相机坐标系中的三维坐标,获取采集装置与显示屏的位置关系,根据位置关系确定相机坐标系与空间坐标系的转换关系;基于转换关系和三维坐标,在空间坐标系中确定目标用户与显示屏的相对位置;若目标用户位于预设区域内,则获取相对位置对应的目标仿真数字人图像;在显示屏上显示目标仿真数字人图像。通过根获取采集装置和显示屏的位置关系,可以较为精确地确定目标用户与显示屏的位置,从而根据位置获取面部精准地朝向目标用户的仿真数字人,虚拟形象的画面逼真度更高。
请参阅图7,图7为本申请一实施例提供的一种仿真3D数字人交互方法的流程示意图,应用于上述终端设备,该方法包括步骤S610至步骤S670。
步骤S610:获取采集装置采集的场景数据。
步骤S620:识别场景图像中的人头信息。
其中,场景数据包括场景图像,可以通过检测算法识别场景图像中的人头信息。其中,检测算法可以是YOLO(You Only Look Once)算法、RCNN、SSD(Single Shot MultiBox Detector)等可以识别图像中自然人的算法。
在一些实施方式中,也可以通过传感器得到的场景数据判断场景内是否存在目标用户,当判定场景内存在目标用户时,再识别场景图像中的人头信息。由于根据传感器判断是否存在目标用户所需的功耗较小,进行图像识别所需功耗较大,当场景内存在目标用户才进一步地图像识别,可以减少图像识别所需的功耗。
步骤S630:根据人头信息获取场景图像中用户的数量。
通过识别场景图像中的人头信息,可以确定场景图像中用户的数量,即确定当前的场景中用户的数量。
在一些实施方式中,若用户的数量为多个,则监测是否获取到用户输入的交互指令;若获取到用户输入的交互指令,则将交互指令对应的用户作为所述目标用户。例如,当有多个目标用户位于预设区域时,可以保持仿真数字人正面朝前的姿态,仿真数字人可以向所有人打招呼,也可以不打招呼,当某个目标用户跟数字人讲话时,数字人转向该目标用户进行交互。当用户离开交互区域时,可以显示预设的待唤醒状态的仿真数字人图像。
其中,交互指令可以是预设的多模态信息。具体地,交互指令可以是语音信息、动作指令、触控操作等多模态信息。其中,语音信息可以是包含预设关键词的语音信息,可以通过对语音信息进行意图识别以获取用户的交互意图;动作指令可以预设的用于交互的动作、手势等,例如面向屏幕招手等。本实施例对此不作限定。
作为一种方式,可以通过麦克风采集场景中的声音信息,通过人声检测判断声音信息中是否包含用户的语音信息。在一个实施例中,还可以通过声学模型对预设关键词进行检测,以进一步确定是否获取用户输入的交互指令。当交互指令为语音信息时,作为一种方式,可以通过声音 测距等方式确定该语音信息的音源的方位,从而将该方位的用户作为目标用户;作为另一种方式,可以对场景图像进行处理以识别多个用户的唇部动作,通过唇语识别可以确定输入交互指令的用户,将该用户作为目标用户。
作为另一种方式,可以通过对场景图像进行动作识别以确定是否有用户输入的动作指令,将输入该动作指令的用户作为目标用户。例如,可以对场景图像中进行手势识别,以检测是否有用户面向屏幕招手。
作为又一种方式,可以通过屏幕传感器检测是否获取用户输入的触控操作,若是,则将输入触控操作的用户作为目标用户。
在又一些实施方式中,当用户数量为多个,可以将场景图像中的每一个用户分别作为目标用户,获取多个目标用户与显示屏的多个第一相对位置,若多个目标用户位于预设区域内,则基于预设的虚拟形象模型,获取与多个第一相对位置对应的多个目标虚拟形象图像,在显示屏上显示多个目标虚拟形象图像,以分别与目标用户进行交互。从而在显示屏上显示多个目标虚拟形象,以分别与多个目标用户进行交互。这样,使每个目标用户都可以与虚拟形象进行面对面地交互,提高了交互的效率。
步骤S640:若用户的数量为一个,则将识别到的用户作为目标用户。
当场景图像中用户的数量为一个时,即当前场景中用户的数量为一个时,则将该用户作为目标用户。
步骤S650:处理场景图像以获取目标用户与显示屏的相对位置。
步骤S660:若目标用户位于预设区域内,则获取相对位置对应的目标仿真数字人图像。
在一些实施方式中,也可以获取交互信息;处理交互信息以获取应答语音信息;将目标参考参数和应答语音信息输入预设的仿真数字人模型,得到输出的图像序列,图像序列由多帧连续的目标仿真数字人图像构成。具体地,请参见前述实施例。
步骤S670:在显示屏上显示目标仿真数字人图像。
在一些实施方式中,场景数据为实时采集的数据,若检测到相对位置改变,则根据改变后的相对位置生成新的目标仿真数字人图像;在显示屏上显示新的目标仿真数字人图像。具体地,请参见后续实施例。
需要说明的是,本实施例中未详细描述的部分可以参考前述实施例,在此不再赘述。
可以理解的是,步骤S610至S660可以由终端设备在本地进行,也可以在服务器中进行,还可以由终端设备与服务器分工进行,根据实际应用场景的不同,可以按照需求进行任务的分配,在此不做限定。
本实施例提供的仿真3D数字人交互方法,获取采集装置采集的场景数据;识别场景图像中的人头信息;根据人头信息获取场景图像中用户的数量;若用户的数量为一个,则将识别到的用户作为目标用户;处理场景图像以获取目标用户与显示屏的相对位置;若目标用户位于预设区域内,则获取相对位置对应的目标仿真数字人图像;在显示屏上显示目标仿真数字人图像。通过识别场景图像以确定场景中的用户数量,根据预设区域内目标用户的不同数量,采用不同的方式显示面部朝向用户的虚拟形象,从而丰富了交互的方式,提高了人机交互的效率。
请参阅图9,图9为本申请一实施例提供的一种仿真3D数字人交互方法的流程示意图,应用于上述终端设备,该方法包括步骤S710至步骤S760。
步骤S710:获取采集装置采集的场景数据。
其中,场景数据为实时采集的数据。
步骤S720:若根据场景数据确定场景内存在目标用户,则处理场景数据以获取目标用户与显示屏的相对位置。
在一些实施方式中,场景数据可以包括场景图像,可以识别场景图像中的人头信息;根据人头信息获取场景图像中用户的数量;若用户的数量为一个,则将所识别到的用户作为目标用户;处理场景图像以获取目标用户与显示屏的相对位置。在一个实施例中,若用户的数量为多个,则监测是否获取到用户输入的交互指令;若获取到用户输入的交互指令,则将交互指令对应的用户作为目标用户。具体地,请参见前述实施例。
步骤S730:若目标用户位于预设区域内,则获取相对位置对应的目标仿真数字人图像。
在一些实施方式中,也可以获取交互信息;处理交互信息以获取应答语音信息;将目标参考参数和应答语音信息输入预设的仿真数字人模型,得到输出的图像序列,图像序列由多帧连续的目标仿真数字人图像构成。具体地,请参见前述实施例。
步骤S740:在显示屏上显示目标仿真数字人图像。
步骤S750:若检测到相对位置改变,则根据改变后的相对位置生成新的目标仿真数字人图像。
在显示屏上显示目标仿真数字人图像之后,可以实时地检测目标用户与显示屏之间的相对位置,若检测到相对位置改变,则根据改变后的相对位置生成新的目标仿真数字人图像。通过检测相对位置的改变,可以根据用户实时的相对位置生成对应的目标仿真数字人,以使仿真数字人在每时每刻都是面部朝向目标用户的,交互更加自然和生动。
在一些实施方式中,若预设时间内相对位置的改变小于预设阈值,则不更新显示屏显示的所述目标仿真数字人图像。其中,预设阈值可以是位移阈值、旋转角度阈值中的至少一种。具体地,可确定预设时间内改变位置的目标用户相对于初始的相对位置的变化参数,变化参数包括相位移参数和旋转角度参数,若变化参数小于相应的预设阈值,则不根据改变后的相对位置生成新的目标仿真数字人图像,也不对显示屏所显示的图像进行更新。由此,只有当目标用户在预设时间内的位置改变大于预设阈值时,才获取新的目标仿真数字人图像,从而在用户预设时间内位姿变化不大的情况下,无需确定新的目标仿真数字人,既实现了根据目标用户相对位置的改变实时地调整显示的仿真数字人的朝向以更自然地交互,又节省了实时生成仿真数字人的算力和功耗。
在一些实施方式中,还可以根据改变前的目标用户的相对位置,以及改变后的相对位置,生成包含多张目标仿真数字人图像的图像序列,即获取由之前的目标仿真数字人图像变为改变后的相对位置对应的目标仿真数字人图像的具有时序的多张图像。可以根据图像序列和时序生成仿真数字人视频,以呈现逐渐变化的动态的仿真数字人。例如,用户与显示屏的相对位置改变时,目标用户看向显示屏的视角也在变化,目标用户看到仿真数字人的图像也在切换,显示屏上显示的数字人,就如同行走的摄像机围绕真人模特走一圈拍到的视频的效果,呈现一个立体的真人模特的视觉效果。
在一些实施方式中,当检测到目标用户离开预设区域,可以在新的目标仿真数字人图像可以是预设的待唤醒状态的仿真数字人图像。同时也可以将终端设备的状态切换为待唤醒状态,降低进行实时交互所需的功耗。在一个实施例中,当检测目标用户离开预设区域时,也可以将预设的动作的仿真数字人图像作为新的目标仿真数字人图像,例如挥手告别等。
步骤S760:在显示屏上显示新的目标仿真数字人图像。
在显示屏上显示新的目标仿真数字人图像。在一些实施方式中,也可以显示由包含多张目标仿真数字人图像的图像序列生成的数字人视频。
可以理解的是,步骤S710至S750可以由终端设备在本地进行,也可以在服务器中进行,还可以由终端设备与服务器分工进行,根据实际应用场景的不同,可以按照需求进行任务的分配。
需要说明的是,本实施例中未详细描述的部分可以参考前述实施例,在此不再赘述。
本实施例提供的仿真3D数字人交互方法,获取采集装置采集的场景数据,若根据场景数据确定场景内存在目标用户,则处理场景数据以获取目标用户与显示屏的相对位置,若目标用户位于预设区域内,则获取相对位置对应的目标仿真数字人图像,在显示屏上显示目标仿真数字人图像,若检测到相对位置改变,则根据改变后的相对位置生成新的目标仿真数字人图像,在显示屏上显示新的目标仿真数字人。通过实时地检测用户的位置,根据用户与显示屏的相对位置实时地更新目标仿真数字人图像,以实现目标用户与仿真数字人进行实时地面对面地交互。
可以理解的是,上述示例仅为本申请实施例提供的方法在一种具体场景进行应用的示意性说明,并不对本申请实施例构成限定。基于本申请实施例提供的方法还可实现更多不同的应用。
请参阅图9,图9示出了本申请实施例提供的仿真3D数字人交互装置800的结构框图。下面将针对图9所示的框图进行阐述,该基于仿真3D数字人交互装置800包括:数据采集模块810、位置获取模块820、图像获取模块830以及显示模块840,其中:
数据采集模块810,用于获取采集装置采集的场景数据;位置获取模块820,用于若根据所述场景数据确定场景内存在目标用户,则处理所述场景数据以获取所述目标用户与显示屏的相对位置;图像获取模块830,用于若所述目标用户位于预设区域内,则获取所述相对位置对应的目标仿真数字人图像,所述目标仿真数字人图像中包括面部朝向所述目标用户的仿真数字人,所述预设区域为与所述显示屏的距离小于预设数值的区域;显示模块840,用于在所述显示屏上显示所述目标仿真数字人图像。
进一步地,所述预设的仿真数字人模型为预先根据包含真人模特的多个样本图像和每个所述样本图像对应的参考参数训练得到的模型,所述仿真数字人模型用于根据输入的参考参数,输出所述样本图像对应的仿真数字人图像,所述图像获取模块830包括参数确定子模块和参数输入子模块,其中,所述参数确定子模块,用于根据所述相对位置,在预设的多个参考参数中确定目标参考参数,所述参考参数用于表征所述样本图像包含的所述真人模特相对于采集所述样本图像的图像采集装置的位姿,所述参数输入子模块用于将所述目标参考参数输入所述预设的仿真数字人模型,将输出的仿真数字人图像作为所述目标仿真数字人图像。
进一步地,所述参数确定子模块包括第一参数确定单元和第二参数确定单元,其中,所述第一参数确定单元用于根据所述相对位置确定用户视角参数,所述用户视角参数用于表征所述目标用户朝向所述显示屏的预设位置的视角;所述第二参数确定单元用于在所述预设的多个参考参数中,根据所述用户视角参数确定所述目标参考参数。
进一步地,所述第一参数确定单元包括位置确定子单元和视角参数确定子单元,其中,所述位置确定子单元用于根据所述相对位置确定所述显示屏的目标显示位置,所述目标显示位置为所述目标仿真数字人图像在所述显示屏上的显示位置;所述视角参数确定子单元,用于根据所述相对位置和所述目标显示位置确定所述用户视角参数。
进一步地,所述仿真3D数字人交互装置800还包括交互信息获取模块,语音信息获取模块,所述交互信息获取模块用于获取交互信息,所述语音信息获取模块用于处理所述交互信息以获取应答语音信息,所述参数输入子模块包括图像序列获取单元,所述图像序列获取单元用于将所述目标参考参数和所述应答语音信息输入所述预设的仿真数字人模型,得到输出的图像序列,所述图像序列由多帧连续的所述目标仿真数字人图像构成,所述显示模块840包括视频输出单元,所述视频输出单元用于根据所述图像序列生成并输出仿真数字人的视频,同步播放所述应答语音信息。
进一步地,述仿真数字人模型包括特征生成模型和图像生成模型,图像序列获取单元包括初始特征参数获取子单元,参数序列获取子单元和图像序列获取子单元,其中,初始特征参数获取子单元,用于将目标参考参数输入特征生成模型以获取初始特征参数,初始特征参数用于表征样本图像对应的真人模特的形态;参数序列获取子单元,用于根据应答语音信息,对初始特征参数的表情参数、动作参数、嘴型参数中至少一个参数进行调整以得到参数序列,参数序列包括多个目标特征参数;图像序列获取子单元,用于基于图像生成模型,获取每个目标特征参数对应的目标仿真数字人图像,以得到参数序列对应的图像序列。
进一步地,目标仿真数字人图像中仿真数字人的朝向角度与目标参考参数对应的样本图像中的真人模特的朝向角度相同。
进一步地,目标仿真数字人图像中仿真数字人的体貌特征与目标参考参数对应的样本图像中的真人模特的体貌特征相同。
进一步地,位置获取模块820包括判断子模块、坐标获取子模块、转换关系确定子模块和位置确定子模块,判断子模块用于判断场景图像内是否存在目标用户,坐标获取子模块用于若是,则识别场景图像以获取目标用户在相机坐标系中的三维坐标,其中,相机坐标系以采集装置的位置为原点;转换关系确定子模块,用于获取采集装置与显示屏的位置关系,根据位置关系确定相机坐标系与空间坐标系的转换关系,其中,空间坐标系以显示屏的位置为原点;位置确定子模块,用于基于转换关系和三维坐标,在空间坐标系中确定目标用户与显示屏的相对位置,相对位置包括相对距离和相对角度中的至少一个。
进一步地,位置获取模块820还包括图像识别子模块,用户数量获取子模块,第一处理 子模块,其中,图像识别子模块用于识别场景图像中的人头信息,用户数量获取子模块,用于根据人头信息获取场景图像中用户的数量,第一处理子模块,用于若用户的数量为一个,则将所识别到的用户作为目标用户。
进一步地,仿真3D数字人交互装置800还包括指令监测子模块和第二处理子模块,指令监测子模块,用于若用户的数量为多个,则监测是否获取到用户输入的交互指令;第二处理子模块,用于若获取到用户输入的交互指令,则将交互指令对应的用户作为目标用户。
进一步地,场景数据为实时采集的数据,在显示屏上显示目标仿真数字人图像之后,仿真3D数字人交互装置800还包括位置检测模块,显示更新模块,位置检测模块,用于若检测到相对位置改变,则根据改变后的相对位置生成新的目标仿真数字人图像;显示更新模块,用于在显示屏上显示新的目标仿真数字人图像。
请参考图10,其示出了本申请实施例提供的一种电子设备的结构框图。该电子设备900可以是智能手机、平板电脑、电子书等能够运行应用计算机可读指令的电子设备。本申请中的电子设备900可以包括一个或多个如下部件:处理器910、存储器920以及一个或多个应用计算机可读指令,其中一个或多个应用计算机可读指令可以被存储在存储器920中并被配置为由一个或多个处理器910执行,一个或多个计算机可读指令配置用于执行如前述方法实施例所描述的方法。
处理器910可以包括一个或者多个处理核。处理器910利用各种接口和线路连接整个电子设备900内的各个部分,通过运行或执行存储在存储器920内的指令、计算机可读指令、代码集或指令集,以及调用存储在存储器920内的数据,执行电子设备900的各种功能和处理数据。在一个实施例中,处理器910可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器910可集成中央处理器(Central Processing Unit,CPU)、图像处理器(Graphics Processing Unit,GPU)和调制解调器等中的一种或几种的组合。其中,CPU主要处理操作***、用户界面和应用计算机可读指令等;GPU用于负责显示内容的渲染和绘制;调制解调器用于处理无线通信。可以理解的是,上述调制解调器也可以不集成到处理器910中,单独通过一块通信芯片进行实现。
存储器920可包括随机存储器(Random Access Memory,RAM),也可包括只读存储器
(Read-Only Memory)。存储器920可用于计算机可读指令、代码、代码集或指令集。存储器920可包括存储计算机可读指令区和存储数据区。存储计算机可读指令区可存储用于实现操作***的指令、用于实现至少一个功能的指令(比如触控功能、声音播放功能、图像播放功能等)、用于实现下述各方法实施例的指令等。存储数据区还可存储电子设备900在使用中所创建的数据等。
请参阅图11,其示出了本申请实施例提供的一个或多个计算机可读取存储介质的结构框图。该计算机可读取存储介质1000中存储有计算机可读指令,计算机可读指令可被处理器调用执行上述方法实施例中所描述的方法。
计算机可读取存储介质1000可以是诸如闪存、电可擦除可编程只读存储器(electrically-erasable programmable read-only memory,EEPROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、硬盘或者ROM之类的电子存储器。在一个实施例中,计算机可读取存储介质1000包括非易失性计算机可读介质(non-transitory computer-readable storage medium)。计算机可读取存储介质1000具有执行上述方法中的任何方法步骤的计算机可读指令1010的存储空间。这些计算机可读指令可以从一个或者多个计算机计算机可读指令产品中读出或者写入到这一个或者多个计算机计算机可读指令产品中。计算机可读指令1010可以例如以适当形式进行压缩。
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不驱使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种仿真3D数字人交互方法,其特征在于,由电子设备执行,包括:
    获取采集装置采集的场景数据;
    若根据所述场景数据确定场景内存在目标用户,则处理所述场景数据以获取所述目标用户与显示屏的相对位置;
    若所述目标用户位于预设区域内,则获取所述相对位置对应的目标仿真数字人图像,所述目标仿真数字人图像包括面部朝向所述目标用户的仿真数字人,所述预设区域为与所述显示屏的距离小于预设数值的区域;
    在所述显示屏上显示所述目标仿真数字人图像。
  2. 根据权利要求1所述的方法,其特征在于,所述目标仿真数字人图像是基于预设的仿真数字人模型获取的;所述预设的仿真数字人模型为预先根据包含真人模特的多个样本图像和每个所述样本图像对应的参考参数训练得到的模型,所述仿真数字人模型用于根据输入的参考参数,输出所述样本图像对应的仿真数字人图像,所述获取所述相对位置对应的目标仿真数字人图像,包括:
    根据所述相对位置,在预设的多个参考参数中确定目标参考参数,所述参考参数用于表征所述样本图像包含的所述真人模特相对于采集所述样本图像的图像采集装置的位姿;
    将所述目标参考参数输入所述预设的仿真数字人模型,将输出的仿真数字人图像作为所述目标仿真数字人图像。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述相对位置,在预设的多个参考参数中确定目标参考参数,包括:
    根据所述相对位置确定用户视角参数,所述用户视角参数用于表征所述目标用户朝向所述显示屏的预设位置的视角;
    在所述预设的多个参考参数中,根据所述用户视角参数确定所述目标参考参数。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述相对位置确定用户视角参数,包括:
    根据所述相对位置确定所述显示屏的目标显示位置,所述目标显示位置为所述目标仿真数字人图像在所述显示屏上的显示位置;
    根据所述相对位置和所述目标显示位置确定所述用户视角参数。
  5. 根据权利要求2所述的方法,其特征在于,在所述获取所述相对位置对应的目标仿真数字人图像之前,所述方法还包括:
    获取交互信息;
    处理所述交互信息以获取应答语音信息;
    所述将所述目标参考参数输入所述预设的仿真数字人模型,将输出的仿真数字人图像作为所述目标仿真数字人图像,包括:
    将所述目标参考参数和所述应答语音信息输入所述预设的仿真数字人模型,得到输出的图像序列,所述图像序列由多帧连续的所述目标仿真数字人图像构成;
    所述在所述显示屏上显示所述目标仿真数字人图像,包括:
    根据所述图像序列生成并输出仿真数字人的视频,同步播放所述应答语音信息。
  6. 根据权利要求5所述的方法,其特征在于,所述仿真数字人模型包括特征生成模型和图像生成模型,所述将所述目标参考参数和所述应答语音信息输入所述预设的仿真数字人模型,得到输出的图像序列,包括:
    将所述目标参考参数输入所述特征生成模型以获取初始特征参数,所述初始特征参数用于表征所述样本图像对应的所述真人模特的形态;
    根据所述应答语音信息,对所述初始特征参数的表情参数、动作参数、嘴型参数中至少一个参数进行调整以得到参数序列,所述参数序列包括多个目标特征参数;
    基于所述图像生成模型,获取每个所述目标特征参数对应的目标仿真数字人图像,以 得到所述参数序列对应的所述图像序列。
  7. 根据权利要求2-6任一项所述的方法,其特征在于,所述目标仿真数字人图像中所述仿真数字人的朝向角度与所述目标参考参数对应的所述样本图像中的所述真人模特的朝向角度相同。
  8. 根据权利要求7所述的方法,其特征在于,所述目标仿真数字人图像中所述仿真数字人的体貌特征与所述目标参考参数对应的样本图像中的所述真人模特的体貌特征相同。
  9. 根据权利要求1-6任一项所述的方法,其特征在于,所述场景数据包括场景图像,所述若根据所述场景数据确定场景内存在目标用户,则处理所述场景数据以获取所述目标用户与显示屏的相对位置,包括:
    判断所述场景图像内是否存在所述目标用户;
    若是,则识别所述场景图像以获取所述目标用户在相机坐标系中的三维坐标,其中,所述相机坐标系以所述采集装置的位置为原点;
    获取所述采集装置与所述显示屏的位置关系,根据所述位置关系确定所述相机坐标系与空间坐标系的转换关系,其中,所述空间坐标系以所述显示屏的位置为原点;
    基于所述转换关系和所述三维坐标,在所述空间坐标系中确定所述目标用户与所述显示屏的所述相对位置。
  10. 根据权利要求1所述的方法,其特征在于,所述场景数据包括场景图像,若根据所述场景数据确定场景内存在目标用户,则处理所述场景数据以获取所述目标用户与显示屏的相对位置,包括:
    识别所述场景图像中的人头信息;
    根据所述人头信息获取所述场景图像中用户的数量;
    若所述用户的数量为一个,则将所识别到的用户作为所述目标用户;
    处理所述场景图像以获取所述目标用户与显示屏的相对位置。
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:
    若所述用户的数量为多个,则监测是否获取到输入的交互指令;
    若获取到所述交互指令,则将输入所述交互指令的用户作为所述目标用户。
  12. 根据权利要求1所述的方法,其特征在于,所述场景数据为实时采集的数据,所述在所述显示屏上显示所述目标仿真数字人图像之后,所述方法还包括:
    若检测到所述相对位置改变,则根据改变后的相对位置生成新的目标仿真数字人图像;
    在所述显示屏上显示所述新的目标仿真数字人图像。
  13. 一种仿真3D数字人交互装置,其特征在于,设置于电子设备中,包括:
    数据采集模块,用于获取采集装置采集的场景数据;
    位置获取模块,用于若根据所述场景数据确定场景内存在目标用户,则处理所述场景数据以获取所述目标用户与显示屏的相对位置;
    图像获取模块,用于若所述目标用户位于预设区域内,则获取所述相对位置对应的目标仿真数字人图像,所述目标仿真数字人图像中包括面部朝向所述目标用户的仿真数字人,所述预设区域为与所述显示屏的距离小于预设数值的区域;
    显示模块,用于在所述显示屏上显示所述目标仿真数字人图像。
  14. 根据权利要求13所述的装置,其特征在于,所述目标仿真数字人图像是基于预设的仿真数字人模型获取的;所述预设的仿真数字人模型为预先根据包含真人模特的多个样本图像和每个所述样本图像对应的参考参数训练得到的模型,所述图像获取模块包括参数确定子模块和参数输入子模块,其中:
    所述参数确定子模块,用于根据所述相对位置,在预设的多个参考参数中确定目标参考参数,所述参考参数用于表征所述样本图像包含的所述真人模特相对于采集所述样本图像的图像采集装置的位姿;
    所述参数输入子模块,用于将所述目标参考参数输入所述预设的仿真数字人模型,将输出的仿真数字人图像作为所述目标仿真数字人图像。
  15. 根据权利要求14所述的装置,其特征在于,所述参数确定子模块包括第一参数确定单元和第二参数确定单元;
    所述第一参数确定单元用于根据所述相对位置确定用户视角参数,所述用户视角参数用于表征所述目标用户朝向所述显示屏的预设位置的视角;
    所述第二参数确定单元用于在所述预设的多个参考参数中,根据所述用户视角参数确定所述目标参考参数。
  16. 根据权利要求15所述的装置,其特征在于,所述第一参数确定单元包括位置确定子单元和视角参数确定子单元;
    其中,所述位置确定子单元用于根据所述相对位置确定所述显示屏的目标显示位置,所述目标显示位置为所述目标仿真数字人图像在所述显示屏上的显示位置;
    所述视角参数确定子单元,用于根据所述相对位置和所述目标显示位置确定所述用户视角参数。
  17. 根据权利要求14所述的装置,其特征在于,所述装置还包括:
    交互信息获取模块用于获取交互信息;
    所述语音信息获取模块用于处理所述交互信息以获取应答语音信息;
    所述参数输入子模块包括图像序列获取单元,所述图像序列获取单元用于将所述目标参考参数和所述应答语音信息输入所述预设的仿真数字人模型,得到输出的图像序列,所述图像序列由多帧连续的所述目标仿真数字人图像构成;
    所述显示模块包括视频输出单元,所述视频输出单元用于根据所述图像序列生成并输出仿真数字人的视频,同步播放所述应答语音信息。
  18. 根据权利要求13所述的装置,其特征在于,所述装置还包括:
    位置检测模块,用于若检测到所述相对位置改变,则根据改变后的相对位置生成新的目标仿真数字人图像;
    显示更新模块,用于在所述显示屏上显示所述新的目标仿真数字人图像。
  19. 一种电子设备,其特征在于,包括:
    一个或多个处理器;
    存储器;
    一个或多个计算机可读指令,其中所述一个或多个计算机可读指令被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个计算机可读指令配置用于执行如权利要求1-12任一项所述的仿真3D数字人交互方法。
  20. 一个或多个计算机可读取存储介质,其特征在于,所述计算机可读取存储介质中存储有计算机可读指令,所述计算机可读指令可被处理器调用执行如权利要求1-12任一项所述的仿真3D数字人交互方法。
PCT/CN2021/123815 2021-01-07 2021-10-14 仿真3d数字人交互方法、装置、电子设备及存储介质 WO2022148083A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110019675.0 2021-01-07
CN202110019675.0A CN112379812B (zh) 2021-01-07 2021-01-07 仿真3d数字人交互方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022148083A1 true WO2022148083A1 (zh) 2022-07-14

Family

ID=74590186

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/123815 WO2022148083A1 (zh) 2021-01-07 2021-10-14 仿真3d数字人交互方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN112379812B (zh)
WO (1) WO2022148083A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115953521A (zh) * 2023-03-14 2023-04-11 世优(北京)科技有限公司 远程数字人渲染方法、装置及***
CN116563432A (zh) * 2023-05-15 2023-08-08 摩尔线程智能科技(北京)有限责任公司 三维数字人生成方法及装置、电子设备和存储介质
CN117473880A (zh) * 2023-12-27 2024-01-30 中国科学技术大学 样本数据生成方法及无线跌倒检测方法

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112379812B (zh) * 2021-01-07 2021-04-23 深圳追一科技有限公司 仿真3d数字人交互方法、装置、电子设备及存储介质
CN112927260B (zh) * 2021-02-26 2024-04-16 商汤集团有限公司 一种位姿生成方法、装置、计算机设备和存储介质
CN112669846A (zh) * 2021-03-16 2021-04-16 深圳追一科技有限公司 交互***、方法、装置、电子设备及存储介质
CN113031768A (zh) * 2021-03-16 2021-06-25 深圳追一科技有限公司 客服服务方法、装置、电子设备及存储介质
CN113050791A (zh) * 2021-03-16 2021-06-29 深圳追一科技有限公司 交互方法、装置、电子设备及存储介质
CN112800206B (zh) * 2021-03-24 2021-08-24 南京万得资讯科技有限公司 一种基于生成式多轮对话意图识别的骚扰电话屏蔽方法
CN113485633B (zh) * 2021-07-30 2024-02-02 京东方智慧物联科技有限公司 一种内容展示方法、装置、电子设备和非瞬态计算机可读存储介质
CN114115527B (zh) * 2021-10-29 2022-11-29 北京百度网讯科技有限公司 增强现实ar信息显示方法、装置、***及存储介质
CN114356092B (zh) * 2022-01-05 2022-09-09 花脸数字技术(杭州)有限公司 一种基于多模态的数字人信息处理用人机交互***
CN116796478B (zh) * 2023-06-09 2023-12-26 南通大学 天线阵列可视区域的数据展示方法、装置
CN117115321B (zh) * 2023-10-23 2024-02-06 腾讯科技(深圳)有限公司 虚拟人物眼睛姿态的调整方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018206063A (ja) * 2017-06-05 2018-12-27 株式会社東海理化電機製作所 画像認識装置及び画像認識方法
CN111290682A (zh) * 2018-12-06 2020-06-16 阿里巴巴集团控股有限公司 交互方法、装置及计算机设备
CN111880659A (zh) * 2020-07-31 2020-11-03 北京市商汤科技开发有限公司 虚拟人物控制方法及装置、设备、计算机可读存储介质
CN112379812A (zh) * 2021-01-07 2021-02-19 深圳追一科技有限公司 仿真3d数字人交互方法、装置、电子设备及存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9135392B2 (en) * 2012-01-31 2015-09-15 Siemens Product Lifecycle Management Software Inc. Semi-autonomous digital human posturing
US10860752B2 (en) * 2015-08-25 2020-12-08 Dassault Systémes Americas Corp. Method and system for vision measure for digital human models
CN111443853B (zh) * 2020-03-25 2021-07-20 北京百度网讯科技有限公司 数字人的控制方法及装置
CN111443854B (zh) * 2020-03-25 2022-01-18 北京百度网讯科技有限公司 基于数字人的动作处理方法、装置、设备及存储介质
CN111309153B (zh) * 2020-03-25 2024-04-09 北京百度网讯科技有限公司 人机交互的控制方法和装置、电子设备和存储介质
CN111736699A (zh) * 2020-06-23 2020-10-02 上海商汤临港智能科技有限公司 基于车载数字人的交互方法及装置、存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018206063A (ja) * 2017-06-05 2018-12-27 株式会社東海理化電機製作所 画像認識装置及び画像認識方法
CN111290682A (zh) * 2018-12-06 2020-06-16 阿里巴巴集团控股有限公司 交互方法、装置及计算机设备
CN111880659A (zh) * 2020-07-31 2020-11-03 北京市商汤科技开发有限公司 虚拟人物控制方法及装置、设备、计算机可读存储介质
CN112379812A (zh) * 2021-01-07 2021-02-19 深圳追一科技有限公司 仿真3d数字人交互方法、装置、电子设备及存储介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115953521A (zh) * 2023-03-14 2023-04-11 世优(北京)科技有限公司 远程数字人渲染方法、装置及***
CN116563432A (zh) * 2023-05-15 2023-08-08 摩尔线程智能科技(北京)有限责任公司 三维数字人生成方法及装置、电子设备和存储介质
CN116563432B (zh) * 2023-05-15 2024-02-06 摩尔线程智能科技(北京)有限责任公司 三维数字人生成方法及装置、电子设备和存储介质
CN117473880A (zh) * 2023-12-27 2024-01-30 中国科学技术大学 样本数据生成方法及无线跌倒检测方法
CN117473880B (zh) * 2023-12-27 2024-04-05 中国科学技术大学 样本数据生成方法及无线跌倒检测方法

Also Published As

Publication number Publication date
CN112379812A (zh) 2021-02-19
CN112379812B (zh) 2021-04-23

Similar Documents

Publication Publication Date Title
WO2022148083A1 (zh) 仿真3d数字人交互方法、装置、电子设备及存储介质
WO2021043053A1 (zh) 一种基于人工智能的动画形象驱动方法和相关装置
US11595617B2 (en) Communication using interactive avatars
US9479736B1 (en) Rendered audiovisual communication
CN107431635B (zh) 化身面部表情和/或语音驱动的动画化
CN110286756A (zh) 视频处理方法、装置、***、终端设备及存储介质
KR102491140B1 (ko) 가상 아바타 생성 방법 및 장치
CN115909015B (zh) 一种可形变神经辐射场网络的构建方法和装置
US20230047858A1 (en) Method, apparatus, electronic device, computer-readable storage medium, and computer program product for video communication
CN110737335B (zh) 机器人的交互方法、装置、电子设备及存储介质
CN110794964A (zh) 虚拟机器人的交互方法、装置、电子设备及存储介质
CN113436602A (zh) 虚拟形象语音交互方法、装置、投影设备和计算机介质
CN117036583A (zh) 视频生成方法、装置、存储介质及计算机设备
CN116095353A (zh) 基于体积视频的直播方法、装置、电子设备及存储介质
CN112435316B (zh) 一种游戏中的防穿模方法、装置、电子设备及存储介质
CN117370605A (zh) 一种虚拟数字人驱动方法、装置、设备和介质
CN114445529A (zh) 一种基于动作及语音特征的人脸图像动画方法和***
CN112767520A (zh) 数字人生成方法、装置、电子设备及存储介质
CN114979789A (zh) 一种视频展示方法、装置以及可读存储介质
JP7184835B2 (ja) コンピュータプログラム、方法及びサーバ装置
US20220156986A1 (en) Scene interaction method and apparatus, electronic device, and computer storage medium
CN116841391A (zh) 数字人的交互控制方法、装置、电子设备和存储介质
TWI583198B (zh) 使用互動化身的通訊技術
CN116761005A (zh) 智能麦克风、虚拟主播直播方法及相关装置
KR20240095969A (ko) 라이다를 이용하는 미디어 파사드 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21917125

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16.11.2023)