WO2021083125A1 - 通话控制方法及相关产品 - Google Patents

通话控制方法及相关产品 Download PDF

Info

Publication number
WO2021083125A1
WO2021083125A1 PCT/CN2020/123910 CN2020123910W WO2021083125A1 WO 2021083125 A1 WO2021083125 A1 WO 2021083125A1 CN 2020123910 W CN2020123910 W CN 2020123910W WO 2021083125 A1 WO2021083125 A1 WO 2021083125A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
model
dimensional face
face
dimensional
Prior art date
Application number
PCT/CN2020/123910
Other languages
English (en)
French (fr)
Inventor
王多民
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Priority to EP20881081.2A priority Critical patent/EP4054161A1/en
Publication of WO2021083125A1 publication Critical patent/WO2021083125A1/zh
Priority to US17/733,539 priority patent/US20220263934A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L21/12Transforming into visible information by displaying time domain information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/57Arrangements for indicating or recording the number of the calling subscriber at the called subscriber's set
    • H04M1/575Means for retrieving and displaying personal data about calling party
    • H04M1/576Means for retrieving and displaying personal data about calling party associated with a pictorial or graphical representation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/72427User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality for supporting games or graphical animations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72448User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions
    • H04M1/72454User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions according to context-related or environment-related conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72484User interfaces specially adapted for cordless or mobile telephones wherein functions are triggered by incoming communication events
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2201/00Electronic components, circuits, software, systems or apparatus used in telephone systems
    • H04M2201/41Electronic components, circuits, software, systems or apparatus used in telephone systems using speaker recognition

Definitions

  • This application relates to the field of network and computer technology, and in particular to a call control method and related products.
  • the terminal only detects the incoming number information when the call comes in, or searches for the picture of the number contact for display, etc. It just connects the voice information of users at both ends to interact with voice information.
  • the embodiments of the present application provide a call control method and related products, in order to improve the intelligence and functionality of the call application of the first terminal.
  • an embodiment of the present application provides a call control method, which is applied to a first terminal, and the method includes:
  • model driving parameters include expression parameters and posture parameters
  • an embodiment of the present application provides a call control method and device, which is applied to a first terminal, and the device includes a processing unit and a communication unit, where:
  • the processing unit is configured to display a three-dimensional face model of the second user during a voice call between the first user of the first terminal and the second user of the second terminal through the communication unit; and Used to determine model driving parameters according to the call voice of the second user, the model driving parameters including expression parameters and posture parameters; and used to drive a three-dimensional face model of the second user according to the model driving parameters to
  • the three-dimensional simulated call animation of the second user is displayed, and the three-dimensional simulated call animation presents expression animation information corresponding to the expression parameters, and presents posture animation information corresponding to the posture parameters.
  • an embodiment of the present application provides a first terminal, which is characterized in that it includes a processor and a memory, where the memory is used to store one or more programs and is configured to be executed by the processor. It includes instructions for performing the steps in the method as described in the first aspect above.
  • an embodiment of the present application provides a chip that includes a processor and a data interface.
  • the processor reads instructions stored in a memory through the data interface, and executes the steps in the method described in the first aspect. step.
  • an embodiment of the present application provides a computer-readable storage medium, wherein the above-mentioned computer-readable storage medium stores a computer program for electronic data exchange, wherein the above-mentioned computer program enables a computer to execute Some or all of the steps described in one aspect.
  • the embodiments of the present application provide a computer program product, wherein the above-mentioned computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the above-mentioned computer program is operable to cause a computer to execute as implemented in this application.
  • the computer program product may be a software installation package.
  • the first terminal when the first user of the first terminal is in a voice call with the second user of the second terminal, the first terminal can display the three-dimensional face model of the second user, and according to the second user’s
  • the call voice determines the model driving parameters of the aforementioned model, and drives the 3D face model of the second user according to the model driving parameters to display the 3D simulated call animation of the second user.
  • this application can more comprehensively present the information of the calling party user, including facial expressions and head posture information, thereby helping to improve the call of the first terminal.
  • the intelligence and functionality of the application is provided.
  • Fig. 1 is a schematic structural diagram of a call control system provided by an embodiment of the present application
  • FIG. 2A is a schematic flowchart of a call control method provided by an embodiment of the present application.
  • 2B is a schematic diagram of a three-dimensional face display interface provided by an embodiment of the present application.
  • 2C is a schematic diagram of a general three-dimensional human face standard model provided by an embodiment of the present application.
  • 2D is a schematic diagram of a flow chart for calculating the Loss value based on the training data of the parameter extraction model provided by an embodiment of the present application;
  • 2E is a schematic diagram of a three-dimensional call selection interface provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of another call control method provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a first terminal provided by the implementation of this application.
  • the general terminal can detect the incoming number information when the call comes in, or search for the picture of the number contact for display, etc., and can view the basic information of the incoming call, but not during the call It can better capture the other party's expression and posture and other information, the function is single, and the visibility of the call process is low.
  • FIG. 1 Please refer to the schematic structural diagram of a call control system shown in FIG. 1, which includes an internet network, a first terminal, and a second terminal. In other possible embodiments, it may also include a third terminal, a fourth terminal, or others. Multiple terminal devices are suitable for application scenarios of multi-party calls.
  • the aforementioned terminals include, but are not limited to, devices with communication functions, smart phones, tablet computers, notebook computers, desktop computers, portable digital players, smart bracelets, and smart watches.
  • FIG. 2A is a schematic flowchart of a call control method provided by an embodiment of the present application. As shown in the figure, it includes the following steps:
  • 201 During a voice call between a first user of the first terminal and a second user of the second terminal, display a three-dimensional face model of the second user;
  • one of the users can display the other party's three-dimensional face model. And it is not only suitable for two-party calls, but also for multi-party calls.
  • the end user of one party in the call can display the three-dimensional face models of the other multi-party users.
  • model drive parameters according to the call voice of the second user, where the model drive parameters include expression parameters and posture parameters.
  • a terminal user of one party may obtain the call voice of the peer user, and process it to generate model driving parameters, where the model driving parameters include posture parameters and expression parameters.
  • the model driving parameter is a three-dimensional face model that drives the peer user, that is, the aforementioned three-dimensional face model of the second user.
  • the three-dimensional face model of the second user will dynamically change as the model driving parameters change. Different parameters correspond to different expressions, such as smiling, laughing, angry, sad, angry, etc. And generating different postures according to different posture parameters, so that the constantly changing three-dimensional face model of the second user presents the effect of three-dimensional animation.
  • the first terminal when the first user of the first terminal is in a voice call with the second user of the second terminal, the first terminal can display the three-dimensional face model of the second user, and according to the second user’s
  • the call voice determines the model driving parameters of the aforementioned model, and drives the 3D face model of the second user according to the model driving parameters to display the 3D simulated call animation of the second user.
  • the present application can more comprehensively present the information of the calling party user, including facial expressions and head posture information, thereby helping to improve the call of the first terminal.
  • the intelligence and functionality of the application is provided.
  • the displaying the three-dimensional face model of the second user includes: displaying the three-dimensional face model of the second user in a call application interface of the first terminal.
  • the three-dimensional face model of the second user can be directly displayed in the call application interface of the first terminal, and the three-dimensional face model of the second user can be seen intuitively .
  • the displaying the three-dimensional face model of the second user includes: split-screen displaying the call application interface of the first terminal and the three-dimensional face model of the second user.
  • the call application interface of the first terminal and the three-dimensional face model of the second user may be separated on the split-screen interface at the same time; or,
  • the three-dimensional face model of the second user may be displayed on the call application interface of the first terminal, and other applications may be displayed on another interface of the split screen.
  • the displaying the three-dimensional face model of the second user includes: displaying the three-dimensional face model of the second user on a third terminal connected to the first terminal.
  • the three-dimensional face model of the second user may be displayed on a third terminal connected to the first terminal.
  • the connection mode may be any one or more of wireless high-fidelity WiFi connection, Bluetooth connection, mobile data connection, and hot connection.
  • connection terminal may include a third terminal, a fourth terminal, or other multiple terminals. Both these terminals and the first terminal can be connected through any one or more of wireless high-fidelity WiFi connection, Bluetooth connection, mobile data connection, and hot spot connection.
  • the driving the three-dimensional face model of the second user according to the model driving parameters to display the three-dimensional simulated call animation of the second user includes:
  • Detect the call voice of the second user process the call voice to obtain the spectrogram of the second user; input the spectrogram of the second user into a driving parameter generation model to generate model driving parameters,
  • the model driving parameters include expression parameters, and/or posture parameters.
  • the conversion into the spectrogram of the second user may include the following steps.
  • Fast Fourier transform is performed on it, and the speech signal is transformed from the time domain to the frequency domain to obtain the spectrogram of the second user.
  • the model driving parameters include expression parameters and posture parameters. There are three posture parameters, which respectively represent scaling parameters, rotation matrix and translation matrix. Using expression parameters or posture parameters, or expression parameters and posture parameters, will make the three-dimensional face model of the second user present an expression or posture. Changes, or changes in expression and posture.
  • the process of acquiring training data of the driving parameter generation model includes the following steps: collecting M pieces of audio, obtaining M spectrograms based on the M pieces of audio, and the M pieces of audio being M pieces of collection
  • Each collection object in the object reads the audio recordings of multiple text libraries in a preset manner; when collecting the M pieces of audio, collect the three-dimensional face data of each of the M collection objects at a preset frequency, Obtain M sets of three-dimensional face data; use the general three-dimensional face standard model as a template to align the M sets of three-dimensional face data to obtain M sets of aligned three-dimensional face data, and the M sets of aligned three-dimensional face data
  • the face data has the same vertices and topological structure as the general three-dimensional standard model of human faces; the three-dimensional face data after the M audios are aligned with the M sets of three-dimensional face data are aligned in time, so that the M sets of three-dimensional humans
  • Each group of three-dimensional face data in the face data corresponds to
  • M 10 as an example, that is, when the collection object is 10 people, (M can also be 50, 30, and the number of people can be appropriately adjusted according to time and resource cost, but it should not be less than 10 people.)
  • Ten collected objects can include different gender, age group, nationality, skin color, face shape, etc.
  • the high-precision dynamic head and face scanning system can be used to collect the three-dimensional face data of the collected object at a preset frequency, and the preset frequency may be 30 frames per second.
  • the collected object reads aloud according to a predefined text sentence library.
  • the text sentence library can include Chinese and English, and can also include other languages, such as Korean, Japanese, French, German, etc.
  • the recording device records the content read by the collection object. Three-dimensional data collection and audio recording are in a quiet environment to avoid introducing noise to the recording.
  • the pre-defined text sentence database can be selected from 10 or more, and the length of each text sentence database is 12000 characters/words.
  • the collection of three-dimensional data for each of the 10 collection objects can be synchronized with the above-mentioned three-dimensional data collection process or simultaneously, recording the audio of all the text sentence libraries of each collection object. After the collection is completed, the 10 pieces of recorded audio are time-aligned with the 10 sets of three-dimensional face data.
  • the three-dimensional face scan data is smoothed and aligned with the template of the general three-dimensional face standard model. Then, the M spectrogram obtained by processing the M parts of audio, according to the three-dimensional face data after the M spectrogram is aligned with the M groups, to obtain the model driving parameter generation model.
  • the audio of multiple collection objects is collected and aligned with the three-dimensional scan data of the collection object in time, so that the audio and the three-dimensional scan data are completely corresponding, and a vocabulary and multilingual text sentence library is adopted to improve the generation of driving parameters.
  • the training process of generating the model with the driving parameters includes the following steps: inputting the M spectrograms into the driving parameter generation model to generate a first model driving parameter set; after aligning the M groups Fitting and optimizing the three-dimensional face data of the universal face three-dimensional standard model to generate a second model driving parameter set; the parameters in the first parameter set and the parameters in the second parameter set are in one-to-one correspondence, Performing a loss function calculation to obtain a loss function value; when the loss function value is less than a preset first loss threshold, the model driving parameter generation model is trained to be completed.
  • the first parameter set generated after inputting the M groups of spectrograms into the driving neural network is with Fitting and optimizing the aligned three-dimensional face data of the M groups with the general three-dimensional standard model of the human face
  • the generated second model drive parameter set also has a corresponding relationship
  • the former is driven by neural network prediction and the latter is generated
  • the convergence degree of the driving neural network can be judged according to the loss function value.
  • the preset loss function threshold value is 5, 10, 7, etc.
  • the loss function threshold value can be set according to the accuracy requirements of different levels adapted to the driving neural network.
  • the driving parameter generation model may be any one of convolutional neural networks.
  • the spectrogram is input into the first parameter set generated by the driving parameter generation model, and the three-dimensional face data after aligning the M groups and the general three-dimensional face model are fitted and optimized to generate a second model-driven parameter set , To calculate the loss function value, and through the loss function value, to judge the convergence of the driving parameter generation model, and improve the model training effect.
  • the method before displaying the three-dimensional face model of the second user during the voice call between the first user of the first terminal and the second user of the second terminal, the method It also includes: acquiring the face image of the second user; inputting the face image of the second user into a pre-trained parameter extraction model to obtain the identity parameters of the second user; The parameters are input to a general three-dimensional face standard model to obtain the three-dimensional face model of the second user; and the three-dimensional face model of the second user is stored.
  • the pre-trained parameter extraction model is also one of many neural networks.
  • the model has the function of generating parameters corresponding to the input face image based on the input face image. Such as identity parameters, expression parameters, posture parameters, etc. Therefore, by inputting the face image containing the second user into the pre-trained parameter extraction model, the identity parameters of the second user can be obtained.
  • the general three-dimensional face standard model is a three-dimensional face model in a natural expression state obtained from multiple three-dimensional face models, which may include N key point annotations, and N may be equal to 106 , 100, 105 (only the key points are marked symbolically in the figure).
  • N may be equal to 106 , 100, 105 (only the key points are marked symbolically in the figure).
  • S i represents the orthogonal basis of the ith face identity
  • B i represents the orthogonal basis of the ith face expression
  • ⁇ i represents the ith face identity parameter
  • ⁇ i represents the ith face expression parameter
  • f, pr, and t are the face pose parameters, respectively representing the scaling parameters, the rotation matrix and the translation matrix
  • is the projection matrix.
  • the three-dimensional face model of the second user can be associated with other information of the second user for storage.
  • the three-dimensional face model of the second user can be obtained .
  • the three-dimensional face model of the second user can also be stored separately, and the three-dimensional face model of the second user can be obtained when the instruction for obtaining the three-dimensional face model of the second user is input.
  • the other information of the second user may include any one or more of remarks, name, phone number, identification number, nickname, picture, and social account.
  • the first terminal may also obtain the face image of the second user, or pre-store the face image of the second user, and then follow the face image of the second user to generate a three-dimensional face model of the second user. Store the three-dimensional face model of the second user.
  • the process of collecting training training data of the parameter extraction model includes the following steps: collecting X face area images, and comparing each of the X face area images Annotate N key points to obtain X face region images annotated with N key points; input the X face region images annotated with N key points into the primary parameter extraction model to generate X sets of parameters, and then The group of parameters are input into the general three-dimensional face standard model to generate X three-dimensional face standard models, and N key point projections are performed on the X three-dimensional face standard models to obtain X face regions projected by N key points Image; collect Y group of three-dimensional face scan data, smoothly process the Y group of three-dimensional face scan data, use the general three-dimensional face model as a template, align the Y group of three-dimensional face scan data, and align the aligned The Y group of three-dimensional face scan data inputs the Y group of three-dimensional face scan data and the general three-dimensional face model to perform fitting optimization to obtain the Y group of general three-dimensional face
  • the X face region images may be about 1 million collected face image data sets that are uniformly distributed in various poses, include comprehensive races, comprehensive and uniformly distributed age groups, balanced gender ratios, and cover a wide range of face shapes.
  • use the face detection algorithm to detect the face of the 1 million face images, obtain the face area, and cut it, and then use the N-point face key point annotation algorithm to cut the obtained Perform key point detection in the face area, and obtain face image data with N key point annotations after trimming.
  • Each vertex of the three-dimensional face model corresponds to a different coordinate and number, for example, 55 corresponds to the right eye corner, and 65 corresponds to the left eye corner.
  • X face images two-dimensional
  • ternary data Y face images, N key points, Y groups of general three-dimensional face standard model parameters
  • the embodiment of the present application uses a combination of face images (two-dimensional) and three-dimensional scan data to train the parameter extraction model , While making training data easier to obtain, it also adds a lot of training data. While improving the training efficiency, it also improves the accuracy of the training.
  • the training process of the parameter extraction model includes the following steps: input the training data of the parameter extraction model into the parameter extraction model, and calculate the loss value between each data. When the loss value is less than the second loss threshold, the training process of the parameter extraction model is completed.
  • the training data of the parameter extraction model is input into the parameter extraction model, the loss value between the parameters is calculated, the 3Dloss value is calculated, and the 2Dloss value is calculated.
  • the loss value is the loss value.
  • the loss value between the parameters is the calculation of the loss value between the parameters of the Y group of general three-dimensional face standard model parameters; the calculation of the 3Dloss value is the generation of a three-dimensional face model according to the parameters of the general three-dimensional face standard model.
  • the model is labeled with N key points to calculate the loss value between the key points; calculating the 2Dloss value is the face image input parameter extraction model to obtain the corresponding parameters, and input the parameters into the general three-dimensional face standard model to obtain a three-dimensional person
  • the face model, the three-dimensional face model is subjected to the two-dimensional projection of N key points, that is, the face area image projected by N key points is obtained, and the face area image with the N key points is calculated and the N key points are projected Point the error value between the image of the face area to obtain the two-dimensional error value.
  • the face region images labeled with N key points include both X face region images with N key points and Y face images with N key points.
  • the face area image projected with N key points includes both X face area images projected by N key points, and Y pieces of people projected by N key points based on Y face images labeled with N key points. Face area image.
  • the method before the displaying the three-dimensional face model of the second user, the method further includes: detecting that the screen status of the first terminal is bright, and the first terminal stores The three-dimensional face model of the second user; or, it is detected that the screen state of the first terminal is bright, and the first terminal stores the three-dimensional face model of the second user, and displays the three-dimensional call mode
  • the selection interface detects the 3D call mode activation instruction entered by the user through the 3D call mode selection interface; or, it detects that the distance between the first terminal and the first user is greater than a preset distance threshold, and the first
  • the terminal displays the three-dimensional call mode selection interface; and detects a three-dimensional call mode start instruction entered by the user through the three-dimensional call mode selection interface.
  • the first terminal when the first terminal is in a bright-screen state, and the first terminal stores a three-dimensional face model of the second user, then the first user and the second user of the first terminal During the voice call of the second user of the terminal, the three-dimensional face model of the second user can be automatically displayed;
  • the terminal may also be that when the terminal is in a bright screen state, or the distance sensor detects that the distance between the first terminal and the first user is greater than a preset distance threshold, as shown in FIG. 2E, a three-dimensional call mode selection is displayed Interface, and after detecting the three-dimensional call mode activation instruction entered in the three-dimensional call mode selection interface, as shown in FIG. 2B, the three-dimensional face model of the second user is displayed.
  • the distance between the first terminal and the first user may be the distance between the first terminal and the first user’s ear, or the first terminal and other parts of the first user’s body The distance between.
  • the method further includes: if a three-dimensional call mode exit instruction is detected, terminating displaying the second user 2.
  • the three-dimensional face model of the user and/or terminate the determination of model driving parameters based on the call voice of the second user; if it is detected that the distance between the first terminal and the first user is less than the distance threshold, Terminate the determination of model driving parameters based on the call voice of the second user.
  • a voice call between a first user of the first terminal and a second user of the second terminal display a three-dimensional face model of the second user
  • the conversion into the spectrogram of the second user may include the following steps.
  • Fast Fourier transform is performed on it, and the speech signal is transformed from the time domain to the frequency domain to obtain the spectrogram of the second user.
  • the driving parameter generation model is a pre-trained neural network. Inputting the spectrogram of the second user into the model driving parameter generation model will generate the model driving parameters.
  • the model driving parameters include expression parameters, and/or posture parameters. There are three posture parameters, which respectively represent the zoom scale parameter, the rotation matrix and the translation matrix.
  • the model driving parameters may be in multiple groups, and different groups correspond to different types of expressions or postures.
  • the model driving parameters are input into the three-dimensional face model of the second user, so that as the model driving parameters change, different expressions or postures are presented, thereby presenting an animation effect.
  • said driving the three-dimensional face model of the second user according to the model driving parameters to display the three-dimensional simulated call animation of the second user includes: inputting the model driving parameters into the second user’s
  • the three-dimensional face model drives the three-dimensional face model of the second user to make dynamic changes, and the dynamic changes include changes in facial expressions and/or changes in posture.
  • the corresponding spectrogram is generated according to the call voice of the second user, and the spectrogram is input into the pre-trained neural network to obtain model driving parameters, and then the model driving parameters are used to drive all
  • the three-dimensional face model of the second user is described, so that it presents different postures and expressions according to different model driving parameters, and realizes the effect of animation.
  • the operation is convenient, and the obtained model drive parameters are highly accurate.
  • Figure 4 is a schematic diagram of the functional unit structure of a call control method apparatus 400 provided by an embodiment of the present application, which is applied to the first terminal.
  • the device includes a processing unit and a communication unit.
  • the call control method device 400 includes: a processing unit 410 and a communication unit 420, wherein:
  • the processing unit 410 is configured to display a three-dimensional face model of the second user during a voice call between the first user of the first terminal and the second user of the second terminal through the communication unit 420 And for determining model driving parameters according to the call voice of the second user, the model driving parameters including expression parameters and posture parameters; and a three-dimensional face model for driving the second user according to the model driving parameters
  • the three-dimensional simulated call animation presents expression animation information corresponding to the expression parameters, and presents posture animation information corresponding to the posture parameters.
  • the processing unit 410 is specifically configured to display the second user's face in the call application interface of the first terminal.
  • Three-dimensional face model In a possible example, in the aspect of displaying the three-dimensional face model of the second user, the processing unit 410 is specifically configured to display the second user's face in the call application interface of the first terminal. Three-dimensional face model.
  • the processing unit 410 is specifically configured to display the call application interface of the first terminal and the second user in a split screen.
  • Three-dimensional face model In a possible example, in the aspect of displaying the three-dimensional face model of the second user, the processing unit 410 is specifically configured to display the call application interface of the first terminal and the second user in a split screen. Three-dimensional face model.
  • the processing unit 410 is specifically configured to display the second user on a third terminal connected to the first terminal.
  • the processing unit 410 specifically uses To detect the call voice of the second user, process the call voice to obtain the spectrogram of the second user; input the spectrogram of the second user into a driving parameter generation model to generate model driving parameters .
  • the model driving parameters include expression parameters, and/or posture parameters.
  • the processing unit 410 is specifically configured to collect M pieces of audio, and obtain M spectrograms based on the M pieces of audio.
  • the M pieces of audio are each of the M collection objects reading the recorded audio of multiple text libraries in a preset manner; when the M pieces of audio are collected, each of the M collection objects is collected at a preset frequency Collect the three-dimensional face data of the object to obtain M sets of three-dimensional face data; use the general three-dimensional face standard model as a template to align the M sets of three-dimensional face data to obtain M sets of aligned three-dimensional face data, so
  • the aligned three-dimensional face data of the M groups have the same vertices and topological structure as the general three-dimensional standard model of the face; time alignment calibration is performed on the M pieces of audio and the aligned three-dimensional face data of the M groups , So that each group of three-dimensional face data in the M groups of three-dimensional face data corresponds to each piece of
  • the processing unit 410 is specifically configured to input the M spectrograms into the driving neural network model to generate a first model driving parameter set; Fitting and optimizing the aligned three-dimensional face data of the M groups with the general three-dimensional standard model of the face to generate a second model driving parameter set; combining the parameters in the first parameter set with the second parameter set
  • the parameters of is one-to-one, and the loss function is calculated to obtain the loss function value; when the loss function value is less than the preset first loss threshold, the driving parameter generation model is trained to complete.
  • the processing unit 410 displays the three-dimensional face model of the second user during a voice call between the first user of the first terminal and the second user of the second terminal. It is also used to obtain the face image of the second user; input the face image of the second user into a pre-trained parameter extraction model to obtain the identity parameters of the second user; The parameters are input to a general three-dimensional face standard model to obtain the three-dimensional face model of the second user; and the three-dimensional face model of the second user is stored.
  • the processing unit 410 is specifically configured to collect X face region images, and perform a calculation of each of the X face region images.
  • the face region image is labeled with N key points to obtain X face region images labeled with N key points;
  • the X face region images labeled with N key points are input into the primary parameter extraction model to generate X sets of parameters, Input the X sets of parameters into the general three-dimensional face standard model to generate X three-dimensional face standard models, and perform N key point projections on the X three-dimensional face standard models to obtain X sheets of N key point projections Image of the face region; collecting Y group of three-dimensional face scan data, smoothly processing the Y group of three-dimensional face scan data, using the general three-dimensional face model as a template, aligning the Y group of three-dimensional face scan data, Fitting and optimizing the aligned three-dimensional face scan data of the Y group and the general three-dimensional face model to obtain the parameters of
  • the processing unit 410 is specifically configured to input the training data of the parameter extraction model into the parameter extraction model, and calculate the loss value between each data.
  • the loss value between the various data is less than the second loss threshold, the training process of the parameter extraction model is completed.
  • the processing unit 410 is further configured to detect that the screen state of the first terminal is bright, and the first terminal A terminal stores the three-dimensional face model of the second user; or, detecting that the screen state of the first terminal is bright, and the first terminal stores the three-dimensional face model of the second user, Display the 3D call mode selection interface, detect the 3D call mode activation instruction entered by the user through the 3D call mode selection interface; or detect that the distance between the first terminal and the first user is greater than a preset distance threshold, and When the first terminal stores the three-dimensional face model of the second user, the three-dimensional call mode selection interface is displayed; and the three-dimensional call mode start instruction entered by the user through the three-dimensional call mode selection interface is detected.
  • the call control method apparatus 400 may further include a storage unit 430 for storing program codes and data of electronic equipment.
  • the processing unit 410 may be a processor
  • the communication unit 420 may be a transceiver
  • the storage unit 430 may be a memory.
  • FIG. 5 is a schematic structural diagram of a first terminal 500 provided by an embodiment of the present application.
  • the first terminal 500 includes an application processor 510, a memory 520, a communication interface 530, and one or more programs 521, where:
  • the one or more programs 521 are stored in the aforementioned memory 520 and are configured to be executed by the aforementioned application processor 510, and the one or more programs 521 include instructions for executing the following steps:
  • model driving parameters include expression parameters and posture parameters
  • one end user can display the three-dimensional face model of other multi-party users, and the terminal corresponding to one end user can generate a model based on the voice of other multi-party terminal users.
  • the driving parameters are used to drive the three-dimensional face models of other multi-party terminal users to display the three-dimensional simulation call animation of the other multi-party terminal users, which improves the visualization and functionality of the call process. It is convenient for one terminal user to capture the changes in expressions and postures of other multi-party terminal users in real time during a call.
  • the one or more programs 521 specifically include instructions for performing the following operations:
  • the interface displays the three-dimensional face model of the second user.
  • the one or more programs 521 include instructions for executing the following steps: split-screen display of the call application interface of the first terminal And a three-dimensional face model of the second user.
  • the one or more programs 521 include instructions for executing the following steps: in the third terminal connected to the first terminal The three-dimensional face model of the second user is displayed on the top.
  • the three-dimensional face model of the second user is driven according to the model driving parameters to display the three-dimensional simulated call animation of the second user
  • the one or more programs 521 include Instructions for performing the following steps: detecting the voice of the second user, processing the voice of the second user, and obtaining the spectrogram of the second user; inputting the spectrogram of the second user to drive parameter generation
  • the model generates model driving parameters, and the model driving parameters include expression parameters and posture parameters.
  • the one or more programs 521 include instructions for performing the following steps: collecting M pieces of audio, according to the M pieces of audio Obtain M spectrograms, the M audio is that each of the M collection objects reads the recording audio of multiple text libraries in a preset manner; when the M audios are collected, all the audios are collected according to the preset frequency.
  • the three-dimensional face data of each of the M collection objects are obtained to obtain M groups of three-dimensional face data; using the general three-dimensional face standard model as a template, the M groups of three-dimensional face data are aligned to obtain M groups of alignment After the three-dimensional face data, the aligned three-dimensional face data of the M groups have the same vertices and topological structure as the general three-dimensional standard model of the face; the three-dimensional face data after aligning the M pieces of audio with the M groups
  • the face data is aligned and calibrated in time, so that each set of three-dimensional face data in the M sets of three-dimensional face data corresponds to each corresponding piece of audio in the M pieces of audio in a one-to-one correspondence in a time sequence; wherein,
  • the training data of the driving parameter generation model includes three-dimensional face data in which M spectrograms are aligned with the M groups.
  • the one or more programs 521 include instructions for performing the following steps: input the M spectrograms into the driving neural network model, Generate a first model drive parameter set; fit and optimize the aligned 3D face data of the M groups with the general 3D standard model of a face, and generate a second model drive parameter set; combine the first parameter set
  • the parameters are in one-to-one correspondence with the parameters in the second parameter set, and the loss function is calculated to obtain the loss function value; when the loss function value is less than the preset first loss threshold, the training completes the driving parameter generation model.
  • the one The or multiple programs 521 include instructions for performing the following steps: acquiring the face image of the second user; inputting the face image of the second user into a pre-trained parameter extraction model to obtain the second user’s face image Identity parameters; input the identity parameters of the second user into a general three-dimensional face standard model to obtain the three-dimensional face model of the second user; store the three-dimensional face model of the second user.
  • the one or more programs 521 include instructions for performing the following steps: collect X face region images, and compare the X Each face area image in the face area images is annotated with N key points, and X face area images annotated with N key points are obtained; input the X face area images with N key points annotated Primary parameter extraction model, generating X sets of parameters, inputting the X sets of parameters into the general three-dimensional face standard model, generating X three-dimensional face standard models, and performing N key points on the X three-dimensional face standard models Projection to obtain X face region images projected by N key points; collect Y group of three-dimensional face scan data, smoothly process the Y group of three-dimensional face scan data, and use the general three-dimensional face model as a template to align For the Y group of three-dimensional face scan data, the aligned three-dimensional face scan data of the Y group and the general three-dimensional face standard model are fitted and optimized to obtain the parameters of the
  • the parameters include any one or more of the identity parameters, expression parameters, and posture parameters; wherein, the Y group of three-dimensional face scan data corresponds to Y collection objects, and the person of each collection object of the Y collection objects is collected Face images, Y face images are obtained; N key point annotations are performed on the Y face images to obtain Y face images annotated with N key points; the training data of the parameter extraction model includes: X face region images labeled with N key points, X face region images projected by N key points, Y sets of general three-dimensional face standard model parameters, Y face images and Y images with N key points Any one or more of the face images.
  • the one or more programs 521 include instructions for executing the following steps: input training data of the parameter extraction model into the parameter extraction model Calculate the loss value between the various data, when the loss value between the various data is less than the second loss threshold, complete the training process of the parameter extraction model.
  • the one or more programs 521 include instructions for executing the following steps: it is detected that the screen status of the first terminal is The screen is bright, and the first terminal stores the three-dimensional face model of the second user; or, it is detected that the screen status of the first terminal is bright screen, and the first terminal stores the second user
  • the user’s three-dimensional face model displays a three-dimensional call mode selection interface, and detects the three-dimensional call mode activation instruction entered by the user through the three-dimensional call mode selection interface; or, detects the distance between the first terminal and the first user
  • the 3D call mode selection interface is displayed; it is detected that the 3D call mode entered by the user through the 3D call mode selection interface is turned on instruction.
  • the processor 510 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on.
  • the processor 510 may adopt at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array, Programmable Logic Array). achieve.
  • the processor 510 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the awake state, also called a CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor used to process data in the standby state.
  • the processor may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used to render and draw content that needs to be displayed on the display screen.
  • the processor 510 may further include an AI (Artificial Intelligence) processor, and the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 520 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 520 may also include a high-speed random access memory and a non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • the memory 520 is at least used to store the following computer program, where the computer program is loaded and executed by the processor 510 to implement relevant steps in the call control method disclosed in any of the foregoing embodiments.
  • the resources stored in the memory 520 may also include an operating system and data, etc., and the storage mode may be short-term storage or permanent storage.
  • the operating system may include Windows, Unix, Linux, etc.
  • the data may include, but is not limited to, terminal interaction data, terminal device signals, and so on.
  • the first terminal 500 may further include an input/output interface, a communication interface, a power supply, and a communication bus.
  • first terminal 500 does not constitute a limitation on the first terminal 500, and may include more or less components.
  • the first terminal includes hardware structures and/or software modules corresponding to each function.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
  • the embodiment of the present application may divide the first terminal into functional units according to the foregoing method examples.
  • each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit. It should be noted that the division of units in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.
  • An embodiment of the present application also provides a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program enables a computer to execute part or all of the steps of any method as recorded in the above method embodiment ,
  • the above-mentioned computer includes a first terminal.
  • the embodiments of the present application also provide a computer program product.
  • the above-mentioned computer program product includes a non-transitory computer-readable storage medium storing a computer program.
  • the above-mentioned computer program is operable to cause a computer to execute any of the methods described in the above-mentioned method embodiments. Part or all of the steps of the method.
  • the computer program product may be a software installation package, and the computer includes the first terminal.
  • the disclosed device may be implemented in other ways.
  • the terminal embodiments described above are only illustrative.
  • the division of the above-mentioned units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
  • the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the above integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory.
  • a number of instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the foregoing methods of the various embodiments of the present application.
  • the aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • the program can be stored in a computer-readable memory, and the memory can include: a flash disk , Read-only memory (English: Read-Only Memory, abbreviation: ROM), random access device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Environmental & Geological Engineering (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Graphics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本申请实施例公开了一种通话控制的方法及相关产品,所述方法包括:在第一终端的第一用户与第二终端的第二用户进行语音通话的过程中,显示第二用户的三维人脸模型;根据第二用户的通话语音确定模型驱动参数;根据模型驱动参数驱动第二用户的三维人脸模型,以显示第二用户的三维模拟通话动画。本申请实施例,不仅能显示第二用户的三维人脸模型,还能根据第二用户的通话语音确定上述模型的模型驱动参数,以模型驱动参数驱动第二用户的三维人脸模型呈现动画效果。相较于现有打电话仅能静态显示呼叫用户的身份信息的技术,本申请能够更全面的呈现通话对端用户的信息,包括表情和头部姿态信息,从而有利于提高第一终端的通话应用的智能性和功能性。

Description

通话控制方法及相关产品 技术领域
本申请涉及网络和计算机技术领域,尤其涉及了一种通话控制方法及相关产品。
背景技术
随着信息技术的飞速发展,智能终端、智能***越来越普及。给人们的生活、生产方式带来了诸多改变。足不出户便能与千里之外的人打接电话,语音通话或者视频通话。
但在通话的过程中,一般而言,终端仅在来电时检测呼入的号码信息,或搜索号码联系人的图片进行显示等。只是将两端用户的语音信息连接起来,进行语音信息的交互。
发明内容
本申请实施例提供一种通话控制方法以及相关产品,以期提高第一终端的通话应用的智能性和功能性。
第一方面,本申请实施例提供一种通话控制的方法,应用于第一终端,所述方法包括:
在所述第一终端的第一用户与第二终端的第二用户进行语音通话的过程中,显示所述第二用户的三维人脸模型;
根据所述第二用户的通话语音确定模型驱动参数,所述模型驱动参数包括表情参数和姿态参数;
根据所述模型驱动参数驱动所述第二用户的三维人脸模型,以显示所述第二用户的三维模拟通话动画,所述三维模拟通话动画呈现有与所述表情参数对应的表情动画信息,以及呈现有与所述姿态参数对应的姿态动画信息。
第二方面,本申请实施例提供一种通话控制方法装置,应用于第一终端,所述装置包括处理单元和通信单元,其中,
所述处理单元,用于在所述第一终端的第一用户与第二终端的第二用户通过所述通信单元进行语音通话的过程中,显示所述第二用户的三维人脸模型;以及用于根据所述第二用户的通话语音确定模型驱动参数,所述模型驱动参数包括表情参数和姿态参数;以及用于根据所述模型驱动参数驱动所述第二用户的三维人脸模型,以显示所述第二用户的三维模拟通话动画,所述三维模拟通话动画呈现有与所述表情参数对应的表情动画信息,以及呈现有与所述姿态参数对应的姿态动画信息。
第三方面,本申请实施例提供一种第一终端,其特征在于,包括处理器、存储器,所述存储器用于存储一个或多个程序,并且被配置由所述处理器执行,所述程序包括用于执行如上述第一方面所述的方法中的步骤的指令。
第四方面,本申请实施例提供了一种芯片,该芯片包括处理器与数据接口,该处理器通过该数据接口读取存储器上存储的指令,执行如上述第一方面所述的方法中的步骤。
第五方面,本申请实施例提供了一种计算机可读存储介质,其中,上述计算机可读存储介质存储用于电子数据交换的计算机程序,其中,上述计算机程序使得计算机执行如本申请实施例第一方面中所描述的部分或全部步骤。
第六方面,本申请实施例提供了一种计算机程序产品,其中,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作来使计算机执行如本申请实施例第一方面中所描述的部分或全部步骤。该计算机程序产品可以为一个软件安装包。
可以看出,本申请实施例中,第一终端的第一用户在与第二终端的第二用户语音通话时,第一终端可以显示第二用户的三维人脸模型,并根据第二用户的通话语音确定上述模 型的模型驱动参数,以及根据模型驱动参数驱动第二用户的三维人脸模型,以显示第二用户的三维模拟通话动画。相较于现有打电话仅能静态显示呼叫用户的身份信息的技术,本申请能够更全面的呈现通话对端用户的信息,包括表情和头部姿态信息,从而有利于提高第一终端的通话应用的智能性和功能性。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种通话控制***的结构示意图;
图2A是本申请实施例提供的一种通话控制方法的流程示意图;
图2B是本申请实施例提供的一种三维人脸显示界面的示意图;
图2C是本申请实施例提供的一种通用三维人脸标准模型的示意图;
图2D是本申请实施例提供的一种根据参数提取模型的训练数据计算Loss值的流程示意图;
图2E是本申请实施例提供的一种三维通话选择界面的示意图;
图3是本申请实施例提供的另一种通话控制方法的流程示意图;
图4是本申请实施提供的通话控制方法装置的功能单元构成的示意图;
图5是本申请实施提供的一种第一终端的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、***、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
目前,在通话方面,当有通话呼入时,一般终端可在来电时检测呼入号码信息,或搜索号码联系人的图片进行显示等,能查看到来电的基本信息,而不能在通话过程中更好地捕捉对方的表情以及姿态等信息,功能单一,通话过程的可视化程度低。
针对上述问题,本申请实施例提供一种通话控制的方法,应用于终端。下面结合附图进行详细介绍。
首先,请参看图1所示的一种通话控制***的结构示意图,包括互联网络,第一终端以及第二终端,在其他可能的实施例中也可以包括第三终端,第四终端,或者其他多个终端设备,适用于多方通话的应用场景。
上述终端包括但不限于带通讯功能的设备、智能手机、平板电脑、笔记本电脑、台式 电脑、便携式数字播放器、智能手环以及智能手表等。
本申请实施例的技术方案可以基于图1举例所示架构的通话控制***或其形变架构来具体实施。
首先参见图2A,图2A是本申请实施例提供的一种通话控制方法的流程示意图,如图所示,包括如下步骤:
201,在所述第一终端的第一用户与第二终端的第二用户进行语音通话的过程中,显示所述第二用户的三维人脸模型;
具体的,在用户通话的过程中,无论是通话呼入或者呼出,其中一方用户都能显示对方的三维人脸模型。而且不仅适用于两方通话,多方通话也同样适用。通话中的一方终端用户能显示其他多方用户的三维人脸模型。
202,根据所述第二用户的通话语音确定模型驱动参数,所述模型驱动参数包括表情参数和姿态参数;
具体的,在通话过程中,一方终端用户可以获取对端用户的通话语音,并且对其进行处理,生成模型驱动参数,所述模型驱动参数包括姿态参数,表情参数。所述模型驱动参数为驱动所述对端用户的三维人脸模型,也即前述第二用户的三维人脸模型。
203,根据所述模型驱动参数驱动所述第二用户的三维人脸模型,以显示所述第二用户的三维模拟通话动画,所述三维模拟通话动画呈现有与所述表情参数对应的表情动画信息,以及呈现有与所述姿态参数对应的姿态动画信息。
具体的,可以理解为所述第二用户的三维人脸模型会随着模型驱动参数的变化而产生动态变化,不同的参数对应不同的表情,比如微笑、大笑、生气、悲伤、愤怒等,以及根据不同的姿态参数生成不同的姿态,以此使得不断变化的所述第二用户的三维人脸模型,呈现出三维动画的效果。
可以看出,本申请实施例中,第一终端的第一用户在与第二终端的第二用户语音通话时,第一终端可以显示第二用户的三维人脸模型,并根据第二用户的通话语音确定上述模型的模型驱动参数,以及根据模型驱动参数驱动第二用户的三维人脸模型,以显示第二用户的三维模拟通话动画。相较于现有打电话仅能显示静态呼叫用户的身份信息的技术,本申请能够更全面的呈现通话对端用户的信息,包括表情和头部姿态信息,从而有利于提高第一终端的通话应用的智能性和功能性。
在一个可能的示例中,所述显示所述第二用户的三维人脸模型,包括:在所述第一终端的通话应用界面中显示所述第二用户的三维人脸模型。
具体的,如图2B所示,所述第二用户的三维人脸模型可以直接在所述第一终端的通话应用界面中显示,能很直观的看到所述第二用户的三维人脸模型。
在一个可能的示例中,所述显示所述第二用户的三维人脸模型,包括:分屏显示所述第一终端的通话应用界面和所述第二用户的三维人脸模型。
具体的,当所述第一终端处于分屏状态时,可以同时在分屏界面上分别所述第一终端的通话应用界面和所述第二用户的三维人脸模型;或者,
可以在所述第一终端的通话应用界面中显示所述第二用户的三维人脸模型,在分屏的另外一个界面上显示其他应用。
在一个可能的示例中,所述显示所述第二用户的三维人脸模型,包括:在与所述第一终端连接的第三终端上显示所述第二用户的三维人脸模型。
具体的,所述第二用户的三维人脸模型可以在与所述第一终端连接的第三终端上显示。其中,所述连接方式可以为无线高保真WiFi连接,蓝牙连接,移动数据连接,热点Hot连接中的任意一种或多种。
可以看出,所述第二用户的三维人脸模型多样化的显示方式,不仅增加通话过程的趣 味性,分屏显示或者在与所述第一终端连接的第三终端上显示,更便于终端用户对终端的使用,同时,通过连接终端的显示屏进行所述第二用户的三维人脸模型的显示,还能提升显示效果。所述连接终端可以包括第三终端,还可以包括第四终端,或者其他多个终端。这些终端与所述第一终端都可以通过无线高保真WiFi连接,蓝牙连接,移动数据连接,热点Hot连接中的任意一种或多种进行连接。
在一个可能的示例中,所述根据所述模型驱动参数驱动所述第二用户的三维人脸模型,以显示所述第二用户的三维模拟通话动画,包括:
检测所述第二用户的通话语音,对所述通话语音进行处理,得到所述第二用户的语谱图;将所述第二用户的语谱图输入驱动参数生成模型,生成模型驱动参数,所述模型驱动参数包括表情参数,和/或姿态参数。
具体的,在所述第一终端的第一用户与第二终端的第二用户进行语音通话的过程中,获取所述第二终端的第二用户的语音通话,将所述通话语音进行处理,转换成所述第二用户的语谱图,可以包括如下步骤,比如所述第二用户的通话语音的长度为t,将其分为m帧,则每帧长为n=t/m。对其做快速傅里叶变换,将语音信号从时域变换到频域,得到所述第二用户的语谱图。将所述第二用户的语谱图输入预先训练好的驱动参数生成模型,则会生成模型驱动参数。所述模型驱动参数包括表情参数,和姿态参数。姿态参数包括三个,分别表示缩放尺度参数,旋转矩阵和平移矩阵,利用表情参数或姿态参数,或者表情参数和姿态参数,会使得所述第二用户的三维人脸模型呈现出表情或姿态的变化,或者表情和姿态的变化。
可见,利用第二用户的通话语音,以及预先训练好的驱动参数生成模型生成驱动参数,并且根据驱动参数对第二用户三维人脸模型的进行驱动,驱动效果较好,操作方法简明清晰。
在一个可能的示例中,所述驱动参数生成模型的训练数据的获取过程包括以下步骤:采集M份音频,根据所述M份音频得到M份语谱图,所述M份音频为M个采集对象中每一个采集对象按照预设方式朗读多个文本库的录音音频;在采集所述M份音频时,按照预设频率采集所述M个采集对象中每一个采集对象的三维人脸数据,得到M组三维人脸数据;以所述通用三维人脸标准模型为模板,对齐所述M组三维人脸数据,得到M组对齐后的三维人脸数据,所述M组对齐后的三维人脸数据与所述通用人脸三维标准模型具有相同的顶点以及拓扑结构;将所述M份音频与所述M组对齐后的三维人脸数据进行时间的对齐校准,使得所述M组三维人脸数据中的每一组三维人脸数据与所述M份音频中对应的每一份音频在时间序列上一一对应;其中,所述驱动参数生成模型的训练数据,包括M份语谱图与所述M组对齐后的三维人脸数据。
具体的,以M=10为例,即采集对象为10人时,(M也可以为50,30,人数的选取可以根据时间以及资源成本适当的进行调整,但不得低于10人。)这十个采集对象可以包含不同性别,年龄段,国籍,肤色,脸型等。
其中,可以使用高精度头面部动态扫描***对采集对象按照预设的频率进行面部三维数据的采集,预设的频率可以是30帧/秒。在进行面部三维数据的采集过程中,采集对象按照预先定义好的文本句库进行朗读,文本句库可以包括中文和英文,也可以包括其他语言,比如韩文,日文,法文,德文等。同时录音设备对采集对象所朗读的内容进行录制。三维数据采集以及音频录制处于安静环境中,避免对录音引进噪音。预先定义好的文本句库可以选取10个及以上,每个文本句库的长度为12000字/词。对10个采集对象中的每一个采集对象进行三维数据的采集,可以与上述三维数据的采集过程同时也可以异时,录制每个采集对象的所有文本句库的音频。采集完成后,将所述10份录音音频与所述10组三维人脸数据进行时间对齐。
更进一步的,对人脸三维扫描数据进行平滑处理以及与通用三维人脸标准模型的模板进行对齐。然后,将M份音频进行处理得到的M份语谱图,根据M份语谱图与所述M组对齐后的三维人脸数据,得到所述模型驱动参数生成模型。
可见,采集多个采集对象的音频,并且与采集对象的三维扫描数据在时间上进行对齐,使得音频与三维扫描数据完全对应起来,而且采取词汇丰富且多语种的文本句库,提高驱动参数生成模型训练数据的准确性以及应用场景。
在一个可能的示例中,以所述驱动参数生成模型的训练过程包括以下步骤:将所述M份语谱图输入驱动参数生成模型,生成第一模型驱动参数集;将所述M组对齐后的三维人脸数据与所述通用人脸三维标准模型进行拟合优化,生成第二模型驱动参数集;将所述第一参数集中的参数与所述第二参数集中的参数一一对应,进行损失函数计算,得到损失函数值;当所述损失函数值小于预设第一损失阈值时,训练完成模型驱动参数生成模型。
具体的,由于M组语谱图与M组对齐后的三维人脸数据一一对应,所以将所述M组语谱图输入所述驱动神经网络后,生成的所述第一参数集,与将所述M组对齐后的三维人脸数据与所述通用人脸三维标准模型进行拟合优化,生成的第二模型驱动参数集,也存在对应关系,前者为驱动神经网络预测生成,后者为真值标注,对两者进行损失函数,便能根据损失函数值,判断驱动神经网络的收敛程度。比如预设损失函数阈值为5,10,7等,所述损失函数阈值的设定可以根据适配于所述驱动神经网络不同层次的精准度要求。
其中,所述驱动参数生成模型可以为卷积神经网络中的任意一种。
可见,将语谱图输入驱动参数生成模型生成的第一参数集,与将M组对齐后的三维人脸数据与所述通用人脸三维标准模型进行拟合优化,生成第二模型驱动参数集,来计算损失函数值,并且通过损失函数值,来判断驱动参数生成模型的收敛情况,提高模型训练效果。
在一个可能的示例中,所述在所述第一终端的第一用户与第二终端的第二用户进行语音通话的过程中,显示所述第二用户的三维人脸模型之前,所述方法还包括:获取所述第二用户的人脸图像;将所述第二用户的人脸图像输入预先训练的参数提取模型,得到所述第二用户的身份参数;将所述第二用户的身份参数输入通用三维人脸标准模型,得到所述第二用户的三维人脸模型;存储所述第二用户的三维人脸模型。
具体的,所述预先训练好的参数提取模型也为诸多神经网络中的一种。该模型具有根据输入的人脸图像,生成与输入的人脸图像对应的参数的功能。比如身份参数,表情参数,姿态参数等。因此将包含第二用户的人脸图像输入预先训练的参数提取模型,便能得到所述第二用户的身份参数。
其中,如图2C所示,所述通用三维人脸标准模型为根据多个三维人脸模型得到的一个处于自然表情状态下的三维人脸模型,可以包括N个关键点标注,N可以等于106,100,105(图中仅象征性的进行了关键点的标注)。将所述第二用户的身份参数输入通用三维人脸标准模型,得到所述第二用户的三维人脸模型,具体可以如下式所示:
Figure PCTCN2020123910-appb-000001
V=f*pr*∏*S+t
其中
Figure PCTCN2020123910-appb-000002
为平均人脸,S i表示第i个人脸身份正交基,B i表示第i个人脸表情正交基,α i表示第i个人脸身份参数,β i表示第i个人脸表情参数。f、pr、t分别是人脸姿态参数,分别表示缩放尺度参数,旋转矩阵和平移矩阵,П为投影矩阵。将所述第二用户的身份参数输入所述通用三维人脸标准模型,进行参数拟合,与身份正交基作用,便能生成所述第二用户的三维人脸模型。
并且将所述第二用户的三维人脸模型可以与所述第二用户的其他信息关联起来进行 存储,当输入第二用户的其他信息时,便能获取所述第二用户的三维人脸模型。或者,也可以将所述第二用户的三维人脸模型单独进行存储,当输入第二用户的三维人脸模型的获取指令时,便能获取所述第二用户的三维人脸模型。所述第二用户的其他信息可以包括备注信息,姓名,电话,身份编号,昵称,图片,社交账号中的任意一种或多种。
当然,也可以由第一终端获取所述第二用户的人脸图像,或者预先存储第二用户的人脸图像,再跟进第二用户的人脸图像生成第二用户的三维人脸模型,存储第二用户的三维人脸模型。
可见,预先将第二用户的三维人脸模型存储在第一终端里,在进行第二用户的三维人脸模型驱动时,能简化流程,简化技术操作。
在一个可能的示例中,所述参数提取模型的训练训练数据的采集过程包括以下步骤:采集X张人脸区域图像,并对所述X张人脸区域图像中的每一张人脸区域图像标注N个关键点,得到X张N个关键点标注的人脸区域图像;将所述X张标注N个关键点的人脸区域图像输入初级参数提取模型,生成X组参数,将所述X组参数输入所述通用三维人脸标准模型,生成X个三维人脸标准模型,将所述X个三维人脸标准模型进行N个关键点投影,得到X张N个关键点投影的人脸区域图像;采集Y组三维人脸扫描数据,平滑处理所述Y组三维人脸扫描数据,以所述通用人脸三维标准模型为模板,对齐所述Y组三维人脸扫描数据,将对齐后的所述Y组三维人脸扫描数据将所述Y组三维人脸扫描数据输入与所述通用人脸三维标准模型进行拟合优化,得到Y组通用三维人脸标准模型参数,所述参数包含所述身份参数,表情参数,姿态参数中的任意一种或多种;其中,所述Y组三维人脸扫描数据对应Y个采集对象,采集Y个采集对象中每一个采集对象的人脸图像,得到Y张人脸图像;对所述Y张人脸图像进行N个关键点标注,得到Y张N个关键点标注的人脸图像;所述参数提取模型的训练数据包括:将所述X张N个关键点标注的人脸区域图像、X张N个关键点投影的人脸区域图像、Y组通用三维人脸标准模型参数,Y张人脸图像以及Y张N个关键点标注的人脸图像中的任意一个或多个输入所述神经网络模型,训练得到参数提取模型。
具体的,所述X张人脸区域图像可以为采集的大约100万张各姿态分布均匀、人种包含全面、年龄段全面且均匀分布、性别比例平衡以及脸型涵盖广泛的人脸图像数据集。在采集到这些人脸图像之后,使用人脸检测算法对这100万张人脸图像进行人脸检测,得到人脸区域,进行剪裁,然后再使用N点人脸关键点标注算法对剪裁得到的人脸区域进行关键点检测,得到剪裁后具有N个关键点标注的人脸图像数据。
另外,使用高精度头面部动态扫描***采集Y组三维扫描数据,比如Y=30000,对组三维人脸扫描数据进行平滑处理,并与通用人脸三维标准模型的模板进行对齐,生成具有相同拓扑与顶点语义的三维人脸数据。然后使用这些对齐的三维人脸数据与通用三维人脸标准模型进行拟合优化,逐步迭代,就能得到Y组通用三维人脸标准模型参数。三维人脸模型的每个顶点对应不同的坐标与编号,比如55对应于右眼角,65对应于左眼角。
另外,在采集Y组三维人脸扫描数据时,采集这Y个采集对象的人脸图像,便得到Y张人脸图像。比如Y=30000,N=106,再对所述30000张人脸图像进行N个关键点标注,得到30000张106个关键点标注的人脸图像。
由此得到了由X张人脸图像(二维)加上三元数据(Y张人脸图像,N个关键点,Y组通用三维人脸标准模型参数)共同构成所述参数提取模型的训练数据。
可见,相较于三维数据,二维图像更容易获得,因此会降低训练的难度和训练成本,因此本申请实施例采用人脸图像(二维)与三维扫描数据相结合,来训练参数提取模型,使得训练数据更容易获取的同时,还增加了大量的训练数据。提高了训练效率的同时,还提高了训练的精准程度。
在一个可能的示例中,所述参数提取模型的训练过程包括以下步骤:将所述参数提取 模型的训练数据输入所述参数提取模型,计算各数据间的损失值,当所述各数据间的损失值小于所述第二损失阈值时,完成所述参数提取模型的训练过程。
如图2D所示,将所述参数提取模型的训练数据输入所述参数提取模型,分别计算参数间的loss值,计算3Dloss值,计算2Dloss值。loss值即为损失值。
其中,参数间的loss值即为计算所述Y组通用三维人脸标准模型参数各个参数间的损失值;计算3Dloss值即为根据所述通用三维人脸标准模型参数生成三维人脸模型,对模型进行N个关键点标注,以此来计算关键点间的损失值;计算2Dloss值即为人脸图像输入参数提取模型,得到对应的参数,将参数输入通用三维人脸标准模型,得到有三维人脸模型,将三维人脸模型进行N个关键点标注的二维投影,即得到N个关键点投影的人脸区域图像,计算所述标注N个关键点的人脸区域图像与投影N个关键点的人脸区域图像之间的误差值,得到二维误差值。标注N个关键点的人脸区域图像既包括X张标注N个关键点的人脸区域图像,又包括Y张N个关键点标注的人脸图像。投影N个关键点的人脸区域图像既包括X张N个关键点投影的人脸区域图像,又包括根据Y张N个关键点标注的人脸图像得到的Y张N个关键点投影的人脸区域图像。
当根据所述参数间的损失值,所述2D损失值,所述3D损失值得到的最终的损失值,小于所述第二损失阈值时,则完成对所述参数提取模型的训练。
可见,通过多组数据来计算数据间的Loss值,并且以Loss值来判断参数提取模型训练的收敛情况,能有效提高模型的训练效果。
在一个可能的示例中,所述显示所述第二用户的三维人脸模型之前,所述方法还包括:检测到所述第一终端的屏幕状态为亮屏,且所述第一终端存储有所述第二用户的三维人脸模型;或者,检测到所述第一终端的屏幕状态为亮屏,且所述第一终端存储有所述第二用户的三维人脸模型,显示三维通话模式选择界面,检测到用户通过所述三维通话模式选择界面录入的三维通话模式开启指令;或者,检测到所述第一终端与所述第一用户的距离大于预设距离阈值,且所述第一终端存储有所述第二用户的三维人脸模型时,显示三维通话模式选择界面;检测到用户通过所述三维通话模式选择界面录入的三维通话模式开启指令。
具体的,可以为,当所述第一终端处于亮屏状态,并且所述第一终端存储有所述第二用户的三维人脸模型,则在所述第一终端的第一用户与第二终端的第二用户进行语音通话的过程中,能自动显示所述第二用户的三维人脸模型;
也可以为,在所述终端处于亮屏状态,或者通过距离传感器检测到所述第一终端与所述第一用户的距离大于预设距离阈值时,如图2E所示,显示三维通话模式选择界面,并且在检测到三维通话模式选择界面录入的三维通话模式开启指令后,如图2B所示,显示所述第二用户的三维人脸模型。所述第一终端与所述第一用户的距离既可以是所述第一终端与所述第一用户耳朵之间的距离,也可以是所述第一终端与所述第一用户身体其他部位间的距离。
可见,多种第二用户三维人脸模型的触发条件或者方式,使得终端功能更加便捷。
在一个可能的示例中,在所述根据所述模型驱动参数驱动所述第二用户的三维人脸模型之后,所述方法还包括:若检测到三维通话模式退出指令,则终止显示所述第二用户的三维人脸模型,和/或,终止根据所述第二用户的通话语音确定模型驱动参数;若检测到所述第一终端与所述第一用户的距离小于所述距离阈值时,终止根据所述第二用户的通话语音确定模型驱动参数。
可见,在检测到三维通话模式退出指令,则终止显示所述第二用户的三维人脸模型,和/或,终止根据所述第二用户的通话语音确定模型驱动参数,能降低不必要的资源消耗,运转更高效。
301、在所述第一终端的第一用户与第二终端的第二用户进行语音通话的过程中,显 示所述第二用户的三维人脸模型;
如上述201所述,在此不再赘述。
302、检测所述第二用户的通话语音,对所述通话语音进行处理,得到所述第二用户的语谱图;
具体的,在所述第一终端的第一用户与第二终端的第二用户进行语音通话的过程中,获取所述第二终端的第二用户的语音通话,将所述通话语音进行处理,转换成所述第二用户的语谱图,可以包括如下步骤,比如所述第二用户的通话语音的长度为t,将其分为m帧,则每帧长为n=t/m。对其做快速傅里叶变换,将语音信号从时域变换到频域,得到所述第二用户的语谱图。
303、将所述第二用户的语谱图输入驱动参数生成模型,生成模型驱动参数,所述驱动参数包括表情参数,和姿态参数;
具体的,驱动参数生成模型为预先训练好的神经网络。将所述第二用户的语谱图输入模型驱动参数生成模型,则会生成模型驱动参数。所述模型驱动参数包括表情参数,和/或姿态参数。姿态参数包括三个,分别表示缩放尺度参数,旋转矩阵和平移矩阵。
304、根据所述模型驱动参数驱动所述第二用户的三维人脸模型,以显示所述第二用户的三维模拟通话动画,所述三维模拟通话动画呈现有与所述表情参数对应的表情动画信息,以及呈现有与所述姿态参数对应的姿态动画信息。
具体的,所述模型驱动参数可以为多组,不同组别对应不同类别的表情或者姿态。具体的,将模型驱动参数输入所述第二用户的三维人脸模型,使之随着模型驱动参数的变化,呈现出不同的表情或者姿态,以此呈现出动画的效果。
其中,所述根据所述模型驱动参数驱动所述第二用户的三维人脸模型,以显示所述第二用户的三维模拟通话动画,包括:将所述模型驱动参数输入所述第二用户的三维人脸模型,驱动所述第二用户的三维人脸模型进行动态变化,所述动态变化包括表情的变化,和/或姿态的变化。
可以看出,根据所述第二用户的通话语音生成对应的语谱图,在将所述语谱图输入预先训练好的神经网络,得到模型驱动参数,再利用所述模型驱动参数来驱动所述第二用户的三维人脸模型,使之根据不同的模型驱动参数呈现出不同的姿态,表情,实现动画的效果。操作便捷,获得的模型驱动参数准确度高。
与上图2A、图3所示的实施例一致的,请参阅图4,图4是本申请实施例提供的一种通话控制方法装置400的功能单元结构示意图,应用于第一终端,所述装置包括处理单元和通信单元,其中,如图所示,所述通话控制方法装置400包括:处理单元410,与通信单元420,其中:
所述处理单元410,用于在所述第一终端的第一用户与第二终端的第二用户通过所述通信单元420进行语音通话的过程中,显示所述第二用户的三维人脸模型;以及用于根据所述第二用户的通话语音确定模型驱动参数,所述模型驱动参数包括表情参数和姿态参数;以及用于根据所述模型驱动参数驱动所述第二用户的三维人脸模型,以显示所述第二用户的三维模拟通话动画,所述三维模拟通话动画呈现有与所述表情参数对应的表情动画信息,以及呈现有与所述姿态参数对应的姿态动画信息。
在一个可能的示例中,在所述显示所述第二用户的三维人脸模型方面,所述处理单元410,具体用于在所述第一终端的通话应用界面中显示所述第二用户的三维人脸模型。
在一个可能的示例中,在所述显示所述第二用户的三维人脸模型方面,所述处理单元410,具体用于分屏显示所述第一终端的通话应用界面和所述第二用户的三维人脸模型。
在一个可能的示例中,在所述显示所述第二用户的三维人脸模型方面,所述处理单元410,具体用于在与所述第一终端连接的第三终端上显示所述第二用户的三维人脸模型。
在一个可能的示例中,在所述根据所述模型驱动参数驱动所述第二用户的三维人脸模型,以显示所述第二用户的三维模拟通话动画方面,所述处理单元410,具体用于检测所述第二用户的通话语音,对所述通话语音进行处理,得到所述第二用户的语谱图;将所述第二用户的语谱图输入驱动参数生成模型,生成模型驱动参数,所述模型驱动参数包括表情参数,和/或姿态参数。
在一个可能的示例中,在所述驱动参数生成模型的训练数据的获取过程方面,所述处理单元410,具体用于采集M份音频,根据所述M份音频得到M份语谱图,所述M份音频为M个采集对象中每一个采集对象按照预设方式朗读多个文本库的录音音频;在采集所述M份音频时,按照预设频率采集所述M个采集对象中每一个采集对象的三维人脸数据,得到M组三维人脸数据;以所述通用三维人脸标准模型为模板,对齐所述M组三维人脸数据,得到M组对齐后的三维人脸数据,所述M组对齐后的三维人脸数据与所述通用人脸三维标准模型具有相同的顶点以及拓扑结构;将所述M份音频与所述M组对齐后的三维人脸数据进行时间的对齐校准,使得所述M组三维人脸数据中的每一组三维人脸数据与所述M份音频中对应的每一份音频在时间序列上一一对应;其中,所述驱动参数生成模型的训练数据,包括M份语谱图与所述M组对齐后的三维人脸数据。
在一个可能的示例中,在所述驱动参数生成模型的训练过程方面,所述处理单元410,具体用于将所述M份语谱图输入驱动神经网络模型,生成第一模型驱动参数集;将所述M组对齐后的三维人脸数据与所述通用人脸三维标准模型进行拟合优化,生成第二模型驱动参数集;将所述第一参数集中的参数与所述第二参数集中的参数一一对应,进行损失函数计算,得到损失函数值;当所述损失函数值小于预设第一损失阈值时,训练完成驱动参数生成模型。
在一个可能的示例中,所述处理单元410在所述第一终端的第一用户与第二终端的第二用户进行语音通话的过程中,显示所述第二用户的三维人脸模型之前,还用于获取所述第二用户的人脸图像;将所述第二用户的人脸图像输入预先训练的参数提取模型,得到所述第二用户的身份参数;将所述第二用户的身份参数输入通用三维人脸标准模型,得到所述第二用户的三维人脸模型;存储所述第二用户的三维人脸模型。
在一个可能的示例中,在所述参数提取模型的训练数据采集过程方面,处理单元410,具体用于采集X张人脸区域图像,并对所述X张人脸区域图像中的每一张人脸区域图像标注N个关键点,得到X张N个关键点标注的人脸区域图像;将所述X张标注N个关键点的人脸区域图像输入初级参数提取模型,生成X组参数,将所述X组参数输入所述通用三维人脸标准模型,生成X个三维人脸标准模型,将所述X个三维人脸标准模型进行N个关键点投影,得到X张N个关键点投影的人脸区域图像;采集Y组三维人脸扫描数据,平滑处理所述Y组三维人脸扫描数据,以所述通用人脸三维标准模型为模板,对齐所述Y组三维人脸扫描数据,将对齐后的所述Y组三维人脸扫描数据与所述通用人脸三维标准模型进行拟合优化,得到Y组通用三维人脸标准模型参数,所述参数包含所述身份参数,表情参数,姿态参数中的任意一种或多种;其中,所述Y组三维人脸扫描数据对应Y个采集对象,采集Y个采集对象中每一个采集对象的人脸图像,得到Y张人脸图像;对所述Y张人脸图像进行N个关键点标注,得到Y张N个关键点标注的人脸图像;所述参数提取模型的训练数据包括:所述X张N个关键点标注的人脸区域图像、X张N个关键点投影的人脸区域图像、Y组通用三维人脸标准模型参数,Y张人脸图像以及Y张N个关键点标注的人脸图像中的任意一个或多个。
在一个可能的示例中,在所述参数提取模型的训练过程方面,所述处理单元410,具体用于将所述参数提取模型的训练数据输入所述参数提取模型,计算各数据间的损失值,当所述各数据间的损失值小于所述第二损失阈值时,完成所述参数提取模型的训练过程。
在一个可能的示例中,所述处理单元410在所述显示所述第二用户的三维人脸模型之 前,还用于,检测到所述第一终端的屏幕状态为亮屏,且所述第一终端存储有所述第二用户的三维人脸模型;或者,检测到所述第一终端的屏幕状态为亮屏,且所述第一终端存储有所述第二用户的三维人脸模型,显示三维通话模式选择界面,检测到用户通过所述三维通话模式选择界面录入的三维通话模式开启指令;或者,检测到所述第一终端与所述第一用户的距离大于预设距离阈值,且所述第一终端存储有所述第二用户的三维人脸模型时,显示三维通话模式选择界面;检测到用户通过所述三维通话模式选择界面录入的三维通话模式开启指令。
其中,所述通话控制方法装置400还可以包括存储单元430,用于存储电子设备的程序代码和数据。所述处理单元410可以是处理器,所述通信单元420可以收发器,存储单元430可以是存储器。
可以理解的是,由于方法实施例与装置实施例为相同技术构思的不同呈现形式,因此,本申请中方法实施例部分的内容应同步适配于装置实施例部分,此处不再赘述。
图5是本申请实施例提供的第一终端500的结构示意图,如图所示,所述第一终端500包括应用处理器510、存储器520、通信接口530以及一个或多个程序521,其中,所述一个或多个程序521被存储在上述存储器520中,并且被配置由上述应用处理器510执行,所述一个或多个程序521包括用于执行以下步骤的指令:
在所述第一终端的第一用户与第二终端的第二用户进行语音通话的过程中,显示所述第二用户的三维人脸模型;
根据所述第二用户的通话语音确定模型驱动参数,所述模型驱动参数包括表情参数和姿态参数;
根据所述模型驱动参数驱动所述第二用户的三维人脸模型,以显示所述第二用户的三维模拟通话动画,所述三维模拟通话动画呈现有与所述表情参数对应的表情动画信息,以及呈现有与所述姿态参数对应的姿态动画信息。
可以看出,在两个或者多个终端用户的通话过程中,一方终端用户能够显示其他多方用户的三维人脸模型,并且一方终端用户对应的终端能够根据其他多方终端用户的通话语音,生成模型驱动参数,以此来驱动其他多方终端用户的三维人脸模型,以显示所述其他多方终端用户的三维模拟通话动画,提高了通话过程的可视化程度和功能性。便于一方终端用户在通话的过程中实时捕捉其他多方终端用户表情以及姿态的变化。
在一个可能的示例中,在所述显示所述第二用户的三维人脸模型方面,所述一个或多个程序521具体包括用于执行以下操作的指令,在所述第一终端的通话应用界面中显示所述第二用户的三维人脸模型。
在一个可能的示例中,所述显示所述第二用户的三维人脸模型,所述一个或多个程序521包括用于执行以下步骤的指令:分屏显示所述第一终端的通话应用界面和所述第二用户的三维人脸模型。
在一个可能的示例中,所述显示所述第二用户的三维人脸模型,所述一个或多个程序521包括用于执行以下步骤的指令:在与所述第一终端连接的第三终端上显示所述第二用户的三维人脸模型。
在一个可能的示例中,所述根据所述模型驱动参数驱动所述第二用户的三维人脸模型,以显示所述第二用户的三维模拟通话动画,所述一个或多个程序521包括用于执行以下步骤的指令:检测所述第二用户的通话语音,对所述通话语音进行处理,得到所述第二用户的语谱图;将所述第二用户的语谱图输入驱动参数生成模型,生成模型驱动参数,所述模型驱动参数包括表情参数,和姿态参数。
在一个可能的示例中,在所述驱动参数生成模型的训练数据的获取过程方面,所述一个或多个程序521包括用于执行以下步骤的指令:采集M份音频,根据所述M份音频得到M 份语谱图,所述M份音频为M个采集对象中每一个采集对象按照预设方式朗读多个文本库的录音音频;在采集所述M份音频时,按照预设频率采集所述M个采集对象中每一个采集对象的三维人脸数据,得到M组三维人脸数据;以所述通用三维人脸标准模型为模板,对齐所述M组三维人脸数据,得到M组对齐后的三维人脸数据,所述M组对齐后的三维人脸数据与所述通用人脸三维标准模型具有相同的顶点以及拓扑结构;将所述M份音频与所述M组对齐后的三维人脸数据进行时间的对齐校准,使得所述M组三维人脸数据中的每一组三维人脸数据与所述M份音频中对应的每一份音频在时间序列上一一对应;其中,所述驱动参数生成模型的训练数据,包括M份语谱图与所述M组对齐后的三维人脸数据。
在一个可能的示例中,在所述驱动参数生成模型的训练过程方面,所述一个或多个程序521包括用于执行以下步骤的指令:将所述M份语谱图输入驱动神经网络模型,生成第一模型驱动参数集;将所述M组对齐后的三维人脸数据与所述通用人脸三维标准模型进行拟合优化,生成第二模型驱动参数集;将所述第一参数集中的参数与所述第二参数集中的参数一一对应,进行损失函数计算,得到损失函数值;当所述损失函数值小于预设第一损失阈值时,训练完成驱动参数生成模型。
在一个可能的示例中,所述在所述第一终端的第一用户与第二终端的第二用户进行语音通话的过程中,显示所述第二用户的三维人脸模型之前,所述一个或多个程序521包括用于执行以下步骤的指令:获取所述第二用户的人脸图像;将所述第二用户的人脸图像输入预先训练的参数提取模型,得到所述第二用户的身份参数;将所述第二用户的身份参数输入通用三维人脸标准模型,得到所述第二用户的三维人脸模型;存储所述第二用户的三维人脸模型。
在一个可能的示例中,在所述参数提取模型的训练数据采集过程方面,所述一个或多个程序521包括用于执行以下步骤的指令:采集X张人脸区域图像,并对所述X张人脸区域图像中的每一张人脸区域图像标注N个关键点,得到X张N个关键点标注的人脸区域图像;将所述X张标注N个关键点的人脸区域图像输入初级参数提取模型,生成X组参数,将所述X组参数输入所述通用三维人脸标准模型,生成X个三维人脸标准模型,将所述X个三维人脸标准模型进行N个关键点投影,得到X张N个关键点投影的人脸区域图像;采集Y组三维人脸扫描数据,平滑处理所述Y组三维人脸扫描数据,以所述通用人脸三维标准模型为模板,对齐所述Y组三维人脸扫描数据,将对齐后的所述Y组三维人脸扫描数据与所述通用人脸三维标准模型进行拟合优化,得到Y组通用三维人脸标准模型参数,所述参数包含所述身份参数,表情参数,姿态参数中的任意一种或多种;其中,所述Y组三维人脸扫描数据对应Y个采集对象,采集Y个采集对象中每一个采集对象的人脸图像,得到Y张人脸图像;对所述Y张人脸图像进行N个关键点标注,得到Y张N个关键点标注的人脸图像;所述参数提取模型的训练数据包括:所述X张N个关键点标注的人脸区域图像、X张N个关键点投影的人脸区域图像、Y组通用三维人脸标准模型参数,Y张人脸图像以及Y张N个关键点标注的人脸图像中的任意一个或多个。
在一个可能的示例中,在所述参数提取模型的训练过程方面,所述一个或多个程序521包括用于执行以下步骤的指令:将所述参数提取模型的训练数据输入所述参数提取模型,计算各数据间的损失值,当所述各数据间的损失值小于所述第二损失阈值时,完成所述参数提取模型的训练过程。
在一个可能的示例中,所述显示所述第二用户的三维人脸模型之前,所述一个或多个程序521包括用于执行以下步骤的指令:检测到所述第一终端的屏幕状态为亮屏,且所述第一终端存储有所述第二用户的三维人脸模型;或者,检测到所述第一终端的屏幕状态为亮屏,且所述第一终端存储有所述第二用户的三维人脸模型,显示三维通话模式选择界面,检测到用户通过所述三维通话模式选择界面录入的三维通话模式开启指令;或者,检测到 所述第一终端与所述第一用户的距离大于预设距离阈值,且所述第一终端存储有所述第二用户的三维人脸模型时,显示三维通话模式选择界面;检测到用户通过所述三维通话模式选择界面录入的三维通话模式开启指令。
其中,处理器510可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器510可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器510也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器510还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器520可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器520还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。本实施例中,存储器520至少用于存储以下计算机程序,其中,该计算机程序被处理器510加载并执行之后,能够实现前述任一实施例公开的通话控制方法中的相关步骤。另外,存储器520所存储的资源还可以包括操作***和数据等,存储方式可以是短暂存储或者永久存储。其中,操作***可以包括Windows、Unix、Linux等。数据可以包括但不限于终端交互数据、终端设备信号等。
在一些实施例中,第一终端500还可包括有输入输出接口、通信接口、电源以及通信总线。
本领域技术人员可以理解,本实施例公开的结构并不构成对第一终端500的限定,可以包括更多或更少的组件。
上述主要从方法侧执行过程的角度对本申请实施例的方案进行了介绍。可以理解的是,第一终端为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所提供的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请实施例可以根据上述方法示例对第一终端进行功能单元的划分,例如,可以对应各个功能划分各个功能单元,也可以将两个或两个以上的功能集成在一个处理单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。需要说明的是,本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
本申请实施例还提供一种计算机存储介质,其中,该计算机存储介质存储用于电子数据交换的计算机程序,该计算机程序使得计算机执行如上述方法实施例中记载的任一方法的部分或全部步骤,上述计算机包括第一终端。
本申请实施例还提供一种计算机程序产品,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作来使计算机执行如上述方法实施例中记载的任一方法的部分或全部步骤。该计算机程序产品可以为一个软件安装包,上述计算机包括第一终端。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为 依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的终端实施例仅仅是示意性的,例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例上述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。
以上所揭露的仅为本申请的部分实施例而已,当然不能以此来限定本申请之权利范围,本领域普通技术人员可以理解实现上述实施例的全部或部分流程,并依本申请权利要求所作的等同变化,仍属于本申请所涵盖的范围。

Claims (26)

  1. 一种通话控制方法,其特征在于,应用于第一终端,所述方法包括:
    在所述第一终端的第一用户与第二终端的第二用户进行语音通话的过程中,显示所述第二用户的三维人脸模型;
    根据所述第二用户的通话语音确定模型驱动参数,所述模型驱动参数包括表情参数和姿态参数;
    根据所述模型驱动参数驱动所述第二用户的三维人脸模型,以显示所述第二用户的三维模拟通话动画,所述三维模拟通话动画呈现有与所述表情参数对应的表情动画信息,以及呈现有与所述姿态参数对应的姿态动画信息。
  2. 根据权利要求1所述的方法,其特征在于,所述显示所述第二用户的三维人脸模型,包括:
    在所述第一终端的通话应用界面中显示所述第二用户的三维人脸模型。
  3. 根据权利要求1所述的方法,其特征在于,所述显示所述第二用户的三维人脸模型,包括:
    分屏显示所述第一终端的通话应用界面和所述第二用户的三维人脸模型。
  4. 根据权利要求1所述的方法,其特征在于,所述显示所述第二用户的三维人脸模型,包括:
    在与所述第一终端连接的第三终端上显示所述第二用户的三维人脸模型。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述根据所述第二用户的通话语音确定模型驱动参数,包括:
    检测所述第二用户的通话语音,对所述通话语音进行处理,得到所述第二用户的语谱图;
    将所述第二用户的语谱图输入驱动参数生成模型,生成所述模型驱动参数。
  6. 根据权利要求5所述的方法,其特征在于,所述驱动参数生成模型的训练数据的获取过程包括以下步骤:
    采集M份音频,根据所述M份音频得到M份语谱图,所述M份音频为M个采集对象中每一个采集对象按照预设方式朗读多个文本库的录音音频;
    在采集所述M份音频时,按照预设频率采集所述M个采集对象中每一个采集对象的三维人脸数据,得到M组三维人脸数据;
    以所述通用三维人脸标准模型为模板,对齐所述M组三维人脸数据,得到M组对齐后的三维人脸数据,所述M组对齐后的三维人脸数据与所述通用人脸三维标准模型具有相同的顶点以及拓扑结构;
    将所述M份音频与所述M组对齐后的三维人脸数据进行时间的对齐校准,使得所述M组三维人脸数据中的每一组三维人脸数据与所述M份音频中对应的每一份音频在时间序列上一一对应;
    其中,所述驱动参数生成模型的训练数据,包括M份语谱图与所述M组对齐后的三维人脸数据。
  7. 根据权利要求6所述的方法,其特征在于,所述驱动参数生成模型的的训练过程包括以下步骤:
    将所述M份语谱图输入驱动参数生成模型,生成第一模型驱动参数集;
    将所述M组对齐后的三维人脸数据与所述通用人脸三维标准模型进行拟合优化,生成第二模型驱动参数集;
    将所述第一参数集中的参数与所述第二参数集中的参数一一对应,进行损失函数计算,得到损失函数值;
    当所述损失函数值小于预设第一损失阈值时,训练完成驱动参数生成模型。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述在所述第一终端的第一用户与第二终端的第二用户进行语音通话的过程中,显示所述第二用户的三维人脸模型之前,所述方法还包括:
    获取所述第二用户的人脸图像;
    将所述第二用户的人脸图像输入预先训练的参数提取模型,得到所述第二用户的身份参数;
    将所述第二用户的身份参数输入通用三维人脸标准模型,得到所述第二用户的三维人脸模型;
    存储所述第二用户的三维人脸模型。
  9. 根据权利要求8所述的方法,其特征在于,所述参数提取模型的训练数据的采集过程包括以下步骤:
    采集X张人脸区域图像,并对所述X张人脸区域图像中的每一张人脸区域图像标注N个关键点,得到X张N个关键点标注的人脸区域图像;
    将所述X张标注N个关键点的人脸区域图像输入初级参数提取模型,生成X组参数,将所述X组参数输入所述通用三维人脸标准模型,生成X个三维人脸标准模型,将所述X个三维人脸标准模型进行N个关键点投影,得到X张N个关键点投影的人脸区域图像;
    采集Y组三维人脸扫描数据,平滑处理所述Y组三维人脸扫描数据,以所述通用人脸三维标准模型为模板,对齐所述Y组三维人脸扫描数据,将对齐后的所述Y组三维人脸扫描数据与所述通用人脸三维标准模型进行拟合优化,得到Y组通用三维人脸标准模型参数,所述参数包含所述身份参数,表情参数,姿态参数中的任意一种或多种;
    其中,所述Y组三维人脸扫描数据对应Y个采集对象,采集Y个采集对象中每一个采集对象的人脸图像,得到Y张人脸图像;
    对所述Y张人脸图像进行N个关键点标注,得到Y张N个关键点标注的人脸图像;
    所述参数提取模型的训练数据包括:所述X张N个关键点标注的人脸区域图像、X张N个关键点投影的人脸区域图像、Y组通用三维人脸标准模型参数,Y张人脸图像以及Y张N个关键点标注的人脸图像中的任意一个或多个。
  10. 根据权利要求8或9所述的方法,所述参数提取模型的训练过程包括以下步骤:
    将所述参数提取模型的训练数据输入所述参数提取模型,计算各数据间的损失值,当所述各数据间的损失值小于所述第二损失阈值时,完成所述参数提取模型的训练过程。
  11. 根据权利要求1-10任一项所述的方法,其特征在于,所述显示所述第二用户的三维人脸模型之前,所述方法还包括:
    检测到所述第一终端的屏幕状态为亮屏,且所述第一终端存储有所述第二用户的三维人脸模型;或者,
    检测到所述第一终端的屏幕状态为亮屏,且所述第一终端存储有所述第二用户的三维人脸模型,显示三维通话模式选择界面,检测到用户通过所述三维通话模式选择界面录入的三维通话模式开启指令;或者,
    检测到所述第一终端与所述第一用户的距离大于预设距离阈值,且所述第一终端存储有所述第二用户的三维人脸模型时,显示三维通话模式选择界面;检测到用户通过所述三维通话模式选择界面录入的三维通话模式开启指令。
  12. 根据权利要求11所述的方法,其特征在于,在所述根据所述模型驱动参数驱动所述第二用户的三维人脸模型之后,所述方法还包括:
    若检测到三维通话模式退出指令,则终止显示所述第二用户的三维人脸模型,和/或,终止根据所述第二用户的通话语音确定模型驱动参数;
    若检测到所述第一终端与所述第一用户的距离小于所述距离阈值时,终止根据所述第二用户的通话语音确定模型驱动参数。
  13. 一种通话控制方法装置,其特征在于,应用于第一终端,所述装置包括处理单元和通信单元,其中,
    所述处理单元,用于在所述第一终端的第一用户与第二终端的第二用户通过所述通信单元进行语音通话的过程中,显示所述第二用户的三维人脸模型;以及用于根据所述第二用户的通话语音确定模型驱动参数,所述模型驱动参数包括表情参数和姿态参数;以及用于根据所述模型驱动参数驱动所述第二用户的三维人脸模型,以显示所述第二用户的三维模拟通话动画,所述三维模拟通话动画呈现有与所述表情参数对应的表情动画信息,以及呈现有与所述姿态参数对应的姿态动画信息。
  14. 根据权利要求13所述的装置,其特征在于,所述显示所述第二用户的三维人脸模型,所述处理单元具体用于:
    在所述第一终端的通话应用界面中显示所述第二用户的三维人脸模型。
  15. 根据权利要求13所述的装置,其特征在于,所述显示所述第二用户的三维人脸模型,所述处理单元具体用于:
    分屏显示所述第一终端的通话应用界面和所述第二用户的三维人脸模型。
  16. 根据权利要求13所述的装置,其特征在于,所述显示所述第二用户的三维人脸模型,所述处理单元具体用于:
    在与所述第一终端连接的第三终端上显示所述第二用户的三维人脸模型。
  17. 根据权利要求13-16任一项所述的装置,其特征在于,所述根据所述第二用户的通话语音确定模型驱动参数,所述处理单元具体用于:
    检测所述第二用户的通话语音,对所述通话语音进行处理,得到所述第二用户的语谱图;
    将所述第二用户的语谱图输入驱动参数生成模型,生成所述模型驱动参数。
  18. 根据权利要求17所述的装置,其特征在于,所述驱动参数生成模型的训练数据的获取过程包括以下步骤:
    采集M份音频,根据所述M份音频得到M份语谱图,所述M份音频为M个采集对象中每一个采集对象按照预设方式朗读多个文本库的录音音频;
    在采集所述M份音频时,按照预设频率采集所述M个采集对象中每一个采集对象的三维人脸数据,得到M组三维人脸数据;
    以所述通用三维人脸标准模型为模板,对齐所述M组三维人脸数据,得到M组对齐后的三维人脸数据,所述M组对齐后的三维人脸数据与所述通用人脸三维标准模型具有相同的顶点以及拓扑结构;
    将所述M份音频与所述M组对齐后的三维人脸数据进行时间的对齐校准,使得所述M组三维人脸数据中的每一组三维人脸数据与所述M份音频中对应的每一份音频在时间序列上一一对应;
    其中,所述驱动参数生成模型的训练数据,包括M份语谱图与所述M组对齐后的三维人脸数据。
  19. 根据权利要求18所述的装置,其特征在于,所述驱动参数生成模型的的训练过程包括以下步骤:
    将所述M份语谱图输入驱动参数生成模型,生成第一模型驱动参数集;
    将所述M组对齐后的三维人脸数据与所述通用人脸三维标准模型进行拟合优化,生成第二模型驱动参数集;
    将所述第一参数集中的参数与所述第二参数集中的参数一一对应,进行损失函数计算, 得到损失函数值;
    当所述损失函数值小于预设第一损失阈值时,训练完成驱动参数生成模型。
  20. 根据权利要求13-19任一项所述的装置,其特征在于,所述在所述第一终端的第一用户与第二终端的第二用户进行语音通话的过程中,显示所述第二用户的三维人脸模型之前,所述处理单元还用于:
    获取所述第二用户的人脸图像;
    将所述第二用户的人脸图像输入预先训练的参数提取模型,得到所述第二用户的身份参数;
    将所述第二用户的身份参数输入通用三维人脸标准模型,得到所述第二用户的三维人脸模型;
    存储所述第二用户的三维人脸模型。
  21. 根据权利要求20所述的装置,其特征在于,所述参数提取模型的训练数据的采集过程包括以下步骤:
    采集X张人脸区域图像,并对所述X张人脸区域图像中的每一张人脸区域图像标注N个关键点,得到X张N个关键点标注的人脸区域图像;
    将所述X张标注N个关键点的人脸区域图像输入初级参数提取模型,生成X组参数,将所述X组参数输入所述通用三维人脸标准模型,生成X个三维人脸标准模型,将所述X个三维人脸标准模型进行N个关键点投影,得到X张N个关键点投影的人脸区域图像;
    采集Y组三维人脸扫描数据,平滑处理所述Y组三维人脸扫描数据,以所述通用人脸三维标准模型为模板,对齐所述Y组三维人脸扫描数据,将对齐后的所述Y组三维人脸扫描数据与所述通用人脸三维标准模型进行拟合优化,得到Y组通用三维人脸标准模型参数,所述参数包含所述身份参数,表情参数,姿态参数中的任意一种或多种;
    其中,所述Y组三维人脸扫描数据对应Y个采集对象,采集Y个采集对象中每一个采集对象的人脸图像,得到Y张人脸图像;
    对所述Y张人脸图像进行N个关键点标注,得到Y张N个关键点标注的人脸图像;
    所述参数提取模型的训练数据包括:所述X张N个关键点标注的人脸区域图像、X张N个关键点投影的人脸区域图像、Y组通用三维人脸标准模型参数,Y张人脸图像以及Y张N个关键点标注的人脸图像中的任意一个或多个。
  22. 根据权利要求20或21所述的装置,所述参数提取模型的训练过程包括以下步骤:
    将所述参数提取模型的训练数据输入所述参数提取模型,计算各数据间的损失值,当所述各数据间的损失值小于所述第二损失阈值时,完成所述参数提取模型的训练过程。
  23. 根据权利要求13-22任一项所述的装置,其特征在于,所述显示所述第二用户的三维人脸模型之前,所述处理单元还用于:
    检测到所述第一终端的屏幕状态为亮屏,且所述第一终端存储有所述第二用户的三维人脸模型;或者,
    检测到所述第一终端的屏幕状态为亮屏,且所述第一终端存储有所述第二用户的三维人脸模型,显示三维通话模式选择界面,检测到用户通过所述三维通话模式选择界面录入的三维通话模式开启指令;或者,
    检测到所述第一终端与所述第一用户的距离大于预设距离阈值,且所述第一终端存储有所述第二用户的三维人脸模型时,显示三维通话模式选择界面;检测到用户通过所述三维通话模式选择界面录入的三维通话模式开启指令。
  24. 根据权利要求23所述的装置,其特征在于,在所述根据所述模型驱动参数驱动所述第二用户的三维人脸模型之后,所述处理单元还用于:
    若检测到三维通话模式退出指令,则终止显示所述第二用户的三维人脸模型,和/或, 终止根据所述第二用户的通话语音确定模型驱动参数;
    若检测到所述第一终端与所述第一用户的距离小于所述距离阈值时,终止根据所述第二用户的通话语音确定模型驱动参数。
  25. 一种第一终端,其特征在于,包括处理器、存储器,所述存储器用于存储一个或多个程序,并且被配置由所述处理器执行,所述程序包括用于执行如权利要求1-12任一项所述的方法中的步骤的指令。
  26. 一种计算机可读存储介质,其特征在于,存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如权利要求1-12任一项所述的方法。
PCT/CN2020/123910 2019-10-31 2020-10-27 通话控制方法及相关产品 WO2021083125A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20881081.2A EP4054161A1 (en) 2019-10-31 2020-10-27 Call control method and related product
US17/733,539 US20220263934A1 (en) 2019-10-31 2022-04-29 Call control method and related product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911053845.6A CN110809090A (zh) 2019-10-31 2019-10-31 通话控制方法及相关产品
CN201911053845.6 2019-10-31

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/733,539 Continuation US20220263934A1 (en) 2019-10-31 2022-04-29 Call control method and related product

Publications (1)

Publication Number Publication Date
WO2021083125A1 true WO2021083125A1 (zh) 2021-05-06

Family

ID=69489844

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/123910 WO2021083125A1 (zh) 2019-10-31 2020-10-27 通话控制方法及相关产品

Country Status (4)

Country Link
US (1) US20220263934A1 (zh)
EP (1) EP4054161A1 (zh)
CN (1) CN110809090A (zh)
WO (1) WO2021083125A1 (zh)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10880433B2 (en) * 2018-09-26 2020-12-29 Rovi Guides, Inc. Systems and methods for curation and delivery of content for use in electronic calls
CN110809090A (zh) * 2019-10-31 2020-02-18 Oppo广东移动通信有限公司 通话控制方法及相关产品
CN112788174B (zh) * 2020-05-31 2022-07-15 深圳市睿耳电子有限公司 一种无线耳机智能找回方法及相关装置
CN113934289A (zh) 2020-06-29 2022-01-14 北京字节跳动网络技术有限公司 数据处理方法、装置、可读介质及电子设备
CN112102468B (zh) * 2020-08-07 2022-03-04 北京汇钧科技有限公司 模型训练、虚拟人物图像生成方法和装置以及存储介质
CN112527115B (zh) * 2020-12-15 2023-08-04 北京百度网讯科技有限公司 用户形象生成方法、相关装置及计算机程序产品
CN112907706A (zh) * 2021-01-31 2021-06-04 云知声智能科技股份有限公司 基于多模态的声音驱动动漫视频生成方法、装置及***
CN115083337B (zh) * 2022-07-08 2023-05-16 深圳市安信泰科技有限公司 一种led显示驱动***及方法
CN117478818A (zh) * 2023-12-26 2024-01-30 荣耀终端有限公司 语音通话方法、终端和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101789990A (zh) * 2009-12-23 2010-07-28 宇龙计算机通信科技(深圳)有限公司 一种在通话过程中判断对方情绪的方法及移动终端
CN103279970A (zh) * 2013-05-10 2013-09-04 中国科学技术大学 一种实时的语音驱动人脸动画的方法
CN104935860A (zh) * 2014-03-18 2015-09-23 北京三星通信技术研究有限公司 视频通话实现方法及装置
WO2018108013A1 (zh) * 2016-12-14 2018-06-21 中兴通讯股份有限公司 一种媒体显示方法及终端
CN110809090A (zh) * 2019-10-31 2020-02-18 Oppo广东移动通信有限公司 通话控制方法及相关产品

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109637522B (zh) * 2018-12-26 2022-12-09 杭州电子科技大学 一种基于语谱图提取深度空间注意特征的语音情感识别方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101789990A (zh) * 2009-12-23 2010-07-28 宇龙计算机通信科技(深圳)有限公司 一种在通话过程中判断对方情绪的方法及移动终端
CN103279970A (zh) * 2013-05-10 2013-09-04 中国科学技术大学 一种实时的语音驱动人脸动画的方法
CN104935860A (zh) * 2014-03-18 2015-09-23 北京三星通信技术研究有限公司 视频通话实现方法及装置
WO2018108013A1 (zh) * 2016-12-14 2018-06-21 中兴通讯股份有限公司 一种媒体显示方法及终端
CN110809090A (zh) * 2019-10-31 2020-02-18 Oppo广东移动通信有限公司 通话控制方法及相关产品

Also Published As

Publication number Publication date
EP4054161A1 (en) 2022-09-07
US20220263934A1 (en) 2022-08-18
CN110809090A (zh) 2020-02-18

Similar Documents

Publication Publication Date Title
WO2021083125A1 (zh) 通话控制方法及相关产品
US11605193B2 (en) Artificial intelligence-based animation character drive method and related apparatus
CN110688911B (zh) 视频处理方法、装置、***、终端设备及存储介质
WO2021036644A1 (zh) 一种基于人工智能的语音驱动动画方法和装置
WO2021109678A1 (zh) 视频生成方法、装置、电子设备及存储介质
RU2488232C2 (ru) Сеть связи и устройства для преобразования текста в речь и текста в анимацию лица
US20140129207A1 (en) Augmented Reality Language Translation
CN112379812A (zh) 仿真3d数字人交互方法、装置、电子设备及存储介质
US11455765B2 (en) Method and apparatus for generating virtual avatar
KR102193029B1 (ko) 디스플레이 장치 및 그의 화상 통화 수행 방법
KR101944112B1 (ko) 사용자 저작 스티커를 생성하는 방법 및 장치, 사용자 저작 스티커 공유 시스템
CN110599359B (zh) 社交方法、装置、***、终端设备及存储介质
CN111538456A (zh) 基于虚拟形象的人机交互方法、装置、终端以及存储介质
CN110162598B (zh) 一种数据处理方法和装置、一种用于数据处理的装置
CN110148406B (zh) 一种数据处理方法和装置、一种用于数据处理的装置
US20210192192A1 (en) Method and apparatus for recognizing facial expression
JP2022518520A (ja) 画像変形の制御方法、装置およびハードウェア装置
CN113536007A (zh) 一种虚拟形象生成方法、装置、设备以及存储介质
CN110825164A (zh) 基于儿童专用穿戴智能设备的交互方法及***
CN111327772A (zh) 进行自动语音应答处理的方法、装置、设备及存储介质
WO2021114682A1 (zh) 会话任务生成方法、装置、计算机设备和存储介质
CN112449098B (zh) 一种拍摄方法、装置、终端及存储介质
WO2021155666A1 (zh) 用于生成图像的方法和装置
CN117370605A (zh) 一种虚拟数字人驱动方法、装置、设备和介质
CN112331209A (zh) 一种语音转文本的方法、装置、电子设备及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20881081

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020881081

Country of ref document: EP

Effective date: 20220530