CN107679519A

CN107679519A - A kind of multi-modal interaction processing method and system based on visual human

Info

Publication number: CN107679519A
Application number: CN201711026544.5A
Authority: CN
Inventors: 尚小维; 马帅
Original assignee: Beijing Guangnian Wuxian Technology Co Ltd
Current assignee: Beijing Guangnian Wuxian Technology Co Ltd
Priority date: 2017-10-27
Filing date: 2017-10-27
Publication date: 2018-02-09

Abstract

A kind of multi-modal interaction processing method and system based on visual human that the application provides, by the multi-modal data for obtaining the person of being imitated, then the three-dimensional face images for the person of being imitated described in being extracted from the multi-modal data, the three-dimensional face images are parsed again, determine the key point of the three-dimensional face images, the corresponding node of the key point and the three-dimensional face model of visual human is bound, the three-dimensional face images for the person of being imitated and parsing described in obtaining in real time, the key point of the three-dimensional face images for the person of being imitated according to obtaining parsing, mapped on the node of the three-dimensional face model of the visual human, generate and export imitation data, allow the visual human three-dimensional face model present it is true to nature, smooth man-machine interaction effect, lift Consumer's Experience.

Description

A kind of multi-modal interaction processing method and system based on visual human

Technical field

The application is related to field of artificial intelligence, more particularly to a kind of multi-modal interaction processing method based on visual human And system, a kind of visual human and a kind of storage medium.

Background technology

With the continuous development of scientific technology, the introducing of information technology, computer technology and artificial intelligence technology, machine Industrial circle is progressively walked out in the research of people, gradually extend to the neck such as medical treatment, health care, family, amusement and service industry Domain.And requirement of the people for robot is also conformed to the principle of simplicity the multiple mechanical action of substance be promoted to anthropomorphic question and answer, independence and with The intelligent robot that other robot interacts, man-machine interaction also just turn into an important factor for determining intelligent robot development.

Robot includes the tangible machine people for possessing entity and the virtual robot being mounted on hardware device at present.It is existing Virtual robot in technology can not carry out multi-modal interaction, and show changeless state always, can not realize it is true to nature, Smooth, anthropomorphic interaction effect.

Therefore, the interaction capabilities and presentation ability of virtual robot are lifted, are the major issues of present urgent need to resolve.

The content of the invention

In view of this, the application provides a kind of multi-modal interaction processing method and system based on visual human, one kind virtually People and a kind of storage medium, to solve technological deficiency present in prior art.

On the one hand, the application provides a kind of multi-modal interaction processing method based on visual human, and the visual human is in intelligence Equipment is run, including：

Obtain the multi-modal data for the person of being imitated；

The three-dimensional face images of the person of being imitated described in extraction from the multi-modal data；

The three-dimensional face images are parsed, determine the key point of the three-dimensional face images；

The corresponding node of the key point and the three-dimensional face model of visual human is bound；

The three-dimensional face images for the person of being imitated and parsing described in obtaining in real time；

The key point of the three-dimensional face images for the person of being imitated according to obtaining parsing, in the three-dimensional people of the visual human Mapped on the node of face model, data are imitated in generation；

Data are imitated in output.

Alternatively, before the multi-modal data for obtaining the person of being imitated, including：

Visual human is waken up, the visual human is shown in default viewing area.

Alternatively, the visual human is by the high mould structure generations of 3D, and possesses default image and technical ability；

The visual human includes application program, the executable file or by the intelligence operated on the smart machine The hologram that equipment projects.

Alternatively, the system that the smart machine uses is included built in WINDOWS systems, MAC OS systems or hologram device System.

Alternatively, the default viewing area includes the throwing of the display interface or the smart machine of the smart machine Penetrate region.

Alternatively, the output is imitated data and included：Make the three-dimensional face model of the visual human according to the mould received Imitative data are imitated the person of being imitated, while export multi-modal interaction data.

Alternatively, the three-dimensional face images that the person of being imitated is extracted from the multi-modal data include：

The multi-modal data of the person of being imitated is parsed, multi-modal interaction data, the parsing are exported with decision-making Including：Semantic understanding, visual identity, affection computation, cognition calculate；

When including the intention for imitating technical ability in the result of the parsing, open imitation technical ability and start acquisition device acquisition The three-dimensional face images of the person of being imitated.

Alternatively, it is described according to parsing when the end rotation for the person of being imitated or face rotate described in being determined according to parsing The key point of the three-dimensional face images of the obtained person of being imitated is carried out on the three-dimensional face model node of the visual human Mapping, data are imitated in generation to be included：

The difference of key point is true corresponding to three-dimensional face images before and after the end rotation of the person of being imitated obtained according to parsing Determine spin matrix；

According to the spin matrix, the anglec of rotation of the end rotation of the person of being imitated is determined；

The three-dimensional face model for controlling the visual human according to the anglec of rotation enters to the person's of being imitated end rotation Row imitates.

Alternatively, the key point includes：

Distributed point of bone, muscle and/or the face of face in face.

Alternatively, the node of the three-dimensional face model of the visual human includes：

Eyebrow, eyes, eyelid, mouth and/or the corners of the mouth.

On the other hand, present invention also provides a kind of multi-modal interaction process system based on visual human, including intelligence to set Standby and server, the smart machine include acquisition module, generation module and output module, and the server includes extraction mould Block, determining module, binding module and parsing module, wherein：

The acquisition module, for obtaining the multi-modal data for the person of being imitated；

The extraction module, the three-dimensional face images for the person of being imitated described in the extraction from the multi-modal data；

The determining module, for being parsed to the three-dimensional face images, determine the pass of the three-dimensional face images Key point；

The binding module, for the corresponding node of the key point and the three-dimensional face model of visual human to be tied up It is fixed；

The parsing module, for the three-dimensional face images for the person of being imitated and parsing described in acquisition in real time；

The generation module, for according to parsing obtain described in the person of being imitated three-dimensional face images key point, Mapped on the node of the three-dimensional face model of the visual human, data are imitated in generation；

The output module, data are imitated for exporting.

Alternatively, the three-dimensional face model that the output module is used to make the visual human is according to the imitation data received The person of being imitated is imitated, while exports multi-modal interaction data.

Alternatively, the server includes：

Data resolution module, for being parsed to the multi-modal data of the person of being imitated, exported with decision-making multi-modal Interaction data, the parsing include：Semantic understanding, visual identity, affection computation, cognition calculate；

Image module is obtained, for when including the intention for imitating technical ability in the result of the parsing, opening and imitating technical ability And start the three-dimensional face images that acquisition device obtains the person of being imitated.

Alternatively, the generation module includes：

Spin matrix determination sub-module, for being rotated when the end rotation for the person of being imitated or face according to parsing determination When, rotation is determined according to the difference of key point corresponding to the three-dimensional face images before and after the end rotation for parsing the obtained person of being imitated Matrix；

Anglec of rotation determination sub-module, for according to the spin matrix, determining the end rotation of the person of being imitated The anglec of rotation；

End rotation imitates submodule, for controlling the three-dimensional face model pair of the visual human according to the anglec of rotation The person's of being imitated end rotation is imitated.

On the other hand, present invention also provides a kind of visual human, the visual human to perform the above-mentioned multimode based on visual human State interaction processing method.

On the other hand, present invention also provides a kind of storage medium, computer instruction is stored with, the computer instruction is held The above-mentioned multi-modal interaction processing method based on visual human of row.

A kind of multi-modal interaction processing method and system based on visual human, a kind of visual human and one kind that the application provides Storage medium, by obtain the person of being imitated multi-modal data, then from the multi-modal data extraction described in the person of being imitated Three-dimensional face images, then the three-dimensional face images are parsed, the key point of the three-dimensional face images are determined, by institute The corresponding node for stating key point and the three-dimensional face model of visual human is bound, in real time the three-dimensional people of the person of being imitated described in acquisition Face image simultaneously parses, the key point of the three-dimensional face images for the person of being imitated according to obtaining parsing, the visual human's Mapped on the node of three-dimensional face model, generate and export imitation data so that the three-dimensional face model of the visual human True to nature, smooth man-machine interaction effect can be presented, lift Consumer's Experience.

Brief description of the drawings

Fig. 1 is a kind of structural representation for multi-modal interaction process system based on visual human that the embodiment of the application one provides Figure；

Fig. 2 is a kind of multi-modal interaction processing method flow chart based on visual human that the embodiment of the application one provides；

Fig. 3 is a kind of multi-modal interaction processing method flow chart based on visual human that the embodiment of the application one provides；

Fig. 4 is a kind of multi-modal interaction processing method flow chart based on visual human that the embodiment of the application one provides；

Fig. 5 is a kind of multi-modal interaction processing method flow chart based on visual human that the embodiment of the application one provides；

Fig. 6 is a kind of structural representation for multi-modal interaction process system based on visual human that the embodiment of the application one provides Figure；

Fig. 7 is a kind of structural representation for multi-modal interaction process system based on visual human that the embodiment of the application one provides Figure；

Fig. 8 is a kind of structural representation for multi-modal interaction process system based on visual human that the embodiment of the application one provides Figure.

Embodiment

Many details are elaborated in the following description in order to fully understand the application.But the application can be with Much it is different from other manner described here to implement, those skilled in the art can be in the situation without prejudice to the application intension Under do similar popularization, therefore the application is not limited by following public specific implementation.

In this application, there is provided a kind of multi-modal interaction processing method and system based on visual human, a kind of visual human And a kind of storage medium, it is described in detail one by one in the following embodiments.

In the application, the visual human operates in smart machine, and the smart machine can be desktop PC, notes Originally, line holographic projections equipment of the intellectual computing device such as palm PC and intelligent movable equipment or intelligence etc., the movement Smart machine can include smart mobile phone, intelligent robot etc..

The attribute that the visual human possesses, it can include：Visual human's mark, social property, personality attribute, personage's technical ability etc. belong to Property.Specifically, social property can include：Appearance, name, sex, native place, age, family relationship, occupation, position, religion The attribute field such as faith, emotion state, educational background；Personality attribute can include：The attribute fields such as personality, makings；Personage's technical ability can With including：Sing and dance, the professional skill such as tell a story, train.

In this application, the attribute of visual human may be such that the parsing of multi-modal interaction and the result of decision can be more prone to or more To be adapted to the visual human, system can be by calling the attribute information to realize the wake-up of visual human, activity, going to wake up and nullify The control of state, belong to the adeditive attribute information that visual human distinguishes true people.

In the application, the intelligent holographic projector equipment can use hologram device built-in system, other intelligence Equipment can use WINDOWS systems or MAC OS systems.

Therefore, the visual human can be the hologram or fortune come out by intelligent holographic projection Application program or executable file of the row on the smart machine.

Referring to Fig. 1, for the structural representation of the multi-modal interactive system based on visual human of the embodiment of the present application.

The multi-modal interactive system based on visual human includes smart machine 120 and server, and the server can be High in the clouds brain 110.

The smart machine 120 can include：User interface 121, communication module 122, CPU 123 and man-machine Interactively enter output module 124.Wherein, the user interface 121, its shown in default viewing area be waken up it is virtual People.The man-machine interaction input/output module 124, it obtains multi-modal data and output visual human performs parameter, multi-modal Data include the data from surrounding environment and the multi-modal input data interacted with user (comprises at least facial image to believe Breath).The communication module 122, it calls visual human's ability interface and received multi-modal defeated by the parsing of visual human's ability interface Enter the multi-modal output data that decision data goes out.The CPU 123, it utilizes target person in multi-modal output data Face and visual human's relative position information calculate the execution parameter that virtual head part rotates to target face direction.

The high in the clouds brain 110 possesses multi-modal data parsing module (also referred to as " visual human's ability interface "), and it is to described The multi-modal data that smart machine 120 is sent is parsed, and the multi-modal output data of decision-making, the multi-modal output data packet Include target face and visual human's relative position information.

As shown in figure 1, corresponding logical process is called respectively in each ability interface of multi-modal data resolving.Below For the explanation of each interface：

Semantic understanding interface 111, it receives the voice messaging from the communication module 122 forwarding, voice knowledge is carried out to it The other and natural language processing based on a large amount of language materials.

Visual identity interface 112, human body, face, scene can be directed to according to computer vision algorithms make, deep learning algorithm Deng progress video content detection, identification, tracking etc..Image is identified according to predetermined algorithm, the detection of quantitative As a result.Possess image preprocessing function, feature extraction functions, decision making function and concrete application function.Image preprocessing can be Basic handling, including the conversion of color space conversion, edge extracting, image and image threshold are carried out to the vision collecting data of acquisition Change；Feature extraction can extract the characteristic information such as the colour of skin of target, color, texture, motion and coordinate in image；Decision-making can be with It is to characteristic information, the concrete application for needing this feature information is distributed to according to certain decision strategy；Concrete application function is real The functions such as existing Face datection, human limbs identification, motion detection.

Affection computation interface 114, it receives the multi-modal data from the communication module 122 forwarding, utilizes affection computation Logic (can be Emotion identification technology) calculates the current emotional state of user.Emotion identification technology is one of affection computation Important component, the content of Emotion identification research include the sides such as facial expression, voice, behavior, text and physiological signal identification Face, the emotional state of user is may determine that by above content.Emotion identification technology can only pass through vision Emotion identification skill Art monitors the emotional state of user, can also be by the way of vision Emotion identification technology and sound Emotion identification technology combine To monitor the emotional state of user, and it is not limited thereto.In the present embodiment, it is preferred to use the two mode combined monitors Mood.

Affection computation interface 114 collects human face when carrying out vision Emotion identification, by using image capture device Facial expression image, being then converted into can the technology progress expression mood analysis such as analyze data, recycling image procossing.Understand facial table Feelings, it usually needs the delicate change to expression detects, such as cheek muscle, mouth change and choose eyebrow etc..

Cognition calculates interface 113, and it receives the multi-modal data from the communication module 122 forwarding, and the cognition calculates Interface 113 carries out data acquisition, identification and study to handle multi-modal data, to obtain user's portrait, knowledge mapping etc., with Rational Decision is carried out to multi-modal output data.

A kind of schematical technical scheme of the above-mentioned multi-modal interactive system based on visual human for the embodiment of the present application. For the ease of skilled artisan understands that the technical scheme of the application, base of the description below by multiple embodiments to the application It is further detailed in the multi-modal interaction processing method and system of visual human, a kind of visual human and a kind of storage medium.

Referring to Fig. 2, the embodiment of the application one provides a kind of multi-modal interaction processing method based on visual human, described virtual People runs in smart machine, including step 201 is to step 207.

Step 201：Obtain the multi-modal data for the person of being imitated.

In the embodiment of the present application, the person of being imitated is the user exchanged with the visual human；As the visual human When carrying the profile of star personage, the bean vermicelli that person can be the star personage is not then imitated.

The multi-modal data can gather natural language, visually-perceptible, touch perception, the language language of the person of being imitated The data such as sound, emotional facial expressions, action.

Visual human is waken up, the visual human is shown in default viewing area.

In the embodiment of the present application, the visual human possesses default image and technical ability by the high mould structure generations of 3D, Such as visual human can be the image appearance of Chinese Human-Female people, the function of possessing the imitation of face facial expression.

The default viewing area can include the projection of the display interface or intelligent holographic projector equipment of smart machine Region.

In the embodiment of the present application, the visual human can in standby, dormancy isotype, when needing to carry out face imitation from Move or wake up visual human manually, such as visual human is the application program operated on smart mobile phone, the application program is opened The face image of a Chinese Famous movie star is shown as afterwards, can be imitated by obtaining the facial expression for the person of being imitated, When the application program without using when will on backstage into temporary transient resting state, it is necessary to using when manually from cutting from the background Change, you can wake up the visual human run in the application program.

In addition, the visual human can also be the hologram of intelligent holographic projector equipment projection, the hologram Projected area is the viewing area of the visual human.

Step 202：The three-dimensional face images of the person of being imitated described in extraction from the multi-modal data.

Referring to Fig. 3, in the embodiment of the present application, the three-dimensional face images bag for the person of being imitated is extracted from the multi-modal data Step 301 is included to step 302.

Step 301：The multi-modal data of the person of being imitated is parsed, multi-modal interaction data is exported with decision-making.

The parsing includes：Semantic understanding, visual identity, affection computation, cognition calculate, i.e.,：The multi-modal data bag Include the data from surrounding environment and from the multi-modal data interacted with the person of being imitated；Call visual human's ability interface solution Analysis exports multi-modal interaction data from the multi-modal data interacted with the person of being imitated, decision-making.

In the embodiment of the present application, the person of being imitated is parsed by server, multi-modal data is generated, then by institute State multi-modal data and be transferred to the smart machine for running and having visual human.

Step 302：When including the intention for imitating technical ability in the result of the parsing, open imitation technical ability and start acquisition Device obtains the three-dimensional face images for the person of being imitated.

In the embodiment of the present application, when smart machine receives the result of the parsing, knowing in the result of the parsing has When imitating the intention of technical ability, then the three-dimensional face images that acquisition device obtains the person of being imitated are opened；The acquisition device can be Smart machine is built-in or external video camera, shooting are first-class.

Step 203：The three-dimensional face images are parsed, determine the key point of the three-dimensional face images.

In the embodiment of the present application, the key point can include the distribution of bone, muscle and/or face in face of face Point, the distributed point is bound with corresponding coordinate points.

Step 204：The corresponding node of the key point and the three-dimensional face model of visual human is bound.

In the embodiment of the present application, the node of the three-dimensional face model of the visual human can include eyebrow, eyes, eyelid, Mouth and/or the corners of the mouth, the eyebrow, eyes, eyelid, mouth and/or the corners of the mouth are individually controlled by node, then be will identify that and Node data carry out corresponding binding with the key point of the person's of being imitated facial image.

Step 205：The three-dimensional face images for the person of being imitated and parsing described in obtaining in real time.

Step 206：The key point of the three-dimensional face images for the person of being imitated according to obtaining parsing, in the visual human Three-dimensional face model node on mapped, generation imitate data.

In the embodiment of the present application, when the three-dimensional face images of the person of being imitated change, the person's of being imitated Bone, muscle and/or the face of the face of three-dimensional face images also can be according to the shiftings of corresponding coordinate points in the distributed point of face Move and change, the node of the three-dimensional face model of the visual human bound with the distributed point also can synchronously occur Change, generates a series of imitation data.

Step 207：Data are imitated in output.

In the embodiment of the present application, the three-dimensional face model of the visual human is according to a series of imitation data pair received The person of being imitated is imitated, such as the visual human can realize that blink, nozzle type or end rotation etc. imitate function.

A kind of multi-modal interaction processing method based on visual human that the application provides, by obtaining described be imitated in real time The three-dimensional face images of person and parsing, the key point of the three-dimensional face images for the person of being imitated according to obtaining parsing, in institute State and mapped on the node of the three-dimensional face model of visual human, data are imitated in generation, and data are imitated in finally output so that described True to nature, smooth man-machine interaction effect can be presented in the three-dimensional face model of visual human, lift Consumer's Experience.

Referring to Fig. 4, the embodiment of the application one provides a kind of multi-modal interaction processing method based on visual human, including step 401 to step 409.

Step 401：Visual human is waken up, the visual human is shown in default viewing area.

In the embodiment of the present application, the visual human possesses aobvious using height emulation 3D virtual figure images as Main User Interface Write the outward appearance of character features；And multi-modal man-machine interaction is supported, possesses natural language understanding, visually-perceptible, touch perception, language Say the AI abilities such as voice output, emotional facial expressions action output.

Step 402：Obtain the multi-modal data for the person of being imitated.

Step 403：The multi-modal data of the person of being imitated is parsed, multi-modal interaction data is exported with decision-making.

Step 404：When including the intention for imitating technical ability in the result of the parsing, open imitation technical ability and start acquisition Device obtains the three-dimensional face images for the person of being imitated.

Step 405：The three-dimensional face images are parsed, determine the key point of the three-dimensional face images.

Step 406：The corresponding node of the key point and the three-dimensional face model of visual human is bound.

Step 407：The three-dimensional face images for the person of being imitated and parsing described in obtaining in real time.

Step 408：The key point of the three-dimensional face images for the person of being imitated according to obtaining parsing, in the visual human Three-dimensional face model node on mapped, generation imitate data.

Step 409：The three-dimensional face model of the visual human enters according to the imitation data received to the person of being imitated Row imitates, while exports multi-modal interaction data.

A kind of multi-modal interaction processing method based on visual human that the embodiment of the present application provides, by by the visual human It is equipped on the smart machines of input/output module such as support perception, control, and society is configured as needed for the visual human Meeting attribute, personality attribute and personage's technical ability etc., can make user experience intelligent and personalized experience.

Referring to Fig. 5, exemplified by operating in the visual human on smart mobile phone and realize that end rotation imitates, the embodiment of the application one There is provided a kind of multi-modal interaction processing method based on visual human, including step 501 is to step 510.

Step 501：Visual human is waken up, the visual human is shown on smart mobile phone in default viewing area.

In the embodiment of the present application, the visual human can be to be generated by the high mould structures of 3D, and possess default image And technical ability, such as the image for Chinese Famous movie actress's model that face face imitates can be carried out, open what is installed in smart mobile phone APP, the visual human are operated in the APP, wake up the visual human, and the visual human is shown in the default of smart mobile phone APP Viewing area, such as the middle position of smart mobile phone display screen.

Step 502：Obtain the multi-modal data for the person of being imitated.

In the embodiment of the present application, the person of being imitated can be the small A of kinsfolk, empty below using the person of being imitated as small A Anthropomorphic three-dimensional face images are to illustrate exemplified by model.

The multi-modal data can gather natural language, visually-perceptible, touch perception, the language language of the person of being imitated The data of the generations such as sound, emotional facial expressions, action, such as things, the sensation of touching object, the sound for gathering small A language, seeing The data of the generations such as sound, mood, the action made.

Step 503：The multi-modal data of the person of being imitated is parsed, multi-modal interaction data is exported with decision-making.

In the embodiment of the present application, the parsing includes：Semantic understanding, visual identity, affection computation, cognition calculate.Such as To the generation such as the above-mentioned small A language collected, the things seen, the sensation of touching object, sound, mood, action for making Data calculated.

Step 504：When including the intention for imitating technical ability in the result of the parsing, open imitation technical ability and start acquisition Device obtains the three-dimensional face images for the person of being imitated.

In the embodiment of the present application, when there is the intention for imitating technical ability in the medium and small A of the result of parsing multi-modal data, Then open the imitation technical ability of visual human and start the three-dimensional face images that the acquisition device on smart mobile phone obtains small A.

Acquisition device in the embodiment of the present application can be the camera of smart mobile phone.

Step 505：The three-dimensional face images are parsed, determine the key point of the three-dimensional face images.

In the embodiment of the present application, small A facial image is parsed, determines the bone, muscle and face of small A face Key point is used as in distributed point of face etc., and records the initial coordinate position of the key point.

Step 506：The corresponding node of the key point and the three-dimensional face model of visual human is bound.

In the embodiment of the present application, by the key point of small A facial image and pair of the three-dimensional face model of visual human's model Node is answered to be bound.

The node includes eyebrow, eyes, eyelid, mouth and/or corners of the mouth etc., and each node individually controls, It can realize and individually imitate function, such as blink, squeeze eyebrow and/or open one's mouth.

Step 507：The three-dimensional face images for the person of being imitated and parsing described in obtaining in real time.

In the embodiment of the present application, small A three-dimensional face images and parsing are obtained in real time, i.e. the moment obtains small A facial table Feelings so that the three-dimensional face images of visual human's model can synchronize imitation, avoid producing imitating to lag and brought not to user The problem of experiencing well.

Step 508：Described in being determined according to parsing during the end rotation for the person of being imitated, according to the obtained person of being imitated of parsing End rotation before and after three-dimensional face images corresponding to the difference of key point determine spin matrix.

In the embodiment of the present application, when the coordinate of the key point of small A three-dimensional face images changes, global coordinate system to One side can determine that small A head is rotated when skew, before and after the small A then obtained according to parsing end rotation Three-dimensional face images corresponding to the difference of key point determine spin matrix.

Whether the face of the three-dimensional face images rotates or whether head is shaken and entered using the above method Row determines.

Step 509：According to the spin matrix, the anglec of rotation of the end rotation of the person of being imitated is determined.

In the embodiment of the present application, then the spin matrix calculates multidimensional to calculate the picture of face according to each picture The anglec of rotation, the angle is the accurate anglec of rotation of the end rotation of the small A of the person of being imitated.

Step 510：The three-dimensional face model of the visual human is controlled to the person's of being imitated head according to the anglec of rotation Portion's rotation is imitated.

In the embodiment of the present application, after drawing the anglec of rotation, the three-dimensional face images can of visual human's model The rotation according to corresponding to being carried out the anglec of rotation, it is possible to achieve more accurately imitate function.

A kind of multi-modal interaction processing method based on visual human that the embodiment of the present application provides, can cause visual human to do The multi-modal interactive technical ability of user's face expression is imitated to real-time, and true to nature, smooth, anthropomorphic interaction effect can be realized.

Fig. 6 to Fig. 8 is a kind of structure for multi-modal interaction process system based on visual human that the embodiment of the present application provides Schematic diagram.Because system embodiment is substantially similar to embodiment of the method, related part is referring to the part explanation of embodiment of the method Can.System embodiment described below is only schematical.

Referring to Fig. 6, the application provides a kind of multi-modal interaction process system based on visual human, including smart machine kimonos Business device, the smart machine include acquisition module 601, generation module 606 and output module 607, and the server includes extraction Module 602, determining module 603, binding module 604 and parsing module 605, wherein：

The acquisition module 601, for obtaining the multi-modal data for the person of being imitated；

The extraction module 602, the three-dimensional face images for the person of being imitated described in the extraction from the multi-modal data；

The determining module 603, for being parsed to the three-dimensional face images, determine the three-dimensional face images Key point；

The binding module 604, for the key point and the corresponding node of the three-dimensional face model of visual human to be carried out Binding；

The parsing module 605, for the three-dimensional face images for the person of being imitated and parsing described in acquisition in real time；

The generation module 606, for according to parsing obtain described in the person of being imitated three-dimensional face images key point, Mapped on the node of the three-dimensional face model of the visual human, data are imitated in generation；

The output module 607, data are imitated for exporting.

Alternatively, the smart machine includes：

Wake module, for waking up visual human, the visual human is set to be shown in default viewing area；The visual human Run in smart machine.

Alternatively, include referring to Fig. 7, the server：

Data resolution module 701, for being parsed to the multi-modal data of the person of being imitated, multimode is exported with decision-making State interaction data, the parsing include：Semantic understanding, visual identity, affection computation, cognition calculate；

Image module 702 is obtained, for when including the intention for imitating technical ability in the result of the parsing, opening and imitating skill The three-dimensional face images that acquisition device obtains the person of being imitated and can be started.

Alternatively, include referring to Fig. 8, the generation module 606：

Spin matrix determination sub-module 801, for when the end rotation or face of the person of being imitated according to parsing determination During rotation, the difference of key point determines corresponding to the three-dimensional face images before and after the end rotation of the person of being imitated obtained according to parsing Spin matrix；

Anglec of rotation determination sub-module 802, for according to the spin matrix, determining the head rotation of the person of being imitated The anglec of rotation turned；

End rotation imitates submodule 803, for controlling the three-dimensional face mould of the visual human according to the anglec of rotation Type is imitated the person's of being imitated end rotation.

Alternatively, the key point includes：

Distributed point of bone, muscle and/or the face of face in face.

Eyebrow, eyes, eyelid, mouth and/or the corners of the mouth.

A kind of multi-modal interaction process system based on visual human that the application provides, by obtaining described be imitated in real time The three-dimensional face images of person and parsing, the key point of the three-dimensional face images for the person of being imitated according to obtaining parsing, in institute State and mapped on the node of the three-dimensional face model of visual human, data are imitated in generation, and data are imitated in finally output so that described True to nature, smooth man-machine interaction effect can be presented in the three-dimensional face model of visual human, lift Consumer's Experience.

The smart machine of the application can include processor and memory, and the memory storage has computer instruction, institute Processor is stated to call the computer instruction and perform the foregoing multi-modal interaction processing method based on visual human.

The processor can be CPU (Central Processing Unit, CPU), can also be it His general processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other PLDs, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor can also be any conventional processor Deng the processor is the control centre of the smart machine, utilizes each of various interfaces and the whole smart machine of connection Individual part.

The memory mainly includes storing program area and storage data field, wherein, storing program area can store operation system Application program (such as sound-playing function, image player function etc.) needed for system, at least one function etc.；Storage data field can Storage uses created data (such as voice data, phone directory etc.) etc. according to mobile phone.In addition, memory can include height Fast random access memory, nonvolatile memory, such as hard disk, internal memory, plug-in type hard disk, intelligent memory card can also be included (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least One disk memory, flush memory device or other volatile solid-state parts.

The embodiment of the application one also provides a kind of visual human, and the visual human performs the above-mentioned multi-modal friendship based on visual human Mutual processing method.

The exemplary scheme of above-mentioned visual human for the present embodiment a kind of.It should be noted that the technical side of the visual human Case and the technical scheme of the above-mentioned multi-modal interaction processing method based on visual human belong to same design, the technical side of visual human The detail content that case is not described in detail, it may refer to the technical scheme of the above-mentioned multi-modal interaction processing method based on visual human Description.

The embodiment of the application one also provides a kind of storage medium, is stored with computer instruction, and the computer instruction performs The above-mentioned multi-modal interaction processing method based on visual human.

A kind of exemplary scheme of above-mentioned storage medium for the present embodiment.It should be noted that the skill of the storage medium Art scheme and the technical scheme of the above-mentioned multi-modal interaction processing method based on visual human belong to same design, storage medium The detail content that technical scheme is not described in detail, it may refer to the skill of the above-mentioned multi-modal interaction processing method based on visual human The description of art scheme.

The computer instruction includes computer program code, the computer program code can be source code form, Object identification code form, executable file or some intermediate forms etc..The computer-readable medium can include：Institute can be carried Any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disc, CD, the computer for stating computer program code store Device, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), Electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that the computer-readable medium include it is interior Appropriate increase and decrease can be carried out according to legislation in jurisdiction and the requirement of patent practice by holding, such as in some jurisdictions of courts Area, electric carrier signal and telecommunication signal are not included according to legislation and patent practice, computer-readable medium.

It should be noted that for foregoing each method embodiment, in order to which simplicity describes, therefore it is all expressed as a series of Combination of actions, but those skilled in the art should know, the application is not limited by described sequence of movement because According to the application, some steps can use other orders or carry out simultaneously.Secondly, those skilled in the art should also know Know, embodiment described in this description belongs to preferred embodiment, and involved action and module might not all be this Shens Please be necessary.

In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, and does not have the portion being described in detail in some embodiment Point, it may refer to the associated description of other embodiments.

The application preferred embodiment disclosed above is only intended to help and illustrates the application.Alternative embodiment is not detailed All details are described, it is only described embodiment also not limit the invention.Obviously, according to the content of this specification, It can make many modifications and variations.This specification is chosen and specifically describes these embodiments, is to preferably explain the application Principle and practical application so that skilled artisan can be best understood by and utilize the application.The application is only Limited by claims and its four corner and equivalent.

Claims

A kind of 1. multi-modal interaction processing method based on visual human, it is characterised in that the visual human runs in smart machine, Including：

Obtain the multi-modal data for the person of being imitated；

The three-dimensional face images of the person of being imitated described in extraction from the multi-modal data；

The three-dimensional face images are parsed, determine the key point of the three-dimensional face images；

The corresponding node of the key point and the three-dimensional face model of visual human is bound；

The three-dimensional face images for the person of being imitated and parsing described in obtaining in real time；

The key point of the three-dimensional face images for the person of being imitated according to obtaining parsing, in the three-dimensional face mould of the visual human Mapped on the node of type, data are imitated in generation；

Data are imitated in output.
2. according to the method for claim 1, it is characterised in that before the multi-modal data for obtaining the person of being imitated, bag Include：

Visual human is waken up, the visual human is shown in default viewing area.
3. according to the method for claim 1, it is characterised in that the visual human has by the high mould structure generations of 3D Standby default image and technical ability；

The visual human includes application program, the executable file or by the smart machine operated on the smart machine The hologram projected.
4. according to the method for claim 1, it is characterised in that the system that the smart machine uses includes WINDOWS systems System, MAC OS systems or hologram device built-in system.
5. according to the method for claim 2, it is characterised in that the default viewing area includes the smart machine The projected area of display interface or the smart machine.
6. according to the method for claim 1, it is characterised in that data are imitated in the output to be included：Make the visual human's Three-dimensional face model is imitated the person of being imitated according to the imitation data received, while exports multi-modal interactive number According to.
7. according to the method for claim 1, it is characterised in that described to extract the person's of being imitated from the multi-modal data Three-dimensional face images include：

The multi-modal data of the person of being imitated is parsed, multi-modal interaction data is exported with decision-making, the parsing includes： Semantic understanding, visual identity, affection computation, cognition calculate；

When including the intention for imitating technical ability in the result of the parsing, open and imitate technical ability and start described in acquisition device acquisition The three-dimensional face images for the person of being imitated.
8. according to the method for claim 1, it is characterised in that when the end rotation of the person of being imitated according to parsing determination Or during face rotation, the key point of the three-dimensional face images of the person of being imitated according to obtaining parsing is in the visual human Three-dimensional face model node on mapped, generation imitate data include：

The difference of key point determines rotation corresponding to three-dimensional face images before and after the end rotation of the person of being imitated obtained according to parsing Torque battle array；

According to the spin matrix, the anglec of rotation of the end rotation of the person of being imitated is determined；

The three-dimensional face model for controlling the visual human according to the anglec of rotation carries out mould to the person's of being imitated end rotation It is imitative.
9. according to the method for claim 1, it is characterised in that the key point includes：

Distributed point of bone, muscle and/or the face of face in face.
10. according to the method for claim 1, it is characterised in that the node of the three-dimensional face model of the visual human includes：

Eyebrow, eyes, eyelid, mouth and/or the corners of the mouth.
11. a kind of multi-modal interaction process system based on visual human, it is characterised in that including smart machine and server, institute Stating smart machine includes acquisition module, generation module and output module, and the server includes extraction module, determining module, tied up Cover half block and parsing module, wherein：

The acquisition module, for obtaining the multi-modal data for the person of being imitated；

The extraction module, the three-dimensional face images for the person of being imitated described in the extraction from the multi-modal data；

The determining module, for being parsed to the three-dimensional face images, determine the key point of the three-dimensional face images；

The binding module, for the corresponding node of the key point and the three-dimensional face model of visual human to be bound；

The parsing module, for the three-dimensional face images for the person of being imitated and parsing described in acquisition in real time；

The generation module, for the key point of the three-dimensional face images for the person of being imitated according to parsing and obtain, described Mapped on the node of the three-dimensional face model of visual human, data are imitated in generation；

The output module, data are imitated for exporting.
12. system according to claim 11, it is characterised in that the output module is used for the three-dimensional for making the visual human Faceform is imitated the person of being imitated according to the imitation data received, while exports multi-modal interaction data.
13. system according to claim 12, it is characterised in that the server includes：

Data resolution module, for being parsed to the multi-modal data of the person of being imitated, multi-modal interaction is exported with decision-making Data, the parsing include：Semantic understanding, visual identity, affection computation, cognition calculate；

Image module is obtained, for when including the intention for imitating technical ability in the result of the parsing, opening and imitating technical ability and open Dynamic acquisition device obtains the three-dimensional face images for the person of being imitated.
14. system according to claim 11, it is characterised in that the generation module includes：

Spin matrix determination sub-module, for when according to parsing determine described in the person of being imitated end rotation or face rotate when, Spin moment is determined according to the difference of key point corresponding to the three-dimensional face images before and after the end rotation for the person of being imitated that parsing obtains Battle array；

Anglec of rotation determination sub-module, for according to the spin matrix, determining the rotation of the end rotation of the person of being imitated Gyration；

End rotation imitates submodule, for controlling the three-dimensional face model of the visual human according to the anglec of rotation to described The person's of being imitated end rotation is imitated.
15. a kind of visual human, it is characterised in that visual human's perform claim requires the method described in 1-10 any one.
16. a kind of storage medium, it is characterised in that be stored with computer instruction, the computer instruction perform claim requires 1- Method described in 10 any one.