CN108537207A

CN108537207A - Lip reading recognition methods, device, storage medium and mobile terminal

Info

Publication number: CN108537207A
Application number: CN201810372876.7A
Authority: CN
Inventors: 陈岩; 刘耀勇
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2018-09-14
Anticipated expiration: 2038-04-24
Also published as: CN108537207B

Abstract

The embodiment of the present application discloses lip reading recognition methods, device, storage medium and mobile terminal.This method includes：When detecting that lip reading identification events are triggered, at least frame 3D lip images of active user are obtained by 3D cameras；The 3D lips image is input in lip reading identification model trained in advance；Lip reading information corresponding with the 3D lips image is determined according to the output result of the lip reading identification model；The lip reading information is supplied to the active user.The embodiment of the present application is by using above-mentioned technical proposal, simple, quick lip reading can be carried out by the lip reading identification model built in advance to 3D lip images to identify, and the accuracy of lip reading identification is further improved, the man-machine interaction experience of user is effectively increased, user demand is preferably met.

Description

Lip reading recognition methods, device, storage medium and mobile terminal

Technical field

The invention relates to technical field of information processing more particularly to lip reading recognition methods, device, storage medium and Mobile terminal.

Background technology

With the fast development of electronic technology and the increasingly raising of people's living standard, smart mobile phone, tablet computer etc. are eventually End has become an essential part in people's life.Meanwhile more human-computer interaction is carried out with terminal device to hommization (Human-Computer Interaction, HCI) becomes more and more important.

However, at present most of terminal device by the modes of operation such as keyboard input, handwriting input and voice input into Row human-computer interaction, it is not convenient enough in many cases, interference of the external environment to user cannot be reduced well.For example, comparing Voice input is carried out under noisy environment, precision of identifying speech is low, effect is poor, and is easy the privacy leakage of user.There is mirror In this, lip reading identification technology is come into being, and accurate lip reading identification becomes most important in field of human-computer interaction.

Invention content

The embodiment of the present application provides lip reading recognition methods, device, storage medium and mobile terminal, can improve lip reading identification Accuracy, meet user demand.

In a first aspect, the embodiment of the present application provides a kind of lip reading recognition methods, including：

When detecting that lip reading identification events are triggered, at least frame 3D lips of active user are obtained by 3D cameras Image；

The 3D lips image is input in lip reading identification model trained in advance；

Lip reading information corresponding with the 3D lips image is determined according to the output result of the lip reading identification model；

The lip reading information is supplied to the active user.

Second aspect, the embodiment of the present application provide a kind of lip reading identification device, including：

Lip image collection module, for when detecting that lip reading identification events are triggered, being worked as by the acquisition of 3D cameras At least frame 3D lip images of preceding user；

Lip image input module, for the 3D lips image to be input in lip reading identification model trained in advance；

Lip reading information determination module, for being determined and the 3D lips figure according to the output result of the lip reading identification model As corresponding lip reading information；

Lip reading information providing module, for the lip reading information to be supplied to the active user.

The third aspect, the embodiment of the present application provide a kind of computer readable storage medium, are stored thereon with computer journey Sequence realizes the lip reading recognition methods as described in the embodiment of the present application when the program is executed by processor.

Fourth aspect, the embodiment of the present application provide a kind of mobile terminal, including memory, processor and are stored in storage It can realize on device and when the computer program of processor operation, the processor execute the computer program as the application is real Apply the lip reading recognition methods described in example.

The lip reading identifying schemes provided in the embodiment of the present application pass through 3D when detecting that lip reading identification events are triggered Camera obtains at least frame 3D lip images of active user, and the 3D lips image is input to lip reading trained in advance In identification model, lip reading information corresponding with the 3D lips image is determined according to the output result of lip reading identification model, then Lip reading information is supplied to active user.It, can be by the lip reading identification model that builds in advance by using above-mentioned technical proposal Simple, quick lip reading identification is carried out to 3D lip images, and further improves the accuracy of lip reading identification, is effectively increased The man-machine interaction experience of user, preferably meets user demand.

Description of the drawings

Fig. 1 is a kind of flow diagram of lip reading recognition methods provided by the embodiments of the present application；

Fig. 2 is the flow diagram of another lip reading recognition methods provided by the embodiments of the present application；

Fig. 3 is the flow diagram of another lip reading recognition methods provided by the embodiments of the present application；

Fig. 4 is a kind of structure diagram of lip reading identification device provided by the embodiments of the present application；

Fig. 5 is a kind of structural schematic diagram of mobile terminal provided by the embodiments of the present application；

Fig. 6 is the structural schematic diagram of another mobile terminal provided by the embodiments of the present application.

Specific implementation mode

Further illustrate the technical solution of the application below with reference to the accompanying drawings and specific embodiments.It is appreciated that It is that specific embodiment described herein is used only for explaining the application, rather than the restriction to the application.It further needs exist for illustrating , illustrate only for ease of description, in attached drawing and the relevant part of the application rather than entire infrastructure.

It should be mentioned that some exemplary embodiments are described as before exemplary embodiment is discussed in greater detail The processing described as flow chart or method.Although each step is described as the processing of sequence, many of which by flow chart Step can be implemented concurrently, concomitantly or simultaneously.In addition, the sequence of each step can be rearranged.When its operation The processing can be terminated when completion, it is also possible to the additional step being not included in attached drawing.The processing can be with Corresponding to method, function, regulation, subroutine, subprogram etc..

Fig. 1 is a kind of flow diagram of lip reading recognition methods provided by the embodiments of the present application, and the present embodiment is applicable to Lip reading identify the case where, this method can be executed by lip reading identification device, wherein the device can by software and or hardware realization, It can generally integrate in the terminal.As shown in Figure 1, this method includes：

Step 101, when detecting that lip reading identification events are triggered, pass through 3D cameras obtain active user at least one Frame 3D lip images.

Illustratively, the mobile terminal in the embodiment of the present application may include the mobile devices such as mobile phone and tablet computer.

When detecting that lip reading identification events are triggered, active user is obtained at least by the 3D cameras of mobile terminal One frame 3D lip images, to start lip reading identification events.

Illustratively, in order to carry out lip reading identification on suitable opportunity, lip reading identification events can be pre-set and be triggered Condition.Optionally, for the demand that really determining user identifies lip reading, lip reading can actively be opened detecting active user When identifying permission, lip reading identification events are triggered.Optionally, in order to make lip reading identification be applied to more valuable Time window, with Extra power consumption caused by lip reading identification is saved, the Time window and application scenarios that can be identified to lip reading are analyzed or investigated Deng, rational default scene is set, when detecting mobile terminal in default scene, triggering lip reading identification events.It needs to illustrate , the embodiment of the present application do not limit the specific manifestation form that lip reading identification events are triggered.

In the embodiment of the present application, when detecting that lip reading identification events are triggered, the 3D cameras of mobile terminal are opened, That is the 3D cameras of control mobile terminal are in running order.Wherein, 3D cameras are properly termed as 3D sensors again.3D cameras Difference lies in 3D cameras can not only obtain flat image, can also obtain the depth of reference object with common camera Information, that is, three-dimensional positions and dimensions information.At least frame 3D lip images of active user are obtained by 3D cameras.

Wherein, by at least frame 3D lip images of 3D cameras acquisition active user, may include：It is imaged by 3D Head obtains an at least frame image for the lip-region of active user, as 3D lip images；Or, being obtained by 3D cameras current At least frame 3D facial images of user extract lip-region image from 3D facial images, and by the lip-region figure of extraction As being used as 3D lip images.Optionally, can the elevation information based on lip extract lip-region from the 3D facial images Image, as 3D lip images.Because lip is located at the lower section of nose, the elevation information (namely depth information) of nose is more than mouth The elevation information of lip, and the elevation information of lip is located at other regions of face, therefore can be based on the elevation information of lip Feature extracts lip-region image from 3D facial images.Optionally, it is also based on outline identification technology, such as edge detection, Identify lip specific location in 3D facial images, the corresponding image of specific location that lip is extracted from 3D facial images is made For lip-region image.

The 3D lips image is input in lip reading identification model trained in advance by step 102.

Wherein, lip reading identification model can be understood as after inputting 3D lip images, quickly determine and the 3D lip images The learning model of corresponding lip reading information.Lip reading identification model can be to the 3D sample lip images of acquisition and its corresponding lip Language content is trained the learning model of generation.Illustratively, when user says " hello " in a manner of sounding or not sounding When word content, the situation of change of corresponding lip feature is：Lower lip moves downward first, and the corners of the mouth moves upwards (pronunciation Ni), later, lip is in O-shaped (pronunciation hao), and the depth information of lip can change with the movement of lip in the whole process Become.A corresponding at least frame 3D sample lip images learn when to sending out lip reading content different, namely to 3D sample lip figures The various characteristic informations such as the depth information of picture and the change information of lip are learnt, and lip reading identification model is generated.

Step 103 determines lip reading corresponding with the 3D lips image according to the output result of the lip reading identification model Information.

In the embodiment of the present application, it by at least frame 3D lip images of the active user obtained in step 101, is input to In advance after trained lip reading identification model, lip reading identification model can carry out the characteristic information of an at least frame 3D lip images Analysis, and can determine lip reading information corresponding with the 3D lip images according to analysis result.Illustratively, by 3D lip images Sequence (multiframe 3D lips image) is input to lip reading identification model, and lip reading identification model believes the feature of the 3D lip image sequences Breath is analyzed, wherein characteristic information may include lip shape of the mouth as one speaks change information and lip depth information situation of change.Through dividing Analysis, the lip shape of the mouth as one speaks change information of the 3D lip image sequences are：" lower lip parts a little, and upper lip is arc-shaped upwards " arrive " outside lips, Tongue is tight against lower tooth ", corresponding lip depth information situation of change is：Lip depth information first increases and then decreases, and in certain Rule can then determine that lip reading information corresponding with the 3D lip image sequences is " page turning ".

The lip reading information is supplied to the active user by step 104.

Illustratively, lip reading information is supplied to active user in the form of word or voice.For example, user is short in editor During letter, lip reading identification information can be presented on the information of word in short message editing frame.For another example, user with friend During carrying out wechat chat, lip reading information can be presented in the form of word in chat conversations frame, or by lip reading Information is sent to friend in the form of speech, to carry out voice-enabled chat.It should be noted that the embodiment of the present application is to believing lip reading Breath is supplied to the mode of active user not limit.

The lip reading recognition methods provided in the embodiment of the present application passes through 3D when detecting that lip reading identification events are triggered Camera obtains at least frame 3D lip images of active user, and the 3D lips image is input to lip reading trained in advance In identification model, lip reading information corresponding with the 3D lips image is determined according to the output result of lip reading identification model, then Lip reading information is supplied to active user.It, can be by the lip reading identification model that builds in advance by using above-mentioned technical proposal Simple, quick lip reading identification is carried out to 3D lip images, and further improves the accuracy of lip reading identification, is effectively increased The man-machine interaction experience of user, preferably meets user demand.

In some embodiments, before detecting that lip reading identification events are triggered, further include：Acquisition is preset in crowd An at least frame 3D sample lip images for each individual, and obtain lip reading content corresponding with the 3D samples lip image；Root The 3D samples lip image is marked according to the lip reading content, obtains training sample set；Utilize the training sample set Default machine learning model is trained, lip reading identification model is obtained.Wherein, according to lip reading content to 3D sample lip images It is marked, it will be appreciated that be the sample labeling that lip content is denoted as to corresponding 3D samples lip image.The benefit being arranged in this way It is, by the 3D lips image of each individual and its corresponding lip reading content, the sample as lip reading identification model in default crowd This source can greatly improve the accuracy trained to lip reading identification model.

In the embodiment of the present application, an at least frame 3D sample lips for each individual in the crowd of presetting are acquired by 3D cameras Portion's image, and obtain lip reading content corresponding with 3D sample lip images.Wherein it is possible to before mobile terminal manufacture, make to preset Each individual reads a large amount of different lip reading contents in a manner of sounding or not sounding in crowd, and is imaged by the 3D of mobile terminal Head acquisition presets each individual in crowd and reads corresponding 3D lips image when different lip reading contents, as 3D sample lip images. For example, can obtain respectively 5,000 ages in the user between 18-30 Sui the corresponding 3D samples people when reading different lip reading contents Face image, and from the 3D sample facial images extract lip-region image, as 3D sample lip images, and as The training sample of lip reading identification model.Wherein, default crowd may include the different groups such as children, teenager, a middle-aged person and old man In the combination of any one or more, the embodiment of the present application to preset crowd covering range be not specifically limited.It is exemplary , the 3D lips video image of each individual in crowd can be preset by the 3D camera synchronous acquisitions with microphone and is said Sound is talked about, when carrying out 3D lips video image and sound of speaking acquires, to ensure the synchronism and correspondence of the two, avoid shadow Ring the precision of lip reading identification model training.It is of course also possible to largely preset each individual speaker in crowd from interconnection online collection Video, lip region image is extracted from video per frame image, as 3D sample lip images, and will be corresponding with the video Speech content as lip reading content.Lip reading content may include Chinese, English and the corresponding word of any language such as Japanese, short The contents such as language, short sentence, long sentence or paragraph.3D sample lip images are marked according to lip reading content, that is, mark 3D sample lips Sonagram is as corresponding lip reading content, using the 3D sample lip images of the good corresponding lip reading content of label as lip reading identification model Training sample set.Machine learning model is preset using training sample set pair to be trained, and generates lip reading identification model.Wherein, in advance If machine learning model may include convolutional neural networks model or the long machine learning models such as memory network model in short-term.This Shen Please embodiment default machine learning model is not limited.

Wherein, before detecting that lip reading identification events are triggered, lip reading identification model is obtained.It should be noted that can To be the above-mentioned training sample set of acquisition for mobile terminal, presets machine learning model using training sample set pair and be trained, directly Generate lip reading identification model.It can also be that mobile terminal directly invokes the lip reading identification model that the training of other mobile terminals generates, For example, using an acquisition for mobile terminal training sample set and generating lip reading identification model before manufacture, then the lip reading is known Other model storage is directly used to other mobile terminals for other mobile terminals.Alternatively, server obtains a large amount of 3D samples This lip image, and be marked according to corresponding lip reading content, obtain training sample set.Server is to being based on default machine Device learning model is trained training sample set, obtains lip reading identification model.When mobile terminal needs to carry out lip reading identification, Namely when detecting that lip reading identification events are triggered, from server calls trained lip reading identification model.

In some embodiments, the 3D samples lip image is being marked according to the lip reading content, is being instructed Before practicing sample set, further include：Obtain the first human facial expression information corresponding with the 3D samples lip image；According to the lip The 3D samples lip image is marked in language content, obtains training sample set, including：According to the lip reading content and described The 3D samples lip image is marked in first human facial expression information, obtains training sample set；Correspondingly, by the 3D Before lip image is input in lip reading identification model trained in advance, further include：It obtains corresponding with the 3D lips image Second human facial expression information；The 3D lips image is input in lip reading identification model trained in advance, including：By the 3D Lip image and second human facial expression information are input in lip reading identification model trained in advance.The benefit being arranged in this way exists In, can according to the expression information of user when different lip reading contents and the corresponding 3D samples lip image of different lip reading contents, into The training study of row lip reading identification model, can further increase the accuracy that lip reading identification model determines lip reading information.

In the embodiment of the present application, in the content difference of expression, corresponding human face expression may be not quite similar user.Example Such as, the content of user's expression is " I am homesick ", corresponding human face expression relatively distress；User's target content is that " I looks for Arrive work ", corresponding human face expression is relatively joyful, even mad with joy.The content of user's expression is different, face table Feelings are different, and the characteristic information of corresponding 3D lips image also differs.Therefore, it can obtain corresponding with 3D sample lip images Human facial expression information namely the first human facial expression information, and according to lip reading content and corresponding first human facial expression information to 3D Sample lip image is marked, and the 3D sample lip images of lip reading content and the first human facial expression information will have been marked as lip The training sample of language identification model.Wherein it is possible to which the corresponding 3D facial images of 3D sample lip images are input to Expression Recognition In model, the analysis according to Expression Recognition model to 3D facial images determines the expression information namely 3D samples of 3D facial images Corresponding first human facial expression information of lip image.First human facial expression information may include expression informations such as " happiness, anger, grief and joy ". Correspondingly, before being input to 3D lip images in lip reading identification model trained in advance, further include：It obtains and the 3D lips Corresponding second human facial expression information of portion's image, and 3D lips image and the second human facial expression information are input to training in advance In lip reading identification model.It is understood that lip reading identification model is to 3D lip readings image and face corresponding with 3D lip reading images Expression information carries out comprehensive analysis, and the 3D lip images under pair human face expression corresponding with human facial expression information carry out lip reading knowledge Not, the lip reading information of the 3D lip images is determined.

In some embodiments, the second human facial expression information corresponding with the 3D lips image is obtained, including：It will be described The corresponding 3D facial images of 3D lip images are input to Expression Recognition model, obtain the second people corresponding with the 3D lips image Facial expression information.Wherein, Expression Recognition model can be understood as after the corresponding 3D facial images of input 3D lip images, quickly Determine the learning model of the expression information of the 3D facial images, namely determining corresponding with the 3D lip images in the 3D facial images Expression information learning model.Expression Recognition model can be instructed to the 3D facial images of acquisition and its corresponding expression Practice the learning model generated.Illustratively, each individual in the crowd of presetting can be shot by the 3D cameras of mobile terminal to exist 3D sample facial images under different expressions, and emotag is carried out to 3D facial images, i.e., according to expression information to corresponding 3D sample facial images are marked.The 3D sample facial images being had to label based on preset machine learning model It is trained, generates Expression Recognition model.After the corresponding 3D facial images of 3D lip images are input to Expression Recognition model, Expression Recognition model is analyzed the 3D facial images, such as the characteristic information and cheek of face in analysis 3D facial images Characteristic information, to obtain the second human facial expression information corresponding with 3D lip images.The technical solution provided through this embodiment, The second human facial expression information corresponding with 3D lip images can be quickly and accurately determined, to be conducive to follow-up accurately root Lip reading identification is carried out to 3D lip images according to human facial expression information.

In some embodiments, it is determined according to the output result of the lip reading identification model corresponding with the 3D lips image Lip reading information, including：Lip reading recognition result is obtained according to the output result of the lip reading identification model；The lip reading is identified As a result it is input in the semantic understanding model built in advance；Wherein, the semantic understanding model, which is used to work as, gets multiple lip readings When recognition result, based on context relationship determines lip reading information from multiple lip reading recognition results；By the semantic understanding model Output result be determined as lip reading information corresponding with the 3D lips image.It is understood that lip reading identification model may There are errors to be identified to the lip reading of 3D lip images, it is understood that there may be multiple lip reading recognition results can by way of semantic analysis Accurately to determine lip reading information from multiple lip reading recognition results.The advantages of this arrangement are as follows based on context knowing to lip reading Multiple lip reading recognition results of other model output carry out semantic understanding, can quickly and accurately determine that user really wants table The lip reading information reached.

Illustratively, lip reading identification model includes in output result by analyzing an at least frame 3D lip images Multiple lip reading recognition results, namely determine that the 3D lip images correspond to multiple lip reading recognition results.For example, the 3D of active user Really corresponding lip reading information is " in high spirits " to lip image, but based on lip reading identification model to the 3D lip image analyses Lip reading recognition result include it is multiple, respectively：" surname Gao Cailie ", " in high spirits " and " apricot cake is adopted strong ".In order to accurately true The fixed 3D lip images with the active user obtained, real corresponding lip reading information, can input multiple lip reading recognition results To the semantic understanding model built in advance, semantic understanding model is set to analyze respectively each lip reading recognition result, final root It is determined and the real corresponding lip reading information of the 3D lips image according to the output result of semantic understanding model.Wherein, semantic understanding Model can be understood as after inputting multiple lip reading recognition results, and one of lip is quickly determined from multiple lip reading recognition results Learning model of the language recognition result as lip reading information.For example, by " surname Gao Cailie ", " in high spirits " and " apricot cake is adopted strong " three Lip reading recognition result is input in semantic understanding model, based on context semantic understanding model is analyzed it one by one, is determined It is final lip reading information to go out " in high spirits ".

In the embodiment of the present application, when in multiple lip reading recognition results only include information above, do not include context information when, Information above in multiple lip reading recognition results can be analyzed based on semantic understanding model, determine final lip reading identification letter Breath；When only including context information in multiple lip reading recognition results semantic understanding model pair can be based on when not including information above Context information is analyzed in multiple lip reading recognition results, determines final lip reading identification information；When multiple lip reading recognition results In not only included information above, but also when including context information, semantic understanding model can be based on in multiple lip reading recognition results Context information is analyzed, and determines final lip reading identification information.And believe above when both not including in multiple lip reading recognition results Breath, and when not comprising context information, it can be according to the statistics to active user's the most used word within a preset period of time As a result, determining final lip reading information from multiple lip reading recognition results.Illustratively, an independent Chinese character had not both included upper Literary information, and do not include context information, for example, multiple lip reading recognition results are respectively：" I ", " nest ", " fertile " and " holding ", and work as Preceding user uses the number of " I " most in one week, and " nest ", " fertile " and " holding " is almost seldom used, then can make " I " For final lip reading information.

In the embodiment of the present application, detect concrete mode that lip reading identification events are triggered can there are many kinds of, do not limit It is fixed, several ways are given below illustratively.

1, the lip reading identification instruction for whether receiving active user's input monitored；Refer to when receiving the lip reading identification When enabling, confirmly detects lip reading identification events and be triggered.

The advantages of this arrangement are as follows can receive it is input by user carry out lip reading identification clearly instruction when, then Lip language identification function is opened, the real demand that user identifies lip reading is met.Wherein, lip reading identification instruction can be that user is advance The control instruction of the unlatching lip language identification function of setting.Wherein, the preset control instruction packet for opening lip language identification function Include but be not limited to Mechanical course instruction (such as receive the preset mechanical button of user's operation and send out instruction), default voice refers to It enables, default touching instruction, and/or preset fingerprint information command etc..If listening to the lip reading identification instruction of active user's input, Lip reading identification events are confirmly detected to be triggered.

2, current time and/or the current location of mobile terminal are obtained；When the current time be in preset time period and/ Or the current location be predeterminated position when, confirmly detect lip reading identification events and be triggered.

The advantages of this arrangement are as follows current time that can be according to mobile terminal and/or current location, rationally determine lip The trigger timing of language identification events.Illustratively, the current time of mobile terminal is obtained, and judges current time whether in pre- If in the period.For example, preset time period may include 9 between date：00-12：00 and 14：00-18：00, if currently Time is Tuesday 11：00, i.e. current time is in preset time period, it is determined that detects that lip reading identification events are triggered.It can With understanding, preset time period, which is user, to be inconvenient to carry out voice or is being inconvenient to be manually entered, or is manually entered ratio Using the period of mobile terminal when relatively wasting time, at this point, when user need using mobile terminal when, if using mobile terminal with Friend chats, it may be determined that detects that lip reading identification events are triggered.It is again illustrative, the current location of mobile terminal is obtained, And judge whether current location is in predeterminated position.For example, predeterminated position may include company and other public arenas, if currently Position is the company of user, i.e., for user currently just in company, i.e. surrounding colleague is just absorbed in work, it has not been convenient to voice is carried out, Lip reading identification events can be confirmly detected to be triggered, i.e., user can be known by lip reading otherwise with mobile terminal into pedestrian Machine interacts.Be inconvenient to carry out to use movement when voice or more bothersome manual operation it is understood that predeterminated position is user The position of terminal, at this point, when user needs using mobile terminal, if being chatted using mobile terminal and friend, it may be determined that inspection Lip reading identification events are measured to be triggered.Others can both be left alone in this way, can also preferably protect the privacy of user.Example again Property, current time and the current location of mobile terminal are obtained, and judge current time whether in preset time period, and work as Whether front position is in predeterminated position.Current time is in preset time period and current location is predeterminated position, at this point it is possible to It confirmly detects lip reading identification events to be triggered, can rationally control lip reading by the verification of time and position double factor and know The trigger timing of other event.

3, the environmental noise of mobile terminal current location is obtained；When the environmental noise is more than default noise threshold, really Regular inspection measures lip reading identification events and is triggered.

The advantages of this arrangement are as follows lip can rationally be determined according to the size of ambient noise of mobile terminal present position The trigger timing of language identification events effectively avoids carrying out voice communication under more noisy environment, avoids speech recognition error Greatly, occur the case where poor user experience.Illustratively, it is 50dB to preset noise threshold, the mobile terminal current location of acquisition Environmental noise 60dB illustrates that user is currently in a more noisy environment, in order to avoid under more noisy environment Voice communication is carried out, is triggered at this point it is possible to confirmly detect lip reading identification events.

4, the communication information of other mobile terminals transmission is obtained；When in the communication information including preset keyword, really Regular inspection measures lip reading identification events and is triggered.

The advantages of this arrangement are as follows the communication that other mobile terminals that can be received according to current mobile terminal are sent Message rationally determines the trigger timing of lip reading identification events.Illustratively, user A is chatted with friend's B progress wechat voices It, can including when preset keyword when the mobile terminal of user A detects in the communication information that the mobile terminal of user B is sent Lip reading identification events are confirmly detected to be triggered.Wherein, preset keyword may include " noise is big ", " not hearing " or " sound The contents such as a little louder ".For example, when the communication information that the mobile terminal of user B is sent is " you that how noise is so big ", lead to Believe in message to include preset keyword " noise is big ", accordingly, it is determined that detecting that lip reading identification events are triggered.

In some embodiments, after the lip reading information is supplied to the active user, further include：Described in reception Active user is to the whether accurate feedback information of the recognition result of the lip reading information；The feedback information is sent to the lip Language identification model is trained.The advantages of this arrangement are as follows whether accurate to the recognition result of lip reading information by user It is whether accurate and whether accurately anti-to lip reading recognition result according to user at any time to specify lip reading recognition result for feedback information Feedforward information adjusts the network parameter of lip reading identification model, can further increase the precision of lip reading identification.

Wherein, feedback information can be understood as the lip reading information that user determines lip reading identification model update information or Judge information.Illustratively, the lip reading determined to lip reading identification model can be set in the human-computer interaction interface of terminal device The amendment option of information judges option.Wherein, it may include two options of "Yes" and "No" to correct option, is when correcting option When "Yes", indicate that user is to approve to the lip reading information that lip reading identification model determines.And when it is "No" to correct option, it indicates User does not approve the lip reading information that lip reading identification model determines, at this point it is possible to according to the modified lip reading information of user to lip The network parameter of language identification is modified.It may include " correct " and " incorrect " two options to judge option, when judge option For " correct " when, that is, when receiving user and inputting the judge instruction of " correct ", expression user lip that lip reading identification model is determined Language information is to approve, at this point, the lip reading information that user can be directly based upon the determination of lip reading identification model carries out human-computer interaction.And When it is " incorrect " to judge option, that is, when receiving " incorrect " judge instruction input by user, indicate user to lip reading The lip reading information that identification model determines is not approved, at this point, receiving correct lip reading information input by user, and is based on correct lip Language information carries out human-computer interaction, and the network parameter that can be identified to lip reading according to the modified lip reading information of user is repaiied Just.The embodiment of the present application is not construed as limiting the concrete form of feedback information.It is true to lip reading identification model that mobile terminal receives user The whether accurate feedback information of recognition result of fixed lip reading information, and feedback information is sent to lip reading identification model, with right The network parameter of lip reading identification model is adaptively adjusted.

Fig. 2 is the flow diagram of lip reading recognition methods provided by the embodiments of the present application.As shown in Fig. 2, this method includes：

Step 201, acquisition preset an at least frame 3D sample lip images for each of crowd individual, and obtain and 3D samples The corresponding lip reading content of this lip image.

Step 202 obtains the first human facial expression information corresponding with 3D sample lip images.

Step 203 is marked 3D sample lip images according to lip reading content and the first human facial expression information, is instructed Practice sample set.

Step 204 is trained using the default machine learning model of training sample set pair, obtains lip reading identification model.

Step 205, when detecting that lip reading identification events are triggered, pass through 3D cameras obtain active user at least one Frame 3D lip images.

Step 206 obtains the second human facial expression information corresponding with 3D lip images.

3D lips image and the second human facial expression information are input in lip reading identification model trained in advance by step 207.

Step 208 determines lip reading information corresponding with 3D lip images according to the output result of lip reading identification model.

Lip reading information is supplied to active user by step 209.

Wherein it is possible to which lip reading information is supplied to active user in the form of word or voice.

The lip reading recognition methods provided in the embodiment of the present application passes through 3D when detecting that lip reading identification events are triggered Camera obtains at least frame 3D lip images of active user, and by the 3D lips image and its corresponding second face table In feelings information input to lip reading identification model trained in advance, determined and the 3D lips according to the output result of lip reading identification model The corresponding lip reading information of portion's image, is then supplied to active user by lip reading information, wherein lip reading identification model is based on to mark Remember that the 3D sample lip reading images of lip reading content and the first human facial expression information are trained generation.By using above-mentioned technology Scheme, can according to the expression information of user when different lip reading contents and the corresponding 3D samples lip image of different lip reading contents, The training study for carrying out lip reading identification model, can further increase the accuracy that lip reading identification model determines lip reading information.

Fig. 3 is the flow diagram of lip reading recognition methods provided by the embodiments of the present application.As shown in figure 3, this method includes：

Step 301, acquisition preset an at least frame 3D sample lip images for each of crowd individual, and obtain and 3D samples The corresponding lip reading content of this lip image.

Step 302 obtains the first human facial expression information corresponding with 3D sample lip images.

Step 303 is marked 3D sample lip images according to lip reading content and the first human facial expression information, is instructed Practice sample set.

Step 304 is trained using the default machine learning model of training sample set pair, obtains lip reading identification model.

Step 305, the environmental noise for obtaining mobile terminal current location.

Step 306 judges whether the environmental noise is more than default noise threshold, if so, 307 are thened follow the steps, otherwise, Return to step 306.

Step 307 confirmly detects lip reading identification events and is triggered.

Step 308, at least frame 3D lip images that active user is obtained by 3D cameras.

Step 309 obtains the second human facial expression information corresponding with 3D lip images.

3D lips image and the second human facial expression information are input in lip reading identification model trained in advance by step 310.

Step 311 obtains lip reading recognition result according to the output result of the lip reading identification model.

The lip reading recognition result is input in the semantic understanding model built in advance by step 312.

Wherein, the semantic understanding model be used for when getting multiple lip reading recognition results, based on context relationship from Lip reading information is determined in multiple lip reading recognition results.

The output result of the semantic understanding model is determined as lip reading corresponding with the 3D lips image by step 313 Information.

Lip reading information is supplied to active user by step 314.

The lip reading recognition methods provided in the embodiment of the present application obtains the environmental noise of mobile terminal current location；Work as ring When border noise is more than default noise threshold, confirmly detects lip reading identification events and be triggered, it can be according to position residing for mobile terminal The size of ambient noise set rationally determines the trigger timing of lip reading identification events, effectively avoid under more noisy environment into The case where row voice communication avoids speech recognition error big, poor user experience occurs.And according to the output knot of lip reading identification model Fruit obtains lip reading recognition result；Lip reading recognition result is input in the semantic understanding model built in advance；By semantic understanding mould The output result of type is determined as lip reading information corresponding with 3D lip images, can based on context be exported to lip reading identification model Multiple lip reading recognition results carry out semantic understanding, quickly and accurately determine the lip reading information that user is really intended by.

Fig. 4 is a kind of structure diagram of lip reading identification device provided by the embodiments of the present application, the device can by software and/or Hardware realization is typically integrated in mobile terminal, can accurately obtain lip reading information by executing lip reading recognition methods.Such as figure Shown in 4, which includes：

Lip image collection module 401, for when detecting that lip reading identification events are triggered, being obtained by 3D cameras At least frame 3D lip images of active user；

Lip image input module 402, for the 3D lips image to be input to lip reading identification model trained in advance In；

Lip reading information determination module 403, for being determined and the 3D lips according to the output result of the lip reading identification model The corresponding lip reading information of portion's image；

Lip reading information providing module 404, for the lip reading information to be supplied to the active user.

The lip reading identification device provided in the embodiment of the present application passes through 3D when detecting that lip reading identification events are triggered Camera obtains at least frame 3D lip images of active user, and the 3D lips image is input to lip reading trained in advance In identification model, lip reading information corresponding with the 3D lips image is determined according to the output result of lip reading identification model, then Lip reading information is supplied to active user.It, can be by the lip reading identification model that builds in advance by using above-mentioned technical proposal Simple, quick lip reading identification is carried out to 3D lip images, and further improves the accuracy of lip reading identification, is effectively increased The man-machine interaction experience of user, preferably meets user demand.

Optionally, which further includes：

Sample lip image capture module, for before detecting that lip reading identification events are triggered, acquiring the crowd of presetting Each of individual an at least frame 3D sample lip images, and obtain in corresponding with 3D samples lip image lip reading Hold；

Sample lip image tagged module, for according to the lip reading content to the 3D samples lip image into rower Note obtains training sample set；

Lip reading identification model training module is instructed for presetting machine learning model using the training sample set pair Practice, obtains lip reading identification model.

Optionally, which further includes：

First expression information acquisition module, for according to the lip reading content to the 3D samples lip image into rower Note before obtaining training sample set, obtains the first human facial expression information corresponding with the 3D samples lip image；

The sample lip image tagged module, is used for：

The 3D samples lip image is marked according to the lip reading content and first human facial expression information, is obtained Obtain training sample set；

Correspondingly, the device further includes：

Second expression information acquisition module, for the 3D lips image to be input to lip reading identification mould trained in advance Before in type, the second human facial expression information corresponding with the 3D lips image is obtained；

Lip image input module, is used for：

The 3D lips image and second human facial expression information are input in lip reading identification model trained in advance.

Optionally, the second expression information acquisition module, is used for：

The corresponding 3D facial images of the 3D lips image are input to Expression Recognition model, are obtained and the 3D lips figure As corresponding second human facial expression information.

Optionally, lip reading information determination module is used for：

Lip reading recognition result is obtained according to the output result of the lip reading identification model；

The lip reading recognition result is input in the semantic understanding model built in advance；Wherein, the semantic understanding mould Type is used for when getting multiple lip reading recognition results, and based on context relationship determines lip reading letter from multiple lip reading recognition results Breath；

The output result of the semantic understanding model is determined as lip reading information corresponding with the 3D lips image.

Optionally, detect that lip reading identification events are triggered, including：

The lip reading identification instruction for whether receiving active user's input monitored；It is instructed when receiving the lip reading identification When, it confirmly detects lip reading identification events and is triggered；Or

Obtain current time and/or the current location of mobile terminal；When the current time be in preset time period and/or When the current location is predeterminated position, confirmly detects lip reading identification events and be triggered；Or

Obtain the environmental noise of mobile terminal current location；When the environmental noise is more than default noise threshold, determine Detect that lip reading identification events are triggered；Or

Obtain the communication information of other mobile terminals transmission；When in the communication information including preset keyword, determine Detect that lip reading identification events are triggered.

Optionally, which further includes：

Feedback information receiving module, for after the lip reading information is supplied to the active user, described in reception Active user is to the whether accurate feedback information of the recognition result of the lip reading information；

Feedback information sending module is trained for the feedback information to be sent to the lip reading identification model.

The embodiment of the present application also provides a kind of storage medium including computer executable instructions, and the computer is executable When being executed by computer processor for executing lip reading recognition methods, this method includes for instruction：

The lip reading information is supplied to the active user.

Storage medium --- any various types of memory devices or storage device.Term " storage medium " is intended to wrap It includes：Install medium, such as CD-ROM, floppy disk or magnetic tape equipment；Computer system memory or random access memory, such as DRAM, DDRRAM, SRAM, EDORAM, blue Bath (Rambus) RAM etc.；Nonvolatile memory, such as flash memory, magnetic medium (example Such as hard disk or optical storage)；The memory component etc. of register or other similar types.Storage medium can further include other types Memory or combinations thereof.In addition, storage medium can be located at program in the first computer system being wherein performed, or It can be located in different second computer systems, second computer system is connected to the first meter by network (such as internet) Calculation machine system.Second computer system can provide program instruction to the first computer for executing.Term " storage medium " can To include two or more that may reside in different location (such as in different computer systems by network connection) Storage medium.Storage medium can store the program instruction that can be executed by one or more processors and (such as be implemented as counting Calculation machine program).

Certainly, a kind of storage medium including computer executable instructions that the embodiment of the present application is provided, computer The lip reading identification operation that executable instruction is not limited to the described above, can also be performed the lip reading that the application any embodiment is provided Relevant operation in recognition methods.

The embodiment of the present application provides a kind of mobile terminal, and lip provided by the embodiments of the present application can be integrated in the mobile terminal Language identification device.Fig. 5 is a kind of structural schematic diagram of mobile terminal provided by the embodiments of the present application.Mobile terminal 500 can wrap It includes：Memory 501, processor 502 and storage on a memory and can processor operation computer program, the processor 502 realize the lip reading recognition methods as described in the embodiment of the present application when executing the computer program.

Mobile terminal provided by the embodiments of the present application, can be by the lip reading identification model that builds in advance to 3D lip images Simple, quick lip reading identification is carried out, and further improves the accuracy of lip reading identification, effectively increases the man-machine friendship of user Mutually experience, preferably meets user demand.

Fig. 6 is the structural schematic diagram of another mobile terminal provided by the embodiments of the present application, which may include： Shell (not shown), memory 601, central processing unit (central processing unit, CPU) 602 (are also known as located Manage device, hereinafter referred to as CPU), circuit board (not shown) and power circuit (not shown).The circuit board is placed in institute State the space interior that shell surrounds；The CPU602 and the memory 601 are arranged on the circuit board；The power supply electricity Road, for being each circuit or the device power supply of the mobile terminal；The memory 601, for storing executable program generation Code；The CPU602 is run and the executable journey by reading the executable program code stored in the memory 601 The corresponding computer program of sequence code, to realize following steps：

The lip reading information is supplied to the active user.

The mobile terminal further includes：Peripheral Interface 603, RF (Radio Frequency, radio frequency) circuit 605, audio-frequency electric Road 606, loud speaker 611, power management chip 608, input/output (I/O) subsystem 609, other input/control devicess 610, Touch screen 612, other input/control devicess 610 and outside port 604, these components pass through one or more communication bus Or signal wire 607 communicates.

It should be understood that diagram mobile terminal 600 is only an example of mobile terminal, and mobile terminal 600 Can have than shown in the drawings more or less component, can combine two or more components, or can be with It is configured with different components.Various parts shown in the drawings can be including one or more signal processings and/or special It is realized in the combination of hardware, software or hardware and software including integrated circuit.

Below just it is provided in this embodiment for lip reading identification mobile terminal be described in detail, the mobile terminal with For mobile phone.

Memory 601, the memory 601 can be by access such as CPU602, Peripheral Interfaces 603, and the memory 601 can Can also include nonvolatile memory to include high-speed random access memory, such as one or more disk memory, Flush memory device or other volatile solid-state parts.

The peripheral hardware that outputs and inputs of equipment can be connected to CPU602 and deposited by Peripheral Interface 603, the Peripheral Interface 603 Reservoir 601.

I/O subsystems 609, the I/O subsystems 609 can be by the input/output peripherals in equipment, such as touch screen 612 With other input/control devicess 610, it is connected to Peripheral Interface 603.I/O subsystems 609 may include 6091 He of display controller One or more input controllers 6092 for controlling other input/control devicess 610.Wherein, one or more input controls Device 6092 processed receives electric signal from other input/control devicess 610 or sends electric signal to other input/control devicess 610, Other input/control devicess 610 may include physical button (pressing button, rocker buttons etc.), dial, slide switch, behaviour Vertical pole clicks idler wheel.It is worth noting that input controller 6092 can with it is following any one connect：Keyboard, infrared port, The indicating equipment of USB interface and such as mouse.

Touch screen 612, the touch screen 612 are the input interface and output interface between customer mobile terminal and user, Visual output is shown to user, visual output may include figure, text, icon, video etc..

Display controller 6091 in I/O subsystems 609 receives electric signal from touch screen 612 or is sent out to touch screen 612 Electric signals.Touch screen 612 detects the contact on touch screen, and the contact detected is converted to and is shown by display controller 6091 The interaction of user interface object on touch screen 612, that is, realize human-computer interaction, the user interface being shown on touch screen 612 Object can be the icon of running game, be networked to the icon etc. of corresponding network.It is worth noting that equipment can also include light Mouse, light mouse are the extensions for the touch sensitive surface for not showing the touch sensitive surface visually exported, or formed by touch screen.

RF circuits 605 are mainly used for establishing the communication of mobile phone and wireless network (i.e. network side), realize mobile phone and wireless network The data receiver of network and transmission.Such as transmitting-receiving short message, Email etc..Specifically, RF circuits 605 receive and send RF letters Number, RF signals are also referred to as electromagnetic signal, and RF circuits 605 convert electrical signals to electromagnetic signal or electromagnetic signal is converted to telecommunications Number, and communicated with mobile communications network and other equipment by the electromagnetic signal.RF circuits 605 may include being used for Execute the known circuit of these functions comprising but it is not limited to antenna system, RF transceivers, one or more amplifiers, tuning Device, one or more oscillators, digital signal processor, CODEC (COder-DECoder, coder) chipset, Yong Hubiao Know module (Subscriber Identity Module, SIM) etc..

Voicefrequency circuit 606 is mainly used for receiving audio data from Peripheral Interface 603, which is converted to telecommunications Number, and the electric signal is sent to loud speaker 611.

Loud speaker 611, the voice signal for receiving mobile phone from wireless network by RF circuits 605, is reduced to sound And play the sound to user.

Power management chip 608, the hardware for being connected by CPU602, I/O subsystem and Peripheral Interface are powered And power management.

Lip reading identification device, storage medium and the mobile terminal provided in above-described embodiment, which can perform the application, arbitrarily to be implemented The lip reading recognition methods that example is provided has and executes the corresponding function module of this method and advantageous effect.Not in above-described embodiment In detailed description technical detail, reference can be made to the lip reading recognition methods that the application any embodiment is provided.

Note that above are only preferred embodiment and the institute's application technology principle of the application.It will be appreciated by those skilled in the art that The application is not limited to specific embodiment described here, can carry out for a person skilled in the art it is various it is apparent variation, The protection domain readjusted and substituted without departing from the application.Therefore, although being carried out to the application by above example It is described in further detail, but the application is not limited only to above example, in the case where not departing from the application design, also May include other more equivalent embodiments, and scope of the present application is determined by scope of the appended claims.

Claims

1. a kind of lip reading recognition methods, which is characterized in that including：

When detecting that lip reading identification events are triggered, at least frame 3D lip figures of active user are obtained by 3D cameras Picture；

The lip reading information is supplied to the active user.

2. according to the method described in claim 1, it is characterized in that, before detecting that lip reading identification events are triggered, also wrap It includes：

An at least frame 3D sample lip images for each of crowd individual are preset in acquisition, and are obtained and the 3D samples lip figure As corresponding lip reading content；

The 3D samples lip image is marked according to the lip reading content, obtains training sample set；

Machine learning model is preset using the training sample set pair to be trained, and obtains lip reading identification model.

3. according to the method described in claim 2, it is characterized in that, according to the lip reading content to the 3D samples lip figure As being marked, before obtaining training sample set, further include：

Obtain the first human facial expression information corresponding with the 3D samples lip image；

The 3D samples lip image is marked according to the lip reading content, obtains training sample set, including：

The 3D samples lip image is marked according to the lip reading content and first human facial expression information, is instructed Practice sample set；

Correspondingly, before being input to the 3D lips image in lip reading identification model trained in advance, further include：

Obtain the second human facial expression information corresponding with the 3D lips image；

The 3D lips image is input in lip reading identification model trained in advance, including：

4. according to the method described in claim 3, it is characterized in that, obtaining the second face table corresponding with the 3D lips image Feelings information, including：

The corresponding 3D facial images of the 3D lips image are input to Expression Recognition model, are obtained and the 3D lips image pair The second human facial expression information answered.

5. according to the method described in claim 1, it is characterized in that, according to the output result of the lip reading identification model determine with The corresponding lip reading information of the 3D lips image, including：

The lip reading recognition result is input in the semantic understanding model built in advance；Wherein, the semantic understanding model is used In when getting multiple lip reading recognition results, based on context relationship determines lip reading information from multiple lip reading recognition results；

6. according to the method described in claim 1, it is characterized in that, detect that lip reading identification events are triggered, including：

The lip reading identification instruction for whether receiving active user's input monitored；When receiving the lip reading identification instruction, Lip reading identification events are confirmly detected to be triggered；Or

Obtain current time and/or the current location of mobile terminal；When the current time is in preset time period and/or described When current location is predeterminated position, confirmly detects lip reading identification events and be triggered；Or

Obtain the environmental noise of mobile terminal current location；When the environmental noise is more than default noise threshold, detection is determined It is triggered to lip reading identification events；Or

Obtain the communication information of other mobile terminals transmission；When in the communication information including preset keyword, detection is determined It is triggered to lip reading identification events.

7. according to any methods of claim 1-6, which is characterized in that described current the lip reading information to be supplied to After user, further include：

The active user is received to the whether accurate feedback information of the recognition result of the lip reading information；

The feedback information is sent to the lip reading identification model to be trained.

8. a kind of lip reading identification device, which is characterized in that including：

Lip image collection module, for when detecting that lip reading identification events are triggered, current use to be obtained by 3D cameras An at least frame 3D lip images at family；

Lip reading information determination module, for being determined and the 3D lips image pair according to the output result of the lip reading identification model The lip reading information answered；

9. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor The lip reading recognition methods as described in any in claim 1-7 is realized when row.

10. a kind of mobile terminal, which is characterized in that including memory, processor and storage are on a memory and can be in processor The computer program of operation, the processor realize the lip as described in claim 1-7 is any when executing the computer program Language recognition methods.