CN108171148A

CN108171148A - The method and system that a kind of lip reading study cloud platform is established

Info

Publication number: CN108171148A
Application number: CN201711432189.1A
Authority: CN
Inventors: 高升
Original assignee: Shanghai Feixun Data Communication Technology Co Ltd
Current assignee: Taizhou Jiji Intellectual Property Operation Co.,Ltd.
Priority date: 2017-12-26
Filing date: 2017-12-26
Publication date: 2018-06-15

Abstract

The present invention provides the method and system that a kind of lip reading study cloud platform is established, including：Lip reading is obtained, the lip reading includes lip tongue action and corresponding sentence；The lip reading is extracted, the lip tongue action is divided into image data and the sentence is divided into voice data, and described image data and voice data transmission to lip reading study cloud platform are carried out data training；Trained data are stored on the host node of lip reading study cloud platform setting, form tranining database；The distributed of lip reading study cloud platform is built, and as needed will be on other nodes of the data organization of tranining database to lip reading study cloud platform.The present invention, to improve the efficiency of lip reading study, promotes the development of lip reading study by promoting the accuracy of labiomaney model extraction sentence.

Description

The method and system that a kind of lip reading study cloud platform is established

Technical field

Embodiment of the present invention be related to communication technique field more particularly to a kind of lip reading study cloud platform establish method and System.

Background technology

Labiomaney has played very crucial effect in the exchange of the mankind and speech understanding, when a phoneme saying in people When dubbing in words video is another different phoneme that someone says, hearer can perceive the different phoneme of third.

In implementing the present invention, it may, inventor has found the prior art, at least there are the following problems：

Labiomaney is a well-known difficult task for the mankind.Tongue and tooth in addition to lip and sometimes, Most of labiomaney signals are all obscure, it is difficult to be differentiated in the case of no linguistic context.

Therefore, the automation for realizing labiomaney is a critically important target.Machine lip-read device has very big practical potentiality, than It such as can be applied to improve hearing aid, the mute dictation of public space, secret dialogue, the speech recognition in noisy environment, biology Feature recognition and the processing of silent film film.Machine labiomaney is highly difficult, since it is desired that extracting space-time characteristic from video, such as position The features such as put and move.Although deep learning method attempts to extract these features by mode end to end.But it is all There is work all only to perform the classification of single word rather than the sequence prediction of sentence surface.

It is all by manually being trained under line and online learning software that current lip reading study is most of.But the mankind Language receives the influences such as areal variation, nationality's difference, and there is the presence of dialect in each place.Meanwhile the artificial training under line Standard language based on official, the applicability in each region be far from anticipation it is so high, therefore, study when be unable to reach Expected effect.And online lip reading learning software, also all only consider the standard language of official, and it is the side in view of place Speech.Moreover, for the content extracted, be all based on for the word in a word, there is no from entire sentence surface come into Row prediction, therefore, there is the drawbacks of very big, expected requirement is also not achieved in the accuracy rate of extraction.

It should be noted that the introduction of technical background is intended merely to above it is convenient technical scheme of the present invention is carried out it is clear, Complete explanation, and facilitate the understanding of those skilled in the art and illustrate.Cannot merely because these schemes the present invention Background technology part is expounded and thinks that above-mentioned technical proposal is known to those skilled in the art.

Invention content

In view of the above-mentioned problems, a kind of method for being designed to provide lip reading study cloud platform and establishing of embodiment of the present invention And system, by promoting the accuracy of labiomaney model extraction sentence, to improve the efficiency of lip reading study, promote the hair of lip reading study Exhibition.

To achieve the above object, embodiment of the present invention provides a kind of method that lip reading study cloud platform is established, including：It obtains Lip reading is taken, the lip reading includes lip tongue action and corresponding sentence；The lip reading is extracted, by the lip tongue Head action is divided into image data and the sentence is divided into voice data, and by described image data and voice data transmission Data training is carried out to lip reading study cloud platform；Trained data are stored in the host node of lip reading study cloud platform setting On, form tranining database；The distributed of lip reading study cloud platform is built, and as needed by training data In the data organization in library to other nodes of lip reading study cloud platform.

Further, the method further includes：It carries out distributed system hardware platform to build, at least builds two nodes, Each node includes central processing unit CPU and graphics processing unit GPU；Bottom process communication uses gRPC Support Libraries, uses The tool that Tensorflow is provided defines the cluster_spec numbers of cluster, and the more mode cards of multimachine are configured.

Further, the lip reading is extracted, specially：The lip reading is extracted by Tensorflow.

Further, described image data and voice data transmission to lip reading study cloud platform are subjected to data training, shape Include into tranining database：The lip tongue action is divided into image data and the sentence is divided into voice data； By data according to the partitioning algorithm of data correlation model, voice data is packaged into training mission with image data, is assigned to difference In working node；Each working node is assigned to by CPU in multiple GPU, after GPU completes training mission every time, sends training number According to CPU, CPU calculates average training data, undated parameter；After the completion of single node training mission, with the forms of broadcasting to lip reading Learn other node transmission datas in cloud platform, and wait for the training data of other nodes；All nodes complete calculating task Afterwards, final training data is stored by the Master nodes set, forms tranining database.

Further, the neural network framework of the lip reading study cloud platform, selects convolutional neural networks structure, selection 128 convolution kernels, 16 layers of convolutional layer, wherein, the layer name of 16 layers of convolutional layer and description are defined as：Init, netinit； Conv1 realizes that convolution and rectification linearly activate；Pool1, maximum pond；Norm1, local acknowledgement's normalization；Conv2 realizes volume Product and rectification linearly activate；Pool2, maximum pond；Som, self-organizing structures input layer；Som2, self-organizing structures output layer； Norm2, local acknowledgement's normalization；Hand1 artificially increases network disturbance according to intermediate result；Conv3 realizes convolution and whole Cleanliness activates；Pool3, maximum pond；Re, the residual computations that recurrence changes；Local3, the full connection linearly activated based on amendment Layer；Local4, the full articulamentum linearly activated based on amendment；And softmax_linear, linear transformation is carried out to export logits。

Further, in the convolutional neural networks structure, including：Feedback self-oscilation mechanism allows cross-layer to transmit information, tool Body backs towards hand1 layers of transmission residual information for pool3 layers.

Further, in the convolutional neural networks structure, including：There is recursive structure at re, specially using Elman nets Network structure, hand1, conv3, pool3, re layers as a hidden layer progress recursive feedback.

To achieve the above object, embodiment of the present invention also provides a kind of lip reading study cloud platform system, including：It obtains single Member, for obtaining lip reading, the lip reading includes lip tongue action and corresponding sentence；Extraction unit, for by the mouth Lip tongue action is divided into image data and the sentence is divided into voice data, and by described image data and voice data The working node for being transmitted to lip reading study cloud platform carries out data training；Host node is set, for trained data to be stored On the host node of lip reading study cloud platform setting, tranining database is formed；Unit is built, for building lip reading study cloud platform Distributed, and as needed by the data organization of tranining database to lip reading study cloud platform other section Point on.

Therefore the method and system of a kind of lip reading study cloud platform foundation that embodiment of the present invention provides, it utilizes The labiomaney model of Tensorflow carrys out the extraction into line statement, compared to the previous extraction for word, has higher Accuracy.Meanwhile the idea for building cloud platform facilitates user to be learnt at any time, is also convenient for being handed over the people of other study Stream.

Description of the drawings

It, below will be to embodiment in order to illustrate more clearly of embodiment of the present invention or technical solution of the prior art Or attached drawing needed to be used in the description of the prior art is simply introduced one by one, it should be apparent that, the accompanying drawings in the following description is Some embodiments of the present invention, for those of ordinary skill in the art, without creative efforts, also Other attached drawings can be obtained according to these attached drawings.

Fig. 1 is the flow diagram that the lip reading that embodiment of the present invention provides learns the method that cloud platform is established；

Fig. 2 is the neural network structure schematic diagram of hand1 layers and re interlayers that embodiment of the present invention provides.

Specific embodiment

Purpose, technical scheme and advantage to make embodiment of the present invention are clearer, implement below in conjunction with the present invention The technical solution in embodiment of the present invention is clearly and completely described in attached drawing in mode, it is clear that described reality The mode of applying is the embodiment of a part of embodiment of the present invention rather than whole.Based on the embodiment in the present invention, ability The every other embodiment that domain those of ordinary skill is obtained without creative efforts, belongs to the present invention The range of protection.

Embodiment of the present invention is based on Distributed T ensorflow technologies progress lip reading study cloud platform and builds. TensorFlow is the second generation artificial intelligence learning system that Google is researched and developed based on DistBelief, and name is from this The operation logic of body.Tensor (tensor) means N-dimensional array, and Flow (stream) means the calculating based on data flow diagram, TensorFlow flow to other end calculating process for tensor from one end of flow graph.TensorFlow is by complicated data structure It is transmitted to the system that analysis and processing procedure are carried out in artificial intelligence nerve net.

Tensorflow expresses high-level machine learning and calculates, and has significantly simplified first generation system, and have more Good flexibility and ductility.Mono- spotlights of TensorFlow are to support heterogeneous device Distributed Calculation, it can be each Automatic running model on platform, the distributed system formed from mobile phone, single cpu/GPU to hundreds and thousands of GPU cards.

TensorFlow can establish labiomaney model from sentence surface, it uses a list end to end independently of speaking The depth model of people simultaneously learns space-time visual signature and a series model.On GRID corpus, Tensorflow The sentence model of upper foundation realizes 93.4% accuracy, has been more than veteran mankind's lip reader and before 79.6% Best accuracy.Therefore, the lip reading reading model based on Tensorflow can effectively improve the accuracy of reading, so as to have Help be promoted the efficiency of lip reading study.

Tensorflow is trained in a manner of end to end, so as to make the sentence surface independently of speaker Prediction.Model is run in character level, has used space-time convolutional neural networks (STCNN), LSTM and connectionism time point Class loses (CTC).

It is on the data set GRID corpus (Cookeetal., 2006) of disclosed sentence surface the experimental results showed that The sentence surface model that Tensorflow is established can reach the word accuracy of 93.4% sentence surface.Correspondingly, before The optimum of the word classification version independently of speaker in this task is 79.6%.It the performance of Tensorflow and listens Feel that the performance of the people of impaired meeting lip-read compares.On average, performances of the Tensorflow on identical sentence is meeting 1.78 times of the people of lip-read language.

In order to implement the present invention method, need carry out distributed system hardware platform build, such as build two nodes, It needs to have：

1. each node is CPUi57400, GPUGTX1050；

2. with interchanger, bandwidth 100G.

Bottom process communication uses gRPC Support Libraries, and the tool provided using Tensorflow defines cluster The more mode cards of multimachine are configured in cluster_spec numbers.

Current these are only is enumerated, and is not restricted and is used.

Embodiment of the present invention provides a kind of method that lip reading study cloud platform is established.Referring to Fig. 1, the method is big Cause can be divided into backstage training stage and service offer stage, specifically may comprise steps of：

Step S1：Lip reading is obtained, the lip reading includes lip tongue action and corresponding sentence.

Step S2：The lip reading is extracted by Tensorflow, the lip tongue action is divided into image Data and the sentence are divided into voice data, and described image data and voice data transmission to lip reading are learnt cloud platform Carry out data training.

In present embodiment, in the backstage training stage, the situation of data flow is：By data drawing according to data correlation model Divide algorithm, voice data is packaged into task with image data, is assigned in different operating node.Each working node by CPU (in Central Processing Unit) it is assigned in multiple GPU (graphics processing unit), after GPU completes calculating task every time, CPU is sent data to, CPU calculates average data, undated parameter.After the completion of single node task, with the forms of broadcasting to lip reading study cloud platform in its He sends training data, and wait for the data of other nodes by node.After all nodes complete calculating task, by the Master set Node stores final training data.

The neural network framework of lip reading study cloud platform, selection convolutional neural networks structure, 128 convolution kernels of selection, 16 Layer convolutional layer, each layer name is with being described as follows：

It is worth noting that, hand1 layers with the neural network structures of re interlayers as shown in Fig. 2, in embodiment of the present invention To important improvement at network structure at least following three：

First, feedback self-oscilation mechanism is added in self-organization rule, cross-layer is allowed to transmit information, pool3 layers back towards hand1 layers Transmit residual information；

2nd, there is recursive structure at re, using Elman network structures, hand1, conv3, pool3, re layers are used as one Hidden layer carries out recursive feedback.

3rd, part SOM network structures, som, som2 layers are self-organizing structures input layer, self-organizing structures output layer respectively, Som layers of neuron are arranged in two-dimensional space in a matrix fashion, and there are one weight vectors for each neuron, and som layers defeated in receiving After incoming vector, som2 layers of neuron can be activated, can be adjusted between som2 neurons according to the method that self-organizing network is trained It is whole.

Step S3：Trained data are stored on the Master nodes of lip reading study cloud platform setting, form training Database.

In the present embodiment, after all nodes complete calculating task, final instruction is stored by the Master nodes set Practice data, form tranining database.

Step S4：The distributed of lip reading study cloud platform is built using Hadoop, and as needed will In the data organization of tranining database to other nodes of lip reading study cloud platform.

In the present embodiment, the offer stage is being serviced, the dump of trained data elder generation is on Master nodes, Ran Houli Distributed is built with Hadoop, data are effectively organized on several Master, Slave nodes.

Embodiment of the present invention also provides lip reading study cloud platform system, including：

Acquiring unit, for obtaining lip reading, the lip reading includes lip tongue action and corresponding sentence；

Extraction unit, for the lip tongue action being divided into image data and the sentence is divided into voice number According to, and the working node of described image data and voice data transmission to lip reading study cloud platform is subjected to data training；

Host node is set, for trained data to be stored in the host node of lip reading study cloud platform setting, is formed Tranining database；

Unit is built, for building the distributed of lip reading study cloud platform, and as needed will training In the data organization of database to other nodes of lip reading study cloud platform.

Wherein,

The extraction unit extracts the lip reading by Tensorflow；The lip tongue action is divided into Image data and the sentence are divided into voice data；By data according to the partitioning algorithm of data correlation model, voice data Training mission is packaged into image data, is assigned in different operating node；Each working node is assigned to multiple GPU by CPU In, after GPU completes training mission every time, training data is sent to CPU, CPU calculates average training data, undated parameter；Work as list After the completion of node training mission, with other node transmission datas of the forms of broadcasting into lip reading study cloud platform, and other are waited for The training data of node；

The setting host node stores training data after all nodes complete calculating task, forms tranining database.

The method that the particular technique details and lip reading study cloud platform that above-mentioned lip reading study cloud platform system is related to are established In it is similar, therefore no longer specifically repeat.

Each embodiment in this specification is described by the way of progressive, identical similar between each embodiment Just to refer each other for part, what each embodiment stressed is the difference with other embodiment.

Finally it should be noted that：Ability is supplied to the purpose described to the description of the various embodiments of the present invention above Field technique personnel.It is not intended to exhaustive or is not intended to and limits the invention to single disclosed embodiment.As above institute It states, various replacements of the invention and variation will be apparent for above-mentioned technology one of ordinary skill in the art.Therefore, Although having specifically discussed some alternative embodiments, other embodiment will be apparent or ability Field technique personnel relatively easily obtain.The present invention is directed to include having discussed herein all replacements of the present invention, modification and Change and fall the other embodiment in the spirit and scope of above-mentioned application.

Claims

1. a kind of method that lip reading study cloud platform is established, which is characterized in that including：

Lip reading is obtained, the lip reading includes lip tongue action and corresponding sentence；

The lip reading is extracted, the lip tongue action is divided into image data and the sentence is divided into voice Data, and described image data and voice data transmission to lip reading study cloud platform are subjected to data training；

Trained data are stored on the host node of lip reading study cloud platform setting, form tranining database；

The distributed of lip reading study cloud platform is built, and as needed arrives the data organization of tranining database On other nodes of lip reading study cloud platform.

2. the method that lip reading study cloud platform according to claim 1 is established, which is characterized in that the method further includes：

Carry out distributed system hardware platform to build, at least build two nodes, each node include central processing unit CPU and Graphics processing unit GPU；

Bottom process communication uses gRPC Support Libraries, the tool provided using Tensorflow, defines the cluster_ of cluster The more mode cards of multimachine are configured in spec numbers.

3. the method that lip reading study cloud platform according to claim 2 is established, which is characterized in that carried to the lip reading It takes, specially：

The lip reading is extracted by Tensorflow.

4. the method that lip reading according to claim 3 study cloud platform is established, which is characterized in that by described image data and Voice data transmission to lip reading study cloud platform carries out data training, forms tranining database and includes：

The lip tongue action is divided into image data and the sentence is divided into voice data；

By data according to the partitioning algorithm of data correlation model, voice data is packaged into training mission with image data, is assigned to In different operating node；

Each working node is assigned to by CPU in multiple GPU, after GPU completes training mission every time, sends training data to CPU, CPU calculates average training data, undated parameter；

After the completion of single node training mission, with the forms of broadcasting to lip reading study cloud platform in other node transmission datas, and Wait for the training data of other nodes；

After all nodes complete calculating task, final training data is stored by the Master nodes set, forms training data Library.

5. the method that lip reading study cloud platform according to claim 4 is established, which is characterized in that the lip reading study cloud is put down The neural network framework of platform, selection convolutional neural networks structure, 128 convolution kernels of selection, 16 layers of convolutional layer, wherein, 16 layers of volume The layer name of lamination and description are defined as：Init, netinit；Conv1 realizes that convolution and rectification linearly activate； Pool1, maximum pond；Norm1, local acknowledgement's normalization；Conv2 realizes that convolution and rectification linearly activate；Pool2, it is maximum Pond；Som, self-organizing structures input layer；Som2, self-organizing structures output layer；Norm2, local acknowledgement's normalization；Hand1, according to Intermediate result artificially increases network disturbance；Conv3 realizes that convolution and rectification linearly activate；Pool3, maximum pond；Re, recurrence The residual computations of change；Local3, the full articulamentum linearly activated based on amendment；Local4 is linearly activated complete based on amendment Articulamentum；And softmax_linear, linear transformation is carried out to export logits.

6. the method that lip reading study cloud platform according to claim 5 is established, which is characterized in that the convolutional neural networks In structure, including：Feedback self-oscilation mechanism allows cross-layer to transmit information, and specially pool3 layers backs towards hand1 layers and transmit residual error letter Breath.

7. the method that lip reading study cloud platform according to claim 5 is established, which is characterized in that the convolutional neural networks In structure, including：There is recursive structure at re, specially using Elman network structures, hand1, conv3, pool3, re layers of work Recursive feedback is carried out for a hidden layer.

8. the method that lip reading study cloud platform according to claim 5 is established, which is characterized in that the convolutional neural networks In structure, including：SOM network structures, specially som, som2 layers are self-organizing structures input layer, self-organizing structures output respectively Layer, som layers of neuron are arranged in two-dimensional space in a matrix fashion, and there are one weight vectors for each neuron, and som layers are receiving After input vector, som2 layers of neuron can be activated, can be carried out between som2 neurons according to the method that self-organizing network is trained Adjustment.

9. a kind of lip reading learns cloud platform system, which is characterized in that including：

Extraction unit, for the lip tongue action being divided into image data and the sentence is divided into voice data, And the working node of described image data and voice data transmission to lip reading study cloud platform is subjected to data training；

Host node is set, for trained data to be stored in the host node of lip reading study cloud platform setting, forms training Database；

Unit is built, for building the distributed of lip reading study cloud platform, and as needed by training data In the data organization in library to other nodes of lip reading study cloud platform.

10. lip reading according to claim 9 learns cloud platform system, which is characterized in that the extraction unit passes through Tensorflow extracts the lip reading；The lip tongue action is divided into image data and the sentence divides For voice data；By data according to the partitioning algorithm of data correlation model, voice data is packaged into training mission with image data, It is assigned in different operating node；Each working node is assigned to by CPU in multiple GPU, after GPU completes training mission every time, Training data is sent to CPU, CPU calculates average training data, undated parameter；After the completion of single node training mission, with broadcast Other node transmission datas of form into lip reading study cloud platform, and wait for the training data of other nodes；