CN110349573A - A kind of audio recognition method, device, machine readable media and equipment - Google Patents
A kind of audio recognition method, device, machine readable media and equipment Download PDFInfo
- Publication number
- CN110349573A CN110349573A CN201910600014.XA CN201910600014A CN110349573A CN 110349573 A CN110349573 A CN 110349573A CN 201910600014 A CN201910600014 A CN 201910600014A CN 110349573 A CN110349573 A CN 110349573A
- Authority
- CN
- China
- Prior art keywords
- model
- criterion
- trained
- training
- acoustic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000012549 training Methods 0.000 claims abstract description 65
- 239000000284 extract Substances 0.000 claims abstract description 4
- 238000005516 engineering process Methods 0.000 claims description 13
- 238000010606 normalization Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 3
- 230000015654 memory Effects 0.000 description 13
- 238000012545 processing Methods 0.000 description 13
- 238000004891 communication Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000000712 assembly Effects 0.000 description 2
- 238000000429 assembly Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- KLDZYURQCUYZBL-UHFFFAOYSA-N 2-[3-[(2-hydroxyphenyl)methylideneamino]propyliminomethyl]phenol Chemical compound OC1=CC=CC=C1C=NCCCN=CC1=CC=CC=C1O KLDZYURQCUYZBL-UHFFFAOYSA-N 0.000 description 1
- 241000209140 Triticum Species 0.000 description 1
- 235000021307 Triticum Nutrition 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 201000001098 delayed sleep phase syndrome Diseases 0.000 description 1
- 208000033921 delayed sleep phase type circadian rhythm sleep disease Diseases 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 210000003127 knee Anatomy 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/144—Training of HMMs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of audio recognition method, the audio recognition method includes: to obtain voice to be identified and extract corresponding acoustic feature from the voice to be identified;The acoustic feature is identified using trained acoustic model, obtains recognition result;Wherein, the acoustic model of the training obtains in the following manner: according to different acoustic models, determining corresponding trained criterion and is trained by the trained criterion to the acoustic model, obtain trained acoustic model.The present invention obtains phoneme alignment information using LF-MMI method end to end, without being iterated training to multiple GMM-HMM models, and training speed is fast, reaches same discrimination, and the training time can be improved 3 times or more;Meanwhile reducing the training of pre-model, reduce the complexity of entire frame.
Description
Technical field
The invention belongs to field of speech recognition, and in particular to a kind of audio recognition method, device, machine readable media and set
It is standby.
Background technique
For speech recognition, accuracy and speed be want it is also contemplated that factor.It is generally desirable to guaranteeing to identify
While precision, the efficiency of training pattern is improved.Traditional speech recognition framework based on DNN-HMM, the training time is slower, needs
Multiple GMM-HMM models are iterated with training and obtains phoneme alignment information.Therefore, there is an urgent need for a kind of high-efficient identification sides
Method.
Summary of the invention
In view of the foregoing deficiencies of prior art, the present invention provides a kind of audio recognition method, device, machine readable Jie
Matter and equipment, for solving the defect that recognition efficiency is not high in the prior art.
In order to achieve the above objects and other related objects, the present invention provides a kind of audio recognition method, the speech recognition
Method includes:
It obtains voice to be identified and extracts corresponding acoustic feature from the voice to be identified;
The acoustic feature is identified using trained acoustic model, obtains recognition result;Wherein, the training
Acoustic model obtains in the following manner:
According to different acoustic models, corresponding trained criterion is determined and by the trained criterion to the acoustic model
It is trained, obtains trained acoustic model.
Optionally, the acoustic model includes DNN-HMM model, GMM-HMM model, DNN-CTC model or seq2seq mould
Type;The trained criterion includes LF-MMI training criterion, CE training criterion or smbr training criterion.
Optionally, if the acoustic model uses the DNN-HMM model, corresponding trained criterion uses the LF-
MMI trains criterion.
Optionally, LF-MMI criterion indicates are as follows:
Wherein, molecule indicates the total score of correct result respective path, and denominator indicates the corresponding score summation in all paths,
Molecule is obtained by sub- word figure, and denominator is obtained by denominator word figure, and θ indicates model parameter, and S indicates training set, S={ (om,wm)|0≤m
≤ M }, omIndicate observation sequence, wmIndicate the real word sequence of the m articles voice, smIndicate that status switch, k indicate acoustics scaling
The factor, P (w) indicate that the prior probability of sequence, w indicate all possible sequence in denominator word figure.
Optionally, the acoustic model is trained using low frame per second technology LFR.
Optionally, identification is decoded to the acoustic feature using low frame per second technology LFR.
Optionally, l2 canonical or batch normalization are added in the output layer of the acoustic model.
In order to achieve the above objects and other related objects, the present invention also provides a kind of speech recognition equipment, the voice is known
Other device includes:
Extraction module, for obtaining voice to be identified and extracting corresponding acoustic feature from the voice to be identified;
Identification module obtains recognition result for identifying using trained acoustic model to the acoustic feature;Its
In, the acoustic model of the training obtains in the following manner:
According to different acoustic models, corresponding trained criterion is determined and by the trained criterion to the acoustic model
It is trained, obtains trained acoustic model.
Optionally, the acoustic model includes DNN-HMM model, GMM-HMM model, DNN-CTC model or seq2seq mould
Type;The trained criterion includes LF-MMI training criterion, CE training criterion or smbr training criterion.
Optionally, the acoustic model uses the DNN-HMM model, then corresponding trained criterion uses the LF-MMI
Training criterion.
Optionally, LF-MMI criterion indicates are as follows:
Wherein, molecule indicates the total score of correct result respective path, and denominator indicates the corresponding score summation in all paths,
Molecule is obtained by sub- word figure, and denominator is obtained by denominator word figure, and θ indicates model parameter, and S indicates training set, S={ (om,wm)|0≤m
≤ M }, omIndicate observation sequence, wmIndicate the real word sequence of the m articles voice, smIndicate that status switch, k indicate acoustics scaling
The factor, P (w) indicate that the prior probability of sequence, w indicate all possible sequence in denominator word figure.
Optionally, the acoustic model is trained using low frame per second technology LFR.
Optionally, identification is decoded to the acoustic feature using low frame per second technology LFR.
Optionally, l2 canonical or batch normalization are added in the output layer of the acoustic model.
In order to achieve the above objects and other related objects, the present invention also provides a kind of equipment, comprising:
One or more processors;With
One or more machine readable medias of instruction are stored thereon with, when one or more of processors execute,
So that the equipment executes the audio recognition method.
In order to achieve the above objects and other related objects, the present invention also provides one or more machine readable medias, thereon
It is stored with instruction, when executed by one or more processors, so that equipment executes the audio recognition method.
As described above, a kind of audio recognition method, device, machine readable media and equipment of the invention, have with following
Beneficial effect:
The present invention can accomplish the discrimination of current state-of-the-art using LF-MMI method end to end, identify
It is high-efficient;
The present invention does not have to be iterated trained acquisition to multiple GMM-HMM models using LF-MMI method end to end
Phoneme alignment information, training speed is fast, reaches same recognition efficiency, and the training time can be improved 3 times or more;Meanwhile subtracting
The training for having lacked pre-model reduces the complexity of entire frame.
Detailed description of the invention
Fig. 1 is a kind of flow chart of audio recognition method in one embodiment of the invention;
Fig. 2 is modeling unit schematic diagram in one embodiment of the invention;
Fig. 3 is the hardware structural diagram of terminal device in one embodiment of the invention;
The hardware structural diagram of terminal device in Fig. 4 one embodiment of the invention.
Specific embodiment
Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification
Other advantages and efficacy of the present invention can be easily understood for disclosed content.The present invention can also pass through in addition different specific realities
The mode of applying is embodied or practiced, the various details in this specification can also based on different viewpoints and application, without departing from
Various modifications or alterations are carried out under spirit of the invention.It should be noted that in the absence of conflict, following embodiment and implementation
Feature in example can be combined with each other.
It should be noted that illustrating the basic structure that only the invention is illustrated in a schematic way provided in following embodiment
Think, only shown in schema then with related component in the present invention rather than component count, shape and size when according to actual implementation
Draw, when actual implementation kenel, quantity and the ratio of each component can arbitrarily change for one kind, and its assembly layout kenel
It is likely more complexity.
As shown in Figure 1, a kind of audio recognition method, the audio recognition method include:
S1 obtains voice to be identified and extracts corresponding acoustic feature from the voice to be identified;
S2 identifies the acoustic feature using trained acoustic model, obtains recognition result;Wherein, the training
Acoustic model obtain in the following manner:
According to different acoustic models, corresponding trained criterion is determined and by the trained criterion to the acoustic model
It is trained, obtains trained acoustic model.
In some embodiments, the acoustic model includes DNN-HMM model (deep neural network-hidden Markov mould
Type), GMM-HMM model (Gaussian Mixture-hidden Markov model), DNN-CTC (deep neural network-connection timing classification) mould
Type or seq2seq model (Sequence to Sequence);The trained criterion includes LF-MMI training criterion (Lattice-
Free Maximum Mutual Information, maximum mutual information), (cross entropy, cross entropy are quasi- for CE training criterion
Then) or smbr trains criterion (state-level minimum Bayesian risk, state levels Bayes risk).
Wherein, several common trained criterion and correlation formula:
CE training criterion is the criterion that a kind of pair of probability optimizes, and can be indicated are as follows:
yiIt is the experienced probability distribution (from the mark of training data) observed feature o and belong to class i,It is using DNN
The probability of estimation.
Smbr training criterion can indicate are as follows:
Generally, for conventional speech recognition neural network model, training criterion usually uses cross entropy (cross
Entropy, CE) form.But speech recognition is substantially a sequence classification problem.The training of sequence distinctive can be preferably
Utilize the maximum a posteriori criterion (maximum a posteriori, MAP) in large vocabulary continuous speech recognition.It can break through in this way
Limitation across frame, the sequence under GMM-HMM frame identify row training, can equally use speech recognition neural network based
In task.Therefore, in the present embodiment, acoustic model uses DNN-HMM model, and training criterion then uses LF-MMI training criterion.
LF-MMI training criterion is by calculating all possible annotated sequence in neural network output layer, according to these
Annotated sequence calculates corresponding MMI information and relevant gradient, then completes training by gradient propagation algorithm.LF-MMI instruction
White silk criterion can directly calculate the posterior probability (Posterior Probability) of all possible paths in the training process,
Eliminate the trouble for needing to generate word figure before distinctive is trained in advance.
LF-MMI, i.e. maximum mutual information criterion (lattice-free Maximum mutual information), it is intended to
Maximize the mutual information of word sequence distribution and observation sequence distribution.Assuming that observation sequenceWord sequenceWherein m indicates that utterance, Tm indicate that frame number, Nm indicate word number.Training set is S={ (om,
wm) | 0≤m≤M }, LF-MMI training criterion can be expressed as follows:
Wherein, molecule indicates the total score of correct result respective path, and denominator indicates the corresponding score summation in all paths,
Molecule is obtained by molecule word figure, and denominator is obtained by denominator word figure, and θ indicates model parameter, and S indicates training set, S={ (om,wm)|0
≤ m≤M }, omIndicate observation sequence, wmIndicate the real word sequence of the m articles voice, smIndicate that status switch, k indicate acoustics
Zoom factor, w indicate all possible sequence in denominator word figure.P (w) is the prior probability of sequence, and corresponding language model is beaten
Point;W is followed by target wmDifference be, wmIt is the real word sequence of a certain voice, is used as trained label, and
W because front sums to w, representative be all possible sequence in denominator word figure (wherein contain true sequence, and with
Sequence similar in real sequence).So the purpose of this criterion be exactly as far as possible true sequence and mistake but with
The very similar sequence of real sequence distinguishes.
When calculating, need to obtain the result of denominator and molecule respectively.The result of molecule is obtained by the molecule word figure generated, point
Female result is obtained by the denominator word figure generated.
Specifically, molecule word figure and denominator word figure can be calculated by the following method:
Denominator word map generalization:
On the basis of LF-MMI, training label is converted to phoneme information first, generates and is based on phoneme (Phone) rank
Language model, most possible aligned phoneme sequence is then converted to using training data.Construct the language model of 4-gram.Then
The decoding fst figure based on phonemic language model, i.e. HCP figure are constructed again.
Wherein, when generating denominator word figure, denominator word map generalization is transferred to GPU (Graphics Processing
Unit, graphics processor) in carry out, the time can be greatly optimized.
Molecule word map generalization:
Label data in training data is converted to the sequence of phonetic units.Then it is converted to by the decoding figure of HCP
Fst format, possibility sequence needed for obtaining molecule word figure.
When training, by calculating molecule denominator word graphic sequence probability of occurrence, the loss value namely formula of MMI can be obtained
(1)。
The present embodiment is avoided by modification denominator word figure and molecule word map generalization mode to prescheme and stringent right
Neat demand, word figure needed for directly being generated in training.
In some embodiments, the think of of CTC (Connectionist Temporal Classification) is utilized
Think, the state of blank is introduced into the modeling of HMM, for absorbing uncertain boundary.As shown in Fig. 2, in this structure
In, Sp is the state really pronounced, and Sb is to indicate space state.In this way modeling granularity is become larger, a HMM structure can
To indicate the unit of a phone-level, the structure of decoder is optimized.
In some embodiments, on the basis of using LF-MMI training criterion, using low frame per second technology LFR (Low
Frame Rate, low frame per second) acoustic model is trained.It is generally acquired with 3 times of stride frame-skipping, improves instruction
Experienced speed and precision.
In some embodiments, identification is decoded to the acoustic feature using low frame per second technology LFR, generally with 3 times
Stride frame-skipping be acquired, faster, precision is higher for such decoding time.
In some embodiments, also the molecule fst of each sentence is divided for multiple small chunk, saves memory space,
It can accelerate training and decode.
In one embodiment, in order to solve the problems, such as that the pure training of sequentiality end to end is easy to cause over-fitting, in acoustics
Each layer of output of model is added to batch normalization (batch normalization) all to accelerate to train and guarantee data
Normalization;Or in output layer, it is added to l2 canonical, avoid output result over-fitting.And in order to guarantee what LF-MMI was calculated
Correctness, then not in the normalization of output layer addition softmax.
The present embodiment also provides a kind of speech recognition equipment, and the speech recognition equipment includes:
Extraction module, for obtaining voice to be identified and extracting corresponding acoustic feature from the voice to be identified;
Identification module obtains recognition result for identifying using trained acoustic model to the acoustic feature;Its
In, the acoustic model of the training obtains in the following manner:
According to different acoustic models, corresponding trained criterion is determined and by the trained criterion to the acoustic model
It is trained, obtains trained acoustic model.
In some embodiments, the acoustic model include DNN-HMM model, GMM-HMM model, DNN-CTC model or
Seq2seq model;The trained criterion includes LF-MMI training criterion, CE training criterion or smbr training criterion.
In one embodiment, the acoustic model uses the DNN-HMM model, then described in corresponding trained criterion use
LF-MMI trains criterion.
In some embodiments, LF-MMI criterion indicates are as follows:
Wherein, molecule indicates the total score of correct result respective path, and denominator indicates the corresponding score summation in all paths,
Molecule is obtained by sub- word figure, and denominator is obtained by denominator word figure, and θ indicates model parameter, and S indicates training set, S={ (om,wm)|0≤m
≤ M }, omIndicate observation sequence, wmIndicate the real word sequence of the m articles voice, smIndicate that status switch, k indicate acoustics scaling
The factor, P (w) indicate that the prior probability of sequence, w indicate all possible sequence in denominator word figure.
In some embodiments, the acoustic model is trained using low frame per second technology LFR.
In some embodiments, identification is decoded to the acoustic feature using low frame per second technology LFR.
In some embodiments, l2 canonical or batch normalization are added in the output layer of the acoustic model.
It should be noted that the embodiment due to device part is corresponded to each other with the embodiment of method part, device
The content of partial embodiment refers to the description of the embodiment of method part, wouldn't repeat here.
The embodiment of the present application also provides a kind of equipment, which may include: one or more processors;It deposits thereon
The one or more machine readable medias for containing instruction, when being executed by one or more of processors, so that the equipment
Execute method described in Fig. 1.In practical applications, which can be used as terminal device, can also be used as server, and terminal is set
Standby example may include: smart phone, tablet computer, E-book reader, MP3 (dynamic image expert's compression standard voice
Level 3, Moving Picture Experts Group Audio Layer III) player, MP4 (dynamic image expert pressure
Contracting received pronunciation level 4, Moving Picture Experts Group Audio Layer IV) it is player, on knee portable
Computer, vehicle-mounted computer, desktop computer, set-top box, intelligent TV set, wearable device etc., the embodiment of the present application for
Specific equipment is without restriction.
The embodiment of the present application also provides a kind of non-volatile readable storage medium, be stored in the storage medium one or
Multiple modules (programs) when the one or more module is used in equipment, can make the equipment execute the application reality
Apply the instruction (instructions) of the included step of audio recognition method in Fig. 1 of example.
Fig. 3 is the hardware structural diagram for the terminal device that one embodiment of the application provides.As shown, the terminal device
It may include: input equipment 1100, processor 1101, output equipment 1102, memory 1103 and at least one communication bus
1104.Communication bus 1104 is for realizing the communication connection between element.Memory 1103 may include high speed RAM memory,
It may also further include non-volatile memories NVM, a for example, at least magnetic disk storage can store various journeys in memory 1103
Sequence, for completing various processing functions and realizing the method and step of the present embodiment.
Optionally, above-mentioned processor 1101 can be for example central processing unit (Central Processing Unit, letter
Claim CPU), application specific integrated circuit (ASIC), digital signal processor (DSP), digital signal processing appts (DSPD), can compile
Journey logical device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components
It realizes, which is coupled to above-mentioned input equipment 1100 and output equipment 1102 by wired or wireless connection.
Optionally, above-mentioned input equipment 1100 may include a variety of input equipments, such as may include user oriented use
At least one of family interface, device oriented equipment interface, the programmable interface of software, camera, sensor.Optionally, should
Device oriented equipment interface can be wireline interface for carrying out data transmission between equipment and equipment, can also be and is used for
Hardware insertion interface (such as USB interface, serial ports etc.) carried out data transmission between equipment and equipment;Optionally, should towards with
The user interface at family for example can be user oriented control button, voice-input device and use for receiving voice input
The touch awareness apparatus (such as touch screen, Trackpad with touch sensing function etc.) of family reception user's touch input;It is optional
, the programmable interface of above-mentioned software for example can be the entrance for editing or modifying program for user, such as the input of chip
Pin interface or input interface etc.;Output equipment 1102 may include the output equipments such as display, sound equipment.
In the present embodiment, the processor of the terminal device includes for executing each module of speech recognition equipment in each equipment
Function, concrete function and technical effect are referring to above-described embodiment, and details are not described herein again.
Fig. 4 is the hardware structural diagram for the terminal device that one embodiment of the application provides.Fig. 4 is to Fig. 3 in reality
A specific embodiment during now.As shown, the terminal device of the present embodiment may include processor 1201 and
Memory 1202.
Processor 1201 executes the computer program code that memory 1202 is stored, and realizes Fig. 1 institute in above-described embodiment
State method.
Memory 1202 is configured as storing various types of data to support the operation in terminal device.These data
Example includes the instruction of any application or method for operating on the terminal device, such as message, picture, video etc..
Memory 1202 may include random access memory (random access memory, abbreviation RAM), it is also possible to further include non-
Volatile memory (non-volatile memory), for example, at least a magnetic disk storage.
Optionally, processor 1201 is arranged in processing component 1200.The terminal device can also include: communication component
1203, power supply module 1204, multimedia component 1205, voice component 1206, input/output interface 1207 and/or sensor group
Part 1208.Component that terminal device is specifically included etc. is set according to actual demand, and the present embodiment is not construed as limiting this.
The integrated operation of the usual controlling terminal equipment of processing component 1200.Processing component 1200 may include one or more
Processor 1201 executes instruction, to complete all or part of the steps of method shown in above-mentioned Fig. 1.In addition, processing component 1200
It may include one or more modules, convenient for the interaction between processing component 1200 and other assemblies.For example, processing component 1200
It may include multi-media module, to facilitate the interaction between multimedia component 1205 and processing component 1200.
Power supply module 1204 provides electric power for the various assemblies of terminal device.Power supply module 1204 may include power management
System, one or more power supplys and other with for terminal device generate, manage, and distribute the associated component of electric power.
Multimedia component 1205 includes the display screen of one output interface of offer between terminal device and user.One
In a little embodiments, display screen may include liquid crystal display (LCD) and touch panel (TP).If display screen includes touch surface
Plate, display screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touchings
Sensor is touched to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or cunning
The boundary of movement, but also detect duration and pressure associated with the touch or slide operation.
Voice component 1206 is configured as output and/or input speech signal.For example, voice component 1206 includes a wheat
Gram wind (MIC), when terminal device is in operation mode, when such as speech recognition mode, microphone is configured as receiving external voice
Signal.The received voice signal of institute can be further stored in memory 1202 or send via communication component 1203.One
In a little embodiments, voice component 1206 further includes a loudspeaker, for exporting voice signal.
Input/output interface 1207 provides interface between processing component 1200 and peripheral interface module, and above-mentioned periphery connects
Mouth mold block can be click wheel, button etc..These buttons may include, but are not limited to: volume button, start button and locking press button.
Sensor module 1208 includes one or more sensors, and the state for providing various aspects for terminal device is commented
Estimate.For example, sensor module 1208 can detecte the state that opens/closes of terminal device, the relative positioning of component, Yong Huyu
The existence or non-existence of terminal device contact.Sensor module 1208 may include proximity sensor, be configured to do not having
Detected the presence of nearby objects when any physical contact, including detection user between terminal device at a distance from.In some implementations
In example, which can also be including camera etc..
Communication component 1203 is configured to facilitate the communication of wired or wireless way between terminal device and other equipment.Eventually
End equipment can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.In one embodiment
In, it may include SIM card slot in the terminal device, which step on terminal device for being inserted into SIM card
GPRS network is recorded, is communicated by internet with server foundation.
From the foregoing, it will be observed that communication component 1203, voice component 1206 involved in Fig. 4 embodiment and input/output
Interface 1207, sensor module 1208 can be used as the implementation of the input equipment in Fig. 3 embodiment.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe
The personage for knowing this technology all without departing from the spirit and scope of the present invention, carries out modifications and changes to above-described embodiment.Cause
This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as
At all equivalent modifications or change, should be covered by the claims of the present invention.
Claims (16)
1. a kind of audio recognition method, which is characterized in that the audio recognition method includes:
It obtains voice to be identified and extracts corresponding acoustic feature from the voice to be identified;
The acoustic feature is identified using trained acoustic model, obtains recognition result;Wherein, the acoustics of the training
Model obtains in the following manner:
According to different acoustic models, determines corresponding trained criterion and the acoustic model is carried out by the trained criterion
Training, obtains trained acoustic model.
2. audio recognition method according to claim 1, which is characterized in that the acoustic model include DNN-HMM model,
GMM-HMM model, DNN-CTC model, seq2seq model;The trained criterion includes LF-MMI training criterion, CE training standard
Then, smbr training criterion.
3. audio recognition method according to claim 2, which is characterized in that if the acoustic model uses the DNN-
HMM model, then corresponding trained criterion is using LF-MMI training criterion.
4. audio recognition method according to claim 3, which is characterized in that LF-MMI criterion indicates are as follows:
Wherein, molecule indicates the total score of correct result respective path, and denominator indicates the corresponding score summation in all paths, molecule
It is obtained by molecule word figure, denominator is obtained by denominator word figure, and θ indicates model parameter, and S indicates training set, S={ (om,wm)|0≤m≤
M }, omIndicate observation sequence, wmIndicate the real word sequence of the m articles voice, smIndicate status switch, k indicate acoustics scaling because
Son, P (w) indicate that the prior probability of sequence, w indicate all possible sequence in denominator word figure.
5. audio recognition method according to claim 3 or 4, which is characterized in that using low frame per second technology LFR to the sound
Model is learned to be trained.
6. audio recognition method according to claim 1, which is characterized in that using low frame per second technology LFR to the acoustics
Feature is decoded identification.
7. audio recognition method according to claim 1, which is characterized in that add l2 in the output layer of the acoustic model
Canonical or batch normalization.
8. a kind of speech recognition equipment, which is characterized in that the speech recognition equipment includes:
Extraction module, for obtaining voice to be identified and extracting corresponding acoustic feature from the voice to be identified;
Identification module obtains recognition result for identifying using trained acoustic model to the acoustic feature;Wherein,
The acoustic model of the training obtains in the following manner:
According to different acoustic models, determines corresponding trained criterion and the acoustic model is carried out by the trained criterion
Training, obtains trained acoustic model.
9. speech recognition equipment according to claim 8, which is characterized in that the acoustic model include DNN-HMM model,
GMM-HMM model, DNN-CTC model or seq2seq model;The trained criterion includes LF-MMI training criterion, CE training standard
Then or smbr trains criterion.
10. speech recognition equipment according to claim 9, which is characterized in that if the acoustic model uses the DNN-
HMM model, then corresponding trained criterion is using LF-MMI training criterion.
11. speech recognition equipment according to claim 10, which is characterized in that LF-MMI criterion indicates are as follows:
Wherein, molecule indicates the total score of correct result respective path, and denominator indicates the corresponding score summation in all paths, molecule
It is obtained by molecule word figure, denominator is obtained by denominator word figure, and θ indicates model parameter, and S indicates training set, S={ (om,wm)|0≤m≤
M }, omIndicate observation sequence, wmIndicate the real word sequence of the m articles voice, smIndicate status switch, k indicate acoustics scaling because
Son, P (w) indicate that the prior probability of sequence, w indicate all possible sequence in denominator word figure.
12. speech recognition equipment described in 0 or 11 according to claim 1, which is characterized in that using low frame per second technology LFR to institute
Acoustic model is stated to be trained.
13. speech recognition equipment according to claim 8, which is characterized in that using low frame per second technology LFR to the acoustics
Feature is decoded identification.
14. speech recognition equipment according to claim 8, which is characterized in that added in the output layer of the acoustic model
L2 canonical or batch normalization.
15. a kind of equipment characterized by comprising
One or more processors;With
One or more machine readable medias of instruction are stored thereon with, when one or more of processors execute, so that
The equipment executes the method as described in one or more in claim 1-7.
16. one or more machine readable medias, which is characterized in that instruction is stored thereon with, when by one or more processors
When execution, so that equipment executes the method as described in one or more in claim 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910600014.XA CN110349573A (en) | 2019-07-04 | 2019-07-04 | A kind of audio recognition method, device, machine readable media and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910600014.XA CN110349573A (en) | 2019-07-04 | 2019-07-04 | A kind of audio recognition method, device, machine readable media and equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110349573A true CN110349573A (en) | 2019-10-18 |
Family
ID=68177412
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910600014.XA Pending CN110349573A (en) | 2019-07-04 | 2019-07-04 | A kind of audio recognition method, device, machine readable media and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110349573A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111755001A (en) * | 2020-05-07 | 2020-10-09 | 国网山东省电力公司信息通信公司 | Artificial intelligence-based power grid rapid dispatching and commanding system and method |
CN113763939A (en) * | 2021-09-07 | 2021-12-07 | 普强时代(珠海横琴)信息技术有限公司 | Mixed speech recognition system and method based on end-to-end model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100125457A1 (en) * | 2008-11-19 | 2010-05-20 | At&T Intellectual Property I, L.P. | System and method for discriminative pronunciation modeling for voice search |
CN108389575A (en) * | 2018-01-11 | 2018-08-10 | 苏州思必驰信息科技有限公司 | Audio data recognition methods and system |
US20180247643A1 (en) * | 2017-02-24 | 2018-08-30 | Baidu Usa Llc | Systems and methods for principled bias reduction in production speech models |
CN108629412A (en) * | 2017-03-15 | 2018-10-09 | 中国科学院声学研究所 | A kind of neural metwork training accelerated method based on mesh free maximum mutual information criterion |
CN109637526A (en) * | 2019-01-08 | 2019-04-16 | 西安电子科技大学 | The adaptive approach of DNN acoustic model based on personal identification feature |
US20190130904A1 (en) * | 2017-10-26 | 2019-05-02 | Hitachi, Ltd. | Dialog system with self-learning natural language understanding |
-
2019
- 2019-07-04 CN CN201910600014.XA patent/CN110349573A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100125457A1 (en) * | 2008-11-19 | 2010-05-20 | At&T Intellectual Property I, L.P. | System and method for discriminative pronunciation modeling for voice search |
US20180247643A1 (en) * | 2017-02-24 | 2018-08-30 | Baidu Usa Llc | Systems and methods for principled bias reduction in production speech models |
CN108629412A (en) * | 2017-03-15 | 2018-10-09 | 中国科学院声学研究所 | A kind of neural metwork training accelerated method based on mesh free maximum mutual information criterion |
US20190130904A1 (en) * | 2017-10-26 | 2019-05-02 | Hitachi, Ltd. | Dialog system with self-learning natural language understanding |
CN108389575A (en) * | 2018-01-11 | 2018-08-10 | 苏州思必驰信息科技有限公司 | Audio data recognition methods and system |
CN109637526A (en) * | 2019-01-08 | 2019-04-16 | 西安电子科技大学 | The adaptive approach of DNN acoustic model based on personal identification feature |
Non-Patent Citations (11)
Title |
---|
DONG YU ET AL.: "Recent Progresses in Deep Learning Based Acoustic Models", 《IEEE/CAA JOURNAL OF AUTOMATICA SINICA》 * |
HOSSEIN HADIAN ET AL.: "End-to-end Speech Recognition Using Lattice-free MMI", 《INTERSPEECH 2018》 * |
KAREL VESELY ET AL.: "Sequence-discriminative training of deep neural networks", 《INTERSPEECH 2013》 * |
MANOHAR, VIMAL , ET AL.: "Semi-Supervised Training of Acoustic Models Using Lattice-Free MMI", 《ICASSP 2018》 * |
POVEY, DANIEL , ET AL.: "Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI", 《INTERSPEECH 2016》 * |
SIBO TONG ET AL.: "An Investigation of Multilingual ASR Using End-to-end LF-MMI", 《IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING》 * |
WEIXIN_30564901: "Kaldi中的Chain模型", 《HTTPS://BLOG.CSDN.NET/WEIXIN_30564901/ARTICLE/DETAILS/96851065》 * |
XUERUI YANG ET AL.: "A novel pyramidal-FSMN architecture with lattice-free MMI for speech recognition", 《ARXIV: SOUND》 * |
YANHUA LONG ET AL: "Domain adaptation of lattice-free MMI based TDNN models for speech recognition", 《INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY》 * |
杨阳阳: "kaldi中的chain model(LFMMI)详解", 《知乎HTTPS://ZHUANLAN.ZHIHU.COM/P/65557682》 * |
知乎用户: "LF-MMI为什么是一种端到端的方法", 《知乎HTTPS://WWW.ZHIHU.COM/QUESTION/304534046》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111755001A (en) * | 2020-05-07 | 2020-10-09 | 国网山东省电力公司信息通信公司 | Artificial intelligence-based power grid rapid dispatching and commanding system and method |
CN113763939A (en) * | 2021-09-07 | 2021-12-07 | 普强时代(珠海横琴)信息技术有限公司 | Mixed speech recognition system and method based on end-to-end model |
CN113763939B (en) * | 2021-09-07 | 2024-04-16 | 普强时代(珠海横琴)信息技术有限公司 | Mixed voice recognition system and method based on end-to-end model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xu et al. | LCANet: End-to-end lipreading with cascaded attention-CTC | |
Zhang et al. | Multimodal intelligence: Representation learning, information fusion, and applications | |
Zhang et al. | Spontaneous speech emotion recognition using multiscale deep convolutional LSTM | |
Cai et al. | Audio-textual emotion recognition based on improved neural networks | |
CN107358951A (en) | A kind of voice awakening method, device and electronic equipment | |
WO2021201999A1 (en) | Pre-training with alignments for recurrent neural network transducer based end-to-end speech recognition | |
CN109887497A (en) | Modeling method, device and the equipment of speech recognition | |
CN109214002A (en) | A kind of transcription comparison method, device and its computer storage medium | |
CN110222649B (en) | Video classification method and device, electronic equipment and storage medium | |
CN110033760A (en) | Modeling method, device and the equipment of speech recognition | |
CN105096935A (en) | Voice input method, device, and system | |
CN108694940A (en) | A kind of audio recognition method, device and electronic equipment | |
CN112200318B (en) | Target detection method, device, machine readable medium and equipment | |
TW202042181A (en) | Method, device and electronic equipment for depth model training and storage medium thereof | |
CN109409241A (en) | Video checking method, device, equipment and readable storage medium storing program for executing | |
CN114021582B (en) | Spoken language understanding method, device, equipment and storage medium combined with voice information | |
CN111653274A (en) | Method, device and storage medium for awakening word recognition | |
CN110349573A (en) | A kind of audio recognition method, device, machine readable media and equipment | |
JP7178394B2 (en) | Methods, apparatus, apparatus, and media for processing audio signals | |
Wang et al. | A lip reading method based on 3D convolutional vision transformer | |
CN110335591A (en) | A kind of parameter management method, device, machine readable media and equipment | |
CN113450771A (en) | Awakening method, model training method and device | |
CN112580669B (en) | Training method and device for voice information | |
CN114155832A (en) | Speech recognition method, device, equipment and medium based on deep learning | |
CN112434746B (en) | Pre-labeling method based on hierarchical migration learning and related equipment thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 511457 Guangdong city of Guangzhou province Nansha District Golden Road No. 26 room 1306 (only for office use) Applicant after: Yuncong Technology Group Co., Ltd Address before: 511457 Guangdong city of Guangzhou province Nansha District Golden Road No. 26 room 1306 (only for office use) Applicant before: GUANGZHOU YUNCONG INFORMATION TECHNOLOGY CO., LTD. |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191018 |