CN108648759A - A kind of method for recognizing sound-groove that text is unrelated - Google Patents
A kind of method for recognizing sound-groove that text is unrelated Download PDFInfo
- Publication number
- CN108648759A CN108648759A CN201810457528.XA CN201810457528A CN108648759A CN 108648759 A CN108648759 A CN 108648759A CN 201810457528 A CN201810457528 A CN 201810457528A CN 108648759 A CN108648759 A CN 108648759A
- Authority
- CN
- China
- Prior art keywords
- voice
- level
- frame
- groove
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 22
- 238000003780 insertion Methods 0.000 claims abstract description 27
- 230000037431 insertion Effects 0.000 claims abstract description 27
- 238000012549 training Methods 0.000 claims abstract description 20
- 238000000605 extraction Methods 0.000 claims abstract description 10
- 230000008569 process Effects 0.000 claims abstract description 6
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 16
- 238000001228 spectrum Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 4
- 230000000694 effects Effects 0.000 claims description 4
- 238000001514 detection method Methods 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 abstract description 8
- 238000013527 convolutional neural network Methods 0.000 abstract description 2
- 230000009467 reduction Effects 0.000 abstract description 2
- 238000012360 testing method Methods 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 230000002035 prolonged effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000000556 factor analysis Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention discloses a kind of method for recognizing sound-groove that text is unrelated, including Application on Voiceprint Recognition model training, extraction insertion, decision scoring three phases.Model training stage step:1) speech signal pre-processing;2) voice frame level operates;3) statistics convergence-level summarizes frame level output;4) one-dimensional convolution operation;5) full articulamentum exports speaker clustering.After the completion of model training, insertion is extracted before full articulamentum first layer non-linearization.It is finally scored using COS distance decision, determines to accept or reject.Present invention combination neural network embedded technology and convolutional neural networks carry out dimensionality reduction using one-dimensional convolution, and using maximum value convergence-level, increase the convolution number of plies and improve the performance of model in this way to carry out further feature extraction.Make the process using COS distance as standards of grading faster, more simply.
Description
Technical field
The present invention relates to the technical field of Application on Voiceprint Recognition, a kind of combination neural network embedded technology and convolution god are referred in particular to
The unrelated method for recognizing sound-groove of text through network.
Background technology
Vocal print refers to the sound wave spectrum that verbal information is carried in human speech, it is the same with fingerprint, has unique biology
Feature is learned, has the function of identification, not only there is specificity, but also there is opposite stability.Voice signal is one-dimensional
Continuous signal, after it is carried out discretization, so that it may to obtain our the now common manageable voice signals of computer.
The manageable discrete voice signal of computer.Application on Voiceprint Recognition (also referred to as Speaker Identification) technology also as existed now
It is the same using very extensive fingerprint identification technology on smart mobile phone, it is special that voice is extracted in the voice signal sent out from speaker
Sign, and the biological identification technology of authentication is carried out to speaker accordingly.
Application on Voiceprint Recognition mainstream technology scheme has the identifying system based on i-vector.Its base in simultaneous factor analysis technology
On plinth, propose that speaker and session difference can be characterized by an individual subspace.It, can be with using this sub-spaces
The digital vector obtained from a sound materials, it is further converted to low dimension vector, is exactly i-vector.
Later with hardware device performance boost, deep neural network is successfully applied to Acoustic Modeling, the ability of identification
Have compared with much progress, it was also proposed that the model that rational DNN and i-vector are combined, in the process of extraction sufficient statistic
In, the UBM in original i-vector models is replaced with the DNN models based on phoneme state, is corresponded to obtain each frame
The posterior probability of each classification.
Current newest technology, which has, extracts embedded spy in the slave time-delay neural network network that David Snyder et al. are proposed
The acoustics identification model of sign, also known as x-vector.The model is used to calculate speaker's insertion of elongated voice
(embedding).Its structure is a kind of end-to-end system.Its step are as follows:
Model training is carried out first.Voice signal is pre-processed, first 5 layers of network Shang not operated in frame level, statistics
Convergence-level receives the output of last frame level layer as input, summarizes one section of all frame of voice and inputs and calculate its mean value and standard
Difference.It then in voice segments level operations, connects full articulamentum and uses activation primitive ReLU, final full articulamentum Softmax is defeated
Go out N number of speaker clustering.
After the completion of model training, the speaker for mapping directly to fixed length is embedded in by the voice of every section of random length.Then in pairs
Registration voice and tested speech carry out decision scoring using the rear end based on PLDA, make and final determine to accept or reject..
Current network structure is all using full articulamentum.It is understood that its more ability to express of the network number of plies are stronger, but
It is to train the full Connection Neural Network of depth highly difficult by gradient descent method, because the gradient of full Connection Neural Network is difficult to pass
It passs more than 3 layers.Therefore, we can not possibly obtain a very deep full Connection Neural Network, also limit its ability.
Invention content
The shortcomings that it is an object of the invention to overcome the prior art and deficiency, it is proposed that a kind of Application on Voiceprint Recognition that text is unrelated
Method improves the neural network embedded structure using convolutional neural networks, to the data of statistics convergence-level output, attempts to use
One-dimensional convolution operation, and dimensionality reduction is carried out using maximum value convergence-level, increase the convolution number of plies, to carry out further feature extraction, this
Sample improves the performance of model, and makes the process faster using COS distance as standards of grading, more simply.
To achieve the above object, technical solution provided by the present invention is:A kind of method for recognizing sound-groove that text is unrelated, packet
Include following steps:
1) Application on Voiceprint Recognition model training
1.1) speech signal pre-processing;
1.2) voice frame level operates;
1.3) statistics convergence-level summarizes frame level output;
1.4) one-dimensional convolution operation;
1.5) full articulamentum exports speaker clustering;
2) extraction insertion:After model training is completed, registration voice and tested speech are inputted into Application on Voiceprint Recognition model, extraction
It is embedded;
3) decision scores:The insertion of registration voice and tested speech calculates its score using COS distance, makes final determine
Surely it accepts or rejects.
In step 1.1), by every section of voice in corpus with 25ms framings, and voice activity detection is carried out, believed from sound
The prolonged mute phase is identified and eliminated in number stream, 20 Jan Vermeer frequency spectrum cepstrum coefficient MFCC is generated, adds single order and two scales
Point coefficient generate per frame the MFCC feature vectors of totally 60 dimensions as input.
In step 1.2), first 5 layers of model training network structure Shang not operated in frame level, have time delay framework, it is assumed that t
It is that current frame splices the Meier frequency spectrum cepstrum coefficient MFCC of the frame at { t-2, t-1, t, t+1, t+2 } one in input terminal
It rises, next two layers output of the splicing preceding layer at time { t-2, t, t+2 } and { t-3, t, t+3 } respectively, later two
Layer is not operated in frame level yet, but without any additional frame, and the frame level part of the network has t-7 to t+7 totally 15 in total
Frame.
In step 1.3), statistics convergence-level receives the output of last frame level layer as input, and it is all to summarize one section of voice
Frame inputs and calculates its mean value, it is assumed that one section of voice is divided into T frames in total, and statistics convergence-level summarizes from frame level layer layer 5
All T frames export and calculate its average value, and statistic is 3200 dimensional vectors, each input voice is only calculated once, this mistake
Journey aggregation information on time dimension, so that succeeding layer runs operation on entire voice.
In step 1.4), to counting the output of convergence-level, handled using one-dimensional convolution, totally 5 layers of convolutional layer, preceding two
The convolution kernel that layer convolutional layer is 5 using 256 sizes, step-length 2, third and fourth, five convolutional layers the use of 256 sizes is 3
Core, step-length 1, each convolutional layer are followed by a maximum value convergence-level.
In step 1.5), connect two full articulamentums, the activation primitives of two full articulamentums be respectively ReLU with
The output of Softmax, the last one full articulamentum are N number of speaker clustering.
In step 2), after model training completion, insertion is extracted before full articulamentum first layer non-linearization, i.e.,
1024 dimensional vectors export.
In step 3), register the insertion of voice and tested speech and calculate its score using COS distance, and with threshold value into
Row compares, and makes final decision and accepts or rejects, score is then refused more than threshold value, then receives less than threshold value, formula is as follows:
Wherein, w1,w2Respectively register voice and tested speech insertion, score (w1,w2) indicate COS distance,<w1,w2>
It is embedded in the dot product with tested speech insertion for registration voice, | | w1||,||w2| | it respectively registers voice insertion and tested speech is embedding
The length entered, θ are preset threshold value.
Compared with prior art, the present invention having the following advantages that and advantageous effect:
1, each neuron is no longer connected with all neurons of last layer in convolutional network, and only neural with sub-fraction
Member is connected.Which reduces many parameters.
2, one group of connection can share the same weight, rather than there are one different weights for each connection, subtract again in this way
Many parameters are lacked.
3, the sample dimension that every layer is reduced using maximum value convergence-level, is further reduced number of parameters, while can be with
The robustness of lift scheme.
4, COS distance scores as the decision of speaker verification makes the process faster, more simply.
Description of the drawings
Fig. 1 is the logical flow chart of the method for the present invention.
Fig. 2 is the Application on Voiceprint Recognition model training flow chart of the present invention.
Specific implementation mode
The present invention is further explained in the light of specific embodiments.
As shown in Figure 1, the method for recognizing sound-groove that the text that is provided of the present embodiment is unrelated, is divided into three phases:Vocal print is known
Other model training, extraction insertion, decision scoring.
The training for carrying out Application on Voiceprint Recognition model first, selects suitable corpus, such as uses AISHELL-ASR0009-OS1
It increases income Chinese speech database, wherein including training library and test library.
As shown in Fig. 2, steps are as follows for Application on Voiceprint Recognition model training:
1) speech signal pre-processing
By every section of voice in corpus with 25ms framings, and carry out voice activity detection, in voice signal stream identification and
The prolonged mute phase is eliminated, 20 Jan Vermeer frequency spectrum cepstrum coefficient MFCC are generated, adds single order and second differnce coefficient, most throughout one's life
At every frame, the totally 60 MFCC feature vectors tieed up are used as input.
2) voice frame level operates
First 5 layers of the sound-groove model network structure Shang not operated in frame level, have time delay framework.Vacation lets t be current frame.
In input terminal, the MFCC of the frame at { t-2, t-1, t, t+1, t+2 } is stitched together by we.Splice respectively for next two layers
Output of the preceding layer at time { t-2, t, t+2 } and { t-3, t, t+3 }.Two layers later is not operated in frame level yet, but
There is no any additional frame.The frame level part of the network has t-7 to t+7 totally 15 frames in total.
3) statistics convergence-level summarizes frame level output
Statistics convergence-level receives the output of last frame level layer as input, summarizes one section of all frame of voice and inputs and calculate it
Mean value.Assuming that one section of voice is divided into T frames in total, statistics convergence-level summarizes the output of all T frames from frame level layer layer 5 simultaneously
Calculate its average value.Statistic is 3200 dimensional vectors, and each input voice is only calculated once.This process is on time dimension
Aggregation information, so that succeeding layer runs operation on entire voice.
4) one-dimensional convolution operation
To counting the output of convergence-level, handled using one-dimensional convolution.Preceding two layers of convolutional layer is 5 using 256 sizes
Convolution kernel, step-length 2, third and fourth, five convolutional layers the use of 256 sizes is 3 core, step-length 1, each convolutional layer is followed by
One maximum value convergence-level.
5) full articulamentum exports speaker clustering.
Two full articulamentums are connected, the activation primitive of two full articulamentums is respectively ReLU and Softmax, the last one is complete
The output of articulamentum is N number of speaker clustering.
The network structure of frame level operations and statistics convergence-level is as shown in table 1:
1 frame level operations of table and statistics convergence-level network structure
Layer | Every layer of included frame | Total context frame number | Input → output |
Frame level first layer | [t-2,t+2] | 5 | 300→1024 |
The frame level second layer | {t-2,t,t+2} | 9 | 3072→1024 |
Frame level third layer | {t-3,t,t+3} | 15 | 3072→1024 |
The 4th layer of frame level | {t} | 15 | 1024→1024 |
Frame level layer 5 | {t} | 15 | 1024→3200 |
Count convergence-level | [0,T] | T | 3200T→3200 |
Convolutional layer and full articulamentum network structure are as shown in table 2:
2 convolutional layer of table and full articulamentum network structure
2)~5 the MFCC of every section of voice is carried out above) step operation, it constantly updates convolution kernel and connects layer parameter entirely,
Complete the training to Application on Voiceprint Recognition model.
Extraction insertion:After the completion of model training, tested using the test library voice in corpus, will registration voice with
Tested speech inputs Application on Voiceprint Recognition model, extracts and is embedded in before the full articulamentum first layer non-linearization of identification model, i.e., and 1024
Dimensional vector exports.
Decision scores:The insertion of registration voice and tested speech calculates its score using COS distance, and is carried out with threshold value
Compare, makes final decision and accept or reject, score is then refused more than threshold value, then receives less than threshold value.Formula is as follows:
Wherein w1,w2The respectively insertion of registration voice and tested speech insertion, score (w1,w2) indicate COS distance,<w1,
w2>It is embedded in the dot product with tested speech insertion for registration voice, | | w1||,||w2| | respectively register voice insertion and test language
The length of sound insertion, θ are preset threshold value.
Embodiment described above is only the preferred embodiments of the invention, and but not intended to limit the scope of the present invention, therefore
Change made by all shapes according to the present invention, principle, should all cover within the scope of the present invention.
Claims (8)
1. a kind of method for recognizing sound-groove that text is unrelated, which is characterized in that include the following steps:
1) Application on Voiceprint Recognition model training
1.1) speech signal pre-processing;
1.2) voice frame level operates;
1.3) statistics convergence-level summarizes frame level output;
1.4) one-dimensional convolution operation;
1.5) full articulamentum exports speaker clustering;
2) extraction insertion:After model training is completed, registration voice and tested speech are inputted into Application on Voiceprint Recognition model, extraction is embedding
Enter;
3) decision scores:The insertion of registration voice and tested speech calculates its score using COS distance, makes final decision and connects
By or refusal.
2. a kind of unrelated method for recognizing sound-groove of text according to claim 1, it is characterised in that:In step 1.1),
By every section of voice in corpus with 25ms framings, and voice activity detection is carried out, the time is identified and eliminated in voice signal stream
Length is more than the mute phase of preset value, generates 20 Jan Vermeer frequency spectrum cepstrum coefficient MFCC, adds single order and the life of second differnce coefficient
At every frame, the totally 60 MFCC feature vectors tieed up are used as input.
3. a kind of unrelated method for recognizing sound-groove of text according to claim 1, it is characterised in that:In step 1.2),
First 5 layers of model training network structure Shang not operated in frame level, have time delay framework, it is assumed that and t is current frame, in input terminal,
The Meier frequency spectrum cepstrum coefficient MFCC of frame at { t-2, t-1, t, t+1, t+2 } is stitched together, next two layers of difference
Splice output of the preceding layer at time { t-2, t, t+2 } and { t-3, t, t+3 }, two layers later is not grasped in frame level yet
Make, but no any additional frame, the frame level part of the network has t-7 to t+7 totally 15 frames in total.
4. a kind of unrelated method for recognizing sound-groove of text according to claim 1, it is characterised in that:In step 1.3),
Statistics convergence-level receives the output of last frame level layer as input, summarizes one section of all frame of voice and inputs and calculate its mean value, false
If one section of voice is divided into T frames in total, statistics convergence-level, which summarizes all T frames from frame level layer layer 5 and export and calculate it, puts down
Mean value, statistic are 3200 dimensional vectors, each input voice are only calculated once, this process polymerize letter on time dimension
Breath, so that succeeding layer runs operation on entire voice.
5. a kind of unrelated method for recognizing sound-groove of text according to claim 1, it is characterised in that:In step 1.4),
To counting the output of convergence-level, handled using one-dimensional convolution, totally 5 layers of convolutional layer, preceding two layers of convolutional layer uses 256 sizes
For 5 convolution kernel, step-length 2, third and fourth, five convolutional layers the use of 256 sizes is 3 core, step-length 1, after each convolutional layer
Connect a maximum value convergence-level.
6. a kind of unrelated method for recognizing sound-groove of text according to claim 1, it is characterised in that:In step 1.5),
Connect two full articulamentums, the activation primitives of two full articulamentums is respectively ReLU and Softmax, the last one full articulamentum
Output is N number of speaker clustering.
7. a kind of unrelated method for recognizing sound-groove of text according to claim 1, it is characterised in that:In step 2),
After model training is completed, insertion, i.e. 1024 dimensional vectors output are extracted before full articulamentum first layer non-linearization.
8. a kind of unrelated method for recognizing sound-groove of text according to claim 1, it is characterised in that:In step 3), note
The insertion of volume voice and tested speech calculates its score using COS distance, and is compared with threshold value, makes final decision and connects
By or refusal, score then refuses more than threshold value, then receives less than threshold value, formula is as follows:
Wherein, w1,w2Respectively register voice and tested speech insertion, score (w1,w2) indicate COS distance,<w1,w2>For note
The dot product of the insertion of volume voice and tested speech insertion, | | w1||,||w2| | respectively register voice insertion and tested speech insertion
Length, θ are preset threshold value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810457528.XA CN108648759A (en) | 2018-05-14 | 2018-05-14 | A kind of method for recognizing sound-groove that text is unrelated |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810457528.XA CN108648759A (en) | 2018-05-14 | 2018-05-14 | A kind of method for recognizing sound-groove that text is unrelated |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108648759A true CN108648759A (en) | 2018-10-12 |
Family
ID=63755316
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810457528.XA Pending CN108648759A (en) | 2018-05-14 | 2018-05-14 | A kind of method for recognizing sound-groove that text is unrelated |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108648759A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109584887A (en) * | 2018-12-24 | 2019-04-05 | 科大讯飞股份有限公司 | A kind of method and apparatus that voiceprint extracts model generation, voiceprint extraction |
CN110033757A (en) * | 2019-04-04 | 2019-07-19 | 行知技术有限公司 | A kind of voice recognizer |
CN110120223A (en) * | 2019-04-22 | 2019-08-13 | 南京硅基智能科技有限公司 | A kind of method for recognizing sound-groove based on time-delay neural network TDNN |
CN110136686A (en) * | 2019-05-14 | 2019-08-16 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARGAN Yu i vector |
CN110189757A (en) * | 2019-06-27 | 2019-08-30 | 电子科技大学 | A kind of giant panda individual discrimination method, equipment and computer readable storage medium |
CN110675878A (en) * | 2019-09-23 | 2020-01-10 | 金瓜子科技发展(北京)有限公司 | Method and device for identifying vehicle and merchant, storage medium and electronic equipment |
CN110942777A (en) * | 2019-12-05 | 2020-03-31 | 出门问问信息科技有限公司 | Training method and device for voiceprint neural network model and storage medium |
CN111081260A (en) * | 2019-12-31 | 2020-04-28 | 苏州思必驰信息科技有限公司 | Method and system for identifying voiceprint of awakening word |
CN111429921A (en) * | 2020-03-02 | 2020-07-17 | 厦门快商通科技股份有限公司 | Voiceprint recognition method, system, mobile terminal and storage medium |
CN112382298A (en) * | 2020-11-17 | 2021-02-19 | 北京清微智能科技有限公司 | Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof |
CN113360869A (en) * | 2020-03-04 | 2021-09-07 | 北京嘉诚至盛科技有限公司 | Method for starting application, electronic equipment and computer readable medium |
CN113488058A (en) * | 2021-06-23 | 2021-10-08 | 武汉理工大学 | Voiceprint recognition method based on short voice |
CN113488060A (en) * | 2021-06-25 | 2021-10-08 | 武汉理工大学 | Voiceprint recognition method and system based on variation information bottleneck |
CN114826709A (en) * | 2022-04-15 | 2022-07-29 | 马上消费金融股份有限公司 | Identity authentication and acoustic environment detection method, system, electronic device and medium |
CN115457968A (en) * | 2022-08-26 | 2022-12-09 | 华南理工大学 | Voiceprint confirmation method based on mixed resolution depth separable convolution network |
CN115457968B (en) * | 2022-08-26 | 2024-07-05 | 华南理工大学 | Voiceprint confirmation method based on mixed resolution depth separable convolution network |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20060022492A (en) * | 2004-09-07 | 2006-03-10 | 학교법인연세대학교 | Transformation method of speech feature vector for speaker recognition |
CN107146624A (en) * | 2017-04-01 | 2017-09-08 | 清华大学 | A kind of method for identifying speaker and device |
CN107464568A (en) * | 2017-09-25 | 2017-12-12 | 四川长虹电器股份有限公司 | Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system |
CN107492382A (en) * | 2016-06-13 | 2017-12-19 | 阿里巴巴集团控股有限公司 | Voiceprint extracting method and device based on neutral net |
CN107993071A (en) * | 2017-11-21 | 2018-05-04 | 平安科技(深圳)有限公司 | Electronic device, auth method and storage medium based on vocal print |
-
2018
- 2018-05-14 CN CN201810457528.XA patent/CN108648759A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20060022492A (en) * | 2004-09-07 | 2006-03-10 | 학교법인연세대학교 | Transformation method of speech feature vector for speaker recognition |
CN107492382A (en) * | 2016-06-13 | 2017-12-19 | 阿里巴巴集团控股有限公司 | Voiceprint extracting method and device based on neutral net |
CN107146624A (en) * | 2017-04-01 | 2017-09-08 | 清华大学 | A kind of method for identifying speaker and device |
CN107464568A (en) * | 2017-09-25 | 2017-12-12 | 四川长虹电器股份有限公司 | Based on the unrelated method for distinguishing speek person of Three dimensional convolution neutral net text and system |
CN107993071A (en) * | 2017-11-21 | 2018-05-04 | 平安科技(深圳)有限公司 | Electronic device, auth method and storage medium based on vocal print |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109584887A (en) * | 2018-12-24 | 2019-04-05 | 科大讯飞股份有限公司 | A kind of method and apparatus that voiceprint extracts model generation, voiceprint extraction |
CN109584887B (en) * | 2018-12-24 | 2022-12-02 | 科大讯飞股份有限公司 | Method and device for generating voiceprint information extraction model and extracting voiceprint information |
CN110033757A (en) * | 2019-04-04 | 2019-07-19 | 行知技术有限公司 | A kind of voice recognizer |
CN110120223A (en) * | 2019-04-22 | 2019-08-13 | 南京硅基智能科技有限公司 | A kind of method for recognizing sound-groove based on time-delay neural network TDNN |
CN110136686A (en) * | 2019-05-14 | 2019-08-16 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARGAN Yu i vector |
CN110189757A (en) * | 2019-06-27 | 2019-08-30 | 电子科技大学 | A kind of giant panda individual discrimination method, equipment and computer readable storage medium |
CN110675878A (en) * | 2019-09-23 | 2020-01-10 | 金瓜子科技发展(北京)有限公司 | Method and device for identifying vehicle and merchant, storage medium and electronic equipment |
CN110942777B (en) * | 2019-12-05 | 2022-03-08 | 出门问问信息科技有限公司 | Training method and device for voiceprint neural network model and storage medium |
CN110942777A (en) * | 2019-12-05 | 2020-03-31 | 出门问问信息科技有限公司 | Training method and device for voiceprint neural network model and storage medium |
CN111081260A (en) * | 2019-12-31 | 2020-04-28 | 苏州思必驰信息科技有限公司 | Method and system for identifying voiceprint of awakening word |
CN111429921A (en) * | 2020-03-02 | 2020-07-17 | 厦门快商通科技股份有限公司 | Voiceprint recognition method, system, mobile terminal and storage medium |
CN111429921B (en) * | 2020-03-02 | 2023-01-03 | 厦门快商通科技股份有限公司 | Voiceprint recognition method, system, mobile terminal and storage medium |
CN113360869A (en) * | 2020-03-04 | 2021-09-07 | 北京嘉诚至盛科技有限公司 | Method for starting application, electronic equipment and computer readable medium |
CN112382298A (en) * | 2020-11-17 | 2021-02-19 | 北京清微智能科技有限公司 | Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof |
CN112382298B (en) * | 2020-11-17 | 2024-03-08 | 北京清微智能科技有限公司 | Awakening word voiceprint recognition method, awakening word voiceprint recognition model and training method thereof |
CN113488058A (en) * | 2021-06-23 | 2021-10-08 | 武汉理工大学 | Voiceprint recognition method based on short voice |
CN113488060A (en) * | 2021-06-25 | 2021-10-08 | 武汉理工大学 | Voiceprint recognition method and system based on variation information bottleneck |
CN113488060B (en) * | 2021-06-25 | 2022-07-19 | 武汉理工大学 | Voiceprint recognition method and system based on variation information bottleneck |
CN114826709A (en) * | 2022-04-15 | 2022-07-29 | 马上消费金融股份有限公司 | Identity authentication and acoustic environment detection method, system, electronic device and medium |
CN115457968A (en) * | 2022-08-26 | 2022-12-09 | 华南理工大学 | Voiceprint confirmation method based on mixed resolution depth separable convolution network |
CN115457968B (en) * | 2022-08-26 | 2024-07-05 | 华南理工大学 | Voiceprint confirmation method based on mixed resolution depth separable convolution network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108648759A (en) | A kind of method for recognizing sound-groove that text is unrelated | |
CN105632501B (en) | A kind of automatic accent classification method and device based on depth learning technology | |
CN108564942B (en) | Voice emotion recognition method and system based on adjustable sensitivity | |
CN110473566A (en) | Audio separation method, device, electronic equipment and computer readable storage medium | |
CN104732978B (en) | The relevant method for distinguishing speek person of text based on combined depth study | |
CN104036774B (en) | Tibetan dialect recognition methods and system | |
CN110211574A (en) | Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism | |
CN110289003A (en) | A kind of method of Application on Voiceprint Recognition, the method for model training and server | |
CN109816092A (en) | Deep neural network training method, device, electronic equipment and storage medium | |
CN107492382A (en) | Voiceprint extracting method and device based on neutral net | |
CN110675859B (en) | Multi-emotion recognition method, system, medium, and apparatus combining speech and text | |
CN107731233A (en) | A kind of method for recognizing sound-groove based on RNN | |
CN106503805A (en) | A kind of bimodal based on machine learning everybody talk with sentiment analysis system and method | |
CN109119072A (en) | Civil aviaton's land sky call acoustic model construction method based on DNN-HMM | |
CN110390955A (en) | A kind of inter-library speech-emotion recognition method based on Depth Domain adaptability convolutional neural networks | |
CN108922541A (en) | Multidimensional characteristic parameter method for recognizing sound-groove based on DTW and GMM model | |
CN110428843A (en) | A kind of voice gender identification deep learning method | |
CN107993664B (en) | Robust speaker recognition method based on competitive neural network | |
CN111048097B (en) | Twin network voiceprint recognition method based on 3D convolution | |
CN113223536B (en) | Voiceprint recognition method and device and terminal equipment | |
CN103578481A (en) | Method for recognizing cross-linguistic voice emotion | |
CN108877812B (en) | Voiceprint recognition method and device and storage medium | |
CN106898355A (en) | A kind of method for distinguishing speek person based on two modelings | |
CN110085216A (en) | A kind of vagitus detection method and device | |
CN106297769B (en) | A kind of distinctive feature extracting method applied to languages identification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20181012 |
|
WD01 | Invention patent application deemed withdrawn after publication |