CN108648759A

CN108648759A - A kind of method for recognizing sound-groove that text is unrelated

Info

Publication number: CN108648759A
Application number: CN201810457528.XA
Authority: CN
Inventors: 郭炜强; 平怡强; 张宇; 郑波
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-05-14
Filing date: 2018-05-14
Publication date: 2018-10-12

Abstract

The invention discloses a kind of method for recognizing sound-groove that text is unrelated, including Application on Voiceprint Recognition model training, extraction insertion, decision scoring three phases.Model training stage step：1) speech signal pre-processing；2) voice frame level operates；3) statistics convergence-level summarizes frame level output；4) one-dimensional convolution operation；5) full articulamentum exports speaker clustering.After the completion of model training, insertion is extracted before full articulamentum first layer non-linearization.It is finally scored using COS distance decision, determines to accept or reject.Present invention combination neural network embedded technology and convolutional neural networks carry out dimensionality reduction using one-dimensional convolution, and using maximum value convergence-level, increase the convolution number of plies and improve the performance of model in this way to carry out further feature extraction.Make the process using COS distance as standards of grading faster, more simply.

Description

A kind of method for recognizing sound-groove that text is unrelated

Technical field

The present invention relates to the technical field of Application on Voiceprint Recognition, a kind of combination neural network embedded technology and convolution god are referred in particular to The unrelated method for recognizing sound-groove of text through network.

Background technology

Vocal print refers to the sound wave spectrum that verbal information is carried in human speech, it is the same with fingerprint, has unique biology Feature is learned, has the function of identification, not only there is specificity, but also there is opposite stability.Voice signal is one-dimensional Continuous signal, after it is carried out discretization, so that it may to obtain our the now common manageable voice signals of computer.

The manageable discrete voice signal of computer.Application on Voiceprint Recognition (also referred to as Speaker Identification) technology also as existed now It is the same using very extensive fingerprint identification technology on smart mobile phone, it is special that voice is extracted in the voice signal sent out from speaker Sign, and the biological identification technology of authentication is carried out to speaker accordingly.

Application on Voiceprint Recognition mainstream technology scheme has the identifying system based on i-vector.Its base in simultaneous factor analysis technology On plinth, propose that speaker and session difference can be characterized by an individual subspace.It, can be with using this sub-spaces The digital vector obtained from a sound materials, it is further converted to low dimension vector, is exactly i-vector.

Later with hardware device performance boost, deep neural network is successfully applied to Acoustic Modeling, the ability of identification Have compared with much progress, it was also proposed that the model that rational DNN and i-vector are combined, in the process of extraction sufficient statistic In, the UBM in original i-vector models is replaced with the DNN models based on phoneme state, is corresponded to obtain each frame The posterior probability of each classification.

Current newest technology, which has, extracts embedded spy in the slave time-delay neural network network that David Snyder et al. are proposed The acoustics identification model of sign, also known as x-vector.The model is used to calculate speaker's insertion of elongated voice (embedding).Its structure is a kind of end-to-end system.Its step are as follows：

Model training is carried out first.Voice signal is pre-processed, first 5 layers of network Shang not operated in frame level, statistics Convergence-level receives the output of last frame level layer as input, summarizes one section of all frame of voice and inputs and calculate its mean value and standard Difference.It then in voice segments level operations, connects full articulamentum and uses activation primitive ReLU, final full articulamentum Softmax is defeated Go out N number of speaker clustering.

After the completion of model training, the speaker for mapping directly to fixed length is embedded in by the voice of every section of random length.Then in pairs Registration voice and tested speech carry out decision scoring using the rear end based on PLDA, make and final determine to accept or reject..

Current network structure is all using full articulamentum.It is understood that its more ability to express of the network number of plies are stronger, but It is to train the full Connection Neural Network of depth highly difficult by gradient descent method, because the gradient of full Connection Neural Network is difficult to pass It passs more than 3 layers.Therefore, we can not possibly obtain a very deep full Connection Neural Network, also limit its ability.

Invention content

The shortcomings that it is an object of the invention to overcome the prior art and deficiency, it is proposed that a kind of Application on Voiceprint Recognition that text is unrelated Method improves the neural network embedded structure using convolutional neural networks, to the data of statistics convergence-level output, attempts to use One-dimensional convolution operation, and dimensionality reduction is carried out using maximum value convergence-level, increase the convolution number of plies, to carry out further feature extraction, this Sample improves the performance of model, and makes the process faster using COS distance as standards of grading, more simply.

To achieve the above object, technical solution provided by the present invention is：A kind of method for recognizing sound-groove that text is unrelated, packet Include following steps：

1) Application on Voiceprint Recognition model training

1.1) speech signal pre-processing；

1.2) voice frame level operates；

1.3) statistics convergence-level summarizes frame level output；

1.4) one-dimensional convolution operation；

1.5) full articulamentum exports speaker clustering；

2) extraction insertion：After model training is completed, registration voice and tested speech are inputted into Application on Voiceprint Recognition model, extraction It is embedded；

3) decision scores：The insertion of registration voice and tested speech calculates its score using COS distance, makes final determine Surely it accepts or rejects.

In step 1.1), by every section of voice in corpus with 25ms framings, and voice activity detection is carried out, believed from sound The prolonged mute phase is identified and eliminated in number stream, 20 Jan Vermeer frequency spectrum cepstrum coefficient MFCC is generated, adds single order and two scales Point coefficient generate per frame the MFCC feature vectors of totally 60 dimensions as input.

In step 1.2), first 5 layers of model training network structure Shang not operated in frame level, have time delay framework, it is assumed that t It is that current frame splices the Meier frequency spectrum cepstrum coefficient MFCC of the frame at { t-2, t-1, t, t+1, t+2 } one in input terminal It rises, next two layers output of the splicing preceding layer at time { t-2, t, t+2 } and { t-3, t, t+3 } respectively, later two Layer is not operated in frame level yet, but without any additional frame, and the frame level part of the network has t-7 to t+7 totally 15 in total Frame.

In step 1.3), statistics convergence-level receives the output of last frame level layer as input, and it is all to summarize one section of voice Frame inputs and calculates its mean value, it is assumed that one section of voice is divided into T frames in total, and statistics convergence-level summarizes from frame level layer layer 5 All T frames export and calculate its average value, and statistic is 3200 dimensional vectors, each input voice is only calculated once, this mistake Journey aggregation information on time dimension, so that succeeding layer runs operation on entire voice.

In step 1.4), to counting the output of convergence-level, handled using one-dimensional convolution, totally 5 layers of convolutional layer, preceding two The convolution kernel that layer convolutional layer is 5 using 256 sizes, step-length 2, third and fourth, five convolutional layers the use of 256 sizes is 3 Core, step-length 1, each convolutional layer are followed by a maximum value convergence-level.

In step 1.5), connect two full articulamentums, the activation primitives of two full articulamentums be respectively ReLU with The output of Softmax, the last one full articulamentum are N number of speaker clustering.

In step 2), after model training completion, insertion is extracted before full articulamentum first layer non-linearization, i.e., 1024 dimensional vectors export.

In step 3), register the insertion of voice and tested speech and calculate its score using COS distance, and with threshold value into Row compares, and makes final decision and accepts or rejects, score is then refused more than threshold value, then receives less than threshold value, formula is as follows：

Wherein, w₁,w₂Respectively register voice and tested speech insertion, score (w₁,w₂) indicate COS distance,<w₁,w₂> It is embedded in the dot product with tested speech insertion for registration voice, | | w₁||,||w₂| | it respectively registers voice insertion and tested speech is embedding The length entered, θ are preset threshold value.

Compared with prior art, the present invention having the following advantages that and advantageous effect：

1, each neuron is no longer connected with all neurons of last layer in convolutional network, and only neural with sub-fraction Member is connected.Which reduces many parameters.

2, one group of connection can share the same weight, rather than there are one different weights for each connection, subtract again in this way Many parameters are lacked.

3, the sample dimension that every layer is reduced using maximum value convergence-level, is further reduced number of parameters, while can be with The robustness of lift scheme.

4, COS distance scores as the decision of speaker verification makes the process faster, more simply.

Description of the drawings

Fig. 1 is the logical flow chart of the method for the present invention.

Fig. 2 is the Application on Voiceprint Recognition model training flow chart of the present invention.

Specific implementation mode

The present invention is further explained in the light of specific embodiments.

As shown in Figure 1, the method for recognizing sound-groove that the text that is provided of the present embodiment is unrelated, is divided into three phases：Vocal print is known Other model training, extraction insertion, decision scoring.

The training for carrying out Application on Voiceprint Recognition model first, selects suitable corpus, such as uses AISHELL-ASR0009-OS1 It increases income Chinese speech database, wherein including training library and test library.

As shown in Fig. 2, steps are as follows for Application on Voiceprint Recognition model training：

1) speech signal pre-processing

By every section of voice in corpus with 25ms framings, and carry out voice activity detection, in voice signal stream identification and The prolonged mute phase is eliminated, 20 Jan Vermeer frequency spectrum cepstrum coefficient MFCC are generated, adds single order and second differnce coefficient, most throughout one's life At every frame, the totally 60 MFCC feature vectors tieed up are used as input.

2) voice frame level operates

First 5 layers of the sound-groove model network structure Shang not operated in frame level, have time delay framework.Vacation lets t be current frame. In input terminal, the MFCC of the frame at { t-2, t-1, t, t+1, t+2 } is stitched together by we.Splice respectively for next two layers Output of the preceding layer at time { t-2, t, t+2 } and { t-3, t, t+3 }.Two layers later is not operated in frame level yet, but There is no any additional frame.The frame level part of the network has t-7 to t+7 totally 15 frames in total.

3) statistics convergence-level summarizes frame level output

Statistics convergence-level receives the output of last frame level layer as input, summarizes one section of all frame of voice and inputs and calculate it Mean value.Assuming that one section of voice is divided into T frames in total, statistics convergence-level summarizes the output of all T frames from frame level layer layer 5 simultaneously Calculate its average value.Statistic is 3200 dimensional vectors, and each input voice is only calculated once.This process is on time dimension Aggregation information, so that succeeding layer runs operation on entire voice.

4) one-dimensional convolution operation

To counting the output of convergence-level, handled using one-dimensional convolution.Preceding two layers of convolutional layer is 5 using 256 sizes Convolution kernel, step-length 2, third and fourth, five convolutional layers the use of 256 sizes is 3 core, step-length 1, each convolutional layer is followed by One maximum value convergence-level.

5) full articulamentum exports speaker clustering.

Two full articulamentums are connected, the activation primitive of two full articulamentums is respectively ReLU and Softmax, the last one is complete The output of articulamentum is N number of speaker clustering.

The network structure of frame level operations and statistics convergence-level is as shown in table 1：

1 frame level operations of table and statistics convergence-level network structure

Layer	Every layer of included frame	Total context frame number	Input → output
				Frame level first layer	[t-2,t+2]	5	300→1024
The frame level second layer	{t-2,t,t+2}	9	3072→1024
				Frame level third layer	{t-3,t,t+3}	15	3072→1024
The 4th layer of frame level	{t}	15	1024→1024
				Frame level layer 5	{t}	15	1024→3200
Count convergence-level	[0,T]	T	3200T→3200

Convolutional layer and full articulamentum network structure are as shown in table 2：

2 convolutional layer of table and full articulamentum network structure

2)~5 the MFCC of every section of voice is carried out above) step operation, it constantly updates convolution kernel and connects layer parameter entirely, Complete the training to Application on Voiceprint Recognition model.

Extraction insertion：After the completion of model training, tested using the test library voice in corpus, will registration voice with Tested speech inputs Application on Voiceprint Recognition model, extracts and is embedded in before the full articulamentum first layer non-linearization of identification model, i.e., and 1024 Dimensional vector exports.

Decision scores：The insertion of registration voice and tested speech calculates its score using COS distance, and is carried out with threshold value Compare, makes final decision and accept or reject, score is then refused more than threshold value, then receives less than threshold value.Formula is as follows：

Wherein w₁,w₂The respectively insertion of registration voice and tested speech insertion, score (w₁,w₂) indicate COS distance,<w₁, w₂>It is embedded in the dot product with tested speech insertion for registration voice, | | w₁||,||w₂| | respectively register voice insertion and test language The length of sound insertion, θ are preset threshold value.

Embodiment described above is only the preferred embodiments of the invention, and but not intended to limit the scope of the present invention, therefore Change made by all shapes according to the present invention, principle, should all cover within the scope of the present invention.

Claims

1. a kind of method for recognizing sound-groove that text is unrelated, which is characterized in that include the following steps：

1) Application on Voiceprint Recognition model training

1.1) speech signal pre-processing；

1.2) voice frame level operates；

1.3) statistics convergence-level summarizes frame level output；

1.4) one-dimensional convolution operation；

1.5) full articulamentum exports speaker clustering；

2) extraction insertion：After model training is completed, registration voice and tested speech are inputted into Application on Voiceprint Recognition model, extraction is embedding Enter；

3) decision scores：The insertion of registration voice and tested speech calculates its score using COS distance, makes final decision and connects By or refusal.

2. a kind of unrelated method for recognizing sound-groove of text according to claim 1, it is characterised in that：In step 1.1), By every section of voice in corpus with 25ms framings, and voice activity detection is carried out, the time is identified and eliminated in voice signal stream Length is more than the mute phase of preset value, generates 20 Jan Vermeer frequency spectrum cepstrum coefficient MFCC, adds single order and the life of second differnce coefficient At every frame, the totally 60 MFCC feature vectors tieed up are used as input.

3. a kind of unrelated method for recognizing sound-groove of text according to claim 1, it is characterised in that：In step 1.2), First 5 layers of model training network structure Shang not operated in frame level, have time delay framework, it is assumed that and t is current frame, in input terminal, The Meier frequency spectrum cepstrum coefficient MFCC of frame at { t-2, t-1, t, t+1, t+2 } is stitched together, next two layers of difference Splice output of the preceding layer at time { t-2, t, t+2 } and { t-3, t, t+3 }, two layers later is not grasped in frame level yet Make, but no any additional frame, the frame level part of the network has t-7 to t+7 totally 15 frames in total.

4. a kind of unrelated method for recognizing sound-groove of text according to claim 1, it is characterised in that：In step 1.3), Statistics convergence-level receives the output of last frame level layer as input, summarizes one section of all frame of voice and inputs and calculate its mean value, false If one section of voice is divided into T frames in total, statistics convergence-level, which summarizes all T frames from frame level layer layer 5 and export and calculate it, puts down Mean value, statistic are 3200 dimensional vectors, each input voice are only calculated once, this process polymerize letter on time dimension Breath, so that succeeding layer runs operation on entire voice.

5. a kind of unrelated method for recognizing sound-groove of text according to claim 1, it is characterised in that：In step 1.4), To counting the output of convergence-level, handled using one-dimensional convolution, totally 5 layers of convolutional layer, preceding two layers of convolutional layer uses 256 sizes For 5 convolution kernel, step-length 2, third and fourth, five convolutional layers the use of 256 sizes is 3 core, step-length 1, after each convolutional layer Connect a maximum value convergence-level.

6. a kind of unrelated method for recognizing sound-groove of text according to claim 1, it is characterised in that：In step 1.5), Connect two full articulamentums, the activation primitives of two full articulamentums is respectively ReLU and Softmax, the last one full articulamentum Output is N number of speaker clustering.

7. a kind of unrelated method for recognizing sound-groove of text according to claim 1, it is characterised in that：In step 2), After model training is completed, insertion, i.e. 1024 dimensional vectors output are extracted before full articulamentum first layer non-linearization.

8. a kind of unrelated method for recognizing sound-groove of text according to claim 1, it is characterised in that：In step 3), note The insertion of volume voice and tested speech calculates its score using COS distance, and is compared with threshold value, makes final decision and connects By or refusal, score then refuses more than threshold value, then receives less than threshold value, formula is as follows：

Wherein, w₁,w₂Respectively register voice and tested speech insertion, score (w₁,w₂) indicate COS distance,<w₁,w₂>For note The dot product of the insertion of volume voice and tested speech insertion, | | w₁||,||w₂| | respectively register voice insertion and tested speech insertion Length, θ are preset threshold value.