CN109256139A

CN109256139A - A kind of method for distinguishing speek person based on Triplet-Loss

Info

Publication number: CN109256139A
Application number: CN201810835179.0A
Authority: CN
Inventors: 王艺航; 熊晓明; 刘祥; 李辉
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2018-07-26
Filing date: 2018-07-26
Publication date: 2019-01-22

Abstract

The present invention relates to a kind of method for distinguishing speek person based on Triplet-Loss, the following steps are included: S1: obtaining voice signal, including three groups of samples, respectively one group of voice sequence of the one of speaker group voice sequence, another group of same speaker of voice sequence and different speakers；S2: the pretreatment of voice signal, the interchannel noise generated during removal voice collecting are carried out；S3: speech characteristic parameter extraction is carried out to the voice signal after denoising；S4: based on LSTM neural network, RNN neural network is constructed；S5: using extract 90% three groups of speech characteristic parameters as the input of RNN neural network, for training RNN neural network；After S6:RNN neural metwork training is good, Speaker Identification is carried out using remaining 10% three groups of speech characteristic parameter as the input of RNN neural network.The present invention is high with accuracy rate, recognition effect is good, high reliability.

Description

A kind of method for distinguishing speek person based on Triplet-Loss

Technical field

The present invention relates to the technical fields of neural network and deep learning, more particularly to one kind to be based on Triplet-Loss Method for distinguishing speek person.

Background technique

As information security issue is on the rise, caused by influence it is increasing." individual privacy secrecy " problem compels to be essential It solves.How accurate and safety determination a person's identity causes the thinking of people.One as human-computer interaction of voice Key interface plays an important role in authentication.Application on Voiceprint Recognition, as Speaker Identification, vocal print are only as speaker The biological characteristic of one nothing two, exactly overcomes the new tool of conventional authentication method.It is compared with other methods, contains the language of vocal print feature Sound obtains convenience, naturally, voiceprint extraction can be completed unconsciously, therefore the acceptance level of user is also high；Obtain voice Identification it is low in cost, using simple a, microphone, there are no need additional sound pick-up outfit when using communication apparatus； Voiceprint is suitble to remote identity confirmation, it is only necessary to a microphone or phone, mobile phone can by network (communication network or Internet) realize Telnet.

The method for recognizing sound-groove based on signal processing of common method for recognizing sound-groove such as early stage, uses signal processing skill Some technical methods calculate voice data in the parameter of signal in art, then carry out template matching, statistical variance analysis etc., This method is extremely sensitive to voice data, and accuracy rate is very low, and recognition effect is very unsatisfactory.

Recognition methods based on gauss hybrid models can obtain preferable effect and simple and flexible, but it is to amount of voice data It is required that very big, the requirement that is unable to satisfy real scene under very sensitive to channel circumstance noise.

The existing method based on deep learning neural network does not consider the context-sensitive essence of voice signal, mentions The feature got can not represent speaker well, and there is no the advantages for playing deep learning completely.

Summary of the invention

That it is an object of the invention to overcome the deficiencies of the prior art and provide a kind of accuracys rate is high, recognition effect is good, reliability The high method for distinguishing speek person based on Triplet-Loss.

To achieve the above object, technical solution provided by the present invention are as follows:

A kind of method for distinguishing speek person based on Triplet-Loss, comprising the following steps:

S1: obtain voice signal, the voice signal include three groups of samples, respectively the one of speaker group voice sequence Xa, One group of voice sequence Xn of the voice sequence Xp of another group of same speaker and different speakers；

S2: the pretreatment of voice signal, the interchannel noise generated during removal voice collecting are carried out；

S3: speech characteristic parameter extraction is carried out to the voice signal after denoising；

S4: based on LSTM neural network, RNN neural network is constructed；

S5: 90% three groups of speech characteristic parameters that step S3 is extracted are used for as the input of RNN neural network Training RNN neural network；

After S6:RNN neural metwork training is good, using remaining 10% three groups of speech characteristic parameter as RNN neural network Input carry out Speaker Identification.

Further, the step S2 carries out denoising to voice signal using subtractive method of spectrums, the specific steps are as follows:

S2-1: voice signal is filtered；

S2-2: preemphasis is carried out to voice signal after filtering, by voice signal framing, to signal frame plus Hamming window；

S2-3: Fast Fourier Transform (FFT) is carried out to the signal after adding window, power spectrum is asked to each frame voice signal, then asks flat Equal noise power；

S2-4: noise estimation is carried out using VAD and monitors quiet section, and then combination recurrence is smooth, updates noise spectrum；

S2-5: it carries out spectrum and subtracts operation, obtain the voice signal power spectrum estimated；

S2-6: insertion phase spectrum calculates speech manual, then carry out Fast Fourier Transform Inverse, the speech frame restored；

S2-7: voice signal is combined into according to each speech frame group, voice signal is aggravated into the signal after being denoised.

Further, the step S3 carries out the specific steps of acoustical characteristic parameters extraction such as to the voice signal after denoising Under:

S3-1: carrying out preemphasis processing to three groups of voice signals after denoising, then by signal framing, each frame multiplied by Hamming window；

S3-2: Fast Fourier Transform (FFT) is carried out to every frame signal, obtains the Energy distribution on frequency spectrum；

Power spectrum: being passed through the triangle filter group of one group of Meier scale by S3-3, calculates each filter group output Logarithmic energy；

S3-4: the characteristic parameter exported by discrete cosine transform.

Further, the step step S4 is based on LSTM neural network, in LSTM neural network characteristics output layer Addition normalization layer and Triplet-Loss loss function layer afterwards, construct RNN neural network.

Further, the Triplet-Loss loss function layer is by study, allow between Xa and Xp feature representation away from From as small as possible, and the distance between feature representation of Xa and Xn is as big as possible, and to allow the distance between Xa and Xn and Xa There is a smallest interval α between the distance between Xp；

Corresponding objective function are as follows:

Wherein,Indicate the Euclidean distance measurement between Xa and Xp；

What is indicated is the Euclidean distance measurement between Xa and Xn；

Distance is measured with Euclidean distance herein, when the value in+[] is greater than zero, takes the value for loss, when minus It waits, loss zero.

Further, specific step is as follows for the step S6 progress Speaker Identification:

S6-1: the feature representation f (Xa) of three groups of samples, f (Xp), f (Xn) are obtained by LSTM neural network；

S6-2: obtained feature representation is normalized；

S6-3: pass through Triplet-Loss loss function optimization neural network；

S6-4: comparing the metric and preset threshold of Triplet-Loss loss function, if metric is greater than preset threshold, Then speak artificial same people, artificial different people of otherwise speaking.

Compared with prior art, this programme principle and advantage is as follows:

1. the pretreatment of voice signal uses subtractive method of spectrums, the constraint condition introduced relative to other methods, subtractive method of spectrums At least, physical significance is most direct, and operand is small, so as to effectively improve the accuracy of identification.

2. based on Triplet-Loss (triple loss function) come training pattern, pass through Inter-class loss and Intra-class loss Joint constraint to carry out model the optimization training of backpropagation, so that similar sample is in feature space as close possible to and different Class sample is away as far as possible in feature space, improves the sense of model, to improve the reliability and accuracy of identification.

Detailed description of the invention

Fig. 1 is a kind of flow chart of the method for distinguishing speek person based on Triplet-Loss of the present invention；

Fig. 2 is the flow chart of subtractive method of spectrums in the present invention.

Fig. 3 is the flow chart that speech characteristic parameter extracts in the present invention.

Specific embodiment

The present invention is further explained in the light of specific embodiments:

Referring to figure 1, a kind of method for distinguishing speek person based on Triplet-Loss described in the present embodiment, including Following steps:

S2: the pretreatment of voice signal is carried out；More interchannel noise can be generated during voice collecting, therefore can be to knowledge Other task brings bigger difficulty, therefore carries out denoising to input voice data using subtractive method of spectrums first, i.e., makes an uproar from band Noise spectrum valuation is subtracted in voice valuation, to obtain the frequency spectrum of clean speech.What is eliminated herein is interchannel noise, and channel is made an uproar Sound be by sound pick-up outfit caused by noise；While removing channel noise, all letters related with speaker are saved completely Breath.

As shown in Fig. 2, carrying out denoising to voice signal using subtractive method of spectrums, the specific steps are as follows:

S2-1: voice signal is filtered；

Specifically, in signal processing, windowing process is a necessary process, because computer can only be handled The signal of finite length, therefore original signal X (t) will be truncated with T (sampling time), i.e. finite process become after XT (t) again into one Step processing, this process is exactly windowing process, in actual signal processing, generally uses rectangular window, but rectangular window is at edge Signal is truncated suddenly at place, and time-domain information all disappears outside window, causes the phenomenon that frequency domain increases frequency component, i.e., frequency spectrum is let out Leakage, consider how reduce adding window when caused by leakage errors, main measure is using reasonable windowed function, and Hamming window is exactly One kind of signal window, the shape that the shape of major part arrives the section pi 0 as sin (x), and rest part is all 0, in this way Function be multiplied by other any one function f, f all only some have nonzero value；

S2-4: it is quiet that noise estimation monitoring is carried out using VAD (Voice Activity Detection speech terminals detection) Quiet section, and then combination recurrence is smooth, updates noise spectrum；

S3: as shown in figure 3, carrying out speech characteristic parameter extraction to the voice signal after denoising, the specific steps are as follows:

S3-1: carrying out preemphasis processing to three groups of voice signals after denoising, then to three groups of voice signal signals point Not according to frame length 25ms, frame moves 10ms and carries out framing, and each frame is multiplied by Hamming window；

S3-4: the speech characteristic parameter exported by discrete cosine transform.

S4: after getting speech characteristic parameter, based on LSTM neural network (long Memory Neural Networks in short-term), Addition normalization layer and Triplet-Loss loss function layer, construct RNN nerve net after LSTM neural network characteristics output layer Network (Recognition with Recurrent Neural Network)；

The Triplet-Loss loss function layer used allows the distance between Xa and Xp feature representation to the greatest extent may be used by study Can be small, and the distance between feature representation of Xa and Xn is as big as possible, and to allow the distance between Xa and Xn and Xa and Xp it Between distance between have a smallest interval α；

Corresponding objective function are as follows:

Wherein,Indicate the Euclidean distance measurement between Xa and Xp；

What is indicated is the Euclidean distance measurement between Xa and Xn；

After S6:RNN neural metwork training is good, using remaining 10% three groups of speech characteristic parameter as RNN neural network Input carry out Speaker Identification；Specific step is as follows for identification:

S6-2: obtained feature representation is normalized；

S6-3: pass through Triplet-Loss loss function optimization neural network；

The pretreatment of voice signal uses subtractive method of spectrums in the present embodiment, introduces relative to other methods, subtractive method of spectrums Constraint condition is minimum, and physical significance is most direct, and operand is small, so as to effectively improve the accuracy of identification.In addition, this implementation Example come training pattern, combines constraint by Inter-class loss and Intra-class loss based on Triplet-Loss (triple loss function) To carry out model the optimization training of backpropagation, so that similar sample is in feature space as close possible to and foreign peoples's sample exists Feature space is away as far as possible, and improves the sense of model, to improve the reliability and accuracy of identification.

The examples of implementation of the above are only the preferred embodiments of the invention, and implementation model of the invention is not limited with this It encloses, therefore all shapes according to the present invention, changes made by principle, should all be included within the scope of protection of the present invention.

Claims

1. a kind of method for distinguishing speek person based on Triplet-Loss, which comprises the following steps:

S1: voice signal is obtained, which includes three groups of samples, respectively the one of speaker group voice sequence Xa, same One group of voice sequence Xn of another group of speaker of voice sequence Xp and different speakers；

S4: based on LSTM neural network, RNN neural network is constructed；

S5: 90% three groups of speech characteristic parameters that step S3 is extracted are as the input of RNN neural network, for training RNN neural network；

After S6:RNN neural metwork training is good, using remaining 10% three groups of speech characteristic parameter as the defeated of RNN neural network Enter to carry out Speaker Identification.

2. a kind of method for distinguishing speek person based on Triplet-Loss according to claim 1, which is characterized in that described Step S2 carries out denoising to voice signal using subtractive method of spectrums, the specific steps are as follows:

S2-1: voice signal is filtered；

S2-3: Fast Fourier Transform (FFT) is carried out to the signal after adding window, power spectrum is asked to each frame voice signal, is then averaging and makes an uproar Acoustical power；

3. a kind of method for distinguishing speek person based on Triplet-Loss according to claim 1, which is characterized in that described Step S3 carries out acoustical characteristic parameters extraction to the voice signal after denoising, and specific step is as follows:

S3-1: preemphasis processing is carried out to three groups of voice signals after denoising, then by signal framing, each frame is multiplied by Hamming Window；

Power spectrum: being passed through the triangle filter group of one group of Meier scale by S3-3, calculates pair of each filter group output Number energy；

S3-4: the characteristic parameter exported by discrete cosine transform.

4. a kind of method for distinguishing speek person based on Triplet-Loss according to claim 1, which is characterized in that described Step step S4 based on LSTM neural network, after LSTM neural network characteristics output layer addition normalization layer and Triplet-Loss loss function layer constructs RNN neural network.

5. a kind of method for distinguishing speek person based on Triplet-Loss according to claim 4, which is characterized in that described Triplet-Loss loss function layer makes the distance between Xa and Xp feature representation as small as possible by study, and Xa and Xn The distance between feature representation is as big as possible, and to allow and have one between the distance between Xa and Xn and the distance between Xa and Xp A the smallest interval α；

Corresponding objective function are as follows:

Wherein,Indicate the Euclidean distance measurement between Xa and Xp；

What is indicated is the Euclidean distance measurement between Xa and Xn；

Distance is measured with Euclidean distance herein, when the value in+[] is greater than zero, takes the value for loss, when minus, damage Mistake is zero.

6. a kind of method for distinguishing speek person based on Triplet-Loss according to claim 1, which is characterized in that described Step S6 carries out Speaker Identification, and specific step is as follows:

S6-2: obtained feature representation is normalized；

S6-3: pass through Triplet-Loss loss function optimization neural network；

S6-4: comparing the metric and preset threshold of Triplet-Loss loss function, if metric is greater than preset threshold, says Talk about artificial same people, artificial different people of otherwise speaking.