CN109256139A - A kind of method for distinguishing speek person based on Triplet-Loss - Google Patents
A kind of method for distinguishing speek person based on Triplet-Loss Download PDFInfo
- Publication number
- CN109256139A CN109256139A CN201810835179.0A CN201810835179A CN109256139A CN 109256139 A CN109256139 A CN 109256139A CN 201810835179 A CN201810835179 A CN 201810835179A CN 109256139 A CN109256139 A CN 109256139A
- Authority
- CN
- China
- Prior art keywords
- neural network
- loss
- voice signal
- triplet
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000013528 artificial neural network Methods 0.000 claims abstract description 36
- 238000012549 training Methods 0.000 claims abstract description 12
- 238000000605 extraction Methods 0.000 claims abstract description 7
- 230000001537 neural effect Effects 0.000 claims abstract description 4
- 238000001228 spectrum Methods 0.000 claims description 33
- 238000012545 processing Methods 0.000 claims description 8
- 238000009432 framing Methods 0.000 claims description 6
- 238000005259 measurement Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 3
- 238000003780 insertion Methods 0.000 claims description 3
- 230000037431 insertion Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012935 Averaging Methods 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 5
- 239000000284 extract Substances 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 15
- 238000013135 deep learning Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 238000012790 confirmation Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
The present invention relates to a kind of method for distinguishing speek person based on Triplet-Loss, the following steps are included: S1: obtaining voice signal, including three groups of samples, respectively one group of voice sequence of the one of speaker group voice sequence, another group of same speaker of voice sequence and different speakers;S2: the pretreatment of voice signal, the interchannel noise generated during removal voice collecting are carried out;S3: speech characteristic parameter extraction is carried out to the voice signal after denoising;S4: based on LSTM neural network, RNN neural network is constructed;S5: using extract 90% three groups of speech characteristic parameters as the input of RNN neural network, for training RNN neural network;After S6:RNN neural metwork training is good, Speaker Identification is carried out using remaining 10% three groups of speech characteristic parameter as the input of RNN neural network.The present invention is high with accuracy rate, recognition effect is good, high reliability.
Description
Technical field
The present invention relates to the technical fields of neural network and deep learning, more particularly to one kind to be based on Triplet-Loss
Method for distinguishing speek person.
Background technique
As information security issue is on the rise, caused by influence it is increasing." individual privacy secrecy " problem compels to be essential
It solves.How accurate and safety determination a person's identity causes the thinking of people.One as human-computer interaction of voice
Key interface plays an important role in authentication.Application on Voiceprint Recognition, as Speaker Identification, vocal print are only as speaker
The biological characteristic of one nothing two, exactly overcomes the new tool of conventional authentication method.It is compared with other methods, contains the language of vocal print feature
Sound obtains convenience, naturally, voiceprint extraction can be completed unconsciously, therefore the acceptance level of user is also high;Obtain voice
Identification it is low in cost, using simple a, microphone, there are no need additional sound pick-up outfit when using communication apparatus;
Voiceprint is suitble to remote identity confirmation, it is only necessary to a microphone or phone, mobile phone can by network (communication network or
Internet) realize Telnet.
The method for recognizing sound-groove based on signal processing of common method for recognizing sound-groove such as early stage, uses signal processing skill
Some technical methods calculate voice data in the parameter of signal in art, then carry out template matching, statistical variance analysis etc.,
This method is extremely sensitive to voice data, and accuracy rate is very low, and recognition effect is very unsatisfactory.
Recognition methods based on gauss hybrid models can obtain preferable effect and simple and flexible, but it is to amount of voice data
It is required that very big, the requirement that is unable to satisfy real scene under very sensitive to channel circumstance noise.
The existing method based on deep learning neural network does not consider the context-sensitive essence of voice signal, mentions
The feature got can not represent speaker well, and there is no the advantages for playing deep learning completely.
Summary of the invention
That it is an object of the invention to overcome the deficiencies of the prior art and provide a kind of accuracys rate is high, recognition effect is good, reliability
The high method for distinguishing speek person based on Triplet-Loss.
To achieve the above object, technical solution provided by the present invention are as follows:
A kind of method for distinguishing speek person based on Triplet-Loss, comprising the following steps:
S1: obtain voice signal, the voice signal include three groups of samples, respectively the one of speaker group voice sequence Xa,
One group of voice sequence Xn of the voice sequence Xp of another group of same speaker and different speakers;
S2: the pretreatment of voice signal, the interchannel noise generated during removal voice collecting are carried out;
S3: speech characteristic parameter extraction is carried out to the voice signal after denoising;
S4: based on LSTM neural network, RNN neural network is constructed;
S5: 90% three groups of speech characteristic parameters that step S3 is extracted are used for as the input of RNN neural network
Training RNN neural network;
After S6:RNN neural metwork training is good, using remaining 10% three groups of speech characteristic parameter as RNN neural network
Input carry out Speaker Identification.
Further, the step S2 carries out denoising to voice signal using subtractive method of spectrums, the specific steps are as follows:
S2-1: voice signal is filtered;
S2-2: preemphasis is carried out to voice signal after filtering, by voice signal framing, to signal frame plus Hamming window;
S2-3: Fast Fourier Transform (FFT) is carried out to the signal after adding window, power spectrum is asked to each frame voice signal, then asks flat
Equal noise power;
S2-4: noise estimation is carried out using VAD and monitors quiet section, and then combination recurrence is smooth, updates noise spectrum;
S2-5: it carries out spectrum and subtracts operation, obtain the voice signal power spectrum estimated;
S2-6: insertion phase spectrum calculates speech manual, then carry out Fast Fourier Transform Inverse, the speech frame restored;
S2-7: voice signal is combined into according to each speech frame group, voice signal is aggravated into the signal after being denoised.
Further, the step S3 carries out the specific steps of acoustical characteristic parameters extraction such as to the voice signal after denoising
Under:
S3-1: carrying out preemphasis processing to three groups of voice signals after denoising, then by signal framing, each frame multiplied by
Hamming window;
S3-2: Fast Fourier Transform (FFT) is carried out to every frame signal, obtains the Energy distribution on frequency spectrum;
Power spectrum: being passed through the triangle filter group of one group of Meier scale by S3-3, calculates each filter group output
Logarithmic energy;
S3-4: the characteristic parameter exported by discrete cosine transform.
Further, the step step S4 is based on LSTM neural network, in LSTM neural network characteristics output layer
Addition normalization layer and Triplet-Loss loss function layer afterwards, construct RNN neural network.
Further, the Triplet-Loss loss function layer is by study, allow between Xa and Xp feature representation away from
From as small as possible, and the distance between feature representation of Xa and Xn is as big as possible, and to allow the distance between Xa and Xn and Xa
There is a smallest interval α between the distance between Xp;
Corresponding objective function are as follows:
Wherein,Indicate the Euclidean distance measurement between Xa and Xp;
What is indicated is the Euclidean distance measurement between Xa and Xn;
Distance is measured with Euclidean distance herein, when the value in+[] is greater than zero, takes the value for loss, when minus
It waits, loss zero.
Further, specific step is as follows for the step S6 progress Speaker Identification:
S6-1: the feature representation f (Xa) of three groups of samples, f (Xp), f (Xn) are obtained by LSTM neural network;
S6-2: obtained feature representation is normalized;
S6-3: pass through Triplet-Loss loss function optimization neural network;
S6-4: comparing the metric and preset threshold of Triplet-Loss loss function, if metric is greater than preset threshold,
Then speak artificial same people, artificial different people of otherwise speaking.
Compared with prior art, this programme principle and advantage is as follows:
1. the pretreatment of voice signal uses subtractive method of spectrums, the constraint condition introduced relative to other methods, subtractive method of spectrums
At least, physical significance is most direct, and operand is small, so as to effectively improve the accuracy of identification.
2. based on Triplet-Loss (triple loss function) come training pattern, pass through Inter-class loss and Intra-class loss
Joint constraint to carry out model the optimization training of backpropagation, so that similar sample is in feature space as close possible to and different
Class sample is away as far as possible in feature space, improves the sense of model, to improve the reliability and accuracy of identification.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the method for distinguishing speek person based on Triplet-Loss of the present invention;
Fig. 2 is the flow chart of subtractive method of spectrums in the present invention.
Fig. 3 is the flow chart that speech characteristic parameter extracts in the present invention.
Specific embodiment
The present invention is further explained in the light of specific embodiments:
Referring to figure 1, a kind of method for distinguishing speek person based on Triplet-Loss described in the present embodiment, including
Following steps:
S1: obtain voice signal, the voice signal include three groups of samples, respectively the one of speaker group voice sequence Xa,
One group of voice sequence Xn of the voice sequence Xp of another group of same speaker and different speakers;
S2: the pretreatment of voice signal is carried out;More interchannel noise can be generated during voice collecting, therefore can be to knowledge
Other task brings bigger difficulty, therefore carries out denoising to input voice data using subtractive method of spectrums first, i.e., makes an uproar from band
Noise spectrum valuation is subtracted in voice valuation, to obtain the frequency spectrum of clean speech.What is eliminated herein is interchannel noise, and channel is made an uproar
Sound be by sound pick-up outfit caused by noise;While removing channel noise, all letters related with speaker are saved completely
Breath.
As shown in Fig. 2, carrying out denoising to voice signal using subtractive method of spectrums, the specific steps are as follows:
S2-1: voice signal is filtered;
S2-2: preemphasis is carried out to voice signal after filtering, by voice signal framing, to signal frame plus Hamming window;
Specifically, in signal processing, windowing process is a necessary process, because computer can only be handled
The signal of finite length, therefore original signal X (t) will be truncated with T (sampling time), i.e. finite process become after XT (t) again into one
Step processing, this process is exactly windowing process, in actual signal processing, generally uses rectangular window, but rectangular window is at edge
Signal is truncated suddenly at place, and time-domain information all disappears outside window, causes the phenomenon that frequency domain increases frequency component, i.e., frequency spectrum is let out
Leakage, consider how reduce adding window when caused by leakage errors, main measure is using reasonable windowed function, and Hamming window is exactly
One kind of signal window, the shape that the shape of major part arrives the section pi 0 as sin (x), and rest part is all 0, in this way
Function be multiplied by other any one function f, f all only some have nonzero value;
S2-3: Fast Fourier Transform (FFT) is carried out to the signal after adding window, power spectrum is asked to each frame voice signal, then asks flat
Equal noise power;
S2-4: it is quiet that noise estimation monitoring is carried out using VAD (Voice Activity Detection speech terminals detection)
Quiet section, and then combination recurrence is smooth, updates noise spectrum;
S2-5: it carries out spectrum and subtracts operation, obtain the voice signal power spectrum estimated;
S2-6: insertion phase spectrum calculates speech manual, then carry out Fast Fourier Transform Inverse, the speech frame restored;
S2-7: voice signal is combined into according to each speech frame group, voice signal is aggravated into the signal after being denoised.
S3: as shown in figure 3, carrying out speech characteristic parameter extraction to the voice signal after denoising, the specific steps are as follows:
S3-1: carrying out preemphasis processing to three groups of voice signals after denoising, then to three groups of voice signal signals point
Not according to frame length 25ms, frame moves 10ms and carries out framing, and each frame is multiplied by Hamming window;
S3-2: Fast Fourier Transform (FFT) is carried out to every frame signal, obtains the Energy distribution on frequency spectrum;
Power spectrum: being passed through the triangle filter group of one group of Meier scale by S3-3, calculates each filter group output
Logarithmic energy;
S3-4: the speech characteristic parameter exported by discrete cosine transform.
S4: after getting speech characteristic parameter, based on LSTM neural network (long Memory Neural Networks in short-term),
Addition normalization layer and Triplet-Loss loss function layer, construct RNN nerve net after LSTM neural network characteristics output layer
Network (Recognition with Recurrent Neural Network);
The Triplet-Loss loss function layer used allows the distance between Xa and Xp feature representation to the greatest extent may be used by study
Can be small, and the distance between feature representation of Xa and Xn is as big as possible, and to allow the distance between Xa and Xn and Xa and Xp it
Between distance between have a smallest interval α;
Corresponding objective function are as follows:
Wherein,Indicate the Euclidean distance measurement between Xa and Xp;
What is indicated is the Euclidean distance measurement between Xa and Xn;
Distance is measured with Euclidean distance herein, when the value in+[] is greater than zero, takes the value for loss, when minus
It waits, loss zero.
S5: 90% three groups of speech characteristic parameters that step S3 is extracted are used for as the input of RNN neural network
Training RNN neural network;
After S6:RNN neural metwork training is good, using remaining 10% three groups of speech characteristic parameter as RNN neural network
Input carry out Speaker Identification;Specific step is as follows for identification:
S6-1: the feature representation f (Xa) of three groups of samples, f (Xp), f (Xn) are obtained by LSTM neural network;
S6-2: obtained feature representation is normalized;
S6-3: pass through Triplet-Loss loss function optimization neural network;
S6-4: comparing the metric and preset threshold of Triplet-Loss loss function, if metric is greater than preset threshold,
Then speak artificial same people, artificial different people of otherwise speaking.
The pretreatment of voice signal uses subtractive method of spectrums in the present embodiment, introduces relative to other methods, subtractive method of spectrums
Constraint condition is minimum, and physical significance is most direct, and operand is small, so as to effectively improve the accuracy of identification.In addition, this implementation
Example come training pattern, combines constraint by Inter-class loss and Intra-class loss based on Triplet-Loss (triple loss function)
To carry out model the optimization training of backpropagation, so that similar sample is in feature space as close possible to and foreign peoples's sample exists
Feature space is away as far as possible, and improves the sense of model, to improve the reliability and accuracy of identification.
The examples of implementation of the above are only the preferred embodiments of the invention, and implementation model of the invention is not limited with this
It encloses, therefore all shapes according to the present invention, changes made by principle, should all be included within the scope of protection of the present invention.
Claims (6)
1. a kind of method for distinguishing speek person based on Triplet-Loss, which comprises the following steps:
S1: voice signal is obtained, which includes three groups of samples, respectively the one of speaker group voice sequence Xa, same
One group of voice sequence Xn of another group of speaker of voice sequence Xp and different speakers;
S2: the pretreatment of voice signal, the interchannel noise generated during removal voice collecting are carried out;
S3: speech characteristic parameter extraction is carried out to the voice signal after denoising;
S4: based on LSTM neural network, RNN neural network is constructed;
S5: 90% three groups of speech characteristic parameters that step S3 is extracted are as the input of RNN neural network, for training
RNN neural network;
After S6:RNN neural metwork training is good, using remaining 10% three groups of speech characteristic parameter as the defeated of RNN neural network
Enter to carry out Speaker Identification.
2. a kind of method for distinguishing speek person based on Triplet-Loss according to claim 1, which is characterized in that described
Step S2 carries out denoising to voice signal using subtractive method of spectrums, the specific steps are as follows:
S2-1: voice signal is filtered;
S2-2: preemphasis is carried out to voice signal after filtering, by voice signal framing, to signal frame plus Hamming window;
S2-3: Fast Fourier Transform (FFT) is carried out to the signal after adding window, power spectrum is asked to each frame voice signal, is then averaging and makes an uproar
Acoustical power;
S2-4: noise estimation is carried out using VAD and monitors quiet section, and then combination recurrence is smooth, updates noise spectrum;
S2-5: it carries out spectrum and subtracts operation, obtain the voice signal power spectrum estimated;
S2-6: insertion phase spectrum calculates speech manual, then carry out Fast Fourier Transform Inverse, the speech frame restored;
S2-7: voice signal is combined into according to each speech frame group, voice signal is aggravated into the signal after being denoised.
3. a kind of method for distinguishing speek person based on Triplet-Loss according to claim 1, which is characterized in that described
Step S3 carries out acoustical characteristic parameters extraction to the voice signal after denoising, and specific step is as follows:
S3-1: preemphasis processing is carried out to three groups of voice signals after denoising, then by signal framing, each frame is multiplied by Hamming
Window;
S3-2: Fast Fourier Transform (FFT) is carried out to every frame signal, obtains the Energy distribution on frequency spectrum;
Power spectrum: being passed through the triangle filter group of one group of Meier scale by S3-3, calculates pair of each filter group output
Number energy;
S3-4: the characteristic parameter exported by discrete cosine transform.
4. a kind of method for distinguishing speek person based on Triplet-Loss according to claim 1, which is characterized in that described
Step step S4 based on LSTM neural network, after LSTM neural network characteristics output layer addition normalization layer and
Triplet-Loss loss function layer constructs RNN neural network.
5. a kind of method for distinguishing speek person based on Triplet-Loss according to claim 4, which is characterized in that described
Triplet-Loss loss function layer makes the distance between Xa and Xp feature representation as small as possible by study, and Xa and Xn
The distance between feature representation is as big as possible, and to allow and have one between the distance between Xa and Xn and the distance between Xa and Xp
A the smallest interval α;
Corresponding objective function are as follows:
Wherein,Indicate the Euclidean distance measurement between Xa and Xp;
What is indicated is the Euclidean distance measurement between Xa and Xn;
Distance is measured with Euclidean distance herein, when the value in+[] is greater than zero, takes the value for loss, when minus, damage
Mistake is zero.
6. a kind of method for distinguishing speek person based on Triplet-Loss according to claim 1, which is characterized in that described
Step S6 carries out Speaker Identification, and specific step is as follows:
S6-1: the feature representation f (Xa) of three groups of samples, f (Xp), f (Xn) are obtained by LSTM neural network;
S6-2: obtained feature representation is normalized;
S6-3: pass through Triplet-Loss loss function optimization neural network;
S6-4: comparing the metric and preset threshold of Triplet-Loss loss function, if metric is greater than preset threshold, says
Talk about artificial same people, artificial different people of otherwise speaking.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810835179.0A CN109256139A (en) | 2018-07-26 | 2018-07-26 | A kind of method for distinguishing speek person based on Triplet-Loss |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810835179.0A CN109256139A (en) | 2018-07-26 | 2018-07-26 | A kind of method for distinguishing speek person based on Triplet-Loss |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109256139A true CN109256139A (en) | 2019-01-22 |
Family
ID=65049985
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810835179.0A Pending CN109256139A (en) | 2018-07-26 | 2018-07-26 | A kind of method for distinguishing speek person based on Triplet-Loss |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109256139A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110390937A (en) * | 2019-06-10 | 2019-10-29 | 南京硅基智能科技有限公司 | A kind of across channel method for recognizing sound-groove based on ArcFace loss algorithm |
CN110570870A (en) * | 2019-09-20 | 2019-12-13 | 平安科技(深圳)有限公司 | Text-independent voiceprint recognition method, device and equipment |
CN110570871A (en) * | 2019-09-20 | 2019-12-13 | 平安科技(深圳)有限公司 | TristouNet-based voiceprint recognition method, device and equipment |
CN110838295A (en) * | 2019-11-17 | 2020-02-25 | 西北工业大学 | Model generation method, voiceprint recognition method and corresponding device |
CN111312259A (en) * | 2020-02-17 | 2020-06-19 | 厦门快商通科技股份有限公司 | Voiceprint recognition method, system, mobile terminal and storage medium |
CN111341304A (en) * | 2020-02-28 | 2020-06-26 | 广州国音智能科技有限公司 | Method, device and equipment for training speech characteristics of speaker based on GAN |
CN111418009A (en) * | 2019-10-31 | 2020-07-14 | 支付宝(杭州)信息技术有限公司 | Personalized speaker verification system and method |
WO2020156153A1 (en) * | 2019-01-29 | 2020-08-06 | 腾讯科技(深圳)有限公司 | Audio recognition method and system, and device |
CN112613481A (en) * | 2021-01-04 | 2021-04-06 | 上海明略人工智能(集团)有限公司 | Bearing abrasion early warning method and system based on frequency spectrum |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102637438A (en) * | 2012-03-23 | 2012-08-15 | 同济大学 | Voice filtering method |
US20170228641A1 (en) * | 2016-02-04 | 2017-08-10 | Nec Laboratories America, Inc. | Distance metric learning with n-pair loss |
CN107481736A (en) * | 2017-08-14 | 2017-12-15 | 广东工业大学 | A kind of vocal print identification authentication system and its certification and optimization method and system |
CN107731233A (en) * | 2017-11-03 | 2018-02-23 | 王华锋 | A kind of method for recognizing sound-groove based on RNN |
-
2018
- 2018-07-26 CN CN201810835179.0A patent/CN109256139A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102637438A (en) * | 2012-03-23 | 2012-08-15 | 同济大学 | Voice filtering method |
US20170228641A1 (en) * | 2016-02-04 | 2017-08-10 | Nec Laboratories America, Inc. | Distance metric learning with n-pair loss |
CN107481736A (en) * | 2017-08-14 | 2017-12-15 | 广东工业大学 | A kind of vocal print identification authentication system and its certification and optimization method and system |
CN107731233A (en) * | 2017-11-03 | 2018-02-23 | 王华锋 | A kind of method for recognizing sound-groove based on RNN |
Non-Patent Citations (2)
Title |
---|
CHUNLEI ZHANG等: "END-TO-END TEXT-INDEPENDENT SPEAKER VERIFICATION WITH FLEXIBILITY IN UTTERANCE DURATION", 《2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU)》 * |
HERVÉ BREDIN: "TristouNet: Triplet loss for speaker turn embedding", 《2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020156153A1 (en) * | 2019-01-29 | 2020-08-06 | 腾讯科技(深圳)有限公司 | Audio recognition method and system, and device |
CN110390937B (en) * | 2019-06-10 | 2021-12-24 | 南京硅基智能科技有限公司 | Cross-channel voiceprint recognition method based on ArcFace loss algorithm |
CN110390937A (en) * | 2019-06-10 | 2019-10-29 | 南京硅基智能科技有限公司 | A kind of across channel method for recognizing sound-groove based on ArcFace loss algorithm |
CN110570870A (en) * | 2019-09-20 | 2019-12-13 | 平安科技(深圳)有限公司 | Text-independent voiceprint recognition method, device and equipment |
CN110570871A (en) * | 2019-09-20 | 2019-12-13 | 平安科技(深圳)有限公司 | TristouNet-based voiceprint recognition method, device and equipment |
US11031018B2 (en) | 2019-10-31 | 2021-06-08 | Alipay (Hangzhou) Information Technology Co., Ltd. | System and method for personalized speaker verification |
CN111418009B (en) * | 2019-10-31 | 2023-09-05 | 支付宝(杭州)信息技术有限公司 | Personalized speaker verification system and method |
US11244689B2 (en) | 2019-10-31 | 2022-02-08 | Alipay (Hangzhou) Information Technology Co., Ltd. | System and method for determining voice characteristics |
CN111418009A (en) * | 2019-10-31 | 2020-07-14 | 支付宝(杭州)信息技术有限公司 | Personalized speaker verification system and method |
WO2020098828A3 (en) * | 2019-10-31 | 2020-09-03 | Alipay (Hangzhou) Information Technology Co., Ltd. | System and method for personalized speaker verification |
US10997980B2 (en) | 2019-10-31 | 2021-05-04 | Alipay (Hangzhou) Information Technology Co., Ltd. | System and method for determining voice characteristics |
CN110838295B (en) * | 2019-11-17 | 2021-11-23 | 西北工业大学 | Model generation method, voiceprint recognition method and corresponding device |
CN110838295A (en) * | 2019-11-17 | 2020-02-25 | 西北工业大学 | Model generation method, voiceprint recognition method and corresponding device |
CN111312259A (en) * | 2020-02-17 | 2020-06-19 | 厦门快商通科技股份有限公司 | Voiceprint recognition method, system, mobile terminal and storage medium |
CN111341304A (en) * | 2020-02-28 | 2020-06-26 | 广州国音智能科技有限公司 | Method, device and equipment for training speech characteristics of speaker based on GAN |
CN112613481A (en) * | 2021-01-04 | 2021-04-06 | 上海明略人工智能(集团)有限公司 | Bearing abrasion early warning method and system based on frequency spectrum |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109256139A (en) | A kind of method for distinguishing speek person based on Triplet-Loss | |
EP2763134B1 (en) | Method and apparatus for voice recognition | |
CN109215665A (en) | A kind of method for recognizing sound-groove based on 3D convolutional neural networks | |
CN102005070A (en) | Voice identification gate control system | |
CN102968990B (en) | Speaker identifying method and system | |
CN113823293B (en) | Speaker recognition method and system based on voice enhancement | |
CN111243617B (en) | Speech enhancement method for reducing MFCC feature distortion based on deep learning | |
CN101930733B (en) | Speech emotional characteristic extraction method for speech emotion recognition | |
CN111554302A (en) | Strategy adjusting method, device, terminal and storage medium based on voiceprint recognition | |
CN109473102A (en) | A kind of robot secretary intelligent meeting recording method and system | |
CN111508504B (en) | Speaker recognition method based on auditory center perception mechanism | |
CN110570871A (en) | TristouNet-based voiceprint recognition method, device and equipment | |
Charisma et al. | Speaker recognition using mel-frequency cepstrum coefficients and sum square error | |
CN112017658A (en) | Operation control system based on intelligent human-computer interaction | |
CN108172220A (en) | A kind of novel voice denoising method | |
Goh et al. | Robust computer voice recognition using improved MFCC algorithm | |
Maazouzi et al. | MFCC and similarity measurements for speaker identification systems | |
CN111105798B (en) | Equipment control method based on voice recognition | |
CN116312561A (en) | Method, system and device for voice print recognition, authentication, noise reduction and voice enhancement of personnel in power dispatching system | |
CN107993666B (en) | Speech recognition method, speech recognition device, computer equipment and readable storage medium | |
CN111862991A (en) | Method and system for identifying baby crying | |
Nijhawan et al. | A new design approach for speaker recognition using MFCC and VAD | |
Sukor et al. | Speaker identification system using MFCC procedure and noise reduction method | |
CN106971712A (en) | A kind of adaptive rapid voiceprint recognition methods and system | |
Khetri et al. | Automatic speech recognition for marathi isolated words |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190122 |