CN111951783B - Speaker recognition method based on phoneme filtering - Google Patents

Speaker recognition method based on phoneme filtering Download PDF

Info

Publication number
CN111951783B
CN111951783B CN202010810083.6A CN202010810083A CN111951783B CN 111951783 B CN111951783 B CN 111951783B CN 202010810083 A CN202010810083 A CN 202010810083A CN 111951783 B CN111951783 B CN 111951783B
Authority
CN
China
Prior art keywords
phoneme
voice
speech
speaker
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010810083.6A
Other languages
Chinese (zh)
Other versions
CN111951783A (en
Inventor
陈仙红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202010810083.6A priority Critical patent/CN111951783B/en
Publication of CN111951783A publication Critical patent/CN111951783A/en
Application granted granted Critical
Publication of CN111951783B publication Critical patent/CN111951783B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a speaker recognition method based on phoneme filtering, and belongs to the fields of voiceprint recognition, pattern recognition and machine learning. In order to solve the problem that the influence of voice content information is not considered in the traditional speaker recognition technology, the invention provides a speaker recognition method based on phoneme filtering. The method establishes a phoneme filter for each phoneme of the voice, and before speaker recognition, the corresponding phoneme filter is selected to remove the content information according to the phonemes corresponding to each frame of voice. Thereby reducing the influence of the content information on the speaker identification and effectively improving the accuracy of the speaker identification. The invention is characterized by comprising a model training stage and a testing stage, wherein the model training comprises the steps of voice preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition and cross entropy minimization. The test stage includes speech preprocessing, phoneme recognition, phoneme filtering, pooling, and speaker recognition steps.

Description

Speaker recognition method based on phoneme filtering
Technical Field
The invention belongs to the technical fields of voiceprint recognition, pattern recognition and machine learning, and particularly relates to a speaker recognition method based on phoneme filtering.
Background
Speaker recognition refers to recognizing the identity of a speaker based on information related to the speaker contained in a voice, and is becoming more and more important and widely used in various fields with the rapid development of information technology and communication technology. If the identity is identified, the criminals of the telephone channel are seized, the identity is confirmed according to the telephone record in the court, the telephone is tracked by voice, and the anti-theft door opening function is provided. The speaker recognition technology can be applied to the fields of voice dialing, telephone banking, telephone shopping, database access, information service, voice e-mail, security control, computer remote login and the like.
In 2011, kenny proposed an i-vector speaker recognition method based on a gaussian mixture model, and obtained the best performance at that time. With the large-scale application of the deep neural network, in 2014, a d-vector speaker recognition method based on the deep neural network is attracting more and more attention, and compared with a traditional Gaussian mixture model, the deep neural network has stronger description capability and can better simulate very complex data distribution. In 2017, snyder considers time sequence information of voice and provides a speaker recognition method based on an x-vector of a time delay neural network. The current state-of-the-art method for speaker recognition is the x-vector. Firstly, preprocessing voice data, extracting MFCC characteristics, performing active voice inspection, and removing a mute part. The MFCC characteristics of each frame of voice are input into a time delay neural network to obtain the output result of each frame, the output results of all frames of one voice are pooled, the average value is obtained, and the speaker recognition of the voice is carried out according to the average value. Although this approach achieves good results, it has not been analyzed or studied directly against the difficulty of speaker recognition. The difficulty with speaker recognition is that speaker information and other information in speech (e.g., noise, channels, content) are entangled and we do not know the principle of entanglement between them. Thus, when we analyze speaker information, uncertainty of other factors, especially uncertainty of voice content information, can degrade system performance. Existing speaker recognition techniques do not focus on the effect of voice content mismatch on speaker factor recognition.
Disclosure of Invention
The invention aims to solve the problem that the influence of voice content information is not considered in the traditional speaker recognition technology, and provides a speaker recognition method based on phoneme filtering. The method establishes a phoneme filter for each phoneme of the voice, and selects a corresponding phoneme filter to remove content information according to the phonemes corresponding to each frame of voice when the speaker is identified. Thereby reducing the influence of the content information on the speaker identification and effectively improving the accuracy of the speaker identification.
The invention provides a speaker recognition method based on phoneme filtering, which is characterized by comprising a model training stage and a testing stage. As shown in fig. 1, wherein model training includes speech preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition, and minimizing cross entropy stages. Speaker recognition includes the steps of speech preprocessing, phoneme recognition, phoneme filtering, pooling, and speaker recognition. The method specifically comprises the following steps:
1) Model training stage; the method specifically comprises the following steps:
1-1) Speech pretreatment
Training the speech dataset to be (x) i ,z i )(i=1,…,I),x i For the ith training speech, z i And training the speaker label corresponding to the voice for the ith. For training speech x i Framing and extracting the corresponding mel-cepstrum feature of each frame Features representing the T frame of the ith training speech, T i Representing the total number of frames of the ith training speech.
1-2) phoneme recognition
Mel-cepstrum features extracted according to step 1)Phonemes for each frame of speech are identified using a phoneme recognizer.Wherein->And N is the total number of phonemes for the phonemes corresponding to the t frame of the ith training speech.
1-3) phoneme filtering
Constructing a phoneme filter f specific to phoneme N (n=1, …, N) n 。f n Which may be a deep neural network, or other linear or nonlinear functions,the parameter is theta n . The phoneme filter input is the mel-cepstrum feature extracted in step 1-1)The output is the characteristic of filtering the phoneme information +.>Phonemic +.A phoneme obtained according to step 1-2)>If->Select->Corresponding phoneme filter f n The method comprises the following steps: />
1-4) pooling
Pooling the features of the training voice after the phoneme information is filtered corresponding to all frames, and obtaining the average value of the features of the training voice after the phoneme information is filtered. For example, the mean value of the features of the ith training voice after the phoneme information is filtered is:
1-5) speaker recognition
Constructing a speaker recognition network g, wherein the speaker recognition network g can be a deep neural network, can also be other linear or nonlinear functions, has a parameter phi, and is input as a mean value y of characteristics of voice after phoneme information is filtered out i Outputting the probability z 'corresponding to each speaker for the voice' i =g(y i ;φ)。
1-6) minimizing cross entropy
Objective function minimizing model predictiveObtaining the probability z 'of the speaker corresponding to the training voice' i And tag z i Cross entropy between, namely:
training to obtain a phoneme filter f corresponding to each phoneme by minimizing the objective function n Parameter θ of (n=1, …, N) n (n=1, …, N) and a parameter phi of the speaker recognition network g.
Ending the model training stage to obtain a phoneme filter f corresponding to each phoneme n And a speaker recognition network g.
2) The testing stage specifically comprises the following steps:
2-1) Speech pretreatment
Framing the test voice x and extracting the Mel cepstrum feature x corresponding to each frame t (t=1,…,T),x t Features of the T-th frame of the test speech are represented, and T represents the total frame number of the test speech.
2-2) phoneme recognition
The mel-cepstrum feature x extracted according to step 2-1) t The phonemes of each frame of speech are identified using the phoneme recognizer used in step 1-2). q t =1, 2, …, N, where q t And N is the total number of phonemes for testing the phonemes corresponding to the t-th frame of the voice.
2-3) phoneme filtering
The phoneme q obtained according to step 2-2) t If q t =n, selecting a phoneme filter f trained in the model training stage n As x t Is provided. The t-th frame characteristic of the test voice is characterized in that after phoneme information is filtered out: y is t =f n (x t ;θ n )。
2-4) pooling
Pooling the characteristics of the test voice after the phoneme information is filtered corresponding to all frames, and obtaining the average value of the characteristics of the test voice after the phoneme information is filtered, namely:
2-5) speaker recognition
And identifying the speaker corresponding to the test voice according to the trained deep neural network g in the model training stage to obtain the probability z' =g (y; phi) of the voice belonging to each speaker.
And completing speaker recognition corresponding to the test voice.
The invention has the characteristics and beneficial effects that:
compared with the existing speaker recognition technology, the method and the device have the advantage that the influence of voice content information on speaker recognition is reduced. Because the voice is mainly carried with content information, speaker information is used as weak information, the voice is easy to submerge in the content information, and the voice is not easy to identify. The invention constructs a filter corresponding to each phoneme, and filters the phoneme information before the speaker recognition, so that the influence of the voice content information on the speaker recognition is reduced. The method of the invention improves the accuracy of speaker identification.
Drawings
Fig. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention provides a speaker recognition method based on phoneme filtering, which comprises a model training stage and a speaker recognition stage. As shown in fig. 1, the model training includes speech preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition, minimizing cross entropy, and the like. Speaker recognition includes the steps of speech preprocessing, phoneme recognition, phoneme filtering, pooling, speaker recognition, and the like. Specific examples are described in further detail below.
1) Model training stage; the method specifically comprises the following steps:
1-1) Speech pretreatment
Training a speech dataset asx i For the ith training languageSound, z i And (3) the speaker label corresponding to the ith training voice, wherein I is the total number of the training voices. For training speech x i Framing and extracting the corresponding mel cepstrum feature of each frame>Features representing the T frame of the ith training speech, T i Representing the total number of frames of the ith training speech. In this embodiment, the number of training voices i=8000, the mel-cepstrum feature corresponding to each frame is 23-dimensional, all training voices are the same in length, and each voice has T i =300 frames.
1-2) phoneme recognition
Mel-cepstrum features extracted according to step 1)Phonemes for each frame of speech are identified using a phoneme recognizer.Wherein->And N is the total number of phonemes for the phonemes corresponding to the t frame of the ith training speech. In this embodiment, the phoneme recognizer is an open-source phoneme recognizer on the university of bruno, and the total number of corresponding phonemes is n=39. According to the phoneme recognizer, a phoneme corresponding to each frame of each voice is obtained.
1-3) phoneme filtering
Constructing a phoneme filter f specific to phoneme N (n=1, …, N) n 。f n Can be a deep neural network or other linear or nonlinear functions, and has the parameter theta n . The phoneme filter input is the mel-cepstrum feature extracted in step 1-1)The output is the characteristic of filtering the phoneme information +.>Phonemic +.A phoneme obtained according to step 1-2)>If->Select->Corresponding phoneme filter f n The method comprises the following steps: />In the present embodiment, the total number of phonemes is n=39, so that 39 phoneme filters f are constructed n (n=1, …, 39). Each phoneme filter is constructed by adopting a 5-layer deep neural network, and the corresponding parameters are theta n . The nth phoneme filter filters out the nth phoneme. 125 th frame of 5 th speech +.>Belonging to the 13 th phoneme, i.eSelect->The corresponding phoneme filter is f 13 The method comprises the following steps: />
1-4) pooling
Pooling the features of the training voice after the phoneme information is filtered corresponding to all frames, and obtaining the average value of the features of the training voice after the phoneme information is filtered. For example, the mean value of the features of the ith training voice after the phoneme information is filtered is:
in this embodiment, the mean value of the features of the training speech after filtering the phoneme information is:
1-5) speaker recognition
Constructing a speaker recognition network g, wherein the speaker recognition network g can be a deep neural network, can also be other linear or nonlinear functions, has a parameter phi, and is input as a mean value y of characteristics of voice after phoneme information is filtered out i Outputting the probability z 'corresponding to each speaker for the voice' i =g(y i The method comprises the steps of carrying out a first treatment on the surface of the Phi). In this embodiment, the speaker recognition network uses an 8-layer deep neural network.
1-6) minimizing cross entropy
In this embodiment, the objective function is to minimize the probability z 'of the speaker corresponding to the training speech obtained by model prediction' i And tag z i Cross entropy between, namely:
training to obtain a phoneme filter f corresponding to each phoneme by minimizing the objective function n Parameter θ of (n=1, …, 39) n (n=1, …, 39) and a parameter phi of the speaker recognition network g.
Ending the model training stage to obtain a phoneme filter f corresponding to each phoneme n (n=1, …, 39) and a speaker recognition network g.
2) A testing stage; the method specifically comprises the following steps:
2-1) Speech pretreatment
Framing the test voice x and extracting the Mel cepstrum feature x corresponding to each frame t (t=1,…,T),x t Representing the characteristics of the T-th frame of the test speech, T representing the total of the test speechFrame number. In this embodiment, t=328.
2-2) phoneme recognition
The mel-cepstrum feature x extracted according to step 2-1) t The phonemes of each frame of speech are identified using the phoneme recognizer used in steps 1-2). q t =1, 2, …,39, where q t For the phonemes corresponding to the t frame of the test speech, 39 is the total number of phonemes.
2-3) phoneme filtering
The phoneme q obtained according to step 2-2) t If q t =n, selecting a phoneme filter f trained in the model training stage n As x t Is provided. The t-th frame characteristic of the test voice is characterized in that after phoneme information is filtered out: y is t =f n (x t ;θ n )。
2-4) pooling
Pooling the characteristics of the test voice after the phoneme information is filtered corresponding to all frames, and obtaining the average value of the characteristics of the test voice after the phoneme information is filtered, namely:
2-5) speaker recognition
And identifying the speaker corresponding to the test voice according to the trained deep neural network g in the model training stage to obtain the probability z' =g (y; phi) of the voice belonging to each speaker.
And completing speaker recognition corresponding to the test voice.
It will be appreciated by those skilled in the art that the method of the present invention can be implemented by a program, and the program can be stored in a computer readable storage medium.
The foregoing description is only one embodiment of the present invention, and it is therefore apparent that it is not intended to limit the scope of the invention in any way, and therefore, equivalent variations are contemplated by the appended claims.

Claims (1)

1. The speaker recognition method based on the phoneme filtering is characterized by comprising a model training stage and a testing stage, wherein the model training stage comprises a voice preprocessing stage, a phoneme recognition stage, a phoneme filtering stage, a pooling stage, a speaker recognition stage and a minimum cross entropy stage; the testing stage comprises voice preprocessing, phoneme recognition, phoneme filtering, pooling and speaker recognition stages;
the model training stage specifically comprises the following steps:
1-1) Speech pretreatment
Training the speech dataset to be (x) i ,z i )(i=1,…,I),x i For the ith training speech, z i Training a speaker tag corresponding to the voice of the ith item; for training speech x i Framing and extracting the corresponding mel-cepstrum feature of each frame Features representing the T frame of the ith training speech, T i Representing the total frame number of the ith training speech;
1-2) phoneme recognition
Mel-cepstrum features extracted according to step 1-1)Identifying phonemes for each frame of speech using a phoneme identifier;wherein->The phonemes corresponding to the t frame of the ith training voice are N, wherein N is the total number of the phonemes;
1-3) phoneme filtering
Constructing it for phoneme N (n=1, …, N)Specific phoneme filter f n ,f n Can be a deep neural network or other linear or nonlinear functions, and has the parameter theta n The method comprises the steps of carrying out a first treatment on the surface of the The phoneme filter input is the mel-cepstrum feature extracted in step 1-1)The output is the characteristic of filtering the phoneme information +.>Phonemic +.A phoneme obtained according to step 1-2)>If->Select->Corresponding phoneme filter f n The method comprises the following steps: />
1-4) pooling
Pooling the features of the training speech after the phoneme information is filtered corresponding to all frames to obtain the average value of the features of the speech after the phoneme information is filtered corresponding to the speech, wherein the average value of the features of the ith training speech after the phoneme information is filtered is:
1-5) speaker recognition
Constructing a speaker recognition network g, q, which can be a deep neural network, or other linear or nonlinear functions, wherein the parameter is phi, and the input is the mean value y of the characteristics of the voice after the phoneme information is filtered i Transport and deliverThe probability z 'for each speaker for the speech is derived' i =g(y i ;φ);
1-6) minimizing cross entropy
The objective function is to minimize the probability z 'of the speaker corresponding to the training speech obtained by model prediction' i And tag z i Cross entropy between, namely:
training to obtain a phoneme filter f corresponding to each phoneme by minimizing the objective function n Parameter θ of (n=1, …, N) n (n=1, …, N) and a parameter phi of the speaker recognition network g;
ending the model training stage to obtain a phoneme filter f corresponding to each phoneme n And a speaker recognition network g;
the testing stage specifically comprises the following steps:
2-1) Speech pretreatment
Framing the test voice x and extracting the Mel cepstrum feature x corresponding to each frame t (t=1,…,T),x t The characteristic of the T frame of the test voice is represented, and T represents the total frame number of the test voice;
2-2) phoneme recognition
The mel-cepstrum feature x extracted according to step 2-1) t Identifying phonemes for each frame of speech using the phoneme identifier used in step 1-2); q t =1, 2, …, N, where q t For testing phonemes corresponding to the t frame of the voice, N is the total number of the phonemes;
2-3) phoneme filtering
The phoneme q obtained according to step 2-2) t If q t =n, selecting a phoneme filter f trained in the model training stage n As x t The feature of the test voice after the t frame feature filters the phoneme information is: y is t =f n (x t ;θ n );
2-4) pooling
Pooling the characteristics of the test voice after the phoneme information is filtered corresponding to all frames, and obtaining the average value of the characteristics of the test voice after the phoneme information is filtered, namely:
2-5) speaker recognition
Identifying speakers corresponding to the test voice according to the deep neural network g trained in the model training stage to obtain the probability z' =g (y; phi) of the voice belonging to each speaker;
and completing speaker recognition corresponding to the test voice.
CN202010810083.6A 2020-08-12 2020-08-12 Speaker recognition method based on phoneme filtering Active CN111951783B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010810083.6A CN111951783B (en) 2020-08-12 2020-08-12 Speaker recognition method based on phoneme filtering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010810083.6A CN111951783B (en) 2020-08-12 2020-08-12 Speaker recognition method based on phoneme filtering

Publications (2)

Publication Number Publication Date
CN111951783A CN111951783A (en) 2020-11-17
CN111951783B true CN111951783B (en) 2023-08-18

Family

ID=73332504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010810083.6A Active CN111951783B (en) 2020-08-12 2020-08-12 Speaker recognition method based on phoneme filtering

Country Status (1)

Country Link
CN (1) CN111951783B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5131043A (en) * 1983-09-05 1992-07-14 Matsushita Electric Industrial Co., Ltd. Method of and apparatus for speech recognition wherein decisions are made based on phonemes
AU2004237046A1 (en) * 2003-05-02 2004-11-18 Giritech A/S Pervasive, user-centric network security enabled by dynamic datagram switch and an on-demand authentication and encryption scheme through mobile intelligent data carriers
CN1991976A (en) * 2005-12-31 2007-07-04 潘建强 Phoneme based voice recognition method and system
CN107369440A (en) * 2017-08-02 2017-11-21 北京灵伴未来科技有限公司 The training method and device of a kind of Speaker Identification model for phrase sound
CN108172214A (en) * 2017-12-27 2018-06-15 安徽建筑大学 A kind of small echo speech recognition features parameter extracting method based on Mel domains
CN108564956A (en) * 2018-03-26 2018-09-21 京北方信息技术股份有限公司 A kind of method for recognizing sound-groove and device, server, storage medium
CN109119069A (en) * 2018-07-23 2019-01-01 深圳大学 Specific crowd recognition methods, electronic device and computer readable storage medium
CN110600018A (en) * 2019-09-05 2019-12-20 腾讯科技(深圳)有限公司 Voice recognition method and device and neural network training method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5131043A (en) * 1983-09-05 1992-07-14 Matsushita Electric Industrial Co., Ltd. Method of and apparatus for speech recognition wherein decisions are made based on phonemes
AU2004237046A1 (en) * 2003-05-02 2004-11-18 Giritech A/S Pervasive, user-centric network security enabled by dynamic datagram switch and an on-demand authentication and encryption scheme through mobile intelligent data carriers
CN1991976A (en) * 2005-12-31 2007-07-04 潘建强 Phoneme based voice recognition method and system
CN107369440A (en) * 2017-08-02 2017-11-21 北京灵伴未来科技有限公司 The training method and device of a kind of Speaker Identification model for phrase sound
CN108172214A (en) * 2017-12-27 2018-06-15 安徽建筑大学 A kind of small echo speech recognition features parameter extracting method based on Mel domains
CN108564956A (en) * 2018-03-26 2018-09-21 京北方信息技术股份有限公司 A kind of method for recognizing sound-groove and device, server, storage medium
CN109119069A (en) * 2018-07-23 2019-01-01 深圳大学 Specific crowd recognition methods, electronic device and computer readable storage medium
CN110600018A (en) * 2019-09-05 2019-12-20 腾讯科技(深圳)有限公司 Voice recognition method and device and neural network training method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于特征音素的说话人识别方法;王昌龙;周福才;凌裕平;於锋;;仪器仪表学报(第10期);全文 *

Also Published As

Publication number Publication date
CN111951783A (en) 2020-11-17

Similar Documents

Publication Publication Date Title
CN107154257B (en) Customer service quality evaluation method and system based on customer voice emotion
WO2021128741A1 (en) Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium
US7904295B2 (en) Method for automatic speaker recognition with hurst parameter based features and method for speaker classification based on fractional brownian motion classifiers
US9336780B2 (en) Identification of a local speaker
CN112289323B (en) Voice data processing method and device, computer equipment and storage medium
CN108986798B (en) Processing method, device and the equipment of voice data
CN111429935B (en) Voice caller separation method and device
CN111599344B (en) Language identification method based on splicing characteristics
KR101618512B1 (en) Gaussian mixture model based speaker recognition system and the selection method of additional training utterance
CN108091340B (en) Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium
Revathi et al. Speaker independent continuous speech and isolated digit recognition using VQ and HMM
CN109473102A (en) A kind of robot secretary intelligent meeting recording method and system
CN111243603A (en) Voiceprint recognition method, system, mobile terminal and storage medium
CN110782902A (en) Audio data determination method, apparatus, device and medium
CN105845143A (en) Speaker confirmation method and speaker confirmation system based on support vector machine
CN113516987B (en) Speaker recognition method, speaker recognition device, storage medium and equipment
Park et al. The Second DIHARD Challenge: System Description for USC-SAIL Team.
CN109065026A (en) A kind of recording control method and device
CN111951783B (en) Speaker recognition method based on phoneme filtering
CN110933236B (en) Machine learning-based null number identification method
Koolagudi et al. Speaker recognition in the case of emotional environment using transformation of speech features
Khanum et al. A novel speaker identification system using feed forward neural networks
CN114155845A (en) Service determination method and device, electronic equipment and storage medium
CN112908340A (en) Global-local windowing-based sound feature rapid extraction method
Astuti et al. Feature extraction using gaussian-mfcc for speaker recognition system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant