CN109377984A - A kind of audio recognition method and device based on ArcFace - Google Patents
A kind of audio recognition method and device based on ArcFace Download PDFInfo
- Publication number
- CN109377984A CN109377984A CN201811400260.2A CN201811400260A CN109377984A CN 109377984 A CN109377984 A CN 109377984A CN 201811400260 A CN201811400260 A CN 201811400260A CN 109377984 A CN109377984 A CN 109377984A
- Authority
- CN
- China
- Prior art keywords
- default
- voice
- identified
- arcface
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 239000013598 vector Substances 0.000 claims abstract description 106
- 239000000284 extract Substances 0.000 claims abstract description 18
- 230000006870 function Effects 0.000 claims description 36
- 230000001052 transient effect Effects 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 6
- 238000004891 communication Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the present invention provides a kind of audio recognition method and device based on ArcFace, which comprises obtains voice to be identified, and extracts the low layer frame level feature of the voice to be identified;According to the low layer frame level feature, identity characteristic vector is extracted;Target identities feature vector similar with the identity characteristic vector is obtained from default sound bank, the default sound bank is previously stored with the corresponding relationship between default identity characteristic vector and default identity information;The corresponding relationship is obtained according to the preset model trained in advance;The preset model is that the default loss function obtained by the algorithm expression formula based on ArcFace is trained;According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, and using the target identity information as the recognition result of the voice to be identified.Described device executes the above method.Method and device provided in an embodiment of the present invention can accurately identify various types of voices.
Description
Technical field
The present embodiments relate to voice processing technology fields, and in particular to a kind of speech recognition side based on ArcFace
Method and device.
Background technique
With the explosive growth of digital audio-frequency data, by speech recognition technology, to identify speaker also gradually
It receives more and more attention.
The i-vector system being most widely used in Speaker Identification at present, based on GMM-UBM (mixed Gaussian
Model-background model model) and GSV-SVM (Gaussian mean super vector-supporting vector machine model) be all built upon statistical model
It is theoretic, therefore it is required that trained and tested speech must reach certain length, otherwise identify accuracy by sharp fall.
On the other hand, although ArcFace is widely used in field of face identification, still ArcFace is not answered at present
Used in the method for field of speech recognition.
Therefore, how to avoid drawbacks described above, based on ArcFace accurately to various types of voices (including long voice and
Phrase sound) it is identified, becoming need solve the problems, such as.
Summary of the invention
In view of the problems of the existing technology, the embodiment of the present invention provides a kind of audio recognition method based on ArcFace
And device.
In a first aspect, the embodiment of the present invention provides a kind of audio recognition method based on ArcFace, which comprises
Voice to be identified is obtained, and extracts the low layer frame level feature of the voice to be identified;
According to the low layer frame level feature, identity characteristic vector is extracted;
Target identities feature vector similar with the identity characteristic vector, the default language are obtained from default sound bank
Sound library is previously stored with the corresponding relationship between default identity characteristic vector and default identity information;Wherein, the corresponding relationship
It is to be obtained according to the preset model trained in advance;The preset model is obtained by the algorithm expression formula based on ArcFace
What the default loss function taken was trained;
According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, and by institute
State recognition result of the target identity information as the voice to be identified.
Second aspect, the embodiment of the present invention provide a kind of speech recognition equipment based on ArcFace, and described device includes:
First acquisition unit for obtaining voice to be identified, and extracts the low layer frame level feature of the voice to be identified;
Extraction unit, for extracting identity characteristic vector according to the low layer frame level feature;
Second acquisition unit, it is special for obtaining target identities similar with the identity characteristic vector from default sound bank
Vector is levied, the default sound bank is previously stored with the corresponding relationship between default identity characteristic vector and default identity information;
Wherein, the corresponding relationship is obtained according to the preset model trained in advance;The preset model is by being based on
What the default loss function that the algorithm expression formula of ArcFace obtains was trained;
Recognition unit, for determining target body corresponding with the target identities feature vector according to the corresponding relationship
Part information, and using the target identity information as the recognition result of the voice to be identified.
The third aspect, the embodiment of the present invention provide a kind of electronic equipment, comprising: processor, memory and bus, wherein
The processor and the memory complete mutual communication by the bus;
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to refer to
Order is able to carry out following method:
Voice to be identified is obtained, and extracts the low layer frame level feature of the voice to be identified;
According to the low layer frame level feature, identity characteristic vector is extracted;
Target identities feature vector similar with the identity characteristic vector, the default language are obtained from default sound bank
Sound library is previously stored with the corresponding relationship between default identity characteristic vector and default identity information;Wherein, the corresponding relationship
It is to be obtained according to the preset model trained in advance;The preset model is obtained by the algorithm expression formula based on ArcFace
What the default loss function taken was trained;
According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, and by institute
State recognition result of the target identity information as the voice to be identified.
Fourth aspect, the embodiment of the present invention provide a kind of non-transient computer readable storage medium, comprising:
The non-transient computer readable storage medium stores computer instruction, and the computer instruction makes the computer
Execute following method:
Voice to be identified is obtained, and extracts the low layer frame level feature of the voice to be identified;
According to the low layer frame level feature, identity characteristic vector is extracted;
Target identities feature vector similar with the identity characteristic vector, the default language are obtained from default sound bank
Sound library is previously stored with the corresponding relationship between default identity characteristic vector and default identity information;Wherein, the corresponding relationship
It is to be obtained according to the preset model trained in advance;The preset model is obtained by the algorithm expression formula based on ArcFace
What the default loss function taken was trained;
According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, and by institute
State recognition result of the target identity information as the voice to be identified.
Audio recognition method and device provided in an embodiment of the present invention based on ArcFace, obtains from default sound bank
The similar target identities feature vector of corresponding with voice to be identified identity characteristic vector, and according to being in advance based on ArcFace's
The preset model that the default loss function that algorithm expression formula obtains is trained obtains corresponding relationship, and then obtains target identities letter
Breath, then using target identity information as the recognition result of voice to be identified, accurately various types of voices can be known
Not.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair
Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root
Other attached drawings are obtained according to these attached drawings.
Fig. 1 is audio recognition method flow diagram of the embodiment of the present invention based on ArcFace;
Fig. 2 is speech recognition equipment structural schematic diagram of the embodiment of the present invention based on ArcFace;
Fig. 3 is electronic equipment entity structure schematic diagram provided in an embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
Fig. 1 is audio recognition method flow diagram of the embodiment of the present invention based on ArcFace, as shown in Figure 1, of the invention
A kind of audio recognition method based on ArcFace that embodiment provides, comprising the following steps:
S101: obtaining voice to be identified, and extracts the low layer frame level feature of the voice to be identified.
Specifically, device obtains voice to be identified, and extract the low layer frame level feature of the voice to be identified.Device can be with
It is the server etc. for executing this method, the equipment such as dynamic microphones, Electret Condencer Microphone and micro-electro-mechanical microphone can be passed through
The voice of same speaker's different channels is acquired, actual speech environment is simulated.It can be moved according to the frame length of 25ms and the frame of 10ms,
The frame level feature of voice to be identified is extracted, and mute operation is carried out to above-mentioned frame level feature using VAD (voice activity detection), is obtained
Obtain low layer frame level feature.Low layer frame level feature can be Fbank feature, be not especially limited.
S102: according to the low layer frame level feature, identity characteristic vector is extracted.
Specifically, device extracts identity characteristic vector according to the low layer frame level feature.Identity characteristic vector is understood that
For the feature vector for identifying speaker, the low layer frame level feature can be inputted to the GRU model optimized, optimized described
GRU model output result as the identity characteristic vector.GRU (Gated Recurrent Unit) is LSTM variant,
As the model of study temporal aspect, while keeping LSTM that can handle remote dependence advantage well, structure is more for it
Simply, it calculates more efficient.A convolutional layer can be introduced before GRU layers, and to optimize GRU model, spectral correlations are modeled
While, feature is reduced in the dimension of time domain and frequency domain, and acceleration model calculates.
S103: obtaining target identities feature vector similar with the identity characteristic vector from default sound bank, described
Default sound bank is previously stored with the corresponding relationship between default identity characteristic vector and default identity information;Wherein, described right
Should be related to obtained according to the preset model trained in advance;The preset model is by the algorithm table based on ArcFace
What the default loss function obtained up to formula was trained.
Specifically, device obtained from default sound bank target identities feature similar with the identity characteristic vector to
Amount, the default sound bank are previously stored with the corresponding relationship between default identity characteristic vector and default identity information;Wherein,
The corresponding relationship is obtained according to the preset model trained in advance;The preset model is by based on ArcFace
What the default loss function that algorithm expression formula obtains was trained.Nearest neighbor classifier can be used, identity characteristic vector is calculated
With the Euclidean distance between the default identity characteristic vector in default sound bank, by the smallest default identity characteristic of Euclidean distance to
Amount is determined as target identities feature vector.Default identity information can be understood as the default corresponding speaker of identity characteristic vector,
I.e. by identifying that default identity characteristic vector, preset model identify which speaker the default identity characteristic vector corresponds to.
The embodiment of the present invention does not make specific limit to preset model.Algorithm expression formula L based on ArcFace3It can be according to following steps
It obtains:
For the sample vector x of inputiLabel y corresponding with itsi(which speaker corresponded to), loss function L1Determine
Justice is as follows:
Wherein, N is in batches trained sample set (i.e. a part of the total sample number of the training of input unit in batches), C
For sample class total (i.e. the sum of speaker),To indicate sample vector xiThe posterior probability of belonged to class, fjIndicate sample
This vector xiBelong to the posterior probability of all classes,It can be expressed as follows:
Wherein,WithIt is weight vectors and the biasing of full articulamentum,It is angle between the two.
Simplified expression, willIt is set as 0, normalizing by L2 willIt is set as 1, thenOnly by sample vector xiWith
AngleIt determines:
Carrying out L2 regularization to feature can remove radial variation of the feature on hypersphere space.Will | | xi| | it is set as constant
S, loss function L2It indicates are as follows:
Since soft-sided circle loss function is focused on correctly classifying, consideration is lacked to classification error situation.It is asked to solve this
Topic adds corner edge loss factor m, that is, existsInterior introducing m increases boundary constraint on classification boundaries, to obtain
The algorithm expression formula L of ArcFace3:
Wherein,Belong to range [0, π-m ].
The target of speech recognition is to judge which speaker unknown voice belongs to, it is assumed that the posterior probability of the belonged to class of voiceGreater than preset threshold t, and the posterior probability f of other classes where itjRespectively less than t.It can be expressed as follows:
In assorting process,Less than or equal to t or fjMore than or equal to t, it is misclassification situation, loss is defined as
The difference of the two.For former instance, if loss is L+, indicate are as follows:
Similarly, the latter's loss is L-, indicate are as follows:
Misclassification loss function is indicated to be whole, by L+And L-Fusion is got up, and maximum boundary item penalty δ is introducedy:
For all samples, maximal margin constrains loss factor are as follows:
All in all, default loss function L, i.e. maximal margin COS distance loss function are obtained based on ArcFace
(maximum marginal cosine distance loss function, hereinafter referred to as " MMCL "), is defined as follows:
Wherein, L is L3And Cmax_marWeighted sum, be expressed as follows:
L=L3+λCmax_mar
λ is weight coefficient, and numerical value is chosen as 0.1~10.
It should be understood that since the maximal margin that the embodiment of the present invention introduces constrains loss factor Cmax_marIn include
Maximum boundary item penalty δy, (corresponding δ in the case of prediction result is correctyExpression formula inThe case where), so that
δy=1;(corresponding δ in the case of prediction result mistakeyExpression formula inThe case where), so that δy=-1;Even if
It is stronger to the resolution capability of prediction result that loss function must be preset, so that recognition result is more accurate.
S104: according to the corresponding relationship, determining target identity information corresponding with the target identities feature vector, and
Using the target identity information as the recognition result of the voice to be identified.
Specifically, device determines target identities corresponding with the target identities feature vector according to the corresponding relationship
Information, and using the target identity information as the recognition result of the voice to be identified.Be illustrated below: default identity is special
It levies there are corresponding relationship between vector A and default identity information a, identity characteristic vector corresponding with voice to be identified is X, is passed through
Vector similitude comparative approach show that default identity characteristic vector A be with identity characteristic vector is the similar target identities feature of X
Vector, so that it is determined that default identity information a is target identity information, using the target identity information as the identification of voice to be identified
As a result.Under the conditions of 2s, 3s, 5s, tetra- kinds of voice length of 8s, MMCL of the embodiment of the present invention is respectively with softmax's and ArcFace
The comparing result of EER index is as shown in table 1:
Recognition performance of the 1 phrase sound method for distinguishing speek person of table under different durations
2s | 3s | 5s | 8s | |
softmax | 0.0643 | 0.0437 | 0.0363 | 0.0301 |
ArcFace | 0.0602 | 0.0410 | 0.0307 | 0.0254 |
MMCL | 0.0538 | 0.0385 | 0.0272 | 0.0215 |
It can be seen that MMCL of the embodiment of the present invention has lesser EER error, voice can accurately be known
Not.
Audio recognition method provided in an embodiment of the present invention based on ArcFace, from default sound bank obtain with wait know
The similar target identities feature vector of the corresponding identity characteristic vector of other voice, and according to the algorithm table for being in advance based on ArcFace
The preset model that the default loss function obtained up to formula is trained obtains corresponding relationship, and then obtains target identity information, then
Using target identity information as the recognition result of voice to be identified, accurately various types of voices can be identified.
On the basis of the above embodiments, the default loss function include maximal margin constraint loss factor, it is described most
The expression formula of big edge constraint loss factor are as follows:
Wherein, Cmax_marLoss factor is constrained for the maximal margin, N is in batches trained sample set, y is sample class
Not, C be sample class sum, t be preset threshold,After the belonged to class of expression sample vector greater than the preset threshold
Test probability, δyFor maximum boundary item penalty.
Specifically, the expression formula of the maximal margin constraint loss factor in device are as follows:
Wherein, Cmax_marLoss factor is constrained for the maximal margin, N is in batches trained sample set, y is sample class
Not, C be sample class sum, t be preset threshold,After the belonged to class of expression sample vector greater than the preset threshold
Test probability, δyFor maximum boundary item penalty.It can refer to above-described embodiment, repeat no more.
Audio recognition method provided in an embodiment of the present invention based on ArcFace, by using include maximal margin about
The default loss function of beam loss factor is trained preset function, be further able to accurately to various types of voices into
Row identification.
On the basis of the above embodiments, the δyExpression formula are as follows:
Wherein, as j ≠ yiWhen, fjThe expression sample vector for representing less than the preset threshold belongs to other classes
Posterior probability.
Specifically, the δ in deviceyExpression formula are as follows:
Wherein, as j ≠ yiWhen, fjThe expression sample vector for representing less than the preset threshold belongs to other classes
Posterior probability.It can refer to above-described embodiment, repeat no more.
Audio recognition method provided in an embodiment of the present invention based on ArcFace is calculated maximum by specific expression formula
Border item penalty is further able to accurately identify various types of voices.
On the basis of the above embodiments, the expression formula of the default loss function are as follows:
L=L3+λCmax_mar
Wherein, L is the default loss function, L3It is weight coefficient, numerical value for the algorithm expression formula based on ArcFace, λ
It is 0.1~10.
Specifically, the expression formula of the default loss function in device are as follows:
L=L3+λCmax_mar
Wherein, L is the default loss function, L3It is weight coefficient, numerical value for the algorithm expression formula based on ArcFace, λ
It is 0.1~10.It can refer to above-described embodiment, repeat no more.
Audio recognition method provided in an embodiment of the present invention based on ArcFace is calculated default by specific expression formula
Loss function is further able to accurately identify various types of voices.
On the basis of the above embodiments, described according to the low layer frame level feature, extract identity characteristic vector, comprising:
The low layer frame level feature is inputted to the GRU model optimized, by the output result of the GRU model optimized
As the identity characteristic vector.
Specifically, device inputs the low layer frame level feature to the GRU model optimized, by the GRU mould optimized
The output result of type is as the identity characteristic vector.It can refer to above-described embodiment, repeat no more.
Audio recognition method provided in an embodiment of the present invention based on ArcFace passes through the defeated of the GRU model that will optimize
Result can guarantee that this method is normally carried out as identity characteristic vector out.
On the basis of the above embodiments, the GRU model optimized is the GRU model equipped with convolutional layer.
Specifically, the GRU model optimized in device is the GRU model equipped with convolutional layer.It can refer to above-mentioned reality
Example is applied, is repeated no more.
Audio recognition method provided in an embodiment of the present invention based on ArcFace is selected as by the GRU model that will optimize
GRU model equipped with convolutional layer, can be improved the operation efficiency of GRU model, know more quickly to various types of voices
Not.
On the basis of the above embodiments, the low layer frame level feature is Fbank feature.
Specifically, the low layer frame level feature in device is Fbank feature.It can refer to above-described embodiment, repeat no more.
Audio recognition method provided in an embodiment of the present invention based on ArcFace, by the way that low layer frame level feature to be selected as
Fbank feature can guarantee that this method is normally carried out.
Fig. 2 is speech recognition equipment structural schematic diagram of the embodiment of the present invention based on ArcFace, as shown in Fig. 2, of the invention
Embodiment provides a kind of speech recognition equipment based on ArcFace, including first acquisition unit 201, extraction unit 202,
Two acquiring units 203 and recognition unit 204, in which:
First acquisition unit 201 extracts the low layer frame level feature of the voice to be identified for obtaining voice to be identified;
Extraction unit 202 is used to extract identity characteristic vector according to the low layer frame level feature;Second acquisition unit 203 is used for from pre-
If obtaining target identities feature vector similar with the identity characteristic vector in sound bank, the default sound bank is stored in advance
There is the corresponding relationship between default identity characteristic vector and default identity information;Wherein, the corresponding relationship is according to instruction in advance
What the preset model practiced obtained;The preset model is the default loss obtained by the algorithm expression formula based on ArcFace
What function was trained;Recognition unit 204 is used for according to the corresponding relationship, and determination is corresponding with the target identities feature vector
Target identity information, and using the target identity information as the recognition result of the voice to be identified.
Specifically, first acquisition unit 201 is used to obtain voice to be identified, and extract the lower-level frame of the voice to be identified
Grade feature;Extraction unit 202 is used to extract identity characteristic vector according to the low layer frame level feature;Second acquisition unit 203 is used
In obtaining target identities feature vector similar with the identity characteristic vector from default sound bank, the default sound bank is pre-
The corresponding relationship being first stored between default identity characteristic vector and default identity information;Wherein, the corresponding relationship is basis
What the preset model trained in advance obtained;The preset model is by the pre- of the algorithm expression formula acquisition based on ArcFace
What if loss function was trained;Recognition unit 204 is used for according to the corresponding relationship, it is determining with the target identities feature to
Corresponding target identity information is measured, and using the target identity information as the recognition result of the voice to be identified.
Speech recognition equipment provided in an embodiment of the present invention based on ArcFace, from default sound bank obtain with wait know
The similar target identities feature vector of the corresponding identity characteristic vector of other voice, and according to the algorithm table for being in advance based on ArcFace
The preset model that the default loss function obtained up to formula is trained obtains corresponding relationship, and then obtains target identity information, then
Using target identity information as the recognition result of voice to be identified, accurately various types of voices can be identified.
Speech recognition equipment provided in an embodiment of the present invention based on ArcFace specifically can be used for executing above-mentioned each method
The process flow of embodiment, details are not described herein for function, is referred to the detailed description of above method embodiment.
Fig. 3 is electronic equipment entity structure schematic diagram provided in an embodiment of the present invention, as shown in figure 3, the electronic equipment
It include: processor (processor) 301, memory (memory) 302 and bus 303;
Wherein, the processor 301, memory 302 complete mutual communication by bus 303;
The processor 301 is used to call the program instruction in the memory 302, to execute above-mentioned each method embodiment
Provided method, for example, obtain voice to be identified, and extract the low layer frame level feature of the voice to be identified;According to
The low layer frame level feature extracts identity characteristic vector;It is obtained from default sound bank similar with the identity characteristic vector
Target identities feature vector, the default sound bank are previously stored between default identity characteristic vector and default identity information
Corresponding relationship;Wherein, the corresponding relationship is obtained according to the preset model trained in advance;The preset model is to pass through
What the default loss function that the algorithm expression formula based on ArcFace obtains was trained;According to the corresponding relationship, determining and institute
The corresponding target identity information of target identities feature vector is stated, and using the target identity information as the voice to be identified
Recognition result.
The present embodiment discloses a kind of computer program product, and the computer program product includes being stored in non-transient calculating
Computer program on machine readable storage medium storing program for executing, the computer program include program instruction, when described program instruction is calculated
When machine executes, computer is able to carry out method provided by above-mentioned each method embodiment, for example, voice to be identified is obtained,
And extract the low layer frame level feature of the voice to be identified;According to the low layer frame level feature, identity characteristic vector is extracted;From pre-
If obtaining target identities feature vector similar with the identity characteristic vector in sound bank, the default sound bank is stored in advance
There is the corresponding relationship between default identity characteristic vector and default identity information;Wherein, the corresponding relationship is according to instruction in advance
What the preset model practiced obtained;The preset model is the default loss obtained by the algorithm expression formula based on ArcFace
What function was trained;According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined,
And using the target identity information as the recognition result of the voice to be identified.
The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium
Computer instruction is stored, the computer instruction makes the computer execute method provided by above-mentioned each method embodiment, example
It such as include: to obtain voice to be identified, and extract the low layer frame level feature of the voice to be identified;It is special according to the low layer frame level
Sign extracts identity characteristic vector;Obtained from default sound bank similar with identity characteristic vector target identities feature to
Amount, the default sound bank are previously stored with the corresponding relationship between default identity characteristic vector and default identity information;Wherein,
The corresponding relationship is obtained according to the preset model trained in advance;The preset model is by based on ArcFace
What the default loss function that algorithm expression formula obtains was trained;It is determining special with the target identities according to the corresponding relationship
The corresponding target identity information of vector is levied, and using the target identity information as the recognition result of the voice to be identified.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through
The relevant hardware of program instruction is completed, and program above-mentioned can be stored in a computer readable storage medium, the program
When being executed, step including the steps of the foregoing method embodiments is executed;And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light
The various media that can store program code such as disk.
The embodiments such as electronic equipment described above are only schematical, wherein it is described as illustrated by the separation member
Unit may or may not be physically separated, and component shown as a unit may or may not be object
Manage unit, it can it is in one place, or may be distributed over multiple network units.It can select according to the actual needs
Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying wound
In the case where the labour for the property made, it can understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can
It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on
Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should
Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers
It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation
Method described in certain parts of example or embodiment.
Finally, it should be noted that the above various embodiments is only to illustrate the technical solution of the embodiment of the present invention, rather than it is right
It is limited;Although the embodiment of the present invention is described in detail referring to foregoing embodiments, the ordinary skill of this field
Personnel are it is understood that it is still possible to modify the technical solutions described in the foregoing embodiments, or to part
Or all technical features are equivalently replaced;And these are modified or replaceed, it does not separate the essence of the corresponding technical solution
The range of various embodiments of the present invention technical solution.
Claims (10)
1. a kind of audio recognition method based on ArcFace characterized by comprising
Voice to be identified is obtained, and extracts the low layer frame level feature of the voice to be identified;
According to the low layer frame level feature, identity characteristic vector is extracted;
Target identities feature vector similar with the identity characteristic vector, the default sound bank are obtained from default sound bank
The corresponding relationship being previously stored between default identity characteristic vector and default identity information;Wherein, the corresponding relationship is root
It is obtained according to the preset model trained in advance;The preset model is obtained by the algorithm expression formula based on ArcFace
What default loss function was trained;
According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, and by the mesh
Mark recognition result of the identity information as the voice to be identified.
2. the method according to claim 1, wherein the default loss function includes maximal margin constraint loss
The factor, the expression formula of the maximal margin constraint loss factor are as follows:
Wherein, Cmax_marLoss factor is constrained for the maximal margin, N is in batches trained sample set, y is sample class, C
It is preset threshold, f for sample class sum, tyiPosteriority for expression the belonged to class of sample vector greater than the preset threshold is general
Rate, δyFor maximum boundary item penalty.
3. according to the method described in claim 2, it is characterized in that, the δyExpression formula are as follows:
Wherein, as j ≠ yiWhen, fjThe expression sample vector for representing less than the preset threshold belongs to the posteriority of other classes
Probability.
4. according to the method in claim 2 or 3, which is characterized in that the expression formula of the default loss function are as follows:
L=L3+λCmax_mar
Wherein, L is the default loss function, L3It is weight coefficient, numerical value 0.1 for the algorithm expression formula based on ArcFace, λ
~10.
5. extraction identity is special the method according to claim 1, wherein described according to the low layer frame level feature
Levy vector, comprising:
The low layer frame level feature is inputted to the GRU model optimized, using the output result of the GRU model optimized as
The identity characteristic vector.
6. according to the method described in claim 5, it is characterized in that, the GRU model optimized is the GRU equipped with convolutional layer
Model.
7. the method according to claim 1, wherein the low layer frame level feature is Fbank feature.
8. a kind of speech recognition equipment based on ArcFace characterized by comprising
First acquisition unit for obtaining voice to be identified, and extracts the low layer frame level feature of the voice to be identified;
Extraction unit, for extracting identity characteristic vector according to the low layer frame level feature;
Second acquisition unit, for obtained from default sound bank similar with identity characteristic vector target identities feature to
Amount, the default sound bank are previously stored with the corresponding relationship between default identity characteristic vector and default identity information;Wherein,
The corresponding relationship is obtained according to the preset model trained in advance;The preset model is by based on ArcFace
What the default loss function that algorithm expression formula obtains was trained;
Recognition unit, for determining target identities letter corresponding with the target identities feature vector according to the corresponding relationship
Breath, and using the target identity information as the recognition result of the voice to be identified.
9. a kind of electronic equipment characterized by comprising processor, memory and bus, wherein
The processor and the memory complete mutual communication by the bus;
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to instruct energy
Enough methods executed as described in claim 1 to 7 is any.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited
Computer instruction is stored up, the computer instruction makes the computer execute the method as described in claim 1 to 7 is any.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811400260.2A CN109377984B (en) | 2018-11-22 | 2018-11-22 | ArcFace-based voice recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811400260.2A CN109377984B (en) | 2018-11-22 | 2018-11-22 | ArcFace-based voice recognition method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109377984A true CN109377984A (en) | 2019-02-22 |
CN109377984B CN109377984B (en) | 2022-05-03 |
Family
ID=65377103
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811400260.2A Active CN109377984B (en) | 2018-11-22 | 2018-11-22 | ArcFace-based voice recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109377984B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110047468A (en) * | 2019-05-20 | 2019-07-23 | 北京达佳互联信息技术有限公司 | Audio recognition method, device and storage medium |
CN111582354A (en) * | 2020-04-30 | 2020-08-25 | 中国平安财产保险股份有限公司 | Picture identification method, device, equipment and storage medium |
CN112669827A (en) * | 2020-12-28 | 2021-04-16 | 清华大学 | Joint optimization method and system for automatic speech recognizer |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104732978A (en) * | 2015-03-12 | 2015-06-24 | 上海交通大学 | Text-dependent speaker recognition method based on joint deep learning |
CN105575394A (en) * | 2016-01-04 | 2016-05-11 | 北京时代瑞朗科技有限公司 | Voiceprint identification method based on global change space and deep learning hybrid modeling |
CN105632502A (en) * | 2015-12-10 | 2016-06-01 | 江西师范大学 | Weighted pairwise constraint metric learning algorithm-based speaker recognition method |
CN105931646A (en) * | 2016-04-29 | 2016-09-07 | 江西师范大学 | Speaker identification method base on simple direct tolerance learning algorithm |
CN106022380A (en) * | 2016-05-25 | 2016-10-12 | 中国科学院自动化研究所 | Individual identity identification method based on deep learning |
US20180197547A1 (en) * | 2017-01-10 | 2018-07-12 | Fujitsu Limited | Identity verification method and apparatus based on voiceprint |
US20180261236A1 (en) * | 2017-03-10 | 2018-09-13 | Baidu Online Network Technology (Beijing) Co., Ltd. | Speaker recognition method and apparatus, computer device and computer-readable medium |
-
2018
- 2018-11-22 CN CN201811400260.2A patent/CN109377984B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104732978A (en) * | 2015-03-12 | 2015-06-24 | 上海交通大学 | Text-dependent speaker recognition method based on joint deep learning |
CN105632502A (en) * | 2015-12-10 | 2016-06-01 | 江西师范大学 | Weighted pairwise constraint metric learning algorithm-based speaker recognition method |
CN105575394A (en) * | 2016-01-04 | 2016-05-11 | 北京时代瑞朗科技有限公司 | Voiceprint identification method based on global change space and deep learning hybrid modeling |
CN105931646A (en) * | 2016-04-29 | 2016-09-07 | 江西师范大学 | Speaker identification method base on simple direct tolerance learning algorithm |
CN106022380A (en) * | 2016-05-25 | 2016-10-12 | 中国科学院自动化研究所 | Individual identity identification method based on deep learning |
US20180197547A1 (en) * | 2017-01-10 | 2018-07-12 | Fujitsu Limited | Identity verification method and apparatus based on voiceprint |
US20180261236A1 (en) * | 2017-03-10 | 2018-09-13 | Baidu Online Network Technology (Beijing) Co., Ltd. | Speaker recognition method and apparatus, computer device and computer-readable medium |
Non-Patent Citations (1)
Title |
---|
CHEN, SHENG 等: "MobileFaceNets: Efficient CNNs for Accurate Real-Time Face Verification on Mobile Devices", 《13TH CHINESE CONFERENCE ON BIOMETRIC RECOGNITION (CCBR)》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110047468A (en) * | 2019-05-20 | 2019-07-23 | 北京达佳互联信息技术有限公司 | Audio recognition method, device and storage medium |
CN110047468B (en) * | 2019-05-20 | 2022-01-25 | 北京达佳互联信息技术有限公司 | Speech recognition method, apparatus and storage medium |
CN111582354A (en) * | 2020-04-30 | 2020-08-25 | 中国平安财产保险股份有限公司 | Picture identification method, device, equipment and storage medium |
CN112669827A (en) * | 2020-12-28 | 2021-04-16 | 清华大学 | Joint optimization method and system for automatic speech recognizer |
CN112669827B (en) * | 2020-12-28 | 2022-08-02 | 清华大学 | Joint optimization method and system for automatic speech recognizer |
Also Published As
Publication number | Publication date |
---|---|
CN109377984B (en) | 2022-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9875743B2 (en) | Acoustic signature building for a speaker from multiple sessions | |
Ittichaichareon et al. | Speech recognition using MFCC | |
US7447338B2 (en) | Method and system for face detection using pattern classifier | |
US20180197547A1 (en) | Identity verification method and apparatus based on voiceprint | |
CN107221320A (en) | Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model | |
CN108281137A (en) | A kind of universal phonetic under whole tone element frame wakes up recognition methods and system | |
CN110910891B (en) | Speaker segmentation labeling method based on long-time and short-time memory deep neural network | |
Gosztolya et al. | DNN-based feature extraction and classifier combination for child-directed speech, cold and snoring identification | |
CN111462729B (en) | Fast language identification method based on phoneme log-likelihood ratio and sparse representation | |
CN111583906B (en) | Role recognition method, device and terminal for voice session | |
CN110287311B (en) | Text classification method and device, storage medium and computer equipment | |
CN108520752A (en) | A kind of method for recognizing sound-groove and device | |
CN106991312B (en) | Internet anti-fraud authentication method based on voiceprint recognition | |
CN111816185A (en) | Method and device for identifying speaker in mixed voice | |
CN109036385A (en) | A kind of voice instruction recognition method, device and computer storage medium | |
CN111401105B (en) | Video expression recognition method, device and equipment | |
CN113223536B (en) | Voiceprint recognition method and device and terminal equipment | |
Ferrer et al. | Spoken language recognition based on senone posteriors. | |
CN108831506A (en) | Digital audio based on GMM-BIC distorts point detecting method and system | |
CN109448756A (en) | A kind of voice age recognition methods and system | |
CN114678030A (en) | Voiceprint identification method and device based on depth residual error network and attention mechanism | |
Venkatesan et al. | Automatic language identification using machine learning techniques | |
CN109377984A (en) | A kind of audio recognition method and device based on ArcFace | |
CN111091809B (en) | Regional accent recognition method and device based on depth feature fusion | |
CN115801374A (en) | Network intrusion data classification method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |