CN109377984A - A kind of audio recognition method and device based on ArcFace - Google Patents

A kind of audio recognition method and device based on ArcFace Download PDF

Info

Publication number
CN109377984A
CN109377984A CN201811400260.2A CN201811400260A CN109377984A CN 109377984 A CN109377984 A CN 109377984A CN 201811400260 A CN201811400260 A CN 201811400260A CN 109377984 A CN109377984 A CN 109377984A
Authority
CN
China
Prior art keywords
default
voice
identified
arcface
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811400260.2A
Other languages
Chinese (zh)
Other versions
CN109377984B (en
Inventor
李鹏
吉瑞芳
蔡新元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wisdom And Technology Co Ltd
Original Assignee
Beijing Wisdom And Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wisdom And Technology Co Ltd filed Critical Beijing Wisdom And Technology Co Ltd
Priority to CN201811400260.2A priority Critical patent/CN109377984B/en
Publication of CN109377984A publication Critical patent/CN109377984A/en
Application granted granted Critical
Publication of CN109377984B publication Critical patent/CN109377984B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the present invention provides a kind of audio recognition method and device based on ArcFace, which comprises obtains voice to be identified, and extracts the low layer frame level feature of the voice to be identified;According to the low layer frame level feature, identity characteristic vector is extracted;Target identities feature vector similar with the identity characteristic vector is obtained from default sound bank, the default sound bank is previously stored with the corresponding relationship between default identity characteristic vector and default identity information;The corresponding relationship is obtained according to the preset model trained in advance;The preset model is that the default loss function obtained by the algorithm expression formula based on ArcFace is trained;According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, and using the target identity information as the recognition result of the voice to be identified.Described device executes the above method.Method and device provided in an embodiment of the present invention can accurately identify various types of voices.

Description

A kind of audio recognition method and device based on ArcFace
Technical field
The present embodiments relate to voice processing technology fields, and in particular to a kind of speech recognition side based on ArcFace Method and device.
Background technique
With the explosive growth of digital audio-frequency data, by speech recognition technology, to identify speaker also gradually It receives more and more attention.
The i-vector system being most widely used in Speaker Identification at present, based on GMM-UBM (mixed Gaussian Model-background model model) and GSV-SVM (Gaussian mean super vector-supporting vector machine model) be all built upon statistical model It is theoretic, therefore it is required that trained and tested speech must reach certain length, otherwise identify accuracy by sharp fall. On the other hand, although ArcFace is widely used in field of face identification, still ArcFace is not answered at present Used in the method for field of speech recognition.
Therefore, how to avoid drawbacks described above, based on ArcFace accurately to various types of voices (including long voice and Phrase sound) it is identified, becoming need solve the problems, such as.
Summary of the invention
In view of the problems of the existing technology, the embodiment of the present invention provides a kind of audio recognition method based on ArcFace And device.
In a first aspect, the embodiment of the present invention provides a kind of audio recognition method based on ArcFace, which comprises
Voice to be identified is obtained, and extracts the low layer frame level feature of the voice to be identified;
According to the low layer frame level feature, identity characteristic vector is extracted;
Target identities feature vector similar with the identity characteristic vector, the default language are obtained from default sound bank Sound library is previously stored with the corresponding relationship between default identity characteristic vector and default identity information;Wherein, the corresponding relationship It is to be obtained according to the preset model trained in advance;The preset model is obtained by the algorithm expression formula based on ArcFace What the default loss function taken was trained;
According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, and by institute State recognition result of the target identity information as the voice to be identified.
Second aspect, the embodiment of the present invention provide a kind of speech recognition equipment based on ArcFace, and described device includes:
First acquisition unit for obtaining voice to be identified, and extracts the low layer frame level feature of the voice to be identified;
Extraction unit, for extracting identity characteristic vector according to the low layer frame level feature;
Second acquisition unit, it is special for obtaining target identities similar with the identity characteristic vector from default sound bank Vector is levied, the default sound bank is previously stored with the corresponding relationship between default identity characteristic vector and default identity information; Wherein, the corresponding relationship is obtained according to the preset model trained in advance;The preset model is by being based on What the default loss function that the algorithm expression formula of ArcFace obtains was trained;
Recognition unit, for determining target body corresponding with the target identities feature vector according to the corresponding relationship Part information, and using the target identity information as the recognition result of the voice to be identified.
The third aspect, the embodiment of the present invention provide a kind of electronic equipment, comprising: processor, memory and bus, wherein
The processor and the memory complete mutual communication by the bus;
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to refer to Order is able to carry out following method:
Voice to be identified is obtained, and extracts the low layer frame level feature of the voice to be identified;
According to the low layer frame level feature, identity characteristic vector is extracted;
Target identities feature vector similar with the identity characteristic vector, the default language are obtained from default sound bank Sound library is previously stored with the corresponding relationship between default identity characteristic vector and default identity information;Wherein, the corresponding relationship It is to be obtained according to the preset model trained in advance;The preset model is obtained by the algorithm expression formula based on ArcFace What the default loss function taken was trained;
According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, and by institute State recognition result of the target identity information as the voice to be identified.
Fourth aspect, the embodiment of the present invention provide a kind of non-transient computer readable storage medium, comprising:
The non-transient computer readable storage medium stores computer instruction, and the computer instruction makes the computer Execute following method:
Voice to be identified is obtained, and extracts the low layer frame level feature of the voice to be identified;
According to the low layer frame level feature, identity characteristic vector is extracted;
Target identities feature vector similar with the identity characteristic vector, the default language are obtained from default sound bank Sound library is previously stored with the corresponding relationship between default identity characteristic vector and default identity information;Wherein, the corresponding relationship It is to be obtained according to the preset model trained in advance;The preset model is obtained by the algorithm expression formula based on ArcFace What the default loss function taken was trained;
According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, and by institute State recognition result of the target identity information as the voice to be identified.
Audio recognition method and device provided in an embodiment of the present invention based on ArcFace, obtains from default sound bank The similar target identities feature vector of corresponding with voice to be identified identity characteristic vector, and according to being in advance based on ArcFace's The preset model that the default loss function that algorithm expression formula obtains is trained obtains corresponding relationship, and then obtains target identities letter Breath, then using target identity information as the recognition result of voice to be identified, accurately various types of voices can be known Not.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.
Fig. 1 is audio recognition method flow diagram of the embodiment of the present invention based on ArcFace;
Fig. 2 is speech recognition equipment structural schematic diagram of the embodiment of the present invention based on ArcFace;
Fig. 3 is electronic equipment entity structure schematic diagram provided in an embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
Fig. 1 is audio recognition method flow diagram of the embodiment of the present invention based on ArcFace, as shown in Figure 1, of the invention A kind of audio recognition method based on ArcFace that embodiment provides, comprising the following steps:
S101: obtaining voice to be identified, and extracts the low layer frame level feature of the voice to be identified.
Specifically, device obtains voice to be identified, and extract the low layer frame level feature of the voice to be identified.Device can be with It is the server etc. for executing this method, the equipment such as dynamic microphones, Electret Condencer Microphone and micro-electro-mechanical microphone can be passed through The voice of same speaker's different channels is acquired, actual speech environment is simulated.It can be moved according to the frame length of 25ms and the frame of 10ms, The frame level feature of voice to be identified is extracted, and mute operation is carried out to above-mentioned frame level feature using VAD (voice activity detection), is obtained Obtain low layer frame level feature.Low layer frame level feature can be Fbank feature, be not especially limited.
S102: according to the low layer frame level feature, identity characteristic vector is extracted.
Specifically, device extracts identity characteristic vector according to the low layer frame level feature.Identity characteristic vector is understood that For the feature vector for identifying speaker, the low layer frame level feature can be inputted to the GRU model optimized, optimized described GRU model output result as the identity characteristic vector.GRU (Gated Recurrent Unit) is LSTM variant, As the model of study temporal aspect, while keeping LSTM that can handle remote dependence advantage well, structure is more for it Simply, it calculates more efficient.A convolutional layer can be introduced before GRU layers, and to optimize GRU model, spectral correlations are modeled While, feature is reduced in the dimension of time domain and frequency domain, and acceleration model calculates.
S103: obtaining target identities feature vector similar with the identity characteristic vector from default sound bank, described Default sound bank is previously stored with the corresponding relationship between default identity characteristic vector and default identity information;Wherein, described right Should be related to obtained according to the preset model trained in advance;The preset model is by the algorithm table based on ArcFace What the default loss function obtained up to formula was trained.
Specifically, device obtained from default sound bank target identities feature similar with the identity characteristic vector to Amount, the default sound bank are previously stored with the corresponding relationship between default identity characteristic vector and default identity information;Wherein, The corresponding relationship is obtained according to the preset model trained in advance;The preset model is by based on ArcFace What the default loss function that algorithm expression formula obtains was trained.Nearest neighbor classifier can be used, identity characteristic vector is calculated With the Euclidean distance between the default identity characteristic vector in default sound bank, by the smallest default identity characteristic of Euclidean distance to Amount is determined as target identities feature vector.Default identity information can be understood as the default corresponding speaker of identity characteristic vector, I.e. by identifying that default identity characteristic vector, preset model identify which speaker the default identity characteristic vector corresponds to. The embodiment of the present invention does not make specific limit to preset model.Algorithm expression formula L based on ArcFace3It can be according to following steps It obtains:
For the sample vector x of inputiLabel y corresponding with itsi(which speaker corresponded to), loss function L1Determine Justice is as follows:
Wherein, N is in batches trained sample set (i.e. a part of the total sample number of the training of input unit in batches), C For sample class total (i.e. the sum of speaker),To indicate sample vector xiThe posterior probability of belonged to class, fjIndicate sample This vector xiBelong to the posterior probability of all classes,It can be expressed as follows:
Wherein,WithIt is weight vectors and the biasing of full articulamentum,It is angle between the two.
Simplified expression, willIt is set as 0, normalizing by L2 willIt is set as 1, thenOnly by sample vector xiWith AngleIt determines:
Carrying out L2 regularization to feature can remove radial variation of the feature on hypersphere space.Will | | xi| | it is set as constant S, loss function L2It indicates are as follows:
Since soft-sided circle loss function is focused on correctly classifying, consideration is lacked to classification error situation.It is asked to solve this Topic adds corner edge loss factor m, that is, existsInterior introducing m increases boundary constraint on classification boundaries, to obtain The algorithm expression formula L of ArcFace3:
Wherein,Belong to range [0, π-m ].
The target of speech recognition is to judge which speaker unknown voice belongs to, it is assumed that the posterior probability of the belonged to class of voiceGreater than preset threshold t, and the posterior probability f of other classes where itjRespectively less than t.It can be expressed as follows:
In assorting process,Less than or equal to t or fjMore than or equal to t, it is misclassification situation, loss is defined as The difference of the two.For former instance, if loss is L+, indicate are as follows:
Similarly, the latter's loss is L-, indicate are as follows:
Misclassification loss function is indicated to be whole, by L+And L-Fusion is got up, and maximum boundary item penalty δ is introducedy:
For all samples, maximal margin constrains loss factor are as follows:
All in all, default loss function L, i.e. maximal margin COS distance loss function are obtained based on ArcFace (maximum marginal cosine distance loss function, hereinafter referred to as " MMCL "), is defined as follows:
Wherein, L is L3And Cmax_marWeighted sum, be expressed as follows:
L=L3+λCmax_mar
λ is weight coefficient, and numerical value is chosen as 0.1~10.
It should be understood that since the maximal margin that the embodiment of the present invention introduces constrains loss factor Cmax_marIn include Maximum boundary item penalty δy, (corresponding δ in the case of prediction result is correctyExpression formula inThe case where), so that δy=1;(corresponding δ in the case of prediction result mistakeyExpression formula inThe case where), so that δy=-1;Even if It is stronger to the resolution capability of prediction result that loss function must be preset, so that recognition result is more accurate.
S104: according to the corresponding relationship, determining target identity information corresponding with the target identities feature vector, and Using the target identity information as the recognition result of the voice to be identified.
Specifically, device determines target identities corresponding with the target identities feature vector according to the corresponding relationship Information, and using the target identity information as the recognition result of the voice to be identified.Be illustrated below: default identity is special It levies there are corresponding relationship between vector A and default identity information a, identity characteristic vector corresponding with voice to be identified is X, is passed through Vector similitude comparative approach show that default identity characteristic vector A be with identity characteristic vector is the similar target identities feature of X Vector, so that it is determined that default identity information a is target identity information, using the target identity information as the identification of voice to be identified As a result.Under the conditions of 2s, 3s, 5s, tetra- kinds of voice length of 8s, MMCL of the embodiment of the present invention is respectively with softmax's and ArcFace The comparing result of EER index is as shown in table 1:
Recognition performance of the 1 phrase sound method for distinguishing speek person of table under different durations
2s 3s 5s 8s
softmax 0.0643 0.0437 0.0363 0.0301
ArcFace 0.0602 0.0410 0.0307 0.0254
MMCL 0.0538 0.0385 0.0272 0.0215
It can be seen that MMCL of the embodiment of the present invention has lesser EER error, voice can accurately be known Not.
Audio recognition method provided in an embodiment of the present invention based on ArcFace, from default sound bank obtain with wait know The similar target identities feature vector of the corresponding identity characteristic vector of other voice, and according to the algorithm table for being in advance based on ArcFace The preset model that the default loss function obtained up to formula is trained obtains corresponding relationship, and then obtains target identity information, then Using target identity information as the recognition result of voice to be identified, accurately various types of voices can be identified.
On the basis of the above embodiments, the default loss function include maximal margin constraint loss factor, it is described most The expression formula of big edge constraint loss factor are as follows:
Wherein, Cmax_marLoss factor is constrained for the maximal margin, N is in batches trained sample set, y is sample class Not, C be sample class sum, t be preset threshold,After the belonged to class of expression sample vector greater than the preset threshold Test probability, δyFor maximum boundary item penalty.
Specifically, the expression formula of the maximal margin constraint loss factor in device are as follows:
Wherein, Cmax_marLoss factor is constrained for the maximal margin, N is in batches trained sample set, y is sample class Not, C be sample class sum, t be preset threshold,After the belonged to class of expression sample vector greater than the preset threshold Test probability, δyFor maximum boundary item penalty.It can refer to above-described embodiment, repeat no more.
Audio recognition method provided in an embodiment of the present invention based on ArcFace, by using include maximal margin about The default loss function of beam loss factor is trained preset function, be further able to accurately to various types of voices into Row identification.
On the basis of the above embodiments, the δyExpression formula are as follows:
Wherein, as j ≠ yiWhen, fjThe expression sample vector for representing less than the preset threshold belongs to other classes Posterior probability.
Specifically, the δ in deviceyExpression formula are as follows:
Wherein, as j ≠ yiWhen, fjThe expression sample vector for representing less than the preset threshold belongs to other classes Posterior probability.It can refer to above-described embodiment, repeat no more.
Audio recognition method provided in an embodiment of the present invention based on ArcFace is calculated maximum by specific expression formula Border item penalty is further able to accurately identify various types of voices.
On the basis of the above embodiments, the expression formula of the default loss function are as follows:
L=L3+λCmax_mar
Wherein, L is the default loss function, L3It is weight coefficient, numerical value for the algorithm expression formula based on ArcFace, λ It is 0.1~10.
Specifically, the expression formula of the default loss function in device are as follows:
L=L3+λCmax_mar
Wherein, L is the default loss function, L3It is weight coefficient, numerical value for the algorithm expression formula based on ArcFace, λ It is 0.1~10.It can refer to above-described embodiment, repeat no more.
Audio recognition method provided in an embodiment of the present invention based on ArcFace is calculated default by specific expression formula Loss function is further able to accurately identify various types of voices.
On the basis of the above embodiments, described according to the low layer frame level feature, extract identity characteristic vector, comprising:
The low layer frame level feature is inputted to the GRU model optimized, by the output result of the GRU model optimized As the identity characteristic vector.
Specifically, device inputs the low layer frame level feature to the GRU model optimized, by the GRU mould optimized The output result of type is as the identity characteristic vector.It can refer to above-described embodiment, repeat no more.
Audio recognition method provided in an embodiment of the present invention based on ArcFace passes through the defeated of the GRU model that will optimize Result can guarantee that this method is normally carried out as identity characteristic vector out.
On the basis of the above embodiments, the GRU model optimized is the GRU model equipped with convolutional layer.
Specifically, the GRU model optimized in device is the GRU model equipped with convolutional layer.It can refer to above-mentioned reality Example is applied, is repeated no more.
Audio recognition method provided in an embodiment of the present invention based on ArcFace is selected as by the GRU model that will optimize GRU model equipped with convolutional layer, can be improved the operation efficiency of GRU model, know more quickly to various types of voices Not.
On the basis of the above embodiments, the low layer frame level feature is Fbank feature.
Specifically, the low layer frame level feature in device is Fbank feature.It can refer to above-described embodiment, repeat no more.
Audio recognition method provided in an embodiment of the present invention based on ArcFace, by the way that low layer frame level feature to be selected as Fbank feature can guarantee that this method is normally carried out.
Fig. 2 is speech recognition equipment structural schematic diagram of the embodiment of the present invention based on ArcFace, as shown in Fig. 2, of the invention Embodiment provides a kind of speech recognition equipment based on ArcFace, including first acquisition unit 201, extraction unit 202, Two acquiring units 203 and recognition unit 204, in which:
First acquisition unit 201 extracts the low layer frame level feature of the voice to be identified for obtaining voice to be identified; Extraction unit 202 is used to extract identity characteristic vector according to the low layer frame level feature;Second acquisition unit 203 is used for from pre- If obtaining target identities feature vector similar with the identity characteristic vector in sound bank, the default sound bank is stored in advance There is the corresponding relationship between default identity characteristic vector and default identity information;Wherein, the corresponding relationship is according to instruction in advance What the preset model practiced obtained;The preset model is the default loss obtained by the algorithm expression formula based on ArcFace What function was trained;Recognition unit 204 is used for according to the corresponding relationship, and determination is corresponding with the target identities feature vector Target identity information, and using the target identity information as the recognition result of the voice to be identified.
Specifically, first acquisition unit 201 is used to obtain voice to be identified, and extract the lower-level frame of the voice to be identified Grade feature;Extraction unit 202 is used to extract identity characteristic vector according to the low layer frame level feature;Second acquisition unit 203 is used In obtaining target identities feature vector similar with the identity characteristic vector from default sound bank, the default sound bank is pre- The corresponding relationship being first stored between default identity characteristic vector and default identity information;Wherein, the corresponding relationship is basis What the preset model trained in advance obtained;The preset model is by the pre- of the algorithm expression formula acquisition based on ArcFace What if loss function was trained;Recognition unit 204 is used for according to the corresponding relationship, it is determining with the target identities feature to Corresponding target identity information is measured, and using the target identity information as the recognition result of the voice to be identified.
Speech recognition equipment provided in an embodiment of the present invention based on ArcFace, from default sound bank obtain with wait know The similar target identities feature vector of the corresponding identity characteristic vector of other voice, and according to the algorithm table for being in advance based on ArcFace The preset model that the default loss function obtained up to formula is trained obtains corresponding relationship, and then obtains target identity information, then Using target identity information as the recognition result of voice to be identified, accurately various types of voices can be identified.
Speech recognition equipment provided in an embodiment of the present invention based on ArcFace specifically can be used for executing above-mentioned each method The process flow of embodiment, details are not described herein for function, is referred to the detailed description of above method embodiment.
Fig. 3 is electronic equipment entity structure schematic diagram provided in an embodiment of the present invention, as shown in figure 3, the electronic equipment It include: processor (processor) 301, memory (memory) 302 and bus 303;
Wherein, the processor 301, memory 302 complete mutual communication by bus 303;
The processor 301 is used to call the program instruction in the memory 302, to execute above-mentioned each method embodiment Provided method, for example, obtain voice to be identified, and extract the low layer frame level feature of the voice to be identified;According to The low layer frame level feature extracts identity characteristic vector;It is obtained from default sound bank similar with the identity characteristic vector Target identities feature vector, the default sound bank are previously stored between default identity characteristic vector and default identity information Corresponding relationship;Wherein, the corresponding relationship is obtained according to the preset model trained in advance;The preset model is to pass through What the default loss function that the algorithm expression formula based on ArcFace obtains was trained;According to the corresponding relationship, determining and institute The corresponding target identity information of target identities feature vector is stated, and using the target identity information as the voice to be identified Recognition result.
The present embodiment discloses a kind of computer program product, and the computer program product includes being stored in non-transient calculating Computer program on machine readable storage medium storing program for executing, the computer program include program instruction, when described program instruction is calculated When machine executes, computer is able to carry out method provided by above-mentioned each method embodiment, for example, voice to be identified is obtained, And extract the low layer frame level feature of the voice to be identified;According to the low layer frame level feature, identity characteristic vector is extracted;From pre- If obtaining target identities feature vector similar with the identity characteristic vector in sound bank, the default sound bank is stored in advance There is the corresponding relationship between default identity characteristic vector and default identity information;Wherein, the corresponding relationship is according to instruction in advance What the preset model practiced obtained;The preset model is the default loss obtained by the algorithm expression formula based on ArcFace What function was trained;According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, And using the target identity information as the recognition result of the voice to be identified.
The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium Computer instruction is stored, the computer instruction makes the computer execute method provided by above-mentioned each method embodiment, example It such as include: to obtain voice to be identified, and extract the low layer frame level feature of the voice to be identified;It is special according to the low layer frame level Sign extracts identity characteristic vector;Obtained from default sound bank similar with identity characteristic vector target identities feature to Amount, the default sound bank are previously stored with the corresponding relationship between default identity characteristic vector and default identity information;Wherein, The corresponding relationship is obtained according to the preset model trained in advance;The preset model is by based on ArcFace What the default loss function that algorithm expression formula obtains was trained;It is determining special with the target identities according to the corresponding relationship The corresponding target identity information of vector is levied, and using the target identity information as the recognition result of the voice to be identified.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through The relevant hardware of program instruction is completed, and program above-mentioned can be stored in a computer readable storage medium, the program When being executed, step including the steps of the foregoing method embodiments is executed;And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light The various media that can store program code such as disk.
The embodiments such as electronic equipment described above are only schematical, wherein it is described as illustrated by the separation member Unit may or may not be physically separated, and component shown as a unit may or may not be object Manage unit, it can it is in one place, or may be distributed over multiple network units.It can select according to the actual needs Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying wound In the case where the labour for the property made, it can understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.
Finally, it should be noted that the above various embodiments is only to illustrate the technical solution of the embodiment of the present invention, rather than it is right It is limited;Although the embodiment of the present invention is described in detail referring to foregoing embodiments, the ordinary skill of this field Personnel are it is understood that it is still possible to modify the technical solutions described in the foregoing embodiments, or to part Or all technical features are equivalently replaced;And these are modified or replaceed, it does not separate the essence of the corresponding technical solution The range of various embodiments of the present invention technical solution.

Claims (10)

1. a kind of audio recognition method based on ArcFace characterized by comprising
Voice to be identified is obtained, and extracts the low layer frame level feature of the voice to be identified;
According to the low layer frame level feature, identity characteristic vector is extracted;
Target identities feature vector similar with the identity characteristic vector, the default sound bank are obtained from default sound bank The corresponding relationship being previously stored between default identity characteristic vector and default identity information;Wherein, the corresponding relationship is root It is obtained according to the preset model trained in advance;The preset model is obtained by the algorithm expression formula based on ArcFace What default loss function was trained;
According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, and by the mesh Mark recognition result of the identity information as the voice to be identified.
2. the method according to claim 1, wherein the default loss function includes maximal margin constraint loss The factor, the expression formula of the maximal margin constraint loss factor are as follows:
Wherein, Cmax_marLoss factor is constrained for the maximal margin, N is in batches trained sample set, y is sample class, C It is preset threshold, f for sample class sum, tyiPosteriority for expression the belonged to class of sample vector greater than the preset threshold is general Rate, δyFor maximum boundary item penalty.
3. according to the method described in claim 2, it is characterized in that, the δyExpression formula are as follows:
Wherein, as j ≠ yiWhen, fjThe expression sample vector for representing less than the preset threshold belongs to the posteriority of other classes Probability.
4. according to the method in claim 2 or 3, which is characterized in that the expression formula of the default loss function are as follows:
L=L3+λCmax_mar
Wherein, L is the default loss function, L3It is weight coefficient, numerical value 0.1 for the algorithm expression formula based on ArcFace, λ ~10.
5. extraction identity is special the method according to claim 1, wherein described according to the low layer frame level feature Levy vector, comprising:
The low layer frame level feature is inputted to the GRU model optimized, using the output result of the GRU model optimized as The identity characteristic vector.
6. according to the method described in claim 5, it is characterized in that, the GRU model optimized is the GRU equipped with convolutional layer Model.
7. the method according to claim 1, wherein the low layer frame level feature is Fbank feature.
8. a kind of speech recognition equipment based on ArcFace characterized by comprising
First acquisition unit for obtaining voice to be identified, and extracts the low layer frame level feature of the voice to be identified;
Extraction unit, for extracting identity characteristic vector according to the low layer frame level feature;
Second acquisition unit, for obtained from default sound bank similar with identity characteristic vector target identities feature to Amount, the default sound bank are previously stored with the corresponding relationship between default identity characteristic vector and default identity information;Wherein, The corresponding relationship is obtained according to the preset model trained in advance;The preset model is by based on ArcFace What the default loss function that algorithm expression formula obtains was trained;
Recognition unit, for determining target identities letter corresponding with the target identities feature vector according to the corresponding relationship Breath, and using the target identity information as the recognition result of the voice to be identified.
9. a kind of electronic equipment characterized by comprising processor, memory and bus, wherein
The processor and the memory complete mutual communication by the bus;
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to instruct energy Enough methods executed as described in claim 1 to 7 is any.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer instruction is stored up, the computer instruction makes the computer execute the method as described in claim 1 to 7 is any.
CN201811400260.2A 2018-11-22 2018-11-22 ArcFace-based voice recognition method and device Active CN109377984B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811400260.2A CN109377984B (en) 2018-11-22 2018-11-22 ArcFace-based voice recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811400260.2A CN109377984B (en) 2018-11-22 2018-11-22 ArcFace-based voice recognition method and device

Publications (2)

Publication Number Publication Date
CN109377984A true CN109377984A (en) 2019-02-22
CN109377984B CN109377984B (en) 2022-05-03

Family

ID=65377103

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811400260.2A Active CN109377984B (en) 2018-11-22 2018-11-22 ArcFace-based voice recognition method and device

Country Status (1)

Country Link
CN (1) CN109377984B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047468A (en) * 2019-05-20 2019-07-23 北京达佳互联信息技术有限公司 Audio recognition method, device and storage medium
CN111582354A (en) * 2020-04-30 2020-08-25 中国平安财产保险股份有限公司 Picture identification method, device, equipment and storage medium
CN112669827A (en) * 2020-12-28 2021-04-16 清华大学 Joint optimization method and system for automatic speech recognizer

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104732978A (en) * 2015-03-12 2015-06-24 上海交通大学 Text-dependent speaker recognition method based on joint deep learning
CN105575394A (en) * 2016-01-04 2016-05-11 北京时代瑞朗科技有限公司 Voiceprint identification method based on global change space and deep learning hybrid modeling
CN105632502A (en) * 2015-12-10 2016-06-01 江西师范大学 Weighted pairwise constraint metric learning algorithm-based speaker recognition method
CN105931646A (en) * 2016-04-29 2016-09-07 江西师范大学 Speaker identification method base on simple direct tolerance learning algorithm
CN106022380A (en) * 2016-05-25 2016-10-12 中国科学院自动化研究所 Individual identity identification method based on deep learning
US20180197547A1 (en) * 2017-01-10 2018-07-12 Fujitsu Limited Identity verification method and apparatus based on voiceprint
US20180261236A1 (en) * 2017-03-10 2018-09-13 Baidu Online Network Technology (Beijing) Co., Ltd. Speaker recognition method and apparatus, computer device and computer-readable medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104732978A (en) * 2015-03-12 2015-06-24 上海交通大学 Text-dependent speaker recognition method based on joint deep learning
CN105632502A (en) * 2015-12-10 2016-06-01 江西师范大学 Weighted pairwise constraint metric learning algorithm-based speaker recognition method
CN105575394A (en) * 2016-01-04 2016-05-11 北京时代瑞朗科技有限公司 Voiceprint identification method based on global change space and deep learning hybrid modeling
CN105931646A (en) * 2016-04-29 2016-09-07 江西师范大学 Speaker identification method base on simple direct tolerance learning algorithm
CN106022380A (en) * 2016-05-25 2016-10-12 中国科学院自动化研究所 Individual identity identification method based on deep learning
US20180197547A1 (en) * 2017-01-10 2018-07-12 Fujitsu Limited Identity verification method and apparatus based on voiceprint
US20180261236A1 (en) * 2017-03-10 2018-09-13 Baidu Online Network Technology (Beijing) Co., Ltd. Speaker recognition method and apparatus, computer device and computer-readable medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN, SHENG 等: "MobileFaceNets: Efficient CNNs for Accurate Real-Time Face Verification on Mobile Devices", 《13TH CHINESE CONFERENCE ON BIOMETRIC RECOGNITION (CCBR)》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047468A (en) * 2019-05-20 2019-07-23 北京达佳互联信息技术有限公司 Audio recognition method, device and storage medium
CN110047468B (en) * 2019-05-20 2022-01-25 北京达佳互联信息技术有限公司 Speech recognition method, apparatus and storage medium
CN111582354A (en) * 2020-04-30 2020-08-25 中国平安财产保险股份有限公司 Picture identification method, device, equipment and storage medium
CN112669827A (en) * 2020-12-28 2021-04-16 清华大学 Joint optimization method and system for automatic speech recognizer
CN112669827B (en) * 2020-12-28 2022-08-02 清华大学 Joint optimization method and system for automatic speech recognizer

Also Published As

Publication number Publication date
CN109377984B (en) 2022-05-03

Similar Documents

Publication Publication Date Title
US9875743B2 (en) Acoustic signature building for a speaker from multiple sessions
Ittichaichareon et al. Speech recognition using MFCC
US7447338B2 (en) Method and system for face detection using pattern classifier
US20180197547A1 (en) Identity verification method and apparatus based on voiceprint
CN107221320A (en) Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model
CN108281137A (en) A kind of universal phonetic under whole tone element frame wakes up recognition methods and system
CN110910891B (en) Speaker segmentation labeling method based on long-time and short-time memory deep neural network
Gosztolya et al. DNN-based feature extraction and classifier combination for child-directed speech, cold and snoring identification
CN111462729B (en) Fast language identification method based on phoneme log-likelihood ratio and sparse representation
CN111583906B (en) Role recognition method, device and terminal for voice session
CN110287311B (en) Text classification method and device, storage medium and computer equipment
CN108520752A (en) A kind of method for recognizing sound-groove and device
CN106991312B (en) Internet anti-fraud authentication method based on voiceprint recognition
CN111816185A (en) Method and device for identifying speaker in mixed voice
CN109036385A (en) A kind of voice instruction recognition method, device and computer storage medium
CN111401105B (en) Video expression recognition method, device and equipment
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
Ferrer et al. Spoken language recognition based on senone posteriors.
CN108831506A (en) Digital audio based on GMM-BIC distorts point detecting method and system
CN109448756A (en) A kind of voice age recognition methods and system
CN114678030A (en) Voiceprint identification method and device based on depth residual error network and attention mechanism
Venkatesan et al. Automatic language identification using machine learning techniques
CN109377984A (en) A kind of audio recognition method and device based on ArcFace
CN111091809B (en) Regional accent recognition method and device based on depth feature fusion
CN115801374A (en) Network intrusion data classification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant