CN109377984A

CN109377984A - A kind of audio recognition method and device based on ArcFace

Info

Publication number: CN109377984A
Application number: CN201811400260.2A
Authority: CN
Inventors: 李鹏; 吉瑞芳; 蔡新元
Original assignee: Beijing Wisdom And Technology Co Ltd
Current assignee: Beijing Wisdom And Technology Co Ltd
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2019-02-22
Anticipated expiration: 2038-11-22
Also published as: CN109377984B

Abstract

The embodiment of the present invention provides a kind of audio recognition method and device based on ArcFace, which comprises obtains voice to be identified, and extracts the low layer frame level feature of the voice to be identified；According to the low layer frame level feature, identity characteristic vector is extracted；Target identities feature vector similar with the identity characteristic vector is obtained from default sound bank, the default sound bank is previously stored with the corresponding relationship between default identity characteristic vector and default identity information；The corresponding relationship is obtained according to the preset model trained in advance；The preset model is that the default loss function obtained by the algorithm expression formula based on ArcFace is trained；According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, and using the target identity information as the recognition result of the voice to be identified.Described device executes the above method.Method and device provided in an embodiment of the present invention can accurately identify various types of voices.

Description

A kind of audio recognition method and device based on ArcFace

Technical field

The present embodiments relate to voice processing technology fields, and in particular to a kind of speech recognition side based on ArcFace Method and device.

Background technique

With the explosive growth of digital audio-frequency data, by speech recognition technology, to identify speaker also gradually It receives more and more attention.

The i-vector system being most widely used in Speaker Identification at present, based on GMM-UBM (mixed Gaussian Model-background model model) and GSV-SVM (Gaussian mean super vector-supporting vector machine model) be all built upon statistical model It is theoretic, therefore it is required that trained and tested speech must reach certain length, otherwise identify accuracy by sharp fall. On the other hand, although ArcFace is widely used in field of face identification, still ArcFace is not answered at present Used in the method for field of speech recognition.

Therefore, how to avoid drawbacks described above, based on ArcFace accurately to various types of voices (including long voice and Phrase sound) it is identified, becoming need solve the problems, such as.

Summary of the invention

In view of the problems of the existing technology, the embodiment of the present invention provides a kind of audio recognition method based on ArcFace And device.

In a first aspect, the embodiment of the present invention provides a kind of audio recognition method based on ArcFace, which comprises

Voice to be identified is obtained, and extracts the low layer frame level feature of the voice to be identified；

According to the low layer frame level feature, identity characteristic vector is extracted；

Target identities feature vector similar with the identity characteristic vector, the default language are obtained from default sound bank Sound library is previously stored with the corresponding relationship between default identity characteristic vector and default identity information；Wherein, the corresponding relationship It is to be obtained according to the preset model trained in advance；The preset model is obtained by the algorithm expression formula based on ArcFace What the default loss function taken was trained；

According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, and by institute State recognition result of the target identity information as the voice to be identified.

Second aspect, the embodiment of the present invention provide a kind of speech recognition equipment based on ArcFace, and described device includes:

First acquisition unit for obtaining voice to be identified, and extracts the low layer frame level feature of the voice to be identified；

Extraction unit, for extracting identity characteristic vector according to the low layer frame level feature；

Second acquisition unit, it is special for obtaining target identities similar with the identity characteristic vector from default sound bank Vector is levied, the default sound bank is previously stored with the corresponding relationship between default identity characteristic vector and default identity information； Wherein, the corresponding relationship is obtained according to the preset model trained in advance；The preset model is by being based on What the default loss function that the algorithm expression formula of ArcFace obtains was trained；

Recognition unit, for determining target body corresponding with the target identities feature vector according to the corresponding relationship Part information, and using the target identity information as the recognition result of the voice to be identified.

The third aspect, the embodiment of the present invention provide a kind of electronic equipment, comprising: processor, memory and bus, wherein

The processor and the memory complete mutual communication by the bus；

The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to refer to Order is able to carry out following method:

Fourth aspect, the embodiment of the present invention provide a kind of non-transient computer readable storage medium, comprising:

The non-transient computer readable storage medium stores computer instruction, and the computer instruction makes the computer Execute following method:

Audio recognition method and device provided in an embodiment of the present invention based on ArcFace, obtains from default sound bank The similar target identities feature vector of corresponding with voice to be identified identity characteristic vector, and according to being in advance based on ArcFace's The preset model that the default loss function that algorithm expression formula obtains is trained obtains corresponding relationship, and then obtains target identities letter Breath, then using target identity information as the recognition result of voice to be identified, accurately various types of voices can be known Not.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.

Fig. 1 is audio recognition method flow diagram of the embodiment of the present invention based on ArcFace；

Fig. 2 is speech recognition equipment structural schematic diagram of the embodiment of the present invention based on ArcFace；

Fig. 3 is electronic equipment entity structure schematic diagram provided in an embodiment of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.

Fig. 1 is audio recognition method flow diagram of the embodiment of the present invention based on ArcFace, as shown in Figure 1, of the invention A kind of audio recognition method based on ArcFace that embodiment provides, comprising the following steps:

S101: obtaining voice to be identified, and extracts the low layer frame level feature of the voice to be identified.

Specifically, device obtains voice to be identified, and extract the low layer frame level feature of the voice to be identified.Device can be with It is the server etc. for executing this method, the equipment such as dynamic microphones, Electret Condencer Microphone and micro-electro-mechanical microphone can be passed through The voice of same speaker's different channels is acquired, actual speech environment is simulated.It can be moved according to the frame length of 25ms and the frame of 10ms, The frame level feature of voice to be identified is extracted, and mute operation is carried out to above-mentioned frame level feature using VAD (voice activity detection), is obtained Obtain low layer frame level feature.Low layer frame level feature can be Fbank feature, be not especially limited.

S102: according to the low layer frame level feature, identity characteristic vector is extracted.

Specifically, device extracts identity characteristic vector according to the low layer frame level feature.Identity characteristic vector is understood that For the feature vector for identifying speaker, the low layer frame level feature can be inputted to the GRU model optimized, optimized described GRU model output result as the identity characteristic vector.GRU (Gated Recurrent Unit) is LSTM variant, As the model of study temporal aspect, while keeping LSTM that can handle remote dependence advantage well, structure is more for it Simply, it calculates more efficient.A convolutional layer can be introduced before GRU layers, and to optimize GRU model, spectral correlations are modeled While, feature is reduced in the dimension of time domain and frequency domain, and acceleration model calculates.

S103: obtaining target identities feature vector similar with the identity characteristic vector from default sound bank, described Default sound bank is previously stored with the corresponding relationship between default identity characteristic vector and default identity information；Wherein, described right Should be related to obtained according to the preset model trained in advance；The preset model is by the algorithm table based on ArcFace What the default loss function obtained up to formula was trained.

Specifically, device obtained from default sound bank target identities feature similar with the identity characteristic vector to Amount, the default sound bank are previously stored with the corresponding relationship between default identity characteristic vector and default identity information；Wherein, The corresponding relationship is obtained according to the preset model trained in advance；The preset model is by based on ArcFace What the default loss function that algorithm expression formula obtains was trained.Nearest neighbor classifier can be used, identity characteristic vector is calculated With the Euclidean distance between the default identity characteristic vector in default sound bank, by the smallest default identity characteristic of Euclidean distance to Amount is determined as target identities feature vector.Default identity information can be understood as the default corresponding speaker of identity characteristic vector, I.e. by identifying that default identity characteristic vector, preset model identify which speaker the default identity characteristic vector corresponds to. The embodiment of the present invention does not make specific limit to preset model.Algorithm expression formula L based on ArcFace₃It can be according to following steps It obtains:

For the sample vector x of input_iLabel y corresponding with its_i(which speaker corresponded to), loss function L₁Determine Justice is as follows:

Wherein, N is in batches trained sample set (i.e. a part of the total sample number of the training of input unit in batches), C For sample class total (i.e. the sum of speaker),To indicate sample vector x_iThe posterior probability of belonged to class, f_jIndicate sample This vector x_iBelong to the posterior probability of all classes,It can be expressed as follows:

Wherein,WithIt is weight vectors and the biasing of full articulamentum,It is angle between the two.

Simplified expression, willIt is set as 0, normalizing by L2 willIt is set as 1, thenOnly by sample vector x_iWith AngleIt determines:

Carrying out L2 regularization to feature can remove radial variation of the feature on hypersphere space.Will | | x_i| | it is set as constant S, loss function L₂It indicates are as follows:

Since soft-sided circle loss function is focused on correctly classifying, consideration is lacked to classification error situation.It is asked to solve this Topic adds corner edge loss factor m, that is, existsInterior introducing m increases boundary constraint on classification boundaries, to obtain The algorithm expression formula L of ArcFace₃:

Wherein,Belong to range [0, π-m ].

The target of speech recognition is to judge which speaker unknown voice belongs to, it is assumed that the posterior probability of the belonged to class of voiceGreater than preset threshold t, and the posterior probability f of other classes where it_jRespectively less than t.It can be expressed as follows:

In assorting process,Less than or equal to t or f_jMore than or equal to t, it is misclassification situation, loss is defined as The difference of the two.For former instance, if loss is L₊, indicate are as follows:

Similarly, the latter's loss is L_-, indicate are as follows:

Misclassification loss function is indicated to be whole, by L₊And L_-Fusion is got up, and maximum boundary item penalty δ is introduced_y:

For all samples, maximal margin constrains loss factor are as follows:

All in all, default loss function L, i.e. maximal margin COS distance loss function are obtained based on ArcFace (maximum marginal cosine distance loss function, hereinafter referred to as " MMCL "), is defined as follows:

Wherein, L is L₃And C_{max_mar}Weighted sum, be expressed as follows:

L=L₃+λC_{max_mar}

λ is weight coefficient, and numerical value is chosen as 0.1~10.

It should be understood that since the maximal margin that the embodiment of the present invention introduces constrains loss factor C_{max_mar}In include Maximum boundary item penalty δ_y, (corresponding δ in the case of prediction result is correct_yExpression formula inThe case where), so that δ_y=1；(corresponding δ in the case of prediction result mistake_yExpression formula inThe case where), so that δ_y=-1；Even if It is stronger to the resolution capability of prediction result that loss function must be preset, so that recognition result is more accurate.

S104: according to the corresponding relationship, determining target identity information corresponding with the target identities feature vector, and Using the target identity information as the recognition result of the voice to be identified.

Specifically, device determines target identities corresponding with the target identities feature vector according to the corresponding relationship Information, and using the target identity information as the recognition result of the voice to be identified.Be illustrated below: default identity is special It levies there are corresponding relationship between vector A and default identity information a, identity characteristic vector corresponding with voice to be identified is X, is passed through Vector similitude comparative approach show that default identity characteristic vector A be with identity characteristic vector is the similar target identities feature of X Vector, so that it is determined that default identity information a is target identity information, using the target identity information as the identification of voice to be identified As a result.Under the conditions of 2s, 3s, 5s, tetra- kinds of voice length of 8s, MMCL of the embodiment of the present invention is respectively with softmax's and ArcFace The comparing result of EER index is as shown in table 1:

Recognition performance of the 1 phrase sound method for distinguishing speek person of table under different durations

	2s	3s	5s	8s
					softmax	0.0643	0.0437	0.0363	0.0301
ArcFace	0.0602	0.0410	0.0307	0.0254
					MMCL	0.0538	0.0385	0.0272	0.0215

It can be seen that MMCL of the embodiment of the present invention has lesser EER error, voice can accurately be known Not.

Audio recognition method provided in an embodiment of the present invention based on ArcFace, from default sound bank obtain with wait know The similar target identities feature vector of the corresponding identity characteristic vector of other voice, and according to the algorithm table for being in advance based on ArcFace The preset model that the default loss function obtained up to formula is trained obtains corresponding relationship, and then obtains target identity information, then Using target identity information as the recognition result of voice to be identified, accurately various types of voices can be identified.

On the basis of the above embodiments, the default loss function include maximal margin constraint loss factor, it is described most The expression formula of big edge constraint loss factor are as follows:

Wherein, C_{max_mar}Loss factor is constrained for the maximal margin, N is in batches trained sample set, y is sample class Not, C be sample class sum, t be preset threshold,After the belonged to class of expression sample vector greater than the preset threshold Test probability, δ_yFor maximum boundary item penalty.

Specifically, the expression formula of the maximal margin constraint loss factor in device are as follows:

Wherein, C_{max_mar}Loss factor is constrained for the maximal margin, N is in batches trained sample set, y is sample class Not, C be sample class sum, t be preset threshold,After the belonged to class of expression sample vector greater than the preset threshold Test probability, δ_yFor maximum boundary item penalty.It can refer to above-described embodiment, repeat no more.

Audio recognition method provided in an embodiment of the present invention based on ArcFace, by using include maximal margin about The default loss function of beam loss factor is trained preset function, be further able to accurately to various types of voices into Row identification.

On the basis of the above embodiments, the δ_yExpression formula are as follows:

Wherein, as j ≠ y_iWhen, f_jThe expression sample vector for representing less than the preset threshold belongs to other classes Posterior probability.

Specifically, the δ in device_yExpression formula are as follows:

Wherein, as j ≠ y_iWhen, f_jThe expression sample vector for representing less than the preset threshold belongs to other classes Posterior probability.It can refer to above-described embodiment, repeat no more.

Audio recognition method provided in an embodiment of the present invention based on ArcFace is calculated maximum by specific expression formula Border item penalty is further able to accurately identify various types of voices.

On the basis of the above embodiments, the expression formula of the default loss function are as follows:

L=L₃+λC_{max_mar}

Wherein, L is the default loss function, L₃It is weight coefficient, numerical value for the algorithm expression formula based on ArcFace, λ It is 0.1~10.

Specifically, the expression formula of the default loss function in device are as follows:

L=L₃+λC_{max_mar}

Wherein, L is the default loss function, L₃It is weight coefficient, numerical value for the algorithm expression formula based on ArcFace, λ It is 0.1~10.It can refer to above-described embodiment, repeat no more.

Audio recognition method provided in an embodiment of the present invention based on ArcFace is calculated default by specific expression formula Loss function is further able to accurately identify various types of voices.

On the basis of the above embodiments, described according to the low layer frame level feature, extract identity characteristic vector, comprising:

The low layer frame level feature is inputted to the GRU model optimized, by the output result of the GRU model optimized As the identity characteristic vector.

Specifically, device inputs the low layer frame level feature to the GRU model optimized, by the GRU mould optimized The output result of type is as the identity characteristic vector.It can refer to above-described embodiment, repeat no more.

Audio recognition method provided in an embodiment of the present invention based on ArcFace passes through the defeated of the GRU model that will optimize Result can guarantee that this method is normally carried out as identity characteristic vector out.

On the basis of the above embodiments, the GRU model optimized is the GRU model equipped with convolutional layer.

Specifically, the GRU model optimized in device is the GRU model equipped with convolutional layer.It can refer to above-mentioned reality Example is applied, is repeated no more.

Audio recognition method provided in an embodiment of the present invention based on ArcFace is selected as by the GRU model that will optimize GRU model equipped with convolutional layer, can be improved the operation efficiency of GRU model, know more quickly to various types of voices Not.

On the basis of the above embodiments, the low layer frame level feature is Fbank feature.

Specifically, the low layer frame level feature in device is Fbank feature.It can refer to above-described embodiment, repeat no more.

Audio recognition method provided in an embodiment of the present invention based on ArcFace, by the way that low layer frame level feature to be selected as Fbank feature can guarantee that this method is normally carried out.

Fig. 2 is speech recognition equipment structural schematic diagram of the embodiment of the present invention based on ArcFace, as shown in Fig. 2, of the invention Embodiment provides a kind of speech recognition equipment based on ArcFace, including first acquisition unit 201, extraction unit 202, Two acquiring units 203 and recognition unit 204, in which:

First acquisition unit 201 extracts the low layer frame level feature of the voice to be identified for obtaining voice to be identified； Extraction unit 202 is used to extract identity characteristic vector according to the low layer frame level feature；Second acquisition unit 203 is used for from pre- If obtaining target identities feature vector similar with the identity characteristic vector in sound bank, the default sound bank is stored in advance There is the corresponding relationship between default identity characteristic vector and default identity information；Wherein, the corresponding relationship is according to instruction in advance What the preset model practiced obtained；The preset model is the default loss obtained by the algorithm expression formula based on ArcFace What function was trained；Recognition unit 204 is used for according to the corresponding relationship, and determination is corresponding with the target identities feature vector Target identity information, and using the target identity information as the recognition result of the voice to be identified.

Specifically, first acquisition unit 201 is used to obtain voice to be identified, and extract the lower-level frame of the voice to be identified Grade feature；Extraction unit 202 is used to extract identity characteristic vector according to the low layer frame level feature；Second acquisition unit 203 is used In obtaining target identities feature vector similar with the identity characteristic vector from default sound bank, the default sound bank is pre- The corresponding relationship being first stored between default identity characteristic vector and default identity information；Wherein, the corresponding relationship is basis What the preset model trained in advance obtained；The preset model is by the pre- of the algorithm expression formula acquisition based on ArcFace What if loss function was trained；Recognition unit 204 is used for according to the corresponding relationship, it is determining with the target identities feature to Corresponding target identity information is measured, and using the target identity information as the recognition result of the voice to be identified.

Speech recognition equipment provided in an embodiment of the present invention based on ArcFace, from default sound bank obtain with wait know The similar target identities feature vector of the corresponding identity characteristic vector of other voice, and according to the algorithm table for being in advance based on ArcFace The preset model that the default loss function obtained up to formula is trained obtains corresponding relationship, and then obtains target identity information, then Using target identity information as the recognition result of voice to be identified, accurately various types of voices can be identified.

Speech recognition equipment provided in an embodiment of the present invention based on ArcFace specifically can be used for executing above-mentioned each method The process flow of embodiment, details are not described herein for function, is referred to the detailed description of above method embodiment.

Fig. 3 is electronic equipment entity structure schematic diagram provided in an embodiment of the present invention, as shown in figure 3, the electronic equipment It include: processor (processor) 301, memory (memory) 302 and bus 303；

Wherein, the processor 301, memory 302 complete mutual communication by bus 303；

The processor 301 is used to call the program instruction in the memory 302, to execute above-mentioned each method embodiment Provided method, for example, obtain voice to be identified, and extract the low layer frame level feature of the voice to be identified；According to The low layer frame level feature extracts identity characteristic vector；It is obtained from default sound bank similar with the identity characteristic vector Target identities feature vector, the default sound bank are previously stored between default identity characteristic vector and default identity information Corresponding relationship；Wherein, the corresponding relationship is obtained according to the preset model trained in advance；The preset model is to pass through What the default loss function that the algorithm expression formula based on ArcFace obtains was trained；According to the corresponding relationship, determining and institute The corresponding target identity information of target identities feature vector is stated, and using the target identity information as the voice to be identified Recognition result.

The present embodiment discloses a kind of computer program product, and the computer program product includes being stored in non-transient calculating Computer program on machine readable storage medium storing program for executing, the computer program include program instruction, when described program instruction is calculated When machine executes, computer is able to carry out method provided by above-mentioned each method embodiment, for example, voice to be identified is obtained, And extract the low layer frame level feature of the voice to be identified；According to the low layer frame level feature, identity characteristic vector is extracted；From pre- If obtaining target identities feature vector similar with the identity characteristic vector in sound bank, the default sound bank is stored in advance There is the corresponding relationship between default identity characteristic vector and default identity information；Wherein, the corresponding relationship is according to instruction in advance What the preset model practiced obtained；The preset model is the default loss obtained by the algorithm expression formula based on ArcFace What function was trained；According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, And using the target identity information as the recognition result of the voice to be identified.

The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium Computer instruction is stored, the computer instruction makes the computer execute method provided by above-mentioned each method embodiment, example It such as include: to obtain voice to be identified, and extract the low layer frame level feature of the voice to be identified；It is special according to the low layer frame level Sign extracts identity characteristic vector；Obtained from default sound bank similar with identity characteristic vector target identities feature to Amount, the default sound bank are previously stored with the corresponding relationship between default identity characteristic vector and default identity information；Wherein, The corresponding relationship is obtained according to the preset model trained in advance；The preset model is by based on ArcFace What the default loss function that algorithm expression formula obtains was trained；It is determining special with the target identities according to the corresponding relationship The corresponding target identity information of vector is levied, and using the target identity information as the recognition result of the voice to be identified.

Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through The relevant hardware of program instruction is completed, and program above-mentioned can be stored in a computer readable storage medium, the program When being executed, step including the steps of the foregoing method embodiments is executed；And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light The various media that can store program code such as disk.

The embodiments such as electronic equipment described above are only schematical, wherein it is described as illustrated by the separation member Unit may or may not be physically separated, and component shown as a unit may or may not be object Manage unit, it can it is in one place, or may be distributed over multiple network units.It can select according to the actual needs Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying wound In the case where the labour for the property made, it can understand and implement.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.

Finally, it should be noted that the above various embodiments is only to illustrate the technical solution of the embodiment of the present invention, rather than it is right It is limited；Although the embodiment of the present invention is described in detail referring to foregoing embodiments, the ordinary skill of this field Personnel are it is understood that it is still possible to modify the technical solutions described in the foregoing embodiments, or to part Or all technical features are equivalently replaced；And these are modified or replaceed, it does not separate the essence of the corresponding technical solution The range of various embodiments of the present invention technical solution.

Claims

1. a kind of audio recognition method based on ArcFace characterized by comprising

Target identities feature vector similar with the identity characteristic vector, the default sound bank are obtained from default sound bank The corresponding relationship being previously stored between default identity characteristic vector and default identity information；Wherein, the corresponding relationship is root It is obtained according to the preset model trained in advance；The preset model is obtained by the algorithm expression formula based on ArcFace What default loss function was trained；

According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, and by the mesh Mark recognition result of the identity information as the voice to be identified.

2. the method according to claim 1, wherein the default loss function includes maximal margin constraint loss The factor, the expression formula of the maximal margin constraint loss factor are as follows:

Wherein, C_{max_mar}Loss factor is constrained for the maximal margin, N is in batches trained sample set, y is sample class, C It is preset threshold, f for sample class sum, t_yiPosteriority for expression the belonged to class of sample vector greater than the preset threshold is general Rate, δ_yFor maximum boundary item penalty.

3. according to the method described in claim 2, it is characterized in that, the δ_yExpression formula are as follows:

Wherein, as j ≠ y_iWhen, f_jThe expression sample vector for representing less than the preset threshold belongs to the posteriority of other classes Probability.

4. according to the method in claim 2 or 3, which is characterized in that the expression formula of the default loss function are as follows:

L=L₃+λC_{max_mar}

Wherein, L is the default loss function, L₃It is weight coefficient, numerical value 0.1 for the algorithm expression formula based on ArcFace, λ ~10.

5. extraction identity is special the method according to claim 1, wherein described according to the low layer frame level feature Levy vector, comprising:

The low layer frame level feature is inputted to the GRU model optimized, using the output result of the GRU model optimized as The identity characteristic vector.

6. according to the method described in claim 5, it is characterized in that, the GRU model optimized is the GRU equipped with convolutional layer Model.

7. the method according to claim 1, wherein the low layer frame level feature is Fbank feature.

8. a kind of speech recognition equipment based on ArcFace characterized by comprising

Second acquisition unit, for obtained from default sound bank similar with identity characteristic vector target identities feature to Amount, the default sound bank are previously stored with the corresponding relationship between default identity characteristic vector and default identity information；Wherein, The corresponding relationship is obtained according to the preset model trained in advance；The preset model is by based on ArcFace What the default loss function that algorithm expression formula obtains was trained；

Recognition unit, for determining target identities letter corresponding with the target identities feature vector according to the corresponding relationship Breath, and using the target identity information as the recognition result of the voice to be identified.

9. a kind of electronic equipment characterized by comprising processor, memory and bus, wherein

The processor and the memory complete mutual communication by the bus；

The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to instruct energy Enough methods executed as described in claim 1 to 7 is any.

10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer instruction is stored up, the computer instruction makes the computer execute the method as described in claim 1 to 7 is any.