CN110473552A

CN110473552A - Speech recognition authentication method and system

Info

Publication number: CN110473552A
Application number: CN201910832042.4A
Authority: CN
Inventors: 王健宗; 苏雪琦; 彭话易
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2019-11-19
Also published as: WO2021042537A1

Abstract

The embodiment of the present invention provides a kind of speech recognition authentication method, comprising: obtains audio-frequency information；The audio-frequency information is pre-processed, to obtain voice messaging from the audio-frequency information according to the short-time energy of the audio-frequency information and spectral centroid；The voice messaging is subjected to speech feature extraction；The phonetic feature is handled, to obtain and the more close target voice feature of speaker；The speaker's phonetic feature stored in the target voice feature and database is matched；And according to matching result, the identity information of the speaker corresponding with speaker's phonetic feature matched is exported, to obtain the speaker corresponding with the voice messaging.The embodiment of the present invention also provides a kind of speech recognition certification, computer equipment and readable storage medium storing program for executing.It is able to ascend the accuracy of speech recognition technology through the embodiment of the present invention, greatly promotes user experience.

Description

Speech recognition authentication method and system

Technical field

The present embodiments relate to field of speech recognition more particularly to a kind of speech recognition authentication method, speech recognition to recognize Card system, computer equipment and readable storage medium storing program for executing.

Background technique

It is increasingly mature with speech recognition technology, speech recognition technology in daily life using extremely wide.Example Such as, domestic intelligent speech robot people completes the phonetic order received by identifying the sound of kinsfolk；Minutes system System, recorded by identifying the sound of participant participant can on speech etc..However, existing most of speech recognition systems All there is the problems such as unclear speech recognition, Speaker Identification mistake, for example, percussion keyboard sound as the spoken utterance of effective people Sound so that speech recognition system gives invalid response, or is recorded as the speech of speaker A the speech of speaker B.

Present invention seek to address that speech recognition is unclear, the low problem of the recognition accuracies such as Speaker Identification mistake.

Summary of the invention

In view of this, it is necessary to provide a kind of speech recognition authentication method, speech recognition Verification System, computer equipment and Readable storage medium storing program for executing is able to ascend the accuracy of speech recognition technology, greatly promotes user experience.

To achieve the above object, the embodiment of the invention provides a kind of speech recognition authentication methods, which comprises

Obtain audio-frequency information；

The audio-frequency information is pre-processed, with according to the short-time energy of the audio-frequency information and spectral centroid from described Voice messaging is obtained in audio-frequency information；

The voice messaging is subjected to speech feature extraction；

The phonetic feature is handled, to obtain and the more close target voice feature of speaker；

The speaker's phonetic feature stored in the target voice feature and database is matched；And

According to matching result, by the identity information of the speaker corresponding with speaker's phonetic feature matched Output, to obtain the speaker corresponding with the voice messaging.

Preferably, described that the audio-frequency information is pre-processed, according to the short-time energy of the audio-frequency information and frequency Spectrum center obtains the step of voice messaging from the audio-frequency information, comprising:

Multiframe short signal is extracted from the audio-frequency information according to preset rules, wherein the preset rules include pre- If signal extraction time interval；And

The multiframe short signal is calculated into the short-time energy according to mute detection algorithm；

The spectral centroid is calculated according to the multiframe short signal；

The first preset value stored in the short-time energy and database is compared；

The second preset value stored in the spectral centroid and the database is compared；

When the short-time energy is higher than first preset value, and the spectral centroid is higher than second preset value, Determine the audio-frequency information for voice messaging；And

Obtain the voice messaging.

Preferably, the calculation formula of the short-time energy are as follows:

Wherein, E indicates the short-time energy, and N indicates the frame number of short signal, N >=2, and is integer, and s (n) indicates time domain The signal amplitude of upper n-th frame short signal.

Preferably, the described the step of spectral centroid is calculated according to the multiframe short signal, comprising:

Frequency corresponding with the multiframe short signal is obtained respectively；And

According to the frequency and the multiframe short signal, the audio-frequency information is calculated according to the mute detection algorithm Spectral centroid, wherein the calculation formula of the spectral centroid are as follows:

Wherein, C indicates that the spectral centroid, K indicate frequency number corresponding with N frame s (n) respectively, K >=2, and is whole Number, S (k) indicate the spectrum energy distribution that discrete Fourier transform corresponding with the s (n) obtains on frequency domain.

Preferably, described the step of being compared the second preset value stored in the spectral centroid and the database Later, further includes:

When the short-time energy is lower than second preset value lower than first preset value and/or the spectral centroid When, determine that the audio-frequency information is invalid audio-frequency information, wherein the invalid audio-frequency information includes at least: mute, environment is made an uproar Sound and non-ambient noise；And

The audio-frequency information is deleted.

Preferably, described to handle the phonetic feature, it is special with the more close target voice of speaker to obtain The step of sign, comprising:

The phonetic feature is normalized using Z-score standardized method, by the phonetic feature into Row is unified, wherein the formula of the normalized are as follows:μ is the mean value of multiple voice messagings, and σ is more The standard deviation of a voice messaging, x are the multiple single frames voice data, and x* is that the voice after normalized is special Sign；

Normalized result feature is spliced, to form the long splicing frame of lap；And

The splicing frame is input in neural network, to be trained to the splicing frame, to obtain the target language Sound feature.

Preferably, described to handle the phonetic feature, it is special with the more close target voice of speaker to obtain After the step of sign, further includes:

The target voice feature is input in trained in advance Speaker change detection model and invasive noise model；

According to voice messaging described in output result verification whether be saved in the Speaker change detection model it is multiple default The voice of a default speaker in speaker；And

When the voice messaging is the voice of the default speaker, then the voice messaging is obtained.

To achieve the above object, the embodiment of the invention also provides a kind of speech recognition Verification Systems, comprising:

Module is obtained, for obtaining audio-frequency information；

Preprocessing module, for being pre-processed to the audio-frequency information, according to the short-time energy of the audio-frequency information Voice messaging is obtained from the audio-frequency information with spectral centroid；

Characteristic extracting module, for the voice messaging to be carried out speech feature extraction；

Processing module, for handling the phonetic feature, to obtain and the more close target voice of speaker Feature；

A matching module, for carrying out the speaker's phonetic feature stored in the target voice feature and database Match；And

Output module corresponding with speaker's phonetic feature is stated what is matched for according to matching result The identity information output of people is talked about, to obtain the speaker corresponding with the voice messaging.

To achieve the above object, the embodiment of the invention also provides a kind of computer equipment, the computer equipment storages Device, processor and it is stored in the computer program that can be run on the memory and on the processor, the computer journey The step of speech recognition authentication method as described above is realized when sequence is executed by processor.

To achieve the above object, the embodiment of the invention also provides a kind of computer readable storage medium, the computers Computer program is stored in readable storage medium storing program for executing, the computer program can be performed by least one processor, so that institute State the step of at least one processor executes speech recognition authentication method as described above.

Speech recognition authentication method provided in an embodiment of the present invention, speech recognition Verification System, computer equipment and readable Storage medium, by being pre-processed to the audio-frequency information of acquisition, according in the short-time energy and frequency spectrum of the audio-frequency information The heart obtains voice messaging from the audio-frequency information, and extracts phonetic feature from the voice messaging, by the phonetic feature It is handled, with acquisition and the more close target voice feature of speaker, will be deposited in the target voice feature and database Speaker's phonetic feature of storage matches, and according to matching result, by the speaker's identity information matched export with Obtain the speaker corresponding with the voice messaging.Through the embodiment of the present invention, it is able to ascend the standard of speech recognition technology Exactness greatly promotes user experience.

Detailed description of the invention

Fig. 1 is the step flow chart of the speech recognition authentication method of the embodiment of the present invention one.

Fig. 2 is audio frequency characteristics spliced map after the normalization of the embodiment of the present invention one.

Fig. 3 is the specific connecting method figure of the embodiment of the present invention one.

Fig. 4 is the hardware structure schematic diagram of the computer equipment of the embodiment of the present invention two.

Fig. 5 is the program module schematic diagram of the speech recognition Verification System of the embodiment of the present invention three.

Appended drawing reference:

Computer equipment	2
		Memory	21
Processor	22
		Network interface	23
Speech recognition Verification System	20
		Obtain module	201
Preprocessing module	202
		Characteristic extracting module	203
Processing module	204
		Matching module	205
Output module	206
		Normalize module	207
Splicing module	208
		Training module	209
Speech verification module	210

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work Every other embodiment obtained is put, shall fall within the protection scope of the present invention.

It should be noted that the description for being related to " first ", " second " etc. in the present invention is used for description purposes only, and cannot It is interpreted as its relative importance of indication or suggestion or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the One ", the feature of " second " can explicitly or implicitly include at least one of the features.In addition, the skill between each embodiment Art scheme can be combined with each other, but must be based on can be realized by those of ordinary skill in the art, when technical solution Will be understood that the combination of this technical solution is not present in conjunction with there is conflicting or cannot achieve when, also not the present invention claims Protection scope within.

Embodiment one

Refering to fig. 1, the step flow chart of the speech recognition authentication method of the embodiment of the present invention one is shown.It is appreciated that Flow chart in this method embodiment, which is not used in, is defined the sequence for executing step.It should be noted that the present embodiment is in terms of Calculating machine equipment 2 is that executing subject carries out exemplary description.It is specific as follows:

Step S100 obtains audio-frequency information.

In a preferred embodiment, when minutes carry out, due to the voice, quiet spoken in environment there are speaker Sound, environmental noise and non-ambient noise, the speech recognition Verification System obtain those sound namely audio-frequency information.

It should be noted that non-ambient noise has in different short-time energy and frequency spectrum from the voice that speaker speaks The heart.

Step S102 pre-processes the audio-frequency information, according to the short-time energy of the audio-frequency information and frequency spectrum Center obtains voice messaging from the audio-frequency information.

Illustratively, after getting audio-frequency information, due to the audio-frequency information include speaker speak voice messaging, Mute, ambient noise and non-ambient noise need to handle the audio-frequency information to obtain institute from the audio-frequency information State voice messaging.The mute part referred to due to silence without pronunciation, such as: speaker can think during speaking It examines, breathe, since speaker will not make a sound in thinking and breathing.The environmental noise includes but is not limited to door and window The sound of the sendings such as the collision of switch, object.The non-ambient noise includes but is not limited to, such as: cough clicks mouse Sound or beat the sound of keyboard.Short-time energy and spectral centroid are two important fingers of mute detection technique sound intermediate frequency information Mark, wherein what the short-time energy embodied is the power of signal energy, and the mute and environment that can be distinguished in a segment of audio is made an uproar Sound, the spectral centroid can distinguish the part in non-ambient noise.By comprehensive short-time energy and spectral centroid with from described Effective audio namely voice messaging are filtered out in audio-frequency information.

In a preferred embodiment, pre-processed when to the audio-frequency information, with according to the audio-frequency information in short-term When energy and spectral centroid obtain voice messaging from the audio-frequency information, extracted from the audio-frequency information according to preset rules Multiframe short signal, wherein the preset rules include preset signals extraction time interval.Then, the multiframe is believed in short-term Number the short-time energy and the spectral centroid are calculated according to mute detection algorithm.Then, by the short-time energy and database First preset value of middle storage is compared, and the second preset value stored in the spectral centroid and the database is compared Compared with.When the short-time energy is higher than first preset value, and the spectral centroid is higher than second preset value, institute is determined Stating audio-frequency information is voice messaging, and obtains the voice messaging.

The calculation formula of the short-time energy are as follows:E indicates the short-time energy, and N is indicated The frame number of short signal, N >=2, s (n) indicate the signal amplitude of n-th frame short signal in time domain.

Illustratively, from audio-frequency information according to prefixed time interval (such as: 0.2ms) extract multiframe short signal s (1), then the multiframe short signal extracted is calculated short-time energy, to determine by (2) s, s (3), s (4) ... s (N) The energy for stating audio-frequency information is strong and weak.

It should be noted that the short-time energy is the quadratic sum of every frame signal, embodiment be signal energy power, when When signal energy is too weak, then determine that the signal is mute or ambient noise.

In a further preferred embodiment, it when calculating the spectral centroid according to the spininess short signal, also obtains respectively Frequency corresponding with the multiframe short signal is taken, and according to the frequency and the multiframe short signal, according to described mute Detection algorithm calculates the spectral centroid of the audio-frequency information, wherein the calculation formula of the spectral centroid are as follows:Wherein, C indicates the spectral centroid, and K indicates frequency number corresponding with N frame s (n) respectively, K >=2, and For integer, S (k) indicates the spectrum energy distribution that discrete Fourier transform corresponding with the s (n) obtains on frequency domain.

It should be noted that the spectral centroid is also known as frequency spectrum single order away from the value of spectral centroid is smaller, shows more Spectrum energy concentrates in low-frequency range, and the part of non-ambient noise can be removed by using spectral centroid, such as: cough, It clicks the sound of mouse or beats the sound of keyboard.

It should be noted that then representing should when threshold value of the index of short-time energy and spectral centroid simultaneously above setting Audio-frequency information is effective audio namely the voice messaging that speaker speaks, and most ambient noise and non-ambient noise are removed, So that the voice messaging remained is purer, quality is higher, is that the process of speech recognition reduces a large amount of disturbing factor.At this In inventive embodiments, by the way that the ratio convenient value of first preset value and second preset value setting is high, and then obtain High quality voice messaging.

In a further preferred embodiment, the second preset value stored in the spectral centroid and the database is compared After relatively, when the short-time energy is lower than second preset value lower than first preset value and/or the spectral centroid When, determine that the audio-frequency information is invalid audio-frequency information, and the audio-frequency information is deleted.Wherein, the invalid audio-frequency information It includes at least: mute, ambient noise and non-ambient noise.

Illustratively, if the short-time energy is lower than first preset value, quiet environment, the audio letter are represented as The mute or ambient noise of breath.If the spectral centroid is lower than second preset value, it is represented as non-quiet environment, the sound Frequency information is non-ambient noise.

The voice messaging is carried out speech feature extraction by step S104.

In a preferred embodiment, by being jumped with a length of 10 frame of window (100 milliseconds), sound frame away from the Chinese for 3 frames (30 milliseconds) Bright window carries out windowing process to the voice messaging, and then extracts corresponding phonetic feature.

It should be noted that the phonetic feature includes but is not limited to spectrum signature, sound quality feature and vocal print feature.Frequency spectrum It is characterized in the voice data different according to acoustical vibration frequency separation, such as target voice and interference voice.Sound quality feature and vocal print The corresponding speaker of voice data to be tested is identified according to the tone color feature of vocal print and sound when feature.Since speech differentiation is to use In distinguishing target voice and interference voice in voice data, therefore it may only be necessary to obtain the spectrum signature of the voice messaging, just It can complete speech differentiation.Wherein frequency spectrum is the abbreviation of frequency spectral density, and spectrum signature is the parameter for reflecting frequency spectral density.

In a preferred embodiment, the voice messaging includes multiple single frames voice data, and the voice messaging is carried out When speech feature extraction, Fast Fourier Transform (FFT) processing first is carried out to the single frames voice data, obtains the voice messaging Then power spectrum carries out dimension-reduction treatment to the power spectrum using Meier filter group, Meier frequency spectrum is obtained, finally to the plum You carry out cepstral analysis by frequency spectrum, to obtain the phonetic feature.

Illustratively, since the Auditory Perception system of people can simulate complicated nonlinear system, the power spectrum of acquisition is not The non-linear behavior of voice data can be showed well, therefore also need to carry out at dimensionality reduction frequency spectrum using Meier filter group Reason, so that the frequency spectrum of the voice data to be tested obtained is more nearly the frequency of auditory perceptual.Wherein, Meier filter group be by The triangle bandpass filter composition of multiple overlappings, triangle bandpass filter carries lower frequency limit, cutoff frequency and center frequency Three kinds of frequencies of rate.The centre frequency of these triangle bandpass filters be on melscale it is equidistant, melscale is in 1000HZ It is linear increase before, is into logarithm after 1000HZ and increases.Cepstrum refers to a kind of Fourier transform spectrum warp pair of signal The inverse Fourier transform carried out again after number operation, since general Fourier spectrum is complex number spectrum, thus cepstrum is also known as cepstrum.

Step S106 handles the phonetic feature, to obtain and the more close target voice feature of speaker.

In a preferred embodiment, the phonetic feature is handled, to obtain and the more close target of speaker The step of phonetic feature, specifically includes: the phonetic feature is normalized using Z-score standardized method, with The phonetic feature is carried out unified, wherein the formula of the normalized are as follows:μ is multiple voice letters The mean value of breath, σ be multiple voice messagings standard deviation, x be the multiple single frames voice data, x* be normalized it The phonetic feature afterwards.Then, normalized result feature is spliced, to form the long splicing frame of lap. Finally, the splicing frame is input in neural network, it is special to obtain the target voice to be trained to the splicing frame Sign, to reduce the loss of the voice messaging.

Illustratively, Fig. 2 is please referred to using a length of 10 frame of window, is jumped away from the Hamming window for 3 frames to the normalized knot Fruit feature is spliced, and the feature of 390 dimensions is formed.It then, is that a splicing unit spells 10 frame with every 10 frame It connects, specific connecting method please refers to Fig. 3.

It should be noted that since each frame is all 39, therefore, the dimension that 10 frames are stitched together is 390 dimensions.Due to jumping Away from 3 steps are jumped backward since first frame for 3 frames, next frame number to be spliced together is the 4th frame to the 13rd frame, and so on.

The embodiment of the present invention solves the comparativity between data target, reduces by the way that the phonetic feature is carried out unification Different Effects caused by unusual sample data help to carry out Comprehensive Correlation evaluation to the phonetic feature, improve better language Sound training effect.

The embodiment of the present invention forms the longer frame of lap by splicing feature, is to capture excessive letter Breath reduces the loss of information in longer duration.

In a further preferred embodiment, the phonetic feature is handled, to obtain and the more close mesh of speaker After marking phonetic feature, the target voice feature is also input to trained in advance Speaker change detection model and invasive noise In model.Then, according to output result verification described in voice messaging whether be saved in the Speaker change detection model it is multiple The voice of a default speaker is then obtained when the voice messaging is the voice of the default speaker in default speaker Take the voice messaging.

Specifically, after extracting the phonetic feature of speaker, by verifying whether the phonetic feature is trained in advance One of them in default speaker in Speaker change detection model, and this is accepted or rejected according to verification result selection and is spoken People.If identifying, the phonetic feature is acted as fraudulent substitute for a person by invader, refuses the voice messaging of the speaker.

Step S108 matches the speaker's phonetic feature stored in the target voice feature and database.

Illustratively, the speaker's phonetic feature stored in treated phonetic feature and database is compared, with Obtain the speaker's phonetic feature to match with the phonetic feature.

In a preferred embodiment, the phonetic feature extracted and the speaker's phonetic feature stored in database are carried out Before matching, the phonetic feature of the speaker is also acquired in advance, and by the phonetic feature and the identity of corresponding speaker Information preservation is in database.

Specifically, it since during the acquisition of speaker's phonetic feature, environment is quiet environment, therefore is easy to obtain institute The phonetic feature of speaker is stated, and the identity information of the phonetic feature and corresponding speaker is stored in database.

Step S110, according to matching result, the speaker corresponding with speaker's phonetic feature that will be matched Identity information output, to obtain corresponding with the voice messaging speaker.

Specifically, when the phonetic feature of the identity information 1 of the speaker stored in the phonetic feature and database extracted When matching, which is exported, and then obtains speaker's first representated by the identity information 1.

Through the embodiment of the present invention, it is able to ascend the accuracy of speech recognition technology, and greatly promotes user experience.

Embodiment two

Referring to Fig. 2, showing the hardware structure schematic diagram of the computer equipment of the embodiment of the present invention two.Computer equipment 2 include but are not limited to, and connection memory 21, processing 22 and network interface 23, Fig. 2 can be in communication with each other by system bus The computer equipment 2 with component 21-23 is illustrated only, it should be understood that be not required for implementing all components shown, The implementation that can be substituted is more or less component.

The memory 21 include at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory, Hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), random access storage device (RAM), static random are visited It asks memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), may be programmed read-only deposit Reservoir (PROM), magnetic storage, disk, CD etc..In some embodiments, the memory 21 can be the computer The internal storage unit of equipment 2, such as the hard disk or memory of the computer equipment 2.In further embodiments, the memory It is also possible to the plug-in type hard disk being equipped on the External memory equipment of the computer equipment 2, such as the computer equipment 2, intelligence Energy storage card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, the memory 21 can also both including the computer equipment 2 internal storage unit and also including outside it Portion stores equipment.In the present embodiment, the memory 21 is installed on the operating system of the computer equipment 2 commonly used in storage With types of applications software, such as the program code of speech recognition Verification System 20 etc..In addition, the memory 21 can be also used for Temporarily store the Various types of data that has exported or will export.

The processor 22 can be in some embodiments central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 22 is commonly used in the control meter Calculate the overall operation of machine equipment 2.In the present embodiment, the processor 22 is for running the program generation stored in the memory 21 Code or processing data, such as run the speech recognition Verification System 20 etc..

The network interface 23 may include radio network interface or wired network interface, which is commonly used in Communication connection is established between the computer equipment 2 and other electronic equipments.For example, the network interface 23 is for passing through network The computer equipment 2 is connected with exterior terminal, establishes data transmission between the computer equipment 2 and exterior terminal Channel and communication connection etc..The network can be intranet (Intranet), internet (Internet), whole world movement Communication system (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), 4G network, 5G network, bluetooth (Bluetooth), the nothings such as Wi-Fi Line or cable network.

Embodiment three

Referring to Fig. 3, showing the program module schematic diagram of the speech recognition Verification System of the embodiment of the present invention three.At this In embodiment, speech recognition Verification System 20 may include or be divided into one or more program modules, one or more Program module is stored in storage medium, and as performed by one or more processors, to complete the present invention, and can be realized Predicate sound identification authentication method.The so-called program module of the embodiment of the present invention is the series of computation for referring to complete specific function Machine program instruction section, the implementation procedure than program itself more suitable for description speech recognition Verification System 20 in storage medium. The function of each program module of the present embodiment will specifically be introduced by being described below:

Module 201 is obtained, for obtaining audio-frequency information.

In a preferred embodiment, when minutes carry out, due to the voice, quiet spoken in environment there are speaker Sound, environmental noise and non-ambient noise, the acquisition module 201 obtain those sound namely audio-frequency information.

Preprocessing module 202, for being pre-processed to the audio-frequency information, in short-term can according to the audio-frequency information Amount and spectral centroid obtain voice messaging from the audio-frequency information.

Illustratively, after the acquisition module 201 gets audio-frequency information, since the audio-frequency information includes speaker Voice messaging, mute, ambient noise and the non-ambient noise spoken, the preprocessing module 202 are needed to the audio-frequency information It is handled to obtain the voice messaging from the audio-frequency information.The mute portion referred to due to silence without pronunciation Point, such as: speaker can think deeply during speaking, breathe, since speaker will not make a sound in thinking and breathing. The environmental noise includes but is not limited to the sound of the sendings such as the collision of the switch of door and window, object.The non-ambient noise includes But it is not limited to, such as: cough clicks the sound of mouse or beats the sound of keyboard.Short-time energy and spectral centroid are quiet Two important indicators of sound detection technique sound intermediate frequency information, wherein what the short-time energy embodied is the power of signal energy, energy The mute and ambient noise in a segment of audio is enough distinguished, the spectral centroid can distinguish the part in non-ambient noise.Pass through Comprehensive short-time energy and spectral centroid from the audio-frequency information to filter out effective audio namely voice messaging.

In a preferred embodiment, the preprocessing module 202 is also used to according to preset rules from the audio-frequency information Extract multiframe short signal, wherein the preset rules include preset signals extraction time interval.Then, the multiframe is short When signal according to mute detection algorithm calculate the short-time energy and the spectral centroid.Then, by the short-time energy and number Be compared according to the first preset value stored in library, by the second preset value stored in the spectral centroid and the database into Row compares.When the short-time energy is higher than first preset value, and the spectral centroid is higher than second preset value, sentence The fixed audio-frequency information is voice messaging, and obtains the voice messaging.

Illustratively, the preprocessing module 202 from audio-frequency information according to prefixed time interval (such as: 0.2ms) mention Multiframe short signal s (1), s (2), s (3), s (4) ... s (N) are taken, is then calculated the multiframe short signal extracted short Shi Nengliang, it is strong and weak with the energy of the determination audio-frequency information.

In a further preferred embodiment, the preprocessing module 202 is also used to obtain respectively and the multiframe short signal Corresponding frequency, and according to the frequency and the multiframe short signal, the audio is calculated according to the mute detection algorithm The spectral centroid of information, wherein the calculation formula of the spectral centroid are as follows:Wherein, C indicates the frequency spectrum Center, K indicate frequency number corresponding with N frame s (n) respectively, K >=2, and are integer, S (k) indicate on frequency domain with the s (n) The spectrum energy distribution that corresponding discrete Fourier transform obtains.

In a further preferred embodiment, the preprocessing module 202 is also used to when the short-time energy is lower than described first When preset value and/or the spectral centroid are lower than second preset value, determine that the audio-frequency information is invalid audio-frequency information, And the audio-frequency information is deleted.Wherein, the invalid audio-frequency information includes at least: mute, ambient noise and non-ambient are made an uproar Sound.

Characteristic extracting module 203, for the voice messaging to be carried out speech feature extraction.

In a preferred embodiment, the characteristic extracting module 203 with a length of 10 frame of window (100 milliseconds), sound frame by being jumped Windowing process is carried out to the voice messaging away from the Hamming window for 3 frames (30 milliseconds), and then extracts corresponding phonetic feature.

In a preferred embodiment, the voice messaging includes multiple single frames voice data, the characteristic extracting module 203 It is also used to first carry out Fast Fourier Transform (FFT) processing to the single frames voice data, obtains the power spectrum of the voice messaging, so Afterwards using Meier filter group to the power spectrum carry out dimension-reduction treatment, obtain Meier frequency spectrum, finally to the Meier frequency spectrum into Row cepstral analysis, to obtain the phonetic feature.

Processing module 204, for handling the phonetic feature, to obtain and the more close target language of speaker Sound feature.

In a further preferred embodiment, the speech recognition Verification System further includes normalization module 207, splicing module 208 and training module 209.The normalization module 207 is used to carry out the phonetic feature using Z-score standardized method Normalized carries out the phonetic feature unified, wherein the formula of the normalized are as follows:μ is The mean value of multiple voice messagings, σ are the standard deviation of multiple voice messagings, and x is the multiple single frames voice data, x* For the phonetic feature after normalized.The splicing module 208 is for spelling normalized result feature It connects, to form the long splicing frame of lap.The training module 209 is used to for the splicing frame being input in neural network, To be trained to the splicing frame, to obtain the target voice feature, to reduce the loss of the voice messaging.

The embodiment of the present invention solves the comparativity between data target, reduces by the way that the phonetic feature is carried out unification Different Effects caused by unusual sample data help to carry out Comprehensive Correlation evaluation to the phonetic feature, improve better language Sound training effect.The embodiment of the present invention forms the longer frame of lap by splicing feature, is to capture Information is spent, the loss of information in longer duration is reduced.

In a further preferred embodiment, the speech recognition Verification System further includes speech verification module 210, is used for institute It states phonetic feature to be input in trained in advance Speaker change detection model and invasive noise model, and according to output result verification Whether the voice messaging is a default speaker in the multiple default speakers saved in the Speaker change detection model Voice then obtains the voice messaging when the voice messaging is the voice of the default speaker.

Specifically, the speech verification module 210 is by verifying whether the phonetic feature is speaker's inspection trained in advance One of them in the default speaker in model is surveyed, and the speaker is accepted or rejected according to verification result selection.If identification The phonetic feature is acted as fraudulent substitute for a person by invader out, then refuses the voice messaging of the speaker.

Matching module 205, speaker's phonetic feature for will store in the target voice feature and database carry out Matching.

Illustratively, speaker's voice that the matching module 205 will store in treated phonetic feature and database Feature compares, to obtain the speaker's phonetic feature to match with the phonetic feature.

In a preferred embodiment, the speech recognition Verification System 20 also acquires the voice spy of the speaker in advance Sign, and the identity information of the phonetic feature and corresponding speaker is stored in database.

Output module 206 corresponding with speaker's phonetic feature is spoken what is matched for according to matching result The identity information of people exports, to obtain the speaker corresponding with the voice messaging.

Specifically, when the speaker's stored in the phonetic feature and database that the characteristic extracting module 203 extracts When the phonetic feature matching of identity information 1, the output module 206 exports the identity information 1, and then obtains the identity information Speaker's first representated by 1.

The present invention also provides a kind of computer equipments, can such as execute smart phone, tablet computer, the notebook electricity of program Brain, desktop computer, rack-mount server, blade server, tower server or Cabinet-type server (including independent clothes Server cluster composed by business device or multiple servers) etc..The computer equipment of the present embodiment includes at least but unlimited In: memory, the processor etc. of connection can be in communication with each other by system bus.

The present embodiment also provides a kind of computer readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), random access storage device (RAM), static random-access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read only memory (PROM), magnetic storage, magnetic Disk, CD, server, App are stored thereon with computer program, phase are realized when program is executed by processor using store etc. Answer function.The computer readable storage medium of the present embodiment identifies Verification System 20 for storaged voice, when being executed by processor Realize the speech recognition authentication method of embodiment one.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.

The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of speech recognition authentication method characterized by comprising

Obtain audio-frequency information；

The audio-frequency information is pre-processed, with according to the short-time energy of the audio-frequency information and spectral centroid from the audio Voice messaging is obtained in information；

The voice messaging is subjected to speech feature extraction；

It is according to matching result, the identity information of the speaker corresponding with speaker's phonetic feature matched is defeated Out, to obtain the speaker corresponding with the voice messaging.

2. speech recognition authentication method as described in claim 1, which is characterized in that described to be located in advance to the audio-frequency information Reason, the step of voice messaging is obtained from the audio-frequency information according to the short-time energy of the audio-frequency information and spectral centroid, Include:

Multiframe short signal is extracted from the audio-frequency information according to preset rules, wherein the preset rules include default letter Number extraction time interval；And

The spectral centroid is calculated according to the multiframe short signal；

When the short-time energy is higher than first preset value, and the spectral centroid is higher than second preset value, determine The audio-frequency information is voice messaging；And

Obtain the voice messaging.

3. speech recognition authentication method as claimed in claim 2, which is characterized in that the calculation formula of the short-time energy are as follows:

Wherein, E indicates the short-time energy, and N indicates the frame number of short signal, N >=2, and is integer, and s (n) indicates in time domain the The signal amplitude of n frame short signal.

4. speech recognition authentication method as claimed in claim 2, which is characterized in that described according to the multiframe short signal meter The step of calculating the spectral centroid, comprising:

According to the frequency and the multiframe short signal, the frequency spectrum of the audio-frequency information is calculated according to the mute detection algorithm Center, wherein the calculation formula of the spectral centroid are as follows:

Wherein, C indicates that the spectral centroid, K indicate frequency number corresponding with N frame s (n) respectively, K >=2, and is integer, S (k) the spectrum energy distribution that discrete Fourier transform corresponding with the s (n) obtains on frequency domain is indicated.

5. speech recognition authentication method as claimed in claim 2, which is characterized in that described by the spectral centroid and the number After the step of being compared according to the second preset value stored in library, further includes:

When the short-time energy is lower than second preset value lower than first preset value and/or the spectral centroid, sentence The fixed audio-frequency information is invalid audio-frequency information, wherein the invalid audio-frequency information includes at least: mute, ambient noise and non- Ambient noise；And

The audio-frequency information is deleted.

6. speech recognition authentication method as described in claim 1, which is characterized in that it is described will be at the phonetic feature Reason, the step of to obtain with speaker's more close target voice feature, comprising:

The phonetic feature is normalized using Z-score standardized method, the phonetic feature is united One, wherein the formula of the normalized are as follows:μ is the mean value of multiple voice messagings, and σ is multiple institutes The standard deviation of voice messaging is stated, x is the multiple single frames voice data, and x* is the phonetic feature after normalized；

The splicing frame is input in neural network, it is special to obtain the target voice to be trained to the splicing frame Sign.

7. speech recognition authentication method as described in claim 1, which is characterized in that it is described will be at the phonetic feature Reason, after the step of the acquisition target voice feature more close with speaker, further includes:

It whether is that multiple preset saved in the Speaker change detection model is spoken according to voice messaging described in output result verification The voice of a default speaker in people；And

8. a kind of speech recognition Verification System characterized by comprising

Module is obtained, for obtaining audio-frequency information；

Preprocessing module, for being pre-processed to the audio-frequency information, according to the short-time energy of the audio-frequency information and frequency Spectrum center obtains voice messaging from the audio-frequency information；

Processing module, for handling the phonetic feature, to obtain and the more close target voice feature of speaker；

Matching module, for matching the speaker's phonetic feature stored in the target voice feature and database；And

Output module is used for according to matching result, the speaker corresponding with speaker's phonetic feature that will be matched Identity information output, to obtain corresponding with the voice messaging speaker.

9. a kind of computer equipment, which is characterized in that the computer equipment memory, processor and be stored in the memory Computer program that is upper and can running on the processor, is realized when the computer program is executed by processor as right is wanted The step of seeking speech recognition authentication method described in any one of 1-7.

10. a kind of computer readable storage medium, which is characterized in that be stored with computer in the computer readable storage medium Program, the computer program can be performed by least one processors, so that at least one described processor executes such as right It is required that described in any one of 1-7 the step of speech recognition authentication method.