CN110148402A

CN110148402A - Method of speech processing, device, computer equipment and storage medium

Info

Publication number: CN110148402A
Application number: CN201910374806.XA
Authority: CN
Inventors: 王涛
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-05-07
Filing date: 2019-05-07
Publication date: 2019-08-20
Also published as: WO2020224217A1

Abstract

The embodiment of the present application provides a kind of method of speech processing, device, computer equipment and computer readable storage medium.The embodiment of the present application belongs to technical field of voice recognition, when the embodiment of the present application realizes speech recognition, by way of non-streaming, after allowing people disposably to record all voices, obtain the voice of non-streaming mode, whether first judge in voice comprising abnormal sound signal, abnormal sound signal includes mute phase signal, if in voice including abnormal sound signal, pass through voice activity detection abnormal sound signal, then voice is cut, suppressing exception voice signal, obtain multiple sound bites, multiple sound bites are subjected to speech synthesis according to original sequence respectively in voice to obtain new speech, then whole sentence knowledge is carried out to new speech by speech recognition server again and carries out speech recognition otherwise, so that the new speech of whole sentence form can efficiently use acoustic model and language model in speech recognition, effectively promote the identification of voice Accuracy and efficiency.

Description

Method of speech processing, device, computer equipment and storage medium

Technical field

This application involves technical field of voice recognition more particularly to a kind of method of speech processing, device, computer equipment and Computer readable storage medium.

Background technique

When carrying out speech recognition recorded speech, especially when the voice of recording is longer, can be deposited during recorded speech The case where pausing so as to cause among the voice of recording there are the blank phase, the voice signal of acquisition be it is discontinuous, in voice The acoustic model and speech model that speech recognition cannot be effectively utilized when identification, to reduce the efficiency of speech recognition.Than Such as, in some business scenarios, it is often necessary to verify identification card number, it is convenient that identification card number is submitted by way of speech recognition Quick way, but since identification card number length is longer, usual user when reading ID card No. among will appear between It is disconnected, if the direct upload server identification in a streaming manner of the audio of acquisition, although can go out in real time as a result, but can not fill Divide the acoustic model and language model using identity-based card number, and be easy error, reduces the voice knowledge of ID card No. Other efficiency.

Summary of the invention

The embodiment of the present application provides a kind of method of speech processing, device, computer equipment and computer-readable storage medium Matter, when being able to solve speech recognition in traditional technology the problem of low efficiency.

In a first aspect, the embodiment of the present application provides a kind of method of speech processing, which comprises pass through input equipment Obtain the voice of non-streaming mode；Judge that the abnormal sound signal includes quiet whether comprising abnormal sound signal in the voice Sound phase signal；If including the abnormal sound signal in the voice, the voice is cut by voice activity detection To delete the abnormal sound signal, multiple sound bites are obtained；By multiple sound bites according to each leisure voice In original sequence carry out speech synthesis to obtain new speech；Speech recognition is carried out to the new speech.

Second aspect, the embodiment of the present application also provides a kind of voice processing apparatus, comprising: acquiring unit, for passing through The voice of input equipment acquisition non-streaming mode；Judging unit, for whether judging in the voice comprising abnormal sound signal, institute Stating abnormal sound signal includes mute phase signal；Unit is cut, if leading to for including the abnormal sound signal in the voice It crosses voice activity detection to cut the voice to delete the abnormal sound signal, obtains multiple sound bites；Synthesis Unit is new to obtain for multiple sound bites to be carried out speech synthesis according to original sequence in each leisure voice Voice；Recognition unit, for carrying out speech recognition to the new speech.

The third aspect, the embodiment of the present application also provides a kind of computer equipments comprising memory and processor, it is described Computer program is stored on memory, the processor realizes the method for speech processing when executing the computer program.

Fourth aspect, it is described computer-readable to deposit the embodiment of the present application also provides a kind of computer readable storage medium Storage media is stored with computer program, and the computer program executes the processor at the voice Reason method.

The embodiment of the present application provides a kind of method of speech processing, device, computer equipment and computer-readable storage medium Matter.When the embodiment of the present application realizes speech recognition, by way of non-streaming, after allowing people disposably to record all voices, obtain The voice of non-streaming mode first judges that the abnormal sound signal includes mute whether comprising abnormal sound signal in the voice Phase signal, if in the voice including the abnormal sound signal, by abnormal sound signal described in voice activity detection, then Voice is cut, the abnormal sound signal is deleted, obtains multiple sound bites, by multiple sound bites according to each Original sequence in the comfortable voice carries out speech synthesis to obtain new speech, then again by speech recognition server to new Voice carries out whole sentence knowledge and carries out speech recognition otherwise, so that the new speech of whole sentence form can efficiently use in speech recognition Acoustic model and language model, effectively promoted voice identification accuracy and efficiency.

Detailed description of the invention

Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is the application scenarios schematic diagram of method of speech processing provided by the embodiments of the present application；

Fig. 2 is the flow diagram of method of speech processing provided by the embodiments of the present application；

Fig. 3 is the waveform diagram of a voice in method of speech processing provided by the embodiments of the present application；

Fig. 4 is speech recognition principle flow chart in method of speech processing provided by the embodiments of the present application；

Fig. 5 is the schematic block diagram of voice processing apparatus provided by the embodiments of the present application；

Fig. 6 is another schematic block diagram of voice processing apparatus provided by the embodiments of the present application；And

Fig. 7 is the schematic block diagram of computer equipment provided by the embodiments of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.

It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded Body, step, operation, the presence or addition of element, component and/or its set.

It is also understood that mesh of the term used in this present specification merely for the sake of description specific embodiment And be not intended to limit the application.As present specification and it is used in the attached claims, unless on Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.

It will be further appreciated that the term "and/or" used in present specification and the appended claims is Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.

Referring to Fig. 1, Fig. 1 is the application scenarios schematic diagram of method of speech processing provided by the embodiments of the present application.

The application scenarios include:

(1) terminal, terminal are referred to as front end, and the component of the input voice such as microphone is equipped in terminal to receive use The voice of family input, the terminal can set for electronics such as laptop, smartwatch, tablet computer or desktop computers Standby, the terminal in Fig. 1 is connect with server.

(2) server, server are substantially carried out speech recognition, server can for single server, server cluster or Person's Cloud Server, server can also include primary server and from server if server cluster.

Please continue to refer to Fig. 1, as shown in Figure 1, in the embodiment of the present application, mainly executing speech processes side with server end The technical solution of the application method of speech processing is explained for the step of method, each body of work process in Fig. 1 is as follows: eventually End receives the complete speech of user's input by voice-input device, and voice is sent to server so that server acquisition is non- The voice of stream mode；Server judges that the abnormal sound signal includes quiet whether comprising abnormal sound signal in the voice Sound phase signal cuts the voice by voice activity detection if including the abnormal sound signal in the voice To delete the abnormal sound signal, multiple sound bites are obtained, by multiple sound bites according to each leisure voice In original sequence carry out speech synthesis to obtain new speech, last server carries out speech recognition to the new speech to obtain Speech recognition result.

It should be noted that the method for speech processing in the embodiment of the present application can be applied to terminal, also can be applied to Server, as long as handling before server identifies voice voice.Meanwhile the language in the embodiment of the present application The application environment of voice handling method is not limited to application environment shown in FIG. 1, method of speech processing and voice can also be known It is not applied in the computer equipments such as terminal together, as long as being carried out before computer equipment carries out speech recognition, upper predicate The application scenarios of voice handling method are merely illustrative technical scheme, are not used to limit technical scheme, on Other forms can also be had by stating connection relationship.

Fig. 2 is the schematic flow chart of method of speech processing provided by the embodiments of the present application.The method of speech processing application In Fig. 1 in the computer equipment of front end, to complete all or part of function of method of speech processing.

Referring to Fig. 2, Fig. 2 is the flow diagram of method of speech processing provided by the embodiments of the present application.As shown in Fig. 2, This approach includes the following steps S210-S250:

S210, the voice that non-streaming mode is obtained by input equipment.

Wherein, the mode of stream refers to that speech recognition apparatus obtains audio stream belonging to voice in real time, and side obtains the audio It flows side and carries out speech recognition.

The mode of non-streaming, after referring to that speech recognition apparatus obtains voice or default size voice in preset time, with The form of whole sentence or whole section of voice, which is concentrated, carries out complete speech recognition.

Specifically, the input equipment can be terminal, or the voices input module such as microphone, user pass through defeated After entering equipment input voice, whole sentence or whole section of voice are uploaded to server, so that server is obtained by input equipment The voice of non-streaming mode.For example, in some business scenarios, it is often necessary to identification card number is verified, by way of speech recognition Submit identification card number be convenient quick way, if acquisition audio in a streaming manner directly upload speech recognition server into Row identification is the voice that speech recognition server obtains ID card No. in real time, and speech recognition server is all over acquisition identification card number The voice of code can go out the speech recognition result of ID card No. all over being identified in real time.For example, speech recognition server identifies ID card No. includes the voice of " ABCDEFGH ", after speech recognition server receives the voice of " A ", carries out the identification of " A ", After the voice for receiving " B ", carry out the identification of " B ", after the voice for receiving " C ", carry out the identification etc. of " C ", be by " A ", The voice of the ID card No. such as " B ", " C " uploads respectively, by real-time reception to audio identify one by one, can come from real time The speech recognition result of part card number, the identification for carrying out " A " go out the recognition result of " A " in real time, and the identification for carrying out " B " goes out in real time The recognition result etc. of " B ".

It is speech recognition server if the audio of acquisition is uploaded speech recognition server in a manner of non-streaming and identified The disposable voice for obtaining full identity card number, speech recognition server carry out the speech recognition of ID card No. again, obtain The speech recognition result of complete ID card No..Specifically after allowing people disposably to record the voices of all identification card numbers, The voice by full identity card number is concentrated to upload again.Then the identity card voice of upload is cut, retains voice segments simultaneously It is stitched together, is then transmitted to speech recognition server again and is identified, obtain the speech recognition knot of complete ID card No. Fruit.For example, speech recognition server identification ID card No. includes the voice of " ABCDEFGH ", it is that speech recognition server is primary Property obtain full identity card number " ABCDEFGH " voice, speech recognition server carry out again ID card No. voice know Not, the speech recognition result of complete identification number " ABCDEFGH " is obtained.It specifically, is to allow people disposably to have recorded to own After the voice of identification card number " ABCDEFGH ", the voice of ID card No. " ABCDEFGH " is concentrated and uploads speech-recognition services Device, rather than " A ", " B ", " C " etc. are uploaded respectively.Then the identity card voice " ABCDEFGH " of upload is cut, is protected It stays voice segments and is stitched together, be then transmitted to speech recognition server again and identified, obtain complete ID card No. The speech recognition result of " ABCDEFGH ".

S220, judge whether comprising abnormal sound signal in the voice, the abnormal sound signal includes mute phase letter Number.

Wherein, abnormal sound signal, which refers to, changes precipitous waveform segment in audio volume control, including mute phase signal.It is mute Phase referred in voice in the silent time.Generating silent situation includes: first is that the other side in interaction is being listened to speak；Two It is the pause between one section of word caused by due to thinking deeply or having a short interval etc.；Third is that intermediate pause of speaking, such as hesitates, Breathing, stutter etc..Pause interval is long in the case of the first and the frequency of occurrences is low；The third situation pause interval is short and the frequency of occurrences It is high；Second situation is between one, three kind of situation.This characteristic of speech source is called switching characteristic, is also speech/quiet sometimes Silent characteristic.Referring to Fig. 3, Fig. 3 is the waveform diagram of a voice in method of speech processing provided by the embodiments of the present application, such as Fig. 3 It is shown, than if any an audio volume control L, the position L2 is exactly the mute phase in waveform.

Specifically, waveform sound, is most common Windows multimedia feature, and waveform sound equipment can pass through Mike Wind captures sound, and is converted into numerical value, and then they are stored into the wave file on memory or disk.Sound is just It is vibration.When sound changes the pressure of air on eardrum, we have just felt sound.Microphone can incude these vibrations It is dynamic, and convert them to electric current.Equally, electric current just becomes sound using amplifier and loudspeaker again.Traditionally, Sound stores (such as audio tape and disc) in an analog fashion, these vibrations are stored in magnetism pulse or profile groove. When sound is converted to electric current, so that it may be indicated with the waveform vibrated at any time, sine can be used by vibrating most natural form Wave table shows.Sine wave is there are two parameter, and one is the peak swing in amplitude, that is, a cycle, another is frequency.Vibration Width is exactly volume, and frequency is exactly tone.In general the range of sine wave that human ear can be experienced is from 20Hz (period per second) Low-frequency sound to the high frequency sound of 20000Hz, can be judged by the size of detected amplitude i.e. volume be in the voice No includes abnormal sound signal, shows in the waveform of voice in namely audio volume control and changes precipitous waveform segment, also Whether be to judge in the voice comprising the mute period, or the obvious volume beyond normal voice shock wave it is ear-piercing The voice of louder volume, due to the characteristic of voice, the mute phase signal for including in voice can be more, therefore, detect abnormal signal master If detecting the mute phase signal in voice.

Further, in one embodiment, described to judge in the voice the step of whether including abnormal sound signal Include:

Whether detect in the audio volume control of the voice includes waveform of the audio amplitude less than the first preset threshold；

If being less than the waveform of first preset threshold in the audio volume control comprising the audio amplitude, institute's predicate is determined It include the abnormal sound signal in sound.

Wherein, the first preset threshold is the volume lower than people's normal good hearing, sound of the audio amplitude less than the first preset threshold The sound of frequency waveform description is bass or referred to as mute.

Specifically, since the volume range of the ear auditory of people is 20~20000 hertz of amplitude (hertz, English are HZ) Volume for general for people voice is linked up, is the mute phase of not sound lower than 20 hertz, is higher than 20000 hertz Volume beyond people's hearing, so under normal circumstances, the volume lower than 20 hertz or higher than 20000 hertz is in audio volume control Show as changing the abnormal sound signal for including in precipitous waveform segment, that is, the voice.It can be by described in detection It whether include that audio amplitude judges whether wrap in the voice less than the waveform of the first preset threshold in the audio volume control of voice Containing mute phase signal, if being less than the waveform of first preset threshold in the audio volume control comprising the audio amplitude, determine It include the voice signal that the abnormal sound signal is mute phase signal in the voice.

Further, the abnormal sound signal further includes shockwave signal, and shock wave, English is Shock Wave, is A kind of propagation of discontinuous peak in the medium, this discontinuous peak cause the physical properties such as pressure, temperature and the density of medium to be jumped Jump formula changes, any wave source, and when movement velocity has been more than the spread speed of its wave, this wave can be known as impacting Wave, or referred to as shock wave, please continue to refer to Fig. 3, as shown in figure 3, the position L1 is exactly to rush in waveform than if any an audio volume control L Hit wave.

In one embodiment, described to judge in the voice that the step of whether including abnormal sound signal includes:

The waveform for whether being greater than the second preset threshold in the audio volume control of the voice comprising audio amplitude can also be detected.

Wherein, the second preset threshold is the volume beyond people's normal good hearing, and audio amplitude is greater than the sound of the second preset threshold The sound of frequency waveform description is shock wave or referred to as high pitch.

Specifically, the volume higher than 20000 hertz also shows as changing precipitous waveform segment in audio volume control, also It is the abnormal sound signal for including in the voice.Whether can shake comprising audio in audio volume control by detecting the voice Width is greater than the waveform of the second preset threshold to judge whether comprising shockwave signal in the voice, if wrapping in the audio volume control It is greater than the waveform of second preset threshold containing the audio amplitude, determines to include the abnormal sound signal in the voice, Further to filter out the noise in voice, the accuracy of speech recognition is improved.

If in S230, the voice including the abnormal sound signal, the voice is carried out by voice activity detection It cuts to delete the abnormal sound signal, obtains multiple sound bites.

Wherein, voice activity detection, English are VoiceActivity Detection, are abbreviated as VAD, also known as end-speech Point detection, speech endpoint detection can identify and eliminate the prolonged mute phase in voice signal stream.

Cut waveform segment and be similar to and delete waveform segment, except that delete waveform segment be the waveform chosen is deleted, and Cutting waveform segment is that unchecked waveform is deleted, and the effect of the two is opposite, for example, cut using GoldWave, GoldWave is digital recording and software for editing, and cutting button used in waveform segment is Trim, after cutting, GoldWave meeting Automatically remaining waveform is amplified and is shown.

Specifically, due to the proprietary term that cutting is waveform concatenation technical field, the meaning of statement is refered in particular to unchecked Waveform is deleted, and after going out abnormal sound signal by voice activity detection, abnormal sound signal is not chosen, by abnormal sound signal Except normal sound signal choose, so that the waveform of unchecked abnormal sound signal be deleted by way of cutting It removes, or is referred to as to cut off, what is left is the non-abnormal sound signal chosen, that is, normal sound signal.Server obtains The voice of negated stream mode identifies the abnormal sound signal in the voice by voice activity detection, for example identifies described It whether include mute phase signal and shockwave signal in voice, if including mute phase signal and shock wave letter in the voice Number equal abnormal sounds signal, the abnormal sounds signals such as the mute phase signal and shockwave signal are not chosen, will be described mute Normal sound signal except the abnormal sounds signal such as phase signal and shockwave signal is chosen, and then cuts to the voice To delete the abnormal sounds signals such as mute phase signal and impact signal in the voice, the language of multiple normal sound signals is obtained Tablet section, if not including the abnormal sounds signals such as mute phase signal and shockwave signal in the voice, the voice is Continuously.Please continue to refer to Fig. 3, if L is the audio volume control of ID card No. " ABCDEFGH " in Fig. 3, ID card No. is obtained Voice L after, detect L in whether include abnormal sound signal L1 and L2, if in the voice including L1 and L2, such as L For ABCL1DEFL2GH, abnormal sound signal L1 and L2 in L are identified by voice activity detection, cut out by way of cutting When cutting waveform segment, L1 and L2 are not chosen, and the normal sound signal " ABCDEFGH " except L1 and L2 is chosen, and carry out speech waveform Unchecked abnormal sound signal waveform L1 and L2 are deleted when cutting, what is left is the normal sound signal chosen The audio volume control of " ABCDEFGH " obtains multiple normal sounds to delete abnormal sound the signal L1 and L2 in the voice Voice segments ABC, DEF and GH of signal.

S240, multiple sound bites are subjected to speech synthesis according to original sequence in each leisure voice to obtain To new speech.

Wherein, speech synthesis, including parameter synthesis and voice joint.Voice is presented in a manner of waveform, is just spelled for waveform It connects, waveform concatenation, which refers to, to be spliced between speech waveform segment to export continuous flow, and PSOLA algorithm is waveform concatenation skill One kind of art.

Specifically, by multiple sound bites according in each leisure voice original sequence carry out speech synthesis with New speech is obtained, can be the wave that multiple sound bites are carried out to voice according to original sequence in each leisure voice Shape is spliced to obtain new speech, that is, by multiple sound bites of acquisition according to original suitable in each leisure voice Sequence is stitched together to carry out speech recognition.Please continue to refer to Fig. 3, the voice is cut by voice activity detection, To cut off two abnormal sound signals of L1 and L2 in the voice, multiple sound bite ABC, DEF and GH are obtained, by voice Segment ABC, DEF and GH are stitched together, and obtain voice ABCDEFGH, as ID card No. " ABCDEFGH " is complete and connects Continuous audio, so as to adequately efficiently use the acoustic model and language model of identity-based card number, the embodiment of the present application In whole sentence knowledge is carried out otherwise to identity card voice, the acoustic model and language model in speech recognition can be efficiently used, Effectively promote the identification accuracy of identity card voice.

Further, waveform concatenation speech synthesis technique is that directly the waveform in speech waveform database is cascaded, output Continuous flow, these speech waveforms take the word and sentence of natural-sounding, and implying tone, stress, the rate of articulation influences, synthesis Lamprophonia is natural.Waveform concatenation speech synthesis technique includes PSOLA algorithm and time-frequency interpolation method.Wherein, time-frequency interpolation side Method, also known as time-frequency interpolation method, English are Time Frequency Interpolation, referred to as TFI, Lai Shixian wave Voice signal is obtained driving source by LPC inverse filter, fundamental tone mark is further carried out to it by shape splicing, this method, will It transforms to frequency domain, referred to as prototype, and when prototype being stored, then being synthesized, prototype taking-up is analyzed and rhythm accordingly Rule adjustment, is then switched back to time-domain signal, obtains synthesis voice by LPC composite filter.

PSOLA algorithm is one kind of waveform concatenation technology, also known as the synchronous superimposing technique PSOLA of gene, main to use Between speech waveform fragment assembly, first according to semanteme, it is adjusted, is made with prosodic features of the PSOLA algorithm to concatenation unit Synthetic waveform had not only been able to maintain the main segment5al feature of raw tone primitive, but also the prosodic features of concatenation unit is made to meet semanteme, from And obtain very high intelligibility and naturalness.When being adjusted to the prosodic features of concatenation unit, as unit of pitch period into The modification of traveling wave shape, using the integrality of pitch period as the smooth continuous basic premise for guaranteeing waveform and frequency spectrum.PSOLA is calculated Method includes TD-PSOLA and FD-PSOLA.Wherein, TD-PSOLA the following steps are included:

1) Pitch synchronous overlap add is analyzed.Accurate pitch synchronous mark is done to primary speech signal, and raw tone is believed It number is multiplied with a series of window function of pitch synchronous, obtains some short-time analysis signals for showing overlapping.Window function uses standard Hanning window or Hamming window, a length of two pitch periods of window have 50% lap between adjacent short-time analysis signal.Fundamental tone The accuracy in period and initial position are extremely important, it has a great impact pairing at the quality of voice.

2) it modifies to intermediate representation.First according to the pitch curve of raw tone waveform and super-segmental feature and target Pitch curve and the modified requirement of super-segmental feature, the mapping for establishing the pitch period between synthetic waveform and original waveform are closed System, then thus mapping relations determine the required synthetic signal sequence in short-term of synthesis.

3) Pitch synchronous overlap add is handled.By the arrangement synchronous with target pitch period of the short signal sequence of synthesis, and it is overlapped Addition obtains synthetic waveform, at this point, the speech waveform of synthesis has desired super-segmental feature.

FD-PSOLA algorithm is similar with TD-PSOLA algorithm, substantially it is non-for Pitch synchronous overlap add analysis, to intermediate representation into Row modification and Pitch synchronous overlap add handle 3 processes.In TD-PSOLA, based on time domain when variation, it is more suited to changing for the duration of a sound Become；But when being related to the change of fundamental frequency, especially when amplitude change is larger, it is be easy to cause the aliasing of superpositing unit, and in FD- In PSOLA algorithm, time ruler not only can change, signal can also be made the appropriate adjustments on frequency domain, the specific steps are as follows:

1) discrete Fourier transform is done to short-time analysis signal, obtains the analysis Fourier spectrum of the signal.

2) spectrum envelope and dissociative excitation source frequency spectrum of short-time analysis Fourier spectrum are obtained with homomorphic filtering.

3) frequency spectrum is compressed and is stretched.Excitation source spectrum is compressed and stretched with linear interpolation, but this processing Method is easy to lose information in linear interpolation.Sinusoidal model method is used for reference, by excitation source spectrum and spectrum envelope in new frequency Point sampling come realize to excitation source spectrum compression and stretching, obtain new Fourier spectrum.The interpolation to frequency spectrum complex values is avoided, And achieve the purpose that modify Fourier spectrum by modification frequency axial coordinate and spectrum envelope interpolation.

4) composite signal in short-term is obtained.

S250, speech recognition is carried out to the new speech.

Wherein, speech recognition, English are Automatic Speech Recognition, and general abbreviation ASR is by sound It is converted into the process of text.

Specifically, referring to Fig. 4, Fig. 4 is speech recognition principle stream in method of speech processing provided by the embodiments of the present application Cheng Tu, speech recognition principle process include:

1) voice inputs, that is, obtains voice, for example, obtaining the new speech after speech synthesis；

2) it encodes, that is, the voice of input is encoded, feature extraction is carried out to voice by coding, for example, right New speech carries out coding extraction, that is, carries out feature extraction to new speech；

3) it decodes.The phonetic feature of extraction is decoded by acoustic model and language model, the acoustic model warp To achieve the effect that meet the requirements, the language model trained training of data 2 is met the requirements the training of training data 1 with reaching Effect, speech recognition are that speech sound waves are converted into text, give the training data of target voice, can train an identification Statistical model, for example, described be decoded new speech；

4) text exports.Acoustic model and the decoded phonetic feature of language model are converted into text output, for example, by new Voice switchs to text to realize speech recognition, to realize the speech recognition for converting speech into text.

When the embodiment of the present application realizes speech recognition, by way of non-streaming, after allowing people disposably to record all voices, The voice for obtaining non-streaming mode first judges that the abnormal sound signal includes whether comprising abnormal sound signal in the voice Mute phase signal, if including the abnormal sound signal in the voice, by abnormal sound signal described in voice activity detection, Then voice is cut, cuts off the abnormal sound signal, obtains multiple sound bites, by multiple sound bites Speech synthesis is carried out to obtain new speech according to original sequence in each leisure voice, then passes through speech-recognition services again Device carries out whole sentence knowledge to new speech and carries out speech recognition otherwise, so that the new speech of whole sentence form can efficiently use voice Acoustic model and language model in identification effectively promote the identification accuracy and efficiency of voice.

In one embodiment, it is described judge in the voice the step of whether including abnormal sound signal before, also wrap It includes:

By detecting the volume of the voice whether to detect in the voice comprising sound；

If in the voice including sound, whether judge in the voice comprising abnormal sound signal；

If in the voice not including sound, the prompt of voice is re-entered in output.

Specifically, when carrying out speech recognition, recorded speech may be had begun, but is not sent out for various reasons Sound out, if but recording duration is more than certain time length, and this section of blank voice can also carry out speech recognition as the voice recorded, but That this section of voice is practical without any content and without in all senses, therefore, can first in detected voice whether include Sound is blank sound, there is no need to this section of language if not including sound, that is, the voice not made a sound in voice Sound carries out abnormal sound signal detection, carries out the detection of abnormal sound signal in normal voice to avoid subsequent to eliminate language The step of abnormal sound signal in sound.Can be judged by judging the volume in the voice in the voice whether There is sound, if the volume in the voice is lower than the volume that people can hear, that is, the audio in the audio volume control of the voice Amplitude is respectively less than the first preset threshold, judges that the voice for the mute of not sound, is examined without further progress speech activity It surveys to detect abnormal sound signal from the voice, there are no speech recognition need to be carried out, if judging in the voice not comprising sound The prompt of voice is re-entered in sound, output, allows user's recorded speech again, to shorten the process of speech recognition, improve voice and know Whether other efficiency and accuracy further judge in the voice if in the voice including sound comprising abnormal sound message Number, if including abnormal sound signal in the voice, the processing to speech recognition is carried out to obtain new speech, then carry out voice knowledge Not, to realize the accuracy of speech recognition.By first determining whether to find do not have as early as possible whether comprising sound in the voice There is the abnormal case of sound, the identification to abnormal speech situation is improved, to improve the efficiency of speech recognition.

It further, can be with: whether detecting in the voice comprising natural language.

Wherein, natural language, that is, the sound of people.

Specifically, can be by the way that mel-frequency cepstrum coefficient be applied to whether judge voice in hidden Markov model It is voice signal.Wherein, MFCCs, English be Mel Frequency Cepstral Coefficents, referred to as MFCCs, again It can be abbreviated as MFCC, be that one kind widely used feature, HMM, English in automatic speech and Speaker Identification are HiddenMarkov Model, referred to as HMM are statistical models, it is used to describe the Ma Er containing implicit unknown parameter It can husband's process.During MFCCs is for the training (and identification) of HMM because in HMM for each frame voice (or each Phoneme) there is feature vector, MFCC can be selected to judge whether there is voice signal in voice.If there is no sound letter in the voice Number, without further progress voice activity detection to detect abnormal sound signal from the voice, there are no need to carry out voice knowledge Not, if in the voice including voice signal, further judge whether comprising abnormal sound signal in the voice, if institute's predicate Include abnormal sound signal in sound, carries out to speech processes to obtain new speech, then carry out speech recognition.By first determining whether Whether include natural language in predicate sound, can find the abnormal case of the sound of nobody as early as possible, further increase to exception The identification of voice situation, to improve the efficiency of speech recognition.

In one embodiment, the voice includes ID card No., described to carry out speech recognition to the new speech Step includes:

Speech recognition is carried out to the new speech comprising ID card No.；

After described the step of carrying out speech recognition to the new speech, further includes:

Verify whether identified ID card No. includes mistake according to preset ID card No. coding rule；

If the ID card No. includes mistake, the ID card No. of mistake is prompted.

Wherein, ID card No., also known as resident identification card number, or referred to as citizenship number, English are People' S Republic ofChinaresidentidentity card, citizenship number are feature combinational codes.

Preset ID card No. coding rule is to have done specific rule in GB11643-1999 " citizenship number " Fixed.

Specifically, the voice that the non-streaming mode comprising ID card No. is obtained by input equipment, if being wrapped in the voice Containing the abnormal sound signal, the voice is cut by voice activity detection with delete the abnormal sound signal with Multiple sound bites are obtained, multiple sound bites are subjected to speech synthesis according to original sequence in each leisure voice To obtain the new speech comprising ID card No., speech recognition is carried out to the new speech, to identify the body for including in voice Part card number.For example, in some business scenarios, it is often necessary to verify identification card number, since identification card number length is longer, usually People when inputting ID card No. by voice among will appear interruption, if on the audio of acquisition is direct in a streaming manner Server identification is passed, although can go out in real time as a result, but being unable to fully acoustic model and language using identity-based card number It says model, and is easy error.And use method of speech processing provided by the embodiments of the present application to acquisition comprising ID card No. Voice carry out speech recognition, after obtaining the ID card No. that identifies, can also be encoded according to preset ID card No. Rule verifies whether identified ID card No. includes mistake, if the ID card No. does not include mistake, shows to body The speech recognition of part card number is accurate, can also be to mistake if the ID card No. identified includes mistake The ID card No. prompted, to allow user to provide the voice comprising ID card No. again, re-recognize identity card Number, to improve efficiency and accuracy to the speech recognition comprising ID card No..

Further, described that whether identified ID card No. is verified according to preset ID card No. coding rule Include: comprising wrong step

Judge whether the ID card No. includes that the digital of presetting digit capacity is with the digit for verifying the ID card No. It is no correct；

Identify the gender of the corresponding enunciator of the voice to verify in the ID card No. according to default phonetic feature Sequential bits whether matched with the gender of the enunciator；

Judge to calculate the check code of the ID card No. identified according to check code calculation formula and is identified Whether the check code that the ID card No. includes one shows the check code that the ID card No. that is identified of verification includes It is whether correct；

If the digit of the ID card No. is correct and the property of sequential bits and the enunciator in the ID card No. It does not match, and the check code of the ID card No. is correct, determines that identified ID card No. does not include mistake.

Specifically, standard " citizenship number " defines the coded object of citizenship number, the structure of number and table Existing form, makes each coded object obtain a unique, constant legal number, for example, citizenship number is feature group Code is closed, is made of 17 bit digital ontology codes and a bit check code.It puts in order from left to right successively are as follows: six digit word address Code, eight-digit number word date of birth code, three bit digital sequence codes and one-bit digital check code.Since ID card No. coding has phase The rule and structure type answered, therefore can be according to the identified ID card No. of the coding rule of ID card No. verification No includes mistake, may include the following contents:

1) judge whether the ID card No. includes the number of presetting digit capacity to verify the digit of the ID card No. It is whether correct.

Specifically, if the ID card No. includes the number of the presetting digit capacity, determine identified identification card number The digit of code is correct, if the ID card No. does not include the number of the presetting digit capacity, determines identified identification card number The digit of code includes mistake.It, can be by judging the ID card No. due to ID card No. ten eight-digit number word in total Whether whether the number comprising presetting digit capacity is correct to verify the digit of the ID card No., that is, judges the body identified Whether part card number includes ten eight-digit number words, to judge whether the ID card No. identified is correct from digit first, If the ID card No. identified is not ten eight-digit number words, it can directly judge that the ID card No. that speech recognition goes out is the presence of mistake Accidentally, with regard to being not necessarily to carry out subsequent judgement, so that the efficiency of speech recognition is improved, if the ID card No. identified is 18 Number can tentatively judge that the ID card No. identified judges to be correct from digit.

2) identify the gender of the corresponding enunciator of the voice to verify the ID card No. according to default phonetic feature In sequential bits whether matched with the gender of the enunciator.

Wherein, the default phonetic feature includes the phonetic features such as fundamental frequency, frequency spectrum, sound frequency and amplitude to distinguish Male voice or female voice.Wherein, for fundamental tone as the term suggests being exactly the basis of sound, the frequency of vocal cord vibration is known as fundamental frequency, fundamental tone Frequency and the structure of personal vocal cords have very big relationship, so gene frequency can be used for identification pronunciation source, in general, male The fundamental frequency of property speaker is lower, and the fundamental frequency of women speaker is relatively high, since the fundamental frequency of men and women's sound is deposited In larger difference, therefore the identification of men and women's sound can be waken up with a start based on fundamental frequency.

Specifically, if the sequential bits in the ID card No. are matched with the gender of the enunciator, judgement is identified The sequential bits of ID card No. do not include mistake, if the sequential bits in the ID card No. and the gender of the enunciator are not Matching determines that the sequential bits of identified ID card No. include mistake.Due to ID card No. the 15th to 17 The regional scope that is identified of address code in, to the serial number that the personnel of the same year, the moon, day birth compile and edit, wherein the 17th surprise Number gives male, and even number gives women, thus can be male voice or female voice by the sound in identification voice with Identify the gender of the corresponding enunciator of the voice to verify the sequential bits in the ID card No. according to default phonetic feature Whether matched with the gender of the enunciator.If the sound identified is women, sequential bits are even number, if the sound that identifies Sound is male, and sequential bits are odd number, the gender and the sequence in the ID card No. for judging the corresponding enunciator of the voice Position matching, is consistent, can further judge that the ID card No. is accurate to the identification of sequential bits；If identifying Sound is women, and sequential bits are odd number, if the sound that identifies is male, sequential bits are even number, judge the voice pair Sequential bits in the gender of the enunciator answered and the ID card No. mismatch, and are inconsistent, it may be possible to identification card number There is mistake in the speech recognition process of code, it is also possible to male carries out speech recognition using the ID card No. of women, It may be that women uses the ID card No. of male to carry out speech recognition, by the verification of sequential bits, both may determine that voice was known Other correctness improves the efficiency of ID card No. speech recognition, it is anti-fake can also to play to a certain extent ID card No. Effect, to prevent from anisotropic carrying out authentication or identification using ID card No..

Further, under normal circumstances, due to the dramatically different feature of male voice and female voice, training can be passed through The corresponding neural network model of data training is to judge that sound is male voice or female voice, that is, passes through neural network classification reality Example, to improve the efficiency of male voice and female voice identification.

3) judge the check code that the ID card No. identified is calculated according to check code calculation formula and the identity Whether whether one to show the ID card No. that identifies of verification correct for the check code that card number includes.

Wherein, check code is calculated according to standard " citizenship number " according to unified formula by numbering unit Come, is to be calculated according to 17 digit numeric code before ID card No. according to ISO 7064:1983.MOD11-2 check code Check code.

It specifically, is to be calculated by numbering unit by unified formula due to the check code as tail number, because This calculates the check code of the ID card No. identified according to check code calculation formula and the ID card No. includes Whether whether one to show the ID card No. that identifies of verification correct for check code.If judging according to check code calculation formula meter The check code that the check code of the ID card No. identified includes with the ID card No. identified is consistent, The check code for determining that the ID card No. identified includes is correct, identifies if judging to be calculated according to check code calculation formula The check code that the check code of the ID card No. out includes with the ID card No. identified is inconsistent, determines institute The check code that the ID card No. identified includes includes mistake.

If number and sequential bits and institute in the ID card No. that the ID card No. includes the presetting digit capacity The gender matching of enunciator is stated, and judges the check code for calculating the ID card No. identified according to check code calculation formula The check code for including with the ID card No. identified is consistent, if that is, the ID card No. digit it is correct, And the sequential bits in the ID card No. are matched with the gender of the enunciator, and the check code of the ID card No. is just Really, determine that identified ID card No. does not include mistake, determine that identified ID card No. does not include mistake

If the above-mentioned ID card No. identified does not include above-mentioned mistake, it is possible to determine that the identification of the ID card No. is Accurately, if the above-mentioned ID card No. identified includes above-mentioned mistake, illustrate that speech recognition has inaccuracy, need to prompt to use Family re-enters the voice comprising ID card No. to re-start speech recognition, is improved as far as possible to comprising identity card according to this The accuracy of the speech recognition of number.If number of the namely described ID card No. comprising the presetting digit capacity and the body Sequential bits in part card number are matched with the gender of the enunciator, and judge to be identified according to the calculating of check code calculation formula The check code that the check code of the ID card No. includes with the ID card No. identified is consistent, and judgement is identified ID card No. do not include mistake；If the ID card No. does not include the number or the identity of the presetting digit capacity The gender for demonstrate,proving sequential bits and the enunciator in number mismatches, or judges to be calculated according to check code calculation formula and identify The ID card No. the check code check code that includes with the ID card No. identified it is inconsistent, judgement is known Not Chu ID card No. include mistake.

Further, described the step of prompting the wrong ID card No., includes:

To the corresponding particular problem of ID card No. prompt mistake of mistake.

Specifically, the mistake of the ID card No. identified is specifically prompted, for example, if detecting the identity card The digit of number is incorrect, prompts the digit mistake of the ID card No., so that user pays attention to ID card No. Digit by voice input when whether input error, if identifying the property of the corresponding enunciator of the voice according to default phonetic feature Whether do not matched with verifying the sequential bits in the ID card No. with the gender of the enunciator, due in ID card No. 17 odd numbers give male, and even number gives women, and it is the 17th input of ID card No. that user can be allowed, which to be noticed that whether, Mistake, if the check code for detecting the ID card No. is incorrect, it is the last of ID card No. that user can be allowed, which to be noticed that whether, Whether the accuracy that user inputs voice, Jin Erti can be improved by targetedly prompting in input error to one check bit The efficiency and accuracy of high speech recognition.

It should be noted that method of speech processing described in above-mentioned each embodiment, can according to need different implementations Example in include technical characteristic re-start combination, with obtain combine after embodiment, but all this application claims protection Within the scope of.

Referring to Fig. 5, Fig. 5 is the schematic block diagram of voice processing apparatus provided by the embodiments of the present application.Corresponding to above-mentioned Method of speech processing, the embodiment of the present application also provide a kind of voice processing apparatus.As shown in figure 5, the voice processing apparatus includes For executing the unit of above-mentioned method of speech processing, which can be configured in the computer equipments such as server.Specifically, Referring to Fig. 5, the voice processing apparatus 500 includes acquiring unit 501, judging unit 502, cuts unit 503, synthesis unit 504 and recognition unit 505.

Wherein, acquiring unit 501, for obtaining the voice of non-streaming mode by input equipment；

Judging unit 502, for whether judging in the voice comprising abnormal sound signal, the abnormal sound signal packet Include mute phase signal；

Unit 503 is cut, if for including the abnormal sound signal in the voice, by voice activity detection to institute Predicate sound is cut to delete the abnormal sound signal, and multiple sound bites are obtained；

Synthesis unit 504, for carrying out multiple sound bites according to original sequence in each leisure voice Speech synthesis is to obtain new speech；

Recognition unit 505, for carrying out speech recognition to the new speech.

Referring to Fig. 6, Fig. 6 is another schematic block diagram of voice processing apparatus provided by the embodiments of the present application.Such as Fig. 6 It is shown, in this embodiment, the voice processing apparatus 500 further include:

Detection unit 506, for the volume by detecting the voice whether to detect in the voice comprising sound Whether sound judges if in the voice including sound comprising abnormal sound signal in the voice, if not including in the voice The prompt of voice is re-entered in sound, output.

Please continue to refer to Fig. 6, the judging unit 502 includes:

Detection sub-unit 5021, it is whether pre- less than first comprising audio amplitude in the audio volume control for detecting the voice If the waveform of threshold value；

First determines subelement 5022, if pre- for being less than described first comprising the audio amplitude in the audio volume control If the waveform of threshold value, determine in the voice comprising the abnormal sound signal.

In one embodiment, the synthesis unit 504 is used for multiple sound bites according to each leisure language Original sequence in sound carries out the waveform concatenation of voice to obtain new speech.

Please continue to refer to Fig. 6, as shown in fig. 6, the voice includes ID card No., the recognition unit 505, for pair New speech comprising ID card No. carries out speech recognition；

The voice processing apparatus 500 further include:

Verification unit 507, for verifying identified ID card No. according to preset ID card No. coding rule It whether include mistake；

Prompt unit 508 proposes the ID card No. of mistake if including mistake for the ID card No. Show.

Please continue to refer to Fig. 6, as shown in fig. 6, the verification unit 507 includes:

First verification subelement 5071, for judging whether the ID card No. includes the number of presetting digit capacity to verify Whether the digit of the ID card No. is correct；

Second verification subelement 5072, for identifying the gender of the corresponding enunciator of the voice according to default phonetic feature Whether matched with verifying the sequential bits in the ID card No. with the gender of the enunciator；

Third verifies subelement 5073, calculates the identification card number identified according to check code calculation formula for judging Code check code include with the ID card No. identified check code whether one show verify identified it is described Whether the check code that ID card No. includes is correct；

Second determines subelement 5074, if the digit for the ID card No. is correct and the ID card No. in Sequential bits matched with the gender of the enunciator, and the check code of the ID card No. is correct, determines identified body Part card number does not include mistake.

In one embodiment, the prompt unit 508, it is corresponding for the ID card No. prompt mistake to mistake Particular problem.

It should be noted that it is apparent to those skilled in the art that, above-mentioned voice processing apparatus and each The specific implementation process of unit can refer to the corresponding description in preceding method embodiment, for convenience of description and succinctly, This is repeated no more.

Meanwhile the division of each unit and connection type are only used for for example, at other in above-mentioned voice processing apparatus In embodiment, voice processing apparatus can be divided into as required to different units, it can also be by each unit in voice processing apparatus The different order of connection and mode are taken, to complete all or part of function of above-mentioned voice processing apparatus.

Above-mentioned voice processing apparatus can be implemented as a kind of form of computer program, which can such as scheme It is run in computer equipment shown in 7.

Referring to Fig. 7, Fig. 7 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The computer Equipment 700 can be desktop computer, and perhaps the computer equipments such as server are also possible to component or portion in other equipment Part.

Refering to Fig. 7, which includes processor 702, memory and the net connected by system bus 701 Network interface 705, wherein memory may include non-volatile memory medium 703 and built-in storage 704.

The non-volatile memory medium 703 can storage program area 7031 and computer program 7032.The computer program 7032 are performed, and processor 702 may make to execute a kind of above-mentioned method of speech processing.

The processor 702 is for providing calculating and control ability, to support the operation of entire computer equipment 700.

The built-in storage 704 provides environment for the operation of the computer program 7032 in non-volatile memory medium 703, should When computer program 7032 is executed by processor 702, processor 702 may make to execute a kind of above-mentioned method of speech processing.

The network interface 705 is used to carry out network communication with other equipment.It will be understood by those skilled in the art that in Fig. 7 The structure shown, only the block diagram of part-structure relevant to application scheme, does not constitute and is applied to application scheme The restriction of computer equipment 700 thereon, specific computer equipment 700 may include more more or fewer than as shown in the figure Component perhaps combines certain components or with different component layouts.For example, in some embodiments, computer equipment can Only to include memory and processor, in such embodiments, reality shown in the structure and function and Fig. 7 of memory and processor It is consistent to apply example, details are not described herein.

Wherein, the processor 702 is for running computer program 7032 stored in memory, to realize following step It is rapid: the voice of non-streaming mode is obtained by input equipment；Whether judge in the voice comprising abnormal sound signal, the exception Voice signal includes mute phase signal；If including the abnormal sound signal in the voice, by voice activity detection to institute Predicate sound is cut to delete the abnormal sound signal, and multiple sound bites are obtained；By multiple sound bites according to Original sequence in each leisure voice carries out speech synthesis to obtain new speech；Speech recognition is carried out to the new speech.

In one embodiment, the processor 702 is realizing described judge in the voice whether to include abnormal sound message Number the step of before, also perform the steps of

In one embodiment, the processor 702 is realizing described judge in the voice whether to include abnormal sound message Number step when, implement following steps:

In one embodiment, the processor 702 realize it is described will multiple sound bites according to it is each it is comfortable described in When original sequence in voice carries out speech synthesis to obtain the step of new speech, following steps are implemented:

By multiple sound bites according in each leisure voice original sequence carry out voice waveform concatenation with Obtain new speech.

In one embodiment, the voice includes ID card No., and the processor 702 is described to the newspeak in realization When sound carries out the step of speech recognition, following steps are implemented:

Speech recognition is carried out to the new speech comprising ID card No.；

The processor 702 is also realized following after realizing described the step of carrying out speech recognition to the new speech Step:

If the ID card No. includes mistake, the ID card No. of mistake is prompted.

In one embodiment, the processor 702 is described according to the verification of preset ID card No. coding rule in realization When whether the ID card No. identified includes wrong step, following steps are implemented:

In one embodiment, the processor 702 is realizing what the ID card No. to mistake was prompted When step, following steps are implemented:

It should be appreciated that in the embodiment of the present application, processor 702 can be central processing unit (Central ProcessingUnit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable GateArray, FPGA) or other programmable logic devices Part, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor or The processor is also possible to any conventional processor etc..

Those of ordinary skill in the art will appreciate that be realize above-described embodiment method in all or part of the process, It is that can be completed by computer program, which can be stored in a computer readable storage medium.The computer Program is executed by least one processor in the computer system, to realize the process step of the embodiment of the above method.

Therefore, the application also provides a kind of computer readable storage medium.The computer readable storage medium can be non- The computer readable storage medium of volatibility, the computer-readable recording medium storage have computer program, the computer program Processor is set to execute following steps when being executed by processor:

A kind of computer program product, when run on a computer, so that computer executes in the above various embodiments The step of described method of speech processing.

The computer readable storage medium can be the internal storage unit of aforementioned device, such as the hard disk or interior of equipment It deposits.What the computer readable storage medium was also possible to be equipped on the External memory equipment of the equipment, such as the equipment Plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card dodge Deposit card (Flash Card) etc..Further, the computer readable storage medium can also both include the inside of the equipment Storage unit also includes External memory equipment.

It is apparent to those skilled in the art that for convenience of description and succinctly, foregoing description is set The specific work process of standby, device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

The computer readable storage medium can be USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), the various computer readable storage mediums that can store program code such as magnetic or disk.

Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not It is considered as beyond scope of the present application.

In several embodiments provided herein, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, the apparatus embodiments described above are merely exemplary.For example, the division of each unit, only Only a kind of logical function partition, there may be another division manner in actual implementation.Such as multiple units or components can be tied Another system is closed or is desirably integrated into, or some features can be ignored or not executed.

Step in the embodiment of the present application method can be sequentially adjusted, merged and deleted according to actual needs.This Shen Please the unit in embodiment device can be combined, divided and deleted according to actual needs.In addition, in each implementation of the application Each functional unit in example can integrate in one processing unit, is also possible to each unit and physically exists alone, can also be with It is that two or more units are integrated in one unit.

If the integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product, It can store in one storage medium.Based on this understanding, the technical solution of the application is substantially in other words to existing skill The all or part of part or the technical solution that art contributes can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that an electronic equipment (can be individual Computer, terminal or network equipment etc.) execute each embodiment the method for the application all or part of the steps.

The above, the only specific embodiment of the application, but the bright protection scope of the application is not limited thereto, and is appointed What those familiar with the art within the technical scope of the present application, can readily occur in various equivalent modifications or Replacement, these modifications or substitutions should all cover within the scope of protection of this application.Therefore, the protection scope Ying Yiquan of the application Subject to the protection scope that benefit requires.

Claims

1. a kind of method of speech processing, which is characterized in that the described method includes:

The voice of non-streaming mode is obtained by input equipment；

Judge that the abnormal sound signal includes mute phase signal whether comprising abnormal sound signal in the voice；

If including the abnormal sound signal in the voice, the voice is cut to delete by voice activity detection The abnormal sound signal, obtains multiple sound bites；

Multiple sound bites are subjected to speech synthesis according to original sequence in each leisure voice to obtain new speech；

Speech recognition is carried out to the new speech.

2. method of speech processing according to claim 1, which is characterized in that described whether to judge in the voice comprising abnormal Before the step of voice signal, further includes:

3. according to claim 1 or 2 method of speech processing, which is characterized in that described to judge whether wrap in the voice The step of signal containing abnormal sound includes:

If being less than the waveform of first preset threshold in the audio volume control comprising the audio amplitude, determine in the voice Include the abnormal sound signal.

4. method of speech processing according to claim 1, which is characterized in that it is described by multiple sound bites according to respective Original sequence in the voice carries out the step of speech synthesis is to obtain new speech

Multiple sound bites are subjected to the waveform concatenation of voice according to original sequence in each leisure voice to obtain New speech.

5. method of speech processing according to claim 1, which is characterized in that the voice includes ID card No., described right The new speech carries out the step of speech recognition and includes:

Speech recognition is carried out to the new speech comprising ID card No.；

If the ID card No. includes mistake, the ID card No. of mistake is prompted.

6. method of speech processing according to claim 5, which is characterized in that described encoded according to preset ID card No. is advised Then verify identified ID card No. whether comprising wrong step include:

Judge whether the ID card No. includes the number of presetting digit capacity whether just to verify the digit of the ID card No. Really；

Identify that the gender of the corresponding enunciator of the voice is suitable in the ID card No. to verify according to default phonetic feature Whether tagmeme matches with the gender of the enunciator；

Judge according to check code calculation formula calculates the check code of the ID card No. identified and is identified Whether whether the check code that ID card No. includes one show check code that the ID card No. that is identified of verification includes Correctly；

If the digit of the ID card No. is correct and the gender of sequential bits and the enunciator in the ID card No. Match, and the check code of the ID card No. is correct, determines that identified ID card No. does not include mistake.

7. according to claim 5 or 6 method of speech processing, which is characterized in that the identification card number to mistake Code the step of being prompted includes:

8. a kind of voice processing apparatus characterized by comprising

Acquiring unit, for obtaining the voice of non-streaming mode by input equipment；

Judging unit, for judging that the abnormal sound signal includes mute whether comprising abnormal sound signal in the voice Phase signal；

Unit is cut, if for including the abnormal sound signal in the voice, by voice activity detection to the voice It is cut to delete the abnormal sound signal, obtains multiple sound bites；

Synthesis unit, for multiple sound bites to be carried out speech synthesis according to original sequence in each leisure voice To obtain new speech；

Recognition unit, for carrying out speech recognition to the new speech.

9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and is connected with the memory Processor；The memory is for storing computer program；The processor is based on running and storing in the memory Calculation machine program, to execute as described in claim any one of 1-7 the step of method of speech processing.

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program make the processor execute the voice as described in any one of claim 1-7 when being executed by processor The step of processing method.