CN110148402A - Method of speech processing, device, computer equipment and storage medium - Google Patents
Method of speech processing, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110148402A CN110148402A CN201910374806.XA CN201910374806A CN110148402A CN 110148402 A CN110148402 A CN 110148402A CN 201910374806 A CN201910374806 A CN 201910374806A CN 110148402 A CN110148402 A CN 110148402A
- Authority
- CN
- China
- Prior art keywords
- voice
- card
- speech
- abnormal sound
- sound signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 74
- 238000012545 processing Methods 0.000 title claims abstract description 57
- 230000002159 abnormal effect Effects 0.000 claims abstract description 98
- 230000005236 sound signal Effects 0.000 claims abstract description 89
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 30
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 30
- 238000001514 detection method Methods 0.000 claims abstract description 26
- 230000000694 effects Effects 0.000 claims abstract description 23
- 238000004590 computer program Methods 0.000 claims description 17
- 238000004364 calculation method Methods 0.000 claims description 13
- 238000012795 verification Methods 0.000 claims description 12
- 238000003672 processing method Methods 0.000 claims description 2
- 238000001228 spectrum Methods 0.000 description 15
- 238000010586 diagram Methods 0.000 description 13
- 230000008569 process Effects 0.000 description 13
- 230000006870 function Effects 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 8
- 230000001360 synchronised effect Effects 0.000 description 8
- 238000012549 training Methods 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 6
- 230000035939 shock Effects 0.000 description 6
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 230000005284 excitation Effects 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 108010076504 Protein Sorting Signals Proteins 0.000 description 2
- 239000002131 composite material Substances 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000013179 statistical model Methods 0.000 description 2
- 210000001260 vocal cord Anatomy 0.000 description 2
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 1
- 208000003028 Stuttering Diseases 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000003116 impacting effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000005389 magnetism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000000704 physical effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 230000029058 respiratory gaseous exchange Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 101150039622 so gene Proteins 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 210000003454 tympanic membrane Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
- G10L15/05—Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
Abstract
The embodiment of the present application provides a kind of method of speech processing, device, computer equipment and computer readable storage medium.The embodiment of the present application belongs to technical field of voice recognition, when the embodiment of the present application realizes speech recognition, by way of non-streaming, after allowing people disposably to record all voices, obtain the voice of non-streaming mode, whether first judge in voice comprising abnormal sound signal, abnormal sound signal includes mute phase signal, if in voice including abnormal sound signal, pass through voice activity detection abnormal sound signal, then voice is cut, suppressing exception voice signal, obtain multiple sound bites, multiple sound bites are subjected to speech synthesis according to original sequence respectively in voice to obtain new speech, then whole sentence knowledge is carried out to new speech by speech recognition server again and carries out speech recognition otherwise, so that the new speech of whole sentence form can efficiently use acoustic model and language model in speech recognition, effectively promote the identification of voice Accuracy and efficiency.
Description
Technical field
This application involves technical field of voice recognition more particularly to a kind of method of speech processing, device, computer equipment and
Computer readable storage medium.
Background technique
When carrying out speech recognition recorded speech, especially when the voice of recording is longer, can be deposited during recorded speech
The case where pausing so as to cause among the voice of recording there are the blank phase, the voice signal of acquisition be it is discontinuous, in voice
The acoustic model and speech model that speech recognition cannot be effectively utilized when identification, to reduce the efficiency of speech recognition.Than
Such as, in some business scenarios, it is often necessary to verify identification card number, it is convenient that identification card number is submitted by way of speech recognition
Quick way, but since identification card number length is longer, usual user when reading ID card No. among will appear between
It is disconnected, if the direct upload server identification in a streaming manner of the audio of acquisition, although can go out in real time as a result, but can not fill
Divide the acoustic model and language model using identity-based card number, and be easy error, reduces the voice knowledge of ID card No.
Other efficiency.
Summary of the invention
The embodiment of the present application provides a kind of method of speech processing, device, computer equipment and computer-readable storage medium
Matter, when being able to solve speech recognition in traditional technology the problem of low efficiency.
In a first aspect, the embodiment of the present application provides a kind of method of speech processing, which comprises pass through input equipment
Obtain the voice of non-streaming mode;Judge that the abnormal sound signal includes quiet whether comprising abnormal sound signal in the voice
Sound phase signal;If including the abnormal sound signal in the voice, the voice is cut by voice activity detection
To delete the abnormal sound signal, multiple sound bites are obtained;By multiple sound bites according to each leisure voice
In original sequence carry out speech synthesis to obtain new speech;Speech recognition is carried out to the new speech.
Second aspect, the embodiment of the present application also provides a kind of voice processing apparatus, comprising: acquiring unit, for passing through
The voice of input equipment acquisition non-streaming mode;Judging unit, for whether judging in the voice comprising abnormal sound signal, institute
Stating abnormal sound signal includes mute phase signal;Unit is cut, if leading to for including the abnormal sound signal in the voice
It crosses voice activity detection to cut the voice to delete the abnormal sound signal, obtains multiple sound bites;Synthesis
Unit is new to obtain for multiple sound bites to be carried out speech synthesis according to original sequence in each leisure voice
Voice;Recognition unit, for carrying out speech recognition to the new speech.
The third aspect, the embodiment of the present application also provides a kind of computer equipments comprising memory and processor, it is described
Computer program is stored on memory, the processor realizes the method for speech processing when executing the computer program.
Fourth aspect, it is described computer-readable to deposit the embodiment of the present application also provides a kind of computer readable storage medium
Storage media is stored with computer program, and the computer program executes the processor at the voice
Reason method.
The embodiment of the present application provides a kind of method of speech processing, device, computer equipment and computer-readable storage medium
Matter.When the embodiment of the present application realizes speech recognition, by way of non-streaming, after allowing people disposably to record all voices, obtain
The voice of non-streaming mode first judges that the abnormal sound signal includes mute whether comprising abnormal sound signal in the voice
Phase signal, if in the voice including the abnormal sound signal, by abnormal sound signal described in voice activity detection, then
Voice is cut, the abnormal sound signal is deleted, obtains multiple sound bites, by multiple sound bites according to each
Original sequence in the comfortable voice carries out speech synthesis to obtain new speech, then again by speech recognition server to new
Voice carries out whole sentence knowledge and carries out speech recognition otherwise, so that the new speech of whole sentence form can efficiently use in speech recognition
Acoustic model and language model, effectively promoted voice identification accuracy and efficiency.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field
For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is the application scenarios schematic diagram of method of speech processing provided by the embodiments of the present application;
Fig. 2 is the flow diagram of method of speech processing provided by the embodiments of the present application;
Fig. 3 is the waveform diagram of a voice in method of speech processing provided by the embodiments of the present application;
Fig. 4 is speech recognition principle flow chart in method of speech processing provided by the embodiments of the present application;
Fig. 5 is the schematic block diagram of voice processing apparatus provided by the embodiments of the present application;
Fig. 6 is another schematic block diagram of voice processing apparatus provided by the embodiments of the present application;And
Fig. 7 is the schematic block diagram of computer equipment provided by the embodiments of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen
Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall in the protection scope of this application.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction
Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded
Body, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this present specification merely for the sake of description specific embodiment
And be not intended to limit the application.As present specification and it is used in the attached claims, unless on
Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in present specification and the appended claims is
Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Referring to Fig. 1, Fig. 1 is the application scenarios schematic diagram of method of speech processing provided by the embodiments of the present application.
The application scenarios include:
(1) terminal, terminal are referred to as front end, and the component of the input voice such as microphone is equipped in terminal to receive use
The voice of family input, the terminal can set for electronics such as laptop, smartwatch, tablet computer or desktop computers
Standby, the terminal in Fig. 1 is connect with server.
(2) server, server are substantially carried out speech recognition, server can for single server, server cluster or
Person's Cloud Server, server can also include primary server and from server if server cluster.
Please continue to refer to Fig. 1, as shown in Figure 1, in the embodiment of the present application, mainly executing speech processes side with server end
The technical solution of the application method of speech processing is explained for the step of method, each body of work process in Fig. 1 is as follows: eventually
End receives the complete speech of user's input by voice-input device, and voice is sent to server so that server acquisition is non-
The voice of stream mode;Server judges that the abnormal sound signal includes quiet whether comprising abnormal sound signal in the voice
Sound phase signal cuts the voice by voice activity detection if including the abnormal sound signal in the voice
To delete the abnormal sound signal, multiple sound bites are obtained, by multiple sound bites according to each leisure voice
In original sequence carry out speech synthesis to obtain new speech, last server carries out speech recognition to the new speech to obtain
Speech recognition result.
It should be noted that the method for speech processing in the embodiment of the present application can be applied to terminal, also can be applied to
Server, as long as handling before server identifies voice voice.Meanwhile the language in the embodiment of the present application
The application environment of voice handling method is not limited to application environment shown in FIG. 1, method of speech processing and voice can also be known
It is not applied in the computer equipments such as terminal together, as long as being carried out before computer equipment carries out speech recognition, upper predicate
The application scenarios of voice handling method are merely illustrative technical scheme, are not used to limit technical scheme, on
Other forms can also be had by stating connection relationship.
Fig. 2 is the schematic flow chart of method of speech processing provided by the embodiments of the present application.The method of speech processing application
In Fig. 1 in the computer equipment of front end, to complete all or part of function of method of speech processing.
Referring to Fig. 2, Fig. 2 is the flow diagram of method of speech processing provided by the embodiments of the present application.As shown in Fig. 2,
This approach includes the following steps S210-S250:
S210, the voice that non-streaming mode is obtained by input equipment.
Wherein, the mode of stream refers to that speech recognition apparatus obtains audio stream belonging to voice in real time, and side obtains the audio
It flows side and carries out speech recognition.
The mode of non-streaming, after referring to that speech recognition apparatus obtains voice or default size voice in preset time, with
The form of whole sentence or whole section of voice, which is concentrated, carries out complete speech recognition.
Specifically, the input equipment can be terminal, or the voices input module such as microphone, user pass through defeated
After entering equipment input voice, whole sentence or whole section of voice are uploaded to server, so that server is obtained by input equipment
The voice of non-streaming mode.For example, in some business scenarios, it is often necessary to identification card number is verified, by way of speech recognition
Submit identification card number be convenient quick way, if acquisition audio in a streaming manner directly upload speech recognition server into
Row identification is the voice that speech recognition server obtains ID card No. in real time, and speech recognition server is all over acquisition identification card number
The voice of code can go out the speech recognition result of ID card No. all over being identified in real time.For example, speech recognition server identifies
ID card No. includes the voice of " ABCDEFGH ", after speech recognition server receives the voice of " A ", carries out the identification of " A ",
After the voice for receiving " B ", carry out the identification of " B ", after the voice for receiving " C ", carry out the identification etc. of " C ", be by " A ",
The voice of the ID card No. such as " B ", " C " uploads respectively, by real-time reception to audio identify one by one, can come from real time
The speech recognition result of part card number, the identification for carrying out " A " go out the recognition result of " A " in real time, and the identification for carrying out " B " goes out in real time
The recognition result etc. of " B ".
It is speech recognition server if the audio of acquisition is uploaded speech recognition server in a manner of non-streaming and identified
The disposable voice for obtaining full identity card number, speech recognition server carry out the speech recognition of ID card No. again, obtain
The speech recognition result of complete ID card No..Specifically after allowing people disposably to record the voices of all identification card numbers,
The voice by full identity card number is concentrated to upload again.Then the identity card voice of upload is cut, retains voice segments simultaneously
It is stitched together, is then transmitted to speech recognition server again and is identified, obtain the speech recognition knot of complete ID card No.
Fruit.For example, speech recognition server identification ID card No. includes the voice of " ABCDEFGH ", it is that speech recognition server is primary
Property obtain full identity card number " ABCDEFGH " voice, speech recognition server carry out again ID card No. voice know
Not, the speech recognition result of complete identification number " ABCDEFGH " is obtained.It specifically, is to allow people disposably to have recorded to own
After the voice of identification card number " ABCDEFGH ", the voice of ID card No. " ABCDEFGH " is concentrated and uploads speech-recognition services
Device, rather than " A ", " B ", " C " etc. are uploaded respectively.Then the identity card voice " ABCDEFGH " of upload is cut, is protected
It stays voice segments and is stitched together, be then transmitted to speech recognition server again and identified, obtain complete ID card No.
The speech recognition result of " ABCDEFGH ".
S220, judge whether comprising abnormal sound signal in the voice, the abnormal sound signal includes mute phase letter
Number.
Wherein, abnormal sound signal, which refers to, changes precipitous waveform segment in audio volume control, including mute phase signal.It is mute
Phase referred in voice in the silent time.Generating silent situation includes: first is that the other side in interaction is being listened to speak;Two
It is the pause between one section of word caused by due to thinking deeply or having a short interval etc.;Third is that intermediate pause of speaking, such as hesitates,
Breathing, stutter etc..Pause interval is long in the case of the first and the frequency of occurrences is low;The third situation pause interval is short and the frequency of occurrences
It is high;Second situation is between one, three kind of situation.This characteristic of speech source is called switching characteristic, is also speech/quiet sometimes
Silent characteristic.Referring to Fig. 3, Fig. 3 is the waveform diagram of a voice in method of speech processing provided by the embodiments of the present application, such as Fig. 3
It is shown, than if any an audio volume control L, the position L2 is exactly the mute phase in waveform.
Specifically, waveform sound, is most common Windows multimedia feature, and waveform sound equipment can pass through Mike
Wind captures sound, and is converted into numerical value, and then they are stored into the wave file on memory or disk.Sound is just
It is vibration.When sound changes the pressure of air on eardrum, we have just felt sound.Microphone can incude these vibrations
It is dynamic, and convert them to electric current.Equally, electric current just becomes sound using amplifier and loudspeaker again.Traditionally,
Sound stores (such as audio tape and disc) in an analog fashion, these vibrations are stored in magnetism pulse or profile groove.
When sound is converted to electric current, so that it may be indicated with the waveform vibrated at any time, sine can be used by vibrating most natural form
Wave table shows.Sine wave is there are two parameter, and one is the peak swing in amplitude, that is, a cycle, another is frequency.Vibration
Width is exactly volume, and frequency is exactly tone.In general the range of sine wave that human ear can be experienced is from 20Hz (period per second)
Low-frequency sound to the high frequency sound of 20000Hz, can be judged by the size of detected amplitude i.e. volume be in the voice
No includes abnormal sound signal, shows in the waveform of voice in namely audio volume control and changes precipitous waveform segment, also
Whether be to judge in the voice comprising the mute period, or the obvious volume beyond normal voice shock wave it is ear-piercing
The voice of louder volume, due to the characteristic of voice, the mute phase signal for including in voice can be more, therefore, detect abnormal signal master
If detecting the mute phase signal in voice.
Further, in one embodiment, described to judge in the voice the step of whether including abnormal sound signal
Include:
Whether detect in the audio volume control of the voice includes waveform of the audio amplitude less than the first preset threshold;
If being less than the waveform of first preset threshold in the audio volume control comprising the audio amplitude, institute's predicate is determined
It include the abnormal sound signal in sound.
Wherein, the first preset threshold is the volume lower than people's normal good hearing, sound of the audio amplitude less than the first preset threshold
The sound of frequency waveform description is bass or referred to as mute.
Specifically, since the volume range of the ear auditory of people is 20~20000 hertz of amplitude (hertz, English are HZ)
Volume for general for people voice is linked up, is the mute phase of not sound lower than 20 hertz, is higher than 20000 hertz
Volume beyond people's hearing, so under normal circumstances, the volume lower than 20 hertz or higher than 20000 hertz is in audio volume control
Show as changing the abnormal sound signal for including in precipitous waveform segment, that is, the voice.It can be by described in detection
It whether include that audio amplitude judges whether wrap in the voice less than the waveform of the first preset threshold in the audio volume control of voice
Containing mute phase signal, if being less than the waveform of first preset threshold in the audio volume control comprising the audio amplitude, determine
It include the voice signal that the abnormal sound signal is mute phase signal in the voice.
Further, the abnormal sound signal further includes shockwave signal, and shock wave, English is Shock Wave, is
A kind of propagation of discontinuous peak in the medium, this discontinuous peak cause the physical properties such as pressure, temperature and the density of medium to be jumped
Jump formula changes, any wave source, and when movement velocity has been more than the spread speed of its wave, this wave can be known as impacting
Wave, or referred to as shock wave, please continue to refer to Fig. 3, as shown in figure 3, the position L1 is exactly to rush in waveform than if any an audio volume control L
Hit wave.
In one embodiment, described to judge in the voice that the step of whether including abnormal sound signal includes:
The waveform for whether being greater than the second preset threshold in the audio volume control of the voice comprising audio amplitude can also be detected.
Wherein, the second preset threshold is the volume beyond people's normal good hearing, and audio amplitude is greater than the sound of the second preset threshold
The sound of frequency waveform description is shock wave or referred to as high pitch.
Specifically, the volume higher than 20000 hertz also shows as changing precipitous waveform segment in audio volume control, also
It is the abnormal sound signal for including in the voice.Whether can shake comprising audio in audio volume control by detecting the voice
Width is greater than the waveform of the second preset threshold to judge whether comprising shockwave signal in the voice, if wrapping in the audio volume control
It is greater than the waveform of second preset threshold containing the audio amplitude, determines to include the abnormal sound signal in the voice,
Further to filter out the noise in voice, the accuracy of speech recognition is improved.
If in S230, the voice including the abnormal sound signal, the voice is carried out by voice activity detection
It cuts to delete the abnormal sound signal, obtains multiple sound bites.
Wherein, voice activity detection, English are VoiceActivity Detection, are abbreviated as VAD, also known as end-speech
Point detection, speech endpoint detection can identify and eliminate the prolonged mute phase in voice signal stream.
Cut waveform segment and be similar to and delete waveform segment, except that delete waveform segment be the waveform chosen is deleted, and
Cutting waveform segment is that unchecked waveform is deleted, and the effect of the two is opposite, for example, cut using GoldWave,
GoldWave is digital recording and software for editing, and cutting button used in waveform segment is Trim, after cutting, GoldWave meeting
Automatically remaining waveform is amplified and is shown.
Specifically, due to the proprietary term that cutting is waveform concatenation technical field, the meaning of statement is refered in particular to unchecked
Waveform is deleted, and after going out abnormal sound signal by voice activity detection, abnormal sound signal is not chosen, by abnormal sound signal
Except normal sound signal choose, so that the waveform of unchecked abnormal sound signal be deleted by way of cutting
It removes, or is referred to as to cut off, what is left is the non-abnormal sound signal chosen, that is, normal sound signal.Server obtains
The voice of negated stream mode identifies the abnormal sound signal in the voice by voice activity detection, for example identifies described
It whether include mute phase signal and shockwave signal in voice, if including mute phase signal and shock wave letter in the voice
Number equal abnormal sounds signal, the abnormal sounds signals such as the mute phase signal and shockwave signal are not chosen, will be described mute
Normal sound signal except the abnormal sounds signal such as phase signal and shockwave signal is chosen, and then cuts to the voice
To delete the abnormal sounds signals such as mute phase signal and impact signal in the voice, the language of multiple normal sound signals is obtained
Tablet section, if not including the abnormal sounds signals such as mute phase signal and shockwave signal in the voice, the voice is
Continuously.Please continue to refer to Fig. 3, if L is the audio volume control of ID card No. " ABCDEFGH " in Fig. 3, ID card No. is obtained
Voice L after, detect L in whether include abnormal sound signal L1 and L2, if in the voice including L1 and L2, such as L
For ABCL1DEFL2GH, abnormal sound signal L1 and L2 in L are identified by voice activity detection, cut out by way of cutting
When cutting waveform segment, L1 and L2 are not chosen, and the normal sound signal " ABCDEFGH " except L1 and L2 is chosen, and carry out speech waveform
Unchecked abnormal sound signal waveform L1 and L2 are deleted when cutting, what is left is the normal sound signal chosen
The audio volume control of " ABCDEFGH " obtains multiple normal sounds to delete abnormal sound the signal L1 and L2 in the voice
Voice segments ABC, DEF and GH of signal.
S240, multiple sound bites are subjected to speech synthesis according to original sequence in each leisure voice to obtain
To new speech.
Wherein, speech synthesis, including parameter synthesis and voice joint.Voice is presented in a manner of waveform, is just spelled for waveform
It connects, waveform concatenation, which refers to, to be spliced between speech waveform segment to export continuous flow, and PSOLA algorithm is waveform concatenation skill
One kind of art.
Specifically, by multiple sound bites according in each leisure voice original sequence carry out speech synthesis with
New speech is obtained, can be the wave that multiple sound bites are carried out to voice according to original sequence in each leisure voice
Shape is spliced to obtain new speech, that is, by multiple sound bites of acquisition according to original suitable in each leisure voice
Sequence is stitched together to carry out speech recognition.Please continue to refer to Fig. 3, the voice is cut by voice activity detection,
To cut off two abnormal sound signals of L1 and L2 in the voice, multiple sound bite ABC, DEF and GH are obtained, by voice
Segment ABC, DEF and GH are stitched together, and obtain voice ABCDEFGH, as ID card No. " ABCDEFGH " is complete and connects
Continuous audio, so as to adequately efficiently use the acoustic model and language model of identity-based card number, the embodiment of the present application
In whole sentence knowledge is carried out otherwise to identity card voice, the acoustic model and language model in speech recognition can be efficiently used,
Effectively promote the identification accuracy of identity card voice.
Further, waveform concatenation speech synthesis technique is that directly the waveform in speech waveform database is cascaded, output
Continuous flow, these speech waveforms take the word and sentence of natural-sounding, and implying tone, stress, the rate of articulation influences, synthesis
Lamprophonia is natural.Waveform concatenation speech synthesis technique includes PSOLA algorithm and time-frequency interpolation method.Wherein, time-frequency interpolation side
Method, also known as time-frequency interpolation method, English are Time Frequency Interpolation, referred to as TFI, Lai Shixian wave
Voice signal is obtained driving source by LPC inverse filter, fundamental tone mark is further carried out to it by shape splicing, this method, will
It transforms to frequency domain, referred to as prototype, and when prototype being stored, then being synthesized, prototype taking-up is analyzed and rhythm accordingly
Rule adjustment, is then switched back to time-domain signal, obtains synthesis voice by LPC composite filter.
PSOLA algorithm is one kind of waveform concatenation technology, also known as the synchronous superimposing technique PSOLA of gene, main to use
Between speech waveform fragment assembly, first according to semanteme, it is adjusted, is made with prosodic features of the PSOLA algorithm to concatenation unit
Synthetic waveform had not only been able to maintain the main segment5al feature of raw tone primitive, but also the prosodic features of concatenation unit is made to meet semanteme, from
And obtain very high intelligibility and naturalness.When being adjusted to the prosodic features of concatenation unit, as unit of pitch period into
The modification of traveling wave shape, using the integrality of pitch period as the smooth continuous basic premise for guaranteeing waveform and frequency spectrum.PSOLA is calculated
Method includes TD-PSOLA and FD-PSOLA.Wherein, TD-PSOLA the following steps are included:
1) Pitch synchronous overlap add is analyzed.Accurate pitch synchronous mark is done to primary speech signal, and raw tone is believed
It number is multiplied with a series of window function of pitch synchronous, obtains some short-time analysis signals for showing overlapping.Window function uses standard
Hanning window or Hamming window, a length of two pitch periods of window have 50% lap between adjacent short-time analysis signal.Fundamental tone
The accuracy in period and initial position are extremely important, it has a great impact pairing at the quality of voice.
2) it modifies to intermediate representation.First according to the pitch curve of raw tone waveform and super-segmental feature and target
Pitch curve and the modified requirement of super-segmental feature, the mapping for establishing the pitch period between synthetic waveform and original waveform are closed
System, then thus mapping relations determine the required synthetic signal sequence in short-term of synthesis.
3) Pitch synchronous overlap add is handled.By the arrangement synchronous with target pitch period of the short signal sequence of synthesis, and it is overlapped
Addition obtains synthetic waveform, at this point, the speech waveform of synthesis has desired super-segmental feature.
FD-PSOLA algorithm is similar with TD-PSOLA algorithm, substantially it is non-for Pitch synchronous overlap add analysis, to intermediate representation into
Row modification and Pitch synchronous overlap add handle 3 processes.In TD-PSOLA, based on time domain when variation, it is more suited to changing for the duration of a sound
Become;But when being related to the change of fundamental frequency, especially when amplitude change is larger, it is be easy to cause the aliasing of superpositing unit, and in FD-
In PSOLA algorithm, time ruler not only can change, signal can also be made the appropriate adjustments on frequency domain, the specific steps are as follows:
1) discrete Fourier transform is done to short-time analysis signal, obtains the analysis Fourier spectrum of the signal.
2) spectrum envelope and dissociative excitation source frequency spectrum of short-time analysis Fourier spectrum are obtained with homomorphic filtering.
3) frequency spectrum is compressed and is stretched.Excitation source spectrum is compressed and stretched with linear interpolation, but this processing
Method is easy to lose information in linear interpolation.Sinusoidal model method is used for reference, by excitation source spectrum and spectrum envelope in new frequency
Point sampling come realize to excitation source spectrum compression and stretching, obtain new Fourier spectrum.The interpolation to frequency spectrum complex values is avoided,
And achieve the purpose that modify Fourier spectrum by modification frequency axial coordinate and spectrum envelope interpolation.
4) composite signal in short-term is obtained.
S250, speech recognition is carried out to the new speech.
Wherein, speech recognition, English are Automatic Speech Recognition, and general abbreviation ASR is by sound
It is converted into the process of text.
Specifically, referring to Fig. 4, Fig. 4 is speech recognition principle stream in method of speech processing provided by the embodiments of the present application
Cheng Tu, speech recognition principle process include:
1) voice inputs, that is, obtains voice, for example, obtaining the new speech after speech synthesis;
2) it encodes, that is, the voice of input is encoded, feature extraction is carried out to voice by coding, for example, right
New speech carries out coding extraction, that is, carries out feature extraction to new speech;
3) it decodes.The phonetic feature of extraction is decoded by acoustic model and language model, the acoustic model warp
To achieve the effect that meet the requirements, the language model trained training of data 2 is met the requirements the training of training data 1 with reaching
Effect, speech recognition are that speech sound waves are converted into text, give the training data of target voice, can train an identification
Statistical model, for example, described be decoded new speech;
4) text exports.Acoustic model and the decoded phonetic feature of language model are converted into text output, for example, by new
Voice switchs to text to realize speech recognition, to realize the speech recognition for converting speech into text.
When the embodiment of the present application realizes speech recognition, by way of non-streaming, after allowing people disposably to record all voices,
The voice for obtaining non-streaming mode first judges that the abnormal sound signal includes whether comprising abnormal sound signal in the voice
Mute phase signal, if including the abnormal sound signal in the voice, by abnormal sound signal described in voice activity detection,
Then voice is cut, cuts off the abnormal sound signal, obtains multiple sound bites, by multiple sound bites
Speech synthesis is carried out to obtain new speech according to original sequence in each leisure voice, then passes through speech-recognition services again
Device carries out whole sentence knowledge to new speech and carries out speech recognition otherwise, so that the new speech of whole sentence form can efficiently use voice
Acoustic model and language model in identification effectively promote the identification accuracy and efficiency of voice.
In one embodiment, it is described judge in the voice the step of whether including abnormal sound signal before, also wrap
It includes:
By detecting the volume of the voice whether to detect in the voice comprising sound;
If in the voice including sound, whether judge in the voice comprising abnormal sound signal;
If in the voice not including sound, the prompt of voice is re-entered in output.
Specifically, when carrying out speech recognition, recorded speech may be had begun, but is not sent out for various reasons
Sound out, if but recording duration is more than certain time length, and this section of blank voice can also carry out speech recognition as the voice recorded, but
That this section of voice is practical without any content and without in all senses, therefore, can first in detected voice whether include
Sound is blank sound, there is no need to this section of language if not including sound, that is, the voice not made a sound in voice
Sound carries out abnormal sound signal detection, carries out the detection of abnormal sound signal in normal voice to avoid subsequent to eliminate language
The step of abnormal sound signal in sound.Can be judged by judging the volume in the voice in the voice whether
There is sound, if the volume in the voice is lower than the volume that people can hear, that is, the audio in the audio volume control of the voice
Amplitude is respectively less than the first preset threshold, judges that the voice for the mute of not sound, is examined without further progress speech activity
It surveys to detect abnormal sound signal from the voice, there are no speech recognition need to be carried out, if judging in the voice not comprising sound
The prompt of voice is re-entered in sound, output, allows user's recorded speech again, to shorten the process of speech recognition, improve voice and know
Whether other efficiency and accuracy further judge in the voice if in the voice including sound comprising abnormal sound message
Number, if including abnormal sound signal in the voice, the processing to speech recognition is carried out to obtain new speech, then carry out voice knowledge
Not, to realize the accuracy of speech recognition.By first determining whether to find do not have as early as possible whether comprising sound in the voice
There is the abnormal case of sound, the identification to abnormal speech situation is improved, to improve the efficiency of speech recognition.
It further, can be with: whether detecting in the voice comprising natural language.
Wherein, natural language, that is, the sound of people.
Specifically, can be by the way that mel-frequency cepstrum coefficient be applied to whether judge voice in hidden Markov model
It is voice signal.Wherein, MFCCs, English be Mel Frequency Cepstral Coefficents, referred to as MFCCs, again
It can be abbreviated as MFCC, be that one kind widely used feature, HMM, English in automatic speech and Speaker Identification are
HiddenMarkov Model, referred to as HMM are statistical models, it is used to describe the Ma Er containing implicit unknown parameter
It can husband's process.During MFCCs is for the training (and identification) of HMM because in HMM for each frame voice (or each
Phoneme) there is feature vector, MFCC can be selected to judge whether there is voice signal in voice.If there is no sound letter in the voice
Number, without further progress voice activity detection to detect abnormal sound signal from the voice, there are no need to carry out voice knowledge
Not, if in the voice including voice signal, further judge whether comprising abnormal sound signal in the voice, if institute's predicate
Include abnormal sound signal in sound, carries out to speech processes to obtain new speech, then carry out speech recognition.By first determining whether
Whether include natural language in predicate sound, can find the abnormal case of the sound of nobody as early as possible, further increase to exception
The identification of voice situation, to improve the efficiency of speech recognition.
In one embodiment, the voice includes ID card No., described to carry out speech recognition to the new speech
Step includes:
Speech recognition is carried out to the new speech comprising ID card No.;
After described the step of carrying out speech recognition to the new speech, further includes:
Verify whether identified ID card No. includes mistake according to preset ID card No. coding rule;
If the ID card No. includes mistake, the ID card No. of mistake is prompted.
Wherein, ID card No., also known as resident identification card number, or referred to as citizenship number, English are People'
S Republic ofChinaresidentidentity card, citizenship number are feature combinational codes.
Preset ID card No. coding rule is to have done specific rule in GB11643-1999 " citizenship number "
Fixed.
Specifically, the voice that the non-streaming mode comprising ID card No. is obtained by input equipment, if being wrapped in the voice
Containing the abnormal sound signal, the voice is cut by voice activity detection with delete the abnormal sound signal with
Multiple sound bites are obtained, multiple sound bites are subjected to speech synthesis according to original sequence in each leisure voice
To obtain the new speech comprising ID card No., speech recognition is carried out to the new speech, to identify the body for including in voice
Part card number.For example, in some business scenarios, it is often necessary to verify identification card number, since identification card number length is longer, usually
People when inputting ID card No. by voice among will appear interruption, if on the audio of acquisition is direct in a streaming manner
Server identification is passed, although can go out in real time as a result, but being unable to fully acoustic model and language using identity-based card number
It says model, and is easy error.And use method of speech processing provided by the embodiments of the present application to acquisition comprising ID card No.
Voice carry out speech recognition, after obtaining the ID card No. that identifies, can also be encoded according to preset ID card No.
Rule verifies whether identified ID card No. includes mistake, if the ID card No. does not include mistake, shows to body
The speech recognition of part card number is accurate, can also be to mistake if the ID card No. identified includes mistake
The ID card No. prompted, to allow user to provide the voice comprising ID card No. again, re-recognize identity card
Number, to improve efficiency and accuracy to the speech recognition comprising ID card No..
Further, described that whether identified ID card No. is verified according to preset ID card No. coding rule
Include: comprising wrong step
Judge whether the ID card No. includes that the digital of presetting digit capacity is with the digit for verifying the ID card No.
It is no correct;
Identify the gender of the corresponding enunciator of the voice to verify in the ID card No. according to default phonetic feature
Sequential bits whether matched with the gender of the enunciator;
Judge to calculate the check code of the ID card No. identified according to check code calculation formula and is identified
Whether the check code that the ID card No. includes one shows the check code that the ID card No. that is identified of verification includes
It is whether correct;
If the digit of the ID card No. is correct and the property of sequential bits and the enunciator in the ID card No.
It does not match, and the check code of the ID card No. is correct, determines that identified ID card No. does not include mistake.
Specifically, standard " citizenship number " defines the coded object of citizenship number, the structure of number and table
Existing form, makes each coded object obtain a unique, constant legal number, for example, citizenship number is feature group
Code is closed, is made of 17 bit digital ontology codes and a bit check code.It puts in order from left to right successively are as follows: six digit word address
Code, eight-digit number word date of birth code, three bit digital sequence codes and one-bit digital check code.Since ID card No. coding has phase
The rule and structure type answered, therefore can be according to the identified ID card No. of the coding rule of ID card No. verification
No includes mistake, may include the following contents:
1) judge whether the ID card No. includes the number of presetting digit capacity to verify the digit of the ID card No.
It is whether correct.
Specifically, if the ID card No. includes the number of the presetting digit capacity, determine identified identification card number
The digit of code is correct, if the ID card No. does not include the number of the presetting digit capacity, determines identified identification card number
The digit of code includes mistake.It, can be by judging the ID card No. due to ID card No. ten eight-digit number word in total
Whether whether the number comprising presetting digit capacity is correct to verify the digit of the ID card No., that is, judges the body identified
Whether part card number includes ten eight-digit number words, to judge whether the ID card No. identified is correct from digit first,
If the ID card No. identified is not ten eight-digit number words, it can directly judge that the ID card No. that speech recognition goes out is the presence of mistake
Accidentally, with regard to being not necessarily to carry out subsequent judgement, so that the efficiency of speech recognition is improved, if the ID card No. identified is 18
Number can tentatively judge that the ID card No. identified judges to be correct from digit.
2) identify the gender of the corresponding enunciator of the voice to verify the ID card No. according to default phonetic feature
In sequential bits whether matched with the gender of the enunciator.
Wherein, the default phonetic feature includes the phonetic features such as fundamental frequency, frequency spectrum, sound frequency and amplitude to distinguish
Male voice or female voice.Wherein, for fundamental tone as the term suggests being exactly the basis of sound, the frequency of vocal cord vibration is known as fundamental frequency, fundamental tone
Frequency and the structure of personal vocal cords have very big relationship, so gene frequency can be used for identification pronunciation source, in general, male
The fundamental frequency of property speaker is lower, and the fundamental frequency of women speaker is relatively high, since the fundamental frequency of men and women's sound is deposited
In larger difference, therefore the identification of men and women's sound can be waken up with a start based on fundamental frequency.
Specifically, if the sequential bits in the ID card No. are matched with the gender of the enunciator, judgement is identified
The sequential bits of ID card No. do not include mistake, if the sequential bits in the ID card No. and the gender of the enunciator are not
Matching determines that the sequential bits of identified ID card No. include mistake.Due to ID card No. the 15th to 17
The regional scope that is identified of address code in, to the serial number that the personnel of the same year, the moon, day birth compile and edit, wherein the 17th surprise
Number gives male, and even number gives women, thus can be male voice or female voice by the sound in identification voice with
Identify the gender of the corresponding enunciator of the voice to verify the sequential bits in the ID card No. according to default phonetic feature
Whether matched with the gender of the enunciator.If the sound identified is women, sequential bits are even number, if the sound that identifies
Sound is male, and sequential bits are odd number, the gender and the sequence in the ID card No. for judging the corresponding enunciator of the voice
Position matching, is consistent, can further judge that the ID card No. is accurate to the identification of sequential bits;If identifying
Sound is women, and sequential bits are odd number, if the sound that identifies is male, sequential bits are even number, judge the voice pair
Sequential bits in the gender of the enunciator answered and the ID card No. mismatch, and are inconsistent, it may be possible to identification card number
There is mistake in the speech recognition process of code, it is also possible to male carries out speech recognition using the ID card No. of women,
It may be that women uses the ID card No. of male to carry out speech recognition, by the verification of sequential bits, both may determine that voice was known
Other correctness improves the efficiency of ID card No. speech recognition, it is anti-fake can also to play to a certain extent ID card No.
Effect, to prevent from anisotropic carrying out authentication or identification using ID card No..
Further, under normal circumstances, due to the dramatically different feature of male voice and female voice, training can be passed through
The corresponding neural network model of data training is to judge that sound is male voice or female voice, that is, passes through neural network classification reality
Example, to improve the efficiency of male voice and female voice identification.
3) judge the check code that the ID card No. identified is calculated according to check code calculation formula and the identity
Whether whether one to show the ID card No. that identifies of verification correct for the check code that card number includes.
Wherein, check code is calculated according to standard " citizenship number " according to unified formula by numbering unit
Come, is to be calculated according to 17 digit numeric code before ID card No. according to ISO 7064:1983.MOD11-2 check code
Check code.
It specifically, is to be calculated by numbering unit by unified formula due to the check code as tail number, because
This calculates the check code of the ID card No. identified according to check code calculation formula and the ID card No. includes
Whether whether one to show the ID card No. that identifies of verification correct for check code.If judging according to check code calculation formula meter
The check code that the check code of the ID card No. identified includes with the ID card No. identified is consistent,
The check code for determining that the ID card No. identified includes is correct, identifies if judging to be calculated according to check code calculation formula
The check code that the check code of the ID card No. out includes with the ID card No. identified is inconsistent, determines institute
The check code that the ID card No. identified includes includes mistake.
If number and sequential bits and institute in the ID card No. that the ID card No. includes the presetting digit capacity
The gender matching of enunciator is stated, and judges the check code for calculating the ID card No. identified according to check code calculation formula
The check code for including with the ID card No. identified is consistent, if that is, the ID card No. digit it is correct,
And the sequential bits in the ID card No. are matched with the gender of the enunciator, and the check code of the ID card No. is just
Really, determine that identified ID card No. does not include mistake, determine that identified ID card No. does not include mistake
If the above-mentioned ID card No. identified does not include above-mentioned mistake, it is possible to determine that the identification of the ID card No. is
Accurately, if the above-mentioned ID card No. identified includes above-mentioned mistake, illustrate that speech recognition has inaccuracy, need to prompt to use
Family re-enters the voice comprising ID card No. to re-start speech recognition, is improved as far as possible to comprising identity card according to this
The accuracy of the speech recognition of number.If number of the namely described ID card No. comprising the presetting digit capacity and the body
Sequential bits in part card number are matched with the gender of the enunciator, and judge to be identified according to the calculating of check code calculation formula
The check code that the check code of the ID card No. includes with the ID card No. identified is consistent, and judgement is identified
ID card No. do not include mistake;If the ID card No. does not include the number or the identity of the presetting digit capacity
The gender for demonstrate,proving sequential bits and the enunciator in number mismatches, or judges to be calculated according to check code calculation formula and identify
The ID card No. the check code check code that includes with the ID card No. identified it is inconsistent, judgement is known
Not Chu ID card No. include mistake.
Further, described the step of prompting the wrong ID card No., includes:
To the corresponding particular problem of ID card No. prompt mistake of mistake.
Specifically, the mistake of the ID card No. identified is specifically prompted, for example, if detecting the identity card
The digit of number is incorrect, prompts the digit mistake of the ID card No., so that user pays attention to ID card No.
Digit by voice input when whether input error, if identifying the property of the corresponding enunciator of the voice according to default phonetic feature
Whether do not matched with verifying the sequential bits in the ID card No. with the gender of the enunciator, due in ID card No.
17 odd numbers give male, and even number gives women, and it is the 17th input of ID card No. that user can be allowed, which to be noticed that whether,
Mistake, if the check code for detecting the ID card No. is incorrect, it is the last of ID card No. that user can be allowed, which to be noticed that whether,
Whether the accuracy that user inputs voice, Jin Erti can be improved by targetedly prompting in input error to one check bit
The efficiency and accuracy of high speech recognition.
It should be noted that method of speech processing described in above-mentioned each embodiment, can according to need different implementations
Example in include technical characteristic re-start combination, with obtain combine after embodiment, but all this application claims protection
Within the scope of.
Referring to Fig. 5, Fig. 5 is the schematic block diagram of voice processing apparatus provided by the embodiments of the present application.Corresponding to above-mentioned
Method of speech processing, the embodiment of the present application also provide a kind of voice processing apparatus.As shown in figure 5, the voice processing apparatus includes
For executing the unit of above-mentioned method of speech processing, which can be configured in the computer equipments such as server.Specifically,
Referring to Fig. 5, the voice processing apparatus 500 includes acquiring unit 501, judging unit 502, cuts unit 503, synthesis unit
504 and recognition unit 505.
Wherein, acquiring unit 501, for obtaining the voice of non-streaming mode by input equipment;
Judging unit 502, for whether judging in the voice comprising abnormal sound signal, the abnormal sound signal packet
Include mute phase signal;
Unit 503 is cut, if for including the abnormal sound signal in the voice, by voice activity detection to institute
Predicate sound is cut to delete the abnormal sound signal, and multiple sound bites are obtained;
Synthesis unit 504, for carrying out multiple sound bites according to original sequence in each leisure voice
Speech synthesis is to obtain new speech;
Recognition unit 505, for carrying out speech recognition to the new speech.
Referring to Fig. 6, Fig. 6 is another schematic block diagram of voice processing apparatus provided by the embodiments of the present application.Such as Fig. 6
It is shown, in this embodiment, the voice processing apparatus 500 further include:
Detection unit 506, for the volume by detecting the voice whether to detect in the voice comprising sound
Whether sound judges if in the voice including sound comprising abnormal sound signal in the voice, if not including in the voice
The prompt of voice is re-entered in sound, output.
Please continue to refer to Fig. 6, the judging unit 502 includes:
Detection sub-unit 5021, it is whether pre- less than first comprising audio amplitude in the audio volume control for detecting the voice
If the waveform of threshold value;
First determines subelement 5022, if pre- for being less than described first comprising the audio amplitude in the audio volume control
If the waveform of threshold value, determine in the voice comprising the abnormal sound signal.
In one embodiment, the synthesis unit 504 is used for multiple sound bites according to each leisure language
Original sequence in sound carries out the waveform concatenation of voice to obtain new speech.
Please continue to refer to Fig. 6, as shown in fig. 6, the voice includes ID card No., the recognition unit 505, for pair
New speech comprising ID card No. carries out speech recognition;
The voice processing apparatus 500 further include:
Verification unit 507, for verifying identified ID card No. according to preset ID card No. coding rule
It whether include mistake;
Prompt unit 508 proposes the ID card No. of mistake if including mistake for the ID card No.
Show.
Please continue to refer to Fig. 6, as shown in fig. 6, the verification unit 507 includes:
First verification subelement 5071, for judging whether the ID card No. includes the number of presetting digit capacity to verify
Whether the digit of the ID card No. is correct;
Second verification subelement 5072, for identifying the gender of the corresponding enunciator of the voice according to default phonetic feature
Whether matched with verifying the sequential bits in the ID card No. with the gender of the enunciator;
Third verifies subelement 5073, calculates the identification card number identified according to check code calculation formula for judging
Code check code include with the ID card No. identified check code whether one show verify identified it is described
Whether the check code that ID card No. includes is correct;
Second determines subelement 5074, if the digit for the ID card No. is correct and the ID card No. in
Sequential bits matched with the gender of the enunciator, and the check code of the ID card No. is correct, determines identified body
Part card number does not include mistake.
In one embodiment, the prompt unit 508, it is corresponding for the ID card No. prompt mistake to mistake
Particular problem.
It should be noted that it is apparent to those skilled in the art that, above-mentioned voice processing apparatus and each
The specific implementation process of unit can refer to the corresponding description in preceding method embodiment, for convenience of description and succinctly,
This is repeated no more.
Meanwhile the division of each unit and connection type are only used for for example, at other in above-mentioned voice processing apparatus
In embodiment, voice processing apparatus can be divided into as required to different units, it can also be by each unit in voice processing apparatus
The different order of connection and mode are taken, to complete all or part of function of above-mentioned voice processing apparatus.
Above-mentioned voice processing apparatus can be implemented as a kind of form of computer program, which can such as scheme
It is run in computer equipment shown in 7.
Referring to Fig. 7, Fig. 7 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The computer
Equipment 700 can be desktop computer, and perhaps the computer equipments such as server are also possible to component or portion in other equipment
Part.
Refering to Fig. 7, which includes processor 702, memory and the net connected by system bus 701
Network interface 705, wherein memory may include non-volatile memory medium 703 and built-in storage 704.
The non-volatile memory medium 703 can storage program area 7031 and computer program 7032.The computer program
7032 are performed, and processor 702 may make to execute a kind of above-mentioned method of speech processing.
The processor 702 is for providing calculating and control ability, to support the operation of entire computer equipment 700.
The built-in storage 704 provides environment for the operation of the computer program 7032 in non-volatile memory medium 703, should
When computer program 7032 is executed by processor 702, processor 702 may make to execute a kind of above-mentioned method of speech processing.
The network interface 705 is used to carry out network communication with other equipment.It will be understood by those skilled in the art that in Fig. 7
The structure shown, only the block diagram of part-structure relevant to application scheme, does not constitute and is applied to application scheme
The restriction of computer equipment 700 thereon, specific computer equipment 700 may include more more or fewer than as shown in the figure
Component perhaps combines certain components or with different component layouts.For example, in some embodiments, computer equipment can
Only to include memory and processor, in such embodiments, reality shown in the structure and function and Fig. 7 of memory and processor
It is consistent to apply example, details are not described herein.
Wherein, the processor 702 is for running computer program 7032 stored in memory, to realize following step
It is rapid: the voice of non-streaming mode is obtained by input equipment;Whether judge in the voice comprising abnormal sound signal, the exception
Voice signal includes mute phase signal;If including the abnormal sound signal in the voice, by voice activity detection to institute
Predicate sound is cut to delete the abnormal sound signal, and multiple sound bites are obtained;By multiple sound bites according to
Original sequence in each leisure voice carries out speech synthesis to obtain new speech;Speech recognition is carried out to the new speech.
In one embodiment, the processor 702 is realizing described judge in the voice whether to include abnormal sound message
Number the step of before, also perform the steps of
By detecting the volume of the voice whether to detect in the voice comprising sound;
If in the voice including sound, whether judge in the voice comprising abnormal sound signal;
If in the voice not including sound, the prompt of voice is re-entered in output.
In one embodiment, the processor 702 is realizing described judge in the voice whether to include abnormal sound message
Number step when, implement following steps:
Whether detect in the audio volume control of the voice includes waveform of the audio amplitude less than the first preset threshold;
If being less than the waveform of first preset threshold in the audio volume control comprising the audio amplitude, institute's predicate is determined
It include the abnormal sound signal in sound.
In one embodiment, the processor 702 realize it is described will multiple sound bites according to it is each it is comfortable described in
When original sequence in voice carries out speech synthesis to obtain the step of new speech, following steps are implemented:
By multiple sound bites according in each leisure voice original sequence carry out voice waveform concatenation with
Obtain new speech.
In one embodiment, the voice includes ID card No., and the processor 702 is described to the newspeak in realization
When sound carries out the step of speech recognition, following steps are implemented:
Speech recognition is carried out to the new speech comprising ID card No.;
The processor 702 is also realized following after realizing described the step of carrying out speech recognition to the new speech
Step:
Verify whether identified ID card No. includes mistake according to preset ID card No. coding rule;
If the ID card No. includes mistake, the ID card No. of mistake is prompted.
In one embodiment, the processor 702 is described according to the verification of preset ID card No. coding rule in realization
When whether the ID card No. identified includes wrong step, following steps are implemented:
Judge whether the ID card No. includes that the digital of presetting digit capacity is with the digit for verifying the ID card No.
It is no correct;
Identify the gender of the corresponding enunciator of the voice to verify in the ID card No. according to default phonetic feature
Sequential bits whether matched with the gender of the enunciator;
Judge to calculate the check code of the ID card No. identified according to check code calculation formula and is identified
Whether the check code that the ID card No. includes one shows the check code that the ID card No. that is identified of verification includes
It is whether correct;
If the digit of the ID card No. is correct and the property of sequential bits and the enunciator in the ID card No.
It does not match, and the check code of the ID card No. is correct, determines that identified ID card No. does not include mistake.
In one embodiment, the processor 702 is realizing what the ID card No. to mistake was prompted
When step, following steps are implemented:
To the corresponding particular problem of ID card No. prompt mistake of mistake.
It should be appreciated that in the embodiment of the present application, processor 702 can be central processing unit (Central
ProcessingUnit, CPU), which can also be other general processors, digital signal processor (Digital
Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit,
ASIC), ready-made programmable gate array (Field-Programmable GateArray, FPGA) or other programmable logic devices
Part, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor or
The processor is also possible to any conventional processor etc..
Those of ordinary skill in the art will appreciate that be realize above-described embodiment method in all or part of the process,
It is that can be completed by computer program, which can be stored in a computer readable storage medium.The computer
Program is executed by least one processor in the computer system, to realize the process step of the embodiment of the above method.
Therefore, the application also provides a kind of computer readable storage medium.The computer readable storage medium can be non-
The computer readable storage medium of volatibility, the computer-readable recording medium storage have computer program, the computer program
Processor is set to execute following steps when being executed by processor:
A kind of computer program product, when run on a computer, so that computer executes in the above various embodiments
The step of described method of speech processing.
The computer readable storage medium can be the internal storage unit of aforementioned device, such as the hard disk or interior of equipment
It deposits.What the computer readable storage medium was also possible to be equipped on the External memory equipment of the equipment, such as the equipment
Plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card dodge
Deposit card (Flash Card) etc..Further, the computer readable storage medium can also both include the inside of the equipment
Storage unit also includes External memory equipment.
It is apparent to those skilled in the art that for convenience of description and succinctly, foregoing description is set
The specific work process of standby, device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
The computer readable storage medium can be USB flash disk, mobile hard disk, read-only memory (Read-Only Memory,
ROM), the various computer readable storage mediums that can store program code such as magnetic or disk.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure
Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware
With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This
A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially
Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not
It is considered as beyond scope of the present application.
In several embodiments provided herein, it should be understood that disclosed device and method can pass through it
Its mode is realized.For example, the apparatus embodiments described above are merely exemplary.For example, the division of each unit, only
Only a kind of logical function partition, there may be another division manner in actual implementation.Such as multiple units or components can be tied
Another system is closed or is desirably integrated into, or some features can be ignored or not executed.
Step in the embodiment of the present application method can be sequentially adjusted, merged and deleted according to actual needs.This Shen
Please the unit in embodiment device can be combined, divided and deleted according to actual needs.In addition, in each implementation of the application
Each functional unit in example can integrate in one processing unit, is also possible to each unit and physically exists alone, can also be with
It is that two or more units are integrated in one unit.
If the integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product,
It can store in one storage medium.Based on this understanding, the technical solution of the application is substantially in other words to existing skill
The all or part of part or the technical solution that art contributes can be embodied in the form of software products, the meter
Calculation machine software product is stored in a storage medium, including some instructions are used so that an electronic equipment (can be individual
Computer, terminal or network equipment etc.) execute each embodiment the method for the application all or part of the steps.
The above, the only specific embodiment of the application, but the bright protection scope of the application is not limited thereto, and is appointed
What those familiar with the art within the technical scope of the present application, can readily occur in various equivalent modifications or
Replacement, these modifications or substitutions should all cover within the scope of protection of this application.Therefore, the protection scope Ying Yiquan of the application
Subject to the protection scope that benefit requires.
Claims (10)
1. a kind of method of speech processing, which is characterized in that the described method includes:
The voice of non-streaming mode is obtained by input equipment;
Judge that the abnormal sound signal includes mute phase signal whether comprising abnormal sound signal in the voice;
If including the abnormal sound signal in the voice, the voice is cut to delete by voice activity detection
The abnormal sound signal, obtains multiple sound bites;
Multiple sound bites are subjected to speech synthesis according to original sequence in each leisure voice to obtain new speech;
Speech recognition is carried out to the new speech.
2. method of speech processing according to claim 1, which is characterized in that described whether to judge in the voice comprising abnormal
Before the step of voice signal, further includes:
By detecting the volume of the voice whether to detect in the voice comprising sound;
If in the voice including sound, whether judge in the voice comprising abnormal sound signal;
If in the voice not including sound, the prompt of voice is re-entered in output.
3. according to claim 1 or 2 method of speech processing, which is characterized in that described to judge whether wrap in the voice
The step of signal containing abnormal sound includes:
Whether detect in the audio volume control of the voice includes waveform of the audio amplitude less than the first preset threshold;
If being less than the waveform of first preset threshold in the audio volume control comprising the audio amplitude, determine in the voice
Include the abnormal sound signal.
4. method of speech processing according to claim 1, which is characterized in that it is described by multiple sound bites according to respective
Original sequence in the voice carries out the step of speech synthesis is to obtain new speech
Multiple sound bites are subjected to the waveform concatenation of voice according to original sequence in each leisure voice to obtain
New speech.
5. method of speech processing according to claim 1, which is characterized in that the voice includes ID card No., described right
The new speech carries out the step of speech recognition and includes:
Speech recognition is carried out to the new speech comprising ID card No.;
After described the step of carrying out speech recognition to the new speech, further includes:
Verify whether identified ID card No. includes mistake according to preset ID card No. coding rule;
If the ID card No. includes mistake, the ID card No. of mistake is prompted.
6. method of speech processing according to claim 5, which is characterized in that described encoded according to preset ID card No. is advised
Then verify identified ID card No. whether comprising wrong step include:
Judge whether the ID card No. includes the number of presetting digit capacity whether just to verify the digit of the ID card No.
Really;
Identify that the gender of the corresponding enunciator of the voice is suitable in the ID card No. to verify according to default phonetic feature
Whether tagmeme matches with the gender of the enunciator;
Judge according to check code calculation formula calculates the check code of the ID card No. identified and is identified
Whether whether the check code that ID card No. includes one show check code that the ID card No. that is identified of verification includes
Correctly;
If the digit of the ID card No. is correct and the gender of sequential bits and the enunciator in the ID card No.
Match, and the check code of the ID card No. is correct, determines that identified ID card No. does not include mistake.
7. according to claim 5 or 6 method of speech processing, which is characterized in that the identification card number to mistake
Code the step of being prompted includes:
To the corresponding particular problem of ID card No. prompt mistake of mistake.
8. a kind of voice processing apparatus characterized by comprising
Acquiring unit, for obtaining the voice of non-streaming mode by input equipment;
Judging unit, for judging that the abnormal sound signal includes mute whether comprising abnormal sound signal in the voice
Phase signal;
Unit is cut, if for including the abnormal sound signal in the voice, by voice activity detection to the voice
It is cut to delete the abnormal sound signal, obtains multiple sound bites;
Synthesis unit, for multiple sound bites to be carried out speech synthesis according to original sequence in each leisure voice
To obtain new speech;
Recognition unit, for carrying out speech recognition to the new speech.
9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and is connected with the memory
Processor;The memory is for storing computer program;The processor is based on running and storing in the memory
Calculation machine program, to execute as described in claim any one of 1-7 the step of method of speech processing.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey
Sequence, the computer program make the processor execute the voice as described in any one of claim 1-7 when being executed by processor
The step of processing method.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910374806.XA CN110148402A (en) | 2019-05-07 | 2019-05-07 | Method of speech processing, device, computer equipment and storage medium |
PCT/CN2019/117786 WO2020224217A1 (en) | 2019-05-07 | 2019-11-13 | Speech processing method and apparatus, computer device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910374806.XA CN110148402A (en) | 2019-05-07 | 2019-05-07 | Method of speech processing, device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110148402A true CN110148402A (en) | 2019-08-20 |
Family
ID=67594842
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910374806.XA Pending CN110148402A (en) | 2019-05-07 | 2019-05-07 | Method of speech processing, device, computer equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110148402A (en) |
WO (1) | WO2020224217A1 (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110853622A (en) * | 2019-10-22 | 2020-02-28 | 深圳市本牛科技有限责任公司 | Method and system for sentence segmentation by voice |
CN110992989A (en) * | 2019-12-06 | 2020-04-10 | 广州国音智能科技有限公司 | Voice acquisition method and device and computer readable storage medium |
CN111583934A (en) * | 2020-04-30 | 2020-08-25 | 联想(北京)有限公司 | Data processing method and device |
CN111627453A (en) * | 2020-05-13 | 2020-09-04 | 广州国音智能科技有限公司 | Public security voice information management method, device, equipment and computer storage medium |
CN111863041A (en) * | 2020-07-17 | 2020-10-30 | 东软集团股份有限公司 | Sound signal processing method, device and equipment |
CN111912519A (en) * | 2020-07-21 | 2020-11-10 | 国网安徽省电力有限公司 | Transformer fault diagnosis method and device based on voiceprint frequency spectrum separation |
WO2020224217A1 (en) * | 2019-05-07 | 2020-11-12 | 平安科技(深圳)有限公司 | Speech processing method and apparatus, computer device, and storage medium |
CN111953727A (en) * | 2020-05-06 | 2020-11-17 | 上海明略人工智能(集团)有限公司 | Audio transmission method and device |
CN112434561A (en) * | 2020-11-03 | 2021-03-02 | 中国工程物理研究院电子工程研究所 | Method for automatically judging shock wave signal validity |
CN113542724A (en) * | 2020-04-16 | 2021-10-22 | 福建天泉教育科技有限公司 | Automatic detection method and system for video resources |
CN114121050A (en) * | 2021-11-30 | 2022-03-01 | 云知声智能科技股份有限公司 | Audio playing method and device, electronic equipment and storage medium |
CN114898755A (en) * | 2022-07-14 | 2022-08-12 | 科大讯飞股份有限公司 | Voice processing method and related device, electronic equipment and storage medium |
CN115565539A (en) * | 2022-11-21 | 2023-01-03 | 中网道科技集团股份有限公司 | Data processing method for realizing self-help correction terminal anti-counterfeiting identity verification |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105336322A (en) * | 2015-09-30 | 2016-02-17 | 百度在线网络技术(北京)有限公司 | Polyphone model training method, and speech synthesis method and device |
CN105427870A (en) * | 2015-12-23 | 2016-03-23 | 北京奇虎科技有限公司 | Voice recognition method and device aiming at pauses |
CN108564954A (en) * | 2018-03-19 | 2018-09-21 | 平安科技(深圳)有限公司 | Deep neural network model, electronic device, auth method and storage medium |
CN108564955A (en) * | 2018-03-19 | 2018-09-21 | 平安科技(深圳)有限公司 | Electronic device, auth method and computer readable storage medium |
CN109192194A (en) * | 2018-08-22 | 2019-01-11 | 北京百度网讯科技有限公司 | Voice data mask method, device, computer equipment and storage medium |
CN109389452A (en) * | 2017-08-10 | 2019-02-26 | 阿里巴巴集团控股有限公司 | The method and device of voice sale |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103581158A (en) * | 2012-08-10 | 2014-02-12 | 百度在线网络技术(北京)有限公司 | Method and system for processing voice data |
CN103400580A (en) * | 2013-07-23 | 2013-11-20 | 华南理工大学 | Method for estimating importance degree of speaker in multiuser session voice |
CN109065076B (en) * | 2018-09-05 | 2020-11-27 | 深圳追一科技有限公司 | Audio label setting method, device, equipment and storage medium |
CN109360551B (en) * | 2018-10-25 | 2021-02-05 | 珠海格力电器股份有限公司 | Voice recognition method and device |
CN109545246A (en) * | 2019-01-21 | 2019-03-29 | 维沃移动通信有限公司 | A kind of sound processing method and terminal device |
CN110148402A (en) * | 2019-05-07 | 2019-08-20 | 平安科技(深圳)有限公司 | Method of speech processing, device, computer equipment and storage medium |
CN110619897A (en) * | 2019-08-02 | 2019-12-27 | 精电有限公司 | Conference summary generation method and vehicle-mounted recording system |
-
2019
- 2019-05-07 CN CN201910374806.XA patent/CN110148402A/en active Pending
- 2019-11-13 WO PCT/CN2019/117786 patent/WO2020224217A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105336322A (en) * | 2015-09-30 | 2016-02-17 | 百度在线网络技术(北京)有限公司 | Polyphone model training method, and speech synthesis method and device |
CN105427870A (en) * | 2015-12-23 | 2016-03-23 | 北京奇虎科技有限公司 | Voice recognition method and device aiming at pauses |
CN109389452A (en) * | 2017-08-10 | 2019-02-26 | 阿里巴巴集团控股有限公司 | The method and device of voice sale |
CN108564954A (en) * | 2018-03-19 | 2018-09-21 | 平安科技(深圳)有限公司 | Deep neural network model, electronic device, auth method and storage medium |
CN108564955A (en) * | 2018-03-19 | 2018-09-21 | 平安科技(深圳)有限公司 | Electronic device, auth method and computer readable storage medium |
CN109192194A (en) * | 2018-08-22 | 2019-01-11 | 北京百度网讯科技有限公司 | Voice data mask method, device, computer equipment and storage medium |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020224217A1 (en) * | 2019-05-07 | 2020-11-12 | 平安科技(深圳)有限公司 | Speech processing method and apparatus, computer device, and storage medium |
CN110853622B (en) * | 2019-10-22 | 2024-01-12 | 深圳市本牛科技有限责任公司 | Voice sentence breaking method and system |
CN110853622A (en) * | 2019-10-22 | 2020-02-28 | 深圳市本牛科技有限责任公司 | Method and system for sentence segmentation by voice |
CN110992989A (en) * | 2019-12-06 | 2020-04-10 | 广州国音智能科技有限公司 | Voice acquisition method and device and computer readable storage medium |
CN110992989B (en) * | 2019-12-06 | 2022-05-27 | 广州国音智能科技有限公司 | Voice acquisition method and device and computer readable storage medium |
CN113542724A (en) * | 2020-04-16 | 2021-10-22 | 福建天泉教育科技有限公司 | Automatic detection method and system for video resources |
CN113542724B (en) * | 2020-04-16 | 2023-09-15 | 福建天泉教育科技有限公司 | Automatic detection method and system for video resources |
CN111583934A (en) * | 2020-04-30 | 2020-08-25 | 联想(北京)有限公司 | Data processing method and device |
CN111953727A (en) * | 2020-05-06 | 2020-11-17 | 上海明略人工智能(集团)有限公司 | Audio transmission method and device |
CN111627453B (en) * | 2020-05-13 | 2024-02-09 | 广州国音智能科技有限公司 | Public security voice information management method, device, equipment and computer storage medium |
CN111627453A (en) * | 2020-05-13 | 2020-09-04 | 广州国音智能科技有限公司 | Public security voice information management method, device, equipment and computer storage medium |
CN111863041A (en) * | 2020-07-17 | 2020-10-30 | 东软集团股份有限公司 | Sound signal processing method, device and equipment |
CN111912519A (en) * | 2020-07-21 | 2020-11-10 | 国网安徽省电力有限公司 | Transformer fault diagnosis method and device based on voiceprint frequency spectrum separation |
CN112434561A (en) * | 2020-11-03 | 2021-03-02 | 中国工程物理研究院电子工程研究所 | Method for automatically judging shock wave signal validity |
CN112434561B (en) * | 2020-11-03 | 2023-09-22 | 中国工程物理研究院电子工程研究所 | Method for automatically judging effectiveness of shock wave signal |
CN114121050A (en) * | 2021-11-30 | 2022-03-01 | 云知声智能科技股份有限公司 | Audio playing method and device, electronic equipment and storage medium |
CN114898755A (en) * | 2022-07-14 | 2022-08-12 | 科大讯飞股份有限公司 | Voice processing method and related device, electronic equipment and storage medium |
CN115565539B (en) * | 2022-11-21 | 2023-02-07 | 中网道科技集团股份有限公司 | Data processing method for realizing self-help correction terminal anti-counterfeiting identity verification |
CN115565539A (en) * | 2022-11-21 | 2023-01-03 | 中网道科技集团股份有限公司 | Data processing method for realizing self-help correction terminal anti-counterfeiting identity verification |
Also Published As
Publication number | Publication date |
---|---|
WO2020224217A1 (en) | 2020-11-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110148402A (en) | Method of speech processing, device, computer equipment and storage medium | |
CN102254553B (en) | The automatic normalization of spoken syllable duration | |
CN108806696B (en) | Method and device for establishing voiceprint model, computer equipment and storage medium | |
Kinnunen | Spectral features for automatic text-independent speaker recognition | |
US5729694A (en) | Speech coding, reconstruction and recognition using acoustics and electromagnetic waves | |
CN101578659A (en) | Voice tone converting device and voice tone converting method | |
US20030097254A1 (en) | Ultra-narrow bandwidth voice coding | |
US9508338B1 (en) | Inserting breath sounds into text-to-speech output | |
JPH10507536A (en) | Language recognition | |
US11727949B2 (en) | Methods and apparatus for reducing stuttering | |
JP2006267465A (en) | Uttering condition evaluating device, uttering condition evaluating program, and program storage medium | |
Gallardo | Human and automatic speaker recognition over telecommunication channels | |
Stupakov et al. | The design and collection of COSINE, a multi-microphone in situ speech corpus recorded in noisy environments | |
JPWO2006083020A1 (en) | Speech recognition system for generating response speech using extracted speech data | |
US20210118464A1 (en) | Method and apparatus for emotion recognition from speech | |
CN107004428A (en) | Session evaluating apparatus and method | |
CN112908302B (en) | Audio processing method, device, equipment and readable storage medium | |
Mandel et al. | Audio super-resolution using concatenative resynthesis | |
EP2541544A1 (en) | Voice sample tagging | |
Westall et al. | Speech technology for telecommunications | |
Nthite et al. | End-to-End Text-To-Speech synthesis for under resourced South African languages | |
JPS59501520A (en) | Device for articulatory speech recognition | |
Gallardo | Human and automatic speaker recognition over telecommunication channels | |
JP7296214B2 (en) | speech recognition system | |
Medhi et al. | Different acoustic feature parameters ZCR, STE, LPC and MFCC analysis of Assamese vowel phonemes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |