CN111613211B - Method and device for processing specific word voice - Google Patents
Method and device for processing specific word voice Download PDFInfo
- Publication number
- CN111613211B CN111613211B CN202010307655.9A CN202010307655A CN111613211B CN 111613211 B CN111613211 B CN 111613211B CN 202010307655 A CN202010307655 A CN 202010307655A CN 111613211 B CN111613211 B CN 111613211B
- Authority
- CN
- China
- Prior art keywords
- voice
- trained
- tested
- net model
- masking value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012545 processing Methods 0.000 title claims abstract description 28
- 238000000034 method Methods 0.000 title claims abstract description 23
- 230000009467 reduction Effects 0.000 claims abstract description 16
- 230000000873 masking effect Effects 0.000 claims description 69
- 238000012549 training Methods 0.000 claims description 31
- 238000001228 spectrum Methods 0.000 claims description 15
- 239000000203 mixture Substances 0.000 claims description 8
- 239000000126 substance Substances 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 2
- 238000001514 detection method Methods 0.000 abstract description 13
- 230000006870 function Effects 0.000 description 20
- 230000000694 effects Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 2
- 241000238558 Eucarida Species 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
- Monitoring And Testing Of Exchanges (AREA)
Abstract
The invention relates to a method and a device for processing specific word voice. The method comprises the following steps: acquiring a voice to be trained with noise; extracting a first feature of the voice to be trained; inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model; acquiring a voice to be tested, and extracting a second characteristic of the voice to be tested; and inputting the second characteristics into the target U-NET model to judge whether the voice to be tested has specific word voice or not and obtain noise reduction voice of the voice to be tested. By the technical scheme of the invention, the noise reduction quality and the detection efficiency of the keywords in the voice with noise can be fully and effectively improved.
Description
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for processing a specific word speech.
Background
At present, a large number of devices used for smart homes, mobile automatic devices and based on voice interaction appear in the market, such as some smart sound boxes, amazon Alexa, apple Siri and the like, and the devices need a specific word detection system to wake up before voice interaction, but the specific word detection system generally has a good detection effect only in a relatively quiet scene, and the performance in a noise scene is not good, that is, the specific word detection method in the prior art only has a good detection effect on voice recorded in a relatively quiet environment, and the performance in the noise scene can present a cliff type, so that keyword detection in noisy voice is inaccurate.
Disclosure of Invention
The embodiment of the invention provides a method and a device for processing specific word voice. The technical scheme is as follows:
according to a first aspect of the embodiments of the present invention, there is provided a method for processing a specific word speech, including:
acquiring a voice to be trained with noise;
extracting a first feature of the voice to be trained;
inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model;
acquiring a voice to be tested, and extracting a second characteristic of the voice to be tested;
and inputting the second characteristics into the target U-NET model to judge whether the voice to be tested has specific word voice or not and obtain noise reduction voice of the voice to be tested.
In one embodiment, the inputting the first feature into a U-NET model to be trained to obtain a target U-NET model includes:
inputting the first characteristic into the U-NET model to be trained to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
and training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model.
In an embodiment, the training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model includes:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
In one embodiment, calculating a model loss function according to the first estimated masking value, the estimation result, the true masking value, and the true judgment result includes:
calculating the model Loss function Loss by a first predetermined formula:
wherein the content of the first and second substances,and &>Respectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
the parameter | pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θ | pure | represents the amplitude of the speech to be trained in the frequency domain space pure Representing the phase theta of the pure voice corresponding to the voice to be trained in the frequency domain space mixture And representing the phase of the voice to be trained in the frequency domain space.
In one embodiment, the inputting the second feature into the target U-NET model to determine whether a specific word speech exists in the speech to be tested and obtain a noise-reduced speech of the speech to be tested includes:
inputting the second characteristic into the target U-NET model to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;
carrying out short-time Fourier transform on the voice to be tested to obtain a frequency spectrum of the voice to be tested;
and multiplying the second estimated masking value and the frequency spectrum, and then performing inverse Fourier transform to obtain the noise-reduced voice.
According to a second aspect of the embodiments of the present invention, there is provided a processing apparatus for a specific word speech, including:
the acquisition module is used for acquiring a voice to be trained with noise;
the extraction module is used for extracting a first feature of the voice to be trained;
the input module is used for inputting the first characteristic into a U-NET model to be trained so as to obtain a target U-NET model;
the first processing module is used for acquiring a voice to be tested and extracting a second characteristic of the voice to be tested;
and the second processing module is used for inputting the second characteristics to the target U-NET model so as to judge whether specific word voice exists in the voice to be tested and obtain noise reduction voice of the voice to be tested.
In one embodiment, the input module comprises:
the input submodule is used for inputting the first characteristic into the U-NET model to be trained so as to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
and the training submodule is used for training the U-NET model to be trained according to the first estimation masking value and the estimation result so as to obtain the target U-NET model.
In one embodiment, the training submodule is specifically configured to:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
In one embodiment, the training submodule is further configured to:
calculating the model Loss function Loss by a first preset formula as follows:
wherein the content of the first and second substances,and &>Respectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θ pure Representing the phase theta of the pure voice corresponding to the voice to be trained in the frequency domain space mixture And representing the phase of the voice to be trained in the frequency domain space.
In one embodiment, the second processing module comprises:
the input submodule is used for inputting the second characteristics into the target U-NET model so as to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;
the conversion submodule is used for carrying out short-time Fourier transform on the voice to be tested so as to obtain the frequency spectrum of the voice to be tested;
and the processing submodule is used for multiplying the second estimated masking value and the frequency spectrum and then performing inverse Fourier transform to obtain the noise-reduced voice.
The technical scheme provided by the embodiment of the invention can have the following beneficial effects:
the method comprises the steps of inputting first characteristics of voice to be trained into a U-NET model to be trained to obtain a target U-NET model with higher maturity and accuracy after training, then extracting second characteristics of the voice to be tested after the voice to be tested is obtained, inputting the second characteristics into the target U-NET model with higher accuracy to obtain noise-reducing voice of the voice to be tested, namely pure voice except noise in the voice to be tested, and judging whether specific word voice exists in the voice to be tested, so that noise-reducing quality and detection efficiency of keywords in the voice with noise are fully and effectively improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1A is a flow diagram illustrating a method of processing a particular word speech according to an example embodiment.
FIG. 1B is a flow diagram illustrating a method of processing a particular word speech according to an example embodiment.
Fig. 2 is a block diagram illustrating an apparatus for processing a specific word speech according to an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
In order to solve the above technical problem, an embodiment of the present invention provides a method for processing a specific word speech, where the method is applicable to a specific word speech processing program, system or device, and an execution subject corresponding to the method may be a terminal or a server, as shown in fig. 1A, and the method includes steps S101 to S105:
in step S101, a speech to be trained with noise is acquired;
the speech to be trained is mixed and obtained in a simulation mode, and the obtained method is to add different types of noise to clean speech at different signal-to-noise ratios.
In step S102, extracting a first feature of the speech to be trained; the first characteristic is the amplitude value of the voice to be trained in the frequency domain space, namely the module value of the real part of the voice to be trained in the expression of the frequency domain space; the first characteristic and the second characteristic are only one characteristic of amplitude with voice, the training stage is trained by the characteristic, and the testing stage is also input to a trained model (namely a target U-NET model) by the amplitude characteristic to obtain a noise-reduced result.
In step S103, inputting the first feature into a U-NET model to be trained (based on deep learning) to obtain a target U-NET model; the U-NET model is a U-shaped network structure and can be used for noise reduction or enhancement of noisy speech and detection of keywords in the speech.
In step S104, acquiring a voice to be tested, and extracting a second feature of the voice to be tested;
the speech to be tested is recorded without mixing.
In step S105, the second feature is input to the target U-NET model to determine whether a specific word voice exists in the voice to be tested, and obtain a noise reduction voice of the voice to be tested.
The method comprises the steps of inputting a first characteristic of a voice to be trained into a U-NET model to be trained to obtain a target U-NET model with higher maturity and accuracy after training, then extracting a second characteristic of the voice to be tested after obtaining the voice to be tested, inputting the second characteristic into the target U-NET model with higher accuracy, judging whether specific word voice exists in the voice to be tested, and obtaining noise reduction voice of the voice to be tested, namely pure voice except noise in the voice to be tested (for example, if only the voice of a specific word is needed, the voice except the specific word voice in the voice to be tested can be filtered), so that noise reduction quality and detection efficiency of keywords or specific words in the voice with noise are fully and effectively improved, and further voice awakening accuracy and timeliness of a voice interaction device are improved. The particular word tone may be a voice of a particular word, such as a voice of a wake word, etc.
In one embodiment, the inputting the first feature into a U-NET model to be trained to obtain a target U-NET model includes:
inputting the first characteristic into the U-NET model to be trained to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
and training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model.
By inputting the first characteristic into the U-NET model to be trained, a first estimated masking value PSM (Phase Sensitive Mask) corresponding to the voice to be trained can be obtained, whether the voice to be trained comprises preset voice or not, namely whether the voice to be trained comprises a certain specified keyword or not is obtained, then the U-NET model to be trained is retrained according to the first estimated masking value and the estimation result, so that an optimized and upgraded target U-NET model is obtained, the target U-NET model is conveniently and accurately used for denoising the voice with noise, and the detection efficiency and accuracy of the keyword in the voice with noise can be improved.
In an embodiment, the training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model includes:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice; the preset speech may also be a speech of a specific word or keyword, and may be the same as or different from the specific word sound.
Calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
When the U-NET model is optimized, an accurate model loss function can be calculated by using the first estimated masking value, the estimated result, the real masking value and the real judgment result, then the U-NET model to be trained is adjusted by using the model loss function, and the adjustment process can be continuously and circularly repeated to obtain an optimized and upgraded target U-NET model, so that the noise reduction treatment can be accurately carried out on the voice with noise by using the target U-NET model, and the detection efficiency and the accuracy of keywords in the voice with noise can be improved.
In one embodiment, calculating a model loss function according to the first estimated masking value, the estimation result, the true masking value, and the true judgment result includes:
calculating the model Loss function Loss by a first predetermined formula:
wherein the content of the first and second substances,and &>Respectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
when the voice to be trained comprises the preset voice, the value is 1, and when the voice to be trained does not comprise the preset voice, the value is 0;
the LABEL takes a value of 1 when the voice to be tested includes the specific word voice, and takes a value of 0 when the voice to be tested does not include the specific word voice.
MAE is the mean absolute error MAE (mean absolute error).
The real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θ pure Representing the phase of the clean voice corresponding to the voice to be trained in the frequency domain space, i.e. the imaginary part, theta, of the clean voice corresponding to the voice to be trained in the expression of the frequency domain space mixture Representing the phase of the speech to be trained in the frequency domain space, i.e. the speech to be trainedThe imaginary part of the speech in the representation in frequency domain space.
When the model Loss function Loss is calculated by using the formula, the training is stopped by using the average absolute error MAE (mean absolute error) as a convergence criterion until the Loss function is converged, and the target U-NET model with the best optimization effect is achieved, so that the voice detection effect is optimal and the noise reduction effect is optimal.
In one embodiment, the inputting the second feature into the target U-NET model to determine whether a specific word speech exists in the speech to be tested and obtain a noise-reduced speech of the speech to be tested includes:
inputting the second characteristic into the target U-NET model to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space; the second characteristic is the real part of the speech to be tested in the expression of the frequency domain space;
carrying out short-time Fourier transform on the voice to be tested to obtain a frequency spectrum of the voice to be tested;
and multiplying the second estimated masking value and the frequency spectrum, and then performing inverse Fourier transform to obtain the noise-reduced voice.
After the target U-NET model is obtained, the second characteristics of the voice to be tested can be input into the target U-NET model to judge whether the voice to be tested really has specific word voice, so that whether a certain keyword exists in the voice to be tested can be accurately identified, a second estimated masking value PSM is obtained, then STFT conversion is carried out on the voice to be tested, the frequency spectrum of the voice to be tested can be obtained, and ISTFT is carried out after the frequency spectrum is multiplied by the second estimated masking value, so that a good noise reduction effect can be obtained.
The technical solution of the present invention will be further described in detail with reference to fig. 1B:
step 1: generating data, mixing original specific word data and various types of noise with different signal-to-noise ratios (-5-15 dB), mixing non-specific word data and noise with different signal-to-noise ratios, using the mixed voice as training data, generating a verification set in the same way, wherein the training set and the verification set have different noise types, signal-to-noise ratios and speakers, training a model by using the training set, and supervising the model by using the verification set without participating in error return;
step 2: extracting characteristics, namely respectively calculating short-time Fourier transform of each sentence of voice of training data, and then normalizing the amplitude of the short-time Fourier transform to be used as the input of a model;
and 3, step 3: and calculating a training target, wherein the training target consists of two parts. A part of the computation of the trained mixed speech (mix) and its corresponding clean speech (pure) yields the phase sensitive mask (true PSM) as follows:
Where | represents amplitude, θ represents phase; the other part is a LABEL (LABEL) of the whole voice, the specific word phonetic symbol is marked as 1, and the non-specific word phonetic symbol is marked as 0;
and 4, step 4: inputting the extracted features into a U-NET network model for training, using an average absolute error MAE (mean absolute error) as a convergence criterion, stopping training until a loss function converges, and storing the model, wherein the loss function is defined as follows:
wherein the content of the first and second substances,and &>Respectively, the model estimated PSM and LABEL.
And in the testing stage, the characteristics of the tested voice are processed by a trained model to obtain a judgment result of whether the tested voice is a specific word or not and an estimated PSM, and the PSM is multiplied by the frequency spectrum (obtained by STFT) of the tested voice and then subjected to inverse Fourier transform to obtain the voice after noise reduction.
Finally, it is clear that: the above embodiments can be freely combined by those skilled in the art according to the actual needs.
Corresponding to the method for processing the specific word speech provided in the embodiment of the present invention, an embodiment of the present invention further provides a device for processing the specific word speech, as shown in fig. 2, where the device includes:
an obtaining module 201, configured to obtain a voice to be trained with noise;
an extracting module 202, configured to extract a first feature of the speech to be trained;
the input module 203 is used for inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model;
the first processing module 204 is configured to obtain a voice to be tested, and extract a second feature of the voice to be tested;
the second processing module 205 is configured to input the second feature to the target U-NET model, so as to determine whether a specific word voice exists in the voice to be tested, and obtain a noise reduction voice of the voice to be tested.
In one embodiment, the input module comprises:
the input submodule is used for inputting the first characteristic into the U-NET model to be trained so as to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
and the training submodule is used for training the U-NET model to be trained according to the first estimation masking value and the estimation result so as to obtain the target U-NET model.
In one embodiment, the training submodule is specifically configured to:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
In one embodiment, the training submodule is further configured to:
calculating the model Loss function Loss by a first predetermined formula:
wherein the content of the first and second substances,and &>Respectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θ pure Representing the phase theta of the pure voice corresponding to the voice to be trained in the frequency domain space mixture And representing the phase of the voice to be trained in the frequency domain space.
In one embodiment, the second processing module comprises:
the input submodule is used for inputting the second characteristics into the target U-NET model so as to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;
the conversion submodule is used for carrying out short-time Fourier transform on the voice to be tested so as to obtain the frequency spectrum of the voice to be tested;
and the processing submodule is used for multiplying the second estimated masking value and the frequency spectrum and then performing inverse Fourier transform to obtain the noise-reduced voice.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.
Claims (6)
1. A method for processing a specific word speech, comprising:
acquiring a voice to be trained with noise;
extracting a first feature of the voice to be trained;
inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model;
acquiring a voice to be tested, and extracting a second characteristic of the voice to be tested;
inputting the second characteristics into the target U-NET model to judge whether specific word voices exist in the voices to be tested or not and obtain noise reduction voices of the voices to be tested;
inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model, wherein the method comprises the following steps:
inputting the first characteristic into the U-NET model to be trained to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
training the U-NET model to be trained according to the first estimation masking value and the estimation result to obtain the target U-NET model;
the training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model comprises:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
2. The method of claim 1,
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result, including:
calculating the model Loss function Loss by a first predetermined formula:
wherein the content of the first and second substances,and &>Respectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θ pure Representing the phase theta of the pure speech corresponding to the speech to be trained in the frequency domain space mixture And representing the phase of the voice to be trained in the frequency domain space.
3. The method according to any one of claims 1 to 2,
the inputting the second characteristic into the target U-NET model to judge whether the voice to be tested has a specific word voice and obtain the noise reduction voice of the voice to be tested includes:
inputting the second characteristic into the target U-NET model to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;
carrying out short-time Fourier transform on the voice to be tested to obtain a frequency spectrum of the voice to be tested;
and multiplying the second estimated masking value and the frequency spectrum, and then performing inverse Fourier transform to obtain the noise-reduced voice.
4. An apparatus for processing a specific word speech, comprising:
the acquisition module is used for acquiring a voice to be trained with noise;
the extraction module is used for extracting a first feature of the voice to be trained;
the input module is used for inputting the first characteristic into a U-NET model to be trained so as to obtain a target U-NET model;
the first processing module is used for acquiring a voice to be tested and extracting a second characteristic of the voice to be tested;
the second processing module is used for inputting the second characteristics to the target U-NET model so as to judge whether specific word voices exist in the voices to be tested and obtain noise reduction voices of the voices to be tested;
the input module includes:
the input submodule is used for inputting the first characteristic into the U-NET model to be trained so as to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
the training submodule is used for training the U-NET model to be trained according to the first estimation masking value and the estimation result to obtain the target U-NET model;
the training submodule is specifically configured to:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
5. The apparatus of claim 4,
the training submodule is further specifically configured to:
calculating the model Loss function Loss by a first predetermined formula:
wherein the content of the first and second substances,and &>Respectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θ pure Representing the phase theta of the pure voice corresponding to the voice to be trained in the frequency domain space mixture And representing the phase of the voice to be trained in the frequency domain space.
6. The apparatus according to any one of claims 4 to 5,
the second processing module comprises:
the input submodule is used for inputting the second characteristics into the target U-NET model so as to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;
the conversion submodule is used for carrying out short-time Fourier transform on the voice to be tested so as to obtain the frequency spectrum of the voice to be tested;
and the processing submodule is used for multiplying the second estimated masking value and the frequency spectrum and then performing inverse Fourier transform to obtain the noise-reduced voice.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010307655.9A CN111613211B (en) | 2020-04-17 | 2020-04-17 | Method and device for processing specific word voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010307655.9A CN111613211B (en) | 2020-04-17 | 2020-04-17 | Method and device for processing specific word voice |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111613211A CN111613211A (en) | 2020-09-01 |
CN111613211B true CN111613211B (en) | 2023-04-07 |
Family
ID=72203952
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010307655.9A Active CN111613211B (en) | 2020-04-17 | 2020-04-17 | Method and device for processing specific word voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111613211B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115798455B (en) * | 2023-02-07 | 2023-06-02 | 深圳元象信息科技有限公司 | Speech synthesis method, system, electronic device and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108986835A (en) * | 2018-08-28 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Based on speech de-noising method, apparatus, equipment and the medium for improving GAN network |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107845389B (en) * | 2017-12-21 | 2020-07-17 | 北京工业大学 | Speech enhancement method based on multi-resolution auditory cepstrum coefficient and deep convolutional neural network |
CN109065027B (en) * | 2018-06-04 | 2023-05-02 | 平安科技(深圳)有限公司 | Voice distinguishing model training method and device, computer equipment and storage medium |
CN109461456B (en) * | 2018-12-03 | 2022-03-22 | 云知声智能科技股份有限公司 | Method for improving success rate of voice awakening |
CN110060704A (en) * | 2019-03-26 | 2019-07-26 | 天津大学 | A kind of sound enhancement method of improved multiple target criterion study |
CN110600017B (en) * | 2019-09-12 | 2022-03-04 | 腾讯科技(深圳)有限公司 | Training method of voice processing model, voice recognition method, system and device |
-
2020
- 2020-04-17 CN CN202010307655.9A patent/CN111613211B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108986835A (en) * | 2018-08-28 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Based on speech de-noising method, apparatus, equipment and the medium for improving GAN network |
Also Published As
Publication number | Publication date |
---|---|
CN111613211A (en) | 2020-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7133826B2 (en) | Method and apparatus using spectral addition for speaker recognition | |
CN112700786B (en) | Speech enhancement method, device, electronic equipment and storage medium | |
CN111785288B (en) | Voice enhancement method, device, equipment and storage medium | |
Ju et al. | Tea-pse: Tencent-ethereal-audio-lab personalized speech enhancement system for icassp 2022 dns challenge | |
US20180033427A1 (en) | Speech recognition transformation system | |
CN107464563B (en) | Voice interaction toy | |
CN111754983A (en) | Voice denoising method and device, electronic equipment and storage medium | |
CN111883154B (en) | Echo cancellation method and device, computer-readable storage medium, and electronic device | |
CN111863008A (en) | Audio noise reduction method and device and storage medium | |
CN110808057A (en) | Voice enhancement method for generating confrontation network based on constraint naive | |
CN111613211B (en) | Method and device for processing specific word voice | |
US9875748B2 (en) | Audio signal noise attenuation | |
CN113035216B (en) | Microphone array voice enhancement method and related equipment | |
CN111028858B (en) | Method and device for detecting voice start-stop time | |
CN112002307B (en) | Voice recognition method and device | |
Lan et al. | Research on speech enhancement algorithm of multiresolution cochleagram based on skip connection deep neural network | |
CN114242104A (en) | Method, device and equipment for voice noise reduction and storage medium | |
Mallidi et al. | Robust speaker recognition using spectro-temporal autoregressive models. | |
CN107818780B (en) | Robust speech recognition method based on nonlinear feature compensation | |
Liang et al. | Real-time speech enhancement algorithm for transient noise suppression | |
Seyedin et al. | New features using robust MVDR spectrum of filtered autocorrelation sequence for robust speech recognition | |
CN117153178B (en) | Audio signal processing method, device, electronic equipment and storage medium | |
Seyedin et al. | Robust MVDR-based feature extraction for speech recognition | |
Wang et al. | Boosting DNN-based speech enhancement via explicit transformations | |
CN114173259B (en) | Echo cancellation method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |