CN111613211B - Method and device for processing specific word voice - Google Patents

Method and device for processing specific word voice Download PDF

Info

Publication number
CN111613211B
CN111613211B CN202010307655.9A CN202010307655A CN111613211B CN 111613211 B CN111613211 B CN 111613211B CN 202010307655 A CN202010307655 A CN 202010307655A CN 111613211 B CN111613211 B CN 111613211B
Authority
CN
China
Prior art keywords
voice
trained
tested
net model
masking value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010307655.9A
Other languages
Chinese (zh)
Other versions
CN111613211A (en
Inventor
高飞
关海欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202010307655.9A priority Critical patent/CN111613211B/en
Publication of CN111613211A publication Critical patent/CN111613211A/en
Application granted granted Critical
Publication of CN111613211B publication Critical patent/CN111613211B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Monitoring And Testing Of Exchanges (AREA)

Abstract

The invention relates to a method and a device for processing specific word voice. The method comprises the following steps: acquiring a voice to be trained with noise; extracting a first feature of the voice to be trained; inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model; acquiring a voice to be tested, and extracting a second characteristic of the voice to be tested; and inputting the second characteristics into the target U-NET model to judge whether the voice to be tested has specific word voice or not and obtain noise reduction voice of the voice to be tested. By the technical scheme of the invention, the noise reduction quality and the detection efficiency of the keywords in the voice with noise can be fully and effectively improved.

Description

Method and device for processing specific word voice
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for processing a specific word speech.
Background
At present, a large number of devices used for smart homes, mobile automatic devices and based on voice interaction appear in the market, such as some smart sound boxes, amazon Alexa, apple Siri and the like, and the devices need a specific word detection system to wake up before voice interaction, but the specific word detection system generally has a good detection effect only in a relatively quiet scene, and the performance in a noise scene is not good, that is, the specific word detection method in the prior art only has a good detection effect on voice recorded in a relatively quiet environment, and the performance in the noise scene can present a cliff type, so that keyword detection in noisy voice is inaccurate.
Disclosure of Invention
The embodiment of the invention provides a method and a device for processing specific word voice. The technical scheme is as follows:
according to a first aspect of the embodiments of the present invention, there is provided a method for processing a specific word speech, including:
acquiring a voice to be trained with noise;
extracting a first feature of the voice to be trained;
inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model;
acquiring a voice to be tested, and extracting a second characteristic of the voice to be tested;
and inputting the second characteristics into the target U-NET model to judge whether the voice to be tested has specific word voice or not and obtain noise reduction voice of the voice to be tested.
In one embodiment, the inputting the first feature into a U-NET model to be trained to obtain a target U-NET model includes:
inputting the first characteristic into the U-NET model to be trained to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
and training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model.
In an embodiment, the training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model includes:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
In one embodiment, calculating a model loss function according to the first estimated masking value, the estimation result, the true masking value, and the true judgment result includes:
calculating the model Loss function Loss by a first predetermined formula:
Figure BDA0002456341360000021
wherein the content of the first and second substances,
Figure BDA0002456341360000022
and &>
Figure BDA0002456341360000023
Respectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
Figure BDA0002456341360000024
the parameter | pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θ | pure | represents the amplitude of the speech to be trained in the frequency domain space pure Representing the phase theta of the pure voice corresponding to the voice to be trained in the frequency domain space mixture And representing the phase of the voice to be trained in the frequency domain space.
In one embodiment, the inputting the second feature into the target U-NET model to determine whether a specific word speech exists in the speech to be tested and obtain a noise-reduced speech of the speech to be tested includes:
inputting the second characteristic into the target U-NET model to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;
carrying out short-time Fourier transform on the voice to be tested to obtain a frequency spectrum of the voice to be tested;
and multiplying the second estimated masking value and the frequency spectrum, and then performing inverse Fourier transform to obtain the noise-reduced voice.
According to a second aspect of the embodiments of the present invention, there is provided a processing apparatus for a specific word speech, including:
the acquisition module is used for acquiring a voice to be trained with noise;
the extraction module is used for extracting a first feature of the voice to be trained;
the input module is used for inputting the first characteristic into a U-NET model to be trained so as to obtain a target U-NET model;
the first processing module is used for acquiring a voice to be tested and extracting a second characteristic of the voice to be tested;
and the second processing module is used for inputting the second characteristics to the target U-NET model so as to judge whether specific word voice exists in the voice to be tested and obtain noise reduction voice of the voice to be tested.
In one embodiment, the input module comprises:
the input submodule is used for inputting the first characteristic into the U-NET model to be trained so as to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
and the training submodule is used for training the U-NET model to be trained according to the first estimation masking value and the estimation result so as to obtain the target U-NET model.
In one embodiment, the training submodule is specifically configured to:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
In one embodiment, the training submodule is further configured to:
calculating the model Loss function Loss by a first preset formula as follows:
Figure BDA0002456341360000041
wherein the content of the first and second substances,
Figure BDA0002456341360000042
and &>
Figure BDA0002456341360000043
Respectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
Figure BDA0002456341360000044
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θ pure Representing the phase theta of the pure voice corresponding to the voice to be trained in the frequency domain space mixture And representing the phase of the voice to be trained in the frequency domain space.
In one embodiment, the second processing module comprises:
the input submodule is used for inputting the second characteristics into the target U-NET model so as to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;
the conversion submodule is used for carrying out short-time Fourier transform on the voice to be tested so as to obtain the frequency spectrum of the voice to be tested;
and the processing submodule is used for multiplying the second estimated masking value and the frequency spectrum and then performing inverse Fourier transform to obtain the noise-reduced voice.
The technical scheme provided by the embodiment of the invention can have the following beneficial effects:
the method comprises the steps of inputting first characteristics of voice to be trained into a U-NET model to be trained to obtain a target U-NET model with higher maturity and accuracy after training, then extracting second characteristics of the voice to be tested after the voice to be tested is obtained, inputting the second characteristics into the target U-NET model with higher accuracy to obtain noise-reducing voice of the voice to be tested, namely pure voice except noise in the voice to be tested, and judging whether specific word voice exists in the voice to be tested, so that noise-reducing quality and detection efficiency of keywords in the voice with noise are fully and effectively improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1A is a flow diagram illustrating a method of processing a particular word speech according to an example embodiment.
FIG. 1B is a flow diagram illustrating a method of processing a particular word speech according to an example embodiment.
Fig. 2 is a block diagram illustrating an apparatus for processing a specific word speech according to an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
In order to solve the above technical problem, an embodiment of the present invention provides a method for processing a specific word speech, where the method is applicable to a specific word speech processing program, system or device, and an execution subject corresponding to the method may be a terminal or a server, as shown in fig. 1A, and the method includes steps S101 to S105:
in step S101, a speech to be trained with noise is acquired;
the speech to be trained is mixed and obtained in a simulation mode, and the obtained method is to add different types of noise to clean speech at different signal-to-noise ratios.
In step S102, extracting a first feature of the speech to be trained; the first characteristic is the amplitude value of the voice to be trained in the frequency domain space, namely the module value of the real part of the voice to be trained in the expression of the frequency domain space; the first characteristic and the second characteristic are only one characteristic of amplitude with voice, the training stage is trained by the characteristic, and the testing stage is also input to a trained model (namely a target U-NET model) by the amplitude characteristic to obtain a noise-reduced result.
In step S103, inputting the first feature into a U-NET model to be trained (based on deep learning) to obtain a target U-NET model; the U-NET model is a U-shaped network structure and can be used for noise reduction or enhancement of noisy speech and detection of keywords in the speech.
In step S104, acquiring a voice to be tested, and extracting a second feature of the voice to be tested;
the speech to be tested is recorded without mixing.
In step S105, the second feature is input to the target U-NET model to determine whether a specific word voice exists in the voice to be tested, and obtain a noise reduction voice of the voice to be tested.
The method comprises the steps of inputting a first characteristic of a voice to be trained into a U-NET model to be trained to obtain a target U-NET model with higher maturity and accuracy after training, then extracting a second characteristic of the voice to be tested after obtaining the voice to be tested, inputting the second characteristic into the target U-NET model with higher accuracy, judging whether specific word voice exists in the voice to be tested, and obtaining noise reduction voice of the voice to be tested, namely pure voice except noise in the voice to be tested (for example, if only the voice of a specific word is needed, the voice except the specific word voice in the voice to be tested can be filtered), so that noise reduction quality and detection efficiency of keywords or specific words in the voice with noise are fully and effectively improved, and further voice awakening accuracy and timeliness of a voice interaction device are improved. The particular word tone may be a voice of a particular word, such as a voice of a wake word, etc.
In one embodiment, the inputting the first feature into a U-NET model to be trained to obtain a target U-NET model includes:
inputting the first characteristic into the U-NET model to be trained to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
and training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model.
By inputting the first characteristic into the U-NET model to be trained, a first estimated masking value PSM (Phase Sensitive Mask) corresponding to the voice to be trained can be obtained, whether the voice to be trained comprises preset voice or not, namely whether the voice to be trained comprises a certain specified keyword or not is obtained, then the U-NET model to be trained is retrained according to the first estimated masking value and the estimation result, so that an optimized and upgraded target U-NET model is obtained, the target U-NET model is conveniently and accurately used for denoising the voice with noise, and the detection efficiency and accuracy of the keyword in the voice with noise can be improved.
In an embodiment, the training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model includes:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice; the preset speech may also be a speech of a specific word or keyword, and may be the same as or different from the specific word sound.
Calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
When the U-NET model is optimized, an accurate model loss function can be calculated by using the first estimated masking value, the estimated result, the real masking value and the real judgment result, then the U-NET model to be trained is adjusted by using the model loss function, and the adjustment process can be continuously and circularly repeated to obtain an optimized and upgraded target U-NET model, so that the noise reduction treatment can be accurately carried out on the voice with noise by using the target U-NET model, and the detection efficiency and the accuracy of keywords in the voice with noise can be improved.
In one embodiment, calculating a model loss function according to the first estimated masking value, the estimation result, the true masking value, and the true judgment result includes:
calculating the model Loss function Loss by a first predetermined formula:
Figure BDA0002456341360000081
wherein the content of the first and second substances,
Figure BDA0002456341360000082
and &>
Figure BDA0002456341360000083
Respectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
Figure BDA0002456341360000084
when the voice to be trained comprises the preset voice, the value is 1, and when the voice to be trained does not comprise the preset voice, the value is 0;
the LABEL takes a value of 1 when the voice to be tested includes the specific word voice, and takes a value of 0 when the voice to be tested does not include the specific word voice.
MAE is the mean absolute error MAE (mean absolute error).
The real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
Figure BDA0002456341360000085
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θ pure Representing the phase of the clean voice corresponding to the voice to be trained in the frequency domain space, i.e. the imaginary part, theta, of the clean voice corresponding to the voice to be trained in the expression of the frequency domain space mixture Representing the phase of the speech to be trained in the frequency domain space, i.e. the speech to be trainedThe imaginary part of the speech in the representation in frequency domain space.
When the model Loss function Loss is calculated by using the formula, the training is stopped by using the average absolute error MAE (mean absolute error) as a convergence criterion until the Loss function is converged, and the target U-NET model with the best optimization effect is achieved, so that the voice detection effect is optimal and the noise reduction effect is optimal.
In one embodiment, the inputting the second feature into the target U-NET model to determine whether a specific word speech exists in the speech to be tested and obtain a noise-reduced speech of the speech to be tested includes:
inputting the second characteristic into the target U-NET model to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space; the second characteristic is the real part of the speech to be tested in the expression of the frequency domain space;
carrying out short-time Fourier transform on the voice to be tested to obtain a frequency spectrum of the voice to be tested;
and multiplying the second estimated masking value and the frequency spectrum, and then performing inverse Fourier transform to obtain the noise-reduced voice.
After the target U-NET model is obtained, the second characteristics of the voice to be tested can be input into the target U-NET model to judge whether the voice to be tested really has specific word voice, so that whether a certain keyword exists in the voice to be tested can be accurately identified, a second estimated masking value PSM is obtained, then STFT conversion is carried out on the voice to be tested, the frequency spectrum of the voice to be tested can be obtained, and ISTFT is carried out after the frequency spectrum is multiplied by the second estimated masking value, so that a good noise reduction effect can be obtained.
The technical solution of the present invention will be further described in detail with reference to fig. 1B:
step 1: generating data, mixing original specific word data and various types of noise with different signal-to-noise ratios (-5-15 dB), mixing non-specific word data and noise with different signal-to-noise ratios, using the mixed voice as training data, generating a verification set in the same way, wherein the training set and the verification set have different noise types, signal-to-noise ratios and speakers, training a model by using the training set, and supervising the model by using the verification set without participating in error return;
step 2: extracting characteristics, namely respectively calculating short-time Fourier transform of each sentence of voice of training data, and then normalizing the amplitude of the short-time Fourier transform to be used as the input of a model;
and 3, step 3: and calculating a training target, wherein the training target consists of two parts. A part of the computation of the trained mixed speech (mix) and its corresponding clean speech (pure) yields the phase sensitive mask (true PSM) as follows:
Figure BDA0002456341360000101
phase of clean speech, theta mixture For mixing phases of speech
Where | represents amplitude, θ represents phase; the other part is a LABEL (LABEL) of the whole voice, the specific word phonetic symbol is marked as 1, and the non-specific word phonetic symbol is marked as 0;
and 4, step 4: inputting the extracted features into a U-NET network model for training, using an average absolute error MAE (mean absolute error) as a convergence criterion, stopping training until a loss function converges, and storing the model, wherein the loss function is defined as follows:
Figure BDA0002456341360000102
wherein the content of the first and second substances,
Figure BDA0002456341360000103
and &>
Figure BDA0002456341360000104
Respectively, the model estimated PSM and LABEL.
And in the testing stage, the characteristics of the tested voice are processed by a trained model to obtain a judgment result of whether the tested voice is a specific word or not and an estimated PSM, and the PSM is multiplied by the frequency spectrum (obtained by STFT) of the tested voice and then subjected to inverse Fourier transform to obtain the voice after noise reduction.
Finally, it is clear that: the above embodiments can be freely combined by those skilled in the art according to the actual needs.
Corresponding to the method for processing the specific word speech provided in the embodiment of the present invention, an embodiment of the present invention further provides a device for processing the specific word speech, as shown in fig. 2, where the device includes:
an obtaining module 201, configured to obtain a voice to be trained with noise;
an extracting module 202, configured to extract a first feature of the speech to be trained;
the input module 203 is used for inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model;
the first processing module 204 is configured to obtain a voice to be tested, and extract a second feature of the voice to be tested;
the second processing module 205 is configured to input the second feature to the target U-NET model, so as to determine whether a specific word voice exists in the voice to be tested, and obtain a noise reduction voice of the voice to be tested.
In one embodiment, the input module comprises:
the input submodule is used for inputting the first characteristic into the U-NET model to be trained so as to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
and the training submodule is used for training the U-NET model to be trained according to the first estimation masking value and the estimation result so as to obtain the target U-NET model.
In one embodiment, the training submodule is specifically configured to:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
In one embodiment, the training submodule is further configured to:
calculating the model Loss function Loss by a first predetermined formula:
Figure BDA0002456341360000111
wherein the content of the first and second substances,
Figure BDA0002456341360000112
and &>
Figure BDA0002456341360000113
Respectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
Figure BDA0002456341360000121
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θ pure Representing the phase theta of the pure voice corresponding to the voice to be trained in the frequency domain space mixture And representing the phase of the voice to be trained in the frequency domain space.
In one embodiment, the second processing module comprises:
the input submodule is used for inputting the second characteristics into the target U-NET model so as to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;
the conversion submodule is used for carrying out short-time Fourier transform on the voice to be tested so as to obtain the frequency spectrum of the voice to be tested;
and the processing submodule is used for multiplying the second estimated masking value and the frequency spectrum and then performing inverse Fourier transform to obtain the noise-reduced voice.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (6)

1. A method for processing a specific word speech, comprising:
acquiring a voice to be trained with noise;
extracting a first feature of the voice to be trained;
inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model;
acquiring a voice to be tested, and extracting a second characteristic of the voice to be tested;
inputting the second characteristics into the target U-NET model to judge whether specific word voices exist in the voices to be tested or not and obtain noise reduction voices of the voices to be tested;
inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model, wherein the method comprises the following steps:
inputting the first characteristic into the U-NET model to be trained to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
training the U-NET model to be trained according to the first estimation masking value and the estimation result to obtain the target U-NET model;
the training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model comprises:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
2. The method of claim 1,
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result, including:
calculating the model Loss function Loss by a first predetermined formula:
Figure FDA0004075658250000021
wherein the content of the first and second substances,
Figure FDA0004075658250000022
and &>
Figure FDA0004075658250000023
Respectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
Figure FDA0004075658250000024
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θ pure Representing the phase theta of the pure speech corresponding to the speech to be trained in the frequency domain space mixture And representing the phase of the voice to be trained in the frequency domain space.
3. The method according to any one of claims 1 to 2,
the inputting the second characteristic into the target U-NET model to judge whether the voice to be tested has a specific word voice and obtain the noise reduction voice of the voice to be tested includes:
inputting the second characteristic into the target U-NET model to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;
carrying out short-time Fourier transform on the voice to be tested to obtain a frequency spectrum of the voice to be tested;
and multiplying the second estimated masking value and the frequency spectrum, and then performing inverse Fourier transform to obtain the noise-reduced voice.
4. An apparatus for processing a specific word speech, comprising:
the acquisition module is used for acquiring a voice to be trained with noise;
the extraction module is used for extracting a first feature of the voice to be trained;
the input module is used for inputting the first characteristic into a U-NET model to be trained so as to obtain a target U-NET model;
the first processing module is used for acquiring a voice to be tested and extracting a second characteristic of the voice to be tested;
the second processing module is used for inputting the second characteristics to the target U-NET model so as to judge whether specific word voices exist in the voices to be tested and obtain noise reduction voices of the voices to be tested;
the input module includes:
the input submodule is used for inputting the first characteristic into the U-NET model to be trained so as to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
the training submodule is used for training the U-NET model to be trained according to the first estimation masking value and the estimation result to obtain the target U-NET model;
the training submodule is specifically configured to:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
5. The apparatus of claim 4,
the training submodule is further specifically configured to:
calculating the model Loss function Loss by a first predetermined formula:
Figure FDA0004075658250000031
wherein the content of the first and second substances,
Figure FDA0004075658250000041
and &>
Figure FDA0004075658250000042
Respectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
Figure FDA0004075658250000043
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θ pure Representing the phase theta of the pure voice corresponding to the voice to be trained in the frequency domain space mixture And representing the phase of the voice to be trained in the frequency domain space.
6. The apparatus according to any one of claims 4 to 5,
the second processing module comprises:
the input submodule is used for inputting the second characteristics into the target U-NET model so as to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;
the conversion submodule is used for carrying out short-time Fourier transform on the voice to be tested so as to obtain the frequency spectrum of the voice to be tested;
and the processing submodule is used for multiplying the second estimated masking value and the frequency spectrum and then performing inverse Fourier transform to obtain the noise-reduced voice.
CN202010307655.9A 2020-04-17 2020-04-17 Method and device for processing specific word voice Active CN111613211B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010307655.9A CN111613211B (en) 2020-04-17 2020-04-17 Method and device for processing specific word voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010307655.9A CN111613211B (en) 2020-04-17 2020-04-17 Method and device for processing specific word voice

Publications (2)

Publication Number Publication Date
CN111613211A CN111613211A (en) 2020-09-01
CN111613211B true CN111613211B (en) 2023-04-07

Family

ID=72203952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010307655.9A Active CN111613211B (en) 2020-04-17 2020-04-17 Method and device for processing specific word voice

Country Status (1)

Country Link
CN (1) CN111613211B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115798455B (en) * 2023-02-07 2023-06-02 深圳元象信息科技有限公司 Speech synthesis method, system, electronic device and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108986835A (en) * 2018-08-28 2018-12-11 百度在线网络技术(北京)有限公司 Based on speech de-noising method, apparatus, equipment and the medium for improving GAN network

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107845389B (en) * 2017-12-21 2020-07-17 北京工业大学 Speech enhancement method based on multi-resolution auditory cepstrum coefficient and deep convolutional neural network
CN109065027B (en) * 2018-06-04 2023-05-02 平安科技(深圳)有限公司 Voice distinguishing model training method and device, computer equipment and storage medium
CN109461456B (en) * 2018-12-03 2022-03-22 云知声智能科技股份有限公司 Method for improving success rate of voice awakening
CN110060704A (en) * 2019-03-26 2019-07-26 天津大学 A kind of sound enhancement method of improved multiple target criterion study
CN110600017B (en) * 2019-09-12 2022-03-04 腾讯科技(深圳)有限公司 Training method of voice processing model, voice recognition method, system and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108986835A (en) * 2018-08-28 2018-12-11 百度在线网络技术(北京)有限公司 Based on speech de-noising method, apparatus, equipment and the medium for improving GAN network

Also Published As

Publication number Publication date
CN111613211A (en) 2020-09-01

Similar Documents

Publication Publication Date Title
US7133826B2 (en) Method and apparatus using spectral addition for speaker recognition
CN112700786B (en) Speech enhancement method, device, electronic equipment and storage medium
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
Ju et al. Tea-pse: Tencent-ethereal-audio-lab personalized speech enhancement system for icassp 2022 dns challenge
US20180033427A1 (en) Speech recognition transformation system
CN107464563B (en) Voice interaction toy
CN111754983A (en) Voice denoising method and device, electronic equipment and storage medium
CN111883154B (en) Echo cancellation method and device, computer-readable storage medium, and electronic device
CN111863008A (en) Audio noise reduction method and device and storage medium
CN110808057A (en) Voice enhancement method for generating confrontation network based on constraint naive
CN111613211B (en) Method and device for processing specific word voice
US9875748B2 (en) Audio signal noise attenuation
CN113035216B (en) Microphone array voice enhancement method and related equipment
CN111028858B (en) Method and device for detecting voice start-stop time
CN112002307B (en) Voice recognition method and device
Lan et al. Research on speech enhancement algorithm of multiresolution cochleagram based on skip connection deep neural network
CN114242104A (en) Method, device and equipment for voice noise reduction and storage medium
Mallidi et al. Robust speaker recognition using spectro-temporal autoregressive models.
CN107818780B (en) Robust speech recognition method based on nonlinear feature compensation
Liang et al. Real-time speech enhancement algorithm for transient noise suppression
Seyedin et al. New features using robust MVDR spectrum of filtered autocorrelation sequence for robust speech recognition
CN117153178B (en) Audio signal processing method, device, electronic equipment and storage medium
Seyedin et al. Robust MVDR-based feature extraction for speech recognition
Wang et al. Boosting DNN-based speech enhancement via explicit transformations
CN114173259B (en) Echo cancellation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant