CN111028861B - Spectrum mask model training method, audio scene recognition method and system - Google Patents

Spectrum mask model training method, audio scene recognition method and system Download PDF

Info

Publication number
CN111028861B
CN111028861B CN201911257776.0A CN201911257776A CN111028861B CN 111028861 B CN111028861 B CN 111028861B CN 201911257776 A CN201911257776 A CN 201911257776A CN 111028861 B CN111028861 B CN 111028861B
Authority
CN
China
Prior art keywords
audio
scene
spectrum
mixed
audio sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911257776.0A
Other languages
Chinese (zh)
Other versions
CN111028861A (en
Inventor
俞凯
吴梦玥
徐薛楠
丁翰林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN201911257776.0A priority Critical patent/CN111028861B/en
Publication of CN111028861A publication Critical patent/CN111028861A/en
Application granted granted Critical
Publication of CN111028861B publication Critical patent/CN111028861B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a spectrum mask model training method, which comprises the following steps: generating a mixed audio sample set based on the scene audio sample set and the speech audio sample set; acquiring mixed audio samples from the mixed audio sample set, inputting the mixed audio samples to a spectrum mask model to be trained to obtain a mask corresponding to scene audio contained in the mixed audio samples; multiplying the mask with a mixed spectrum of the mixed audio samples to filter out a spectrum of the mixed spectrum corresponding to speech audio; obtaining a spectrum of a scene audio sample used to generate the mixed audio sample; training the spectral mask model to be trained by minimizing a difference between the filtered mixed spectrum and a spectrum of the scene audio sample. The embodiment of the application realizes a spectrogram masking frame to filter out the voice contained in the mixed audio and obviously improve the classification performance of the scene data accompanying the voice.

Description

Spectrum mask model training method, audio scene recognition method and system
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a spectrum mask model training method, an audio scene recognition method and an audio scene recognition system.
Background
The past acoustic scene classification (audio scene classification) research has focused mainly on audio data containing ambient sounds. However, in real-world scenes, speech may appear in any acoustic scene. Preliminary experiments performed by the inventors have shown that the accuracy of scene classification decreases significantly in the presence of speech. This rich speech situation presents new challenges to audio scene classification and other broader audio processing areas: how to make the environmental features easier to identify.
Audio research has become an interesting area in the last few years, as the rich information provided by ambient sounds can help to improve the perception capabilities of the machine. However, most audio scene classification studies are performed on clean environmental data, whereas in the real world, most scenes are accompanied by speech (the dominant signal source).
The newly released Android 10 of *** has Live Caption technology (instant Caption function, that is, a sentence describing the video and audio is output at the same time when a video is played), the Live Caption automatically generates a sentence describing the video when the video is played, and pure audio classification technology judges which scene the background of the audio belongs to through the characteristics of the input audio.
Live Caption does not explicitly give background scene information of audio, sometimes the given description is not accurate enough, and the pure audio frequency classification technology has no good effect on audio classification with clear voice.
The inventors have discovered in the course of practicing the present invention that one skilled in the art would typically add more training data to solve the above problems, training on data with speech.
Disclosure of Invention
The embodiment of the invention provides a spectrum mask model training method, an audio scene recognition method and an audio scene recognition system, which are used for solving at least one of the technical problems.
In a first aspect, an embodiment of the present invention provides a spectrum mask model training method, including:
generating a mixed audio sample set based on the scene audio sample set and the speech audio sample set;
acquiring mixed audio samples from the mixed audio sample set, inputting the mixed audio samples to a spectrum mask model to be trained to obtain a mask corresponding to scene audio contained in the mixed audio samples;
multiplying the mask with a mixed spectrum of the mixed audio samples to filter out a spectrum of the mixed spectrum corresponding to speech audio;
obtaining a spectrum of a scene audio sample used to generate the mixed audio sample;
training the spectral mask model to be trained by minimizing a difference between the filtered mixed spectrum and a spectrum of the scene audio sample.
In some embodiments, the spectral mask model to be trained includes: an input layer, a plurality of GLU mask blocks, a linear layer, and an output layer, which are sequentially connected.
In some embodiments, each of the plurality of GLU mask blocks includes: sequentially connected convolutional layers, batch normalization layers and gated linear units as activation functions.
In some embodiments, generating the set of mixed audio samples based on the set of scene audio samples and the set of speech audio samples comprises:
and generating a mixed audio sample set according to a plurality of set signal-to-noise ratios based on the scene audio sample set and the voice audio sample set.
In some embodiments, training the spectral mask model to be trained by minimizing a difference between the filtered mixed spectrum and the spectrum of the scene audio sample comprises:
minimizing the following loss equation to train the spectral mask model to be trained:
Figure BDA0002310769710000021
wherein S issceneIs the frequency spectrum of the scene audio sample,
Figure BDA0002310769710000022
is the filtered mixed spectrum.
In a second aspect, an embodiment of the present invention provides an audio scene identification method, including:
inputting audio data to be recognized into a spectrum mask model obtained by adopting any one of the spectrum mask model training methods, so as to filter out voice audio data contained in the audio data to be recognized, wherein the audio data to be recognized contains voice audio data and scene audio data;
and processing the output data of the spectrum mask model by adopting a pre-trained audio scene classification model so as to determine the audio scene corresponding to the audio data to be identified.
In a third aspect, an embodiment of the present invention provides an audio scene recognition system, including: a spectrum mask model obtained by adopting the spectrum mask model training method of any one of the preceding claims and a pre-trained audio scene classification model; wherein,
the spectrum mask model is used for filtering voice audio data contained in audio data to be recognized, and the audio data to be recognized contains the voice audio data and scene audio data;
the pre-trained audio scene classification model is used for processing the output data of the spectrum mask model to determine the audio scene corresponding to the audio data to be identified.
In a fourth aspect, an embodiment of the present invention provides a spectrum mask model training system, including:
a sample generation module for generating a mixed audio sample set based on the scene audio sample set and the speech audio sample set;
a to-be-trained spectral mask model, configured to receive a mask corresponding to a scene audio included in the mixed audio sample set obtained from the mixed audio sample set;
a filtering module, configured to multiply the mask with a mixed spectrum of the mixed audio sample to filter out a spectrum corresponding to a voice audio in the mixed spectrum;
a spectrum data obtaining module, configured to obtain a spectrum of a scene audio sample used for generating the mixed audio sample;
a training module for training the to-be-trained spectral mask model by minimizing a difference between the filtered mixed spectrum and the spectrum of the scene audio sample.
In a fifth aspect, an embodiment of the present invention provides a storage medium, where one or more programs including execution instructions are stored, where the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the above-described spectral mask model training method and/or audio scene recognition method of the present invention.
In a sixth aspect, an electronic device is provided, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executable by the at least one processor to enable the at least one processor to perform any one of the spectral mask model training methods and/or the audio scene recognition methods of the present invention described above.
In a seventh aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a storage medium, and the computer program includes program instructions, which, when executed by a computer, cause the computer to execute any one of the above-mentioned spectral mask model training method and/or audio scene recognition method.
The embodiment of the invention has the beneficial effects that: the embodiment of the application realizes a spectrogram masking frame, and the voice information is masked through a spectral mask, so that the scene information is enhanced, and the defect that the classification effect is not good enough is overcome.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of an embodiment of a spectral mask model training method of the present invention;
FIG. 2 is a framework of an embodiment of an audio scene recognition system of the present invention;
FIG. 3a is a schematic diagram of the architecture and parameters of an embodiment of a spectral mask model in the present invention;
FIG. 3b is a schematic diagram of the architecture and parameters of an embodiment of an audio scene classification model according to the present invention;
FIG. 4 is a graph showing the comparison of the accuracy of scene classification by the reference model and the spectral mask model of the present invention in the case of mixed audio of different signal-to-noise ratios;
fig. 5 is a schematic structural diagram of an embodiment of an electronic device according to the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The past acoustic scene classification (audio scene classification) research has focused mainly on audio data containing ambient sounds. However, in real-world scenes, speech may appear in any acoustic scene. Our preliminary experiments show that the accuracy of scene classification is significantly reduced in the presence of speech. This rich speech situation presents new challenges to audio scene classification and other broader audio processing areas: how to make the environmental features easier to identify.
The present invention is therefore directed to enhancing scene information in speech-rich audio in an audio scene classification task. To solve this problem, data for mixing speech and scene at various SNR (signal-to-noise ratio) values is first generated, and then a spectral mask method is proposed to filter the speech and enhance scene information. Experimental results show that the mask method provided by the invention improves the effectiveness of classification accuracy on the acoustic scene data rich in voice. In particular, when the speech and the acoustic scene information are almost obvious, i.e., SNR is 0db, the classification performance is relatively improved by 36%.
As shown in fig. 1, an embodiment of the present invention provides a spectrum mask model training method, including:
and S10, generating a mixed audio sample set based on the scene audio sample set and the voice audio sample set.
Illustratively, the scene audio sample set contains pure scene audio, which may include, for example, background audio of scenes such as airports, buses, shopping centers, pedestrian streets, street traffic, subway stations, parks, subways, public squares, and trams.
Illustratively, the set of speech audio samples comprises clean speech audio, which may include, for example, speech audio of males and females of various ages.
Illustratively, a mixed audio sample set is generated according to a plurality of set signal-to-noise ratios based on the scene audio sample set and the voice audio sample set, and the obtained mixed audio sample set comprises mixed audio samples with a plurality of signal-to-noise ratios.
And S20, acquiring mixed audio samples from the mixed audio sample set, inputting the mixed audio samples to a spectrum mask model to be trained, and obtaining masks corresponding to the scene audio contained in the mixed audio samples.
Illustratively, the spectrum mask model to be trained employs a CNN-based network model. The mixed audio sample may be a mixture of the scene audio sample and the first speech audio sample.
Illustratively, the spectrum mask model to be trained includes: an input layer, a plurality of GLU mask blocks, a linear layer, and an output layer, which are sequentially connected. Wherein each of the plurality of GLU mask blocks respectively comprises: sequentially connected convolutional layers, batch normalization layers and gated linear units as activation functions.
And S30, multiplying the mask and the mixed spectrum of the mixed audio sample to filter out the spectrum corresponding to the voice audio in the mixed spectrum.
Illustratively, enhancement of the first scene audio in the mixed audio samples is achieved by multiplication of the mask with the mixed spectrum of the mixed audio samples.
S40, obtaining the frequency spectrum of the scene audio sample used for generating the mixed audio sample.
Illustratively, the scene audio samples employed in generating the mixed audio samples are also obtained at the same time as the mixed audio samples are obtained.
For example, the generated mixed audio sample and the adopted scene audio and voice audio may be stored in association when the mixed audio sample is generated, so as to be used in the subsequent flow.
S50, training the spectrum mask model to be trained by minimizing the difference between the filtered mixed spectrum and the spectrum of the scene audio sample.
Illustratively, in some embodiments, training the spectral mask model to be trained by minimizing a difference between the filtered mixed spectrum and the spectrum of the scene audio sample comprises:
minimizing the following loss equation to train the spectral mask model to be trained:
Figure BDA0002310769710000071
wherein S issceneIs the frequency spectrum of the scene audio sample,
Figure BDA0002310769710000072
is the filtered mixed spectrum.
The embodiment of the application realizes a spectrogram masking frame to filter out the voice contained in the mixed audio and obviously improve the classification performance of the scene data accompanying the voice.
According to the embodiment of the application, the pure voice and the pure scene audio are superposed to obtain mixed audio serving as training data. For an input mixed audio frequency spectrum, a mask with the same size as audio is obtained by using a mask model based on CNN, and the mask is multiplied by the original frequency spectrum to obtain a frequency spectrum with voice information filtered, namely the scene information is enhanced. The difference between the spectrum of the pure scene and the filtered spectrum is used as a loss training mask model in the training time. And filtering out voice by using a mask model during testing, carrying out scene classification on the filtered spectrum, and outputting the probability of each class by using a CNN classifier.
In some embodiments, the present invention further provides a spectrum mask model training system, including:
a sample generation module for generating a mixed audio sample set based on the scene audio sample set and the speech audio sample set;
a to-be-trained spectral mask model, configured to receive a mask corresponding to a scene audio included in the mixed audio sample set obtained from the mixed audio sample set;
a filtering module, configured to multiply the mask with a mixed spectrum of the mixed audio sample to filter out a spectrum corresponding to a voice audio in the mixed spectrum;
a spectrum data obtaining module, configured to obtain a spectrum of a scene audio sample used for generating the mixed audio sample;
a training module for training the to-be-trained spectral mask model by minimizing a difference between the filtered mixed spectrum and the spectrum of the scene audio sample.
In some embodiments, the present invention further provides an audio scene recognition method, including:
inputting audio data to be recognized into a spectrum mask model obtained by adopting any one of the spectrum mask model training methods, so as to filter out voice audio data contained in the audio data to be recognized, wherein the audio data to be recognized contains voice audio data and scene audio data;
and processing the output data of the spectrum mask model by adopting a pre-trained audio scene classification model so as to determine the audio scene corresponding to the audio data to be identified.
According to the voice filtering and scene classification model, high-level scene information of audio can be extracted, the same idea can be used for filtering the scene information, unnecessary information can be filtered according to needs, and the audio information which is wanted by people is enhanced.
In some embodiments, the present invention also provides an audio scene recognition system comprising: a spectrum mask model obtained by adopting the spectrum mask model training method of any one of the preceding claims and a pre-trained audio scene classification model; wherein,
the spectrum mask model is used for filtering voice audio data contained in audio data to be recognized, and the audio data to be recognized contains the voice audio data and scene audio data;
the pre-trained audio scene classification model is used for processing the output data of the spectrum mask model to determine the audio scene corresponding to the audio data to be identified.
The above advantageous effects achieved by the present invention and the verification of the actual experiment obtained by the obtained results are also described in detail below.
1. The present invention aims to investigate how to filter speech and enhance scene sounds in the task of audio scene classification. Our preliminary experiments show that adding speech to ambient sound with SNR ═ 0db, the audio scene classification accuracy drops sharply from 63.2% to 24.1%. Enhancing an acoustic scene by filtering out speech may be more difficult than speech enhancement. The speech information is generally distributed over all frequency points, while the scene information is concentrated over low frequency points. Therefore, we need to focus on a wider frequency range to filter out speech.
The research significance of the acoustic scene enhancement surpasses the scene classification. Speech and ambient sounds are often entangled and understanding the environment is as important as understanding speech. For speech related tasks in different acoustic scenarios, an accurate estimate of the environment may help improve the performance of the speech task. This prompted us to conduct acoustic scene environment enhancement studies, especially in terms of mixing speech and scene audio. It is desirable that the model learn the underlying environmental information by correctly identifying the scene. Ultimately, this better scene understanding may be beneficial for other audio and speech signal processing tasks.
As shown in fig. 2, a framework of an embodiment of the audio scene recognition system of the present invention includes (a) a spectral mask model and (b) a scene classification network audio scene classification.
The main contributions of the invention can be summarized at least as follows:
the influence of the speech on the classification of the acoustic scene is analyzed, and the severe influence of clean speech on the classification performance of the audio scene is verified.
A spectrum mask framework is proposed to filter out speech and significantly improve the classification performance of scene data accompanying speech.
2. Related work in the prior art
Acoustic scene classification the audio scene classification will estimate scene classes, e.g. cafe, street, train, from the input acoustic signal. The main challenge remains the influence of different geographical location recording and recording devices. A complete description and recent advances can be found in task1 "detection and classification of sound fields and events (DCASE) challenge". However, the impact of speech has not been addressed in previous audio scene classification tasks. Benchmarks in DCASE2019 suggest an accuracy of 62.5% for this 10-class classification problem.
3. Method of the invention
The invention filters out the voice by adopting a spectrogram shielding method, so that the model can realize better classification performance on mixed audio.
The present invention simulates a real-world scene by mixing speech and scene audio signals x (t), as follows:
xmix(t)=xspeech(t)+xscene(t) (1)
although the summation is done on the original audio signal, the standard practice of using the amplitude spectrogram S as an audio representation is incorporated. Suppose a mixed audio spectrogram SmixContains information from scenes and speech:
Figure BDA0002310769710000101
wherein
Figure BDA0002310769710000102
And
Figure BDA0002310769710000103
respectively represent a representative SmixVoice and scene information.
Figure BDA0002310769710000104
Speech from S may be modified by applying a soft mask MmixFiltering:
Figure BDA0002310769710000105
after filtration
Figure BDA0002310769710000106
With clean scene spectrogram SsceneMay be slightly different but they represent the same context information. Therefore, when training M, we use S directlysceneTo represent
Figure BDA0002310769710000107
The system consists of two separate components:
(a) spectral mask model consisting of (S)mix,Sscene) The pair is used as input. By minimizing SsceneAnd the calculated masking spectrogram
Figure BDA0002310769710000108
The difference between them. The architecture and parameters of the spectral mask model are shown in fig. 3 a.
(b) An audio scene classification model, which takes a spectrogram as an input and outputs the probability of each class as a typical classification model. During training, only clean scene data is used, while during evaluation, S is used simultaneouslymixAnd
Figure BDA0002310769710000109
the masking performance was evaluated. The architecture and parameters of the audio scene classification model are shown in fig. 3 b.
3.1 spectrogram Spectrum mask model
The inspiration of the spectral mask model comes from the recent advances in audio source separation. Network takes Smix as input and predicts
Figure BDA00023107697100001010
Obtain the same mask as Smix shape, and compare with SmixMultiplying to obtain
Figure BDA00023107697100001011
Figure BDA00023107697100001012
The spectral mask model is trained to minimize the loss equation (4). Each block in the spectral mask model (see fig. 3a) contains a convolutional layer, a batch normalization layer and a Gated Linear Unit (GLU) as activation functions. The GLU is used for language modeling, replacing the recursive network with a gated time convolutional network. We choose it as the activation function instead of ReLU, because its operation is similar to the spectral mask model: a mask is calculated from the input, the mask value is compressed to a range of 0 to 1, and then multiplied by the input. The output of each Block is a Mask feature map, so we name the Block as "GLU Mask Block". The last part of the spectral mask model is an S-type activation to ensure that the mask value is between 0 and 1.
3.2, classification model
The classification model is a Light CNN based network, consisting of Maximum Feature Mapping (MFM) blocks and groups. Each MFM block contains a convolutional layer, an MFM activation and batch normalization layer. An MFM group is a combination of two MFM blocks, where the convolutional layer of the second block has the same output filter size as the input. Since it performs well in face recognition with noisy labels, we merge it into a reliable classification model.
4. Experimental setup
In the experiment, a standard voice separation data set is intentionally not selected to obtain acoustic scene data with clean, labeled and comparable performance, and the voice scene data is generated by mixing background scene audio and voice. Separate data sources, data generation processes, and our training process will be described below.
4.1, data set
Scene audio: the corpus in Task1A challenged with DCASE2019 was used. Since the evaluation dataset labels are not available, we train and test our model on the development dataset. The development dataset contained 14400 audio segments from 10 cities with a sampling rate of 48kHz, each segment lasting 10 seconds. Including 10 scenarios: airports, buses, shopping centers, pedestrian streets, street traffic, subway stations, parks, subways, public squares and trams. To make our results comparable, we followed the official provided evaluation setup: 9185 subdivisions in the training subset and 4185 subdivisions in the testing subset. The remaining 1030 segments are not used in this setup.
Voice audio: VoxCeleb1 is used to provide a reference data set for speaker recognition to provide a speech source. It is a large-scale text-independent data set containing brief human voice clips from 125 speakers of different ethnicities, accents and ages. The sampling rate of the audio clip is 16 kHz.
4.2 data Generation
Training data is generated by mixing clean scenes with clean speech audio. To match the segment lengths of the scene data set and generate natural sound samples, we only select the duration>10s of speech. Then by adding SspeechTrimming to 10S and mixing with SsceneMixing to complete the mixing. The scene audio segment is downsampled to 16kHz to match the speech utterance. The selected 13370 voices contained voice from 1227 different speakers, covering nearly all 1251 speakers. The number of speakers selected ensures that the data generated contains a wide variety of speech content and speakers.
To explore the impact of speech, we generated mixed audio of different SNRs. As mentioned earlier, our goals are contrary to previous speech enhancement studies. We need to filter speech instead of ambient sound, so the SNR represents the scene to speech ratio. A positive SNR indicates that the sound field is dominant and a negative SNR indicates that speech is dominant. The SNR used in this study ranged from-5 db to 15db with a step size of 5 db.
Using the selected SNR value, we can calculate the ratio between the signal at mixing and the original speech signal:
Figure BDA0002310769710000121
xmix(SNR)=xscene+α(SNR)xspeech (6)
where E represents the signal energy.
4.3, training
The input to the model is a standard spectrogram, extracted every 20 milliseconds in a 40 millisecond window by a Short Time Fourier Transform (STFT). Since audio scene classification does not rely on constant changes in short frames, we set the frame length and shift value to be larger than the standard values (25ms and 10 ms). We use a Hanning window of length 1024 samples, corresponding to 64ms at a sampling rate of 16 kHz. For each audio, we obtain a magnitude spectrogram with a size of 500 × 513.
As described in section 4.1, we use official evaluationAnd (4) setting. The development set is divided into a training subset and a verification subset to evaluate the classification performance of the model. The training subset contains 75% of the entire development dataset. The spectral mask model and the audio scene classification model are trained separately. The audio scene classification model is at SsceneAnd (4) training. We used an Adam optimizer to convert beta1And beta2Set to 0.9 and 0.999, and set the learning rate to 1 × 10-3. The audio scene classification model was trained until the validation loss did not decrease over 5 epochs. The mask model was then trained for 20 epochs with the same optimization settings, except that the learning rate was set to 1 × 10-4. Optimal masking model selected by Performance on verification subsets for Filtering Smix. Finally, we evaluated the audio scene classification model at SmixAnd filtered SmixThe classification performance of (3).
5. Results and discussion
In fig. 4, the scene classification accuracy versus the inventive model for a reference model and a mixed audio case with different signal-to-noise ratios is shown. Since we did not focus on improving the classification performance of pure scene audio, only training the classification model achieved an accuracy of 63.2%, comparable to the DCASE2019 benchmark (62.5%). Speech has a significant impact on classification performance because accuracy decreases (SNR becomes smaller) with increasing speech components. When speech and scene are equally dominant (SNR ═ 0), the audio scene classification accuracy is as low as 24.1%.
Our proposed spectrogram spectral mask model has been shown to be effective in acoustic scene enhancement and speech filtering. All SNRs from-5 db to 15db, the classification accuracy improves, and the highest peak is reached when the SNR is 0, which is a relative improvement of 36%. When the SNR is-5 db, the speech signal becomes very dominant and the acoustic scene sound is hardly recognized, so the improvement is relatively small. As the SNR increases, the scene energy becomes more resolvable, and therefore the improvement brought by the mask model becomes less pronounced, but in any event its performance exceeds that of the baseline system. This enhancement effect seems to be less pronounced compared to speech enhancement. However, it is worth noting that filtering speech is a more challenging task.
6. Conclusion
The invention solves the significant influence of voice on audio analysis (especially acoustic scene classification). In addition to generating a speech-rich audio data set, the present study also proposes an efficient spectrogram masking method to enhance the ambient sound and thus improve the audio scene classification performance at different SNRs. Especially when speech and sound scene signals are also dominant (SNR ═ 0db), the relative improvement is over 36%. Future work can be explored by extending the challenges of rich speech scenarios to other audio processing tasks.
It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In some embodiments, the present invention provides a non-transitory computer-readable storage medium, in which one or more programs including executable instructions are stored, where the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the above-described spectrum mask model training method and/or audio scene recognition method of the present invention.
In some embodiments, the present invention further provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any one of the above-mentioned spectral mask model training method and/or audio scene recognition method.
In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a spectral mask model training method and/or an audio scene recognition method.
In some embodiments, an embodiment of the present invention further provides a storage medium having a computer program stored thereon, where the program is configured to implement a spectral mask model training method and/or an audio scene recognition method when executed by a processor.
The spectrum mask model training system and/or the audio scene recognition system according to the embodiments of the present invention may be used to execute the spectrum mask model training method and/or the audio scene recognition method according to the embodiments of the present invention, and accordingly achieve the technical effects achieved by the spectrum mask model training method and/or the audio scene recognition method according to the embodiments of the present invention, which are not described herein again. In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).
Fig. 5 is a schematic diagram of a hardware structure of an electronic device for executing an audio scene recognition method according to another embodiment of the present application, and as shown in fig. 5, the electronic device includes:
one or more processors 510 and memory 520, with one processor 510 being an example in fig. 5.
The apparatus for performing the audio scene recognition method may further include: an input device 530 and an output device 540.
The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, and the bus connection is exemplified in fig. 5.
The memory 520, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the audio scene recognition method in the embodiment of the present application. The processor 510 executes various functional applications of the server and data processing by executing the nonvolatile software programs, instructions and modules stored in the memory 520, namely, implements the audio scene recognition method of the above-described method embodiment.
The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the audio scene recognition device, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 520 may optionally include memory located remotely from the processor 510, which may be connected to the audio scene recognition device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 530 may receive input numeric or character information and generate signals related to user settings and function control of the audio scene recognition device. The output device 540 may include a display device such as a display screen.
The one or more modules are stored in the memory 520 and, when executed by the one or more processors 510, perform the audio scene recognition method of any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.
The electronic device of the embodiments of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.
(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.
(5) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (10)

1. A method of spectral mask model training, comprising:
generating a mixed audio sample set based on the scene audio sample set and the speech audio sample set;
acquiring mixed audio samples from the mixed audio sample set, inputting the mixed audio samples to a spectrum mask model to be trained to obtain a mask corresponding to scene audio contained in the mixed audio samples;
multiplying the mask with a mixed spectrum of the mixed audio samples to filter out a spectrum of the mixed spectrum corresponding to speech audio;
obtaining a spectrum of a scene audio sample used to generate the mixed audio sample;
training the spectral mask model to be trained by minimizing a difference between the filtered mixed spectrum and a spectrum of the scene audio sample.
2. The method of claim 1, wherein the spectral mask model to be trained comprises: an input layer, a plurality of GLU mask blocks, a linear layer, and an output layer, which are sequentially connected.
3. The method of claim 2, wherein each of the plurality of GLU mask blocks respectively comprises: sequentially connected convolutional layers, batch normalization layers and gated linear units as activation functions.
4. The method of claim 1, wherein generating a mixed set of audio samples based on a set of scene audio samples and a set of speech audio samples comprises:
and generating a mixed audio sample set according to a plurality of set signal-to-noise ratios based on the scene audio sample set and the voice audio sample set.
5. The method of claim 1, wherein training the spectral mask model to be trained by minimizing a difference between the filtered mixed spectrum and the spectrum of the scene audio samples comprises:
minimizing the following loss equation to train the spectral mask model to be trained:
Figure FDA0002310769700000011
wherein S issceneIs the frequency spectrum of the scene audio sample,
Figure FDA0002310769700000012
is the filtered mixed spectrum.
6. An audio scene recognition method, comprising:
inputting audio data to be recognized into a spectrum mask model obtained by adopting the method of any one of claims 1 to 5, so as to filter out voice audio data contained in the audio data to be recognized, wherein the audio data to be recognized contains the voice audio data and scene audio data;
and processing the output data of the spectrum mask model by adopting a pre-trained audio scene classification model so as to determine the audio scene corresponding to the audio data to be identified.
7. An audio scene recognition system comprising: a spectrum mask model obtained by adopting the method of any one of claims 1-5 and a pre-trained audio scene classification model; wherein,
the spectrum mask model is used for filtering voice audio data contained in audio data to be recognized, and the audio data to be recognized contains the voice audio data and scene audio data;
the pre-trained audio scene classification model is used for processing the output data of the spectrum mask model to determine the audio scene corresponding to the audio data to be identified.
8. A spectral mask model training system, comprising:
a sample generation module for generating a mixed audio sample set based on the scene audio sample set and the speech audio sample set;
a to-be-trained spectral mask model, configured to receive a mask corresponding to a scene audio included in the mixed audio sample set obtained from the mixed audio sample set;
a filtering module, configured to multiply the mask with a mixed spectrum of the mixed audio sample to filter out a spectrum corresponding to a voice audio in the mixed spectrum;
a spectrum data obtaining module, configured to obtain a spectrum of a scene audio sample used for generating the mixed audio sample;
a training module for training the to-be-trained spectral mask model by minimizing a difference between the filtered mixed spectrum and the spectrum of the scene audio sample.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-6.
10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN201911257776.0A 2019-12-10 2019-12-10 Spectrum mask model training method, audio scene recognition method and system Active CN111028861B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911257776.0A CN111028861B (en) 2019-12-10 2019-12-10 Spectrum mask model training method, audio scene recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911257776.0A CN111028861B (en) 2019-12-10 2019-12-10 Spectrum mask model training method, audio scene recognition method and system

Publications (2)

Publication Number Publication Date
CN111028861A CN111028861A (en) 2020-04-17
CN111028861B true CN111028861B (en) 2022-02-22

Family

ID=70205305

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911257776.0A Active CN111028861B (en) 2019-12-10 2019-12-10 Spectrum mask model training method, audio scene recognition method and system

Country Status (1)

Country Link
CN (1) CN111028861B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111653290B (en) * 2020-05-29 2023-05-02 北京百度网讯科技有限公司 Audio scene classification model generation method, device, equipment and storage medium
CN112116025A (en) * 2020-09-28 2020-12-22 北京嘀嘀无限科技发展有限公司 User classification model training method and device, electronic equipment and storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9124981B2 (en) * 2012-11-14 2015-09-01 Qualcomm Incorporated Systems and methods for classification of audio environments
US10373611B2 (en) * 2014-01-03 2019-08-06 Gracenote, Inc. Modification of electronic system operation based on acoustic ambience classification
WO2017059881A1 (en) * 2015-10-05 2017-04-13 Widex A/S Hearing aid system and a method of operating a hearing aid system
JP6517760B2 (en) * 2016-08-18 2019-05-22 日本電信電話株式会社 Mask estimation parameter estimation device, mask estimation parameter estimation method and mask estimation parameter estimation program
CN108305616B (en) * 2018-01-16 2021-03-16 国家计算机网络与信息安全管理中心 Audio scene recognition method and device based on long-time and short-time feature extraction
CN109616104B (en) * 2019-01-31 2022-12-30 天津大学 Environment sound identification method based on key point coding and multi-pulse learning
CN109741747B (en) * 2019-02-19 2021-02-12 珠海格力电器股份有限公司 Voice scene recognition method and device, voice control method and device and air conditioner
CN111863009B (en) * 2020-07-15 2022-07-26 思必驰科技股份有限公司 Training method and system of context information prediction model
CN112967730B (en) * 2021-01-29 2024-07-02 北京达佳互联信息技术有限公司 Voice signal processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111028861A (en) 2020-04-17

Similar Documents

Publication Publication Date Title
CN109637546B (en) Knowledge distillation method and apparatus
Bhat et al. A real-time convolutional neural network based speech enhancement for hearing impaired listeners using smartphone
Barker et al. The PASCAL CHiME speech separation and recognition challenge
CN109766759A (en) Emotion identification method and Related product
CN106486131A (en) A kind of method and device of speech de-noising
CN108417201B (en) Single-channel multi-speaker identity recognition method and system
CN108922559A (en) Recording terminal clustering method based on voice time-frequency conversion feature and integral linear programming
CN111862942B (en) Method and system for training mixed speech recognition model of Mandarin and Sichuan
CN111028861B (en) Spectrum mask model training method, audio scene recognition method and system
CN111755013B (en) Denoising automatic encoder training method and speaker recognition system
CN111179915A (en) Age identification method and device based on voice
CN113555032B (en) Multi-speaker scene recognition and network training method and device
CN112309365A (en) Training method and device of speech synthesis model, storage medium and electronic equipment
CN112989108A (en) Language detection method and device based on artificial intelligence and electronic equipment
Barker et al. The CHiME challenges: Robust speech recognition in everyday environments
CN111081260A (en) Method and system for identifying voiceprint of awakening word
CN110232928B (en) Text-independent speaker verification method and device
CN110232927B (en) Speaker verification anti-spoofing method and device
CN114255782A (en) Speaker voice enhancement method, electronic device and storage medium
CN111863009B (en) Training method and system of context information prediction model
Enzinger et al. Mismatched distances from speakers to telephone in a forensic-voice-comparison case
Lin et al. Focus on the sound around you: Monaural target speaker extraction via distance and speaker information
CN113241091B (en) Sound separation enhancement method and system
CN116978359A (en) Phoneme recognition method, device, electronic equipment and storage medium
CN112784094B (en) Automatic audio summary generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant