CN115359771B - Underwater sound signal noise reduction method, system, equipment and storage medium - Google Patents

Underwater sound signal noise reduction method, system, equipment and storage medium Download PDF

Info

Publication number
CN115359771B
CN115359771B CN202210868441.8A CN202210868441A CN115359771B CN 115359771 B CN115359771 B CN 115359771B CN 202210868441 A CN202210868441 A CN 202210868441A CN 115359771 B CN115359771 B CN 115359771B
Authority
CN
China
Prior art keywords
feature
self
attention
frequency
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210868441.8A
Other languages
Chinese (zh)
Other versions
CN115359771A (en
Inventor
周翱隆
李小勇
宋君强
徐国军
任开军
邓科峰
冷洪泽
任小丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210868441.8A priority Critical patent/CN115359771B/en
Publication of CN115359771A publication Critical patent/CN115359771A/en
Application granted granted Critical
Publication of CN115359771B publication Critical patent/CN115359771B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K11/00Methods or devices for transmitting, conducting or directing sound in general; Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/16Methods or devices for protecting against, or for damping, noise or other acoustic waves in general
    • G10K11/161Methods or devices for protecting against, or for damping, noise or other acoustic waves in general in systems with fluid flow
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Fluid Mechanics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Computational Linguistics (AREA)
  • Combustion & Propulsion (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a method, a system, equipment and a storage medium for noise reduction of an underwater sound signal, which comprise the steps of obtaining a noisy frequency, extracting a noisy frequency complex spectrum from the noisy frequency, carrying out downsampling on the noisy frequency complex spectrum through an encoder, learning a first characteristic of the complex spectrum in a time dimension through a first self-attention module, learning a second characteristic of the complex spectrum in a frequency dimension through a second self-attention module, merging the first characteristic and the second characteristic, carrying out upsampling on the merged characteristic through a decoder to obtain a noise reduction audio complex spectrum, converting the noise reduction audio complex spectrum into noise reduction audio, capturing long-term dependence and local dependence of the characteristic from the time dimension and the frequency dimension of the characteristic respectively, and carrying out complementary information interaction on two branches through information, so that the expression capability of a module is enhanced, the accuracy of the time spectrum characteristic of the complex spectrum of the noisy frequency is learned, the interference of the noise signal is effectively inhibited, and the noise reduction performance is improved.

Description

Underwater sound signal noise reduction method, system, equipment and storage medium
Technical Field
The invention relates to the technical field of underwater acoustic signal noise reduction, in particular to an underwater acoustic signal noise reduction method, an underwater acoustic signal noise reduction system, an underwater acoustic signal noise reduction device and a storage medium.
Background
Acoustic waves are the only form of energy radiation in the ocean that can be transmitted over long distances at present, and are effective means for information transmission as information carriers. The underwater sound technology has become an important means for developing ocean high-new technology, improving national ocean comprehensive competitiveness and maintaining ocean safety rights and interests. The related fields of automatic acquisition, efficient processing, intelligent identification and tracking of underwater acoustic signals, underwater environment monitoring of sonar, communication, target positioning tracking and the like are the ocean high and new technology which is preferentially developed at present. However, the marine environment is not quiet, and the signal detected by the underwater sonar may contain a variety of signal information, not only the target radiation noise, but also other environmental noise information. Underwater noise signals can be largely divided into three categories: the first is noise caused by various marine environmental factors, commonly referred to as marine power noise; the second type is the sound emitted by marine organisms, commonly referred to as marine noise; the third type of noise is noise caused by human activities, also called as human noise, and mainly comprises marine ship radiation noise, sound waves emitted by active sonar, underwater engineering noise and the like. The processing and application of ship radiation noise signals are one of the focus and research difficulties of modern underwater acoustic signal processing, and the interference of environmental noise and the reduction of noise level of target signals can have serious influence on effective detection and identification tracking of targets, so that the signals received by sonar are effectively reduced in noise, the influence of interference noise signals is reduced, and the enhancement of target signals has very important significance and research value.
In order to effectively suppress the influence of interference noise on a target signal, the target sound signal is separated from an ambient noise interference signal, and the underwater sound signal enhancement is the most critical technology. The underwater acoustic signal enhancement can also be called noise reduction, and aims to suppress the energy of non-target noise signals and improve the energy of target sound signals, and the traditional underwater acoustic signal noise reduction technology relies on accurate analysis modeling of signals and optimized fine adjustment of parameters and mainly comprises a noise reduction method based on a spectral subtraction technology, a noise reduction method based on a filtering technology, a noise reduction method based on wavelet transformation and a noise reduction method based on signal decomposition. Although the method reduces the influence of noise interference signals on target signals to a certain extent, the method still has limitations of different degrees and problems to be solved, such as irrelevant frequency spectrum components such as music noise and the like can be introduced after the denoising treatment by a spectral subtraction method and a filter method, and the audio quality is reduced; the wavelet transform method and the signal decomposition-based method may have problems of optimal basis function selection, threshold setting, mode mixing, and the like. In addition, the methods often need to perform a large number of iterative computations, have the problems of long running time and low processing efficiency, perform well under the condition of high signal-to-noise ratio, and have limited performance under the condition of low signal-to-noise ratio in an actual marine environment.
In order to solve the problems of the traditional noise reduction method, researchers construct a noise reduction model based on a deep neural network (Deep Neural Network, DNN), and the method is often capable of obtaining performance superior to the traditional method by relying on strong knowledge learning and induction reasoning capability of the neural network model. DNN-based methods typically use Short-time fourier transforms (Short-time Fourier transform, STFT) to transform time-domain audio into time-frequency (T-F) spectra, mostly ignoring the estimation of phase information, resulting in an upper performance limit for the method.
Disclosure of Invention
The present invention aims to at least solve the technical problems existing in the prior art. Therefore, the invention provides the underwater sound signal noise reduction method, the underwater sound signal noise reduction system, the underwater sound signal noise reduction equipment and the storage medium, which can effectively inhibit the interference of noise signals, extract the time spectrum characteristics more accurately and improve the noise reduction performance.
In a first aspect of the present invention, there is provided a method for noise reduction of an underwater acoustic signal, comprising the steps of:
acquiring a noise frequency;
extracting a noisy frequency complex spectrum from the noisy audio;
downsampling the noisy frequency complex spectrum by an encoder to obtain a high-dimensional characteristic of the noisy frequency;
Learning a first feature of the complex spectrum in a time dimension by a first self-attention module and a second feature of the complex spectrum in a frequency dimension by a second self-attention module;
fusing the first feature and the second feature to obtain a fused feature;
upsampling the fusion feature by a decoder to obtain a noise-reduced audio complex spectrum;
and converting the noise reduction audio complex frequency spectrum into noise reduction audio.
According to the embodiment of the invention, at least the following technical effects are achieved:
according to the method, a complex frequency spectrum with noise frequency is extracted from the complex frequency spectrum with noise frequency, the complex frequency spectrum with noise frequency is subjected to downsampling through an encoder to obtain a high-dimensional characteristic with noise frequency, a first characteristic of the complex frequency spectrum in a time dimension is learned through a first self-attention module, a second characteristic of the complex frequency spectrum in a frequency dimension is learned through a second self-attention module, the first characteristic and the second characteristic are fused to obtain a fused characteristic, the fused characteristic is subjected to upsampling through a decoder to obtain a complex frequency spectrum of noise reduction audio, the complex frequency spectrum of noise reduction audio is converted into noise reduction audio, the long-term dependence and the local dependence of the characteristic are captured through the time dimension and the frequency dimension of the characteristic respectively, and the two branches are mutually complementary through information, so that the expression capability of the module is enhanced, the accuracy of the time-frequency spectrum characteristic of learning the complex frequency spectrum with noise frequency is improved, the interference of noise signal is effectively inhibited, and the noise reduction performance is improved.
According to some embodiments of the invention, the encoder comprises 3 superimposed convolution blocks, wherein each convolution block comprises a 2D convolution layer, a batch normalization layer, and a parametric rectified linear element activation function; the calculation formula for downsampling the noisy frequency complex spectrum by the encoder is as follows:
Y∈R T×F×2
Figure BDA0003760342160000031
U Encoder ∈R T×F×C
Figure BDA0003760342160000032
wherein Y is the complex frequency spectrum with noise,
Figure BDA0003760342160000033
for the output of the ith convolution block, PReLU is a parameter rectification linear unit activation function, BN is a batch normalization layer, U Encoder The method is characterized in that the method is characterized in high dimension with noise frequency, T is the number of time frames, F is the number of frequency points, C is the number of channels, convolution kernels of the 2D convolution layers are (6, 4), step sizes are (2, 2), and the number of channels is 16, 32 and 64.
According to some embodiments of the invention, the self-attention module includes a plurality of self-attention blocks, the learning, by a first self-attention module, of a first characteristic of the complex spectrum in a time dimension and the learning, by a second self-attention module, of a second characteristic of the complex spectrum in a frequency dimension, including:
inputting the high-dimensional characteristics with noise frequency into a 1 st first self-attention block and a 1 st second self-attention block in parallel to obtain a 1 st first characteristic and a 1 st second characteristic;
Performing information interaction on the 1 st first feature and the 1 st second feature to obtain the 1 st first feature of interaction and the 1 st second feature of interaction, wherein a calculation formula for performing information interaction on the 1 st first feature and the 1 st second feature is as follows:
Figure BDA0003760342160000034
Figure BDA0003760342160000035
wherein Conv is a two-dimensional convolution layer with a convolution kernel size of (1, 1) and a step size of (1, 1),
Figure BDA0003760342160000036
for the 1 st first feature>
Figure BDA0003760342160000037
For the 1 st first feature of the interaction, < ->
Figure BDA0003760342160000038
For the 1 st second feature>
Figure BDA0003760342160000039
Is the 1 st second feature of the interaction;
inputting the 1 st first feature of the interaction and the 1 st second feature of the interaction into the 2 nd first self-focusing block and the 2 nd second self-focusing block in parallel to obtain the 2 nd first feature and the 2 nd second feature, carrying out information interaction on the 2 nd first feature and the 2 nd second feature to obtain the 2 nd first feature of the interaction and the 2 nd second feature of the interaction, and so on until the nth first feature and the nth second feature output by the last first self-focusing block and the last second self-focusing block are obtained, carrying out information interaction on the nth first feature and the nth second feature to obtain the first feature and the second feature, wherein n is the number of the self-focusing blocks.
According to some embodiments of the invention, each of the first self-attention blocks includes a first global self-attention block and a first local self-attention block, and the inputting the high-dimensional feature with noise frequency into the 1 st first self-attention block, to obtain the 1 st first feature includes:
inputting the high-dimensional characteristics with noise frequency into a first global self-attention block and a first local self-attention block of the first self-attention block in parallel to obtain global self-attention output characteristics of the 1 st time dimension and local self-attention output characteristics of the 1 st time dimension;
the high-dimensional characteristic with noise frequency is input into a first global self-attention block of the first self-attention block, and a calculation formula for obtaining a global self-attention output characteristic of the 1 st time dimension is as follows:
Q,K,V=Reshape local (Linear(U Encoder ))
{Q,K,V}∈R T′×(F′×C)
Figure BDA0003760342160000041
Figure BDA0003760342160000042
Figure BDA0003760342160000043
wherein Q is core component Query of the self-attention mechanism, K is core component Key of the self-attention mechanism, V is core component Value of the self-attention mechanism, all operations between Q, K and V are matrix multiplication, reshape local (. Cndot.) is to take the shape of the tensor from R T′×F′×C Conversion to R T′×(F′×C) Linear is the full connection layer, W T For the attention matrix in the time dimension, softmax is the activation function, reshape lobal* (. Cndot.) is a reverse direction operation,
Figure BDA00037603421600000413
global self-attention output features for the 1 st time dimension;
the high-dimensional characteristic with noise frequency is input into a first local self-attention block of the first self-attention block, and a calculation formula for obtaining a local self-attention output characteristic of the 1 st time dimension is as follows:
T local =2N T +1
S T =T′
Figure BDA0003760342160000044
Figure BDA0003760342160000045
Figure BDA0003760342160000046
Figure BDA0003760342160000047
Figure BDA0003760342160000048
Figure BDA0003760342160000049
Figure BDA00037603421600000410
wherein N is T For the adjacent time frames selected on both sides of the current time frame, T local For the width of each partial segment, S T For the number of partial segments,
Figure BDA00037603421600000411
input features of local self-attention mechanism for time dimension, SA T For the output characteristics of the self-attention mechanism of the time dimension, the Segmentation (-) is to divide the feature vector into partial segments and recombine the feature vector into the obtained feature vector, and the LSA (-) is to perform self-attention operation on the partial segments, and the Reshape is to perform the self-attention operation on the partial segments local (. Cndot.) is to take the shape of the tensor from R T′×1×(F′×C) Conversion to R T′×F′×C ,/>
Figure BDA00037603421600000412
A local self-attention output feature that is the 1 st time dimension;
linking and convolving the global self-attention output characteristic of the 1 st time dimension and the local self-attention output characteristic of the 1 st time dimension to obtain a 1 st first characteristic; the calculation formula for linking and convolving the global self-attention output characteristic of the 1 st time dimension and the local self-attention mechanism of the 1 st time dimension is as follows:
U Res ∈R T′×F′×C
Figure BDA0003760342160000051
Figure BDA0003760342160000052
Wherein,,
Figure BDA0003760342160000053
for the 1 st first feature, concat is a chaining operation, conv is a two-dimensional convolution layer with a convolution kernel size of (1, 1), and step size of (1, 1).
According to some embodiments of the invention, each of the second self-attention blocks includes a second global self-attention block and a second local self-attention block; inputting the high-dimensional characteristic with noise frequency into a 1 st second self-attention block to obtain a 1 st second characteristic, wherein the method comprises the following steps of:
inputting the high-dimensional characteristics with noise frequency into a second global self-attention block and a second local self-attention block of the first second self-attention block in parallel to obtain global self-attention output characteristics of the 1 st frequency dimension and local self-attention output characteristics of the 1 st frequency dimension;
the high-dimensional characteristic with noise frequency is input into a second global self-attention block of the first second self-attention block, and a calculation formula for obtaining the global self-attention output characteristic of the 1 st frequency dimension is as follows:
Q,K,V=Reshape local (Linear(U Encoder ))
{Q,K,V}∈R F′×(T′×C)
Figure BDA0003760342160000054
Figure BDA0003760342160000055
Figure BDA0003760342160000056
wherein Q is core component Query of the self-attention mechanism, K is core component Key of the self-attention mechanism, V is core component Value of the self-attention mechanism, all operations between Q, K and V are matrix multiplication, reshape local (. Cndot.) is to take the shape of the tensor from R T′×F′×C Conversion to R F′×(T′×C) Linear is the full connection layer, W F For the attention matrix in the frequency dimension, softmax is the activation function, reshape lobal* (. Cndot.) is a reverse direction operation,
Figure BDA0003760342160000058
global self-attention output features for the 1 st frequency dimension;
the high-dimensional characteristic with noise frequency is input into a second local self-attention block of the first second self-attention block, and a calculation formula for obtaining the local self-attention output characteristic of the 1 st frequency dimension is as follows:
F local =2N F +1
S F =F′
Figure BDA0003760342160000057
Figure BDA0003760342160000061
Figure BDA0003760342160000062
Figure BDA0003760342160000063
Figure BDA0003760342160000064
Conv(SA F )∈R F′×1×(T′×C)
Figure BDA0003760342160000065
wherein N is F For the adjacent frequency points selected from the two sides of the current frequency point, F local For the width of each partial segment, S F For the number of partial segments,
Figure BDA0003760342160000066
for the input feature of the local self-attention mechanism of the frequency dimension, SA is the output feature of the self-attention mechanism, segmentation () is the feature vector obtained by dividing the feature vector into partial segments and recombining the partial segments, LSA (-) is the self-attention operation on the partial segments, reshape local (. Cndot.) is to take the shape of the tensor from R F′×1×(T′×C) Conversion to R F′×T′×C ,/>
Figure BDA0003760342160000067
A local self-attention output feature for the 1 st frequency dimension;
linking and convolving the global self-attention output characteristic of the 1 st frequency dimension and the local self-attention output characteristic of the 1 st frequency dimension to obtain a 1 st second characteristic; the calculation formula for linking and convolving the global self-attention output characteristic of the 1 st frequency dimension and the local self-attention mechanism of the 1 st frequency dimension is as follows:
U Res ∈R F′×T′×C
Figure BDA0003760342160000068
Figure BDA0003760342160000069
Wherein,,
Figure BDA00037603421600000610
for the 1 st second feature, concat is a chaining operation, conv is a two-dimensional convolution layer with a convolution kernel size of (1, 1), and step size of (1, 1).
According to some embodiments of the invention, the feature fusing the first feature and the second feature to obtain a fused feature includes:
linking the first feature and the second feature to obtain a linking result;
inputting the link result into a convolution block to obtain a fusion characteristic; the convolution block comprises a 2D convolution layer, a batch normalization layer and a parameter rectification linear unit activation function, wherein the convolution kernel of the 2D convolution layer is (1, 1), and the step length is (1, 1).
According to some embodiments of the invention, the decoder comprises 3 sub-pixel blocks, each sub-pixel block comprising a 2D convolution layer, a batch normalization layer and a parameter commutating linear element activation function 2D convolved channel number set to 32, 16 and 2, respectively; the step of up-sampling the fusion feature by a decoder to obtain a noise reduction audio complex frequency spectrum comprises the following steps:
inputting the fusion features into a sub-pixel layer, and expanding the channel number C of the fusion features into C multiplied by r by utilizing convolution operation 2 Obtaining a first fusion characteristic;
And carrying out pixel point rearrangement on the first fusion characteristic by using pixel shuffling to obtain the noise reduction audio complex frequency spectrum.
In a second aspect of the present invention, there is provided a noise reduction system for underwater acoustic signals, which includes the noise reduction method for underwater acoustic signals according to the first aspect of the present invention, the noise reduction system for underwater acoustic signals including:
the data acquisition module is used for acquiring the noise frequency;
the data extraction module is used for extracting a noisy frequency complex spectrum from the noisy audio;
the data coding module is used for downsampling the complex frequency spectrum with noise frequency through an encoder to obtain the high-dimensional characteristic of the complex frequency spectrum with noise frequency;
the feature learning module is used for learning a first feature of the complex frequency spectrum in the time dimension through the first self-attention module and learning a second feature of the complex frequency spectrum in the frequency dimension through the second self-attention module;
the feature fusion module is used for fusing the first feature and the second feature to obtain a fusion feature;
the data decoding module is used for up-sampling the fusion characteristics through a decoder to obtain a noise reduction audio complex frequency spectrum;
the data output module is used for converting the noise reduction audio complex frequency spectrum into noise reduction audio.
The system acquires a noisy frequency, extracts a noisy frequency complex spectrum from the noisy audio, downsamples the noisy frequency complex spectrum through an encoder to obtain a high-dimensional characteristic of the noisy frequency, learns a first characteristic of the complex spectrum in a time dimension through a first self-attention module, learns a second characteristic of the complex spectrum in a frequency dimension through a second self-attention module, fuses the first characteristic and the second characteristic to obtain a fused characteristic, upsamples the fused characteristic through a decoder to obtain a noisy audio complex spectrum, converts the noisy audio complex spectrum into a noisy audio, captures long-term dependence and local dependence of the characteristic from the time dimension and the frequency dimension of the characteristic respectively, and the two branches interact complementary information through information, so that the expression capability of the module is enhanced, the accuracy of the time-frequency spectrum characteristic of the noisy frequency complex spectrum is improved, the interference of noise signals is effectively inhibited, and the noise reduction performance is improved.
In a third aspect of the invention, there is provided an underwater acoustic signal noise reduction electronic device comprising at least one control processor and a memory for communication connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform the underwater sound signal noise reduction method described above.
In a fourth aspect of the present invention, there is provided a computer-readable storage medium storing computer-executable instructions for causing a computer to perform the above-described underwater sound signal noise reduction method.
It should be noted that the advantages of the second to fourth aspects of the present invention and the prior art are the same as those of the above-described underwater sound signal noise reduction system and the prior art, and will not be described in detail herein.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:
FIG. 1 is a schematic diagram of time-frequency domain characteristics of ship signals and marine environmental noise;
FIG. 2 is a flow chart of a method for noise reduction of an underwater acoustic signal according to an embodiment of the present invention;
FIG. 3 is a general flow chart of a method for reducing noise of an underwater sound signal according to an embodiment of the present invention;
FIG. 4 is a diagram of a global-local self-attention block architecture of one embodiment of the present invention;
Table 1 a graph of experimental comparison results of a conventional noise reduction method based on a statistical method and a dual-branch self-attention network model according to an embodiment of the present invention;
table 2 an experimental comparison result diagram of a DNN-based noise reduction method and a dual-branch self-attention network model according to an embodiment of the present invention;
FIG. 5 is a time-frequency spectrum diagram and a power spectrum density diagram of a ship target signal after noise reduction according to an embodiment of the present invention;
FIG. 6 is a comparison graph of ship target time domain signals after noise reduction of different models according to an embodiment of the present invention;
FIG. 7 is a graph of SDRi and SI-SNRi index scores for a dual-branch self-care network model on sen and unseen datasets according to an embodiment of the present invention;
fig. 8 is a flow chart of a system for reducing noise of an underwater sound signal according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
In the description of the present invention, the description of first, second, etc. is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present invention, it should be understood that the direction or positional relationship indicated with respect to the description of the orientation, such as up, down, etc., is based on the direction or positional relationship shown in the drawings, is merely for convenience of describing the present invention and simplifying the description, and does not indicate or imply that the apparatus or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.
In the description of the present invention, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present invention can be determined reasonably by a person skilled in the art in combination with the specific content of the technical solution.
Traditional underwater acoustic signal noise reduction technology relies on accurate analytical modeling of signals and optimized fine tuning of parameters, and mainly comprises a noise reduction method based on a spectrum subtraction technology, a noise reduction method based on a filtering technology, a noise reduction method based on wavelet transformation and a noise reduction method based on signal decomposition. Although the method reduces the influence of noise interference signals on target signals to a certain extent, the method still has limitations of different degrees and problems to be solved, such as irrelevant frequency spectrum components such as music noise and the like can be introduced after the denoising treatment by a spectral subtraction method and a filter method, and the audio quality is reduced; the wavelet transform method and the signal decomposition-based method may have problems of optimal basis function selection, threshold setting, mode mixing, and the like. In addition, the methods often need to perform a large number of iterative computations, have the problems of long running time and low processing efficiency, are better in performance under the condition of high signal-to-noise ratio, and are limited in performance under the condition of low signal-to-noise ratio in an actual marine environment, so that in order to solve the problems of the traditional noise reduction method, researchers construct a noise reduction model based on a deep neural network (Deep Neural Network, DNN), and rely on strong knowledge learning and induction reasoning capacity of the neural network model, and the method can often obtain performance superior to that of the traditional method. DNN-based methods typically use Short-time fourier transforms (Short-time Fourier transform, STFT) to transform time-domain audio into time-frequency (T-F) spectra, mostly ignoring the estimation of phase information, resulting in an upper performance limit for the method.
The noise reduction of underwater acoustic signals is different from that of air acoustic signals, firstly, researchers need to recover target signals from signals with low signal-to-noise ratio (-about 10 dB), and secondly, characteristic information of targets in water such as ships and the like is concentrated in a smaller frequency band range in time-frequency characteristics. Therefore, a network model is needed to be constructed in a targeted manner according to the characteristics of the underwater acoustic signals. Fig. 1 shows the characteristic behavior of a ship target signal and a marine environmental noise signal in the time-frequency domain. As can be seen from the figure, the frequency band distribution of the ship target signal is concentrated, and the time continuity is strong, the frequency band distribution range of the noise signal is wide, and the characteristic of strong randomness is shown. Therefore, the network model needs to learn the long-term dependence of the distribution position information of the target signal in the global feature and the time dimension, and at the same time needs to grasp the local feature for the specific frequency band range of the target signal.
To solve the above technical drawbacks, referring to fig. 2, an embodiment of the present invention provides a method for noise reduction of an underwater acoustic signal, including:
step S101: and acquiring the noisy audio.
Step S102: a noisy frequency complex spectrum is extracted from the noisy audio.
Step S103: and downsampling the complex frequency spectrum with noise by an encoder to obtain high-dimensional characteristics of the frequency with noise.
Step S104: a first feature of the complex spectrum in the time dimension is learned by a first self-attention module and a second feature of the complex spectrum in the frequency dimension is learned by a second self-attention module.
Step S105: and fusing the first feature and the second feature to obtain a fused feature.
Step S106: and up-sampling the fusion characteristics through a decoder to obtain a noise reduction audio complex frequency spectrum.
Step S107: the complex spectrum of noise reduction audio is converted into noise reduction audio.
The method comprises the steps of obtaining a noisy frequency, extracting a noisy frequency complex spectrum from the noisy frequency, carrying out downsampling on the noisy frequency complex spectrum through an encoder to obtain a high-dimensional characteristic of the noisy frequency, learning a first characteristic of the complex spectrum in a time dimension through a first self-attention module, learning a second characteristic of the complex spectrum in a frequency dimension through a second self-attention module, merging the first characteristic and the second characteristic to obtain a merged characteristic, carrying out upsampling on the merged characteristic through a decoder to obtain a noisy frequency complex spectrum, converting the noisy frequency complex spectrum into a noisy frequency, capturing long-term dependence and local dependence of the characteristic from the time dimension and the frequency dimension of the characteristic respectively, and carrying out mutual complementary information on the two branches through information, so that the expression capability of the module is enhanced, the accuracy of the time spectrum characteristic of the noisy frequency complex spectrum is improved, the interference of noise signals is effectively inhibited, and the noise reduction performance is improved.
In some embodiments, the encoder comprises 3 superimposed convolution blocks, wherein each convolution block comprises a 2D convolution layer, a batch normalization layer, and a parametric rectified linear element activation function; the calculation formula for downsampling the noisy frequency complex spectrum by the encoder is as follows:
Y∈R T×F×2
Figure BDA0003760342160000101
U Encoder ∈R T′×F′×C
Figure BDA0003760342160000102
wherein Y is the complex frequency spectrum with noise,
Figure BDA0003760342160000103
for the output of the ith convolution block, PReLU is a parameter rectification linear unit activation function, BN is a batch normalization layer, U Encoder The method is characterized by high dimension with noise frequency, T is the number of time frames, F is the number of frequency points, C is the number of channels, convolution kernels of the 2D convolution layers are (6, 4), step sizes are (2, 2), and the number of channels is 16, 32 and 64.
Referring to fig. 3, in some embodiments, the self-attention module includes a plurality of self-attention blocks, and step S104 may include, but is not limited to including steps S201 to S203:
step S201, inputting the high-dimensional features with noise frequency into the 1 st first self-attention block and the 1 st second self-attention block in parallel to obtain the 1 st first feature and the 1 st second feature.
Step S202, carrying out information interaction on the 1 st first feature and the 1 st second feature to obtain the 1 st first feature of interaction and the 1 st second feature of interaction, wherein a calculation formula for carrying out information interaction on the 1 st first feature and the 1 st second feature is as follows:
Figure BDA0003760342160000104
Figure BDA0003760342160000105
Wherein Conv is a two-dimensional convolution layer with a convolution kernel size of (1, 1) and a step size of (1, 1),
Figure BDA0003760342160000106
for the 1 st first feature>
Figure BDA0003760342160000107
For the 1 st first feature of the interaction, < ->
Figure BDA0003760342160000108
For the 1 st second feature>
Figure BDA0003760342160000109
Is the 1 st second feature of the interaction.
Step S203, inputting the 1 st first feature of the interaction and the 1 st second feature of the interaction into the 2 nd first self-attention block and the 2 nd second self-attention block in parallel to obtain the 2 nd first feature and the 2 nd second feature, carrying out information interaction on the 2 nd first feature and the 2 nd second feature to obtain the 2 nd first feature of the interaction and the 2 nd second feature of the interaction, and so on until the nth first feature and the nth second feature output by the last first self-attention block and the last second self-attention block are obtained, carrying out information interaction on the nth first feature and the nth second feature to obtain the first feature and the second feature, wherein n is the number of the self-attention blocks.
Referring to fig. 4, in some embodiments, each first self-attention block includes one first global self-attention block and one first local self-attention block, and step S201 may include, but is not limited to including steps S301 to S304:
Step 301, inputting the high-dimensional characteristic with noise frequency into the first global self-attention block and the first local self-attention block of the first self-attention block in parallel to obtain the global self-attention output characteristic of the 1 st time dimension and the local self-attention output characteristic of the 1 st time dimension.
Step S302, wherein the high-dimensional feature of the noisy audio is input into the first global self-attention block of the first self-attention block, and the calculation formula for obtaining the global self-attention output feature of the 1 st time dimension is:
Q,K,V=Reshape lobal (Linear(U Encoder ))
{Q,K,V}∈R T′×(F′×C)
Figure BDA0003760342160000111
Figure BDA0003760342160000112
Figure BDA0003760342160000113
wherein Q is core component Query of the self-attention mechanism, K is core component Key of the self-attention mechanism, V is core component Value of the self-attention mechanism, all operations between Q, K and V are matrix multiplication, reshape local (. Cndot.) is to take the shape of the tensor from R T′×F′×C Conversion to R T′×(F′×C) Linear is the full connection layer, W T For the attention matrix in the time dimension, softmax is the activation function, reshape global* (. Cndot.) is a reverse direction operation,
Figure BDA0003760342160000114
is a global self-attention output feature of the 1 st time dimension.
Step S303, wherein the high-dimensional feature of the noisy audio is input into the first local self-attention block of the first self-attention block, and the calculation formula for obtaining the local self-attention output feature of the 1 st time dimension is:
T local =2N T +1
S T =T′
Figure BDA0003760342160000115
Figure BDA0003760342160000116
Figure BDA0003760342160000117
Figure BDA0003760342160000118
Figure BDA0003760342160000119
Conv(SA T )∈R T′×1×(F′×C)
Figure BDA00037603421600001110
Wherein N is T For the adjacent time frames selected on both sides of the current time frame, T local For the width of each partial segment, S T For the number of partial segments,
Figure BDA0003760342160000121
input features of local self-attention mechanism for time dimension, SA T For the output feature of the self-attention mechanism of the time dimension, segment (·) is used to divide the feature vector intoThe partial segmentation and recombination are combined into the obtained feature vector, and LSA (hand-held) is self-attention operation on the partial segmentation, and Reshape is adopted local (. Cndot.) is to take the shape of the tensor from R T′×1×(F′×C) Conversion to R T′×F′×C ,/>
Figure BDA0003760342160000122
Is a local self-attention output feature of the 1 st time dimension.
Step S304, linking and convolving the global self-attention output characteristic of the 1 st time dimension and the local self-attention output characteristic of the 1 st time dimension to obtain a 1 st first characteristic; the calculation formula for linking and convolving the global self-attention output characteristic of the 1 st time dimension and the local self-attention mechanism of the 1 st time dimension is as follows:
U Res ∈R T′×F′×C
Figure BDA0003760342160000123
Figure BDA0003760342160000124
wherein,,
Figure BDA0003760342160000125
for the 1 st first feature, concat is a chaining operation, conv is a two-dimensional convolution layer with a convolution kernel size of (1, 1), and step size of (1, 1).
Referring to fig. 4, in some embodiments, each second self-attention block includes one second global self-attention block and one second local self-attention block, and step S201 may include, but is not limited to including steps S401 to S404:
Step S401, inputting the high-dimensional characteristic with noise frequency into the second global self-attention block and the second local self-attention block of the first second self-attention block in parallel to obtain the global self-attention output characteristic of the 1 st frequency dimension and the local self-attention output characteristic of the 1 st frequency dimension.
Step S402, wherein the high-dimensional feature of the noisy audio is input into the second global self-attention block of the first second self-attention block, and the calculation formula of the global self-attention output feature of the 1 st frequency dimension is obtained as follows:
Q,K,V=Reshape local (Linear(U Encoder ))
{Q,K,V}∈R F′×(T′×C)
Figure BDA0003760342160000126
Figure BDA0003760342160000127
Figure BDA0003760342160000128
wherein Q is core component Query of the self-attention mechanism, K is core component Key of the self-attention mechanism, V is core component Value of the self-attention mechanism, all operations between Q, K and V are matrix multiplication, reshape local (. Cndot.) is to take the shape of the tensor from R T′×F′×C Conversion to R F′×(T′×C) Linear is the full connection layer, W F For the attention matrix in the frequency dimension, softmax is the activation function, reshape global* (. Cndot.) is a reverse direction operation,
Figure BDA0003760342160000129
is a global self-attention output feature of the 1 st frequency dimension.
Step S403, wherein the high-dimensional feature of the noisy audio is input into the second local self-attention block of the first second self-attention block, and the calculation formula of the local self-attention output feature of the 1 st frequency dimension is obtained:
F local =2N F +1
S F =F′
Figure BDA0003760342160000131
Figure BDA0003760342160000132
Figure BDA0003760342160000133
Figure BDA0003760342160000134
Figure BDA0003760342160000135
Conv(SA F )∈R F′×1×(T′×C)
Figure BDA0003760342160000136
Wherein N is F For the adjacent frequency points selected from the two sides of the current frequency point, F local For the width of each partial segment, S F For the number of partial segments,
Figure BDA0003760342160000137
for the input feature of the local self-attention mechanism of the frequency dimension, SA is the output feature of the self-attention mechanism, segmentation () is the feature vector obtained by dividing the feature vector into partial segments and recombining the partial segments, LSA (-) is the self-attention operation on the partial segments, reshape local (. Cndot.) is to take the shape of the tensor from R F′×1×(T′×C) Conversion to R F′×T′×C ,/>
Figure BDA00037603421600001311
Is a local self-attention output feature of the 1 st frequency dimension.
Step S404, linking and convolving the global self-attention output characteristic of the 1 st frequency dimension and the local self-attention output characteristic of the 1 st frequency dimension to obtain a 1 st second characteristic; the calculation formula for linking and convolving the global self-attention output characteristic of the 1 st frequency dimension and the local self-attention mechanism of the 1 st frequency dimension is as follows:
U Res ∈R F′×T′×C
Figure BDA0003760342160000138
Figure BDA0003760342160000139
wherein,,
Figure BDA00037603421600001310
for the 1 st second feature, concat is a chaining operation, conv is a two-dimensional convolution layer with a convolution kernel size of (1, 1), and step size of (1, 1).
The invention captures the long-term dependence and local dependence of the feature from the time dimension and the frequency dimension of the feature by utilizing the global self-attention and local self-attention network layer of the global-local self-attention block, and the two branches interact complementary information through information, thereby enhancing the expression capability of the model.
In some embodiments, step S105 may include, but is not limited to including, step S501 to step S502:
step S501, the first feature and the second feature are linked to obtain a linking result.
Step S502, inputting a link result into a convolution block to obtain a fusion characteristic; the convolution block comprises a 2D convolution layer, a batch normalization layer and a parameter rectification linear unit activation function, wherein the convolution kernel of the 2D convolution layer is (1, 1), and the step size is (1, 1).
In some embodiments, the decoder comprises 3 sub-pixel blocks, each sub-pixel block comprising a 2D convolution layer, a batch normalization layer, and a channel number of the parametric rectified linear unit activation function 2D convolution set to 32, 16, and 2, respectively; step S106 may include, but is not limited to, steps S601 to S602:
step S601, inputting the fusion feature into the sub-pixel layer, and expanding the channel number C of the fusion feature into C×r by convolution operation 2 A first fusion feature is obtained.
Step S602, the first fusion feature is rearranged in pixel points by using pixel shuffling, and noise reduction audio complex frequency spectrums are obtained.
A set of experimental validation processes and results are provided below:
1. experimental data:
experiment verification is carried out on a shipsEar data set, wherein the data set is an underwater ship noise database and is collected by hydrophones deployed in shallow sea, wherein 11 types of high-definition ship engine noise and marine environment noise are respectively used for constructing an experiment data set D-I and D-II by using two types of ships (motorboats and Passenger ships Passenger vessels) with sufficient sample size and 3 types of marine environment noise (precipitation, ocean currents and ocean surface winds):
The data set 1, D-I data set contains 43 segments of ship audio data (motorboats and passenger ships) and 3 marine environmental noises. All audio files are downsampled to 16kHz and cut into lengths of about 3s (49152 sampling points), ship noise of each small section is mixed with a random noise section, a value is randomly generated within the range of [ -5dB, -15dB ] as signal to noise ratio, mixed audio data with low signal to noise ratio is generated, and 73557 audio sections are generated.
In order to verify the noise reduction performance of the model in different scenes, the invention designs three different scenes for model verification: (1) Seen shift: the ship signals of the verification set and the ship signals of the training set come from the same ship; (2) The ship signals of the verification set do not appear in the training set, but the ship types are the same; (3) Unsen shift type: the validation set of ship signal types (Dredger dregger and fishing vessel fisher) are different from the training set of ship types. In addition, the mixed signal of the validation set contained three different signal-to-noise ratios, -5dB, -10dB and-15 dB.
Data set 2, D-II data set a new noise data set is generated using superposition of the three types of noise data, and then a noisy sample is generated using the mixed noise and ship data, the other settings of the data set being the same as D-I.
2. Experiment setting:
all audio signals are converted into a time-frequency domain through short-time Fourier transform, complex-valued spectrum characteristics are used as the input of a network model, the window length is 512 sampling points, and the frame is shifted to 256 sampling points. In the local self-attention model, the adjacent frame number N is set to 5. In the experiment, an Adam optimizer is adopted to train the model, the training period number is set to be 100, the initial learning rate of the model is 1e-4, and if the evaluation index of the model on the training set is not improved continuously for two periods, the learning rate is attenuated to be 0.1 originally, and is not improved continuously for 10 periods, the training is stopped. The Batch size is set to 8.
The invention selects a time domain loss function-scale invariant source noise ratio (SI-SNR) for training a model, wherein the definition of the SI-SNR is as follows:
Figure BDA0003760342160000151
wherein x and
Figure BDA0003760342160000152
respectively clean and estimated time domain audio signals,<·,·>is the dot product between two vectors, |·||i 2 Is the L2 norm.
In the experiment, 4 evaluation indexes are selected to evaluate the model performance, namely SDR, SDRi, SI-SNRi and SSNR, and the larger the numerical value is, the better the model performance is.
3. Experimental results:
the first, compared to the conventional method:
the baseline model of the traditional noise reduction method selected in the experiment is a spectral subtraction method, a wiener filtering method, a noise reduction method based on wavelet transformation and a noise reduction method based on empirical mode decomposition. Experimental results are shown in table 1, the invention performs experimental evaluation on (-5 dB, -10dB, -15 dB) three low signal-to-noise ratio scenes on two different data sets respectively, and as can be seen from table 1, the DBSA-Net provided by the invention obtains the best performance in all comparison scenes, which is far superior to the traditional method. DBSA-Net achieves an average SDR boost of 14.41dB and an SSNR boost of 13.12dB on data set D-I and an average SNR boost of 11.19dB and an SSNR boost of 10.02dB on data set D-II. Compared with a deep learning-based method, the traditional noise reduction method is poor in performance in an underwater sound environment with low signal-to-noise ratio, and although the wiener filtering method achieves the best result in the traditional method, a larger gap exists compared with DBSA-Net.
The second, in contrast to the noise reduction method based on deep neural networks:
the invention further compares DBSA-Net with the existing noise reduction model based on the deep learning method, and the baseline model comprises: (1) SEGAN: a time domain signal noise reduction method realized based on a generation countermeasure network; (2) DCCRN: a complex spectrum-based network model that captures global dependencies of features using LSTM; (3) FullSubNet: a network model based on complex frequency spectrum, wherein the model utilizes a full-band model and a sub-band model to extract the global dependency and the local dependency of the features respectively; (4) DBSA-Global: the derivative model of DBSA-Net uses only global attention patterns; (5) DBSA-Local: the derived model of DBSA-Net only uses the local attention pattern, and all models are trained using the same data for 100 cycles.
Referring to Table 2, table 2 shows experimental comparison of DNN-based noise reduction with the DBSA-Net model presented herein on data sets D-I and D-II, where DBSA-Net achieved optimal performance for all indicators on the D-I data set, resulting in 14.41dB SDR improvement, 13.12dB SSNR score and 14.51dB SI-SNRi score. On the data set D-II, the FullSubNet method obtains the best performance on two indexes of SDR and SDRi, but the score on the SI-SNRi index is lower, and the reason for the phenomenon is probably because the method eliminates part of target signals in the noise reduction process, causes the distortion of the target signals, influences the Si-SNRi index, and the ship signals after noise reduction are respectively compared in the time domain and the frequency domain for verifying the hypothesis.
Referring to fig. 5, the ship target signal contains 5 major frequency components, 140.5Hz, 169 Hz, 92hz, 1068Hz and 2031Hz, respectively. The energy of the frequency band gradually decreases along with the increase of the frequency, and the signals processed by the SEGAN, DCCRN and FullSubNet methods lose the frequency band component of high frequency and cause the distortion of the original target signal. By comparing the DBSA-Global and DBSA-Local models, the latter can capture the fine granularity change of the target signal, and recover the high-frequency signal component with lower energy while effectively eliminating noise. The DBSA-Net provided by the invention combines the advantages of the local model and the global model, so that the best performance can be obtained on various indexes.
Referring to fig. 6, fig. 6 further compares the results of the ship signal after noise reduction by different models in the time domain, where the lines in the figure represent clean target signals, noisy acoustic signals, and signals after noise reduction by different models. By comparing partial sampling points in the block, the signal after the DBSA-Net and DBSA-Local models are noise reduced can be seen to have the highest matching degree with the original target signal, and the matching degree is far better than other models. The DBSA-Net not only utilizes the branch structure to extract the characteristics from the time dimension and the frequency dimension respectively, but also utilizes the GL-SA block to capture the long-term dependency relationship and the fine-grained local dependency relationship of the characteristics from the global and local angles, and the multi-module fusion ensures the noise reduction capability of the model under the condition of low signal to noise ratio.
Referring to FIG. 7, to further verify generalization of DBSA-Net, the present invention performs performance verification of models on two Unsen datasets, FIG. 7 shows SDRi and SI-SNRi index scores of DBSA-Net on both sen and two Unseen datasets. For the unseen shift test scenario, the model showed no decrease in SDRi scores on both the D-I and D-II datasets, with only a 0.135dB decrease in SI-SNRi index. The method has the advantages that the ships of the same type have similar time-frequency spectrum characteristics, the trained model can effectively inhibit the interference of noise signals, and the time-frequency spectrum characteristics of the ships of the same type are extracted. For the unseen shift type test scenario, the model was dropped by 2.24% and 4.68% on both evaluation indicators, respectively. This is because different types of ships exhibit different time-frequency spectrum characteristics, and the model does not exhibit sufficient attention to the characteristic frequency bands of the types of ships that do not appear, affecting the noise reduction effect. Even so, the performance of DBSA-Net on both data sets still exceeded the evaluation scores of most models shown in Table 2 on the sen data sets.
In addition, referring to fig. 8, the present invention further provides an underwater sound signal noise reduction system, which includes a data acquisition module 1100, a data extraction module 1200, a data encoding module 1300, a feature learning module 1400, a feature fusion module 1500, a data decoding module 1600, and a data output module 1700, wherein:
The data acquisition module 1100 is configured to acquire a noisy frequency.
The data extraction module 1200 is configured to extract a noisy frequency complex spectrum from noisy audio.
The data encoding module 1300 is configured to downsample the complex spectrum of the noisy frequency with an encoder to obtain a high-dimensional characteristic of the noisy frequency.
The feature learning module 1400 is configured to learn a first feature of the complex spectrum in a time dimension by the first self-attention module and learn a second feature of the complex spectrum in a frequency dimension by the second self-attention module.
The feature fusion module 1500 is configured to fuse the first feature and the second feature to obtain a fused feature.
The data decoding module 1600 is configured to upsample the fusion feature by a decoder to obtain a noise-reduced audio complex spectrum.
The data output module 1700 is configured to convert the complex spectrum of noise reduction audio to noise reduction audio.
The system acquires a noisy frequency, extracts a noisy frequency complex spectrum from the noisy audio, downsamples the noisy frequency complex spectrum through an encoder to obtain a high-dimensional characteristic of the noisy frequency, learns a first characteristic of the complex spectrum in a time dimension through a first self-attention module, learns a second characteristic of the complex spectrum in a frequency dimension through a second self-attention module, fuses the first characteristic and the second characteristic to obtain a fused characteristic, upsamples the fused characteristic through a decoder to obtain a noisy audio complex spectrum, converts the noisy audio complex spectrum into a noisy audio, captures long-term dependence and local dependence of the characteristic from the time dimension and the frequency dimension of the characteristic respectively, and the two branches interact complementary information through information, so that the expression capability of the module is enhanced, the accuracy of the time-frequency spectrum characteristic of the noisy frequency complex spectrum is improved, the interference of noise signals is effectively inhibited, and the noise reduction performance is improved.
It should be noted that, the method embodiment and the system embodiment described above are based on the same inventive concept, so that the relevant content of the method embodiment described above is also applicable to the system embodiment, and will not be repeated here.
The application also provides an underwater acoustic signal noise reduction electronic device, comprising: memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing when executing the computer program: such as the above-described underwater acoustic signal noise reduction method.
The processor and the memory may be connected by a bus or other means.
The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The non-transitory software program and instructions required to implement the underwater sound signal noise reduction method of the above-described embodiment are stored in the memory, and when executed by the processor, the underwater sound signal noise reduction method of the above-described embodiment is performed, for example, the method steps S101 to S107 in fig. 2 described above are performed.
The present application also provides a computer-readable storage medium storing computer-executable instructions for performing: such as the above-described underwater acoustic signal noise reduction method.
The computer-readable storage medium stores computer-executable instructions that are executed by a processor or controller, for example, by a processor in the above-described electronic device embodiment, which may cause the processor to perform the underwater sound signal noise reduction method in the above-described embodiment, for example, to perform the method steps S101 to S107 in fig. 2 described above.
Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program elements or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program elements or other data in a modulated data signal such as a carrier wave or other transport mechanism and may include any information delivery media.
The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of one of ordinary skill in the art without departing from the spirit of the present invention.

Claims (9)

1. A method of noise reduction of an underwater acoustic signal, the method comprising:
acquiring a noise frequency;
extracting a noisy frequency complex spectrum from the noisy audio;
downsampling the noisy frequency complex spectrum by an encoder to obtain a high-dimensional characteristic of the noisy frequency;
learning a first characteristic of the complex spectrum in a time dimension by a first self-attention module and a second characteristic of the complex spectrum in a frequency dimension by a second self-attention module, wherein the self-attention module comprises a plurality of self-attention blocks, specifically:
inputting the high-dimensional characteristics with noise frequency into a 1 st first self-attention block and a 1 st second self-attention block in parallel to obtain a 1 st first characteristic and a 1 st second characteristic;
performing information interaction on the 1 st first feature and the 1 st second feature to obtain the 1 st first feature of interaction and the 1 st second feature of interaction, wherein a calculation formula for performing information interaction on the 1 st first feature and the 1 st second feature is as follows:
Figure QLYQS_1
Figure QLYQS_2
Wherein Conv is the convolution kernelA two-dimensional convolution layer of size (1, 1) and step size (1, 1),
Figure QLYQS_3
for the 1 st first feature>
Figure QLYQS_4
For the 1 st first feature of the interaction, < ->
Figure QLYQS_5
For the 1 st second feature>
Figure QLYQS_6
Is the 1 st second feature of the interaction;
inputting the 1 st first feature of the interaction and the 1 st second feature of the interaction into a 2 nd first self-focusing block and a 2 nd second self-focusing block in parallel to obtain a 2 nd first feature and a 2 nd second feature, carrying out information interaction on the 2 nd first feature and the 2 nd second feature to obtain a 2 nd first feature of the interaction and a 2 nd second feature of the interaction, and so on until an nth first feature and an nth second feature output by a last first self-focusing block and a last second self-focusing block are obtained, carrying out information interaction on the nth first feature and the nth second feature to obtain the first feature and the second feature, wherein n is the number of the self-focusing blocks;
fusing the first feature and the second feature to obtain a fused feature;
upsampling the fusion feature by a decoder to obtain a noise-reduced audio complex spectrum;
and converting the noise reduction audio complex frequency spectrum into noise reduction audio.
2. The method of claim 1, wherein the encoder comprises 3 superimposed convolution blocks, wherein each convolution block comprises a 2D convolution layer, a batch normalization layer, and a parametric rectification linear element activation function; the calculation formula for downsampling the noisy frequency complex spectrum by the encoder is as follows:
Y∈R T×F×2
Figure QLYQS_7
U Encoder ∈R T′×F′×C
Figure QLYQS_8
wherein Y is the complex frequency spectrum with noise,
Figure QLYQS_9
for the output of the ith convolution block, PReLU is a parameter rectification linear unit activation function, BN is a batch normalization layer, U Encoder The method is characterized in that the method is characterized in high dimension with noise frequency, T is the number of time frames, F is the number of frequency points, C is the number of channels, convolution kernels of the 2D convolution layers are (6, 4), step sizes are (2, 2), and the number of channels is 16, 32 and 64.
3. The method of claim 2, wherein each of the first self-attention blocks includes a first global self-attention block and a first local self-attention block, and wherein inputting the high-dimensional feature with noise frequency into the 1 st first self-attention block results in the 1 st first feature, comprising:
inputting the high-dimensional characteristics with noise frequency into the first global self-attention block and the first local self-attention block of the 1 st first self-attention block in parallel to obtain global self-attention output characteristics of the 1 st time dimension and local self-attention output characteristics of the 1 st time dimension;
The high-dimensional characteristic with noise frequency is input into the first global self-attention block of the 1 st first self-attention block, and a calculation formula for obtaining the global self-attention output characteristic of the 1 st time dimension is as follows:
Q,K,V=Reshape local (Linear(U Encoder ))
{Q,K,V}∈R T′×(F′×C)
Figure QLYQS_10
Figure QLYQS_11
Figure QLYQS_12
wherein Q is core component Query of the self-attention mechanism, K is core component Key of the self-attention mechanism, V is core component Value of the self-attention mechanism, all operations between Q, K and V are matrix multiplication, reshape local (. Cndot.) is to take the shape of the tensor from R T′×F′×C Conversion to R T′×(F′×C) Linear is the full connection layer, W T For the attention matrix in the time dimension, softmax is the activation function, reshape global* (. Cndot.) is a reverse direction operation,
Figure QLYQS_13
global self-attention output features for the 1 st time dimension;
the high-dimensional characteristic with noise frequency is input into the first local self-attention block of the 1 st first self-attention block, and a calculation formula for obtaining the local self-attention output characteristic of the 1 st time dimension is as follows:
T local =2N T +1
S T =T′
Figure QLYQS_14
Figure QLYQS_15
Figure QLYQS_16
Figure QLYQS_17
Figure QLYQS_18
Conv(SA T )∈R T′×1×(F′×C)
Figure QLYQS_19
wherein N is T For the adjacent time frames selected on both sides of the current time frame, T local For the width of each partial segment, S T For the number of partial segments,
Figure QLYQS_20
input features of local self-attention mechanism for time dimension, SA T For the output characteristics of the self-attention mechanism of the time dimension, the Segmentation (-) is to divide the feature vector into partial segments and recombine the feature vector into the obtained feature vector, and the LSA (-) is to perform self-attention operation on the partial segments, and the Reshape is to perform the self-attention operation on the partial segments local (. Cndot.) is to take the shape of the tensor from R T ′×1×(F′×C) Conversion to R T′×F′×C ,/>
Figure QLYQS_21
A local self-attention output feature that is the 1 st time dimension;
linking and convolving the global self-attention output characteristic of the 1 st time dimension and the local self-attention output characteristic of the 1 st time dimension to obtain a 1 st first characteristic; the calculation formula for linking and convolving the global self-attention output characteristic of the 1 st time dimension and the local self-attention mechanism of the 1 st time dimension is as follows:
U Res ∈R T′×F′×C
Figure QLYQS_22
Figure QLYQS_23
wherein,,
Figure QLYQS_24
for the 1 st first feature, concat is a chaining operation, conv is a two-dimensional convolution layer with a convolution kernel size of (1, 1), and step size of (1, 1).
4. A method of noise reduction of an underwater sound signal as claimed in claim 3, wherein each of said second self-attention blocks comprises a second global self-attention block and a second local self-attention block; inputting the high-dimensional characteristic with noise frequency into a 1 st second self-attention block to obtain a 1 st second characteristic, wherein the method comprises the following steps of:
inputting the high-dimensional characteristics with noise frequency into a second global self-attention block and a second local self-attention block of the 1 st second self-attention block in parallel to obtain global self-attention output characteristics of the 1 st frequency dimension and local self-attention output characteristics of the 1 st frequency dimension;
The high-dimensional characteristic with noise frequency is input into the second global self-attention block of the 1 st second self-attention block, and a calculation formula for obtaining the global self-attention output characteristic with 1 st frequency dimension is as follows:
Q,K,V=Reshape local (Linear(U Encoder ))
{Q,K,V}∈R F′×(T′×C)
Figure QLYQS_25
Figure QLYQS_26
Figure QLYQS_27
wherein Q is core component Query of the self-attention mechanism, K is core component Key of the self-attention mechanism, V is core component Value of the self-attention mechanism, all operations between Q, K and V are matrix multiplication, reshape local (. Cndot.) is to take the shape of the tensor from R T′×F′×C Conversion to R F′×(T′×C) Linear is the full connection layer, W F For the attention matrix in the frequency dimension, softmax is the activation function, reshape global* (. Cndot.) is a reverse direction operation,
Figure QLYQS_28
global self-attention output features for the 1 st frequency dimension;
the high-dimensional characteristic with noise frequency is input into the second local self-attention block of the 1 st second self-attention block, and a calculation formula for obtaining the local self-attention output characteristic with the 1 st frequency dimension is as follows:
F local =2N F +1
S F =F′
Figure QLYQS_29
Figure QLYQS_30
Figure QLYQS_31
Figure QLYQS_32
Figure QLYQS_33
Conv(SA F )∈R F′×1×(T′×C)
Figure QLYQS_34
wherein N is F For the adjacent frequency points selected from the two sides of the current frequency point, F local For the width of each partial segment, S F For the number of partial segments,
Figure QLYQS_35
for the input feature of the local self-attention mechanism of the frequency dimension, SA is the output feature of the self-attention mechanism, segmentation () is the feature vector obtained by dividing the feature vector into partial segments and recombining the partial segments, LSA (-) is the self-attention operation on the partial segments, reshape local (. Cndot.) is to take the shape of the tensor from R F′×1×(T′×C) Conversion to R F ′×T′×C ,/>
Figure QLYQS_36
A local self-attention output feature for the 1 st frequency dimension;
linking and convolving the global self-attention output characteristic of the 1 st frequency dimension and the local self-attention output characteristic of the 1 st frequency dimension to obtain a 1 st second characteristic; the calculation formula for linking and convolving the global self-attention output characteristic of the 1 st frequency dimension and the local self-attention mechanism of the 1 st frequency dimension is as follows:
U Res ∈R F′×T′×C
Figure QLYQS_37
Figure QLYQS_38
wherein,,
Figure QLYQS_39
for the 1 st second feature, concat is a chaining operation, conv is a two-dimensional convolution layer with a convolution kernel size of (1, 1), and step size of (1, 1).
5. The method of claim 4, wherein the feature fusing the first feature and the second feature to obtain a fused feature comprises:
linking the first feature and the second feature to obtain a linking result;
inputting the link result into a convolution block to obtain a fusion characteristic; the convolution block comprises a 2D convolution layer, a batch normalization layer and a parameter rectification linear unit activation function, wherein the convolution kernel of the 2D convolution layer is (1, 1), and the step length is (1, 1).
6. The method of claim 5, wherein the decoder comprises 3 sub-pixel blocks, each sub-pixel block comprising a 2D convolution layer, a batch normalization layer, and a parameter commutating linear element activation function 2D convolved channel number set to 32, 16, and 2, respectively; the step of up-sampling the fusion feature by a decoder to obtain a noise reduction audio complex frequency spectrum comprises the following steps:
inputting the fusion features into a sub-pixel layer, and expanding the channel number C of the fusion features into C multiplied by r by utilizing convolution operation 2 Obtaining a first fusion characteristic;
and carrying out pixel point rearrangement on the first fusion characteristic by using pixel shuffling to obtain the noise reduction audio complex frequency spectrum.
7. A system for reducing noise in an underwater acoustic signal, the system comprising:
the data acquisition module is used for acquiring the noise frequency;
the data extraction module is used for extracting a noisy frequency complex spectrum from the noisy audio;
the data coding module is used for downsampling the complex frequency spectrum with noise frequency through an encoder to obtain the high-dimensional characteristic of the complex frequency spectrum with noise frequency;
the feature learning module is configured to learn a first feature of the complex spectrum in a time dimension through the first self-attention module, and learn a second feature of the complex spectrum in a frequency dimension through the second self-attention module, where the self-attention module includes a plurality of self-attention blocks, specifically:
Inputting the high-dimensional characteristics with noise frequency into a 1 st first self-attention block and a 1 st second self-attention block in parallel to obtain a 1 st first characteristic and a 1 st second characteristic;
performing information interaction on the 1 st first feature and the 1 st second feature to obtain the 1 st first feature of interaction and the 1 st second feature of interaction, wherein a calculation formula for performing information interaction on the 1 st first feature and the 1 st second feature is as follows:
Figure QLYQS_40
Figure QLYQS_41
wherein Conv is a two-dimensional convolution layer with a convolution kernel size of (1, 1) and a step size of (1, 1),
Figure QLYQS_42
for the 1 st first feature>
Figure QLYQS_43
For the 1 st first feature of the interaction, < ->
Figure QLYQS_44
For the 1 st second feature>
Figure QLYQS_45
Is the 1 st second feature of the interaction;
inputting the 1 st first feature of the interaction and the 1 st second feature of the interaction into a 2 nd first self-focusing block and a 2 nd second self-focusing block in parallel to obtain a 2 nd first feature and a 2 nd second feature, carrying out information interaction on the 2 nd first feature and the 2 nd second feature to obtain a 2 nd first feature of the interaction and a 2 nd second feature of the interaction, and so on until an nth first feature and an nth second feature output by a last first self-focusing block and a last second self-focusing block are obtained, carrying out information interaction on the nth first feature and the nth second feature to obtain the first feature and the second feature, wherein n is the number of the self-focusing blocks;
The feature fusion module is used for fusing the first feature and the second feature to obtain a fusion feature;
the data decoding module is used for up-sampling the fusion characteristics through a decoder to obtain a noise reduction audio complex frequency spectrum;
the data output module is used for converting the noise reduction audio complex frequency spectrum into noise reduction audio.
8. An underwater acoustic signal noise reduction device comprising at least one control processor and a memory for communication connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform a method of underwater sound signal noise reduction as claimed in any one of claims 1 to 6.
9. A computer-readable storage medium, characterized by: the computer-readable storage medium stores computer-executable instructions for causing a computer to perform a method of reducing noise of an underwater sound signal as claimed in any one of claims 1 to 6.
CN202210868441.8A 2022-07-22 2022-07-22 Underwater sound signal noise reduction method, system, equipment and storage medium Active CN115359771B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210868441.8A CN115359771B (en) 2022-07-22 2022-07-22 Underwater sound signal noise reduction method, system, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210868441.8A CN115359771B (en) 2022-07-22 2022-07-22 Underwater sound signal noise reduction method, system, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115359771A CN115359771A (en) 2022-11-18
CN115359771B true CN115359771B (en) 2023-07-07

Family

ID=84031531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210868441.8A Active CN115359771B (en) 2022-07-22 2022-07-22 Underwater sound signal noise reduction method, system, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115359771B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115630326B (en) * 2022-12-19 2023-05-16 广州海洋地质调查局三亚南海地质研究所 Method and device for monitoring health state of marine ecosystem by hydrophone
CN115691541B (en) * 2022-12-27 2023-03-21 深圳元象信息科技有限公司 Voice separation method, device and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10672414B2 (en) * 2018-04-13 2020-06-02 Microsoft Technology Licensing, Llc Systems, methods, and computer-readable media for improved real-time audio processing
CN113870888A (en) * 2021-09-24 2021-12-31 武汉大学 Feature extraction method and device based on time domain and frequency domain of voice signal, and echo cancellation method and device

Also Published As

Publication number Publication date
CN115359771A (en) 2022-11-18

Similar Documents

Publication Publication Date Title
CN115359771B (en) Underwater sound signal noise reduction method, system, equipment and storage medium
CN111971743B (en) Systems, methods, and computer readable media for improved real-time audio processing
CN101303764B (en) Method for self-adaption amalgamation of multi-sensor image based on non-lower sampling profile wave
US9117440B2 (en) Method, apparatus, and medium for detecting frequency extension coding in the coding history of an audio signal
Wang et al. Ensemble EMD‐based signal denoising using modified interval thresholding
US20110051956A1 (en) Apparatus and method for reducing noise using complex spectrum
CN115331686A (en) Noise robust forged voice detection system and method based on joint training
CN112084845A (en) Low-frequency 1/f noise elimination method based on multi-scale wavelet coefficient autocorrelation
Ashraf et al. Underwater ambient-noise removing GAN based on magnitude and phase spectra
Li et al. Deeplabv3+ vision transformer for visual bird sound denoising
Raj et al. Multilayered convolutional neural network-based auto-CODEC for audio signal denoising using mel-frequency cepstral coefficients
Czyżewski et al. Neuro-rough control of masking thresholds for audio signal enhancement
CN108280416A (en) A kind of broadband underwater acoustic signal processing method of small echo across scale correlation filtering
CN102509268B (en) Immune-clonal-selection-based nonsubsampled contourlet domain image denoising method
CN110808067A (en) Low signal-to-noise ratio sound event detection method based on binary multiband energy distribution
Yan et al. Ship Radiated Noise Recognition Using Resonance‐Based Sparse Signal Decomposition
CN113848532A (en) FMCW radar signal noise reduction system and method based on noise reduction model
Liang et al. Self-adaptive spatial image denoising model based on scale correlation and SURE-LET in the nonsubsampled contourlet transform domain
Raj et al. Audio signal quality enhancement using multi-layered convolutional neural network based auto encoder–decoder
Gasenzer et al. Towards generalizing deep-audio fake detection networks
Sheikh et al. Compression and denoising of speech transmission using Daubechies wavelet family
Si et al. Multi‐scale audio super resolution via deep pyramid wavelet convolutional neural network
CN103839235B (en) Method for denoising global Bandelet transformation domain based on non-local directional correction
CN117251737B (en) Lightning waveform processing model training method, classification method, device and electronic equipment
CN117370737B (en) Unsteady state non-Gaussian noise removing method based on self-adaptive Gaussian filter

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant