CN115331691A - Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium - Google Patents

Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium Download PDF

Info

Publication number
CN115331691A
CN115331691A CN202211250290.6A CN202211250290A CN115331691A CN 115331691 A CN115331691 A CN 115331691A CN 202211250290 A CN202211250290 A CN 202211250290A CN 115331691 A CN115331691 A CN 115331691A
Authority
CN
China
Prior art keywords
layer
sound signal
sampling
module
sampling module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211250290.6A
Other languages
Chinese (zh)
Inventor
陈翔
吕继先
雷文彬
廖科文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Chengzhi Intelligent Machine Technology Co ltd
Original Assignee
Guangzhou Chengzhi Intelligent Machine Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Chengzhi Intelligent Machine Technology Co ltd filed Critical Guangzhou Chengzhi Intelligent Machine Technology Co ltd
Priority to CN202211250290.6A priority Critical patent/CN115331691A/en
Publication of CN115331691A publication Critical patent/CN115331691A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention relates to an unmanned aerial vehicle pickup method, which comprises the following steps: acquiring an original sound signal to be processed; carrying out primary noise reduction processing on the original sound signal to obtain an enhanced sound signal; and inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal. Compared with the prior art, the unmanned aerial vehicle pickup method provided by the invention fuses the characteristics of different layers in the encoder and the decoder through the noise reduction neural network, the characteristics of different sizes extracted by different receptive fields are fully utilized, the extraction accuracy of effective sound signals can be improved through the fusion of multi-scale characteristics, and the human voice enhancement effect under the environment with extremely low signal-to-noise ratio of the unmanned aerial vehicle is achieved by aiming at the high-decibel self-noise and the wind noise reduction of the unmanned aerial vehicle platform.

Description

Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium
Technical Field
The invention relates to the technical field of pickup of unmanned aerial vehicles, in particular to a pickup method and device of an unmanned aerial vehicle, electronic equipment and a computer readable storage medium.
Background
There is significant self-noise in the course of drone flight, including steady-state drone mechanical noise, as well as paddle noise generated when the unsteady propeller is rotating and wind noise generated by the propeller causing air flow. Unmanned aerial vehicle's the general more than 90 decibels of self-noise, be greater than effective sound such as the people's voice received far away, effective sound is longer from the propagation distance of ground sound source to unmanned aerial vehicle microphone moreover, and there is the decay in the propagation of effective sound in the air. Under this low SNR's environment, the effective sound signal that the microphone of mount on unmanned aerial vehicle received is drowned in unmanned aerial vehicle's self-noise, and the effective sound signal that the unmanned aerial vehicle microphone was difficult to effectively gather.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a pickup method of an unmanned aerial vehicle, which can weaken the self-noise of a human and a machine and improve the signal-to-noise ratio of sound signals, so that an unmanned aerial vehicle microphone can effectively acquire effective sound signals.
The invention is realized by the following technical scheme: an unmanned aerial vehicle pickup method comprises the following steps:
acquiring an original sound signal to be processed;
carrying out primary noise reduction processing on the original sound signal to obtain an enhanced sound signal;
inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal;
the noise reduction neural network comprises an encoder and a decoder, wherein the encoder comprises a plurality of down-sampling modules and a first convolution module which are sequentially connected, each down-sampling module comprises a one-dimensional convolution layer and a down-sampling layer, and the one-dimensional convolution layer is used for performing convolution operation on the enhanced sound signal or the sound signal output by the last sampling module; the down-sampling layer is used for performing down-sampling operation on the characteristics output by the one-dimensional convolution layer on the same layer; the first convolution module is used for performing one-dimensional convolution operation on the sound signal output by the last layer of the down-sampling module;
the decoder comprises a plurality of up-sampling modules and a second convolution module which are sequentially connected, the down-sampling modules correspond to the up-sampling modules layer by layer, each up-sampling module comprises an up-sampling layer, a splicing layer and a one-dimensional deconvolution layer, and the up-sampling layer of the up-sampling module of the first layer is used for performing up-sampling operation on the sound signal output by the first convolution module; the up-sampling layers from the second layer to the last layer of the up-sampling module are used for performing up-sampling operation on the sound signal output by the up-sampling module in the last layer; the splicing layer of the first layer of the up-sampling module is used for splicing the sound signal output by the up-sampling layer on the same layer with the feature extracted by the one-dimensional convolution layer of the up-sampling module on the same layer, and performing linear interpolation operation; the splicing layers from the second layer of up-sampling module to the last layer of up-sampling module are used for splicing the sound signal output by the up-sampling layer at the same layer with the extracted feature of the one-dimensional convolution layer of the down-sampling module at the same layer and the extracted feature of the one-dimensional convolution layer of the up-sampling module at the same layer; the one-dimensional deconvolution layer is used for performing deconvolution operation on the sound signals output by the splicing layer; and the second convolution module is used for performing one-dimensional convolution operation on the sound signal output by the last layer of the up-sampling module.
Compared with the prior art, the unmanned aerial vehicle pickup method provided by the invention fuses the characteristics of different layers in the encoder and the decoder through the noise reduction neural network, the characteristics of different sizes extracted by different receptive fields are fully utilized, the extraction accuracy of effective sound signals can be improved through the fusion of multi-scale characteristics, and the human voice enhancement effect under the environment with extremely low signal-to-noise ratio of the unmanned aerial vehicle is achieved by aiming at the high-decibel self-noise and the wind noise reduction of the unmanned aerial vehicle platform.
Further, the excitation function of the one-dimensional convolution layer is a linear rectification function with leakage; the excitation function of the one-dimensional deconvolution layer from the first layer of up-sampling module to the second last layer of up-sampling module is a linear rectification function, and the excitation function of the one-dimensional deconvolution layer of the last layer of up-sampling module is a Sigmod function.
Further, the original sound signal is collected by a microphone linear array;
carrying out primary noise reduction processing on the original sound signal to obtain an enhanced sound signal, and comprising the following steps of:
performing framing and windowing processing on the original sound signal;
in a preset angle range, calculating a P value of each frame of the original sound signal aiming at each angle, and determining an angle corresponding to the maximum P value as a sound source direction of the frame, wherein the expression of the P value is as follows:
Figure 154655DEST_PATH_IMAGE001
wherein m is the number of microphones in the linear array of microphones; k = w/c, w =2 × pi f, f is the frequency of the original sound signal subjected to fourier transform, c is the speed of sound propagation in air;
Figure 343191DEST_PATH_IMAGE002
short-time Fourier transform of the l frame sound signal of the n path of original sound signal;
Figure 918529DEST_PATH_IMAGE003
for the delay phase of the l frame sound signal of the n-th path of the original sound signal
Figure 192515DEST_PATH_IMAGE004
Figure 511501DEST_PATH_IMAGE005
It is the frequency of the effective sound that,
Figure 554543DEST_PATH_IMAGE006
d is the pitch of the microphones of the linear array of microphones,
Figure 300783DEST_PATH_IMAGE007
to calculate the angle;
for each frame of the original sound signal, obtaining an enhanced sound signal X according to the sound source direction, wherein the expression of the enhanced sound signal X is as follows:
Figure 796486DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 919163DEST_PATH_IMAGE009
the original sound signal of the nth microphone line.
Further, before inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal, the method further comprises the following steps:
and after the enhanced sound signal is input into a band-pass filter for filtering, detecting effective sound in the enhanced sound signal through a VAD algorithm, and entering the subsequent step when continuous effective sound is detected.
Based on the same inventive concept, the application also provides an unmanned aerial vehicle pickup device, which comprises:
the signal acquisition module is used for acquiring an original sound signal to be processed;
the signal enhancement module is used for carrying out primary noise reduction processing on the original sound signal to obtain an enhanced sound signal;
the noise reduction processing module is used for inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal;
the noise reduction neural network comprises an encoder and a decoder, wherein the encoder comprises a plurality of down-sampling modules and a first convolution module which are sequentially connected, each down-sampling module comprises a one-dimensional convolution layer and a down-sampling layer, and the one-dimensional convolution layer is used for performing convolution operation on the enhanced sound signal or the sound signal output by the last sampling module; the down-sampling layer is used for performing down-sampling operation on the characteristics output by the one-dimensional convolution layer;
the decoder comprises a plurality of up-sampling modules and a second convolution module which are sequentially connected, the down-sampling modules correspond to the up-sampling modules layer by layer, each up-sampling module comprises an up-sampling layer, a splicing layer and a one-dimensional deconvolution layer, and the up-sampling layer is used for performing up-sampling operation on a sound signal output by the first convolution module or a sound signal output by the last up-sampling module; the splicing layer of the first layer of the up-sampling module is used for splicing the sound signal output by the up-sampling layer with the characteristics extracted by the one-dimensional convolution layer of the same layer of the up-sampling module and performing linear interpolation operation; the splicing layer from the second layer of up-sampling module to the last layer of up-sampling module is used for splicing the sound signal output by the up-sampling layer with the extracted feature of the one-dimensional convolution layer of the up-sampling module on the same layer and the extracted feature of the one-dimensional convolution layer of the up-sampling module on the last layer of the up-sampling module on the same layer; and the one-dimensional deconvolution layer is used for performing deconvolution operation on the sound signals output by the splicing layer.
Further, the excitation function of the one-dimensional convolution layer is a leaky linear rectification function; the excitation function of the one-dimensional deconvolution layer from the first layer of up-sampling module to the second last layer of up-sampling module is a linear rectification function, and the excitation function of the one-dimensional deconvolution layer of the last layer of up-sampling module is a Sigmod function.
Further, the original sound signal is collected by a microphone linear array;
the signal enhancement module includes:
the framing windowing submodule is used for performing framing windowing processing on the original sound signal;
a sound source direction sub-module, configured to calculate, for each angle within a preset angle range, a P value of each frame of the original sound signal, and determine an angle corresponding to the maximum P value as a sound source direction of the frame, where the P value expression is:
Figure 816712DEST_PATH_IMAGE001
wherein m is the number of microphones in the linear array of microphones; k = w/c, w =2 × pi f, f is the frequency of the original sound signal subjected to fourier transform, c is the speed of sound propagation in air;
Figure 733852DEST_PATH_IMAGE002
short-time Fourier transform of the l frame sound signal of the n path of original sound signal;
Figure 982431DEST_PATH_IMAGE003
for the delay phase of the l frame sound signal of the n-th path of the original sound signal
Figure 643219DEST_PATH_IMAGE004
Figure 395275DEST_PATH_IMAGE005
It is the frequency of the effective sound that,
Figure 748896DEST_PATH_IMAGE006
d is the microphone pitch of the linear array of microphones,
Figure 219191DEST_PATH_IMAGE007
to calculate the angle;
a signal accumulation submodule, configured to obtain, for each frame of the original sound signal, an enhanced sound signal X according to the sound source direction, where an expression of the enhanced sound signal X is:
Figure 683671DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 290232DEST_PATH_IMAGE009
the original sound signal of the nth microphone line.
Further, still include:
and the continuous effective sound detection module is used for inputting the enhanced sound signal into a band-pass filter for filtering, detecting effective sound in the enhanced sound signal through a VAD algorithm, and entering the noise reduction processing module when the continuous effective sound is detected.
Based on same inventive concept, this application still provides an unmanned aerial vehicle, includes the fuselage, includes:
the microphone array is arranged on the machine body and used for collecting original sound signals and transmitting the original sound signals to the controller;
a controller, comprising:
a processor;
a memory for storing a computer program for execution by the processor;
wherein the processor implements the steps of the above method when executing the computer program.
Based on the same inventive concept, the present application also provides a computer-readable storage medium on which a computer program is stored, which when executed performs the steps of the above-described method.
For a better understanding and practice, the invention is described in detail below with reference to the accompanying drawings.
Drawings
Fig. 1 is a schematic diagram of an exemplary application environment of a pickup method of an unmanned aerial vehicle according to an embodiment;
fig. 2 is a schematic flow chart of a pickup method of an unmanned aerial vehicle according to an embodiment;
FIG. 3 is a flow diagram of a spatial filtering process according to one embodiment;
FIG. 4 is a schematic diagram of a noise reduction neural network in one embodiment;
FIG. 5 is a time domain diagram of an original sound signal collected from a sound source;
FIG. 6 is a time domain diagram of a valid sound signal;
fig. 7 is a schematic structural diagram of a pickup device of an unmanned aerial vehicle in one embodiment;
fig. 8 is a schematic structural diagram of the drone in one embodiment.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.
In the description of the present application, it is to be understood that the terms "first," "second," "third," and the like are used solely for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order, nor should be construed to indicate or imply relative importance. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate. Further, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
The invention reduces the noise of the sound signal collected by the unmanned aerial vehicle by improving the noise reduction sound network from the U-NET + + and LSTM basic network framework, and is particularly suitable for the pickup environment with low signal-to-noise ratio. The following examples are intended to illustrate the details.
Please refer to fig. 1, which is a schematic diagram of an exemplary application environment of the pickup method for an unmanned aerial vehicle according to an embodiment, and includes an unmanned aerial vehicle microphone 11 and a remote controller 12, where the unmanned aerial vehicle microphone 11 is a sound receiving device mounted on the unmanned aerial vehicle, and may be a microphone array or the like; the remote controller 12 comprises a memory in which a computer program is stored and a processor in which the computer program is executable in the memory. Remote transmission to remote controller 12 after unmanned aerial vehicle microphone 11 gathers sound signal, realization such as remote transmission accessible bluetooth module, wireless wifi module, remote controller 12 handles the sound signal of receipt through the unmanned aerial vehicle pickup method of this embodiment, obtains clear effective sound signal.
Please refer to fig. 2, which is a flowchart illustrating an exemplary method for picking up sound by an unmanned aerial vehicle. The method comprises the following steps:
s1: acquiring an original sound signal to be processed;
s2: carrying out primary noise reduction processing on an original sound signal to obtain an enhanced sound signal;
s3: and inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal.
In step S1, the original sound signal is a sound signal directly collected by a microphone mounted on the unmanned aerial vehicle, and the original sound signal can be obtained through wired or wireless transmission with the microphone.
In step S2, a preliminary noise reduction process is performed on the original sound signal to enhance the effective sound in the original sound signal, where the preliminary noise reduction process is related to the structure of the microphone that collects the original sound signal. Please refer to fig. 3, which is a flowchart illustrating a spatial filtering process according to an embodiment, including the following steps:
s21: performing frame windowing processing on an original sound signal;
the original sound signal is subjected to frame windowing processing, so that short-time analysis is performed on the original sound signal, and processing of non-stationary signals is facilitated.
S22: calculating the P value of each frame of original sound signals aiming at each angle within a preset angle range, and determining the angle corresponding to the maximum P value as the sound source direction of the frame;
wherein, predetermine the angle scope and can set for according to microphone and unmanned aerial vehicle's relative position, for example, when the microphone was located unmanned aerial vehicle's dead ahead, the direction probability of effective sound was located unmanned aerial vehicle's front side, and unmanned aerial vehicle oar is noisy and is located the microphone dead back, then predetermine the angle scope and can set upIs in front of the unmanned plane
Figure 814755DEST_PATH_IMAGE010
And (4) degree to reduce the amount of calculation.
The P value of the original sound signal is a spatial filter function, and its expression is:
Figure 772346DEST_PATH_IMAGE001
wherein m is the number of microphones in the linear array of microphones; n is the original sound signal circuit of the nth microphone; k = w/c, w =2 × pi × f, f is the frequency obtained by fourier transforming the time domain signal of the original sound signal, and c is the speed of sound propagation in air; l is the original sound signal of the l frame;
Figure 774937DEST_PATH_IMAGE002
short-time Fourier transform of the l frame sound signal of the n path original sound signal;
Figure 501585DEST_PATH_IMAGE003
a delayed phase of the l frame sound signal as the n-th original sound signal
Figure 931429DEST_PATH_IMAGE004
Figure 173055DEST_PATH_IMAGE005
It is the frequency of the effective sound that,
Figure 182599DEST_PATH_IMAGE006
d is the microphone pitch of the linear array of microphones,
Figure 26403DEST_PATH_IMAGE007
to calculate the angle.
S23: and aiming at each path of original sound signal of the same frame, obtaining the delay phase of the sound source direction of the frame, and accumulating the original sound signals of all the paths to obtain an enhanced sound signal.
Wherein the expression of the enhanced sound signal X is:
Figure 627148DEST_PATH_IMAGE008
wherein, the first and the second end of the pipe are connected with each other,
Figure 559332DEST_PATH_IMAGE009
the original sound signal of the nth microphone line.
In step S3, the noise reduction neural network performs further human voice enhancement and noise reduction processing on the enhanced sound signal. Please refer to fig. 4, which is a schematic structural diagram of a noise reduction neural network in an embodiment, the noise reduction neural network includes an encoder and a decoder. The encoder is used for down-sampling and feature extraction of an input enhanced sound signal, and the decoder is used for up-sampling features output by the encoder and then outputting an effective sound signal.
Specifically, the encoder includes a plurality of Downsampling modules (Downsampling blocks) and a first Convolution module (1D Convolution), each Downsampling module includes a one-dimensional Convolution layer (1D Convolution) and a Downsampling layer (Downsampling), wherein the one-dimensional Convolution layer is configured to perform Convolution operation on the enhanced sound signal or the sound signal output by the last sampling module to extract features, in a specific implementation, the step size of the one-dimensional Convolution layer is set to be 2, the Convolution kernel size is set to be 15, and the excitation function is a leakage linear rectification function (leak ReLU); the down-sampling layer is used for performing down-sampling operation on the characteristics output by the one-dimensional convolution layer, and the signal output by the down-sampling layer is the sound signal output by the down-sampling module of the layer. The first convolution module is configured to perform a one-dimensional convolution operation on the sound signal output by the last layer down-sampling module to extract features, and in one embodiment, the convolution kernel size of the first convolution module is set to 15.
The decoder comprises a plurality of up-sampling modules (Uwnsampling blocks) and a second Convolution module (1D Convolution) which are connected in sequence, wherein the down-sampling modules correspond to the up-sampling modules layer by layer, namely the down-sampling module at the first layer corresponds to the up-sampling module at the last layer, the down-sampling module at the second layer corresponds to the up-sampling module at the last layer, and the like. Each up-sampling module comprises an up-sampling layer (Uwnsampling), a splicing layer and a one-dimensional deconvolution layer (1D volume), wherein the up-sampling layer is used for carrying out up-sampling operation on the sound signal output by the first Convolution module or the sound signal output by the up-sampling module of the previous layer; the splicing layer of the first layer of the up-sampling module is used for splicing the sound signal output by the up-sampling layer with the features extracted by the one-dimensional convolution layer of the up-sampling module on the same layer, namely feature skip connection (feature skip connect), and performing linear interpolation operation; the splicing layer from the second layer of up-sampling module to the last layer of up-sampling module is used for performing feature hopping connection on the sound signal output by the up-sampling layer and the feature extracted by the one-dimensional convolution layer of the same layer of down-sampling module, and splicing the sound signal with the feature extracted by the one-dimensional convolution layer of the last layer of down-sampling module of the same layer, namely sampling skip connection (sampling skip connection); the one-dimensional deconvolution layer is used for performing deconvolution operation on the sound signal output by the splicing layer, in a specific embodiment, the step length of the one-dimensional convolution layer in the decoder is set to be 2, the convolution kernel size is set to be 15, the excitation function of the one-dimensional deconvolution layer from the first layer up-sampling module to the second last layer up-sampling module is a linear rectification function, and the excitation function of the last layer up-sampling module is a Sigmod function. The second convolution module is used for performing one-dimensional convolution operation on the sound signal output by the last layer of up-sampling module and outputting an effective sound signal, and preferably, the convolution kernel of the second convolution module is 1, so that the original sound signal can be fully utilized without changing the length of sound data, noise in the original sound signal is suppressed, and pure effective sound is restored.
The effective sound signal is a sound signal which needs to be collected actually and can be preset as a human sound signal and the like.
In an alternative embodiment, the encoder comprises 12 downsampling modules and the decoder comprises 12 upsampling modules.
And when the noise reduction neural network is trained, the loss function adopts a mean square error loss function. In one specific implementation, the training is performed based on Quadro P1000G video memory GPU and audio with a sampling rate of 16K, the batch processing size is 30, an Adam optimizer is adopted, and the optimizer parameters are as follows: the initial learning rate is 0.001, the first-order moment estimation exponential decay rate is 0.9, and the second-order moment estimation exponential decay rate is 0.99. The homemade data set includes pure human voice samples based on the ST-CMDS-20170001_1-OS data set and noise samples that mix wind and paddle noise at different signal-to-noise ratios (5,0, -5, -10, -15). The data set comprises 40 ten thousand data samples, wherein each of the pure human voice samples and the mixed noise samples comprises 20 ten thousand, 15 ten thousand noise samples are used as training sets, 3 ten thousand are used as verification sets, and 2 ten thousand are used as test sets.
In a preferred embodiment, before the enhanced sound signal is input into the noise reduction neural network for processing to obtain the effective sound signal, the method further comprises the following steps: and after the enhanced sound signal is input into a band-pass filter for filtering, detecting effective sound in the enhanced sound signal through a VAD algorithm, continuously detecting through a sliding window, and entering the subsequent step when the continuous effective sound is detected. The filtering frequency band of the band-pass filter is set to be within the frequency range of the effective sound, and if the effective sound is human sound, the filtering frequency band of the band-pass filter can be set to be 300-3500Hz. The VAD algorithm, voice Activity Detection, can detect the start point and the end point of a valid sound in a noise background. Because the noise reduction processing of the noise reduction neural network involves a large amount of calculation, the calculation force requirement on the chip is high, and after continuous effective sound exists in the enhanced sound signal is detected, the enhanced sound signal is input into the noise reduction neural network for processing, so that the calculation pressure of the chip can be reduced, the heat emission is reduced, and the service life of the chip is prolonged.
Compared with the prior art, the noise reduction neural network disclosed by the invention has the advantages that the characteristics of different layers in the encoder and the decoder are fused, the characteristics of different sizes extracted from different receptive fields are fully utilized, the extraction accuracy of effective sound signals can be improved by the fusion of multi-scale characteristics, and the human voice enhancement effect under the environment with extremely low signal-to-noise ratio of the unmanned aerial vehicle is achieved by aiming at the high-decibel self-noise and wind noise reduction of the unmanned aerial vehicle platform.
In addition, the invention collects sound through the microphone linear array and provides a spatial filtering processing algorithm aiming at the microphone linear array, so that effective sound can be enhanced directionally, and the effect of further denoising is achieved.
Please refer to fig. 5, which is a time domain diagram of an original sound signal collected from a sound source; please refer to fig. 6, which is a time domain diagram of an effective sound signal obtained after the original sound signal collected from the sound source is processed by the above-mentioned pickup method by the drone. The comparison shows that after the unmanned aerial vehicle pickup method is used for processing, noise in original sound signals is suppressed, and human voice is reserved.
Based on the same invention concept, the invention also provides an unmanned aerial vehicle pickup device. Please refer to fig. 7, which is a schematic structural diagram of an unmanned aerial vehicle sound pickup apparatus in an embodiment, the apparatus includes a signal obtaining module 21, a signal enhancing module 22, and a noise reduction processing module 23, wherein the signal obtaining module 21 is configured to obtain an original sound signal to be processed; the signal enhancement module 22 is configured to perform spatial filtering processing on the original sound signal to obtain an enhanced sound signal; the noise reduction processing module 23 is configured to input the enhanced sound signal into a noise reduction neural network for processing, so as to obtain an effective sound signal.
In an optional embodiment, the signal enhancement module 22 includes a framing windowing sub-module 221, a sound source direction sub-module 222, and a signal accumulation sub-module 223, where the framing windowing sub-module 221 is configured to perform framing windowing on the original sound signal; the sound source direction sub-module 222 is configured to calculate, for each angle within a preset angle range, a P value of each frame of original sound signals, and determine an angle corresponding to a maximum P value as a sound source direction of the frame; the signal accumulation sub-module 223 is configured to obtain a delay phase in the sound source direction of the frame for each path of original sound signals of the same frame, and accumulate the original sound signals of all the paths to obtain an enhanced sound signal.
In a preferred embodiment, the pickup apparatus for unmanned aerial vehicle further includes a continuous effective sound detection module 24, where the continuous effective sound detection module 24 is configured to detect an effective sound in the enhanced sound signal through a VAD algorithm after the enhanced sound signal is input into the band-pass filter for filtering, and perform continuous detection through a sliding window, and when a continuous effective sound is detected, enter a subsequent step.
For the device embodiments, reference is made to the description of the method embodiments for relevant details, since they correspond essentially to the method embodiments.
Based on above-mentioned unmanned aerial vehicle pickup method, this application still provides an unmanned aerial vehicle. Please refer to fig. 8, which is a schematic structural diagram of an embodiment of a drone, the drone includes a body 31, a microphone array 32, a support rod 33, a drone controller (not shown), and a remote controller (not shown). Wherein, the fuselage 31 is a flying carrier; the microphone array 32 is arranged on the main body 31 through a support rod 33, the microphone array 32 can be arranged in the direction of 45 degrees right in front of or right above the main body 31, and the microphone array 32 can be selected to be a linear array consisting of 2-4 microphones. For the case that the microphone array 32 is disposed right in front of the body 31, the microphone may be selected as a cardioid directional microphone; for the case that the microphone array 32 is arranged in the 45-degree direction right in front of and above the body 31, the microphone can be an 8-shaped microphone; whereby the directivity of sound collection can be improved. The support rods 33 may be selected from elongated light carbon tubes. The unmanned aerial vehicle controller comprises a pickup module, a data transmission module and a shouting module, wherein the pickup module is used for receiving original sound signals collected by the microphone array 32; the data transmission module is used for remotely transmitting the original sound signal in the pickup module to the remote controller and receiving a calling voice signal from the remote controller; the calling module is used for receiving and playing the calling voice signal in the data transmission module. The remote controller comprises one or more processors and a memory, wherein the processors are used for executing the unmanned aerial vehicle pickup method of the program implementation method embodiment; the memory is for storing a computer program executable by the processor.
Based on the same inventive concept, the present invention further provides a computer-readable storage medium, corresponding to the foregoing embodiments of the sound pickup method for the unmanned aerial vehicle, wherein the computer-readable storage medium stores thereon a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the sound pickup method for the unmanned aerial vehicle, which are described in any one of the foregoing embodiments.
This application may take the form of a computer program product embodied on one or more storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having program code embodied therein. Computer-usable storage media include permanent and non-permanent, removable and non-removable media, and may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, to those skilled in the art, changes and modifications may be made without departing from the spirit of the present invention, and it is intended that the present invention encompass such changes and modifications.

Claims (10)

1. The pickup method of the unmanned aerial vehicle is characterized by comprising the following steps:
acquiring an original sound signal to be processed;
carrying out primary noise reduction processing on the original sound signal to obtain an enhanced sound signal;
inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal;
the noise reduction neural network comprises an encoder and a decoder, wherein the encoder comprises a plurality of down-sampling modules and a first convolution module which are sequentially connected, each down-sampling module comprises a one-dimensional convolution layer and a down-sampling layer, and the one-dimensional convolution layer is used for performing convolution operation on the enhanced sound signal or the sound signal output by the last sampling module; the down-sampling layer is used for performing down-sampling operation on the characteristics output by the one-dimensional convolution layer on the same layer; the first convolution module is used for performing one-dimensional convolution operation on the sound signal output by the last layer of the down-sampling module;
the decoder comprises a plurality of up-sampling modules and a second convolution module which are sequentially connected, the down-sampling modules correspond to the up-sampling modules layer by layer, each up-sampling module comprises an up-sampling layer, a splicing layer and a one-dimensional anti-convolution layer, and the up-sampling layer of the up-sampling module of the first layer is used for performing up-sampling operation on the sound signal output by the first convolution module; the up-sampling layers from the second layer to the last layer of the up-sampling module are used for performing up-sampling operation on the sound signal output by the up-sampling module in the last layer; the splicing layer of the first layer up-sampling module is used for splicing the sound signal output by the up-sampling layer at the same layer with the characteristics extracted by the one-dimensional convolution layer of the up-sampling module at the same layer and performing linear interpolation operation; the splicing layers from the second layer of up-sampling module to the last layer of up-sampling module are used for splicing the sound signal output by the up-sampling layer at the same layer with the extracted feature of the one-dimensional convolution layer of the down-sampling module at the same layer and the extracted feature of the one-dimensional convolution layer of the up-sampling module at the same layer; the one-dimensional deconvolution layer is used for performing deconvolution operation on the sound signals output by the splicing layer; and the second convolution module is used for performing one-dimensional convolution operation on the sound signal output by the last layer of the up-sampling module.
2. The method of claim 1, wherein: the excitation function of the one-dimensional convolution layer is a linear rectification function with leakage; the excitation function of the one-dimensional deconvolution layer from the first layer of up-sampling module to the second last layer of up-sampling module is a linear rectification function, and the excitation function of the one-dimensional deconvolution layer of the last layer of up-sampling module is a Sigmod function.
3. The method of claim 1, wherein: the original sound signal is collected through a microphone linear array;
carrying out primary noise reduction processing on the original sound signal to obtain an enhanced sound signal, and comprising the following steps of:
performing frame windowing processing on the original sound signal;
in a preset angle range, calculating a P value of each frame of the original sound signal aiming at each angle, and determining an angle corresponding to the maximum P value as a sound source direction of the frame, wherein the expression of the P value is as follows:
Figure 378883DEST_PATH_IMAGE001
wherein m is the number of microphones in the linear array of microphones; k = w/c, w =2 × pi f, f is the frequency of the original sound signal subjected to fourier transform, c is the speed of sound propagation in air;
Figure 747548DEST_PATH_IMAGE002
short-time Fourier transform of the l frame sound signal of the n path of original sound signal;
Figure 579237DEST_PATH_IMAGE003
for the delay phase of the l frame sound signal of the n-th path of the original sound signal
Figure 84168DEST_PATH_IMAGE004
Figure 710322DEST_PATH_IMAGE005
It is the frequency of the effective sound that is,
Figure 300703DEST_PATH_IMAGE006
d is the microphone pitch of the linear array of microphones,
Figure 936084DEST_PATH_IMAGE007
to calculate the angle;
for each frame of the original sound signal, obtaining an enhanced sound signal X according to the sound source direction, wherein the expression of the enhanced sound signal X is as follows:
Figure 29941DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 92575DEST_PATH_IMAGE009
the original sound signal of the nth microphone line.
4. The method of claim 1, wherein: before inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal, the method further comprises the following steps:
and after the enhanced sound signal is input into a band-pass filter for filtering, detecting effective sound in the enhanced sound signal through a VAD algorithm, and entering the subsequent step when continuous effective sound is detected.
5. The utility model provides an unmanned aerial vehicle pickup apparatus which characterized in that includes:
the signal acquisition module is used for acquiring an original sound signal to be processed;
the signal enhancement module is used for carrying out primary noise reduction processing on the original sound signal to obtain an enhanced sound signal;
the noise reduction processing module is used for inputting the enhanced sound signal into a noise reduction neural network for processing to obtain an effective sound signal;
the noise reduction neural network comprises an encoder and a decoder, wherein the encoder comprises a plurality of down-sampling modules and a first convolution module which are sequentially connected, each down-sampling module comprises a one-dimensional convolution layer and a down-sampling layer, and the one-dimensional convolution layer is used for performing convolution operation on the enhanced sound signal or the sound signal output by the last sampling module; the down-sampling layer is used for performing down-sampling operation on the characteristics output by the one-dimensional convolutional layer;
the decoder comprises a plurality of up-sampling modules and a second convolution module which are sequentially connected, the down-sampling modules correspond to the up-sampling modules layer by layer, each up-sampling module comprises an up-sampling layer, a splicing layer and a one-dimensional deconvolution layer, and the up-sampling layer is used for performing up-sampling operation on a sound signal output by the first convolution module or a sound signal output by the last up-sampling module; the splicing layer of the first layer of the up-sampling module is used for splicing the sound signal output by the up-sampling layer with the characteristics extracted by the one-dimensional convolution layer of the same layer of the up-sampling module and performing linear interpolation operation; the splicing layer from the second layer of up-sampling module to the last layer of up-sampling module is used for splicing the sound signal output by the up-sampling layer with the extracted feature of the one-dimensional convolution layer of the up-sampling module on the same layer and the extracted feature of the one-dimensional convolution layer of the up-sampling module on the last layer of the up-sampling module on the same layer; and the one-dimensional deconvolution layer is used for performing deconvolution operation on the sound signals output by the splicing layer.
6. The apparatus of claim 5, wherein: the excitation function of the one-dimensional convolution layer is a linear rectification function with leakage; the excitation function of the one-dimensional deconvolution layer from the first layer of up-sampling module to the second last layer of up-sampling module is a linear rectification function, and the excitation function of the one-dimensional deconvolution layer of the last layer of up-sampling module is a Sigmod function.
7. The apparatus of claim 5, wherein: the original sound signal is collected through a microphone linear array;
the signal enhancement module includes:
the framing windowing submodule is used for performing framing windowing processing on the original sound signal;
a sound source direction submodule, configured to calculate, for each angle within a preset angle range, a P value of each frame of the original sound signal, and determine an angle corresponding to a maximum P value as a sound source direction of the frame, where the P value expression is:
Figure 904674DEST_PATH_IMAGE001
wherein m is the number of microphones in the linear array of microphones; k = w/c, w =2 × pi f,f is the frequency of the original sound signal subjected to Fourier transform, c is the speed of sound propagation in air;
Figure 343745DEST_PATH_IMAGE002
short-time Fourier transform of the l frame sound signal of the n path of original sound signal;
Figure 557689DEST_PATH_IMAGE003
a delay phase of the l frame sound signal for the n-th path of the original sound signal
Figure 791224DEST_PATH_IMAGE004
Figure 90618DEST_PATH_IMAGE005
It is the frequency of the effective sound that,
Figure 271064DEST_PATH_IMAGE006
d is the microphone pitch of the linear array of microphones,
Figure 401831DEST_PATH_IMAGE007
to calculate the angle;
a signal accumulation submodule, configured to obtain, for each frame of the original sound signal, an enhanced sound signal X according to the sound source direction, where an expression of the enhanced sound signal X is:
Figure 743951DEST_PATH_IMAGE008
wherein, the first and the second end of the pipe are connected with each other,
Figure 592958DEST_PATH_IMAGE009
the original sound signal of the nth microphone line.
8. The apparatus of claim 5, further comprising:
and the continuous effective sound detection module is used for inputting the enhanced sound signal into a band-pass filter for filtering, detecting effective sound in the enhanced sound signal through a VAD algorithm, and entering the noise reduction processing module when the continuous effective sound is detected.
9. An unmanned aerial vehicle, includes the fuselage, its characterized in that includes:
the microphone array is arranged on the machine body and used for collecting original sound signals and transmitting the original sound signals to the controller;
a controller, comprising:
a processor;
a memory for storing a computer program for execution by the processor;
wherein the processor, when executing the computer program, implements the steps of the method of any one of claims 1-4.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed, carries out the steps of the method of any one of claims 1 to 4.
CN202211250290.6A 2022-10-13 2022-10-13 Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium Pending CN115331691A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211250290.6A CN115331691A (en) 2022-10-13 2022-10-13 Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211250290.6A CN115331691A (en) 2022-10-13 2022-10-13 Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN115331691A true CN115331691A (en) 2022-11-11

Family

ID=83913561

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211250290.6A Pending CN115331691A (en) 2022-10-13 2022-10-13 Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN115331691A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109831731A (en) * 2019-02-15 2019-05-31 杭州嘉楠耘智信息科技有限公司 Sound source orientation method and device and computer readable storage medium
CN109949821A (en) * 2019-03-15 2019-06-28 慧言科技(天津)有限公司 A method of far field speech dereverbcration is carried out using the U-NET structure of CNN
CN112904279A (en) * 2021-01-18 2021-06-04 南京工程学院 Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum
CN114333796A (en) * 2021-12-27 2022-04-12 深圳Tcl数字技术有限公司 Audio and video voice enhancement method, device, equipment, medium and smart television

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109831731A (en) * 2019-02-15 2019-05-31 杭州嘉楠耘智信息科技有限公司 Sound source orientation method and device and computer readable storage medium
CN109949821A (en) * 2019-03-15 2019-06-28 慧言科技(天津)有限公司 A method of far field speech dereverbcration is carried out using the U-NET structure of CNN
CN112904279A (en) * 2021-01-18 2021-06-04 南京工程学院 Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum
CN114333796A (en) * 2021-12-27 2022-04-12 深圳Tcl数字技术有限公司 Audio and video voice enhancement method, device, equipment, medium and smart television

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CARIG MACARTNEY ET AL: "Improved Speech Enhancement with the Wave-U-Net", 《ARXIV》 *
DANIEL STOLLER ET AL: "WAVE-U-NET: A MULTI-SCALE NEURAL NETWORK FOR END-TO-END AUDIO SOURCE SEPARATION", 《ARXIV》 *
袁安富等: "一种改进的联合 SRP-PHAT 语音定位算法", 《南京信息工程大学学报》 *

Similar Documents

Publication Publication Date Title
US11620983B2 (en) Speech recognition method, device, and computer-readable storage medium
CN111933188B (en) Sound event detection method based on convolutional neural network
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN106846803B (en) Traffic event detection device and method based on audio frequency
CN110245608A (en) A kind of Underwater targets recognition based on semi-tensor product neural network
CN110782878A (en) Attention mechanism-based multi-scale audio scene recognition method
Tang et al. Improving reverberant speech training using diffuse acoustic simulation
CN111031463B (en) Microphone array performance evaluation method, device, equipment and medium
WO2019239043A1 (en) Location of sound sources in a given acoustic environment
CN105637331B (en) Abnormal detector, method for detecting abnormality
KR20190108804A (en) Method and apparatus of sound event detecting robust for frequency change
CN110600059A (en) Acoustic event detection method and device, electronic equipment and storage medium
CN112466290B (en) Abnormal sound detection model training method and device and computer storage medium
CN108922514B (en) Robust feature extraction method based on low-frequency log spectrum
CN113191178A (en) Underwater sound target identification method based on auditory perception feature deep learning
CN112259116A (en) Method and device for reducing noise of audio data, electronic equipment and storage medium
CN115565550A (en) Baby crying emotion identification method based on characteristic diagram light convolution transformation
CN107360497A (en) Estimate the computational methods and device of reverberation component
CN112735466B (en) Audio detection method and device
CN115598594B (en) Unmanned aerial vehicle sound source positioning method and device, unmanned aerial vehicle and readable storage medium
CN115331691A (en) Pickup method and device for unmanned aerial vehicle, unmanned aerial vehicle and computer readable storage medium
WO2021074502A1 (en) Improved location of an acoustic source
CN114743562B (en) Method and system for recognizing airplane voiceprint, electronic equipment and storage medium
CN113436640B (en) Audio noise reduction method, device and system and computer readable storage medium
CN107919136B (en) Digital voice sampling frequency estimation method based on Gaussian mixture model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20221111

RJ01 Rejection of invention patent application after publication