CN111126199B

CN111126199B - Signal characteristic extraction and data mining method based on echo measurement data

Info

Publication number: CN111126199B
Application number: CN201911268281.8A
Authority: CN
Inventors: 朱殷; 张军平
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-12-11
Filing date: 2019-12-11
Publication date: 2023-05-30
Anticipated expiration: 2039-12-11
Also published as: CN111126199A

Abstract

The invention belongs to the technical field of machine learning and signal feature extraction, and particularly relates to a signal feature extraction and data mining method based on echo measurement data. The method comprises the following steps: generating simulation data additionally according to the existing simulation data; constructing a basic residual block which can be used for an encoder and a decoder simultaneously; the encoder and decoder are linked across at the same level of abstraction; the original echoes generated by a plurality of targets are input into an encoder to obtain a characteristic map which is easy to process, a decoder is used for obtaining a plurality of echoes generated by different targets, and the performance is estimated according to the amplitude and the relative error of the positions of the peaks. The invention can well separate the echoes generated by different targets from the echoes generated by a plurality of targets together, so that the model can well learn the characteristics of the echoes generated by the targets, thereby extracting a plurality of relatively simple echoes generated by a single target.

Description

Signal characteristic extraction and data mining method based on echo measurement data

Technical Field

The invention belongs to the technical field of machine learning and signal feature extraction, and particularly relates to a signal feature extraction and data mining method based on echo measurement data.

Background

Feature extraction and data mining of echo measurement data actually belong to the technical field of signal processing, however, a traditional signal processing method relies on a physical model of a signal to be processed, and deep learning is used for signal processing, so that a network needs to be built according to the characteristics of the signal. There are few highly viable publications of echo feature extraction methods based on deep learning.

The problem to which the invention relates, which consists in separating the signals reflected by a single signal source from the mixed echoes reflected by a plurality of signal sources, is called blind source separation (BlindSourceSeparation) [1], because of the lack of efficient information on the system characteristics of the mixed source signals, the individual desired signals need to be recovered by means of methods and transformations.

The adaptive spatial sampling algorithm, originally proposed by Heraul and Jutten et al in 1986, achieved blind separation of the two signals using a simple neural network [2]. In 1994, comon proposed ICA method based on minimal mutual information, the system set forth the concept of independent components and also defined the basic assumption of blind source separation problem [3]. In 1995, an ICA algorithm based on information maximization criteria published by Bell and Sejnowski fully utilized the information amount transmitted by a nonlinear network by using a method of maximizing the entropy of the nonlinear node, and further successfully studied adaptive blind separation and blind deconvolution [4]. The blind source separation theory is developed and widely applied to the fields of image processing, voice processing, signal processing and the like.

In the era of deep learning being widely used, many blind source separation problems can be solved by using a deep learning method. For example, in speech processing, the cocktail party problem is one of the most studied areas [5] that attempts to separate each person's voice in a waveform of a mixture of multiple persons' voices. Similar tasks are the separation of the different instruments that make up a song from the vocal parts in the music process, all of which are part of the multi-channel blind deconvolution problem [6]. Many deep learning approaches have been successfully utilized and solve such problems. Jansson et al refer to the U-Net model structure in the field of image segmentation [7], and apply this structure network to music spectrograms to successfully separate the human voice from the different musical instrument parts [8]. Daniel et al, on the basis of this, proposed a Wave-U-Net structure to directly process in the frequency domain in order to improve the phase information that would be lost when music is Fourier transformed [9]. Li et al, in turn, have proposed TF-Attention-Net [10] by applying the Attention mechanism to such problems.

[1]Chabriel G,Kleinsteuber M,Moreau E,et al.Joint matrices decompositions and blind source separation:A survey of methods,identification,and applications[J].IEEE Signal Processing Magazine,2014,31(3):34-43.

[2]Jutten C,Herault J.Blind separation of sources,part I:An adaptive algorithm based on neuromimetic architecture[J].Signal processing,1991,24(1):1-10.

[3]Comon P.Separation of stochastic processes[C]//Workshop on Higher-Order Spectral Analysis.IEEE,1989:174-179.

[4]Bell A J,Sejnowski T J.An information-maximization approach to blind separation and blind deconvolution[J].Neural computation,1995,7(6):1129-1159.

[5]Haykin S,Chen Z.The cocktail party problem[J].Neural computation,2005,17(9):1875-1902.

[6]Cardoso J F.Blind signal separation:statistical principles[J].Proceedings of the IEEE,1998,86(10):2009-2025.

[7]Ronneberger O,Fischer P,Brox T.U-net:Convolutional networks for biomedical image segmentation[C]//International Conference on Medical image computing and computer-assisted intervention.Springer,Cham,2015:234-241.

[8]Jansson A,Humphrey E,Montecchio N,et al.Singing voice separation with deep U-Net convolutional networks[J].2017.

[9]Stoller D,Ewert S,Dixon S.Wave-u-net:A multi-scale neural network for end-to-end audio source separation[J].arXiv preprint arXiv:1806.03185,2018.

[10]Li T,Chen J,Hou H,et al.TF-Attention-Net:An End To End Neural Network For Singing Voice Separation[J].arXiv preprint arXiv:1909.05746,2019.。

Disclosure of Invention

The object of the invention is to propose a method which enables accurate and rapid feature extraction and data mining of signals without knowing the physical properties of the signals, so that an echo signal generated by a plurality of targets can be separated into a plurality of echo signals generated by a single target.

The invention provides a method for extracting characteristics and mining data of signals, which is based on echo measurement data and adopts a deep learning information processing technology, and comprises the following specific steps:

(1) Additionally generating echo data for each individual target based on the echoes originally generated jointly by the plurality of targets as a tag for separating echoes;

(2) Constructing a basic residual block, wherein the structure comprises: three one-dimensional convolutions, one ReLU activation function, two batch regularization functions, and one maximum pooling layer; see fig. 1;

(3) Respectively linking 12 basic residual blocks into an encoder and a decoder, linking the basic residual blocks by one-dimensional convolution, and finally forming a complete deep learning network between the encoder and the decoder at the same abstract level by using cross-layer linking;

(4) Dividing a training set, a verification set and a test set according to different relative postures at the same target position on the complete data set according to the ratio of 8:1:1;

(5) For the deep learning network, firstly, training is carried out on a training set for multiple times, then the learning rate is reduced, the performance is verified on a verification set while training is carried out on the training set, training is stopped after certain conditions are met, and the best network model on the verification set is output.

In the step (1) of the present invention, the additional process of generating a single target echo is as follows: for a given target position and target pose, for example, there are k targets, to generate an echo generated by the first target alone, the position and relative pose of the first target are saved, and the other targets are removed, so as to generate a tag corresponding to the target. Such a process is repeated k times to obtain all tags.

In step (2) of the present invention, the network requires a deeper layer number and more neurons to obtain a stronger nonlinear expression because the echo signal amplitude distribution to be processed here is very uneven compared to the usual sound signal. Deeper layers can easily lead to gradient explosions or vanishing, so each layer requires a residual approach. Let x denote our input in the form of N x C x L, representing the number of samples (also called the number of samples in a batch, the N samples together forming a batch), the number of channels and the length; f represents a layer in the network, and using the residual error changes the output of this layer from a general f (x) to x+f (x). However, if the number of channels to be output is not changed, the number of channels f (x) is larger than x if more neurons are used, and thus the direct addition cannot be performed.

There are two paths within the basic residual block, one path, called the map path, intended to map the input signal directly to a higher dimensional space with a greater number of channels, hopefully to obtain a higher dimensional representation that is easier to handle. While the other path, called the residual path, is intended to learn the residual between the higher-dimensional representation obtained in the previous path and the better higher-dimensional representation.

In the mapping path, since the echo signals have strong timing correlation, a one-dimensional convolution is used to obtain timing information and change the number of channels input, and this one-dimensional convolution is denoted by conv_f. In order to realize the residual addition, a step size is set according to the kernel size of the previous convolution (the kernel size used by the method is 3, and the step size is 1), so that the length of the input is unchanged, and the input can be directly added with the output of a residual channel. Let conv_f (x) denote our resulting map.

In the residual path, two identical structures are reused for expression capability, and this structure is specifically described as follows:

(a) Also, to obtain timing information, the input is first subjected to a one-dimensional convolution, denoted conv_1.

(b) Then using a linear rectification function (RectifiedLinearUnit) as the activation function, similar to the mechanism of neurons, the part of the result that is less than 0 is forced to be 0, and this part is not related to gradient return, and is denoted by relu_1.

(c) For ease of training, a batch regularization (Batchnormalization) is used, denoted by bn_1. Because the input and output are relative to a batch, batch regularization is to map all samples in the batch to a mean value of 0 and a variance of 1, which can speed up the network training.

The inputs are passed twice through the above (a) (b) (c) paths, the one-dimensional convolution in the second time activates the functions, the batch regularization functions are denoted by conv_2, relu_2, bn_2, respectively, and the resulting is the residual, denoted by res (x).

Finally, a maximum pooling layer is used to obtain the final result, i.e. a sliding window is set on the input, the maximum value in this window is output, and the input of this pooling layer is conv_f (x) +res (x). The effect of the pooling layer is to reduce the number of parameters, expressed as MaxPool, while retaining the salient features.

The whole basic residual block diagram is shown in fig. 1, and the algorithm pseudo code of the basic residual block is shown in annex 1.

In step (3) of the present invention, the method for forming the complete network is that L (10 are specifically taken) basic residual blocks are firstly taken, an encoder is formed according to an end-to-end mode, and the first basic residual block is numbered 1 according to the input passing sequence (f is used ₁ Indicating, and otherwise the same), the last passing basic residual block is numbered L.

Then, one-dimensional convolution, denoted conv_med, is performed, which is intended to process the encoded features to the following decoder. Then, L basic residual blocks are also taken and a decoder is formed in an end-to-end mode, the numbering sequence is opposite to that of the encoder, and the first basic residual block is numbered L (h _L Expressed, otherwise the same thing), the last one is 1. Note that the input here is the feature subjected to the one-dimensional convolution processing mentioned earlier.

Finally, a cross-link of the encoder and decoder is established. For a base residual block of decoder number i, if i < L, i.e., either the first pass through base residual block, the input is the encoder residual block output of number i, and the decoder residual block of number i+1 (i.e., the last pass through decoder residual block) are spliced along the channel; if i=l, it is the output from conv_med that is spliced together as input.

The whole deep learning network is shown in figure 2, and the algorithm pseudo code of the deep learning network is shown in annex 2.

In step (4) of the invention, the data sets are in the form of data, each file corresponding to a multi-target situation, for example three targets, after which there is one acquired echo for the relative angles between the different targets and the radar, and the echoes generated by the single target generated in step (1) are used as labels. To take care of various situations, the inside of each file is divided into a training set, a validation set and a test set according to the ratio of 8 to 1 when dividing the validation set.

In step (5) of the present invention, the training network comprises the specific steps of randomly dividing the data in the training set into a plurality of batches, each batch consisting of 64 echo signals. The final network model is generally the best performing network model on the validation set, but there is no clear optimal method as to when to stop training. The training network is divided into two stages, the first stage adopts a relatively high learning rate, so that reasonable training times are estimated from the loss transformation curve, and the probability of reaching saddle points can be reduced through the relatively high learning rate. Specifically, we used an Adam optimizer with a learning rate of 0.001, beta1=0.9, beta2=0.999. Only run iterations (e.g., 1000 times) on the training set, and use MSE (MeanSquareError) as a loss function to make gradient pass back, the expression is as follows:

MSE(Z _i ,Y _i )＝||Z _i -Y _i || ₂

here, Z _i And Y _i Representing the ith input and output;

the second stage uses a lower learning rate in order to fine tune the model. The difference from the first stage is that the learning rate is reduced to 0.0001, the iteration is run on the training set (e.g. 1000 times), the same loss function is still used, then the performance is evaluated on the validation set every 5 times, the model is considered to have converged if the performance is not improved within 50 iterations, training is stopped and the historical best model is output.

In the invention, two indexes for actually evaluating the network performance are provided, one is the relative error of the peak value, and the other is the relative error of the peak value relative position, and if the two relative errors are within fifteen percent, the two relative errors are qualified.

The exact definition of these two performance indicators is given below, first defining the peak of the signal X (here and hereafter X are assumed to be tensors of only one dimension in length) as:

φ(X)＝max(X)

the peak position of the redefined signal X is:

P(X)＝argmax(X)

the relative error in peak size is then:

the relative error in peak position is:

note that the denominator is defined herein as the length of the occurrence of a peak, i.e., the number of points in the whole wave that are greater than a certain proportion of the peak, where a certain proportion is denoted by k, and a specific implementation is 0.01.

Since the evaluation indexes are only related to a certain maximum point of the wave peak, the input parameters of the evaluation indexes are required to have only one obvious wave peak, and the meaning is lost. This is why an echo of a single object is extracted, since such an echo must have only one distinct peak.

This also results in direct gradient return with the evaluation index, which tends to distort the separated signal at other points, so MSE (MeanSquareError) is used as a loss function for gradient return in the first stage. In particular operations, the gradient may be smoothed based on the number of samples in the batch or the signal length averaged. The gradient return is performed by using such a loss function, mainly to make the output more reliable because all points of the output signal are as close to the tag as possible.

The algorithmic pseudocode for training the deep learning network is found in appendix 3.

The method can well separate the echoes generated by different targets from the echoes generated by a plurality of targets together, so that the model can well learn the characteristics of the echoes generated by the targets, and a plurality of relatively simple echoes generated by a single target are extracted.

Drawings

Fig. 1 is a basic residual block structure diagram.

Fig. 2 is a schematic diagram of the structure of the model of the present invention.

Fig. 3 is a sample presentation of input data. Wherein (a) represents echo signals generated by three targets together, and (b), (c) and (d) are echo signals generated by three targets respectively.

FIG. 4 is a graphical representation of a visual representation of the predicted results of the present invention. In (a) - (f), each row represents a predicted echo and tag from left to right (e.g., (b), (c), (d) in fig. 3).

Detailed Description

Having introduced the algorithm principles and specific steps of the present invention, the following demonstrates the signal separation test effect of the present invention on simulation data.

The data set used for the experiment is special band echo data for three targets generated from 802. There were 428 target positions, 24 relative poses, and a total of 10272 samples.

In the test, the peak amplitude relative error (e _φ ) Peak position relative error (e) _P ) Two indexes are used for measuring experimental effect, wherein the first two relative errors are generally within 15 percent and have practical application value.

Experimental example 1: random 20 samples

I denotes a sample number, H denotes a peak amplitude relative error, P denotes a peak position relative error, and the following number denotes the number of targets with respect to which. For example, H1 represents the relative error in the amplitude of the predicted first target echo signal and the actual echo signal. The bold-faced numbers indicate that the index is not acceptable.

Table 1: performance of the algorithm on random 20 samples

/>

From the result of random extraction, it can be seen that:

1. the average peak error generated by the model is 8.455% at maximum.

2. The maximum error of the average peak position generated by the model is 5.761 percent.

3. The model performance may in some cases suffer from over-fitting problems, because the robustness of the model may be reduced when there is a high degree of noise in the data. But from the experimental results, in most cases, has good predictive performance. We will solve this problem in the next study by augmenting the data and improving the robustness of the model.

Experimental example 2: all samples

Table 2: performance of the algorithm on all samples

	H1	P1	H2	P2	H3	P3
							Mean (%)	8.380	8.396	6.739	7.064	7.684	6.775
Standard deviation of	0.104	0.163	0.099	0.210	0.111	0.146

From the absolute value of the error, the average peak gap was 0.0004 and the average position gap was 1.436, where the unit of peak gap was signal strength and the position gap was one step (0.01 m). In addition, for random sample input, 87.208% of probability is obtained to obtain the output with qualified peak relative error, 84.820% of probability is obtained to obtain the output with qualified peak relative position error, and the evaluation index within 15% is better achieved.

Appendix 1: algorithmic pseudocode for basic residual block

Input X

f＝conv_f(X)

t＝conv_1(X)

t＝relu_1(t)

t＝bn_1(t)

t＝conv_2(t)

t＝relu_2(t)

res＝bn_2(t)

t＝f+res

output＝MaxPool(t)

returnoutput

Appendix 2: algorithm pseudocode for deep learning networks

Appendix 3: algorithmic pseudocode for training deep learning networks

/>

/>

Claims

1. The signal characteristic extraction and data mining method based on echo measurement data is characterized by adopting a deep learning information processing technology, and comprises the following specific steps of:

(2) Constructing a basic residual block, wherein the structure comprises: three one-dimensional convolutions, one ReLU activation function, two batch regularization functions, and one maximum pooling layer;

(5) For the deep learning network, firstly, training is carried out on a training set for multiple times, then the learning rate is reduced, the training is carried out on a verification set while the training is carried out on the training set, the training is stopped after certain conditions are met, and the best network model on the verification set is output;

the procedure described in step (1) for additionally generating echo data for each individual target is: for a given target position and target posture, k targets are arranged, and an echo generated by a first target independently is generated, wherein the position and the relative posture of the first target are saved, other targets are removed, and a label corresponding to the target is generated; repeating the process k times to obtain all labels;

in the step (2), two paths are set in the basic residual block: one path, called the mapping path, is used to map the input signal directly to a higher-dimensional space with a greater number of channels to obtain a higher-dimensional representation that is easier to process; the other path is called a residual path for learning the residual between the higher-dimensional representation obtained in the previous path and the better higher-dimensional representation;

in the mapping path, since the echo signals have strong timing correlation, one-dimensional convolution is used to obtain timing information and change the number of input channels, and is represented by conv_f; in order to realize residual addition, step length is set according to the kernel size of the previous convolution, so that the length of the input is ensured to be unchanged, the input can be directly added with the output of a residual path, and a result is obtained through mapping by using conv_f (x);

in the residual path, two identical structures are reused, which are described in detail as follows:

(a) Also, to obtain timing information, the input is first subjected to a one-dimensional convolution, denoted conv_1;

(b) Then using a linear rectification function as an activation function, and forcedly setting a part smaller than 0 in the result to 0, wherein the part does not involve gradient feedback and is represented by relu_1;

(c) For the convenience of training, a batch regularization is used, and is denoted by bn_1; because the input and output are relative to a batch, batch regularization is to map all samples in the batch to map the mean value to 0 and the variance to 1 so as to accelerate network training;

the input is subjected to two times of one-dimensional convolution in the paths (a), (b) and (c) above and the activation function in the second time, the batch regularization function is respectively represented by conv_2, relu_2 and bn_2, and the obtained residual is represented by res (x);

finally, a maximum pooling layer is used for obtaining the final result, namely a sliding window is arranged on the input, the maximum value in the window is output, and the input of the pooling layer is conv_f (x) +res (x); the effect of the pooling layer is to reduce the number of parameters, expressed as MaxPool, while retaining the salient features;

in the step (5), the specific step of the training network is that the data in the training set is divided into a plurality of batches at random, and each batch consists of 64 echo signals; the training network is divided into two phases:

the first stage uses a higher learning rate, i.e., an Adam optimizer with a learning rate between 0.001, beta1 = 0.9, beta2 = 0.999; only run iterations on the training set, use MSE as the loss function to make gradient return, the expression is as follows:

MSE(Z _i ,Y _i )＝||Z _i -Y _i || ₂

the second stage adopts a lower learning rate, and is different from the first stage in that the learning rate is reduced to 0.0001, in addition, the training set is run for iteration, the same loss function is still used, then the performance is evaluated on the verification set every 5 times, if the performance is not improved within 50 iterations, the model is considered to be converged, training is stopped, and a history optimal model is output;

in the step (5), two training network indexes are evaluated, one is the relative error of the peak value, and the other is the relative error of the peak value relative position, and if the two relative errors are within fifteen percent, the two are qualified;

the following gives an accurate definition of two performance indicators, first defining the peak value of the signal X as:

φ(X)＝max(X)

here the signal X and the following X are each assumed to be tensors of only one dimension in length;

the peak position of the redefined signal X is:

P(X)＝argmax(X)

the relative error in peak size is then:

the relative error in peak position is:

the denominator is defined herein as the length of the occurrence of a peak, i.e., the number of points in the whole wave that are greater than a certain proportion of the peak, where k represents a certain proportion;

the evaluation index only has a relation with a certain maximum point of the wave crest, so that the input parameter of the evaluation index only has an obvious wave crest;

smoothing the gradient by taking an average value according to the number of batch samples or the signal length; the loss function is used for gradient feedback, so that all points of an output signal are as close to a tag as possible, and the output is more credible.

2. The method of echo measurement data based signal feature extraction and data mining according to claim 1, wherein the method of composing a complete deep learning network in step (3) is: taking L basic residual blocks, forming an encoder in an end-to-end mode, numbering the first basic residual block with 1 according to the input passing sequence, and using f ₁ Representation, other theories; the last passing basic residual block is numbered L;

then, a one-dimensional convolution, denoted conv_med, is performed to process the encoded features to a subsequent decoder; then, L basic residual blocks are also taken and a decoder is formed in an end-to-end mode, the numbering sequence is opposite to that of the encoder, the first basic residual block is numbered as L, and h is used _L Representation, other theories; the last one is 1; the input here is the features after one-dimensional convolution processing as mentioned above;

finally, establishing a cross-link of the encoder and the decoder; for a base residual block of decoder number i, if i < L, i.e., either the first passed base residual block, the encoder residual block output of number i is input, and the decoder residual block of number i+1 is spliced along the channel; if i=l, it is the output from conv_med that is spliced together as input.