CN116187206B

CN116187206B - COD spectrum data migration method based on generation countermeasure network

Info

Publication number: CN116187206B
Application number: CN202310450642.0A
Authority: CN
Inventors: 张颖颖; 侯士伟; 袁达; 吴丙伟; 冯现东; 曹璐; 程岩; 王茜
Original assignee: Institute of Oceanographic Instrumentation Shandong Academy of Sciences
Current assignee: Institute of Oceanographic Instrumentation Shandong Academy of Sciences
Priority date: 2023-04-25
Filing date: 2023-04-25
Publication date: 2023-07-07
Anticipated expiration: 2043-04-25
Also published as: CN116187206A

Abstract

The invention discloses a COD spectrum data migration method based on a generated countermeasure network, which relates to the technical field of seawater detection and comprises the following steps of; and (3) data acquisition: collecting seawater sample COD spectrum data in the field as source domain data, and determining target domain data; defining the network structure of the generator G and the determiner D in the WGAN-GP network: the generator G takes LSTM as a network structure, and the judging device D takes a full-connection layer as a network structure; training the WGAN-GP network: training a judging device D and a generator G through source domain data and target domain data respectively; and inputting the source domain data into a trained WGAN-GP network to obtain simulated spectrum data, and comparing the similarity of the simulated spectrum data and the target domain spectrum data. The invention can effectively solve the problems of different spectral characteristics and insufficient samples caused by different COD components in different areas, and improves the identification accuracy and reliability of seawater COD spectral data.

Description

COD spectrum data migration method based on generation countermeasure network

Technical Field

The invention relates to the technical field of seawater detection, in particular to a COD spectrum data migration method based on a generated countermeasure network.

Background

Chemical oxygen demand (Chemical Oxygen Demand) is a comprehensive evaluation index of marine environmental organic pollution, and the sensitive section of the spectrum depends on the type and concentration of organic substances in water. The problems of different spectral characteristics and insufficient samples caused by insufficient sample numbers of COD components formed by different areas and different single areas are caused, so that certain difficulties exist in spectral data processing by deep learning.

Disclosure of Invention

In order to overcome the problems in the prior art, the invention provides a COD spectrum data migration method based on generation of an antagonism network.

The technical scheme adopted for solving the technical problems is as follows: a COD spectrum data migration method based on a generated countermeasure network comprises the following steps of;

step 1, data acquisition: collecting seawater sample COD spectrum data in the field as source domain data, and determining target domain data;

step 2, defining the network structure of a generator G and a judging device D in the WGAN-GP network: the generator G takes LSTM as a network structure, and the judging device D takes a full-connection layer as a network structure;

step 3, training the WGAN-GP network: training a judging device D and a generating device G respectively through the source domain data and the target domain data obtained in the step 1;

and 4, inputting the source domain data in the step 1 into the WGAN-GP network trained in the step 3 to obtain simulated spectrum data, comparing the similarity of the simulated spectrum data and the target domain spectrum data, comparing the distribution overlapping parts of the two groups of data in a distribution mode, and verifying the effectiveness of the model.

In the above-mentioned COD spectrum data migration method based on generation of the countermeasure network, in the step 2, the generator G includes two LSTM layers and 1 fully connected network, and the determiner D includes three fully connected layers.

The above method for migrating COD spectrum data based on generation of an countermeasure network, wherein the step 3 specifically includes:

step 3.1, selecting m samples from the target domain data, inputting the m samples into a judging device D, and marking the m samples as true;

step 3.2, selecting m samples from the source domain data, and inputting the m samples into a generator G to obtain simulated spectrum data;

step 3.3, inputting the simulated spectrum data obtained in the step 3.2 into a judging device D, and calculating the Wasserstein distance between the data obtained in the step 3.1 and the simulated spectrum data by the judging device D;

and 3.4, judging whether the Wasserstein distance obtained in the step 3.3 meets a set threshold, outputting the simulated spectrum data if the Wasserstein distance meets the set threshold, and repeating the steps 3.1-3.3 until the Wasserstein distance meets the set threshold if the Wasserstein distance does not meet the set threshold, and outputting the simulated spectrum data.

The calculation formula of the Wasserstein distance in the step 3.3 is as follows:

,

wherein,,

for the probability distribution of the target data, +.>

Probability distribution for source data; />

Is->

And->

Spatial sampling between the two distributions; d (x) and D (G (z)) represent Wasserstein distances, E, of real target domain data and simulated spectral data, respectively _x~Pr [D(X)]And E is _t~Pt [D(G(z))]Respectively representing the corresponding expected values; />

Is a super-parameter which is used for the processing of the data,

is a gradient penalty term.

The method for migrating COD spectrum data based on the generation countermeasure network specifically comprises the following steps:

for each target domain sample x and simulated spectral data sample t, an interpolated sample is calculated between them

：

,

Wherein the method comprises the steps of

Is a random number, and the value range of the random number is between 0 and 1;

calculating the difference between the interpolated samples

Gradient norm above:

,

wherein the method comprises the steps of

L representing vector ₂ Norms (F/F)>

Representing the discriminator D pair->

Is a gradient of (2);

multiplying norm internal minus 1 by a superparameter

As a gradient penalty term.

According to the COD spectrum data migration method based on the generation countermeasure network, the threshold value range is set to be 0.001-0.1 in the step 3.4.

The method has the beneficial effects that the data migration of the COD spectrum data from the source domain to the target domain is realized through the generation of the countermeasure network, the sample distribution of the real lighting spectrum data of the target domain is taken as the basis, the sample distribution of the COD spectrum data of the target domain is identified and simulated in a form of countermeasure learning through two network structures of the generator G and the judging device D, and the purpose of the COD spectrum migration of different areas is realized through sampling in the distribution; meanwhile, a WGAN-GP generator with an improved architecture is utilized to generate simulation data which is highly similar to real data, so that the migration of seawater COD spectrum data in different areas is performed; the method can effectively solve the problems of different spectral characteristics and insufficient samples caused by different COD components in different areas, and improves the identification accuracy and reliability of seawater COD spectral data.

Drawings

The invention will be further described with reference to the drawings and examples.

Fig. 1 is a schematic diagram of a WGAN-GP network structure according to an embodiment of the present invention;

FIG. 2 is a sample generated spectrum contrast diagram in an embodiment of the present invention, wherein (a) in FIG. 2 is a spectrum diagram of data in the source domain of 5 samples; FIG. 2 (b) is a simulated spectrum of 5 samples; FIG. 2 (c) is a target domain spectrogram of 5 samples.

Detailed Description

The present invention will be described in detail below with reference to the drawings and detailed description to enable those skilled in the art to better understand the technical scheme of the present invention.

The embodiment discloses a COD spectrum data migration method based on a generated countermeasure network, which comprises the following steps of;

in this embodiment, the specific structure of the WGAN-GP network is shown in fig. 1, the generator G includes two LSTM layers and 1 fully-connected network, and the determiner D includes three fully-connected layers.

the step 3 specifically comprises the following steps:

the calculation formula of the Wasserstein distance is as follows:

，

pr is probability distribution of target data, and Pt is probability distribution of source data;

sampling the space between Pr and Pt; d (x) and D (G (z)) represent Wasserstein distances, E, of real target domain data and simulated spectral data, respectively _x~Pr [D(X)]And E is _t~Pt [D(G(z))]Respectively representing the corresponding expected values; />

Is a super-parameter which is used for the processing of the data,

is a gradient penalty term.

The gradient penalty term acts to smooth the gradient of the arbiter over the interpolated samples, thus making the training of the generator more stable. In addition, the gradient penalty term can also improve the quality and diversity of the generated samples of the WGAN-GP. The gradient penalty term is specifically:

：

，

Wherein the method comprises the steps of

calculating the difference between the interpolated samples

Gradient norm above:

,

wherein the method comprises the steps of

L representing vector ₂ Norms (F/F)>

Representing the discriminator D pair->

Is a gradient of (2);

multiplying norm internal minus 1 by a superparameter

As a gradient penalty term.

The threshold range is empirically set to 0.001-0.1, and in this embodiment is set to 0.01.

The origin of the WGAN-GP network in this embodiment is: the generation of the antagonism network (GAN) is composed of two parts of a generator and a arbiter. The generator is used for generating the simulation data of the target domain, and the discriminator is used for distinguishing the real data and the simulation data of the target domain. The goal of the generator is to generate simulated data that is highly similar to the target domain data, such that the arbiter classifies the simulated data as real data. The objective of the discriminator is to constantly discriminate between the simulated data and the real data generated by the generator. Through the mutual antagonism of the two network structures of the generator and the discriminator, the GAN realizes the conversion of spectrum COD data from a source domain to target domain data.

Example 1

Two different seawater region COD data sets are prepared, the distribution form of the target domain data x is Pr (x), and the distribution form of the source domain data t is Pt (t). The variable t is first obtained by sampling in Pt (t) and put into a generator to produce analog data x' =g (z). Then, the existing real data x and the obtained analog data x 'are classified and labeled, the real data x is labeled 1, and the analog data x' is labeled 0. The real data x with the tag and the analog data x' are added into the discriminator together, and the discriminator is trained by a supervised learning method. Meanwhile, the generator is trained through the performance of the discriminator so as to achieve the aim of mutual antagonism of two network structures. The loss function in GAN training is:

，

in formula (1), the discriminators and generators are denoted as D and G, respectively, the target domain data is x, the source domain data is t, and E is an expected value. Due to monotonicity of the logarithmic function, the arbiter D expects to infinitely approximate the result of D (G (t)) to 0 and to infinitely approximate D (x) to 1, thereby letting equation (1) take a large value; while generator G expects to approach D (G (z)) to 1 infinitely, taking equation (1) to a minimum. Where D (x) represents the classification probability of the discriminator for the true data of the target domain, G (t) represents the data generated by the generator, and log is the natural logarithm. The first expected value represents the classification accuracy of the arbiter for the real target domain data, and the second expected value represents the classification accuracy of the arbiter for the generated data. Therefore, the goal of generating the countermeasure network is to maximize the degree of discrimination of D for the generated target data and the real target data, and minimize the distribution of G generated data and target conversion data, thereby achieving the effect of two kinds of network countermeasure. The loss function of this countermeasure mode is defined as a cross entropy loss function.

However, cross entropy loss functions have problems such as mode collapse and gradient disappearance, which can lead to GAN training instability. To solve these problems, the Wasserstein Distance (also known as Earth Mover's Distance) is used to measure the Distance between two probability distributions. By optimizing the Wasserstein distance function, a steady gradient can be provided to the generator G, thereby reducing the distance between the distribution f (x) and the target distribution, and achieving the goal of pulling the two distributions closer.

The calculation method of Wasserstein distance is as follows: for probability distributions Pr (x) and Pt (t) of target data and source data, their density functions are expected to be E, respectively _x~Pr [f(X)]And E is _x~Pt [f(X)]The wasperstein distance between them can be expressed as:

，

in the middle of

Indicating the presence of a constant +.>

Making any two values x in the definition domain ₁ And x ₂ All satisfy:

，

in calculating the Wasserstein distance from the distribution Pr (x) to the distribution Pt (t), the distance functionLipschitz constant K of the number f (x) also needs to be considered. Specifically, f (x) in the formula represents an evaluation function, E represents an expected value, and sup represents an upper bound. This formula represents that one is chosen from all possible functions f (x) such that

Is then multiplied by 1/k to obtain the value of the Wasserstein distance. The advantage of this formula is that the wasperstein distance can be calculated by optimizing the function f (x) without directly calculating the distance between the two distributions.

The essence of the wasperstein distance is to measure the minimum transport cost from one distribution to another, i.e. the minimum cost required to move the mass in one distribution to another. By using the Wasserstein distance as a loss function, the training stability of GAN and the quality of generated data are improved by replacing a two-class calculation mode of the cross entropy loss function. Meanwhile, the Lipschitz constant K of the evaluation function f (x) is considered, so that the calculation efficiency and accuracy of the Wasserstein distance can be further improved.

The loss function of the determiner D of the WGAN is thus defined as:

，

the loss function of generator G is defined as:

，

wasserstein distance is an important indicator for measuring the difference between the generated sample and the real sample, wherein D (x) and D (G (z)) represent the Wasserstein distance of the real target domain data and the generated target domain data, E, respectively _x~Pr [D(X)]And E is _t~Pt [D(G(z))]Respectively representing the corresponding expected values. In WGAN, it is desirable to minimize E _t~Pt [D(G(z))]-E _x~Pr [D(X)]I.e. minimizing the desired wasperstein distance between the generated sample and the real sample, thereby training the generator and the arbiter.

However, it is not feasible to directly calculate the gradient of the wasperstein distance, since we cannot enumerate all possible joint distributions. To solve this problem, an approximation method is used, i.e., a function of a discriminator is used to approximate the wasperstein distance, thereby achieving effective optimization of the wasperstein distance. Specifically, WGAN approximates the Wasserstein distance using a function D (x) of the arbiter, i.e

。

This approximation can be achieved by gradient descent, but since the Lipschitz constant of D (x) can be very large, it is not feasible to calculate the gradient directly.

Thus, the gradient penalty technique is employed to limit the Lipschitz constant of D (x), thereby achieving efficient optimization of Wasserstein distance. Specifically, the WGAN adopts a gradient constraint mode to enable the equation to meet the requirement of Lipschitz continuity, namely, by punishing the gradient of the discriminator, the norm of the gradient is limited to not exceed a preset threshold value. In this way, the Lipschitz constant of D (x) can be effectively controlled, thereby achieving effective optimization of the Wasserstein distance.

The equation meets the requirement of Lipschitz continuity by adopting a gradient constraint mode, and the equation is as follows:

，

wherein,,

for the probability distribution of the target data, +.>

Probability distribution for source data; />

Is->

And->

Is a super-parameter which is used for the processing of the data,

is a gradient penalty term.

：

，

Wherein the method comprises the steps of

calculating the difference between the interpolated samples

Gradient norm above:

,

wherein the method comprises the steps of

L representing vector ₂ Norms (F/F)>

Representing the discriminator D pair->

Is a gradient of (2);

multiplying norm internal minus 1 by a superparameter

As a gradient penalty term, is added to the loss function of the arbiter.

The gradient penalty term acts to smooth the gradient of the arbiter over the interpolated samples, thus making the training of the generator more stable. In addition, the gradient penalty term can also improve the quality and diversity of the generated samples of the WGAN-GP.

In this embodiment, the COD spectrum data source field is a real-harvest gulf seawater sample, and the target field is a COD seawater sample configured by artificial seawater and o-benzene, and the data size is 63×601. Wasserstein GAN with Gradient Penalty (WGAN-GP) network is implemented in which LSTM is used as the network structure of generator G and the full connectivity layer is used as the network structure of arbiter D.

First, some hyper parameters are defined, where the batch size (number of samples per batch) is chosen to be 5; the learning rate is 0.0002; the training round number is 1000, the judgment device is updated 5 times before each updating generator, the weight cut-off threshold is 0.01, the hidden variable dimension is 1, the characteristic number 601 of spectrum data (depending on the variable length of samples, here the sample conversion of 601 variables is taken as an example), the hidden dimension 128 of LSTM, and the above super parameters are manually adjusted according to the correlation degree of network training.

Next, the network structures of the generator G and the determiner D are defined. The generator G comprises two LSTM layers followed by a fully connected network of 601 x 601; the arbiter D comprises three full connection layers, the first layer is 601×1024, the second layer is 1024×5112, and the third layer is 512×1. In the generator G, the target domain data x is input into the LSTM, and then the output of the last time step of the LSTM is taken as the generated simulated seawater COD spectral data. In the determiner D, the LeakyReLU is used as an activation function to avoid the problem of gradient disappearance.

Then, a loss function and optimizer are defined for calculating gradient penalty term and Wasserstein distance.

Finally, training the model is started. In each iteration, the arbiter D will be trained first, and then the generator G will be trained. In training the arbiter D, a gradient penalty term needs to be calculated and added to the loss function. When the generator G is trained, seawater COD data with 601 variable points of each sample collected in a source domain is taken as input, and analog spectrum data with 601 variables is output through an LSTM layer and a fully connected network; the target domain data having 601 variables is input as a judgment device D, the judgment result is output, and the generated analog spectrum data is input into the judgment device D, and the judgment result is output. The WGAN-GP method no longer uses a classification method, but evaluates the distance between the two distributions. The loss function of generator G uses equation (3).

And judging the quality of the output result of the generator G, if the output result meets the requirement, outputting, if the output result does not meet the requirement, selecting the number of sample queues for single training, then performing 5-cycle training of the discriminator D, and after the completion, further training the generator G. And thus reciprocates until the output of the generator G meets the requirements. In this iterative manner, the generator G and the arbiter D are constantly performing countermeasure training, thereby improving the quality of the data generated by the generator G so as to be closer to the target domain data x.

FIG. 2 is a comparison of the generated spectra of 5 samples, wherein (a) in FIG. 2 is a spectrum of data of the source domain of 5 samples; FIG. 2 (b) is a simulated spectrum of 5 samples; in fig. 2 (c) is a 5 sample target domain spectrogram, calculated to be 0.457 for real source domain data, and 0.036 for variance; the simulated spectrum data is expected to be 0.468 and the variance is 0.041, and the gaussian distribution of the surface simulated spectrum data is less numerical in the center position of the gaussian distribution than the source domain data and more discrete in the distribution.

Table 1 lists the d_loss and g_loss values for the neural network at different runs, and the sample error rates in the training set and test set. Where d_loss is the loss function value of the arbiter, g_loss is the generator loss function value, and as can be seen from table 1, as the number of runs increases, the d_loss and g_loss values gradually decrease, indicating that the performance of the network is continuously improved. Meanwhile, the sample error dividing ratio of the training set and the test set is gradually reduced, which means that the generalization capability of the network is gradually enhanced. However, at higher run times, the sample misclassification rate increases instead, possibly due to overfitting. Therefore, in practical applications, an appropriate number of operations needs to be selected according to the circumstances.

The above embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, the scope of which is defined by the claims. Various modifications and equivalent arrangements of this invention will occur to those skilled in the art, and are intended to be within the spirit and scope of the invention.

Claims

1. The COD spectrum data migration method based on the generation countermeasure network is characterized by comprising the following steps of;

step 4, inputting the source domain data in the step 1 into the WGAN-GP network trained in the step 3 to obtain simulated spectrum data, comparing the similarity of the simulated spectrum data and the target domain spectrum data, comparing the distributed overlapping parts of the two groups of data in a distributed mode, and verifying the effectiveness of the model;

the step 3 specifically includes:

step 3.4, judging whether the Wasserstein distance obtained in step 3.3 meets a set threshold, if so, outputting simulated spectrum data, and if not, repeating steps 3.1-3.3 until the Wasserstein distance meets the set threshold, and outputting the simulated spectrum data;

wherein P is _r For probability distribution of target data, P _t Probability distribution for source data;

is P _r And P _t Spatial sampling between the two distributions; d (x) and D (G (z)) represent the wasperstein distances of the real target domain data and the simulated spectrum data respectively,

and->

Respectively representing the corresponding expected values; lambda is the superparameter, ">

Is a gradient penalty term.

2. The COD spectrum data migration method based on generation countermeasure network according to claim 1, wherein in the step 2, the generator G includes two LSTM layers and 1 fully connected network, and the determiner D includes three fully connected layers.

3. The COD spectrum data migration method based on generation countermeasure network according to claim 1, wherein the gradient penalty term is specifically:

Wherein epsilon is a random number, and the value of epsilon is in the range of 0 to 1;

calculating the difference between the interpolated samples

Gradient norm above:

where L represents the L of the vector ₂ The norm of the sample is calculated,

representing the discriminator D pair->

Is a gradient of (2);

the norm internal minus 1 is multiplied by a super parameter lambda as a gradient penalty term.

4. The COD spectrum data migration method based on generation of countermeasure network according to claim 1, wherein the threshold value range is set to 0.001-0.1 in the step 3.4.