CN116777196A

CN116777196A - LSTM-DFGAN-based grain processing process pollutant data expansion and risk prediction method

Info

Publication number: CN116777196A
Application number: CN202310343276.9A
Authority: CN
Inventors: 王立; 郑浪; 金学波; 王小艺; 于家斌; 白玉廷; 赵峙尧; 郭天洋
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2023-09-19

Abstract

The invention discloses a method for expanding pollutant data and predicting risk in a grain processing process based on LSTM-DFGAN, and belongs to the field of food safety. Firstly, obtaining original data of pollutants in the grain processing process, processing the original data to obtain 1 row of one-dimensional data of N× (I+J) columns, inputting the one-dimensional data into TimeGAN for expansion, dividing each group of expanded data sets into a training set and a testing set, and training LSTM and GAN. And then inputting the input data in the training set into the trained LSTM/GAN model, outputting predicted data, fusing the predicted output data with the input data in the training set to serve as the input data of the DF model, and training the DF model by using the output data in the training set as the output data of the DF model. And embedding DF into GAN based on the trained DF, and establishing a DFGAN model. And finally, replacing DF with an LSTM-DF model on the basis of the DFGAN model, and establishing the LSTM-DFGAN model to realize accurate prediction of pollutants in the grain processing process. The method has the advantages of initiatively fusing the models and having better prediction effect.

Description

LSTM-DFGAN-based grain processing process pollutant data expansion and risk prediction method

Technical Field

The invention relates to a method for expanding pollutant data and predicting risk in a grain processing process, and belongs to the field of food safety.

Background

Grain is the root of survival, and the yield of grains in China reaches 6.3 hundred million tons in 2021. The metal pollutants and the microbial pollutants are main pollutants existing in grains and widely exist in a plurality of links of the processing process, the pollutants directly reduce the quality of the grains, and even the pollutants with too high concentration can threaten the health of human bodies. The risk prediction of contaminants in grain processing is therefore a problem of research interest.

Grain processing is an intermittent process, compared with a continuous industrial process, the processing procedure is more complicated, the process mechanism characteristic does not change with time and moment, and changes regularly along with the change of the processing progress, so that sectionality is presented, and each section has different process variable change characteristics. While the process is complex, the types of pollutants are various, wherein the metal pollutants comprise Pb, cr, hg, as, cd and other metals, and the microbial pollutants comprise aflatoxin B1 toxin, vomitoxin and the like. Therefore, the data acquisition is difficult, the data amount is small, the training is easy to be excessively fitted by directly using the acquired data, the prediction effect is poor, the data expansion is needed, and the pollutant data in the grain processing process is not expanded in the existing research. Meanwhile, the existing research has not proposed a suitable prediction method for predicting pollutants in the grain processing process.

The prediction of the grain processing pollutants is to model the change of the grain processing pollutants, so that the change trend of the grain pollutant growth can be reflected through a model, the prediction model of the grain processing pollutants is divided into a model based on mechanism driving and a model based on data driving, the growth change of the pollutants is described by a formula based on the prediction of the mechanism, and the future grain pollutant amount or the probability of whether the pollutants exceed standards can be predicted. The grain processing pollutant prediction model based on data driving is to train a sufficiently complex model by utilizing data, learn deep features among the data, find out the change rule among the data and achieve the purpose of prediction. The prediction model based on data driving has a good prediction effect, especially in recent years, and can be applied to the prediction of pollutants in the grain processing process.

Therefore, after knowing the problems of the pollutant data of the grain processing process, how to preprocess the pollutant data of the existing grain processing process so that the pollutant data of the existing grain processing process can be used as modeling data, and how to extract the variation trend information of the pollutant data of the grain processing process so as to predict the pollutant of the grain processing process at the future moment, and the method is used for predicting and modeling the pollutant of the grain processing process, which is a problem to be solved in the research field of the food safety field.

Disclosure of Invention

The invention provides an LSTM-DFGAN-based grain processing process pollutant data expansion and risk prediction method for solving the problems of grain processing process data expansion and pollutant prediction.

The LSTM-DFGAN-based grain processing process pollutant data expansion and risk prediction method comprises the following specific steps:

step one, obtaining original data of pollutants in the grain processing process, and dividing and treating the original data;

the number of ring joints in the grain processing process is recorded as I+J, wherein I is the number of links with known pollutant concentration, J is the number of links with pollutant concentration to be predicted, and the original data of the pollutant in the grain processing process is defined as shown in a formula (1):

where N is the number of sample data, and the size of the data is n× (i+j).

Dividing the original data, recording link data of known pollutant concentration as X and recording link data of pollutant concentration to be predicted as Y, wherein the link data are shown in a formula (2) and a formula (3):

the processing procedure of the original data is as follows:

firstly, arranging pollutant data of each processing link in a sequence from large to small to obtain N rows of ordered I+J columns of data;

the ordered N rows i+j column data are then transformed into 1 row n× (i+j) column one-dimensional data.

And step two, inputting the transformed one-dimensional data into a TimeGAN for expansion to obtain a data set containing a plurality of groups of N rows of I+J columns of data.

Firstly, the TimeGAN learns the transformed 1 row of one-dimensional data of N× (I+J) columns to generate a plurality of groups of one-dimensional data;

then, each group of one-dimensional data is inversely transformed and restored into a plurality of groups of N rows and I+J columns of data.

Dividing each group of expanded data sets into a training set and a testing set, and respectively dividing an input set and an output set;

randomly ordering the expanded data groups according to a row unit and N ₁ :(N-N ₁ ) The data set is divided into a training set and a testing set.

And dividing input and output of each group of training set and test set respectively, taking the pollutant concentration of the grain processing process with known ring number as input and taking the pollutant concentration of the grain processing process needing prediction as output.

The method comprises the following steps:

the input of the training set, i.e. the data of the known link is X _train The output of the training set, i.e. the output of the link to be predicted is Y _train Similarly, the input and output of the test set are X respectively _test And Y is equal to _test After division, as shown in the formulas (4) to (7):

step four, inputting X through training set _train And output Y _train Training LSTM with data to obtain training The LSTM model is refined.

Step five, training the GAN model through a training set at the same time;

the GAN model comprises a generator and a discriminator, and the training process comprises the following steps:

first, input data X of I links in training set _train As input of the generator, obtaining predicted J link data through 4-layer DNN networkAnd fusing with input data of the links I in the training set;

then, inputting the training set data and the data which are fused with the input data of the I links in the training set and the predicted J links into a discriminator at the same time;

the I link data in the training set plus the J link data predicted by the generator are not real data change trends, so that the label is 0 when the I link data and the J link data are input to the discriminator; similarly, the tag is 1 when training set data is input to the arbiter, so the arbiter outputs a one-dimensional scalar.

And finally, inputting the one-dimensional scalar output by the discriminator into the activation function, wherein the larger the output value of the activation function is, the more real the input data is, namely the more accurate the change rule is.

After the countermeasure training of the generator and the discriminator, when the generator inputs known I link data, J link data similar to real data can be output, namely the training of the GAN model is completed, and the prediction capability is possessed.

And step six, taking the input data in the training set as the input of the trained LSTM/GAN model to obtain the predicted output data of the LSTM/GAN model, fusing the predicted output data with the input data of the training set to serve as the input data of the DF model, and taking the output data in the training set as the output data of the DF model to train the DF model to obtain the LSTM-DF model/GAN-DF model.

The specific training process is as follows:

step 601, for N ₁ Pollutant input number in grain processing process of row I+J columnAccording to the data, DF extracts the input data vector by using a plurality of sliding windows with different sizes, and the extracted data vector is obtained.

The sliding window is p×q in size, and the number of times of sliding extraction of the input data is (N ₁ -p+1) x (i+j-q+1), the extracted data vector size is: p x q x (N) ₁ -p+1)×(I+J-q+1)。

Step 602, the data vectors extracted from each sliding window are respectively input into a complete random forest a and a random forest B, and each complete random forest a or random forest B outputs k prediction results, so that the number of the data vectors finally obtained from each sliding window is as follows: p x q x (N) ₁ -p+1) x (i+j-q+1) x k x 2, wherein 2 is meant to include both completely random forest a and random forest B.

Step 603, assuming that the number of sliding windows is M, the total number of data vectors obtained through the multi-granularity scanning stage is m= Σ _m p×q×(N ₁ -p+1)×(I+J-q+1)×k×2。

Step 604, the M data vectors are serially connected according to the sequence of multi-granularity scanning, so as to obtain enhanced data vectors, and the enhanced data vectors are used as the input of the cascade forest.

The enhanced data vector contains M rows and 1 columns of data.

The cascade forest comprises a multi-layer cascade structure, and each layer of cascade structure comprises x random forests and y completely random forests.

Step 605, inputting the enhanced data vector into the first layer cascade structure for learning, and outputting a predicted data vector; the number of data output by the first hierarchical structure is: kX (x+y)

Step 606, combining the data vector output by the first hierarchical structure with the enhanced data vector, as a new data vector, transmitting the new data vector to the second hierarchical structure for learning, so as to push the new data vector, and learning the data vector after combining the output of the last hierarchical structure with the enhanced data vector by each hierarchical structure until the maximum layer number is reached or the prediction result is not changed, determining the layer number of the hierarchical structure, and completing DF training.

When DF training is completed, each random forest and depth random forest of the last layer of cascade structure outputs a group of predicted data vectors, and the outputs of all random forests and complete random forests of the layer are averaged to obtain a final predicted result.

And 607, summarizing the output result of the cascade forest by the trained DF algorithm, and then verifying the prediction effect by using the test set to obtain the prediction result and the accuracy.

Step 608, after DF learns the variation rule of the grain pollutant processing process and the error of LSTM/GAN prediction through training, an LSTM-DF model/GAN-DF model is obtained.

Step seven, based on the trained DF, the DF is embedded into the GAN, a DFGAN model is built, the DF is used as a generator of the GAN, and the discriminator still uses the DNN model.

The specific process is as follows:

step 701, X _train Inputting into the trained DF to obtain the predicted value of DFX is to be _train And->Performing transverse combination to obtain input ++of the discriminator>As shown in equation (14):

wherein the method comprises the steps ofIs to the N th ₁ Predicted value of group data in the I+J th link, due to +.>From the I+1 to I+J ring members of (C)Since the data is predicted by DF, the arbiter needs to determine false, and thus, the flag 0 is added to the arbiter.

Step 702, simultaneously N-N ₁ Input-combining N-N of individual test sets ₁ The true output data of each test set is labeled 1 and enters the arbiter because the output data is true data.

In step 703, according to the effect of the discriminator discrimination, the discriminator is optimized by back propagation, the performance of the generator is discriminated by the loss function of the discriminator, and the prediction accuracy of the current generator for the training data is detected.

The result of discriminating the performance of the generator by the loss function is divided into the following:

(1) If the loss function of the discriminator is not changed relative to the loss function when the generator is adjusted last time, the prediction effect of the generator is still good, and generator optimization is not needed;

(2) If the loss function of the arbiter changes for the loss function at the last time the generator was adjusted, then a further determination of the change in the loss function is needed, step 704 is performed;

step 704, judging whether the absolute value of the change of the loss function reaches an adjustment threshold, if so, adjusting a generator, increasing the number of layers of the cascade forests, and returning to step 701; otherwise, go to step 705;

the value of the adjustment threshold is shown in formula (15):

δ _e ＝a×b ^epoch +c (9)

wherein delta _e Is the absolute value of the current prediction error minus the last prediction error, epoch is the current training run, delta _e The initial value of (1) is defined as 0, delta _e Is a value that decreases as epoch increases; a needs to be determined according to the order of magnitude of the prediction accuracy, b is the influence delta _e The factor of the decay speed, c, is to ensure that delta is still ensured when the training times are already large _e There is a certain value.

Step 705, judging whether the training of the round is completed, if yes, executing step 706; otherwise, return to step 703 and continue with the arbiter optimization.

Step 706, judging whether training of all rounds is completed, if yes, completing an optimization process of the generator by DF; otherwise, return to step 703 and continue the optimization.

Step 707, using DF as a generator, establishing a DFGAN model, and predicting the known I link data to obtain predicted J link data.

And step eight, replacing the generator DF with an LSTM-DF model on the basis of the DFGAN model, and establishing the LSTM-DFGAN model to realize accurate prediction of pollutants in the grain processing process.

Step 801, inputting the known I link data into the trained LSTM to obtain the prediction result of the LSTM;

step 802, fusing the LSTM prediction result with training set input, and changing the fused data into N ₁ Row i+j columns;

step 803, taking the fused data as input of DF, and outputting predicted J link data by DF and taking the predicted J link data as input of a discriminator;

step 804, the discriminator judges J pieces of link data predicted by DF output, detects the prediction accuracy of the current DF on the training data, judges whether the loss function of the discriminator changes relative to the loss function of the previous round, if so, executes step 805; otherwise, continuing to train the arbiter, without performing an optimization generator, step 806 is performed.

Step 805, further determining whether the magnitude of the change of the loss function of the round relative to the loss function of the previous round reaches the adjustment threshold, if so, adjusting DF, increasing the number of cascaded forest layers, and returning to step 803; otherwise, step 806 is performed.

Step 806, determining whether the round is completed, if yes, executing step 807; otherwise, returning to the step 804, and continuing to optimize the discriminator;

step 807, judging whether all rounds are completed, if yes, completing the establishment of an LSTM-DFGAN model; otherwise, return to step 804, proceed with arbiter optimization.

And step 808, predicting the known I link data by using the established LSTM-DFGAN model to obtain predicted J link data.

The invention has the advantages that:

1. the invention provides the method for expanding the pollutant data in the grain processing process by using the TimeGAN model for the first time, the model is suitable for the data of the type of the pollutant in the grain processing process, the expanded data has a change rule which is close to that of the original data, meanwhile, two pieces of data have certain difference, the distribution condition of the data is consistent with that of the original data, and the overall mean value and variance are basically unchanged.

2. The invention firstly proposes the prediction of grain processing data by using the DF model, and has better prediction effect on discrete type data compared with a learning model of a neural network type such as LSTM and the like, and the DF is integrated with a random forest, so that the prediction effect of the DF model is more accurate compared with an independent random forest.

3. The invention provides an LSTM-DF model for predicting pollutant data in the grain processing process for the first time, and compared with the single LSTM or DF model, the model combines the advantages of the LSTM model and the DF model, and has better prediction effect.

4. The invention provides a GAN-DF model for predicting pollutant data in the grain processing process for the first time, and compared with the independent GAN or DF model, the model combines the advantages of the GAN model and the DF model, and has better prediction effect.

5. The invention firstly provides a DFGAN model for predicting pollutants in the grain processing process, refers to the working principle of generating an countermeasure network, takes DF as a generator, optimizes parameters of DF by using the output of a discriminator, and finally has better DF predicting effect than the original DF predicting effect.

6. The invention provides the LSTM-DFGAN model for predicting the pollutants in the grain processing process for the first time, compared with the LSTM-DF model, the GAN structure is increased, and compared with the DFGAN model, the DF is improved to the LSTM-DF model with higher precision, and the prediction is more accurate.

Drawings

FIG. 1 is an overall flow chart of the present invention for grain processing prediction based on the LSTM-DFGAN model;

FIG. 2 is a schematic diagram of the present invention for data augmentation using TimeGAN;

FIG. 3 is a structural schematic diagram of DF for predicting pollutants in grain processing;

FIG. 4 is a flowchart of a DF algorithm for predicting the pollutants in grain processing;

FIG. 5 is a schematic diagram of LSTM model for predicting contaminants in grain processing;

FIG. 6 is a schematic diagram of the LSTM-DF model for predicting contaminants during grain processing;

FIG. 7 is a flowchart of an algorithm for predicting contaminants in grain processing using the LSTM-DF model;

FIG. 8 is a schematic diagram of the structure of the GAN-DF model for predicting pollutants in grain processing;

FIG. 9 is a flowchart of an algorithm for predicting grain process pollutants by using a GAN-DF model;

FIG. 10 is a schematic diagram of the structure of the DFGAN model for predicting contaminants in grain processing;

FIG. 11 is a flowchart of an algorithm for predicting grain process pollutants by the DFGAN model;

FIG. 12 is a schematic diagram of the LSTM-DFGAN model for predicting contaminants during grain processing;

FIG. 13 is a flowchart of an algorithm for predicting grain process pollutants using the LSTM-DFGAN model;

FIG. 14 is a visual comparison of a set of predictions for example 1;

FIG. 15 is a schematic view of the rice processing flow mentioned in example 2;

FIG. 16 is the result of expanding 1 set of data in example 2;

FIG. 17 is the result of expanding 10 sets of data in example 2;

FIG. 18 is the result of expanding 100 sets of data in example 2;

FIG. 19 is the result of expanding 1000 sets of data in example 2;

fig. 20 is a graph showing the accuracy of the prediction results of the original data and the respective expanded data in example 2.

Detailed Description

The invention will be described in further detail with reference to the drawings and examples.

The invention aims to solve the problems of difficult data acquisition, small data size and the like of pollutants in the grain processing process, and firstly expands the pollutant data in the grain processing process based on TimeGAN (Time Generative Adversarial Network, timeGAN); aiming at the problem that the related prediction method is lack of the grain processing pollutants, the invention provides the method for predicting the unknown link of the grain processing pollutants based on Deep Forest (DF), LSTM-DF, GAN-DF, DFGAN, LSTM-DFGAN and other methods.

The grain processing pollutant data expanding method and the various grain processing pollutant risk prediction methods are shown in fig. 1, and mainly comprise the following steps:

Where N is the number of sample data, and the size of the data is n× (i+j).

considering that grain pollutant data are data of different links in the same row and different columns, and different rows of data are another group of samples, the change trend of the relationship between two rows of data is the same, but the initial data value of the t-th row of data(t=1, 2 … N) which leads to the value of the subsequent step +.>(h=2, 3 … i+j), but still is different from +.>Therefore, each column of data has a certain change rule, and the original data is processed according to the change rule, specifically:

firstly, arranging pollutant data of each processing link in original data according to a sequence from large to small to obtain N rows of ordered I+J columns of data;

The data quantity of pollutants in the grain processing process is small, the data type is different from the common long-time sequence data, and a certain amount of pollutants exist in the processing process in a special intermittent process, but the data quantity is small and cannot reach the data quantity required by predictive modeling training due to complex pollutant types, complex processing links and high cost of detecting the pollutants. Therefore, the data expansion is needed, the TimeGAN method suitable for expanding the grain processing pollutant data is used for expanding the grain processing pollutant data, the relation of the 'time dynamics' relation in the grain pollutant data is the relation among each group of data, and the method can be suitable for various grain processing metal pollutants and microbial pollutants.

The process of data expansion using TimeGAN is shown in fig. 2, and specifically includes:

firstly, the TimeGAN learns the transformed 1 row of one-dimensional data of N× (I+J) columns to generate a plurality of groups of one-dimensional data, and the generated data is similar to the original data;

Finally, enough generated data can be obtained according to the requirement so as to achieve the data volume required by training deep learning.

The method comprises the following steps:

Step four, inputting X through training set _train And output Y _train And training LSTM by the data to obtain a trained LSTM model.

Because the Long Short-Term Memory (LSTM) comprises a cell processor, the gate control circulation unit is utilized to process information, so that the information can selectively pass through, and the discarded information is forgotten through a forgetting gate, so that the problem of large prediction error of Long-sequence information is avoided, and the Long-time sequence information can be processed.

LSTM is mainly used to solve the problems of gradient extinction and gradient explosion that occur in long sequences during training. The chain structure composed of forgetting gate, input gate, output gate, memory unit and activating function is shown in figure 5.

The LSTM individual gate outputs are calculated as shown in equations (8) through (13):

f _n ＝σ(W _f ·[X _n ,h _n-1 ]+b _f ) (8)

wherein f _n Is the output of the n group forgetting gate, sigma is the Sigmoid operation, W _f Is the weight of forgetting door, X _n Is the concentration of the contaminants in the grain processing of the nth group, the general time series data is N rows and 1 column, wherein N is the length of the sequence, which is different from the general time series data, X in the step _n The data of the grain processing process data are N rows of I+J columns, each group of data are arranged according to time, but the time is discontinuous, the size is 1 row of I column, and the change trend among each group of data is the same. h is a _n-1 Is the output of group n-1 LSTM, b _f Is the bias of the forgetting gate, which reads the last group of outputs h _n-1 Current input X _n Doing a Sigmoid operation to output the output vector f of the nth group _n 。

i _n ＝σ(W _i ·[X _n ,h _n-1 ]+b _i ) (9)

Wherein i is _n Is the output of the n-th group of input gates, W _i And W is equal to _c Is the weight of the input gate, b _i And b _c Is the bias of the input gate and,is a candidate value vector, C _n Is the state of the nth group of cells, and the input gate reads the n-1 th group of output h _n-1 Pollutant concentration X in nth group grain processing _n Doing a Sigmoid operation while creating a tanh operation as a candidate vector +.>Combining old state C _n-1 And f _n Sigmoid operation i of adding input gate _n And candidate value vector->As a new cell state C _n 。

O _n ＝σ(W _o ·[X _n ,h _n-1 ]+b _o ) (12)

h _n ＝O _n ·tanh(C _n ) (13) wherein O _n Is the output of the output gate, W _o Is the weight of the output gate, b _o Is the bias of the output gate, h _n Is the output of the nth group LSTM, the input gate reads the n-1 th group output h _n-1 Nth group of inputs X _n Performing a Sigmoid operation to obtain O _n At the same time, for C _n Make a tanh operation with O _n Multiplying to obtain current output H _n 。

Step five, training the GAN model through a training set at the same time;

the generation countermeasure network (Generative Adversarial Network, GAN) is composed of a generator and a discriminator as shown in fig. 8, and in order to enable the GAN generator to predict the grain process pollutants, the GAN generator is designed such that the generator inputs grain process pollutant data with a known number of ring joints and outputs grain process pollutants whose number of ring joints needs to be predicted. The function of the generator is to obtain the generated data similar to the real data through learning, and the function of the discriminator is to distinguish the real data from the data generated by the generator, so that the idea of competitive learning is embodied. The generator and the discriminator compete with each other to perform optimization learning, and after learning, the data generated by the generator are very vivid, so that the purpose of spurious and spurious is achieved. The DNN network is used for designing the generator, so that the generator can learn the change of the pollutants in the grain processing process, and the generator can predict the pollutants in the grain processing process.

The training process comprises the following steps:

then, inputting the training set data and the data which are fused with the input data of the I links in the training set and the predicted J links data into a discriminator at the same time, wherein the discriminator needs to distinguish real data and generated data;

Because of the characteristics of pollutant data in the grain processing process, different links of each group of data are arranged according to the processing sequence, the time is discontinuous, and the variation trend of different groups of data in each link is the same, so that the effect of using a general time sequence prediction model is poor. The DF can well predict discrete time sequences relative to a common neural network model, so that the DF model is used for predicting pollutants in the grain processing process, and the predicted algorithm flow is shown in figure 4.

LSTM has certain advantages for time series processing, but discrete time series are difficult to process, and DF can well process discrete time series, so that LSTM and DF models are combined, and the advantages of the LSTM and DF models are combined to predict pollutants in the grain processing process. Meanwhile, the GAN generator is designed based on a neural network, DF is favorable for integrated learning, and the GAN model and the DF model are combined, so that the pollution in the grain processing process is predicted by combining the advantages of the GAN model and the DF model.

The training process of the LSTM-DF model/GAN-DF model is shown in fig. 6, 7, 8 and 9, and the specific training process is as follows:

step 601, fusing known I links with J links predicted by LSTM/GAN model, and changing the fused data into N ₁ Row I+J columns, and taking the fused data as the input of DF;

step 602, for N ₁ Grain processing in columns I+JThe contaminant input data, DF uses several sliding windows of different sizes to extract the input data vector, resulting in an extracted data vector.

As shown in fig. 3, taking sliding windows of 2×2 and 3×3 sizes as an example, when the data is a total of 84 data of 12 rows and 7 columns, the data is respectively extracted by sliding the data using sliding windows of 2×2 and 3×3 sizes, so as to obtain data of 2×2×11×6 and 3×3×10×5, namely 264 and 450 data, respectively, wherein 2×2 and 3×3 are the sizes of each extraction of the sliding window, and 11×6 and 10×5 are the times of extraction of the sliding window of the current size.

Step 603, the data vectors extracted from each sliding window are respectively input into the complete random forest a and the random forest B, and each complete random forest a or random forest B outputs k prediction results, so that the number of the data vectors finally obtained from each sliding window is as follows: p x q x (N) ₁ -p+1) x (i+j-q+1) x k x 2, wherein 2 is meant to include both completely random forest a and random forest B.

Assuming that 3 predictors are required to be output per random forest, a total of 3570 data vectors of 264×3×2 and 450×3×2 are obtained.

Step 604, assuming that the number of sliding windows is M, the total number of data vectors obtained through the multi-granularity scanning stage is m= Σ _m p×q×(N ₁ -p+1)×(I+J-q+1)×k×2。

Step 605, concatenating the M data vectors according to the sequence of the multi-granularity scanning, obtaining an enhanced data vector, and using the enhanced data vector as an input of the cascade forest.

The enhanced data vector contains M rows and 1 columns of data. After extraction through sliding windows of 2 x 2 and 3 x 3 sizes, the data size 84 is changed from 3570 using the enhanced data vector as input to the cascaded forest.

Step 606, the enhanced data vector is input into the first layer cascade structure for learning, and the predicted data vector is output; the number of data output by the first hierarchical structure is: kX (x+y)

In this embodiment, the first layer cascade structure learns using 3570 data vectors obtained by multi-granularity sliding, and assuming that each layer linkage structure has 2 random forests and 2 completely random forests, the first layer linkage structure outputs 3×4 data vectors, combines the vectors output by the first layer linkage structure with the enhanced data vectors into new data vectors to obtain 3582 data, and then transmits the 3582 data to the second layer forest for learning.

Step 607, combining the data vector output by the first hierarchical structure with the enhanced data vector, as a new data vector, transmitting to the second hierarchical structure for learning, so as to push the data vector, wherein each hierarchical structure learns the data vector after the output of the previous hierarchical structure is combined with the enhanced data vector until the maximum layer number is reached or the prediction result is not changed, storing and recording all the current parameters, determining the layer number of the hierarchical structure, and completing DF training.

When DF training is completed, each random forest and the complete random forest of the last layer of cascade structure can output a group of predicted data vectors, and the outputs of all random forests and the complete random forests of the layer are averaged to obtain a final predicted result.

In this embodiment, the DF last layer concatenation structure outputs only 3×4 data vectors.

And 608, summarizing the output result of the cascade forest by the trained DF algorithm, and then verifying the prediction effect by using the test set to obtain a prediction result and accuracy.

Step 609, after DF learns the variation rule of the grain pollutant processing course and the error of LSTM/GAN prediction through training, an LSTM-DF model/GAN-DF model is obtained.

Here, DF learns the change relation between the known I links and the J links to be predicted, learns the deviation of LSTM/GAN prediction results, and further improves the accuracy of prediction.

Because DF training mode is based on an integrated learning method and GAN training mode is based on neural network back propagation training, DF cannot be trained when being directly embedded into GAN, DF is embedded into GAN based on the idea of GAN to be used as a generator of GAN model, DFGAN model is built, and pollutant prediction in rice processing process is carried out.

As shown in fig. 10 and 11, the specific process is:

wherein the method comprises the steps ofIs to the N th ₁ Predicted value of group data in the I+J th link, due to +.>The data of the (i+1) to (i+j) th link is predicted by DF, so that the arbiter needs to be judged as false, and thus the tag 0 is added, and the data enters the arbiter.

Since the generator cannot directly perform the back propagation training using the result of the discriminator, the performance of the generator is discriminated using the Loss function BCE Loss of the discriminator, and the discrimination result is divided into the following:

step 704, judging whether the absolute value of the change of the loss function reaches an adjustment threshold, if so, adjusting a generator to increase the number of layers of the cascade forest by one, and returning to step 701; otherwise, go to step 705;

The value of the adjustment threshold is shown in formula (15):

δ _e ＝a×b ^epoch +c (15)

wherein delta _e Is the absolute value of the current prediction error minus the last prediction error, epoch is the current training run, delta _e The initial value of (1) is defined as 0, delta _e Is a value that decreases as epoch increases; a needs to be determined according to the order of magnitude of the prediction accuracy, b is the influence delta _e The factor of the decay speed, b takes on a value between about 0.8 and 0.99, c takes on a value to ensure that delta is still ensured when the training times are already large _e There is a certain value. At the beginning of training, the epoch is small, and the prediction effect is obvious along with the increase of the epochChange, at this point delta is desired _e Larger, the cascade layer can be rapidly increased. When the epoch is relatively large, the change in the predictive effect is small as the epoch increases, and it is desirable that δ _e The method is small, and the number of cascade forest layers can be still changed when the prediction accuracy is improved slightly. When epoch is very large, delta with a certain size can still be obtained _e 。

At the time delta _e When the size of the (c) reaches the adjustment threshold, it is indicated that the prediction effect of the last generator is obviously changed when the cascade layer is added, and the optimization generator is still needed.

And step eight, because the LSTM-DF model has higher precision compared with the DF model, the generator DF is replaced by the LSTM-DF model on the basis of the DFGAN model, and the LSTM-DFGAN model is built, so that the accurate prediction of the pollutants in the grain processing process is realized.

As shown in fig. 13, the process of building the LSTM-DFGAN model is:

step 802, fusing the LSTM prediction result with training set input, merging columns with unchanged line number, and converting the fused data into N ₁ Row i+j columns;

Step 805, further determining whether the magnitude of the change of the loss function of the round relative to the loss function of the previous round reaches the adjustment threshold, if so, adjusting DF, increasing the number of layers of the cascade forest, and returning to step 803; otherwise, step 806 is performed.

The LSTM-DFGAN model is built as shown in FIG. 12.

Example 1

Data set: the experimental data are from aflatoxin monitoring data of peanut oil supply chain of oil planting limited company in Zhongliang county, the peanut oil supply chain has 12 links, and the total data amount is 103899. For the data set, the known ring node number is 7, the ring node number to be predicted is 5, the method of the second step and the third step is verified, and a single random forest and LSTM (least squares) are used as comparison experiments, wherein the ratio of the training set to the testing set is 8:2. Because the data span is large, the 9 th to 12 th links of one group of prediction results are selected for visualization, and the results are shown in fig. 14. The prediction accuracy results are shown in table 1:

Table 1 example 1 predicted results

	RF	LSTM	DF	LSTM-DF
					RMSE	2.14e-2	4.89e-4	5.17e-4	6.56e-5
MAE	2.08e-2	1.66e-2	1.02e-2	5.43e-3

Analysis of results: on the data set, the LSTM-DF model provided by the invention has the smallest RMSE, the LSTM model and the DF model, and the largest RF model, so that the difference between the LSTM-DF model and other methods is larger, and the difference between the LSTM model and the DF model is not larger. The LSTM-DF model MAE provided by the invention is the smallest, the DF model is the next, the LSTM model is the next, and the largest is the RF model. The error of the LSTM-DF model is about half that of the DF model. In the comprehensive view, among all models of the data prediction, the LSTM-DF model has the best prediction effect, the LSTM model and the DF model have the second prediction effect, are closer to each other, and have the worst prediction effect, namely the RF model.

Example 2

In example 2, the rice processing procedure is more and is divided into raw grain, one-step rice hulling, two-step rice hulling, one-step rice milling, two-step rice milling, three-step rice milling, four-step rice milling, etc., as shown in fig. 15.

Data set: the data set is rice produced by different varieties in Jiangsu, hubei and Heilongjiang places, under the process shown in fig. 15, each process carries out detection of metal pollutants, the detected matters comprise 5 metal pollutants such As lead (Pb), chromium (Cr), arsenic (As), cadmium (Cd) and mercury (Hg), each metal pollutant has 12 groups of data, and each group of data comprises 7 processes shown in fig. 15. Taking Pb as an example, the method of the invention is verified, wherein 4-layer DNN networks related to GAN networks all use two hidden layers with 512 and 256 sizes, the number of nodes of the input layer network is the same as the size of input data, and the number of nodes of the output layer network is the same as the size of output data.

Step one, using TimeGAN to expand Pb data in the rice processing process;

fig. 16 to 19 are expanded results: results of expanding 1, 10, 100 and 1000 groups from the original set of data are shown, respectively, where each group is 12 x 7 in size:

in order to verify that the original data is equally valid as the generated data, some verification of the data on statistics is done.

First centroid contrast:

the raw data are: [0.12,0.06,0.05,0.05,0.04,0.03,0.04]

The generated data are: [0.13,0.06,0.05,0.05,0.05,0.03 0.04]

The second is the ratio of the pollutants contained in the raw grain in different intervals, as shown in table 2:

table 2 expanded distribution results (data volume/ratio)

min～0.09

0.09～0.11

0.11～0.13

0.13～0.15

0.15～0.17

0.17～max

Raw data

4/33.3％

1/8.3％

3/25％

1/8.3％

0/0％

3/24.9％

Augmenting data

4502/37.5％

733/6.10％

1400/11.6％

2278/18.9％

29/0.2％

3018/25.0％

Step two, predicting Pb data in the rice processing process;

for the data generated in the first step, the prediction methods of RF, DF, LSTM-DF, GAN-DF, DFGAN and LSTM-DFGAN are respectively used for the original data and the different quantity of expansion data, the front 4 links are used for predicting the rear 3 links, the ratio of the training set to the testing set is 8:2, and meanwhile, a random forest is used as a comparison experiment, a group of data in a fifth group of prediction results is taken for visualization, as shown in fig. 20, the prediction effect of each model can be seen to be within a certain range, but the difference exists, and the LSTM-DFGAN model effect is comprehensively optimal. The accuracy quantification of the predictions is shown in table 3, measured by two indicators, MAE (smaller more accurate) and RMSE (smaller more accurate).

TABLE 3 predicted results for example 2 (MAE/RMSE)

From the perspective of different models, the error predicted by each method is larger for the raw data, the MAE of the GAN-DF model is minimum, the RMSE of the RF model is minimum, the MAE of each model is in the same order of magnitude as the RMSE, the maximum MAE is about 23% higher, and the maximum RMSE is about 27% higher.

For the second set of data, LSTM-DFGAN is the same as and minimal as the MAE of GAN-DF, within 18% of LSTM-DF, with the MAE of DF being maximal, exceeding 50% of the MAE of LSTM-DFGAN, with the RMSE of the GAN-DF model being minimal, within 21% of the RMSE of LSTM-DF, LSTM-DFGAN, and likewise with the RMSE of DF being maximal, exceeding 50% of the RMSE of LSTM-DFGAN.

For the third set of data, the MAE of the GAN-DF model was minimal, but very close to LSTM-DF, DFGAN, LSTM-DFGAN, with a difference of about 8%, the MAE of RF, DF was about 50% higher, the RMSE of the LSTM-DFGAN model was minimal, the LSTM-DF, GAN-DF, DFGAN were closer, with a difference of about 21% to LSTM-DFGAN, and a difference of over 100% to RF, DF.

For the fourth and fifth sets of data, MAE and RMSE were both minimal for the LSTM-DFGAN model, MAE was more than 15% different from LSTM-DF, GAN-DF, DFGAN, more than 100% different from RF, DF, RMSE was more than 5% different from LSTM-DF, GAN-DF, DFGAN, more than 100% different from RF, DF.

In a combined view, from the perspective of different models, the LSTM-DFGAN model has the smallest error, the GAN-DF model is next, the LSTM-DF model and the DFGAN model are next, and the errors of the four models are basically in the same magnitude, and the difference is smaller. The maximum error is the RF model and DF model, the error is about 2 to 3 times of the previous three models, DF is smaller when the data volume is large, but RF effect is better when the data volume is small. Proved by the invention, the LSTM-DFGAN model has excellent prediction effect, and can realize the prediction of pollutants in the grain processing process.

From the data expansion point of view, the prediction effect of the original data is poor, because the data size is too small, and the model is extremely easy to fit. With increasing data expansion, the prediction effect of each model is better, the prediction improvement brought by increasing data volume on LSTM-DF, GAN-DF, DFGAN and LSTM-DFGAN is larger, the prediction accuracy on RF and DF is relatively smaller, and the prediction accuracy on RF and DF is improved by about 2 orders of magnitude. The data expansion proposed by the invention is proved to be significant for prediction.

Claims

1. A grain processing process pollutant data expansion and risk prediction method based on LSTM-DFGAN is characterized by comprising the following specific steps:

Step one, acquiring original data of pollutants in a grain processing process, and dividing and processing to obtain 1 row of one-dimensional data of N× (I+J) columns;

wherein, N is the number of sample data, and the size of the data is N× (I+J);

step two, inputting the transformed one-dimensional data into TimeGAN for expansion to obtain a data set containing a plurality of groups of N rows of I+J columns of data;

randomly ordering the expanded data groups according to a row unit and N ₁ :(N-N ₁ ) Dividing the data set into a training set and a testing set;

step four, input X of training set is utilized _train And output Y _train Training the LSTM and the GAN at the same time by the data to obtain a trained LSTM model and a trained GAN model;

inputting the input data in the training set into a trained LSTM/GAN model to obtain predicted LSTM/GAN model output data, fusing the predicted output data with the input data of the training set to serve as input data of a DF model, and training the DF model by using the output data in the training set as output data of the DF model to obtain an LSTM-DF model/GAN-DF model;

Step six, embedding DF into GAN based on the trained DF, establishing a DFGAN model, taking DF as a GAN generator, and still using DNN model by a discriminator;

the specific process is as follows:

step 601, X _train Inputting into the trained DF to obtain the predicted value of DFX is to be _train And->Performing transverse combination to obtain input ++of the discriminator>As shown in equation (14):

wherein the method comprises the steps ofIs to the N th ₁ Predicted value of group data in the I+J th link, due to +.>The data of the I+1 to I+J links are predicted by DF, so that the discriminator needs to be judged as false, and therefore, the label 0 is added to enter the discriminator;

step 602, at the same time, N-N ₁ Input-combining N-N of individual test sets ₁ The real output data of the test sets are marked 1 and enter a discriminator because the output data are real data;

step 603, according to the judging effect of the judging device, firstly, the judging device is optimized through back propagation, the performance of the generating device is judged through the loss function of the judging device, and the prediction precision of the current generating device to the training data is detected;

(2) If the loss function of the arbiter changes for the loss function at the last time the generator was adjusted, then a further determination of the change in the loss function is needed, step 604 is performed;

step 604, judging whether the absolute value of the change of the loss function reaches an adjustment threshold, if so, adjusting a generator, increasing the number of layers of the cascade forests, and returning to step 601; otherwise, go to step 605;

the value of the adjustment threshold is shown in formula (15):

δ _e ＝a×b ^epoch +c (15)

wherein delta _e Is the absolute value of the current prediction error minus the last prediction error, epoch is the current training run, delta _e The initial value of (1) is defined as 0, delta _e Is a value that decreases as epoch increases; a needs to be determined according to the order of magnitude of the prediction accuracy, b is the influence delta _e The factor of the decay speed, c, is to ensure that delta is still ensured when the training times are already large _e Has a certain value;

step 605, judging whether the training of the round is completed, if yes, executing step 606; otherwise, returning to the step 603, and continuing to optimize the discriminator;

step 606, judging whether training of all rounds is completed, if so, completing an optimization process of the generator by DF; otherwise, returning to the step 603, and continuing to optimize;

Step 607, using DF as a generator, establishing a DFGAN model, and predicting the known I link data to obtain predicted J link data;

and seventhly, replacing the generator DF with an LSTM-DF model on the basis of the DFGAN model, and establishing the LSTM-DFGAN model to realize accurate prediction of pollutants in the grain processing process.

2. The method for expanding and predicting risk of food processing pollutant data based on LSTM-DFGAN according to claim 1, wherein the processing of the raw data in step one is as follows:

3. The method for expanding and predicting risk of food processing pollutant data based on LSTM-DFGAN according to claim 1, wherein in step three, the training set and the test set are respectively divided into an input set and an output set, and the food processing pollutant concentration of known number of loops in each training set or test set is used as input, and the food processing pollutant concentration to be predicted is used as output, specifically:

4. the method for expanding and predicting risk of food processing process pollutant data based on LSTM-DFGAN as set forth in claim 1, wherein the training of the GAN model by the training set is as follows:

the I link data in the training set plus the J link data predicted by the generator are not real data change trends, so that the label is 0 when the I link data and the J link data are input to the discriminator; similarly, the tag is 1 when the training set data is input to the discriminator, so that the discriminator outputs a one-dimensional scalar;

finally, inputting the one-dimensional scalar output by the discriminator into the activation function, wherein the larger the output value of the activation function is, the more real the input data is, namely the more accurate the change rule is;

After the countermeasure training of the generator and the discriminator, when the generator inputs known I link data, J link data similar to real data are output, namely the training of the GAN model is completed, and the prediction capability is possessed.

5. The method for expanding and predicting risk of food processing process pollutant data based on LSTM-DFGAN as set forth in claim 1, wherein the predicted output data is fused with input data of a training set to be used as input data of a DF model, the output data of the training set is used as output data of the DF model to train the DF model, and the specific training process is as follows:

step 501, for N ₁ The input data of the rows I and J, DF extracts the input data vector by using a plurality of sliding windows with different sizes, and the extracted data vector is obtained;

the sliding window is p×q in size, and the number of times of sliding extraction of the input data is (N ₁ -p+1) x (i+j-q+1), the extracted data vector size is: p x q x (N) ₁ -p+1)×(I+J-q+1)；

Step 502, data vectors extracted from each sliding window are respectively input into a complete random forest a and a random forest B, each complete random forest a or random forest B outputs k prediction results, and the number of data vectors finally obtained from each sliding window is: p x q x (N) ₁ -p+1) x (i+j-q+1) x k x 2, wherein 2 is defined to include both a completely random forest a and a random forest B;

step 503, setting the number of sliding windows as m, and obtaining the total number of data vectors through the multi-granularity scanning stageIs M = Σ _m p×q×(N ₁ -p+1)×(I+J-q+1)×k×2；

Step 504, connecting M data vectors in series according to the sequence of multi-granularity scanning to obtain enhanced data vectors, and using the enhanced data vectors as input of a cascade forest;

the enhanced data vector contains M rows 1 columns of data;

the cascade forests comprise a multi-layer cascade structure, and each layer of cascade structure comprises x random forests and y completely random forests;

in step 505 of the process,will beThe enhanced data vector is input into a first layer cascade structure for learning, and a predicted data vector is output;

the number of data output by the first hierarchical structure is: kX (x+y)

Step 506, combining the data vector output by the first hierarchical structure with the enhanced data vector, as a new data vector, transmitting the new data vector to the second hierarchical structure for learning, so as to push the new data vector, wherein each layer of cascade structure learns the data vector after the output of the last layer of cascade structure is combined with the enhanced data vector until the maximum layer number is reached or the prediction result is not changed, determining the layer number of the cascade structure, and completing DF training;

6. The method for expanding and predicting risk of food processing process pollutant data based on LSTM-DFGAN as set forth in claim 1, wherein the specific process of establishing the LSTM-DFGAN model is as follows:

step 701, inputting known I link data into a trained LSTM to obtain a prediction result of the LSTM;

step 702, fusing the LSTM prediction result with training set input, and converting the fused data into N ₁ Row i+j columns;

step 703, taking the fused data as input of DF, and DF outputs predicted J link data and takes the predicted J link data as input of a discriminator;

step 704, the discriminator judges J link data predicted by DF output, detects the prediction precision of the current DF on training data, judges whether the loss function of the discriminator changes relative to the loss function of the previous round, if so, executes step 705; otherwise, continuing to train the arbiter without performing an optimization generator, executing step 706;

Step 705, further judging whether the magnitude of the change of the loss function of the round relative to the loss function of the previous round reaches an adjustment threshold, if so, adjusting DF, increasing the number of cascade forests, and returning to step 703; otherwise, go to step 706;

step 706, determining whether the round is completed, if yes, executing step 707; otherwise, returning to step 704, continuing the optimization of the discriminator;

step 707, judging whether all rounds are completed, if yes, completing the establishment of the LSTM-DFGAN model; otherwise, returning to step 704, continuing the optimization of the discriminator;

step 708, predicting the known I link data by using the established LSTM-DFGAN model to obtain predicted J link data.